Results 1 
6 of
6
Evaluating MultiWay Joins over Discounted Hitting Time
"... Abstract—The discounted hitting time (DHT), which is a randomwalk similarity measure for graph node pairs, is useful in various applications, including link prediction, collaborative recommendation, and reputation ranking. We examine a novel query, called the multiway join (or nway join), on DHT ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Abstract—The discounted hitting time (DHT), which is a randomwalk similarity measure for graph node pairs, is useful in various applications, including link prediction, collaborative recommendation, and reputation ranking. We examine a novel query, called the multiway join (or nway join), on DHT scores. Given a graph and n sets of nodes, the nway join retrieves a set of ntuples with the k highest scores, according to some aggregation function of DHT values. This query enables analysis and prediction of complex relationship among n sets of nodes. Since an nway join is expensive to compute, we develop the Partial Join algorithm (or PJ). This solution decomposes an nway join into a number of topm 2way joins, and combines their results to construct the answer of the nway join. Since PJ may necessitate the computation of top(m + 1) 2way joins, we study an incremental solution, which allows the top(m + 1) 2way join to be derived quickly from the topm 2way join results earlier computed. We further examine fast processing and pruning algorithms for 2way joins. An extensive evaluation on three real datasets shows that PJ accurately evaluates nway joins, and is four orders of magnitude faster than basic solutions. I.
Efficient SimRankbased Similarity Join Over Large Graphs
, 2013
"... Graphs have been widely used to model complex data in many realworld applications. Answering vertex join queries over large graphs is meaningful and interesting, which can benefit friend recommendation in social networks and link prediction, etc. In this paper, we adopt “SimRank ” to evaluate the s ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Graphs have been widely used to model complex data in many realworld applications. Answering vertex join queries over large graphs is meaningful and interesting, which can benefit friend recommendation in social networks and link prediction, etc. In this paper, we adopt “SimRank ” to evaluate the similarity of two vertices in a large graph because of its generality. Note that “SimRank ” is purely structure dependent and it does not rely on the domain knowledge. Specifically, we define a SimRankbased join (SRJ) query to find all the vertex pairs satisfying the threshold in a data graph G. In order to reduce the search space, we propose an estimated shortestpath distance based upper bound for SimRank scores to prune unpromising vertex pairs. In the verification, we propose a novel index, called hgo cover, to efficiently compute the SimRank score of a single vertex pair. Given a graph G, we only materialize the SimRank scores of a small proportion of vertex pairs (called hgo covers), based on which, the SimRank score of any vertex pair can be computed easily. In order to handle large graphs, we extend our technique to the partitionbased framework. Thorough theoretical analysis and extensive experiments over both real and synthetic datasets confirm the efficiency and effectiveness of our solution.
Efficient PartialPairs SimRank Search on Large Networks
"... The assessment of nodetonode similarities based on graph topology arises in a myriad of applications, e.g., web search. SimRank is a notable measure of this type, with the intuition that “two nodes are similar if their inneighbors are similar”. While most existing work retrieving SimRank only con ..."
Abstract
 Add to MetaCart
(Show Context)
The assessment of nodetonode similarities based on graph topology arises in a myriad of applications, e.g., web search. SimRank is a notable measure of this type, with the intuition that “two nodes are similar if their inneighbors are similar”. While most existing work retrieving SimRank only considers allpairs SimRank s(⋆, ⋆) and singlesource SimRank s(⋆, j) (scores between every node and query j), there are appealing applications for partialpairs SimRank, e.g., similarity join. Given two node subsets A and B in a graph, partialpairs SimRank assessment aims to retrieve only {s(a, b)}∀a∈A,∀b∈B. However, the bestknown solution appears not selfcontained since it hinges on the premise that the SimRank scores with nodepairs in an hgo cover set must be given beforehand. This paper focuses on efficient assessment of partialpairs SimRank in a selfcontained manner. (1) We devise a novel “seed germination ” model that computes partialpairs SimRank in O(kEmin{A, B}) time and O(E+kV ) memory for k iterations on a graph of V  nodes and E  edges. (2) We further eliminate unnecessary edge access to improve the time of partialpairs SimRank to O(mmin{A, B}), where m ≤ min{kE,∆2k}, and ∆ is the maximum degree. (3) We show that our partialpairs SimRank model also can handle the computations of allpairs and singlesource SimRanks. (4) We empirically verify that our algorithms are (a) 38x faster than the bestknown competitors, and (b) memoryefficient, allowing scores to be assessed accurately on graphs with tens of millions of links. 1.
Efficient TopK SimRankbased Similarity Join
"... SimRank is a popular and widelyadopted similarity measure to evaluate the similarity between nodes in a graph. It is time and space consuming to compute the SimRank similarities for all pairs of nodes, especially for large graphs. In realworld applications, users are only interested in the most s ..."
Abstract
 Add to MetaCart
(Show Context)
SimRank is a popular and widelyadopted similarity measure to evaluate the similarity between nodes in a graph. It is time and space consuming to compute the SimRank similarities for all pairs of nodes, especially for large graphs. In realworld applications, users are only interested in the most similar pairs. To address this problem, in this paper we study the topk SimRankbased similarity join problem, which finds k most similar pairs of nodes with the largest SimRank similarities among all possible pairs. To the best of our knowledge, this is the first attempt to address this problem. We encode each node as a vector by summarizing its neighbors and transform the calculation of the SimRank similarity between two nodes to computing the dot product between the corresponding vectors. We devise an efficient twostep framework to compute topk similar pairs using the vectors. For large graphs, exact algorithms cannot meet the highperformance requirement, and we also devise an approximate algorithm which can efficiently identify topk similar pairs under userspecified accuracy requirement. Experiments on both real and synthetic datasets show our method achieves high performance and good scalability. 1.
Discovering MetaPaths in Large Heterogeneous Information Networks
"... The Heterogeneous Information Network (HIN) is a graph data model in which nodes and edges are annotated with class and relationship labels. Large and complex datasets, such as Yago or DBLP, can be modeled as HINs. Recent work has studied how to make use of these rich information sources. In partic ..."
Abstract
 Add to MetaCart
(Show Context)
The Heterogeneous Information Network (HIN) is a graph data model in which nodes and edges are annotated with class and relationship labels. Large and complex datasets, such as Yago or DBLP, can be modeled as HINs. Recent work has studied how to make use of these rich information sources. In particular, metapaths, which represent sequences of node classes and edge types between two nodes in a HIN, have been proposed for such tasks as information retrieval, decision making, and product recommendation. Current methods assume metapaths are found by domain experts. However, in a large and complex HIN, retrieving metapaths manually can be tedious and difficult. We thus study how to discover metapaths automatically. Specifically, users are asked to provide example pairs of nodes that exhibit high proximity. We then investigate how to generate metapaths that can best explain the relationship between these node pairs. Since this problem is computationally intractable, we propose a greedy algorithm to select the most relevant metapaths. We also present a data structure to enable efficient execution of this algorithm. We further incorporate hierarchical relationships among node classes in our solutions. Extensive experiments on realworld HIN show that our approach captures important metapaths in an efficient and scalable manner. 1.