Scaling link-based similarity search (2004)
Cached
Download Links
- [www.ilab.sztaki.hu]
- [www.www2005.org]
- [www2005.org]
- DBLP
Other Repositories/Bibliography
| Citations: | 27 - 1 self |
BibTeX
@TECHREPORT{Fogaras04scalinglink-based,
author = {Dániel Fogaras and Balázs Rácz},
title = {Scaling link-based similarity search},
institution = {},
year = {2004}
}
Years of Citing Articles
OpenURL
Abstract
To exploit the similarity information hidden in the hyperlink structure of the web, this paper introduces algorithms scalable to graphs with billions of vertices on a distributed architecture. The similarity of multi-step neighborhoods of vertices are numerically evaluated by similarity functions including SimRank [20], a recursive refinement of cocitation; PSimRank, a novel variant with better theoretical characteristics; and the Jaccard coefficient, extended to multi-step neighborhoods. Our methods are presented in a general framework of Monte Carlo similarity search algorithms that precompute an index database of random fingerprints, and at query time, similarities are estimated from the fingerprints. The performance and quality of the methods were tested on the Stanford Webbase [19] graph of 80M pages by comparing our scores to similarities extracted from the ODP directory [26]. Our experimental results suggest that the hyperlink structure of vertices within four to five steps provide more adequate information for similarity search than singlestep neighborhoods.







