MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  2

Download:
Download as a PDF | Download as a PS
by Paat Rusmevichientong, David M. Pennock, Steve Lawrence, C. Lee Giles
http://www-users.cs.york.ac.uk/~tw/fall/Proceedings/pennock-legal.ps
Add To MetaCart

Abstract:

We present two new algorithms for generating uniformly random samples of pages from the World Wide Web, building upon recent work by Henzinger et al. (Henzinger et al. 2000) and Bar-Yossef et al. (Bar-Yossef et al. 2000). Both algorithms are based on a weighted random-walk methodology. The first algorithm (DIRECTED-SAMPLE) operates on arbitrary directed graphs, and so is naturally applicable to the web. We show that, in the limit, this algorithm generates samples that are uniformly random. The second algorithm (UNDIRECTED-SAMPLE) operates on undirected graphs, thus requiring a mechanism for obtaining inbound links to web pages (e.g., access to a search engine). With this additional knowledge of inbound links, the algorithm can arrive at a uniform distribution faster than DIRECTEDSAMPLE, and we derive explicit bounds on the time to convergence. In addition, we evaluate the two algorithms on simulated web data, showing that both yield reliably uniform samples of pages. We also compare our results with those of previous algorithms, and discuss the theoretical relationships among the various proposed methods.

Citations

1631 The anatomy of a large-scale hypertextual web search engine – Brin, Page - 1998
325 Focused crawling: A new approach to topic-specific Web resource discovery – Chakrabarti, Berg, et al. - 1999
162 Geometric bounds for eigenvalues of Markov chains – DIACONIS, STROOCK - 1991
142 Focused Crawling Using Context Graphs – Diligenti, Coetzee, et al. - 2000
130 A technique for measuring the relative size and overlap of public Web search engines – Bharat, Broder - 1998
127 Graph structure in the web – BRODER, KUMAR, et al. - 2000
94 Emergence of scaling in random networks, Science 286 – Barabasi, Albert - 1999
64 On near-uniform URL sampling – Henzinger, Heydon, et al. - 2000
13 Preserving the Internet. Scientific American – Kahle - 1997
4 Web surpasses one billion documents. Inktomi/NEC press release, http://www.inktomi.com – January - 2000
4 Searching the World Wide Web. Science 280(5360):98--100 – Lawrence, Giles - 1998