2
Abstract:
We present two new algorithms for generating uniformly random samples of pages from the World Wide Web, building upon recent work by Henzinger et al. (Henzinger et al. 2000) and Bar-Yossef et al. (Bar-Yossef et al. 2000). Both algorithms are based on a weighted random-walk methodology. The first algorithm (DIRECTED-SAMPLE) operates on arbitrary directed graphs, and so is naturally applicable to the web. We show that, in the limit, this algorithm generates samples that are uniformly random. The second algorithm (UNDIRECTED-SAMPLE) operates on undirected graphs, thus requiring a mechanism for obtaining inbound links to web pages (e.g., access to a search engine). With this additional knowledge of inbound links, the algorithm can arrive at a uniform distribution faster than DIRECTEDSAMPLE, and we derive explicit bounds on the time to convergence. In addition, we evaluate the two algorithms on simulated web data, showing that both yield reliably uniform samples of pages. We also compare our results with those of previous algorithms, and discuss the theoretical relationships among the various proposed methods.
Citations
| 1631 | The anatomy of a large-scale hypertextual web search engine – Brin, Page - 1998 |
| 325 | Focused crawling: A new approach to topic-specific Web resource discovery – Chakrabarti, Berg, et al. - 1999 |
| 162 | Geometric bounds for eigenvalues of Markov chains – DIACONIS, STROOCK - 1991 |
| 142 | Focused Crawling Using Context Graphs – Diligenti, Coetzee, et al. - 2000 |
| 130 | A technique for measuring the relative size and overlap of public Web search engines – Bharat, Broder - 1998 |
| 127 | Graph structure in the web – BRODER, KUMAR, et al. - 2000 |
| 94 | Emergence of scaling in random networks, Science 286 – Barabasi, Albert - 1999 |
| 64 | On near-uniform URL sampling – Henzinger, Heydon, et al. - 2000 |
| 13 | Preserving the Internet. Scientific American – Kahle - 1997 |
| 4 | Web surpasses one billion documents. Inktomi/NEC press release, http://www.inktomi.com – January - 2000 |
| 4 | Searching the World Wide Web. Science 280(5360):98--100 – Lawrence, Giles - 1998 |

