(Enter summary)
Abstract: We compare several algorithms for identifying mirrored hosts on the World Wide
Web. The algorithms operate on the basis of URL strings and linkage data: the type of
information easily available from web proxies and crawlers.
Identification of mirrored hosts can improve web-based information retrieval in several
ways: First, by identifying mirrored hosts, search engines can avoid storing and
returning duplicate documents. Second, several new information retrieval techniques for
the Web make... (Update)
Similar documents based on text: More All
0.4: Web Information Retrieval - an Algorithmic Perspective - Henzinger (2000)
(Correct)
0.3: Who Links to Whom: Mining Linkage between Web Sites - Bharat, Chang, Henzinger, Ruhl (2003)
(Correct)
0.3: Information Retrieval on the Web - Page 6 - Broder, Henzinger (1998)
(Correct)
Related documents from co-citation: More All
11: Syntactic clustering of the Web (context) - Broder, Glassman et al. - 1997
9: The anatomy of a large-scale hypertextual Web search engine
- Brin, Page
9: Authoritative sources in a hyperlinked environment
- Kleinberg - 1997
BibTeX entry: (Update)
Bharat, K.; Broder, A.; Dean, J.; and Henzinger, M. 1999. A comparison of techniques to find mirrored hosts on the WWW. In Proceedings of the ACM Digital Library Workshop on Organizing Web Space (WOWS). http://citeseer.ist.psu.edu/bharat99comparison.html More
@article{ bharat00comparison,
author = "Krishna Bharat and Andrei Z. Broder and Jeffrey Dean and Monika Rauch Henzinger",
title = "A comparison of techniques to find mirrored hosts on the {WWW}",
journal = "Journal of the American Society of Information Science",
volume = "51",
number = "12",
pages = "1114-1122",
year = "2000",
url = "citeseer.ist.psu.edu/bharat99comparison.html" }
Citations (may not include all citations):
576
Authoritative sources in a hyperlinked environment
- Kleinberg - 1998
463
Term-weighting approaches in automatic text retrieval (context) - Salton, Buckley - 1988
344
The PageRank citation ranking: Bringing order to the Web
- Page, Brin et al.
163
Improved algorithms for topic distillation in hyperlinked en..
- Bharat, Henzinger - 1998
154
Automatic resource compilation by analyzing hyperlink struct..
- Chakrabarti, Dom et al.
136
Syntactic clustering of the Web (context) - Broder, Glassman et al. - 1997
79
Web document clustering: A feasibility demonstration
- Zamir, Etzioni - 1998
70
Clustering algorithms (context) - Rasmussen - 1992
68
A technique for measuring the relative size and overlap of p.. (context) - Bharat, Broder
24
Recent trends in hierarchical document clustering: a critica.. (context) - Willet - 1988
20
Experiments in topic distillation (context) - Chakrabarti, Dom et al. - 1998
18
Finding near-replicas of documents on the web
- Shivakumar, Garca-Molina - 1998
15
mirror on the web: A study of host pairs with replicated con.. (context) - Bharat, Broder
14
Filtering near-duplicate documents (context) - Broder - 1998
3
Computing document clusters on the web (context) - Cho, Shivakumar et al. - 1999
2
A methodology for sampling the world wide web (context) - O'Neill, McClain et al. - 1997
The graph only includes citing articles where the year of publication is known.
Documents on the same site (http://www.research.digital.com/SRC/personal/monika/papers3.html):
Finding Related Pages in the World Wide Web - Dean, Henzinger (1999)
(Correct)
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC