See this document in CiteSeerX!

A Comparison of Techniques to Find Mirrored Hosts on the WWW (1999)  (Make Corrections)  (19 citations)
Krishna Bharat, Andrei Broder, Jeffrey Dean, et al.
Journal of the American Society of Information Science



  Home/Search   Context   Related

 
View or download:
digital.com/SRC/personal/m...wows.ps.gz
Cached:  PS.gz  PS  PDF   Image  Update  Help

From:  digital.com/SRC/persona...papers3 (more)
(Enter author homepages)

Rate this article: (best)
  Comment on this article  
(Enter summary)

Abstract: We compare several algorithms for identifying mirrored hosts on the World Wide Web. The algorithms operate on the basis of URL strings and linkage data: the type of information easily available from web proxies and crawlers. Identification of mirrored hosts can improve web-based information retrieval in several ways: First, by identifying mirrored hosts, search engines can avoid storing and returning duplicate documents. Second, several new information retrieval techniques for the Web make... (Update)

Cited by:   More
Discovering Large Dense Subgraphs in Massive Graphs - David Gibson Ravi (2005)   (Correct)
A Systematic Study of Parameter Correlations in - Large Scale Duplicate   (Correct)
Undue Influence: Eliminating the Impact of Link Plagiarism on.. - Wu, Davison (2006)   (Correct)

Similar documents (at the sentence level):
13.9%:   Special Issue on Data Cleaning - Sarawagi (2000)   (Correct)
6.6%:   A URL-String-Based Algorithm for Finding WWWMirror Hosts - Submitted To Committee   (Correct)

Active bibliography (related documents):   More   All
0.3:   Beyond Document Similarity: Understanding.. - Paepcke.. (2000)   (Correct)
0.1:   Recognizing Nepotistic Links on the Web - Davison (2000)   (Correct)
0.1:   Background Readings for Collection Synthesis - Bibliography (2002)   (Correct)

Similar documents based on text:   More   All
0.4:   Web Information Retrieval - an Algorithmic Perspective - Henzinger (2000)   (Correct)
0.3:   Who Links to Whom: Mining Linkage between Web Sites - Bharat, Chang, Henzinger, Ruhl (2003)   (Correct)
0.3:   Information Retrieval on the Web - Page 6 - Broder, Henzinger (1998)   (Correct)

Related documents from co-citation:   More   All
11:   Syntactic clustering of the Web (context) - Broder, Glassman et al. - 1997
9:   The anatomy of a large-scale hypertextual Web search engine - Brin, Page
9:   Authoritative sources in a hyperlinked environment - Kleinberg - 1997

BibTeX entry:   (Update)

Bharat, K.; Broder, A.; Dean, J.; and Henzinger, M. 1999. A comparison of techniques to find mirrored hosts on the WWW. In Proceedings of the ACM Digital Library Workshop on Organizing Web Space (WOWS). http://citeseer.ist.psu.edu/bharat99comparison.html   More

@article{ bharat00comparison,
    author = "Krishna Bharat and Andrei Z. Broder and Jeffrey Dean and Monika Rauch Henzinger",
    title = "A comparison of techniques to find mirrored hosts on the {WWW}",
    journal = "Journal of the American Society of Information Science",
    volume = "51",
    number = "12",
    pages = "1114-1122",
    year = "2000",
    url = "citeseer.ist.psu.edu/bharat99comparison.html" }
Citations (may not include all citations):
576   Authoritative sources in a hyperlinked environment - Kleinberg - 1998
463   Term-weighting approaches in automatic text retrieval (context) - Salton, Buckley - 1988
344   The PageRank citation ranking: Bringing order to the Web - Page, Brin et al.
163   Improved algorithms for topic distillation in hyperlinked en.. - Bharat, Henzinger - 1998
154   Automatic resource compilation by analyzing hyperlink struct.. - Chakrabarti, Dom et al.
136   Syntactic clustering of the Web (context) - Broder, Glassman et al. - 1997
79   Web document clustering: A feasibility demonstration - Zamir, Etzioni - 1998
70   Clustering algorithms (context) - Rasmussen - 1992
68   A technique for measuring the relative size and overlap of p.. (context) - Bharat, Broder
24   Recent trends in hierarchical document clustering: a critica.. (context) - Willet - 1988
20   Experiments in topic distillation (context) - Chakrabarti, Dom et al. - 1998
18   Finding near-replicas of documents on the web - Shivakumar, Garca-Molina - 1998
15   mirror on the web: A study of host pairs with replicated con.. (context) - Bharat, Broder
14   Filtering near-duplicate documents (context) - Broder - 1998
3   Computing document clusters on the web (context) - Cho, Shivakumar et al. - 1999
2   A methodology for sampling the world wide web (context) - O'Neill, McClain et al. - 1997



The graph only includes citing articles where the year of publication is known.


Documents on the same site (http://www.research.digital.com/SRC/personal/monika/papers3.html):
Finding Related Pages in the World Wide Web - Dean, Henzinger (1999)   (Correct)

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC