See this document in CiteSeerX!

Finding Replicated Web Collections (2000)  (Make Corrections)  (19 citations)
Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina



  Home/Search   Context   Related

 
View or download:
stanford.edu/pub/papers...chomirror.ps
stanford.edu/~cho/pape...chomirror.pdf
Cached:  PS.gz  PS  PDF   Image  Update  Help

From:  stanford.edu/pub/papers/ (more)
(Enter author homepages)

Rate this article: (best)
  Comment on this article  
(Enter summary)

Abstract: Many web documents (such as JAVA FAQs) are being replicated on the Internet. Often entire document collections (such as hyperlinked Linux manuals) are being replicated many times. In this paper, we make the case for identifying replicated documents and collections to improve web crawlers, archivers, and ranking functions used in search engines. The paper describes how to e#ciently identify replicated documents and hyperlinked document collections. The challenge is to identify these... (Update)

Cited by:   More
A Systematic Study of Parameter Correlations in - Large Scale Duplicate   (Correct)
Undue Influence: Eliminating the Impact of Link Plagiarism on.. - Wu, Davison (2006)   (Correct)
LSH Forest: Self-Tuning Indexes for Similarity Search - Mayank Bawa Bawa (2005)   (Correct)

Similar documents (at the sentence level):
5.3%:   Finding Replicated Web Collections - Cho, Shivakumar, Garcia-Molina (1999)   (Correct)

Active bibliography (related documents):   More   All
0.2:   Scalable Techniques for Clustering the Web (Extended.. - Haveliwala, Gionis, Indyk (2000)   (Correct)
0.2:   Optimizing Selections over Data Cubes - Ross, Zaman (1998)   (Correct)
0.2:   Predicting the cost-quality trade-off for.. - Blok, de Jong..   (Correct)

Similar documents based on text:   More   All
0.3:   Crawler-Friendly Web Servers - Brandman, Cho, Garcia-Molina.. (2000)   (Correct)
0.2:   The Evolution of the Web and Implications for an Incremental .. - Cho, Garcia-Molina (1999)   (Correct)
0.2:   Estimating Frequency of Change - Cho, Garcia-Molina (2000)   (Correct)

Related documents from co-citation:   More   All
10:   The anatomy of a large-scale hypertextual Web search engine - Brin, Page
8:   Authoritative sources in a hyperlinked environment - Kleinberg - 1997
8:   Syntactic clustering of the Web (context) - Broder, Glassman et al. - 1997

BibTeX entry:   (Update)

J. Cho, S. Narayanan, H. Garcia-Molina. Finding Replicated Web Collections. Proc. SIGMOD Conference, 2000. http://citeseer.ist.psu.edu/article/cho00finding.html   More

@inproceedings{ cho00finding,
    author = "Junghoo Cho and Narayanan Shivakumar and Hector Garcia-Molina",
    title = "Finding replicated {Web} collections",
    pages = "355--366",
    year = "2000",
    url = "citeseer.ist.psu.edu/article/cho00finding.html" }
Citations (may not include all citations):
3972   Introduction to algorithms (context) - Cormen, Leiserson et al. - 1991
150   Accessibility of information on the web (context) - Lawrence, Giles - 1999 - http://www.wwwmetrics.com/
136   Syntactic clustering of the web (context) - Broder, Glassman et al. - 1997
67   the resemblance and containment of documents - Broder - 1997
62   Adaptive web sites: Automatically synthesizing web pages - Perkowitz, Etzioni - 1998
37   and lawfulness on the electronic frontier (context) - Pitkow, Pirolli et al. - 1997
28   SCAM:a copy detection mechanism for digital documents - Shivakumar, Garcia-Molina - 1995
24   Building a scalable and accurate copy detection mechanism - Shivakumar, Garcia-Molina - 1996
6   the Web: A study of host pairs with replicated content (context) - Bharat, Broder et al. - 1999
5   Google search engine (context) - Brin, Page - 1999
4   Computing iceberg queries e#ciently (context) - Fang, Shivakumar et al. - 1998
2   Itroduction to modern information retrieval (context) - Salton - 1983



The graph only includes citing articles where the year of publication is known.


Documents on the same site (http://www-db.stanford.edu/pub/papers/):   More
Replicated Data Management in Mobile Environments.. - Barbará-Millá..   (Correct)
Extracting Semistructured Information from the Web - Hammer, Garcia-Molina, Cho, .. (1997)   (Correct)
U-PAI: A Universal Payment Application Interface, v 0.93 - Ketchpel, Garcia-Molina, .. (1996)   (Correct)

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC