(Enter summary)
Abstract: Many web documents (such as JAVA FAQs) are being replicated on the Internet. Often
entire document collections (such as hyperlinked Linux manuals) are being replicated many
times. In this paper, we make the case for identifying replicated documents and collections
to improve web crawlers, archivers, and ranking functions used in search engines. The paper
describes how to e#ciently identify replicated documents and hyperlinked document collections.
The challenge is to identify these... (Update)
Cited by: More
A Systematic Study of Parameter Correlations in - Large Scale Duplicate
(Correct)
Undue Influence: Eliminating the Impact of Link Plagiarism on.. - Wu, Davison (2006)
(Correct)
LSH Forest: Self-Tuning Indexes for Similarity Search - Mayank Bawa Bawa (2005)
(Correct)
Similar documents (at the sentence level):
5.3%: Finding Replicated Web Collections - Cho, Shivakumar, Garcia-Molina (1999)
(Correct)
Active bibliography (related documents): More All
0.2: Scalable Techniques for Clustering the Web (Extended.. - Haveliwala, Gionis, Indyk (2000)
(Correct)
0.2: Optimizing Selections over Data Cubes - Ross, Zaman (1998)
(Correct)
0.2: Predicting the cost-quality trade-off for.. - Blok, de Jong..
(Correct)
Similar documents based on text: More All
0.3: Crawler-Friendly Web Servers - Brandman, Cho, Garcia-Molina.. (2000)
(Correct)
0.2: The Evolution of the Web and Implications for an Incremental .. - Cho, Garcia-Molina (1999)
(Correct)
0.2: Estimating Frequency of Change - Cho, Garcia-Molina (2000)
(Correct)
Related documents from co-citation: More All
10: The anatomy of a large-scale hypertextual Web search engine
- Brin, Page
8: Authoritative sources in a hyperlinked environment
- Kleinberg - 1997
8: Syntactic clustering of the Web (context) - Broder, Glassman et al. - 1997
BibTeX entry: (Update)
J. Cho, S. Narayanan, H. Garcia-Molina. Finding Replicated Web Collections. Proc. SIGMOD Conference, 2000. http://citeseer.ist.psu.edu/article/cho00finding.html More
@inproceedings{ cho00finding,
author = "Junghoo Cho and Narayanan Shivakumar and Hector Garcia-Molina",
title = "Finding replicated {Web} collections",
pages = "355--366",
year = "2000",
url = "citeseer.ist.psu.edu/article/cho00finding.html" }
Citations (may not include all citations):
3972
Introduction to algorithms (context) - Cormen, Leiserson et al. - 1991
150
Accessibility of information on the web (context) - Lawrence, Giles - 1999 - http://www.wwwmetrics.com/
136
Syntactic clustering of the web (context) - Broder, Glassman et al. - 1997
67
the resemblance and containment of documents
- Broder - 1997
62
Adaptive web sites: Automatically synthesizing web pages
- Perkowitz, Etzioni - 1998
37
and lawfulness on the electronic frontier (context) - Pitkow, Pirolli et al. - 1997
28
SCAM:a copy detection mechanism for digital documents
- Shivakumar, Garcia-Molina - 1995
24
Building a scalable and accurate copy detection mechanism
- Shivakumar, Garcia-Molina - 1996
6
the Web: A study of host pairs with replicated content (context) - Bharat, Broder et al. - 1999
5
Google search engine (context) - Brin, Page - 1999
4
Computing iceberg queries e#ciently (context) - Fang, Shivakumar et al. - 1998
2
Itroduction to modern information retrieval (context) - Salton - 1983
The graph only includes citing articles where the year of publication is known.
Documents on the same site (http://www-db.stanford.edu/pub/papers/): More
Replicated Data Management in Mobile Environments.. - Barbará-Millá..
(Correct)
Extracting Semistructured Information from the Web - Hammer, Garcia-Molina, Cho, .. (1997)
(Correct)
U-PAI: A Universal Payment Application Interface, v 0.93 - Ketchpel, Garcia-Molina, .. (1996)
(Correct)
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC