Results 1 - 10
of
206
The PARSEC benchmark suite: Characterization and architectural implications
- IN PRINCETON UNIVERSITY
, 2008
"... This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiprocessors have focused on high-performance computing applications and used a limited ..."
Abstract
-
Cited by 518 (4 self)
- Add to MetaCart
(Show Context)
This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiprocessors have focused on high-performance computing applications and used a limited number of synchronization methods. PARSEC includes emerging applications in recognition, mining and synthesis (RMS) as well as systems applications which mimic large-scale multithreaded commercial programs. Our characterization shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic. The benchmark suite has been made available to the public.
On the Resemblance and Containment of Documents
- In Compression and Complexity of Sequences (SEQUENCES’97
, 1997
"... Given two documents A and B we define two mathematical notions: their resemblance r(A, B)andtheircontainment c(A, B) that seem to capture well the informal notions of "roughly the same" and "roughly contained." The basic idea is to reduce these issues to set intersection probl ..."
Abstract
-
Cited by 506 (6 self)
- Add to MetaCart
(Show Context)
Given two documents A and B we define two mathematical notions: their resemblance r(A, B)andtheircontainment c(A, B) that seem to capture well the informal notions of "roughly the same" and "roughly contained." The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done independently for each document. Furthermore, the resemblance can be evaluated using a fixed size sample for each document.
A Low-bandwidth Network File System
, 2001
"... This paper presents LBFS, a network file system designed for low bandwidth networks. LBFS exploits similarities between files or versions of the same file to save bandwidth. It avoids sending data over the network when the same data can already be found in the server's file system or the client ..."
Abstract
-
Cited by 394 (3 self)
- Add to MetaCart
(Show Context)
This paper presents LBFS, a network file system designed for low bandwidth networks. LBFS exploits similarities between files or versions of the same file to save bandwidth. It avoids sending data over the network when the same data can already be found in the server's file system or the client's cache. Using this technique, LBFS achieves up to two orders of magnitude reduction in bandwidth utilization on common workloads, compared to traditional network file systems
Winnowing: Local Algorithms for Document Fingerprinting
- Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data 2003
, 2003
"... Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents. We introduce the class of local document fingerprinting algor ..."
Abstract
-
Cited by 264 (5 self)
- Add to MetaCart
(Show Context)
Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents. We introduce the class of local document fingerprinting algorithms, which seems to capture an essential property of any fingerprinting technique guaranteed to detect copies. We prove a novel lower bound on the performance of any local algorithm. We also develop winnowing, an efficient local fingerprinting algorithm, and show that winnowing’s performance is within 33 % of the lower bound. Finally, we also give experimental results on Web data, and report experience with MOSS, a widely-used plagiarism detection service. 1.
What's New on the Web? The Evolution of the Web from a Search Engine Perspective
, 2004
"... We seek to gain improved insight into how Web search engines should cope with the evolving Web, in an attempt to provide users with the most up-to-date results possible. For this purpose we collected weekly snapshots of some 150 Web sites over the course of one year, and measured the evolution of co ..."
Abstract
-
Cited by 220 (15 self)
- Add to MetaCart
(Show Context)
We seek to gain improved insight into how Web search engines should cope with the evolving Web, in an attempt to provide users with the most up-to-date results possible. For this purpose we collected weekly snapshots of some 150 Web sites over the course of one year, and measured the evolution of content and link structure. Our measurements focus on aspects of potential interest to search engine designers: the evolution of link structure over time, the rate of creation of new pages and new distinct content on the Web, and the rate of change of the content of existing pages under search-centric measures of degree of change.
Filtering near-duplicate documents
- Proc. FUN 98
, 1998
"... Abstract. The mathematical concept of document resemblance captures well the informal notion of syntactic similarity. The resemblance can be estimated using a fixed size “sketch ” for each document. For a large collection of documents (say hundreds of millions) the size of this sketch is of the orde ..."
Abstract
-
Cited by 157 (6 self)
- Add to MetaCart
(Show Context)
Abstract. The mathematical concept of document resemblance captures well the informal notion of syntactic similarity. The resemblance can be estimated using a fixed size “sketch ” for each document. For a large collection of documents (say hundreds of millions) the size of this sketch is of the order of a few hundred bytes per document. However, for efficient large scale web indexing it is not necessary to determine the actual resemblance value: it suffices to determine whether newly encountered documents are duplicates or near-duplicates of documents already indexed. In other words, it suffices to determine whether the resemblance is above a certain threshold. In this talk we show how this determination can be made using a ”sample ” of less than 50 bytes per document. The basic approach for computing resemblance has two aspects: first, resemblance is expressed as a set (of strings) intersection problem, and second, the relative size of intersections is evaluated by a process of random sampling that can be done independently for each document. The process of estimating the relative size of intersection of sets and the thresholdtestdiscussedabovecanbeappliedtoarbitrarysets,andthus might be of independent interest. The algorithm for filtering near-duplicate documents discussed here has been successfully implemented and has been used for the last three years in the context of the AltaVista search engine. 1
Scam: A copy detection mechanism for digital documents
- In Digitial Library
, 1995
"... Copy detection in Digital Libraries may provide the necessary guarantees for publishers and newsfeed ser-vices to oer valuable on-line data. We consider the case for a registration server that maintains regis-tered documents against which new documents can be checked for overlap. In this paper we pr ..."
Abstract
-
Cited by 148 (10 self)
- Add to MetaCart
(Show Context)
Copy detection in Digital Libraries may provide the necessary guarantees for publishers and newsfeed ser-vices to oer valuable on-line data. We consider the case for a registration server that maintains regis-tered documents against which new documents can be checked for overlap. In this paper we present a new scheme for detecting copies based on compar-ing the word frequency occurrences of the new docu-ment against those of registered documents. We also report on an experimental comparison between our proposed scheme and COPS [6], a detection scheme based on sentence overlap. The tests involve over a million comparisons of netnews articles and show that in general the new scheme performs better in detecting documents that have partial overlap.
Challenges in Web Search Engines
, 2002
"... This article presents a high-level discussion of some problems in information retrieval that are unique to web search engines. The goal is to raise awareness and stimulate research in these areas. ..."
Abstract
-
Cited by 128 (0 self)
- Add to MetaCart
This article presents a high-level discussion of some problems in information retrieval that are unique to web search engines. The goal is to raise awareness and stimulate research in these areas.
Finding near-duplicate web pages: a large-scale evaluation of algorithms
- In Proc. of Int. Conf. on Information Retrieval (SIGIR
, 2006
"... Broder et al.’s [3] shingling algorithm and Charikar’s [4] ran-dom projection based approach are considered “state-of-the-art ” algorithms for finding near-duplicate web pages. Both algorithms were either developed at or used by popular web search engines. We compare the two algorithms on a very lar ..."
Abstract
-
Cited by 118 (2 self)
- Add to MetaCart
(Show Context)
Broder et al.’s [3] shingling algorithm and Charikar’s [4] ran-dom projection based approach are considered “state-of-the-art ” algorithms for finding near-duplicate web pages. Both algorithms were either developed at or used by popular web search engines. We compare the two algorithms on a very large scale, namely on a set of 1.6B distinct web pages. The results show that neither of the algorithms works well for finding near-duplicate pairs on the same site, while both achieve high precision for near-duplicate pairs on different sites. Since Charikar’s algorithm finds more near-duplicate pairs on different sites, it achieves a better precision overall, namely 0.50 versus 0.38 for Broder et al. ’s algorithm. We present a combined algorithm which achieves precision 0.79 with 79 % of the recall of the other algorithms.
Efficient similarity joins for near duplicate detection
- In WWW
, 2008
"... With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pair of records such that their similarities are no less than a given ..."
Abstract
-
Cited by 103 (9 self)
- Add to MetaCart
(Show Context)
With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pair of records such that their similarities are no less than a given threshold. Several existing algorithms rely on the prefix filtering principle to avoid computing similarity values for all possible pairs of records. We propose new filtering techniques by exploiting the token ordering information; they are integrated into the existing methods and drastically reduce the candidate sizes and hence improve the efficiency. We have also studied the implementation of our proposed algorithm in stand-alone and RDBMSbased settings. Experimental results show our proposed algorithms can outperforms previous algorithms on several real datasets.