Results 1 - 10
of
118
Efficient Parallel Set-Similarity Joins Using MapReduce
"... In this paper we study how to efficiently perform set-similarity joins in parallel using the popular MapReduce framework. We propose a 3-stage approach for end-to-end setsimilarity joins. We take as input a set of records and output a set of joined records based on a set-similarity condition. We eff ..."
Abstract
-
Cited by 109 (7 self)
- Add to MetaCart
(Show Context)
In this paper we study how to efficiently perform set-similarity joins in parallel using the popular MapReduce framework. We propose a 3-stage approach for end-to-end setsimilarity joins. We take as input a set of records and output a set of joined records based on a set-similarity condition. We efficiently partition the data across nodes in order to balance the workload and minimize the need for replication. We study both self-join and R-S join cases, and show how to carefully control the amount of data kept in main memory on each node. We also propose solutions for the case where, even if we use the most fine-grained partitioning, the data still does not fit in the main memory of a node. We report results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop. Categories and Subject Descriptors H.2.4 [Database Management]: Systems—query processing, parallel databases
Efficient similarity joins for near duplicate detection
- In WWW
, 2008
"... With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pair of records such that their similarities are no less than a given ..."
Abstract
-
Cited by 103 (9 self)
- Add to MetaCart
With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pair of records such that their similarities are no less than a given threshold. Several existing algorithms rely on the prefix filtering principle to avoid computing similarity values for all possible pairs of records. We propose new filtering techniques by exploiting the token ordering information; they are integrated into the existing methods and drastically reduce the candidate sizes and hence improve the efficiency. We have also studied the implementation of our proposed algorithm in stand-alone and RDBMSbased settings. Experimental results show our proposed algorithms can outperforms previous algorithms on several real datasets.
Detecting Near-Duplicates for Web Crawling
- WWW 2007
, 2007
"... Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a nea ..."
Abstract
-
Cited by 92 (0 self)
- Add to MetaCart
Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we demonstrate that Charikar’s fingerprinting technique is appropriate for this goal. Second, we present an algorithmic technique for identifying existing fbit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.
Ed-Join: An Efficient Algorithm for Similarity Joins With Edit Distance Constraints
, 2008
"... There has been considerable interest in similarity join in the research community recently. Similarity join is a fundamental operation in many application areas, such as data integration and cleaning, bioinformatics, and pattern recognition. We focus on efficient algorithms for similarity join with ..."
Abstract
-
Cited by 72 (8 self)
- Add to MetaCart
(Show Context)
There has been considerable interest in similarity join in the research community recently. Similarity join is a fundamental operation in many application areas, such as data integration and cleaning, bioinformatics, and pattern recognition. We focus on efficient algorithms for similarity join with edit distance constraints. Existing approaches are mainly based on converting the edit distance constraint to a weaker constraint on the number of matching q-grams between pair of strings. In this paper, we propose the novel perspective of investigating mismatching q-grams. Technically, we derive two new edit distance lower bounds by analyzing the locations and contents of mismatching q-grams. A new algorithm, Ed-Join, is proposed that exploits the new mismatch-based filtering methods; it achieves substantial reduction of the candidate sizes and hence saves computation time. We demonstrate experimentally that the new algorithm outperforms alternative methods on large-scale real datasets under a wide range of parameter settings.
Spotsigs: robust and efficient near duplicate detection in large web collections
- In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
, 2008
"... Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching signatures for near duplicate detection in large Web crawls. Our spot signatures are designed to favor naturallanguage porti ..."
Abstract
-
Cited by 45 (2 self)
- Add to MetaCart
(Show Context)
Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching signatures for near duplicate detection in large Web crawls. Our spot signatures are designed to favor naturallanguage portions of Web pages over advertisements and navigational bars. The contributions of SpotSigs are twofold: 1) by combining stopword antecedents with short chains of adjacent content terms, we create robust document signatures with a natural ability to filter out noisy components of Web pages that would otherwise distract pure n-gram-based approaches such as Shingling; 2) we provide an exact and efficient, selftuning matching algorithm that exploits a novel combination of collection partitioning and inverted index pruning for high-dimensional similarity search. Experiments confirm an increase in combined precision and recall of more than 24 percent over state-of-the-art approaches such as Shingling or I-Match and up to a factor of 3 faster execution times than Locality Sensitive Hashing (LSH), over a demonstrative “Gold Set ” of manually assessed near-duplicate news articles as well as the TREC WT10g Web collection.
Top-k Set Similarity Joins
"... Abstract — Similarity join is a useful primitive operation underlying many applications, such as near duplicate Web page detection, data integration, and pattern recognition. Traditional similarity joins require a user to specify a similarity threshold. In this paper, we study a variant of the simil ..."
Abstract
-
Cited by 42 (2 self)
- Add to MetaCart
(Show Context)
Abstract — Similarity join is a useful primitive operation underlying many applications, such as near duplicate Web page detection, data integration, and pattern recognition. Traditional similarity joins require a user to specify a similarity threshold. In this paper, we study a variant of the similarity join, termed top-k set similarity join. It returns the top-k pairs of records ranked by their similarities, thus eliminating the guess work users have to perform when the similarity threshold is unknown before hand. An algorithm, topk-join, is proposed to answer top-k similarity join efficiently. It is based on the prefix filtering principle and employs tight upper bounding of similarity values of unseen pairs. Experimental results demonstrate the efficiency of the proposed algorithm on large-scale real datasets. I.
Web graph similarity for anomaly detection
- Journal of Internet Services and Applications
, 2010
"... Web graphs are approximate snapshots of the web, created by search engines. They are essential to monitor the evolution of the web and to compute global properties like PageRank values of web pages. Their continuous monitoring requires a notion of graph similarity to help measure the amount and sign ..."
Abstract
-
Cited by 30 (4 self)
- Add to MetaCart
(Show Context)
Web graphs are approximate snapshots of the web, created by search engines. They are essential to monitor the evolution of the web and to compute global properties like PageRank values of web pages. Their continuous monitoring requires a notion of graph similarity to help measure the amount and significance of changes in the evolving web. As a result, these measurements provide means to validate how well search engines acquire content from the web. In this paper we propose five similarity schemes: three of them we adapted from existing graph similarity measures, and two we adapted from well-known document and vector similarity methods (namely, the shingling method and random projection based method). We empirically evaluate and compare all five schemes using a sequence of web graphs from Yahoo!, and study if the schemes can identify anomalies that may occur due to hardware or other problems. 1
Local text reuse detection
- In Proc. of SIGIR
, 2008
"... Text reuse occurs in many different types of documents and for many different reasons. One form of reuse, duplicate or near-duplicate documents, has been a focus of researchers because of its importance in Web search. Local text reuse occurs when sentences, facts or passages, rather than whole docum ..."
Abstract
-
Cited by 28 (3 self)
- Add to MetaCart
(Show Context)
Text reuse occurs in many different types of documents and for many different reasons. One form of reuse, duplicate or near-duplicate documents, has been a focus of researchers because of its importance in Web search. Local text reuse occurs when sentences, facts or passages, rather than whole documents, are reused and modified. Detecting this type of reuse can be the basis of new tools for text analysis. In this paper, we introduce a new approach to detecting local text reuse and compare it to other approaches. This comparison involves a study of the amount and type of reuse that occurs in real documents, including TREC newswire and blog collections.
Finding text reuse on the web
- In WSDM ’09: Proceedings of the Second ACM International Conference on Web Search and Data Mining
, 2009
"... With the overwhelming number of reports on similar events originating from different sources on the web, it is often hard, using existing web search paradigms, to find the original source of “facts”, statements, rumors, and opinions, and to track their development. Several techniques have been previ ..."
Abstract
-
Cited by 28 (1 self)
- Add to MetaCart
(Show Context)
With the overwhelming number of reports on similar events originating from different sources on the web, it is often hard, using existing web search paradigms, to find the original source of “facts”, statements, rumors, and opinions, and to track their development. Several techniques have been previously proposed for detecting such text reuse between different sources, however these techniques have been tested against relatively small and homogeneous TREC collections. In this work, we test the feasibility of text reuse detection techniques in the setting of web search. In addition to text reuse detection, we develop a novel technique that addresses the unique challenges of finding original sources on the web, such as defining a timeline. We also explore the use of link analysis for identifying reliable and relevant reports. Our experimental results show that the proposed techniques can operate on the scale of the web, are significantly more accurate than standard web search for finding text reuse, and provide a richer representation for tracking the information flow.
V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors
"... This work proposes V-SMART-Join, a scalable MapReducebased framework for discovering all pairs of similar entities. The V-SMART-Join framework is applicable to sets, multisets, and vectors. V-SMART-Join is motivated by the observed skew in the underlying distributions of Internet traffic, and is a f ..."
Abstract
-
Cited by 26 (1 self)
- Add to MetaCart
(Show Context)
This work proposes V-SMART-Join, a scalable MapReducebased framework for discovering all pairs of similar entities. The V-SMART-Join framework is applicable to sets, multisets, and vectors. V-SMART-Join is motivated by the observed skew in the underlying distributions of Internet traffic, and is a family of 2-stage algorithms, where the first stage computes and joins the partial results, and the second stage computes the similarity exactly for all candidate pairs. The V-SMART-Join algorithms are very efficient and scalable in the number of entities, as well as their cardinalities. They were up to 30 times faster than the state of the art algorithm, VCL, when compared on a real dataset of a small size. We also established the scalability of the proposed algorithms by running them on a dataset of a realistic size, on which VCL never succeeded to finish. Experiments were run using real datasets of IPs and cookies, where each IP is represented as a multiset of cookies, and the goal is to discover similar IPs to identify Internet proxies. 1.