#### DMCA

## ClusterJoin: A Similarity Joins Framework using Map-Reduce

Citations: | 2 - 0 self |

### Citations

3206 | Mapreduce: simplified data processing on large clusters
- Dean, Ghemawat
(Show Context)
Citation Context ...Triangle Inequality: d(x, z) ≤ d(x, y) + d(y, z) The framework we propose in this work is generic and can handle any metric distance function, including those mentioned above. 3.2 MapReduce MapReduce =-=[12]-=- is a popular framework for parallel computation. In the MapReduce programming model, data is expressed through (key, value) pairs, and computation is represented by a Map function and a Reduce functi... |

585 |
Spatial tessellations: concepts and applications of voronoi photometric effects and voronoi-diagrams as a mixed model for the spatial distribution of galaxies 47 diagrams.
- Okabe, Boots, et al.
- 1992
(Show Context)
Citation Context ...lated to computing a generalized Voronoi diagram forA in high dimensions. Yet even for the well-behaved Euclidean distance function, computing Voronoi diagram is nontrivial and a discipline by itself =-=[20]-=-. Generalizing this to other distance functions that are considerably more complex, such as distributional distances like Jensen Shannon [15] is a daunting task. Instead of solving the hard problem of... |

516 |
Syntactic clustering of the Web
- Broder, Glassman, et al.
- 1997
(Show Context)
Citation Context ...function (or distance values less than a distance threshold). It is an essential operation in a variety of applications, including data cleaning [11], web page deduplication [16], document clustering =-=[8]-=-, plagiarism detection [17], click fraud detection [18], entity resolution [26], data integration [14], etc. As these applications need to handle increasingly vast amounts of data, the problem of scal... |

397 | When is ”nearest neighbor” meaningful
- Beyer, Goldstein, et al.
- 1999
(Show Context)
Citation Context ...n compared to the 2-dimensional Euclidean filter in Figure 5a, and produce good pruning only with low distance (high similarity) thresholds. This is partly attributable to the curse of dimensionality =-=[7]-=- – high dimensional data may just be inherently hard to prune away. However, we argue that this is still very useful, because in many real applications people are more interested in finding pairs of r... |

295 | The State of Record Linkage and Current Research Problems
- Winkler
- 1999
(Show Context)
Citation Context ...al operation in a variety of applications, including data cleaning [11], web page deduplication [16], document clustering [8], plagiarism detection [17], click fraud detection [18], entity resolution =-=[26]-=-, data integration [14], etc. As these applications need to handle increasingly vast amounts of data, the problem of scaling up similarity joins is getting ever more important. Performing similarity j... |

190 | SCOPE: easy and efficient parallel processing of massive data sets
- Chaiken, Jenkins, et al.
- 2008
(Show Context)
Citation Context ...n [13] that handles set-based and string-based similarities, which are not the focus of this work. We implement all algorithms described above and conduct experiments in a production MapReduce system =-=[10]-=-. All algorithms were executed concurrently with other production jobs, at the normal cluster workload, using a fixed amount of virtual resources. 8.2 Experimental evaluation 8.2.1 Filter effectivenes... |

150 |
Scaling up all pairs similarity search
- BAYARDO, MA, et al.
(Show Context)
Citation Context ...Reduce. 2. RELATED WORK The problem of performing efficient similarity joins has a wide variety of applications. Numerous techniques have been proposed, including prefix-based filters [11], All-pairs =-=[6]-=-,PP-Join [27], and many others. This long and fruitful line of work has lead to a significant improvement in the scalability of similarity joins. More recently, similarity join using MapReduce have at... |

133 | Efficient exact set-similarity joins
- Arasu, Ganti, et al.
- 2006
(Show Context)
Citation Context ...sed similarity functions have direct metric distance counterparts. For instance Jaccard similarity has a metric distance equivalent. Given that Jaccard similarity is heavily studied in the literature =-=[5, 6, 24, 27]-=-, and we don’t have a special filter for Jaccard yet, it would be interesting to see how our approach performs using the general filter. Our results using the news data set and Jaccard similarity of 0... |

122 | A primitive operator for similarity joins in data cleaning
- Chaudhuri, Ganti, et al.
- 2006
(Show Context)
Citation Context ...edefined similarity threshold under a given similarity function (or distance values less than a distance threshold). It is an essential operation in a variety of applications, including data cleaning =-=[11]-=-, web page deduplication [16], document clustering [8], plagiarism detection [17], click fraud detection [18], entity resolution [26], data integration [14], etc. As these applications need to handle ... |

112 | Finding near-duplicate Web pages: A large-scale evaluation of algorithms
- Henzinger
- 2006
(Show Context)
Citation Context ... under a given similarity function (or distance values less than a distance threshold). It is an essential operation in a variety of applications, including data cleaning [11], web page deduplication =-=[16]-=-, document clustering [8], plagiarism detection [17], click fraud detection [18], entity resolution [26], data integration [14], etc. As these applications need to handle increasingly vast amounts of ... |

104 | Efficient parallel set-similarity joins using mapreduce
- Vernica, Carey, et al.
- 2010
(Show Context)
Citation Context ...s of great practical importance. In this work we propose a general framework to compute similarity joins in MapReduce on metric distance functions. While similarity join in MapReduce has been studied =-=[13, 18, 24, 25]-=-, most existing approaches focus on set-based or string-based similarity metrics (Jaccard similarity, set-based Cosine similarity, and edit distances). In this work we focus on general metric distance... |

100 | Efficient similarity joins for near duplicate detection
- Xiao, Wang, et al.
- 2008
(Show Context)
Citation Context ...ELATED WORK The problem of performing efficient similarity joins has a wide variety of applications. Numerous techniques have been proposed, including prefix-based filters [11], All-pairs [6],PP-Join =-=[27]-=-, and many others. This long and fruitful line of work has lead to a significant improvement in the scalability of similarity joins. More recently, similarity join using MapReduce have attracted signi... |

86 |
A new metric for probability distributions,” Information Theory
- Endres, Schindelin
- 2003
(Show Context)
Citation Context ...g Voronoi diagram is nontrivial and a discipline by itself [20]. Generalizing this to other distance functions that are considerably more complex, such as distributional distances like Jensen Shannon =-=[15]-=- is a daunting task. Instead of solving the hard problem of finding the exact outer partition membership M(Q,A) above, we choose to solve it approximately. Specifically, we find a superset S(Q,A) ⊇M(Q... |

73 |
Methods for identifying versioned and plagiarized documents
- HOAD, ZOBEL
- 2003
(Show Context)
Citation Context ...es less than a distance threshold). It is an essential operation in a variety of applications, including data cleaning [11], web page deduplication [16], document clustering [8], plagiarism detection =-=[17]-=-, click fraud detection [18], entity resolution [26], data integration [14], etc. As these applications need to handle increasingly vast amounts of data, the problem of scaling up similarity joins is ... |

44 | Processing theta-joins using mapreduce
- Okcan, Riedewald
- 2011
(Show Context)
Citation Context ... their anchor points approach. However, their approach can be viewed as uniform space-partitioning, which is likely to lead to imbalanced partition with skewed data distributions. Okcan and Riedewald =-=[21]-=- design a Theta-Join framework that can handle joins for arbitrary predicates. Their approach is very general and is capable of handling any joins. However this approach cannot prune away candidate pa... |

24 |
Metric Spaces: Iteration and Application
- Bryant
- 1985
(Show Context)
Citation Context ...on distance, Total Variation distance and Earth Mover distance, etc. Metric distances have a number of nice properties that we use to design candidate pruning filters. DEFINITION 3.1 (METRIC DISTANCE =-=[9]-=-). LetD be the domain of all records. A metric distance on D is any function d : D×D→ R satisfying the following properties ∀x, y, z ∈ D: • Non-negativity: d(x, y) ≥ 0 • Coincidence Axiom: d(x, y) = 0... |

23 | V-smart-join: A scalable mapreduce framework for all-pair similarity joins of multisets and vectors
- Metwally, Faloutsos
(Show Context)
Citation Context ...shold). It is an essential operation in a variety of applications, including data cleaning [11], web page deduplication [16], document clustering [8], plagiarism detection [17], click fraud detection =-=[18]-=-, entity resolution [26], data integration [14], etc. As these applications need to handle increasingly vast amounts of data, the problem of scaling up similarity joins is getting ever more important.... |

17 | Fuzzy joins using mapreduce
- Afrati, Sarma, et al.
- 2012
(Show Context)
Citation Context ... token level to compute pairwise similarity. They show that their approach works well for sparse data sets with a large alphabet. Their approach does not prune away any candidate pairs. Afrati et al. =-=[4]-=- study techniques such as ball hashing and anchor points analytically. Our Cluster-Join algorithm draws inspiration from their anchor points approach. However, their approach can be viewed as uniform ... |

17 | Bayesian locality sensitive hashing for fast similarity search
- Satuluri, Parthasarathy
- 2012
(Show Context)
Citation Context ... the more powerful filters with our dynamic load balancing scheme, our approach is experimentally shown to be up to an order of magnitude more efficient than MAPSS. Approximate similarity join (e.g., =-=[22]-=-) is the related problem of discovering similar pairs with a small false negative probability. In this paper, we focus on the exact similarity join problem, where all matching pairs are to be found, w... |

9 |
Z.: Principles of Data Integration
- Doan, Halevy, et al.
- 2012
(Show Context)
Citation Context ...ty of applications, including data cleaning [11], web page deduplication [16], document clustering [8], plagiarism detection [17], click fraud detection [18], entity resolution [26], data integration =-=[14]-=-, etc. As these applications need to handle increasingly vast amounts of data, the problem of scaling up similarity joins is getting ever more important. Performing similarity joins on massive amounts... |

6 | Exploiting MapReduce-based similarity joins
- Silva, Reed
- 2012
(Show Context)
Citation Context ...on. Here, the cost of broadcasting is insignificant, while the benefit of balancing load is quite significant. We also note that although splitting skewed partitions has been used for similarity join =-=[23, 25]-=-, previous approaches split partitions in ad-hoc manners that cannot provide an upper bound for the size of each partition. In fact, when more than T records are mapped to the same partition, they som... |

4 | Scalable all-pairs similarity search in metric spaces - Wang, Metwally, et al. |

3 | Anchor-Points Algorithms for Hamming and Edit Distances Using MapReduce - Afrati, Sarma, et al. - 2014 |

1 |
Massjoin: A mapreduce-based algorithm for string similarity joins
- Deng, Li, et al.
- 2013
(Show Context)
Citation Context ...s of great practical importance. In this work we propose a general framework to compute similarity joins in MapReduce on metric distance functions. While similarity join in MapReduce has been studied =-=[13, 18, 24, 25]-=-, most existing approaches focus on set-based or string-based similarity metrics (Jaccard similarity, set-based Cosine similarity, and edit distances). In this work we focus on general metric distance... |

1 | The set-union knapsack problem. University of Texas at - Nehme-Haily - 1995 |