Results 1  10
of
105
G.: Constructing Virtual Documents for Ontology Matching
 In: 15th International World Wide Web Conference
, 2006
"... Abstract. Ontology matching is a crucial task for data integration and management on the Semantic Web. The ontology matching techniques today can solve many problems from heterogeneity of ontologies to some extent. However, for matching large ontologies, most ontology matchers take too long run tim ..."
Abstract

Cited by 78 (9 self)
 Add to MetaCart
(Show Context)
Abstract. Ontology matching is a crucial task for data integration and management on the Semantic Web. The ontology matching techniques today can solve many problems from heterogeneity of ontologies to some extent. However, for matching large ontologies, most ontology matchers take too long run time and have strong requirements on running environment. Based on the MapReduce framework and the virtual document technique, in this paper, we propose a 3stage MapReducebased approach called VDoc+ for matching large ontologies, which significantly reduces the run time while keeping good precision and recall. Firstly, we establish four MapReduce processes to construct virtual document for each entity (class, property or instance), which consist of a simple process for the descriptions of entities, an iterative process for the descriptions of blank nodes and two processes for exchanging the descriptions with neighbors. Then, we use a wordweightbased partition method to calculate similarities between entities in the corresponding reducers. We report our results from two experiments on an OAEI dataset and a dataset from the biology domain. Its performance is assessed by comparing with existing ontology matchers. Additionally, we show how run time is reduced with increasing the size of cluster. 1
Hyracks: A Flexible and Extensible Foundation for DataIntensive Computing
 In Proceedings of the International Conference on Data Engineering (ICDE
, 2011
"... Abstract—Hyracks is a new partitionedparallel software platform designed to run dataintensive computations on large sharednothing clusters of computers. Hyracks allows users to express a computation as a DAG of data operators and connectors. Operators operate on partitions of input data and pro ..."
Abstract

Cited by 60 (12 self)
 Add to MetaCart
(Show Context)
Abstract—Hyracks is a new partitionedparallel software platform designed to run dataintensive computations on large sharednothing clusters of computers. Hyracks allows users to express a computation as a DAG of data operators and connectors. Operators operate on partitions of input data and produce partitions of output data, while connectors repartition operators’ outputs to make the newly produced partitions available at the consuming operators. We describe the Hyracks end user model, for authors of dataflow jobs, and the extension model for users who wish to augment Hyracks ’ builtin library with new operator and/or connector types. We also describe our initial Hyracks implementation. Since Hyracks is in roughly the same space as the open source Hadoop platform, we compare Hyracks with Hadoop experimentally for several different kinds of use cases. The initial results demonstrate that Hyracks has significant promise as a nextgeneration platform for dataintensive applications. I.
M.: Processing ThetaJoins using MapReduce
 In: SIGMOD Conference
, 2011
"... Joins are essential for many data analysis tasks, but are not supported directly by the MapReduce paradigm. While there has been progress on equijoins, implementation of join algorithms in MapReduce in general is not sufficiently understood. We study the problem of how to map arbitrary join conditi ..."
Abstract

Cited by 48 (1 self)
 Add to MetaCart
(Show Context)
Joins are essential for many data analysis tasks, but are not supported directly by the MapReduce paradigm. While there has been progress on equijoins, implementation of join algorithms in MapReduce in general is not sufficiently understood. We study the problem of how to map arbitrary join conditions to Map and Reduce functions, i.e., a parallel infrastructure that controls data flow based on keyequality only. Our proposed join model simplifies creation of and reasoning about joins in MapReduce. Using this model, we derive a surprisingly simple randomized algorithm, called 1BucketTheta, for implementing arbitrary joins (thetajoins) in a single MapReduce job. This algorithm only requires minimal statistics (input cardinality) and we provide evidence that for a variety of join problems, it is either close to optimal or the best possible option. For some of the problems where 1BucketTheta is not the best choice, we show how to achieve better performance by exploiting additional input statistics. All algorithms can be made ’memoryaware’, and they do not require any modifications to the MapReduce environment. Experiments show the effectiveness of our approach.
Efficient processing of k nearest neighbor joins using mapreduce
 Professor of Computer Science at the National University of Singapore (NUS). He obtained his BSc (1st Class Honors) and PhD from Monash University, Australia, in 1985 and
, 2012
"... k nearest neighbor join (kNN join), designed to find k nearest neighbors from a dataset S for every object in another dataset R, is a primitive operation widely adopted by many data mining applications. As a combination of the k nearest neighbor query and the join operation, kNN join is an expensiv ..."
Abstract

Cited by 32 (2 self)
 Add to MetaCart
(Show Context)
k nearest neighbor join (kNN join), designed to find k nearest neighbors from a dataset S for every object in another dataset R, is a primitive operation widely adopted by many data mining applications. As a combination of the k nearest neighbor query and the join operation, kNN join is an expensive operation. Given the increasing volume of data, it is difficult to perform a kNN join on a centralized machine efficiently. In this paper, we investigate how to perform kNN join using MapReduce which is a wellaccepted framework for dataintensive applications over clusters of computers. In brief, the mappers cluster objects into groups; the reducers perform the kNN join on each group of objects separately. We design an effective mapping mechanism that exploits pruning rules for distance filtering, and hence reduces both the shuffling and computational costs. To reduce the shuffling cost, we propose two approximate algorithms to minimize the number of replicas. Extensive experiments on our inhouse cluster demonstrate that our proposed methods are efficient, robust and scalable. 1.
Passjoin: A partitionbased method for similarity joins
, 2011
"... As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this paper, we study string similarity joins with editdistance constraints, which find similar string pairs from two large sets of strings whose edit distance is wit ..."
Abstract

Cited by 27 (13 self)
 Add to MetaCart
(Show Context)
As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this paper, we study string similarity joins with editdistance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold. Existing algorithms are efficient either for short strings or for long strings, and there is no algorithm that can efficiently and adaptively support both short strings and long strings. To address this problem, we propose a partitionbased method called PassJoin. PassJoin partitions a string into a set of segments and creates inverted indices for the segments. Then for each string, PassJoin selects some of its substrings and uses the selected substrings to find candidate pairs using the inverted indices. We devise efficient techniques to select the substrings and prove that our method can minimize the number of selected substrings. We develop novel pruning techniques to efficiently verify the candidate pairs. Experimental results show that our algorithms are efficient for both short strings and long strings, and outperform stateoftheart methods on real datasets. 1.
Efficient KNearest Neighbor Graph Construction for Generic Similarity Measures
"... KNearest Neighbor Graph (KNNG) construction is an important operation with many web related applications, including collaborative filtering, similarity search, and many others in data mining and machine learning. Existing methods for KNNG construction either do not scale, or are specific to certa ..."
Abstract

Cited by 26 (0 self)
 Add to MetaCart
(Show Context)
KNearest Neighbor Graph (KNNG) construction is an important operation with many web related applications, including collaborative filtering, similarity search, and many others in data mining and machine learning. Existing methods for KNNG construction either do not scale, or are specific to certain similarity measures. We present NNDescent, a simple yet efficient algorithm for approximate KNNG construction with arbitrary similarity measures. Our method is based on local search, has minimal space overhead and does not rely on any shared global index. Hence, it is especially suitable for largescale applications where data structures need to be distributed over the network. We have shown with a variety of datasets and similarity measures that the proposed method typically converges to above 90 % recall with each point comparing only to several percent of the whole dataset on average.
VSMARTJoin: A Scalable MapReduce Framework for AllPair Similarity Joins of Multisets and Vectors
"... This work proposes VSMARTJoin, a scalable MapReducebased framework for discovering all pairs of similar entities. The VSMARTJoin framework is applicable to sets, multisets, and vectors. VSMARTJoin is motivated by the observed skew in the underlying distributions of Internet traffic, and is a f ..."
Abstract

Cited by 25 (1 self)
 Add to MetaCart
(Show Context)
This work proposes VSMARTJoin, a scalable MapReducebased framework for discovering all pairs of similar entities. The VSMARTJoin framework is applicable to sets, multisets, and vectors. VSMARTJoin is motivated by the observed skew in the underlying distributions of Internet traffic, and is a family of 2stage algorithms, where the first stage computes and joins the partial results, and the second stage computes the similarity exactly for all candidate pairs. The VSMARTJoin algorithms are very efficient and scalable in the number of entities, as well as their cardinalities. They were up to 30 times faster than the state of the art algorithm, VCL, when compared on a real dataset of a small size. We also established the scalability of the proposed algorithms by running them on a dataset of a realistic size, on which VCL never succeeded to finish. Experiments were run using real datasets of IPs and cookies, where each IP is represented as a multiset of cookies, and the goal is to discover similar IPs to identify Internet proxies. 1.
Fastjoin: An efficient method for fuzzy token matching based string similarity join
 In ICDE
, 2011
"... Abstract—String similarity join that finds similar string pairs between two string sets is an essential operation in many applications, and has attracted significant attention recently in the database community. A significant challenge in similarity join is to implement an effective fuzzy match oper ..."
Abstract

Cited by 21 (11 self)
 Add to MetaCart
(Show Context)
Abstract—String similarity join that finds similar string pairs between two string sets is an essential operation in many applications, and has attracted significant attention recently in the database community. A significant challenge in similarity join is to implement an effective fuzzy match operation to find all similar string pairs which may not match exactly. In this paper, we propose a new similarity metrics, called “fuzzy token matching based similarity”, which extends tokenbased similarity functions (e.g., Jaccard similarity and Cosine similarity) by allowing fuzzy match between two tokens. We study the problem of similarity join using this new similarity metrics and present a signaturebased method to address this problem. We propose new signature schemes and develop effective pruning techniques to improve the performance. Experimental results show that our approach achieves high efficiency and result quality, and significantly outperforms stateoftheart methods. I.
Efficient Parallel kNN Joins for Large Data in MapReduce
"... In data mining applications and spatial and multimedia databases, a useful tool is the kNN join, which is to produce the k nearest neighbors (NN), from a dataset S, of every point in a dataset R. Since it involves both the join and the NN search, performing kNN joins efficiently is a challenging tas ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
(Show Context)
In data mining applications and spatial and multimedia databases, a useful tool is the kNN join, which is to produce the k nearest neighbors (NN), from a dataset S, of every point in a dataset R. Since it involves both the join and the NN search, performing kNN joins efficiently is a challenging task. Meanwhile, applications continue to witness a quick (exponential in some cases) increase in the amount of data to be processed. A popular model nowadays for largescale data processing is the sharednothing cluster on a number of commodity machines using MapReduce [6]. Hence, how to execute kNN joins efficiently on large data that are stored in a MapReduce cluster is an intriguing problem that meets many practical needs. This work proposes novel (exact and approximate) algorithms in MapReduce to perform efficient parallel kNN joins on large data. We demonstrate our ideas using Hadoop. Extensive experiments in large real and synthetic datasets, with tens or hundreds of millions of records in both R and S and up to 30 dimensions, have demonstrated the efficiency, effectiveness, and scalability of our methods.
Upper and Lower Bounds on the Cost of a MapReduce Computation ∗
"... In this paper we study the tradeoff between parallelism and communication cost in a mapreduce computation. For any problem that is not “embarrassingly parallel, ” the finer we partition the work of the reducers so that more parallelism can be extracted, the greater will be the total communication b ..."
Abstract

Cited by 20 (1 self)
 Add to MetaCart
(Show Context)
In this paper we study the tradeoff between parallelism and communication cost in a mapreduce computation. For any problem that is not “embarrassingly parallel, ” the finer we partition the work of the reducers so that more parallelism can be extracted, the greater will be the total communication between mappers and reducers. We introduce a model of problems that can be solved in a single round of mapreduce computation. This model enables a generic recipe for discovering lower bounds on communication cost as a function of the maximum number of inputs that can be assigned to one reducer. We use the model to analyze the tradeoff for three problems: finding pairs of strings at Hamming distance d, finding triangles and other patterns in a larger graph, and matrix multiplication. For finding strings of Hamming distance 1, we have upper and lower bounds that match exactly. For triangles and many other graphs, we have upper and lower bounds that are the same to within a constant factor. For the problem of matrix multiplication, we have matching upper and lower bounds for oneround mapreduce algorithms. We are also able to explore tworound mapreduce algorithms for matrix multiplication and show that these never have more communication, for a given reducer size, than the best oneround algorithm, and often have significantly less. 1.