Results 1 - 10
of
10
Efficient similarity joins for near duplicate detection
- In WWW
, 2008
"... With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pair of records such that their similarities are no less than a given ..."
Abstract
-
Cited by 32 (5 self)
- Add to MetaCart
With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pair of records such that their similarities are no less than a given threshold. Several existing algorithms rely on the prefix filtering principle to avoid computing similarity values for all possible pairs of records. We propose new filtering techniques by exploiting the token ordering information; they are integrated into the existing methods and drastically reduce the candidate sizes and hence improve the efficiency. We have also studied the implementation of our proposed algorithm in stand-alone and RDBMSbased settings. Experimental results show our proposed algorithms can outperforms previous algorithms on several real datasets.
Ranking-based clustering of heterogeneous information networks with star network schema
- In: Proc. 2009 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD 2009
, 2009
"... A heterogeneous information network is an information network composed of multiple types of objects. Clustering on such a network may lead to better understanding of both hidden structures of the network and the individual role played by every object in each cluster. However, although clustering on ..."
Abstract
-
Cited by 16 (13 self)
- Add to MetaCart
A heterogeneous information network is an information network composed of multiple types of objects. Clustering on such a network may lead to better understanding of both hidden structures of the network and the individual role played by every object in each cluster. However, although clustering on homogeneous networks has been studied over decades, clustering on heterogeneous networks has not been addressed until recently. A recent study proposed a new algorithm, RankClus, for clustering on bi-typed heterogeneous networks. However, a real-world network may consist of more than two types, and the interactions among multi-typed objects play a key role at disclosing the rich semantics that a network carries. In this paper, we study clustering of multi-typed heterogeneous networks with a star network schema and propose a novel algorithm, NetClus, that utilizes links across multityped objects to generate high-quality net-clusters. An iterative enhancement method is developed that leads to effective ranking-based clustering in such heterogeneous networks. Our experiments on DBLP data show that NetClus generates more accurate clustering results than the baseline topic model algorithm PLSA and the recently proposed algorithm, RankClus. Further, NetClus generates informative clusters, presenting good ranking and cluster membership information for each attribute object in each net-cluster.
Pathsim: Meta path-based top-k similarity search in heterogeneous information networks
- In VLDB’ 11
, 2011
"... Similarity search is a primitive operation in database and Web search engines. With the advent of large-scale heterogeneous information networks that consist of multi-typed, interconnected objects, such as the bibliographic networks and social media networks, it is important to study similarity sear ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Similarity search is a primitive operation in database and Web search engines. With the advent of large-scale heterogeneous information networks that consist of multi-typed, interconnected objects, such as the bibliographic networks and social media networks, it is important to study similarity search in such networks. Intuitively, two objects are similar if they are linked by many paths in the network. However, most existing similarity measures are defined for homogeneous networks. Different semantic meanings behind paths are not taken into consideration. Thus they cannot be directly applied to heterogeneous networks. In this paper, we study similarity search that is defined among the same type of objects in heterogeneous networks. Moreover, by considering different linkage paths in a network, one could derive various similarity semantics. Therefore, we introduce the concept
Finding associations and computing similarity via biased pair sampling,” Knowledge and Information Systems (to appear
"... Abstract—Sampling-based methods have previously been proposed for the problem of finding interesting associations in data, even for low-support items. While these methods do not guarantee precise results, they can be vastly more efficient than approaches that rely on exact counting. However, for man ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract—Sampling-based methods have previously been proposed for the problem of finding interesting associations in data, even for low-support items. While these methods do not guarantee precise results, they can be vastly more efficient than approaches that rely on exact counting. However, for many similarity measures no such methods have been known. In this paper we show how a wide variety of measures can be supported by a simple biased sampling method. The method also extends to find high-confidence association rules. We demonstrate theoretically that our method is superior to exact methods when the threshold for “interesting similarity/confidence ” is above the average pairwise similarity/confidence, and the average support is not too low. Our method is particularly good when transactions contain many items. We confirm in experiments on standard association mining benchmarks that this gives a significant speedup on real data sets (sometimes much larger than the theoretical guarantees). Reductions in computation time of over an order of magnitude, and significant savings in space, are observed. Keywords-algorithms; sampling; data mining; association rules. I.
B ed-Tree: An All-Purpose Index Structure for String Similarity Search Based on Edit Distance
"... Strings are ubiquitous in computer systems and hence string processing has attracted extensive research effort from computer scientists in diverse areas. One of the most important problems in string processing is to efficiently evaluate the similarity between two strings based on a specified similar ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Strings are ubiquitous in computer systems and hence string processing has attracted extensive research effort from computer scientists in diverse areas. One of the most important problems in string processing is to efficiently evaluate the similarity between two strings based on a specified similarity measure. String similarity search is a fundamental problem in information retrieval, database cleaning, biological sequence analysis, and more. While a large number of dissimilarity measures on strings have been proposed, edit distance is the most popular choice in a wide spectrum of applications. Existing indexing techniques for similarity search queries based on edit distance, e.g., approximate selection and join queries, rely mostly on n-gram signatures coupled with inverted list structures. These techniques are tailored for specific query types only, and their performance remains unsatisfactory especially in scenarios with strict memory constraints or frequent data updates. In this paper we propose the B ed-tree, a B +-tree based index structure for evaluating all types of similarity queries on edit distance and normalized edit distance. We identify the necessary properties of a mapping from the string space to the integer space for supporting searching and pruning for these queries. Three transformations are proposed that capture different aspects of information inherent in strings, enabling efficient pruning during the search process on the tree. Compared to stateof-the-art methods on string similarity search, the B ed-tree is a complete solution that meets the requirements of all applications, providing high scalability and fast response time.
Weighted Set-Based String Similarity
"... Consider a universe of tokens, each of which is associated with a weight, and a database consisting of strings that can be represented as subsets of these tokens. Given a query string, also represented as a set of tokens, a weighted string similarity query identifies all strings in the database whos ..."
Abstract
- Add to MetaCart
Consider a universe of tokens, each of which is associated with a weight, and a database consisting of strings that can be represented as subsets of these tokens. Given a query string, also represented as a set of tokens, a weighted string similarity query identifies all strings in the database whose similarity to the query is larger than a user specified threshold. Weighted string similarity queries are useful in applications like data cleaning and integration for finding approximate matches in the presence of typographical mistakes, multiple formatting conventions, data transformation errors, etc. We show that this problem has semantic properties that can be exploited to design index structures that support very efficient algorithms for query answering. 1
(will be inserted by the editor) Efficient Processing of Exact Top-k Queries over Sorted Lists
"... the date of receipt and acceptance should be inserted later Abstract The top-k query is employed in a wide range of applications to generate a ranked list of data that have the highest aggregate scores over certain attributes. As the pool of attributes for selection by individual queries may be larg ..."
Abstract
- Add to MetaCart
the date of receipt and acceptance should be inserted later Abstract The top-k query is employed in a wide range of applications to generate a ranked list of data that have the highest aggregate scores over certain attributes. As the pool of attributes for selection by individual queries may be large, the data are indexed with perattribute sorted lists, and a threshold algorithm is applied on the lists involved in each query. The threshold algorithm executes in two phases – find a cut-off threshold for the top-k result scores, then evaluate all the records that could score above the threshold. In this paper, we focus on exact top-k queries that involve monotonic linear scoring functions over diskresident sorted lists. We introduce a model for estimating the depths to which each sorted list needs to be processed
REGULAR PAPER Efficient fuzzy full-text type-ahead search
"... Abstract Traditional information systems return answers after a user submits a complete query. Users often feel “left in the dark ” when they have limited knowledge about the underlying data and have to use a try-and-see approach for finding information. A recent trend of supporting autocomplete in ..."
Abstract
- Add to MetaCart
Abstract Traditional information systems return answers after a user submits a complete query. Users often feel “left in the dark ” when they have limited knowledge about the underlying data and have to use a try-and-see approach for finding information. A recent trend of supporting autocomplete in these systems is a first step toward solving this problem. In this paper, we study a new information-access paradigm, called “type-ahead search ” in which the system searches the underlying data “on the fly ” as the user types in query keywords. It extends autocomplete interfaces by allowing keywords to appear at different places in the underlying data. This framework allows users to explore data as they type, even in the presence of minor errors. We study research challenges in this framework for large amounts of data. Since each keystroke of the user could invoke a query on the backend, we need efficient algorithms to process each query within milliseconds. We develop various incremental-search algorithms for both single-keyword queries and multi-keyword queries, using previously computed and cached results in order to achieve a high interactive speed. We develop novel techniques to support fuzzy search by allowing mismatches
Supervisor
"... I would like to thank Professor Rasmus Pagh for these years of research during which he has been a precious guide. If I have gained any skill as a researcher during this period in time, it is because of his lessons and his advices. A special thank goes to Professor Gerth Brodal, Professor Thore Husf ..."
Abstract
- Add to MetaCart
I would like to thank Professor Rasmus Pagh for these years of research during which he has been a precious guide. If I have gained any skill as a researcher during this period in time, it is because of his lessons and his advices. A special thank goes to Professor Gerth Brodal, Professor Thore Husfeldt and Professor Piotr Indyk for having accepted the burden of being in my assessment committee. I thank Professor Ronitt Rubinfeld for having been a very nice host and for the great support that she has given me in order to deal with a completely new research topic. A thank goes to the Mitteleuropa for the nice people, maybe friends. A special thought goes to the one person who, in these years, made a better being of me and then killed me. I fear I did the same to you. Finally, my mind turns towards those who have given me the strength and resources to walk all the path to this point. In my brain I have

