Results 1 - 10
of
52
Large-scale deduplication with constraints using dedupalog
- in: Proceedings of the 25th International Conference on Data Engineering (ICDE
"... Abstract — We present a declarative framework for collective deduplication of entity references in the presence of constraints. Constraints occur naturally in many data cleaning domains and can improve the quality of deduplication. An example of a constraint is “each paper has a unique publication v ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
Abstract — We present a declarative framework for collective deduplication of entity references in the presence of constraints. Constraints occur naturally in many data cleaning domains and can improve the quality of deduplication. An example of a constraint is “each paper has a unique publication venue”; iftwo paper references are duplicates, then their associated conference references must be duplicates as well. Our framework supports collective deduplication, meaning that we can dedupe both paper references and conference references collectively in the example above. Our framework is based on a simple declarative Datalogstyle language with precise semantics. Most previous work on deduplication either ignore constraints or use them in an ad-hoc domain-specific manner. We also present efficient algorithms to support the framework. Our algorithms have precise theoretical guarantees for a large subclass of our framework. We show, using a prototype implementation, that our algorithms scale to very large datasets. We provide thorough experimental results over real-world data demonstrating the utility of our framework for high-quality and scalable deduplication. I.
Query-time entity resolution
- In The ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD
, 2006
"... Entity resolution is the problem of reconciling database references corresponding to the same real-world entities. Given the abundance of publicly available databases that have unresolved entities, we motivate the problem of query-time entity resolution: quick and accurate resolution for answering q ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
Entity resolution is the problem of reconciling database references corresponding to the same real-world entities. Given the abundance of publicly available databases that have unresolved entities, we motivate the problem of query-time entity resolution: quick and accurate resolution for answering queries over such ‘unclean ’ databases at query-time. Since collective entity resolution approaches — where related references are resolved jointly — have been shown to be more accurate than independent attribute-based resolution for off-line entity resolution, we focus on developing new algorithms for collective resolution for answering entity resolution queries at query-time. For this purpose, we first formally show that, for collective resolution, precision and recall for individual entities follow a geometric progression as neighbors at increasing distances are considered. Unfolding this progression leads naturally to a two stage ‘expand and resolve ’ query processing strategy. In this strategy, we first extract the related records for a query using two novel expansion operators, and then resolve the extracted records collectively. We then show how the same strategy can be adapted for query-time entity resolution by identifying and resolving only those database references that are the most helpful for processing the query. We validate our approach on two large real-world publication databases where we show the usefulness of collective resolution and at the same time demonstrate the need for adaptive strategies for query processing. We then show how the same queries can be answered in real-time using our adaptive approach while preserving the gains of collective resolution. In addition to experiments on real datasets, we use synthetically generated data to empirically demonstrate the validity of the performance trends predicted by our analysis of collective entity resolution over a wide range of structural characteristics in the data. 1.
Web people search via connection analysis
- IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE
, 2008
"... Abstract—Nowadays, searches for the web pages of a person with a given name constitute a notable fraction of queries to Web search engines. Such a query would normally return web pages related to several namesakes, who happened to have the queried name, leaving the burden of disambiguating and colle ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
Abstract—Nowadays, searches for the web pages of a person with a given name constitute a notable fraction of queries to Web search engines. Such a query would normally return web pages related to several namesakes, who happened to have the queried name, leaving the burden of disambiguating and collecting pages relevant to a particular person (from among the namesakes) on the user. In this paper, we develop a Web People Search approach that clusters web pages based on their association to different people. Our method exploits a variety of semantic information extracted from web pages, such as named entities and hyperlinks, to disambiguate among namesakes referred to on the web pages. We demonstrate the effectiveness of our approach by testing the efficacy of the disambiguation algorithms and its impact on person search. Index Terms—Web people search, entity resolution, graph-based disambiguation, social network analysis, clustering. Ç 1
Automatic Record Linkage using Seeded Nearest Neighbour and Support Vector Machine Classification
"... The task of linking databases is an important step in an increasing number of data mining projects, because linked data can contain information that is not available otherwise, or that would require time-consuming and expensive collection of specific data. The aim of linking is to match and aggregat ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
The task of linking databases is an important step in an increasing number of data mining projects, because linked data can contain information that is not available otherwise, or that would require time-consuming and expensive collection of specific data. The aim of linking is to match and aggregate all records that refer to the same entity. One of the major challenges when linking large databases is the efficient and accurate classification of record pairs into matches and non-matches. While traditionally classification was based on manually-set thresholds or on statistical procedures, many of the more recently developed classification methods are based on supervised learning techniques. They therefore require training data, which is often not available in real world situations or has to be prepared manually, an expensive, cumbersome and time-consuming process. A novel two-step approach to automatic record pair classification has previously been presented by the author. In its first step, training examples of high quality are automatically selected from the compared record pairs, and used in the second step to train a support vector machine (SVM) classifier. Initial experiments showed the feasibility of this approach, achieving results that outperformed k-means clustering. Two variations of this approach are presented in this paper. The first is based on a nearest-neighbour classifier, while the second improves a SVM classifier by iteratively adding more examples into the training sets. Experimental results show that this two-step approach can achieve better classification results than other unsupervised approaches.
M.: Linked movie data base
- In: Workshop on Linked Data on the Web (LDOW 2009
, 2009
"... The Linked Movie Database (LinkedMDB) project provides a demonstration of the first open linked dataset connecting several major existing (and highly popular) movie web resources. The database exposed by LinkedMDB contains millions of RDF triples with hundreds of thousands of RDF links to existing w ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
The Linked Movie Database (LinkedMDB) project provides a demonstration of the first open linked dataset connecting several major existing (and highly popular) movie web resources. The database exposed by LinkedMDB contains millions of RDF triples with hundreds of thousands of RDF links to existing web data sources that are part of the growing Linking Open Data cloud, as well as to popular movierelated web pages such as IMDb. LinkedMDB uses a novel way of creating and maintaining large quantities of high quality links by employing state-of-the-art approximate join techniques for finding links, and providing additional RDF metadata about the quality of the links and the techniques used for deriving them.
Modeling and Querying Possible Repairs in Duplicate Detection
"... One of the most prominent data quality problems is the existence of duplicate records. Current duplicate elimination procedures usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter setti ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
One of the most prominent data quality problems is the existence of duplicate records. Current duplicate elimination procedures usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter settings can be hard, and in many cases, perfect settings do not exist. Furthermore, replacing the input dirty data with one possible clean instance may result in unrecoverable errors, for example, identification and merging of possible duplicate records in health care systems. In this paper, we treat duplicate detection procedures as data processing tasks with uncertain outcomes. We concentrate on a family of duplicate detection algorithms that are based on parameterized clustering. We propose a novel uncertainty model that compactly encodes the space of possible repairs corresponding to different parameter settings. We show how to efficiently support relational queries under our model, and to allow new types of queries on the set of possible repairs. We give an experimental study illustrating the scalability and the efficiency of our techniques in different configurations. 1.
Automatic Training Example Selection for Scalable Unsupervised Record Linkage
"... Abstract. Linking records from two or more databases is becoming increasingly important in the data preparation step of many data mining projects, as linked data can enable analysts to conduct studies that are not feasible otherwise, or that would require expensive and timeconsuming collection of sp ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Abstract. Linking records from two or more databases is becoming increasingly important in the data preparation step of many data mining projects, as linked data can enable analysts to conduct studies that are not feasible otherwise, or that would require expensive and timeconsuming collection of specific data. The aim of such linkages is to match all records that refer to the same entity. One of the main challenges in record linkage is the accurate classification of record pairs into matches and non-matches. With traditional techniques, classification thresholds have to be set either manually or using an EM-based approach. Many modern classification techniques, on the other hand, are based on supervised machine learning and thus require training data, which is often not available in real world situations. A novel two-step approach to unsupervised record pair classification is presented in this paper. In the first step, training examples are selected automatically, and in the second step these examples are used to train a binary classifier. An experimental evaluation shows that this approach can outperform k-means clustering and can also be much faster than other classification techniques.
A Query Language for Analyzing Networks
"... With more and more large networks becoming available, mining and querying such networks are increasingly important tasks which are not being supported by database models and querying languages. This paper wants to alleviate this situation by proposing a data model and a query language for facilitati ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
With more and more large networks becoming available, mining and querying such networks are increasingly important tasks which are not being supported by database models and querying languages. This paper wants to alleviate this situation by proposing a data model and a query language for facilitating the analysis of networks. Key features include support for executing external tools on the networks, flexible contexts on the network each resulting in a different graph, primitives for querying subgraphs (including paths) and transforming graphs. The data model provides for a closure property, in which the output of every query can be stored in the database and used for further querying.

