• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Parallel Computing Techniques for High-Performance Probabilistic Record Linkage (0)

by P Christen
Add To MetaCart

Tools

Sorted by:
Results 1 - 2 of 2

P-Swoosh: Parallel Algorithm for Generic Entity Resolution

by Hideki Kawai, Hector Garcia-molina, Omar Benjelloun, David Menestrina, Euijong Whang, Heng Gong , 2006
"... Entity Resolution (ER) is a problem that arises in many information integration applications. ER process identifies duplicated records that refer to the same real-world entity (match process), and derives composite information about the entity (merge process). Additionally, the merged record can mat ..."
Abstract - Cited by 5 (1 self) - Add to MetaCart
Entity Resolution (ER) is a problem that arises in many information integration applications. ER process identifies duplicated records that refer to the same real-world entity (match process), and derives composite information about the entity (merge process). Additionally, the merged record can match another records recursively. Since the ER process is typically compute-intensive, it is important to distribute the ER workload across multiple processors. In this paper, we propose a parallel algorithm for ER, P-Swoosh, which uses generic match and merge functions and allows load balancing between processors. Our evaluation results using Yahoo! shopping data demonstrates the almost linear scalability from 2 to 15 processors. 1.
(Show Context)

Citation Context

... multiple processors for a very large record set. The simplest approach for the parallelism is splitting the initial records into small pieces using “blocking” techniques (blocking-based parallelism) =-=[9]-=-. In this approach, the initial record set is split according to the values of blocking attribute. For example, customer records with the same age are moved into separate blocks, and only records with...

Effective Blocking for Combining Multiple Entity Resolution Systems

by Aye Chan Mon, Mie Mie, Su Thwin
"... Abstract—An important aspect of maintaining information quality in data repositories is determining which sets of records refer to the same real world entity. This so called entity resolution problem comes up frequently for data cleaning and integration. Entity Resolution (ER) is a problem that aris ..."
Abstract - Add to MetaCart
Abstract—An important aspect of maintaining information quality in data repositories is determining which sets of records refer to the same real world entity. This so called entity resolution problem comes up frequently for data cleaning and integration. Entity Resolution (ER) is a problem that arises in many information integration applications. ER process identifies duplicated records that refer to the same real-world entity, and derives composite information about the entity. The cost of the ER process is high. In this propose paper, input data is split according to the blocking variables. As no comparisons are conducted between different blocks, each block can be processed independently form all others. Blocks can contain different numbers of records which results in varying processing times. We propose an effective blocking for combining multiple entity resolution systems. Keywords- entity resolution, data integration, data reduction, indexing, pre-processing I.
(Show Context)

Citation Context

...e of the most important factors for efficient and accurate indexing for record linkage and deduplication is the proper definition of blocking keys. First ideas for parallel matching were described in =-=[11]-=-. P.Christen [11] showed how the record linkage processes can be partition and parallelized. M.Mchelson and S.A.Macskassy [9] proposed a measurement technique for Entity Resolution. The authors propos...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University