Results 1 -
2 of
2
P-Swoosh: Parallel Algorithm for Generic Entity Resolution
, 2006
"... Entity Resolution (ER) is a problem that arises in many information integration applications. ER process identifies duplicated records that refer to the same real-world entity (match process), and derives composite information about the entity (merge process). Additionally, the merged record can mat ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
(Show Context)
Entity Resolution (ER) is a problem that arises in many information integration applications. ER process identifies duplicated records that refer to the same real-world entity (match process), and derives composite information about the entity (merge process). Additionally, the merged record can match another records recursively. Since the ER process is typically compute-intensive, it is important to distribute the ER workload across multiple processors. In this paper, we propose a parallel algorithm for ER, P-Swoosh, which uses generic match and merge functions and allows load balancing between processors. Our evaluation results using Yahoo! shopping data demonstrates the almost linear scalability from 2 to 15 processors. 1.
Effective Blocking for Combining Multiple Entity Resolution Systems
"... Abstract—An important aspect of maintaining information quality in data repositories is determining which sets of records refer to the same real world entity. This so called entity resolution problem comes up frequently for data cleaning and integration. Entity Resolution (ER) is a problem that aris ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—An important aspect of maintaining information quality in data repositories is determining which sets of records refer to the same real world entity. This so called entity resolution problem comes up frequently for data cleaning and integration. Entity Resolution (ER) is a problem that arises in many information integration applications. ER process identifies duplicated records that refer to the same real-world entity, and derives composite information about the entity. The cost of the ER process is high. In this propose paper, input data is split according to the blocking variables. As no comparisons are conducted between different blocks, each block can be processed independently form all others. Blocks can contain different numbers of records which results in varying processing times. We propose an effective blocking for combining multiple entity resolution systems. Keywords- entity resolution, data integration, data reduction, indexing, pre-processing I.