MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  A Duplicate Detection Benchmark for XML (and Relational) Data

Download:
Download as a PDF
by Melanie Weis, Felix Naumann, Franziska Brosy
http://www.informatik.hu-berlin.de/mac/publications/benchmark_iqis06.pdf
Add To MetaCart

Abstract:

Duplicate detection, which is an important subtask of data cleaning, is the task of identifying multiple representations of a same real-world object. Numerous approaches both for relational and XML data exist. Their goals are either on improving the quality of the detected duplicates (effectiveness) or on saving computation time (efficiency). In particular for the first goal, the “goodness” of an approach is usually evaluated based on experimental studies. Although some methods and data sets have gained popularity, it is still difficult to compare different approaches or to assess the quality of one own’s approach. This difficulty of comparison is mainly due to lack of documentation of algorithms and the data, software and hardware used and/or limited resources not allowing to rebuild systems described by others. In this paper, we propose a benchmark for duplicate detection, specialized to XML, which can be part of a broader duplicate detection or even data cleansing benchmark. We discuss all necessary elements to make up a benchmark: Data provisioning, clearly defined operations (the benchmark workload), and metrics to evaluate the quality. The proposed benchmark is a step forward to representative comparisons of duplicate detection algorithms. We note that this benchmark is yet to be implemented and this paper is meant to be a starting point for discussion. 1.

Citations

200 The merge/purge problem for large databases – Hernandez, Stolfo - 1995
144 XMark: A benchmark for XML data management – Schmidt, Waas, et al. - 2002
134 An efficient domain-independent algorithm for detecting approximately duplicate database records – MONGE, ELKAN - 1997
102 The state of record linkage and current research problems – Winkler - 1999
91 Interactive deduplication using active learning – Sarawagi, Bhamidipaty - 2002
58 Declarative Data Cleaning: Language, Model, and Algorithms – Galhardas, Florescu, et al. - 2001
16 DogmatiX tracks down duplicates in XML – Weis, Naumann - 2005
15 Object identification with attribute-mediated dependences – Singla, Domingos - 2005
4 Object identification quality – Neiling, Jurk, et al. - 2003
2 XML duplicate detection using sorted neigborhoods – Puhlmann, Weis, et al. - 2006
1 Web und Datenbanken. dpunkt.verlag – Rahm, Vossen - 2002