6 citations found. Retrieving documents...
Hernandez, M., \A generalization of band joins and the merge/purge problem ", Ph.D. Thesis, Dept. Computer Science, Columbia Univ. (1996).

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Information Retrieval on the Web - Kobayashi, Takeda (2000)   (22 citations)  (Correct)

....(MICA) 39 reports weekly on indexing coverage and quality of indexing by a few, select search engines, which claim to index at least one fth of the Web . Other studies on estimating the extent of Web pages which have been indexed by popular search engines include [Baldonado, Winograd 1997] [Hern andez 1996], Hern andez, Stolfo 1995] Hylton 1996] Monge, Elkan] Selberg, Etzioni 1997] Silberschatz et al. 1995] 35 Snap: www.snap.com 36 Microsoft: www.msn.com 37 Google: www.google.com 38 Euroseek: www.euroseek.com 39 Melee s Indexing Coverage Analysis (MICA) Report: ....

....versions of news and newspaper sites. Broder et al. 1997] and [Shivakumar, Garc ia Molina 1998] estimate that 30 of Web pages are duplicates or near duplicates. Tools for removing redundant URLs or URLs of near and perfectly identical sites have been investigated by [Baldonado, Winograd 1997] [Hern andez 1996], Hern andez, Stolfo 1995] Hylton 1996] Monge, Elkan] Selberg, Etzioni 1997] Silberschatz et al. 1995] Henzinger et al. 1999] has suggested a method for evaluating the quality of pages in a search engine s index. In the past, the volume of pages indexed was used as the primary ....

[Article contains additional citation context not shown here]

Hernandez, M., \A generalization of band joins and the merge/purge problem ", Ph.D. Thesis, Dept. Computer Science, Columbia Univ. (1996).


Information Retrieval on the Web: Selected Topics - Kobayashi, Takeda (1999)   (1 citation)  (Correct)

....Northern Light TM http: www.nlsearch.com Excite TM http: www.excite.com Infoseek TM http: www.infoseek.com Lycos TM http: www.lycos.com Several other people have also estimated the extent of Web pages which have been indexed by popular search engines. Baldonado, Winograd 1997a] [Hern andez 1996], Hern andez, Stolfo 1995] Hylton 1996] Monge, Elkan] Silberschatz et al. 1995] Bharat and Broder estimated in November of 1997, the number of pages indexed by HotBot, AltaVista, Excite and Infoseek were 77 million, 100 million, 32 million and 17 million, respectively. Furthermore, they ....

.... Multiple copies of identical or near identical pages are abundant (e.g. FAQ 22 postings, mirror sites, old and updated versions of news and newspaper sites) Tools for removing redundant URLs or URLs of near and perfectly identical aites have been investigated by [Baldonado, Winograd 1997a] [Hern andez 1996], Hern andez, Stolfo 1995] Hylton 1996] Monge, Elkan] Silberschatz et al. 1995] The development of e ective indexing to aid in ltering is another major class of problems associated with Web based search and retrieval. Removal of spurious information is a particularly challenging problem ....

[Article contains additional citation context not shown here]

Hernandez, M., \A generalization of band joins and the merge/purge problem ", Ph.D. Thesis, Dept. Computer Science, Columbia Univ. (1996).


An Efficient Domain-Independent Algorithm for Detecting.. - Monge, Elkan (1997)   (33 citations)  (Correct)

....identifier. We can do this because only a small portion of the total number of records in the database is ever kept in the priority queue. 6 Experimental results The first experiments reported here use databases that are mailing lists generated randomly by software designed and implemented by [Her96] Each record in a mailing list contains nine fields: social security number, first name, middle initial, last name, address, apartment, city, state, and zip code. All field values are chosen randomly and independently. Personal names are chosen from a list of 63000 real names. Address fields are ....

.... to frequencies known from previous research on spelling correction algorithms [Kuk92] Edit distance algorithms are designed to detect some of the errors introduced, however, our algorithm was developed without knowledge of the particular error probabilities used by the database generator of [Her96] The pairwise record matching algorithm of [HS95] has special rules for transpositions of entire words, complete changes in names and zip codes, and social security number omissions, while our Smith Waterman algorithm variant does not. Table 1 contains example pairs of records chosen as ....

[Article contains additional citation context not shown here]

Mauricio Hern'andez. A Generalization of Band Joins and the Merge/Purge Problem. Ph.D. thesis, Columbia University, 1996.


A Generalization of Band Joins and The Merge/Purge Problem - Hernandez (1996)   (4 citations)  Self-citation (Hern'andez)   (Correct)

....solution methods. The generated data simulates a mailing list and the task of the merge purge application is to remove duplicates from the generated list. Chapter 5 explores parallel processing of these solutions and briefly explores load balancing issues. In our initial presentation of this work [Hern andez, 1995] we proposed the implementation of two other merge purge applications to test our techniques over different domains. One application involved the implementation of intersection spatial joins while the other involved the re implementation of the engine for the ALEXSYS expert system. Given the ....

M. A. Hern'andez. A Generalization of Band-Joins and the Merge/Purge Problem. Technical Report CUCS-005-1995, Department of Computer Science, Columbia University, February 1995.


A Generalization of Band Joins and The Merge/Purge Problem - Hernandez (1996)   (4 citations)  Self-citation (Hern'andez)   (Correct)

....solution methods. The generated data simulates a mailing list and the task of the merge purge application is to remove duplicates from the generated list. Chapter 5 explores parallel processing of these solutions and briefly explores load balancing issues. In our initial presentation of this work [Hern andez, 1995] we proposed the implementation of two other merge purge applications to test our techniques over different domains. One application involved the implementation of intersection spatial joins while the other involved the re implementation of the engine for the ALEXSYS expert system. Given the ....

....procedure, including an incremental merge purge procedure, all fully implemented in a general and widely useful system. Most of the results we present in this thesis have been published in the literature [Hern andez and Stolfo, 1995b] and have recently been submitted for archival publication in [Hern andez and Stolfo, 1995a] 8 Chapter 2 Previous Work Several lines of work have an impact on efficient solutions for the merge purge problem. The semantic integration problem [Kent, 1991] seeks to identify a multiplicity of database objects that represent the same or related real world entity, even though their ....

M. Hern'andez and S. Stolfo. A Generalization of Band-Joins and the Merge/Purge Problem. Submitted for review to IEEE's Transactions on Knowledge and Data Engineering, November 1995.


The Merge/Purge Problem for Large Databases - Hernandez, Stolfo (1995)   (41 citations)  Self-citation (Hern'andez)   (Correct)

....in one pass. The differences from this previous work and ours in the use of a complex function (the equational theory) to determine if records under consideration match , and our concern for the accuracy of the computed result since matching records may not appear within a common band . In [9], we describe the sorted neighborhood method as a generalization of band joins and provide an alternative algorithm for the sorted neighborhood method based on the duplicate elimination algorithm described in [3] This duplicate elimination algorithms takes advantage of the fact that matching ....

M. A. Hern'andez. A Generalization of Band-Joins and the Merge/Purge Problem. Technical Report CUCS005 -1995, Department of Computer Science, Columbia University, February 1995.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC