MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  Recognizing structure in Web pages using similarity queries (1999) [31 citations — 6 self]

Download:
Download as a PDF | Download as a PS
by William W. Cohen
In AAAI-99
http://www.research.whizbang.com/~wcohen/postscript/aaai-99-extract.ps
Add To MetaCart

Abstract:

We present general-purpose methods for recognizing certain types of structure in HTML documents. The methods are implemented using WHIRL, a \soft " logic that incorporates a notion of textual similarity developed in the information retrieval community. In an experimental evaluation on 82 Web pages, the structure ranked rst by our method is \meaningful"|i.e., a structure that was used in a hand-coded \wrapper", or extraction program, for the page|nearly 70 % of the time. This improves on a value of 50 % obtained by an earlier method. With appropriate background information, the structure-recognition methods we describe can also be used to learn a wrapper from examples, or for maintaining a wrapper as a Web page changes format. In these settings, the top-ranked structure is meaningful nearly 85 % of the time.

Citations

2438 Classification and Regression Trees – Breiman, Friedman, et al. - 1984
620 Fast effective rule induction – Cohen - 1995
603 Querying Heterogeneous Information Sources Using Source Descriptions – Levy, Rajaraman, et al. - 1996
396 Wrapper induction for information extraction – Kushmerick, Weld, et al. - 1997
306 The TSIMMIS approach to mediation: Data models and languages – Garcia-Molina, Papakonstantinou, et al. - 1997
183 Infomaster: An information integration system – Genesereth, Keller, et al. - 1997
167 Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity – Cohen - 1998
164 Wrapper Generation for Semi-structured Internet – Ashish, Knoblock - 1997
163 Extracting Semistructured Information from the Web – Hammer, Garcia-Molina, et al. - 1997
108 Modeling Web sources for information integration – Knoblock, Minton, et al. - 1998
75 Fast eective rule induction – Cohen - 1995
61 A web-based information system that reasons with structured collections of text – Cohen - 1998
53 Wrapper induction for semistructured, web-based information sources – Muslea, Minton, et al. - 1998
48 A algorithm for suffix-stripping. Program 14(3):130—137 – Porter - 1980
39 The Araneus Web-Base Management System – Mecca, Atzeni, et al. - 1998
35 Learning page-independent heuristics for extracting data from Web pages. Computer Networks – Cohen, Fan - 1999
31 Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules – Hsu - 1998
20 An algorithm for sux stripping. Program – Porter - 1980
8 User-oriented smart-cache for the web: what you seek is what you get – Lacroix, Sahuguet, et al. - 1998