Abstract:
We present general-purpose methods for recognizing certain types of structure in HTML documents. The methods are implemented using WHIRL, a \soft " logic that incorporates a notion of textual similarity developed in the information retrieval community. In an experimental evaluation on 82 Web pages, the structure ranked rst by our method is \meaningful"|i.e., a structure that was used in a hand-coded \wrapper", or extraction program, for the page|nearly 70 % of the time. This improves on a value of 50 % obtained by an earlier method. With appropriate background information, the structure-recognition methods we describe can also be used to learn a wrapper from examples, or for maintaining a wrapper as a Web page changes format. In these settings, the top-ranked structure is meaningful nearly 85 % of the time.
Citations
|
2438
|
Classification and Regression Trees
– Breiman, Friedman, et al.
- 1984
|
|
620
|
Fast effective rule induction
– Cohen
- 1995
|
|
603
|
Querying Heterogeneous Information Sources Using Source Descriptions
– Levy, Rajaraman, et al.
- 1996
|
|
396
|
Wrapper induction for information extraction
– Kushmerick, Weld, et al.
- 1997
|
|
306
|
The TSIMMIS approach to mediation: Data models and languages
– Garcia-Molina, Papakonstantinou, et al.
- 1997
|
|
183
|
Infomaster: An information integration system
– Genesereth, Keller, et al.
- 1997
|
|
167
|
Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity
– Cohen
- 1998
|
|
164
|
Wrapper Generation for Semi-structured Internet
– Ashish, Knoblock
- 1997
|
|
163
|
Extracting Semistructured Information from the Web
– Hammer, Garcia-Molina, et al.
- 1997
|
|
108
|
Modeling Web sources for information integration
– Knoblock, Minton, et al.
- 1998
|
|
75
|
Fast eective rule induction
– Cohen
- 1995
|
|
61
|
A web-based information system that reasons with structured collections of text
– Cohen
- 1998
|
|
53
|
Wrapper induction for semistructured, web-based information sources
– Muslea, Minton, et al.
- 1998
|
|
48
|
A algorithm for suffix-stripping. Program 14(3):130—137
– Porter
- 1980
|
|
39
|
The Araneus Web-Base Management System
– Mecca, Atzeni, et al.
- 1998
|
|
35
|
Learning page-independent heuristics for extracting data from Web pages. Computer Networks
– Cohen, Fan
- 1999
|
|
31
|
Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules
– Hsu
- 1998
|
|
20
|
An algorithm for sux stripping. Program
– Porter
- 1980
|
|
8
|
User-oriented smart-cache for the web: what you seek is what you get
– Lacroix, Sahuguet, et al.
- 1998
|