MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  and Wen-tau Yih

Download:
Download as a PDF | Download as a PS
by Jane Yung-jen Hsu
http://hugo.csie.ntu.edu.tw/~yjhsu/courses/u1760/papers/TbIE.ps
Add To MetaCart

Abstract:

Tools for mining information from data can create added value for the Internet. As the majority of electronic documents available over the network are in unstructured textual form, extracting useful information from a document usually involves information retrieval techniques or manual processing. This paper presents a novel approach to mining information from HTML documents using tree-structured templates. In addition to syntactic and semantic descriptions, each template is designed to capture the logical structure of a class of documents. Experiments have been conducted to extract FAQ information automatically from over one hundred HTML documents collected from the Web. Using two basic templates, the prototype FAQ Miner has accurately analyzed 65 % of the collection of FAQ documents. With additional processing to handle "near-pass"es, the success rate is approximately 75%. The preliminary results have demonstrated the utility of structural templates for mining information from semi-structured text-based documents.

Citations

957 Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer – Salton
273 A Scalable ComparisonShopping Agent for the World-Wide Web – Doorenbos, Etzioni, et al. - 1997
202 The Harvest Information Discovery and Access System – Bowman, Danzig, et al. - 1994
171 Multi-service search and comparison using the MetaCrawler – Selberg, Etzioni - 1995
122 Scalable Internet resource discovery: Research problems and approaches – Bowman, Danzig, et al. - 1994
55 The World-Wide Web: quagmire or gold mine – Etzioni - 1996
21 Faq finder: A case-based approach to knowledge navigation – Hammond, Burke, et al. - 1995
15 Auto-FAQ: an Experiment in Cyberspace Leveraging – Whitehead - 1994
12 Knowledge-based information retrieval from semistructured text – Burke, Hammond, et al. - 1995
9 HTML 3.2 Reference Specification – Raggett - 1997
8 Document processing for automatic knowledge acquisition – Tang, Yan, et al. - 1994
3 Automatic abstract generation based on document structure analysis and its evaluation as a document retrieval presentation function. Systems and Computers in Japan 26(13):32--43 – Sumita, Miike, et al. - 1995