MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  Abstract

Download:
Download as a PDF
by Yalin Wang, Jianying Hu
http://www.research.avayalabs.com/techreport/ALR-2001-024-paper.pdf
Add To MetaCart

Abstract:

Table is a commonly used presentation scheme, especially for describing relational information. However, table understanding remains an open problem in both document image analysis and information retrieval fields. In this paper, we consider the problem of table detection in web documents. Its potential applications include web mining, knowledge management, and web content summarization and delivery to narrow-bandwidth devices. We describe a machine learning based approach to classify each given table entity as either genuine or non-genuine. Various features reflecting the layout as well as content characteristics of tables are studied. In order to facilitate the training and evaluation of our table classifier, we designed a novel web document table ground truthing protocol and used it to build a large table ground truth database. The database consists of  ¢¡¤£¦¥¦ £ HTML files collected from hundreds of different web sites and   ¢  ¦¡¨§�©¦© contains leaf <TABLE> elements, out of which  ¦¡�©�§� � are genuine tables. Experiments were conducted using the cross validation method and an F-measure of ¥������¦�� � was achieved.

Citations

4833 Elements of Information Theory – Cover, Thomas - 1991
3413 C4.5: Programs for Machine Learning – Quinlan - 1993
2634 Classification and Regression Trees – Breiman, Friedman, et al. - 1984
2363 Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications – STOICA, MORRIS, et al. - 2001
2350 Optimization by simulated annealing – Kirkpatrick, Gelatt, et al. - 1983
2335 A tutorial on hidden markov models and selected applications in speech recognition – Rabiner - 1989
2316 Scheduling Algorithms for Multiprogramming in a Hard Real-Time Environment – Liu, Layland - 1973
2089 Matrix analysis – Horn, Johnson - 1985
1958 A Scalable Content-Addressable Network – RATNASAMY, FRANCIS, et al. - 2001
1791 The Anatomy of a Large-Scale Hypertextual Web Search Engine – Brin, Page - 1998
1743 R-trees: A Dynamic Index Structure for Spatial Searching – Guttman - 1984
1734 Fast algorithms for mining association rules – Agrawal, Srikant - 1994
1646 Authoritative Sources in a Hyperlinked Environment – Kleinberg - 1999
1601 The Unified Modeling Language User Guide – Booch, Rumbaugh, et al. - 1999
1556 Mining association rules between sets of items in large databases – Agrawal, Imielinski, et al. - 1993
1509 Convex Analysis – Rockafellar - 1970
1164 Vector Quantization and Signal Compression – Gersho, Gray - 1992
1106 The capacity of wireless networks – Gupta, Kumar - 2000
1081 Normalized Cuts and Image Segmentation – Shi, Malik - 2000
1059 An algorithm for suffix stripping – Porter - 1980
974 Random oracles are practical: a paradigm for designing efficient protocols – Bellare, Rogaway - 1993
964 Reinforcement learning: A survey – Kaelbling, Littman, et al. - 1996
888 Network Flows: Theory, Algorithms, and Applications – Ahuja, Magnanti, et al. - 1993
885 On the evolution of random graphs – Erdős, Rényi - 1960
818 Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems – Rowstron, Druschel - 2001
804 The physiology of the grid: An open grid services architecture for distributed systems integration – Foster, Kesselman, et al. - 2002
783 Space/time trade-offs in hash coding with allowable errors – Bloom - 1970
738 Curves and Surfaces for Computer Aided Geometric Design – FARIN - 1990
709 The Art of Computer Systems Performance Analysis – Jain - 1991
707 A global geometric framework for nonlinear dimensionality reduction – Tenenbaum, Silva, et al. - 2000
699 The Unified Modeling Language Reference Manual – Rumbaugh, Jacobson, et al. - 1998
682 Priority Inheritance Protocols: An Approach to Real-Time Synchronization – Sha, Rajkumar, et al. - 1990
665 Nonlinear dimensionality reduction by locally linear embedding – Roweis, Saul - 2000
655 The CN2 induction algorithm – CLARK, NIBLETT - 1989
614 Efficient software-based fault isolation – Wahbe, Lucco, et al. - 1993
601 Mining frequent patterns without candidate generation – Han, Pei, et al. - 2000
595 Content-based image retrieval at the end of the early years – Smeulders, Worring, et al. - 2000
578 Bayesian Data Analysis – Gelman, Carlin, et al. - 1995
547 The MD5 Message-Digest Algorithm – RIVEST - 1992
540 Image coding using wavelet transform – Antonini, Barlaud, et al. - 1992
503 Generalization as search – Mitchell - 1982
503 OSPF Version 2 – Moy - 1998
473 GPS-less low cost outdoor localization for very small devices – Bulusu, Heidemann, et al. - 2000
468 The rate monotonic scheduling algorithm: Exact characterization and average case behavior – Lehoczky, Sha, et al. - 1989
454 A re-examination of text categorization methods – Yang, Liu - 1999
451 Mean shift: A robust approach toward feature space analysis – Comaniciu, Meer - 2002
444 Mobility increases the capacity of ad-hoc wireless networks – Grossglauser, Tse - 2001
438 A Fast Algorithm for Particle Simulations – Greengard, Rokhlin - 1987
436 Numerical Recipes – Press, Flannery, et al. - 1992
432 A Space-economical Suffix Tree Construction Algorithm – McCreight - 1976