Table is a commonly used presentation scheme, especially for describing relational information. However, table understanding remains an open problem in both document image analysis and information retrieval fields. In this paper, we consider the problem of table detection in web documents. Its potential applications include web mining, knowledge management, and web content summarization and delivery to narrow-bandwidth devices. We describe a machine learning based approach to classify each given table entity as either genuine or non-genuine. Various features reflecting the layout as well as content characteristics of tables are studied. In order to facilitate the training and evaluation of our table classifier, we designed a novel web document table ground truthing protocol and used it to build a large table ground truth database. The database consists of ¢¡¤£¦¥¦ £ HTML files collected from hundreds of different web sites and ¢ ¦¡¨§�©¦© contains leaf <TABLE> elements, out of which ¦¡�©�§� � are genuine tables. Experiments were conducted using the cross validation method and an F-measure of ¥������¦�� � was achieved.
|
4833
|
Elements of Information Theory
– Cover, Thomas
- 1991
|
|
3413
|
C4.5: Programs for Machine Learning
– Quinlan
- 1993
|
|
2634
|
Classification and Regression Trees
– Breiman, Friedman, et al.
- 1984
|
|
2363
|
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications
– STOICA, MORRIS, et al.
- 2001
|
|
2350
|
Optimization by simulated annealing
– Kirkpatrick, Gelatt, et al.
- 1983
|
|
2335
|
A tutorial on hidden markov models and selected applications in speech recognition
– Rabiner
- 1989
|
|
2316
|
Scheduling Algorithms for Multiprogramming in a Hard Real-Time Environment
– Liu, Layland
- 1973
|
|
2089
|
Matrix analysis
– Horn, Johnson
- 1985
|
|
1958
|
A Scalable Content-Addressable Network
– RATNASAMY, FRANCIS, et al.
- 2001
|
|
1791
|
The Anatomy of a Large-Scale Hypertextual Web Search Engine
– Brin, Page
- 1998
|
|
1743
|
R-trees: A Dynamic Index Structure for Spatial Searching
– Guttman
- 1984
|
|
1734
|
Fast algorithms for mining association rules
– Agrawal, Srikant
- 1994
|
|
1646
|
Authoritative Sources in a Hyperlinked Environment
– Kleinberg
- 1999
|
|
1601
|
The Unified Modeling Language User Guide
– Booch, Rumbaugh, et al.
- 1999
|
|
1556
|
Mining association rules between sets of items in large databases
– Agrawal, Imielinski, et al.
- 1993
|
|
1509
|
Convex Analysis
– Rockafellar
- 1970
|
|
1164
|
Vector Quantization and Signal Compression
– Gersho, Gray
- 1992
|
|
1106
|
The capacity of wireless networks
– Gupta, Kumar
- 2000
|
|
1081
|
Normalized Cuts and Image Segmentation
– Shi, Malik
- 2000
|
|
1059
|
An algorithm for suffix stripping
– Porter
- 1980
|
|
974
|
Random oracles are practical: a paradigm for designing efficient protocols
– Bellare, Rogaway
- 1993
|
|
964
|
Reinforcement learning: A survey
– Kaelbling, Littman, et al.
- 1996
|
|
888
|
Network Flows: Theory, Algorithms, and Applications
– Ahuja, Magnanti, et al.
- 1993
|
|
885
|
On the evolution of random graphs
– Erdős, Rényi
- 1960
|
|
818
|
Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems
– Rowstron, Druschel
- 2001
|
|
804
|
The physiology of the grid: An open grid services architecture for distributed systems integration
– Foster, Kesselman, et al.
- 2002
|
|
783
|
Space/time trade-offs in hash coding with allowable errors
– Bloom
- 1970
|
|
738
|
Curves and Surfaces for Computer Aided Geometric Design
– FARIN
- 1990
|
|
709
|
The Art of Computer Systems Performance Analysis
– Jain
- 1991
|
|
707
|
A global geometric framework for nonlinear dimensionality reduction
– Tenenbaum, Silva, et al.
- 2000
|
|
699
|
The Unified Modeling Language Reference Manual
– Rumbaugh, Jacobson, et al.
- 1998
|
|
682
|
Priority Inheritance Protocols: An Approach to Real-Time Synchronization
– Sha, Rajkumar, et al.
- 1990
|
|
665
|
Nonlinear dimensionality reduction by locally linear embedding
– Roweis, Saul
- 2000
|
|
655
|
The CN2 induction algorithm
– CLARK, NIBLETT
- 1989
|
|
614
|
Efficient software-based fault isolation
– Wahbe, Lucco, et al.
- 1993
|
|
601
|
Mining frequent patterns without candidate generation
– Han, Pei, et al.
- 2000
|
|
595
|
Content-based image retrieval at the end of the early years
– Smeulders, Worring, et al.
- 2000
|
|
578
|
Bayesian Data Analysis
– Gelman, Carlin, et al.
- 1995
|
|
547
|
The MD5 Message-Digest Algorithm
– RIVEST
- 1992
|
|
540
|
Image coding using wavelet transform
– Antonini, Barlaud, et al.
- 1992
|
|
503
|
Generalization as search
– Mitchell
- 1982
|
|
503
|
OSPF Version 2
– Moy
- 1998
|
|
473
|
GPS-less low cost outdoor localization for very small devices
– Bulusu, Heidemann, et al.
- 2000
|
|
468
|
The rate monotonic scheduling algorithm: Exact characterization and average case behavior
– Lehoczky, Sha, et al.
- 1989
|
|
454
|
A re-examination of text categorization methods
– Yang, Liu
- 1999
|
|
451
|
Mean shift: A robust approach toward feature space analysis
– Comaniciu, Meer
- 2002
|
|
444
|
Mobility increases the capacity of ad-hoc wireless networks
– Grossglauser, Tse
- 2001
|
|
438
|
A Fast Algorithm for Particle Simulations
– Greengard, Rokhlin
- 1987
|
|
436
|
Numerical Recipes
– Press, Flannery, et al.
- 1992
|
|
432
|
A Space-economical Suffix Tree Construction Algorithm
– McCreight
- 1976
|