• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

A survey of table recognition: Models, observations, transformations and inferences,” (2004)

by R Zanibbi, D Blostein, J R Cordy
Venue:IJDAR,
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 50
Next 10 →

Y.: Webtables: exploring the power of tables on the web

by Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, Yang Zhang - PVLDB , 2008
"... The World-Wide Web consists of a huge number of unstruc-tured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML ta-bles from Google’s general-purpose web crawl, and used sta-tistical classification techniques to find the estimated 154M that co ..."
Abstract - Cited by 122 (13 self) - Add to MetaCart
The World-Wide Web consists of a huge number of unstruc-tured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML ta-bles from Google’s general-purpose web crawl, and used sta-tistical classification techniques to find the estimated 154M that contain high-quality relational data. Because each re-lational table has its own “schema ” of labeled and typed columns, each such table can be considered a small struc-tured database. The resulting corpus of databases is larger than any other corpus we are aware of, by at least five orders of magnitude. We describe the WebTables system to explore two fun-damental questions about this collection of databases. First, what are effective techniques for searching for structured
(Show Context)

Citation Context

...hough it is just slightly more than 1.1% of raw HTML tables. Previous work on HTML tables focused on the problem of recognizing good tables or extracting additional information from individual tables =-=[26, 29, 30]-=-. In this paper we consider a corpus of tables that is five orders of magnitude larger than the largest one considered to date [26], and address two fundamental questions: (1) what are effective metho...

What have Innsbruck and Leipzig in common? Extracting Semantics from Wiki Content

by Sören Auer, Jens Lehmann - In Franconi et al. (eds), Proceedings of European Semantic Web Conference (ESWC 2007), LNCS 4519 , 2007
"... Abstract Wikis are established means for the collaborative authoring, versioning and publishing of textual articles. The Wikipedia project, for example, succeeded in creating the by far largest encyclopedia just on the basis of a wiki. Recently, several approaches have been proposed on how to extend ..."
Abstract - Cited by 106 (13 self) - Add to MetaCart
Abstract Wikis are established means for the collaborative authoring, versioning and publishing of textual articles. The Wikipedia project, for example, succeeded in creating the by far largest encyclopedia just on the basis of a wiki. Recently, several approaches have been proposed on how to extend wikis to allow the creation of structured and semantically enriched content. However, the means for creating semantically enriched structured content are already available and are, although unconsciously, even used by Wikipedia authors. In this article, we present a method for revealing this structured content by extracting information from template instances. We suggest ways to efficiently query the vast amount of extracted information (e.g. more than 8 million RDF statements for the English Wikipedia version alone), leading to astonishing query answering possibilities (such as for the title question). We analyze the quality of the extracted content, and propose strategies for quality improvements with just minor modifications of the wiki systems being currently used. 1
(Show Context)

Citation Context

...Web research [8]. Additionally, there are strong links to knowledge extraction from table structures. A general overview of work on recognizing tables and drawing inferences from them can be found in =-=[27]-=-. [21] is an approach for automatic generation of F-Logic frames out of tables, which subsequently supports the automatic population of ontologies from table-like structures. Different approaches for ...

Information extraction

by Sunita Sarawagi - FnT Databases
"... The automatic extraction of information from unstructured sources has opened up new avenues for querying, organizing, and analyzing data by drawing upon the clean semantics of structured databases and the abundance of unstructured data. The field of information extraction has its genesis in the natu ..."
Abstract - Cited by 95 (4 self) - Add to MetaCart
The automatic extraction of information from unstructured sources has opened up new avenues for querying, organizing, and analyzing data by drawing upon the clean semantics of structured databases and the abundance of unstructured data. The field of information extraction has its genesis in the natural language processing community where the primary impetus came from competitions centered around the recognition of named entities like people names and organization from news articles. As society became more data oriented with easy online access to both structured and unstructured data, new applications of structure extraction came around. Now, there is interest in converting our personal desktops to structured databases, the knowledge in scientific publications to structured records, and harnessing the Internet for structured fact finding queries. Consequently, there are many different communities of researchers bringing in techniques from machine learning, databases, information retrieval, and computational linguistics for various aspects of the information extraction problem. This review is a survey of information extraction research of over two decades from these diverse communities. We create a taxonomy of the field along various dimensions derived from the nature of theextraction task, the techniques used for extraction, the variety of input resources exploited, and the type of output produced. We elaborate on rule-based and statistical methods for entity and relationship extraction. In each case we highlight the different kinds of models for capturing the diversity of clues driving the recognition process and the algorithms for training and efficiently deploying the models. We survey techniques for optimizing the various steps in an information extraction pipeline, adapting to dynamic data, integrating with existing entities and handling uncertainty in the extraction process. 1
(Show Context)

Citation Context

...is topic in the survey to contain its scope and volume. On the topic of table extraction there is an extensive research literature spanning many different communities, including the document analysis =-=[84, 109, 134, 222]-=-, information retrieval [164], web [62, 96], database [36, 165], and machine learning [164, 216] communities. A survey can be found in [84].1.4 Types of Unstructured Sources 1.4 Types of Unstructured...

Tableseer: automatic table metadata extraction and searching in digital libraries

by Ying Liu, Kun Bai, Prasenjit Mitra, C. Lee Giles - In JCDL , 2007
"... Tables are ubiquitous in digital libraries. In scientific documents, tables are widely used to present experimental results or statistical data in a condensed fashion. However, current search engines do not support table search. The difficulty of automatic extracting tables from un-tagged documents, ..."
Abstract - Cited by 43 (16 self) - Add to MetaCart
Tables are ubiquitous in digital libraries. In scientific documents, tables are widely used to present experimental results or statistical data in a condensed fashion. However, current search engines do not support table search. The difficulty of automatic extracting tables from un-tagged documents, the lack of a universal table metadata specification, and the unsuitability/limitation of the existing ranking schemes make table search problem challenging. In this paper, we describe TableSeer, a search engine for tables. TableSeer crawls digital libraries, detects tables from documents, represent tables with metadata, indexes tables, and provides a user-friendly search interface. We propose an extensive set of medium-independent metadata for tables that scientists and other users can adopt for representing table information. Given a query, TableSeer ranks the matched tables using an innovative ranking algorithm – TableRank. TableRank rates the <query, table> pairs with a tailored vector space model and a specific term weighting scheme. Overall, TableSeer aggregates impact factors from three levels: the term, the table, and the document level. We demonstrate the value of TableSeer with empirical studies on scientific documents.
(Show Context)

Citation Context

...orts the automatic table metadata extraction and table search. Researchers in the automatic table extraction field largely focus on analyzing the table structure in a specific document media. Zanibbi =-=[15]-=- provides a survey with detailed description of each method. All the methods can be divided into three categories: pre-defined layout based [10], heuristics based [7][9][11][16], and statistical based...

Using Visual Cues for Extraction of Tabular Data from Arbitrary HTML Documents

by Bernhard Krüpl, Marcus Herzog, Wolfgang Gatterbauer - In In Proc. of the 14th Int’l Conf. on World Wide Web , 2005
"... We describe a method to extract tabular data from web pages. Rather than just analyzing the DOM tree, we also exploit visual cues in the rendered version of the document to extract data from tables which are not explicitly marked with an HTML table element. To detect tables, we rely on a variant of ..."
Abstract - Cited by 35 (3 self) - Add to MetaCart
We describe a method to extract tabular data from web pages. Rather than just analyzing the DOM tree, we also exploit visual cues in the rendered version of the document to extract data from tables which are not explicitly marked with an HTML table element. To detect tables, we rely on a variant of the well-known X-Y cut algorithm as used in the OCR community. We implemented the system by directly accessing Mozilla's box model that contains the positional data for all HTML elements of a given web page.

Uncovering the relational web

by Michael J. Cafarella, Eugene Wu - In under review , 2008
"... The World-Wide Web consists of a huge number of unstructured hypertext documents, but it also contains structured data in the form of HTML tables. Many of these tables contain both relational-style data and a small “schema ” of labeled and typed columns, making each such table a small structured dat ..."
Abstract - Cited by 28 (8 self) - Add to MetaCart
The World-Wide Web consists of a huge number of unstructured hypertext documents, but it also contains structured data in the form of HTML tables. Many of these tables contain both relational-style data and a small “schema ” of labeled and typed columns, making each such table a small structured database. The WebTables project is an effort to extract and make use of the huge number of these structured tables on the Web. A clean collection of relational-style tables could be useful for improving web search, schema design, and many other applications. This paper describes the first stage of the WebTables project. First, we give an in-depth study of the Web’s HTML table corpus. For example, we extracted 14.1 billion HTML tables from a several-billion-page portion of Google’s generalpurpose web crawl, and estimate that 154 million of these tables contain high-quality relational-style data. We also describe the crawl’s distribution of table sizes and data types. Second, we describe a system for performing relation recovery. The Web mixes relational and non-relational tables indiscriminately (often on the same page), so there is no simple way to distinguish the 1.1 % of good relations from the remainder, nor to recover column label and type information. Our mix of hand-written detectors and statistical classifiers takes a raw Web crawl as input, and generates a collection of databases that is five orders of magnitude larger than any other collection we are aware of. Relation recovery achieves precision and recall that are comparable to other domain-independent information extraction systems. 1.
(Show Context)

Citation Context

...tured data elements on the Web. A number of authors have studied the problem of information extraction from a single table, some of which serve a role similar to that of the WebTables relation filter =-=[3, 6, 10, 14, 15]-=-. As discussed in Section 3.1, Wang and Hu detected “true” tables with a classifier and features that involved both content and layout [12]. This last paper processed the most tables, taking as input ...

From Tables to Frames

by Er Pivk, York Sure - in 3rd International Semantic Web Conference , 2004
"... Abstract. Turning the current Web into a Semantic Web requires automatic ap-proaches for annotation of existing data since manual approaches will not scale in general. We here present an approach for automatic generation of F-Logic frames out of tables which subsequently supports the automatic popul ..."
Abstract - Cited by 22 (2 self) - Add to MetaCart
Abstract. Turning the current Web into a Semantic Web requires automatic ap-proaches for annotation of existing data since manual approaches will not scale in general. We here present an approach for automatic generation of F-Logic frames out of tables which subsequently supports the automatic population of ontologies from table-like structures. The approach consists of a methodology, an accom-panying implementation and a thorough evaluation. It is based on a grounded cognitive table model which is stepwise instantiated by our methodology. 1
(Show Context)

Citation Context

...of the 21 tables, such that in general we conclude that our results are certainly very promising. 4 Related Work A very recent systematic overview of related work on table recognition can be found in =-=[18]-=-. Several conclusions can be drawn from this survey. Firstly, only few table models have been described explicitly. Apart from the table model of Hurst which we have applied in our approach [7, 8] the...

Conceptual Modeling Foundations for a Web of Knowledge

by David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale
"... The semantic web purports to be a web of knowledge that can answer our questions, help us reason about everyday problems as well as scientific endeavors, and service many of our wants and needs. Researchers and others expound various views about exactly what this means. Here we propose an answer w ..."
Abstract - Cited by 11 (9 self) - Add to MetaCart
The semantic web purports to be a web of knowledge that can answer our questions, help us reason about everyday problems as well as scientific endeavors, and service many of our wants and needs. Researchers and others expound various views about exactly what this means. Here we propose an answer with conceptual modeling as its foundation. We define a web of knowledge as a collection of interconnected knowledge bundles superimposed over a web of documents. Knowledge bundles are conceptual model instances augmented with facilities that provide for both extensional and intensional facts, for linking between knowledge bundles yielding a web of data, and for linking to an underlying document collection providing a means of authentication. We formally define both the component parts of these augmented conceptual models and their synergistic interconnections. As for practicalities, we discuss problems regarding the potentially high cost of constructing a web of knowledge and explain how they may be mitigated. We also discuss usage issues and show how untrained users can interact with and gain benefit from

Notes on Contemporary Table Recognition

by David W. Embley, Daniel Lopresti, George Nagy - in Proc. Document Analysis Systems VII, 7th International Workshop, DAS 2006 , 2006
"... Abstract. The shift of interest to web tables in HTML and PDF files, coupled with the incorporation of table analysis and conversion routines in commercial desktop document processing software, are likely to turn table recognition into more of a systems than an algorithmic issue. We illustrate the t ..."
Abstract - Cited by 8 (2 self) - Add to MetaCart
Abstract. The shift of interest to web tables in HTML and PDF files, coupled with the incorporation of table analysis and conversion routines in commercial desktop document processing software, are likely to turn table recognition into more of a systems than an algorithmic issue. We illustrate the transition by some actual examples of web table conversion. We then suggest that the appropriate target format for table analysis, whether performed by conventional customized programs or by off-theshelf software, is a representation based on the abstract table introduced by X. Wang in 1996. We show that the Wang model is adequate for some useful tasks that prove elusive for less explicit representations, and outline our plans to develop a semi-automated table processing system to demonstrate this approach. Screen-snaphots of a prototype tool to allow table mark-up in the style of Wang are also presented. 1
(Show Context)

Citation Context

...ng the appropriate array model for a table with multi-line cells without complete rulings has proved difficult. In the last 15 years, over 200 research papers have been published on table recognition =-=[1, 2, 3, 4, 5]-=-. Most published algorithms for cell alignment treat the table as a 2-D array of cells, and attempt to identify the coordinates and contents of each cell. Methods vary depending on whether the table i...

Semantically Conceptualizing and Annotating Tables

by Stephen Lynn, David W. Embley
"... Abstract. Enabling a system to automatically conceptualize and annotate a human-readable table is one way to create interesting semanticweb content. But exactly “how? ” is not clear. With conceptualization and annotation in mind, we investigate a semantic-enrichment procedure as a way to turn syntac ..."
Abstract - Cited by 7 (4 self) - Add to MetaCart
Abstract. Enabling a system to automatically conceptualize and annotate a human-readable table is one way to create interesting semanticweb content. But exactly “how? ” is not clear. With conceptualization and annotation in mind, we investigate a semantic-enrichment procedure as a way to turn syntactically observed table layout into semantically coherent ontological concepts, relationships, and constraints. Our semanticenrichment procedure shows how to make use of auxiliary world knowledge to construct rich ontological structures and to populate these ontological structures with instance data. The system uses auxiliary knowledge (1) to recognize concepts and which data values belong to which concepts, (2) to discover relationships among concepts and which datavalue combinations represent relationship instances, and (3) to discover constraints over the concepts and relationships that the data values and data-value combinations should satisfy. Experimental evaluations indicate that the automatic conceptualization and annotation processes perform well, yielding F-measures of 90 % for concept recognition, 77 % for relationship discovery, and 90 % for constraint discovery in web tables selected from the geopolitical domain. 1
(Show Context)

Citation Context

...opulation computed as the sum of the populations from the states in the region. Automated “table understanding” has been the subject of research in the document analysis community for several decades =-=[10, 13]-=-. Most of these efforts end, however, after only identifying table labels and table instance data. Some researchers have described a semantic-enrichment step in the table-understanding process, but as...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University