Results 1 - 10
of
50
Y.: Webtables: exploring the power of tables on the web
- PVLDB
, 2008
"... The World-Wide Web consists of a huge number of unstruc-tured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML ta-bles from Google’s general-purpose web crawl, and used sta-tistical classification techniques to find the estimated 154M that co ..."
Abstract
-
Cited by 122 (13 self)
- Add to MetaCart
(Show Context)
The World-Wide Web consists of a huge number of unstruc-tured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML ta-bles from Google’s general-purpose web crawl, and used sta-tistical classification techniques to find the estimated 154M that contain high-quality relational data. Because each re-lational table has its own “schema ” of labeled and typed columns, each such table can be considered a small struc-tured database. The resulting corpus of databases is larger than any other corpus we are aware of, by at least five orders of magnitude. We describe the WebTables system to explore two fun-damental questions about this collection of databases. First, what are effective techniques for searching for structured
What have Innsbruck and Leipzig in common? Extracting Semantics from Wiki Content
- In Franconi et al. (eds), Proceedings of European Semantic Web Conference (ESWC 2007), LNCS 4519
, 2007
"... Abstract Wikis are established means for the collaborative authoring, versioning and publishing of textual articles. The Wikipedia project, for example, succeeded in creating the by far largest encyclopedia just on the basis of a wiki. Recently, several approaches have been proposed on how to extend ..."
Abstract
-
Cited by 106 (13 self)
- Add to MetaCart
(Show Context)
Abstract Wikis are established means for the collaborative authoring, versioning and publishing of textual articles. The Wikipedia project, for example, succeeded in creating the by far largest encyclopedia just on the basis of a wiki. Recently, several approaches have been proposed on how to extend wikis to allow the creation of structured and semantically enriched content. However, the means for creating semantically enriched structured content are already available and are, although unconsciously, even used by Wikipedia authors. In this article, we present a method for revealing this structured content by extracting information from template instances. We suggest ways to efficiently query the vast amount of extracted information (e.g. more than 8 million RDF statements for the English Wikipedia version alone), leading to astonishing query answering possibilities (such as for the title question). We analyze the quality of the extracted content, and propose strategies for quality improvements with just minor modifications of the wiki systems being currently used. 1
Information extraction
- FnT Databases
"... The automatic extraction of information from unstructured sources has opened up new avenues for querying, organizing, and analyzing data by drawing upon the clean semantics of structured databases and the abundance of unstructured data. The field of information extraction has its genesis in the natu ..."
Abstract
-
Cited by 95 (4 self)
- Add to MetaCart
(Show Context)
The automatic extraction of information from unstructured sources has opened up new avenues for querying, organizing, and analyzing data by drawing upon the clean semantics of structured databases and the abundance of unstructured data. The field of information extraction has its genesis in the natural language processing community where the primary impetus came from competitions centered around the recognition of named entities like people names and organization from news articles. As society became more data oriented with easy online access to both structured and unstructured data, new applications of structure extraction came around. Now, there is interest in converting our personal desktops to structured databases, the knowledge in scientific publications to structured records, and harnessing the Internet for structured fact finding queries. Consequently, there are many different communities of researchers bringing in techniques from machine learning, databases, information retrieval, and computational linguistics for various aspects of the information extraction problem. This review is a survey of information extraction research of over two decades from these diverse communities. We create a taxonomy of the field along various dimensions derived from the nature of theextraction task, the techniques used for extraction, the variety of input resources exploited, and the type of output produced. We elaborate on rule-based and statistical methods for entity and relationship extraction. In each case we highlight the different kinds of models for capturing the diversity of clues driving the recognition process and the algorithms for training and efficiently deploying the models. We survey techniques for optimizing the various steps in an information extraction pipeline, adapting to dynamic data, integrating with existing entities and handling uncertainty in the extraction process. 1
Tableseer: automatic table metadata extraction and searching in digital libraries
- In JCDL
, 2007
"... Tables are ubiquitous in digital libraries. In scientific documents, tables are widely used to present experimental results or statistical data in a condensed fashion. However, current search engines do not support table search. The difficulty of automatic extracting tables from un-tagged documents, ..."
Abstract
-
Cited by 43 (16 self)
- Add to MetaCart
(Show Context)
Tables are ubiquitous in digital libraries. In scientific documents, tables are widely used to present experimental results or statistical data in a condensed fashion. However, current search engines do not support table search. The difficulty of automatic extracting tables from un-tagged documents, the lack of a universal table metadata specification, and the unsuitability/limitation of the existing ranking schemes make table search problem challenging. In this paper, we describe TableSeer, a search engine for tables. TableSeer crawls digital libraries, detects tables from documents, represent tables with metadata, indexes tables, and provides a user-friendly search interface. We propose an extensive set of medium-independent metadata for tables that scientists and other users can adopt for representing table information. Given a query, TableSeer ranks the matched tables using an innovative ranking algorithm – TableRank. TableRank rates the <query, table> pairs with a tailored vector space model and a specific term weighting scheme. Overall, TableSeer aggregates impact factors from three levels: the term, the table, and the document level. We demonstrate the value of TableSeer with empirical studies on scientific documents.
Using Visual Cues for Extraction of Tabular Data from Arbitrary HTML Documents
- In In Proc. of the 14th Int’l Conf. on World Wide Web
, 2005
"... We describe a method to extract tabular data from web pages. Rather than just analyzing the DOM tree, we also exploit visual cues in the rendered version of the document to extract data from tables which are not explicitly marked with an HTML table element. To detect tables, we rely on a variant of ..."
Abstract
-
Cited by 35 (3 self)
- Add to MetaCart
We describe a method to extract tabular data from web pages. Rather than just analyzing the DOM tree, we also exploit visual cues in the rendered version of the document to extract data from tables which are not explicitly marked with an HTML table element. To detect tables, we rely on a variant of the well-known X-Y cut algorithm as used in the OCR community. We implemented the system by directly accessing Mozilla's box model that contains the positional data for all HTML elements of a given web page.
Uncovering the relational web
- In under review
, 2008
"... The World-Wide Web consists of a huge number of unstructured hypertext documents, but it also contains structured data in the form of HTML tables. Many of these tables contain both relational-style data and a small “schema ” of labeled and typed columns, making each such table a small structured dat ..."
Abstract
-
Cited by 28 (8 self)
- Add to MetaCart
(Show Context)
The World-Wide Web consists of a huge number of unstructured hypertext documents, but it also contains structured data in the form of HTML tables. Many of these tables contain both relational-style data and a small “schema ” of labeled and typed columns, making each such table a small structured database. The WebTables project is an effort to extract and make use of the huge number of these structured tables on the Web. A clean collection of relational-style tables could be useful for improving web search, schema design, and many other applications. This paper describes the first stage of the WebTables project. First, we give an in-depth study of the Web’s HTML table corpus. For example, we extracted 14.1 billion HTML tables from a several-billion-page portion of Google’s generalpurpose web crawl, and estimate that 154 million of these tables contain high-quality relational-style data. We also describe the crawl’s distribution of table sizes and data types. Second, we describe a system for performing relation recovery. The Web mixes relational and non-relational tables indiscriminately (often on the same page), so there is no simple way to distinguish the 1.1 % of good relations from the remainder, nor to recover column label and type information. Our mix of hand-written detectors and statistical classifiers takes a raw Web crawl as input, and generates a collection of databases that is five orders of magnitude larger than any other collection we are aware of. Relation recovery achieves precision and recall that are comparable to other domain-independent information extraction systems. 1.
From Tables to Frames
- in 3rd International Semantic Web Conference
, 2004
"... Abstract. Turning the current Web into a Semantic Web requires automatic ap-proaches for annotation of existing data since manual approaches will not scale in general. We here present an approach for automatic generation of F-Logic frames out of tables which subsequently supports the automatic popul ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
(Show Context)
Abstract. Turning the current Web into a Semantic Web requires automatic ap-proaches for annotation of existing data since manual approaches will not scale in general. We here present an approach for automatic generation of F-Logic frames out of tables which subsequently supports the automatic population of ontologies from table-like structures. The approach consists of a methodology, an accom-panying implementation and a thorough evaluation. It is based on a grounded cognitive table model which is stepwise instantiated by our methodology. 1
Conceptual Modeling Foundations for a Web of Knowledge
"... The semantic web purports to be a web of knowledge that can answer our questions, help us reason about everyday problems as well as scientific endeavors, and service many of our wants and needs. Researchers and others expound various views about exactly what this means. Here we propose an answer w ..."
Abstract
-
Cited by 11 (9 self)
- Add to MetaCart
The semantic web purports to be a web of knowledge that can answer our questions, help us reason about everyday problems as well as scientific endeavors, and service many of our wants and needs. Researchers and others expound various views about exactly what this means. Here we propose an answer with conceptual modeling as its foundation. We define a web of knowledge as a collection of interconnected knowledge bundles superimposed over a web of documents. Knowledge bundles are conceptual model instances augmented with facilities that provide for both extensional and intensional facts, for linking between knowledge bundles yielding a web of data, and for linking to an underlying document collection providing a means of authentication. We formally define both the component parts of these augmented conceptual models and their synergistic interconnections. As for practicalities, we discuss problems regarding the potentially high cost of constructing a web of knowledge and explain how they may be mitigated. We also discuss usage issues and show how untrained users can interact with and gain benefit from
Notes on Contemporary Table Recognition
- in Proc. Document Analysis Systems VII, 7th International Workshop, DAS 2006
, 2006
"... Abstract. The shift of interest to web tables in HTML and PDF files, coupled with the incorporation of table analysis and conversion routines in commercial desktop document processing software, are likely to turn table recognition into more of a systems than an algorithmic issue. We illustrate the t ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
(Show Context)
Abstract. The shift of interest to web tables in HTML and PDF files, coupled with the incorporation of table analysis and conversion routines in commercial desktop document processing software, are likely to turn table recognition into more of a systems than an algorithmic issue. We illustrate the transition by some actual examples of web table conversion. We then suggest that the appropriate target format for table analysis, whether performed by conventional customized programs or by off-theshelf software, is a representation based on the abstract table introduced by X. Wang in 1996. We show that the Wang model is adequate for some useful tasks that prove elusive for less explicit representations, and outline our plans to develop a semi-automated table processing system to demonstrate this approach. Screen-snaphots of a prototype tool to allow table mark-up in the style of Wang are also presented. 1
Semantically Conceptualizing and Annotating Tables
"... Abstract. Enabling a system to automatically conceptualize and annotate a human-readable table is one way to create interesting semanticweb content. But exactly “how? ” is not clear. With conceptualization and annotation in mind, we investigate a semantic-enrichment procedure as a way to turn syntac ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
(Show Context)
Abstract. Enabling a system to automatically conceptualize and annotate a human-readable table is one way to create interesting semanticweb content. But exactly “how? ” is not clear. With conceptualization and annotation in mind, we investigate a semantic-enrichment procedure as a way to turn syntactically observed table layout into semantically coherent ontological concepts, relationships, and constraints. Our semanticenrichment procedure shows how to make use of auxiliary world knowledge to construct rich ontological structures and to populate these ontological structures with instance data. The system uses auxiliary knowledge (1) to recognize concepts and which data values belong to which concepts, (2) to discover relationships among concepts and which datavalue combinations represent relationship instances, and (3) to discover constraints over the concepts and relationships that the data values and data-value combinations should satisfy. Experimental evaluations indicate that the automatic conceptualization and annotation processes perform well, yielding F-measures of 90 % for concept recognition, 77 % for relationship discovery, and 90 % for constraint discovery in web tables selected from the geopolitical domain. 1