Results 1 - 10
of
76
Y.: Webtables: exploring the power of tables on the web
- PVLDB
, 2008
"... The World-Wide Web consists of a huge number of unstruc-tured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML ta-bles from Google’s general-purpose web crawl, and used sta-tistical classification techniques to find the estimated 154M that co ..."
Abstract
-
Cited by 122 (13 self)
- Add to MetaCart
(Show Context)
The World-Wide Web consists of a huge number of unstruc-tured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML ta-bles from Google’s general-purpose web crawl, and used sta-tistical classification techniques to find the estimated 154M that contain high-quality relational data. Because each re-lational table has its own “schema ” of labeled and typed columns, each such table can be considered a small struc-tured database. The resulting corpus of databases is larger than any other corpus we are aware of, by at least five orders of magnitude. We describe the WebTables system to explore two fun-damental questions about this collection of databases. First, what are effective techniques for searching for structured
ViDE: A Vision-Based Approach for Deep Web Data Extraction
- IEEE Transactions On Knowledge And Data Engineering
"... Abstract—Deep Web contents are accessed by queries submitted to Web databases and the returned data records are enwrapped in dynamically generated Web pages (they will be called deep Web pages in this paper). Extracting structured data from deep Web pages is a challenging problem due to the underlyi ..."
Abstract
-
Cited by 40 (2 self)
- Add to MetaCart
(Show Context)
Abstract—Deep Web contents are accessed by queries submitted to Web databases and the returned data records are enwrapped in dynamically generated Web pages (they will be called deep Web pages in this paper). Extracting structured data from deep Web pages is a challenging problem due to the underlying intricate structures of such pages. Until now, a large number of techniques have been proposed to address this problem, but all of them have inherent limitations because they are Web-page-programming-languagedependent. As the popular two-dimensional media, the contents on Web pages are always displayed regularly for users to browse. This motivates us to seek a different way for deep Web data extraction to overcome the limitations of previous works by utilizing some interesting common visual features on the deep Web pages. In this paper, a novel vision-based approach that is Web-pageprogramming-language-independent is proposed. This approach primarily utilizes the visual features on the deep Web pages to implement deep Web data extraction, including data record extraction and data item extraction. We also propose a new evaluation measure revision to capture the amount of human effort needed to produce perfect extraction. Our experiments on a large set of Web databases show that the proposed vision-based approach is highly effective for deep Web data extraction. Index Terms—Web mining, Web data extraction, visual features of deep Web pages, wrapper generation. Ç 1
Uncovering the relational web
- In under review
, 2008
"... The World-Wide Web consists of a huge number of unstructured hypertext documents, but it also contains structured data in the form of HTML tables. Many of these tables contain both relational-style data and a small “schema ” of labeled and typed columns, making each such table a small structured dat ..."
Abstract
-
Cited by 28 (8 self)
- Add to MetaCart
(Show Context)
The World-Wide Web consists of a huge number of unstructured hypertext documents, but it also contains structured data in the form of HTML tables. Many of these tables contain both relational-style data and a small “schema ” of labeled and typed columns, making each such table a small structured database. The WebTables project is an effort to extract and make use of the huge number of these structured tables on the Web. A clean collection of relational-style tables could be useful for improving web search, schema design, and many other applications. This paper describes the first stage of the WebTables project. First, we give an in-depth study of the Web’s HTML table corpus. For example, we extracted 14.1 billion HTML tables from a several-billion-page portion of Google’s generalpurpose web crawl, and estimate that 154 million of these tables contain high-quality relational-style data. We also describe the crawl’s distribution of table sizes and data types. Second, we describe a system for performing relation recovery. The Web mixes relational and non-relational tables indiscriminately (often on the same page), so there is no simple way to distinguish the 1.1 % of good relations from the remainder, nor to recover column label and type information. Our mix of hand-written detectors and statistical classifiers takes a raw Web crawl as input, and generates a collection of databases that is five orders of magnitude larger than any other collection we are aware of. Relation recovery achieves precision and recall that are comparable to other domain-independent information extraction systems. 1.
WebSets: Extracting Sets of Entities from the Web Using Unsupervised Information Extraction
"... We describe a open-domain information extraction method for extracting concept-instance pairs from an HTML corpus. Most earlier approaches to this problem rely on combining clusters of distributionally similar terms and conceptinstance pairs obtained with Hearst patterns. In contrast, our method rel ..."
Abstract
-
Cited by 27 (8 self)
- Add to MetaCart
(Show Context)
We describe a open-domain information extraction method for extracting concept-instance pairs from an HTML corpus. Most earlier approaches to this problem rely on combining clusters of distributionally similar terms and conceptinstance pairs obtained with Hearst patterns. In contrast, our method relies on a novel approach for clustering terms found in HTML tables, and then assigning concept names to these clusters using Hearst patterns. The method can be efficiently applied to a large corpus, and experimental results on several datasets show that our method can accurately extract large numbers of concept-instance pairs. 1.
Using Wikipedia to Bootstrap Open Information Extraction
"... We often use ‘Data Management ’ to refer to the manipulation of relational or semi-structured information, but much of the world’s data is unstructured, for example the vast amount of natural-language text on the Web. The ability to manage ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
(Show Context)
We often use ‘Data Management ’ to refer to the manipulation of relational or semi-structured information, but much of the world’s data is unstructured, for example the vast amount of natural-language text on the Web. The ability to manage
Web Data Extraction, Applications and Techniques: A Survey
, 2010
"... The World Wide Web contains a huge amount of unstructured and semi-structured information, that is exponentially increasing with the coming of the Web 2.0, thanks to User-Generated Con-tents (UGC). In this paper we intend to briefly survey the fields of application, in particular enterprise and soci ..."
Abstract
-
Cited by 18 (10 self)
- Add to MetaCart
The World Wide Web contains a huge amount of unstructured and semi-structured information, that is exponentially increasing with the coming of the Web 2.0, thanks to User-Generated Con-tents (UGC). In this paper we intend to briefly survey the fields of application, in particular enterprise and social applications, and techniques used to approach and solve the problem of the extraction of information from Web sources: during last years many approaches were developed, some inherited from past studies on Information Extraction (IE) systems, many others studied ad hoc to solve specific problems.
Automatic hidden-web table interpretation by sibling page comparison
- In Proceedings of the 26th International Conference on Conceptual Modeling (ER’07
, 2007
"... Abstract. The longstanding problem of automatic table interpretation still illudes us. Its solution would not only be an aid to table processing applications such as large volume table conversion, but would also be an aid in solving related problems such as information extraction and semi-structured ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
(Show Context)
Abstract. The longstanding problem of automatic table interpretation still illudes us. Its solution would not only be an aid to table processing applications such as large volume table conversion, but would also be an aid in solving related problems such as information extraction and semi-structured data management. In this paper, we offer a conceptual modeling solution for the common special case in which so-called sibling pages are available. The sibling pages we consider are pages on the hidden web, commonly generated from underlying databases. We compare them to identify and connect nonvarying components (category labels) and varying components (data values). We tested our solution using more than 2,000 tables in source pages from three different domains—car advertisements, molecular biology, and geopolitical information. Experimental results show that the system can successfully identify sibling tables, generate structure patterns, interpret tables using the generated patterns, and automatically adjust the structure patterns, if necessary, as it processes a sequence of hidden-web pages. For these activities, the system was able to achieve an overall F-measure of 94.5%. 1
Scalable Web Data Extraction for Online Market Intelligence
"... Online market intelligence (OMI), in particular competitive intelligence for product pricing, is a very important application area for Web data extraction. However, OMI presents non-trivial challenges to data extraction technology. Sophisticated and highly parameterized navigation and extraction tas ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
(Show Context)
Online market intelligence (OMI), in particular competitive intelligence for product pricing, is a very important application area for Web data extraction. However, OMI presents non-trivial challenges to data extraction technology. Sophisticated and highly parameterized navigation and extraction tasks are required. On-the-fly data cleansing is necessary in order two identify identical products from different suppliers. It must be possible to smoothly define data flow scenarios that merge and filter streams of extracted data stemming from several Web sites and store the resulting data into a data warehouse, where the data is subjected to market intelligence analytics. Finally, the system must be highly scalable, in order to be able to extract and process massive amounts of data in a short time. Lixto (www.lixto.com), a company offering data extraction tools and services, has been providing OMI solutions for several customers. In this paper we show how Lixto has tackled each of the above challenges by improving and extending its original data extraction software. Most importantly, we show how high scalability is achieved through cloud computing. This paper also features a case study from the computers and electronics market. 1.
Dynamic hierarchical Markov random fields for integrated web data extraction
- JMLR
"... Existing template-independent web data extraction approaches adopt highly ineffective decoupled strategies—attempting to do data record detection and attribute labeling in two separate phases. In this paper, we propose an integrated web data extraction paradigm with hierarchical models. The proposed ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
(Show Context)
Existing template-independent web data extraction approaches adopt highly ineffective decoupled strategies—attempting to do data record detection and attribute labeling in two separate phases. In this paper, we propose an integrated web data extraction paradigm with hierarchical models. The proposed model is called Dynamic Hierarchical Markov Random Fields (DHMRFs). DHMRFs take structural uncertainty into consideration and define a joint distribution of both model structure and class labels. The joint distribution is an exponential family distribution. As a conditional model, DHMRFs relax the independence assumption as made in directed models. Since exact inference is intractable, a variational method is developed to learn the model’s parameters and to find the MAP model structure and label assignments. We apply DHMRFs to a real-world web data extraction task. Experimental results show that: (1) integrated web data extraction models can achieve significant improvements on both record detection and attribute labeling compared to decoupled models; (2) in diverse web data extraction DHMRFs can potentially address the blocky artifact issue which is suffered by fixed-structured hierarchical models.
Automatic Hidden-Web Table Interpretation, Conceptualization, and Semantic Annotation
"... The longstanding problem of automatic table interpretation still illudes us. Its solution would not only be an aid to table processing applications such as large volume table conversion, but would also be an aid in solving related problems such as information extraction, semantic annotation, and sem ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
(Show Context)
The longstanding problem of automatic table interpretation still illudes us. Its solution would not only be an aid to table processing applications such as large volume table conversion, but would also be an aid in solving related problems such as information extraction, semantic annotation, and semi-structured data management. In this paper, we offer a solution for the common special case in which so-called sibling pages are available. The sibling pages we consider are pages on the hidden web, commonly generated from underlying databases. Our system compares them to identify and connect nonvarying components (category labels) and varying components (data values). We tested our solution using more than 2,000 tables in source pages from three different domains — car advertisements, molecular biology, and geopolitical information. Experimental results show that the system can successfully identify sibling tables, generate structure patterns, interpret tables using the generated patterns, and automatically adjust the structure patterns as it processes a sequence of hidden-web pages. For these activities, the system was able to achieve an overall F-measure of 94.5%. Further, given that we can automatically interpret tables, we next show that this leads immediately to a conceptualization of the data in these interpreted tables and thus also to a way to semantically annotate these interpreted tables with respect to the ontological conceptualization. Labels in nested table structures yield ontological concepts and interrelationships among these concepts, and associated data values become annotated information. We further show that semantically annotated data leads immediately to queriable data. Thus, the entire process, which is fully automatic, transform facts embedded within tables into facts accessible by standard query engines. 1