• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Extracting patterns and relations from the World Wide Web, in: (1998)

by S Brin
Venue:Proc. WebDB ’98,
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 471
Next 10 →

Unsupervised Models for Named Entity Classification

by Michael Collins, Yoram Singer - In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora , 1999
"... This paper discusses the use of unlabeled examples for the problem of named entity classification. A large number of rules is needed for coverage of the domain, suggesting that a fairly large number of labeled examples should be required to train a classifier. However, we show that the use of unlabe ..."
Abstract - Cited by 542 (4 self) - Add to MetaCart
This paper discusses the use of unlabeled examples for the problem of named entity classification. A large number of rules is needed for coverage of the domain, suggesting that a fairly large number of labeled examples should be required to train a classifier. However, we show that the use of unlabeled data can reduce the requirements for supervision to just 7 simple “seed ” rules. The approach gains leverage from natural redundancy in the data: for many named-entity instances both the spelling of the name and the context in which it appears are sufficient to determine its type. We present two algorithms. The first method uses a similar algorithm to that of (Yarowsky 95), with modifications motivated by (Blum and Mitchell 98). The second algorithm extends ideas from boosting algorithms, designed for supervised learning tasks, to the framework suggested by (Blum and Mitchell 98). 1

Extracting Relations from Large Plain-Text Collections

by Eugene Agichtein, Luis Gravano , 2000
"... Text documents often contain valuable structured data that is hidden in regular English sentences. This data is best exploited if available as a relational table that we could use for answering precise queries or for running data mining tasks. We explore a technique for extracting such tables fr ..."
Abstract - Cited by 494 (25 self) - Add to MetaCart
Text documents often contain valuable structured data that is hidden in regular English sentences. This data is best exploited if available as a relational table that we could use for answering precise queries or for running data mining tasks. We explore a technique for extracting such tables from document collections that requires only a handful of training examples from users. These examples are used to generate extraction patterns, that in turn result in new tuples being extracted from the document collection. We build on this idea and present our Snowball system. Snowball introduces novel strategies for generating patterns and extracting tuples from plain-text documents. At each iteration of the extraction process, Snowball evaluates the quality of these patterns and tuples without human intervention, and keeps only the most reliable ones for the next iteration. In this paper we also develop a scalable evaluation methodology and metrics for our task, and present a t...
(Show Context)

Citation Context

...per we develop the Snowball system for extracting structured data from plain-text documents with minimal human participation. Our techniques build on the ideas and general approach introduced by Brin =-=[2]-=-, which we describe next. DIPRE: Dual Iterative Pattern Expansion To extract a structured relation (or table) from a collection of HTML documents, Brin introduced the DIPRE method [2]. DIPRE works bes...

Open information extraction from the web

by Michele Banko, Michael J Cafarella, Stephen Soderland, Matt Broadhead, Oren Etzioni - IN IJCAI , 2007
"... Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to ma ..."
Abstract - Cited by 373 (39 self) - Add to MetaCart
Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to manually create new extraction rules or hand-tag new training examples. This manual labor scales linearly with the number of target relations. This paper introduces Open IE (OIE), a new extraction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input. The paper also introduces TEXTRUNNER, a fully implemented, highly scalable OIE system where the tuples are assigned a probability and indexed to support efficient extraction and exploration via user queries. We report on experiments over a 9,000,000 Web page corpus that compare TEXTRUNNER with KNOWITALL, a state-of-the-art Web IE system. TEXTRUNNER achieves an error reduction of 33% on a comparable set of extractions. Furthermore, in the amount of time it takes KNOWITALL to perform extraction for a handful of pre-specified relations, TEXTRUNNER extracts a far broader set of facts reflecting orders of magnitude more relations, discovered on the fly. We report statistics on TEXTRUNNER’s 11,000,000 highest probability tuples, and show that they contain over 1,000,000 concrete facts and over 6,500,000 more abstract assertions.

Unsupervised namedentity extraction from the web: An experimental study.

by Oren Etzioni , Michael Cafarella , Doug Downey , Ana-Maria Popescu , Tal Shaked , Stephen Soderland , Daniel S Weld , Alexander Yates - Artificial Intelligence, , 2005
"... Abstract The KNOWITALL system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KNOW-ITALL's novel architecture and ..."
Abstract - Cited by 372 (39 self) - Add to MetaCart
Abstract The KNOWITALL system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KNOW-ITALL's novel architecture and design principles, emphasizing its distinctive ability to extract information without any hand-labeled training examples. In its first major run, KNOW-ITALL extracted over 50,000 class instances, but suggested a challenge: How can we improve KNOWITALL's recall and extraction rate without sacrificing precision? This paper presents three distinct ways to address this challenge and evaluates their performance. Pattern Learning learns domain-specific extraction rules, which enable additional extractions. Subclass Extraction automatically identifies sub-classes in order to boost recall (e.g., "chemist" and "biologist" are identified as sub-classes of "scientist"). List Extraction locates lists of class instances, learns a "wrapper" for each list, and extracts elements of each list. Since each method bootstraps from KNOWITALL's domain-independent methods, the methods also obviate hand-labeled training examples. The paper reports on experiments, focused on building lists of named entities, that measure the relative efficacy of each method and demonstrate their synergy. In concert, our methods gave KNOWITALL a 4-fold to 8-fold increase in recall at precision of 0.90, and discovered over 10,000 cities missing from the Tipster Gazetteer.
(Show Context)

Citation Context

...or efficiency (the number of unique instances produced per search engine query) is also important. For a given class, we first select the top patterns according to the following heuristics: H1: As in =-=[6]-=-, we prefer patterns that appear for multiple distinct seeds. By banning all patterns found for just a single seed (i.e. requiring that EstimatedRecall > 1/S in Equation 4), 96% of the potential rules...

Extracting structured data from web pages

by Arvind Arasu - In ACM SIGMOD , 2003
"... Many web sites contain a large collection of “structured” web pages. These pages encode data from an underlying structured source, and are typically generated dynamically. An example of such a collection is the set of book pages in Amazon. There are two important characteristics of such a collection ..."
Abstract - Cited by 310 (0 self) - Add to MetaCart
Many web sites contain a large collection of “structured” web pages. These pages encode data from an underlying structured source, and are typically generated dynamically. An example of such a collection is the set of book pages in Amazon. There are two important characteristics of such a collection: first, all the pages in the collection contain structured data conforming to a common schema; second, the pages are generated using a common template. Our goal is to automatically extract structured data from a collection of pages described above, without any human input like manually generated rules or training sets. Extracting structured data gives us greater querying power over the data and is useful in information integration systems. Most of the existing work on extracting structured data assumes significant human input, for example, in form of training examples of the data to be extracted. To the best of our knowledge, ROADRUNNER project is the only other work that tries to automatically extract structured data. However, ROADRUNNER makes several simplifying assumptions. These assumptions and their implications are discussed in our paper [2]. Structured data denotes data conforming to a schema or type. We borrow the definition of complex types from [1]. Any value conforming to a schema is an instance of the schema. For example, the schema   ¡ £ ¥ § © ¥ � § ¥ � represents a tuple of � attributes. The first and third attributes are “atomic”; the second attribute is a set of atomic values. The value denotes an instance of schema. A template is a pattern that describes how instances of a schema are encoded. An example template for schema above � is where each letter denotes a string. Template � encodes the first attribute of   between strings � and �, the second between �
(Show Context)

Citation Context

...the one discussed here, possibly by sacrificing accuracy for efficiency. Also when we work at the scale of the entire web we might be able to leverage the redundancy of the data on the web as in Brin =-=[3]-=-. The second direction of work is to develop techniques for automatically annotating the extracted data, possibly using the words that appear in the template. Acknowledgments We thank Mayank Bawa and ...

Dependency tree kernels for relation extraction

by Aron Culotta - In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04 , 2004
"... We extend previous work on tree kernels to estimate the similarity between the dependency trees of sentences. Using this kernel within a Support Vector Machine, we detect and classify relations between entities in the Automatic Content Extraction (ACE) corpus of news articles. We examine the utility ..."
Abstract - Cited by 263 (2 self) - Add to MetaCart
We extend previous work on tree kernels to estimate the similarity between the dependency trees of sentences. Using this kernel within a Support Vector Machine, we detect and classify relations between entities in the Automatic Content Extraction (ACE) corpus of news articles. We examine the utility of different features such as Wordnet hypernyms, parts of speech, and entity types, and find that the dependency tree kernel achieves a 20 % F1 improvement over a “bag-of-words ” kernel. 1

Distant supervision for relation extraction without labeled data

by Mike Mintz, Steven Bills, Rion Snow, Dan Jurafsky
"... Modern models of relation extraction for tasks like ACE are based on supervised learning of relations from small hand-labeled corpora. We investigate an alternative paradigm that does not require labeled corpora, avoiding the domain dependence of ACEstyle algorithms, and allowing the use of corpora ..."
Abstract - Cited by 239 (3 self) - Add to MetaCart
Modern models of relation extraction for tasks like ACE are based on supervised learning of relations from small hand-labeled corpora. We investigate an alternative paradigm that does not require labeled corpora, avoiding the domain dependence of ACEstyle algorithms, and allowing the use of corpora of any size. Our experiments use Freebase, a large semantic database of several thousand relations, to provide distant supervision. For each pair of entities that appears in some Freebase relation, we find all sentences containing those entities in a large unlabeled corpus and extract textual features to train a relation classifier. Our algorithm combines the advantages of supervised IE (combining 400,000 noisy pattern features in a probabilistic classifier) and unsupervised IE (extracting large numbers of relations from large corpora of any domain). Our model is able to extract 10,000 instances of 102 relations at a precision of 67.6%. We also analyze feature performance, showing that syntactic parse features are particularly helpful for relations that are ambiguous or lexically distant in their expression. 1
(Show Context)

Citation Context

...ting relations may not be easy to map to relations needed for a particular knowledge base. A third approach has been to use a very small number of seed instances or patterns to do bootstrap learning (=-=Brin, 1998-=-; Riloff and Jones, 1999; Agichtein and Gravano, 2000; Ravichandran and Hovy, 2002; Etzioni et al., 2005; Pennacchiotti and Pantel, 2006; Bunescu and Mooney, 2007; Rozenfeld and Feldman, 2008). These ...

A survey of named entity recognition and classification.

by D Nadeau, S Sekine - Lingvisticae Investigationes, , 2007
"... ..."
Abstract - Cited by 235 (2 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...and “scientist” (O. Etzioni et al. 2005), “email address” and “phone number” (I. Witten et al. 1999, D. Maynard et al. 2001), “research area” and “project name” (J. Zhu et al. 2005), “book title” (S. =-=Brin 1998-=-, I. Witten et al. 1999), “job title” (W. Cohen & Sarawagi 2004) and “brand” (E. Bick 2004). A recent interest in bioinformatics, and the availability of the GENIA corpus (T. Ohta et al. 2002) led to ...

Searching the Web

by Arvind Arasu, Junghoo Cho, Hector Garcia-Molia, Andreas Paepcke, Sriram Raghavan - ACM TRANSACTIONS ON INTERNET TECHNOLOGY , 2001
"... We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and im ..."
Abstract - Cited by 162 (1 self) - Add to MetaCart
We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and implementation techniques for each of these components are presented. For this presentation we draw from the literature and from our own experimental search engine testbed. Emphasis is on introducing the fundamental concepts and the results of several performance analyses we conducted to compare different designs.
(Show Context)

Citation Context

...art of B and may even be weighted more heavily than the actual text in B. 10 Interestingly, mutually reinforcing relationships have been identified and exploited for other Web tasks, see for instance =-=[9]-=-. 32sonly focuses the link analysis on the most relevant part of the Web, but also reduces the amount of work for the next phase. (Since both the subgraph selection and its analysis are done at query ...

Unsupervised Personal Name Disambiguation

by Gideon S. Mann, David Yarowsky - Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 , 2003
"... This paper presents a set of algorithms for distinguishing personal names with multiple real referents in text, based on little or no supervision. The approach utilizes an unsupervised clustering technique over a rich feature space of biographic facts, which are automatically extracted via a languag ..."
Abstract - Cited by 161 (4 self) - Add to MetaCart
This paper presents a set of algorithms for distinguishing personal names with multiple real referents in text, based on little or no supervision. The approach utilizes an unsupervised clustering technique over a rich feature space of biographic facts, which are automatically extracted via a language-independent bootstrapping process. The induced clustering of named entities are then partitioned and linked to their real referents via the automatically extracted biographic data. Performance is evaluated based on both a test set of handlabeled multi-referent personal names and via automatically generated pseudonames.
(Show Context)

Citation Context

...: Learning Extraction Patterns from Filled Templates and Web Pages In the late 90s, there was a substantial body of research on learning information extraction patterns from templates (Huffman, 1995; =-=Brin, 1998-=-; Califf and Mooney, 1998; Freitag and McCallum, 1999; Yangarber et al., 2000; Ravichandran and Hovy, 2002). These techniques provide a way to bootstrap information extraction patterns from a set of e...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University