Results 1 - 10
of
13
Exploiting dictionaries in named entity extraction: Combining semi-markov extraction processes and data integration method
- In Proceedings of the ACM SIGKDD Conference
, 2004
"... We consider the problem of improving named entity recognition (NER) systems by using external dictionaries—more specifically, the problem of extending state-of-the-art NER systems by incorporating information about the similarity of extracted entities to entities in an external dictionary. This is d ..."
Abstract
-
Cited by 98 (6 self)
- Add to MetaCart
(Show Context)
We consider the problem of improving named entity recognition (NER) systems by using external dictionaries—more specifically, the problem of extending state-of-the-art NER systems by incorporating information about the similarity of extracted entities to entities in an external dictionary. This is difficult because most high-performance named entity recognition systems operate by sequentially classifying words as to whether or not they participate in an entity name; however, the most useful similarity measures score entire candidate names. To correct this mismatch we formalize a semi-Markov extraction process which relaxes the usual Markov assumptions. This process is based on sequentially classifying segments of several adjacent words, rather than single words. In addition to allowing a natural way of coupling NER and high-performance record linkage methods, this formalism also allows the direct use of other useful entity-level features, and provides a more natural formulation of the NER problem than sequential word classification. Experiments in multiple domains show that the new model can substantially improve extraction performance, relative to previously published methods for using external dictionaries in NER.
Extracting Web data using instance-based learning
- In WISE-05
, 2005
"... Abstract. This paper studies structured data extraction from Web pages, e.g., online product description pages. Existing approaches to data extraction include wrapper induction and automatic methods. In this paper, we propose an instance-based learning method, which performs extraction by comparing ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
(Show Context)
Abstract. This paper studies structured data extraction from Web pages, e.g., online product description pages. Existing approaches to data extraction include wrapper induction and automatic methods. In this paper, we propose an instance-based learning method, which performs extraction by comparing each new instance (or page) to be extracted with labeled instances (or pages). The key advantage of our method is that it does not need an initial set of labeled pages to learn extraction rules as in wrapper induction. Instead, the algorithm is able to start extraction from a single labeled instance (or page). Only when a new page cannot be extracted does the page need labeling. This avoids unnecessary page labeling, which solves a major problem with inductive learning (or wrapper induction), i.e., the set of labeled pages may not be representative of all other pages. The instance-based approach is very natural because structured data on the Web usually follow some fixed templates and pages of the same template usually can be extracted using a single page instance of the template. The key issue is the similarity or distance measure. Traditional measures based on the Euclidean distance or text similarity are not easily applicable in this context because items to be extracted from different pages can be entirely different. This paper proposes a novel similarity measure for the purpose, which is suitable for templated Web pages. Experimental results with product data extraction from 1200 pages in 24 diverse Web sites show that the approach is surprisingly effective. It outperforms the state-of-the-art existing systems significantly. 1
Hierarchical Text Categorization and Its Application to Bioinformatics
, 2005
"... In a hierarchical categorization problem, categories are partially ordered to form a hier-archy. In this dissertation, we explore two main aspects of hierarchical categorization: learning algorithms and performance evaluation. We introduce the notion of consistent hierarchical classification that ma ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
In a hierarchical categorization problem, categories are partially ordered to form a hier-archy. In this dissertation, we explore two main aspects of hierarchical categorization: learning algorithms and performance evaluation. We introduce the notion of consistent hierarchical classification that makes classification results more comprehensible and easily interpretable for end-users. Among the previously introduced hierarchical learning algo-rithms, only a local top-down approach produces consistent classification. The present work extends this algorithm to the general case of DAG class hierarchies and possible internal class assignments. In addition, a new global hierarchical approach aimed at performing consistent classification is proposed. This is a general framework of convert-ing a conventional “flat ” learning algorithm into a hierarchical one. An extensive set of experiments on real and synthetic data indicate that the proposed approach significantly outperforms the corresponding “flat ” as well as the local top-down method. For eval-uation purposes, we use a novel hierarchical evaluation measure that is superior to the existing hierarchical and non-hierarchical evaluation techniques according to a number
Learning to extract gene-protein names from weaklylabeled text in preparation
- In preparation
, 2006
"... Training a named entity recognizer (NER) has always been a difficult task due to the effort required to generate a significant amount of annotated training data. In this paper, we reduce or eliminate the effort required to create training data by automatically converting other sources of data into a ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
(Show Context)
Training a named entity recognizer (NER) has always been a difficult task due to the effort required to generate a significant amount of annotated training data. In this paper, we reduce or eliminate the effort required to create training data by automatically converting other sources of data into annotated training data. The performance of this approach is tested on a geneprotein name extractor by using the mouse and fly data obtained from the BioCreAtIvE challenge. Results show that our methods are effective and that our trained NER system outperforms all of our baseline results. 1
ProtChew: Automatic Extraction of Protein Names from
- In Proceedings of the International Workshop on Biomedical Data Engineering (BMDE 2005, in conjunction with ICDE 2005
, 2005
"... With the increasing amount of biomedical literature, there is a need for automatic extraction of information to support biomedical researchers. Due to incomplete biomedical information databases, the extraction is not straightforward using dictionaries, and several approaches using contextual rules ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
(Show Context)
With the increasing amount of biomedical literature, there is a need for automatic extraction of information to support biomedical researchers. Due to incomplete biomedical information databases, the extraction is not straightforward using dictionaries, and several approaches using contextual rules and machine learning have previously been proposed. Our work is inspired by the previous approaches, but is novel in the sense that it is fully automatic and doesn’t rely on expert tagged corpora. The main ideas are 1) unigram tagging of corpora using known protein names for training examples for the protein name extraction classifier and 2) tight positive and negative examples by having protein-related words as negative examples and protein names/synonyms as positive examples. We present preliminary results on Medline abstracts about gastrin, further work will be on testing the approach on BioCreative benchmark data sets. 1.
Analyzing Gene Relationships for Down Syndrome with Labeled Transitions Graphs
- in Proceedings of Formal Methods in Computer Aided Design (FMCAD), 2007
, 2010
"... Abstract — The relationship between changes in gene expression and physical characteristics associated with Down syndrome is not well understood. Chromosome 21 genes interact with nonchromosome 21 genes to produce Down syndrome characteristics. This indirect influence, however, is difficult to empir ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
(Show Context)
Abstract — The relationship between changes in gene expression and physical characteristics associated with Down syndrome is not well understood. Chromosome 21 genes interact with nonchromosome 21 genes to produce Down syndrome characteristics. This indirect influence, however, is difficult to empirically define due to the number, size, and complexity of the involved gene regulatory networks. This work links chromosome 21 genes to non-chromosome 21 genes known to interact in a Down syndrome phenotype through a reachability analysis of labeled transition graphs extracted from published gene regulatory network databases. The analysis provides new relations in a recently discovered link between a specific gene and Down syndrome phenotype. This type of formal analysis helps scientists direct empirical studies to unravel chromosome 21 gene interactions with the hope for therapeutic intervention. I.
Using natural language processing and the gene ontology to populate a structured pathway database
- IEEE CSB’03 Poster paper
"... ..."
(Show Context)
Research Track Paper Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods
"... We consider the problem of improving named entity recognition (NER) systems by using external dictionaries—more specifically, the problem of extending state-of-the-art NER systems by incorporating information about the similarity of extracted entities to entities in an external dictionary. This is d ..."
Abstract
- Add to MetaCart
We consider the problem of improving named entity recognition (NER) systems by using external dictionaries—more specifically, the problem of extending state-of-the-art NER systems by incorporating information about the similarity of extracted entities to entities in an external dictionary. This is difficult because most high-performance named entity recognition systems operate by sequentially classifying words as to whether or not they participate in an entity name; however, the most useful similarity measures score entire candidate names. To correct this mismatch we formalize a semi-Markov extraction process, which is based on sequentially classifying segments of several adjacent words, rather than single words. In addition to allowing a natural way of coupling high-performance NER methods and highperformance similarity functions, this formalism also allows the direct use of other useful entity-level features, and provides a more natural formulation of the NER problem than sequential word classification. Experiments in multiple domains show that the new model can substantially improve extraction performance over previous methods for using external dictionaries in NER.
A simple approach for protein name identification: prospects and
, 2005
"... limits ..."
(Show Context)
Construction of Gene Correlation Networks and Text Classification via Biomedical Literature Mining
"... Abstract — Automatic extraction of information from biomedical texts appears as a necessity considering the growing of the massive amounts of the relative scientific literature. A special feature that makes this task more challenging is the over-abundance and heterogeneity of the relative genes/prot ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract — Automatic extraction of information from biomedical texts appears as a necessity considering the growing of the massive amounts of the relative scientific literature. A special feature that makes this task more challenging is the over-abundance and heterogeneity of the relative genes/proteins terminology. In this paper we introduce a novel term-identification process and propose an effective data structure based on TRIE trees. It enables the storage of millions of biomedical terms and reflects their semantic relations in a compressed and memory efficient way. Gene-Gene and Gene-Disease correlations are induced based on the utilization of the entropic Mutual Information Measure. Moreover we introduce a novel texts classification process that utilizes the terms identification process and a novel similarity matching metric. The induced correlation networks reveal valuable biomedical information. Text classification results exhibit highly accuracy figures in the range of 90 to 97.5% indicating the reliability of the whole approach.