Results 1 - 10
of
44
Prediction of complete gene structures in human genomic DNA
- J. Mol. Biol
, 1997
"... The problem of identifying genes in genomic DNA sequences by computational methods has attracted considerable research attention in recent years. From one point of view, the problem is closely ..."
Abstract
-
Cited by 497 (7 self)
- Add to MetaCart
The problem of identifying genes in genomic DNA sequences by computational methods has attracted considerable research attention in recent years. From one point of view, the problem is closely
A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA
, 1996
"... We present a statistical model of genes in DNA. A Generalized Hidden Markov Model (GHMM) provides the framework for describing the grammar of a legal parse of a DNA sequence (Stormo & Haussler 1994). Probabilities are assigned to transitions between states in the GHMM and to the generation of each n ..."
Abstract
-
Cited by 122 (13 self)
- Add to MetaCart
We present a statistical model of genes in DNA. A Generalized Hidden Markov Model (GHMM) provides the framework for describing the grammar of a legal parse of a DNA sequence (Stormo & Haussler 1994). Probabilities are assigned to transitions between states in the GHMM and to the generation of each nucleotide base given a particular state. Machine learning techniques are applied to optimize these probabilities using a standardized training set. Given a new candidate sequence, the best parse is deduced from the model using a dynamic programming algorithm to identify the path through the model with maximum probability. The GHMM is flexible and modular, so new sensors and additional states can be inserted easily. In addition, it provides simple solutions for integrating cardinality constraints, reading frame constraints, "indels", and homology searching. The description and results of an implementation of such a gene-finding model, called Genie, is presented. The exon sensor is a codon fre...
Two methods for improving performance of an HMM and their application for gene finding
, 1997
"... A hidden Markov model for gene finding consists of submodels for coding regions, splice sites, introns, intergenic regions and possibly more. It is described how to estimate the model as a whole from labeled sequences instead of estimating the individual parts independently from subsequences. It is ..."
Abstract
-
Cited by 98 (5 self)
- Add to MetaCart
A hidden Markov model for gene finding consists of submodels for coding regions, splice sites, introns, intergenic regions and possibly more. It is described how to estimate the model as a whole from labeled sequences instead of estimating the individual parts independently from subsequences. It is argued that the standard maximum likelihood estimation criterion is not optimal for training such a model. Instead of maximizing the probability of the DNA sequence, one should maximize the probability of the correct prediction. Such a criterion, called conditional maximum likelihood, is used for the gene finder `HMMgene '. A new (approximative) algorithm is described, which finds the most probable prediction summed over all paths yielding the same prediction. We show that these methods contribute significantly to the high performance of HMMgene. Keywords: Hidden Markov model, gene finding, maximum likelihood, statistical sequence analysis. Introduction As the genome projects evolve autom...
Gene Structure Prediction by Linguistic Methods
- Genomics
, 1994
"... The higher-order structure of genes and other features of biological sequences can be described by means of formal grammars. These grammars can then be used by general-purpose parsers to detect and assemble such structures by means of syntactic pattern recognition. We describe a grammar and parser f ..."
Abstract
-
Cited by 55 (2 self)
- Add to MetaCart
The higher-order structure of genes and other features of biological sequences can be described by means of formal grammars. These grammars can then be used by general-purpose parsers to detect and assemble such structures by means of syntactic pattern recognition. We describe a grammar and parser for eukaryotic protein-encoding genes, which by some measures is as effective as current connectionist and combinatorial algorithms in predicting gene structures for sequence database entries. Parameters on the grammar rules are optimized for several different species, and mixing experiments performed to determine the degree of species specificity and the relative importance of compositional, signal-based, and syntactic components in gene prediction. Introduction Formal language theory views languages as sets of strings over some alphabet, and specifies potentially infinite languages with concise sets of rules called grammars [10]. Grammars are an exceptionally well-studied methodology, fami...
Improved Splice Site Detection in Genie
- J. COMPUT. BIOL
, 1997
"... We present an improved splice site predictor for the genefinding program Genie. Genie is based on a generalized Hidden Markov Model (GHMM) that describes the grammar of a legal parse of a multi-exon gene in a DNA sequence. In Genie, probabilities are estimated for gene features by using dynamic prog ..."
Abstract
-
Cited by 43 (3 self)
- Add to MetaCart
We present an improved splice site predictor for the genefinding program Genie. Genie is based on a generalized Hidden Markov Model (GHMM) that describes the grammar of a legal parse of a multi-exon gene in a DNA sequence. In Genie, probabilities are estimated for gene features by using dynamic programming to combine information from multiple content and signal sensors, including sensors that integrate matches to homologous sequences from a database. One of the hardest problems in genefinding is to determine the complete gene structure correctly. The splice site sensors are the key signal sensors that address this problem. We replaced the existing splice site sensors in Genie with two novel neural networks based on dinucleotide frequencies. Using these novel sensors, Genie shows significant improvements in the sensitivity and specificity of gene structure identification. Experimental results in tests using a standard set of annotated genes showed that Genie identified 86% of coding nuc...
A Method for Identifying Splice Sites and Translational Start Sites in Eukaryotic mRNA
, 1997
"... This paper describes a new method for determining the consensus sequences that signal the start of translation and the boundaries between exons and introns (donor and acceptor sites) in eukaryotic mRNA. The method takes into account the dependencies between adjacent bases, in contrast to the usual t ..."
Abstract
-
Cited by 40 (1 self)
- Add to MetaCart
This paper describes a new method for determining the consensus sequences that signal the start of translation and the boundaries between exons and introns (donor and acceptor sites) in eukaryotic mRNA. The method takes into account the dependencies between adjacent bases, in contrast to the usual technique of considering each position independently. When coupled with a dynamic program to compute the most likely sequence, new consensus sequences emerge. The consensus sequence information is summarized in conditional probability matrices which, when used to locate signals in uncharacterized genomic DNA, have greater sensitivity and specificity than conventional matrices. Species-specific versions of these matrices are especially effective at distinguishing true and false sites.
Computational Methods for the Identification of Genes in Vertebrate Genomic Sequences
- Hum. Mol. Genet
, 1997
"... Research into new methods to identify genes in anonymous genomic sequences has been going on for more than 15 years. Over this period of time, the field has evolved from the designing of programs to identify protein coding regions in compact mitochondrial or bacterial genomes, to the challenge of pr ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
Research into new methods to identify genes in anonymous genomic sequences has been going on for more than 15 years. Over this period of time, the field has evolved from the designing of programs to identify protein coding regions in compact mitochondrial or bacterial genomes, to the challenge of predicting the detailed organization of multi-exon vertebrate genes. The best program currently available perfectly locates more than 80 % of the internal coding exons, and only 5 % of the predictions do not overlap a real exon. Given such accuracy, computational methods are indeed very useful; however, they do not alleviate the need for experimental validation. If the performances are satisfactory for the identification of the coding moiety of genes (internal coding exons), the determination of the full extent of the transcript (5 ′ and 3 ′ extremities of the gene) and the location of promoter regions are still unreliable. As the human and mouse genome sequencing projects enter a production mode, the fully automated annotation of megabase-long anonymous genomic sequences is the next big challenge in bioinformatics.
Finding Genes in DNA with a Hidden Markov Model
- Journal of Computational Biology
, 1997
"... This study describes a new Hidden Markov Model (HMM) system for segmenting uncharacterized genomic DNA sequences into exons, introns, and intergenic regions. Separate HMM modules were designed and trained for specific regions of DNA: exons, introns, intergenic regions, and splice sites. The models w ..."
Abstract
-
Cited by 36 (0 self)
- Add to MetaCart
This study describes a new Hidden Markov Model (HMM) system for segmenting uncharacterized genomic DNA sequences into exons, introns, and intergenic regions. Separate HMM modules were designed and trained for specific regions of DNA: exons, introns, intergenic regions, and splice sites. The models were then tied together to form a biologically feasible topology. The integrated HMM was trained further on a set of eukaryotic DNA sequences, and tested by using it to segment a separate set of sequences. The resulting HMM system, which is called VEIL (Viterbi Exon-Intron Locator), obtains an overall accuracy on test data of 92% of total bases correctly labelled, with a correlation coefficient of 0.73. Using the more stringent test of exact exon prediction, VEIL correctly located both ends of 53% of the coding exons, and 49% of the exons it predicts are exactly correct. These results compare favorably to the best previous results for gene structure prediction, and demonstrate the benefits of...
KDD for Science Data Analysis: Issues and Examples
- In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining
, 1996
"... The analysis of the massive data sets collected by scientific instruments demands automation as a pre-requisite to analysis. There is an urgent need to create an intermediate level at which scientists can operate effectively; isolating them from the massive sizes and harnessing human analysis capabi ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
The analysis of the massive data sets collected by scientific instruments demands automation as a pre-requisite to analysis. There is an urgent need to create an intermediate level at which scientists can operate effectively; isolating them from the massive sizes and harnessing human analysis capabilities to focus on tasks in which machines do not even remotely approach humans---namely, creative data analysis, theory and hypothesis formation, and drawing insights into underlying phenomena. We give an overview of the main issues in the exploitation of scientific datasets, present five case studies where KDD tools play important and enabling roles, and conclude with future challenges for data mining and KDD techniques in science data analysis. keywords: Applications in Science, Data Analysis, overview article, large databases, automated analysis, scietific data sets, scientific discovery. To appear: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining...
The Conserved Exon Method for Gene Finding
, 2000
"... A new approach to gene finding is introduced called the "Conserved Exon Method" (CEM). It is based on the idea of looking for conserved protein sequences by comparing pairs of DNA sequences, identifying putative exon pairs based on conserved regions and splice junction signals then chaining pairs of ..."
Abstract
-
Cited by 33 (0 self)
- Add to MetaCart
A new approach to gene finding is introduced called the "Conserved Exon Method" (CEM). It is based on the idea of looking for conserved protein sequences by comparing pairs of DNA sequences, identifying putative exon pairs based on conserved regions and splice junction signals then chaining pairs of putative exons together. It simultaneously predicts gene structures in both human and mouse genomic sequences (or in other pairs of sequences at the appropriate evolutionary distance). Experimental results indicate the potential usefulness of this approach.

