Results 1 - 10
of
16
Identification of protein coding regions by database similarity search
- Nature Genetics
, 1993
"... Correspondence should be addressed to W.G. page 1 Summary Sequence similarity between a translated nucleotide sequence and a known biological protein can provide strong evidence for the presence of a homologous coding region, and such similarities can often be identified even between distantly relat ..."
Abstract
-
Cited by 64 (1 self)
- Add to MetaCart
Correspondence should be addressed to W.G. page 1 Summary Sequence similarity between a translated nucleotide sequence and a known biological protein can provide strong evidence for the presence of a homologous coding region, and such similarities can often be identified even between distantly related genes. The computer program BLASTX performed conceptual translation of a nucleotide query sequence followed by a protein database search in one programmatic step. The BLAST search algorithm combined with Karlin-Altschul statistics yields a predictable selectivity that has been parameterized. We characterized the sensitivity of BLASTX recognition to the presence of substitution, insertion and deletion errors in the query sequence and to sequence divergence. Reading frames were reliably identified in the presence of 1 % query errors, a rate that is typical for primary nucleotide sequence data. BLASTX is appropriate for use in moderate and large scale sequencing projects at the earliest opportunity, when the data are most prone to containing errors. page 2
Gene Structure Prediction by Linguistic Methods
- Genomics
, 1994
"... The higher-order structure of genes and other features of biological sequences can be described by means of formal grammars. These grammars can then be used by general-purpose parsers to detect and assemble such structures by means of syntactic pattern recognition. We describe a grammar and parser f ..."
Abstract
-
Cited by 55 (2 self)
- Add to MetaCart
The higher-order structure of genes and other features of biological sequences can be described by means of formal grammars. These grammars can then be used by general-purpose parsers to detect and assemble such structures by means of syntactic pattern recognition. We describe a grammar and parser for eukaryotic protein-encoding genes, which by some measures is as effective as current connectionist and combinatorial algorithms in predicting gene structures for sequence database entries. Parameters on the grammar rules are optimized for several different species, and mixing experiments performed to determine the degree of species specificity and the relative importance of compositional, signal-based, and syntactic components in gene prediction. Introduction Formal language theory views languages as sets of strings over some alphabet, and specifies potentially infinite languages with concise sets of rules called grammars [10]. Grammars are an exceptionally well-studied methodology, fami...
Computational Methods for the Identification of Genes in Vertebrate Genomic Sequences
- Hum. Mol. Genet
, 1997
"... Research into new methods to identify genes in anonymous genomic sequences has been going on for more than 15 years. Over this period of time, the field has evolved from the designing of programs to identify protein coding regions in compact mitochondrial or bacterial genomes, to the challenge of pr ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
Research into new methods to identify genes in anonymous genomic sequences has been going on for more than 15 years. Over this period of time, the field has evolved from the designing of programs to identify protein coding regions in compact mitochondrial or bacterial genomes, to the challenge of predicting the detailed organization of multi-exon vertebrate genes. The best program currently available perfectly locates more than 80 % of the internal coding exons, and only 5 % of the predictions do not overlap a real exon. Given such accuracy, computational methods are indeed very useful; however, they do not alleviate the need for experimental validation. If the performances are satisfactory for the identification of the coding moiety of genes (internal coding exons), the determination of the full extent of the transcript (5 ′ and 3 ′ extremities of the gene) and the location of promoter regions are still unreliable. As the human and mouse genome sequencing projects enter a production mode, the fully automated annotation of megabase-long anonymous genomic sequences is the next big challenge in bioinformatics.
Identification of Genes in Human Genomic DNA
, 1997
"... A general probabilistic model of the gene structural and compositional properties of human genomic DNA is introduced and applied to the problem of identifying genes in unannotated human genomic sequences. The model uses a \Hidden semi-Markov" or semi-Markov source architecture which incorporate ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
A general probabilistic model of the gene structural and compositional properties of human genomic DNA is introduced and applied to the problem of identifying genes in unannotated human genomic sequences. The model uses a \Hidden semi-Markov" or semi-Markov source architecture which incorporates probabilistic descriptions of fundamental transcriptional, translational and splicing signals, as well as length distri-butions and compositional features of exons, introns and intergenic regions. Distinct sets of model parameters are derived which account for many of the substantial di er-ences in gene density and structure observed in distinct C+G compositional regions (\isochores") of the human genome. A novel model building procedure, termed Max-imal Dependence Decomposition, is introduced which captures potentially important dependencies between non-adjacent aswell as adjacent positions in a biological signal. Application of this model to the donor splice signal not only gives better discrimina-tion of potential donor sites than previous probabilistic models, but also reveals subtle properties of this signal which suggest aspects of its biochemical function. Acceptor
Optimally Parsing a Sequence into Different Classes Based on Multiple Types of Evidence
- In Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology
, 1994
"... We consider the problem of parsing a sequence into different classes of subsequences. Two common examples are finding the exons and introns in genomic sequences and identifying the secondary structure domains of protein sequences. In each case there are various types of evidence that are relevant to ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
We consider the problem of parsing a sequence into different classes of subsequences. Two common examples are finding the exons and introns in genomic sequences and identifying the secondary structure domains of protein sequences. In each case there are various types of evidence that are relevant to the classification, but none are completely reliable, so we expect some weighted average of all the evidence to provide improved classifications. For example, in the problem of identifying coding regions in genomic DNA, the combined use of evidence such as codon bias and splice junction patterns can give more reliable predictions than either type of evidence alone. We show three main results: 1. For a given weighting of the evidence a dynamic programming algorithm returns the optimal parse and any number of sub-optimal parses. 2. For a given weighting of the evidence a dynamic programming algorithm determines the probability of the optimal parse and any number of sub-optimal parses under a ...
Linear-Time Algorithms for Computing Maximum-Density Sequence Segments with Bioinformatics Applications
"... We study an abstract optimization problem arising from biomolecular sequence analysis. For a sequence A of pairs (a i ; w i ) for i = 1; : : : ; n and w i > 0, a segment A(i; j) is a consecutive subsequence of A starting with index i and ending with index j. The width of A(i; j) is w(i; j) = w k ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
We study an abstract optimization problem arising from biomolecular sequence analysis. For a sequence A of pairs (a i ; w i ) for i = 1; : : : ; n and w i > 0, a segment A(i; j) is a consecutive subsequence of A starting with index i and ending with index j. The width of A(i; j) is w(i; j) = w k , and the density is ( ikj a k )=w(i; j): The maximum-density segment problem takes A and two values L and U as input and asks for a segment of A with the largest possible density among those of width at least L and at most U . When U is unbounded, we provide a relatively simple, O(n)-time algorithm, improving upon the O(n log L)-time algorithm by Lin, Jiang and Chao. When both L and U are speci ed, there are no previous nontrivial results. We solve the problem in O(n) time if w i = 1 for all i, and more generally in O(n + n log(U L + 1)) time when w i 1 for all i.
A Segment-based Dynamic Programing Algorithm for Parsing Gene Structure
, 1996
"... Predicting gene structure requires search within a combinatorially large space of possible gene structures. The search space may be narrowed by two types of computational tools: optimality criteria and consistency constraints. Consistency constraints are requirements concerning reading frame and sto ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Predicting gene structure requires search within a combinatorially large space of possible gene structures. The search space may be narrowed by two types of computational tools: optimality criteria and consistency constraints. Consistency constraints are requirements concerning reading frame and stop codons, namely: the total exon length must be a multiple of three; exons may not contain internal stop codons in their reading frame; and exon-exon junctions may not form stop codons in their reading frame. I present a segment-based dynamic programming algorithm that explores the space of globally consistent gene structures, and finds the optimally scoring gene structure within that space. The algorithm may be modified to provide an arbitrary number of near-optimal solutions and to allow cardinality constraints that limit the number of exons in the gene structure. The algorithm maintains reading frame information that may be used to improve scoring estimates of the likelihood of exons. Seg...
Super-pattern matching
- Algorithmica
, 1995
"... Some recognition problems are either too complex or too ambiguous to be expressed as a simple pattern matching problem using a sequence or regular expression pattern. In these cases, a richer environment is needed to describe the “patterns ” and recognition techniques used to perform the recognition ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Some recognition problems are either too complex or too ambiguous to be expressed as a simple pattern matching problem using a sequence or regular expression pattern. In these cases, a richer environment is needed to describe the “patterns ” and recognition techniques used to perform the recognition. Some researchers have turned to artificial intelligence techniques and multi-step matching approaches for the problems of gene recognition [5, 7, 18], protein structure recognition [13] and on-line character recognition [6]. This paper presents a class of problems which involve finding matches to “patterns of patterns ” or super-patterns, given solutions to the lower-level patterns. The expressiveness of this problem class rivals that of traditional artifical intelligence characterizations, and yet polynomial time algorithms are described for each problem in the class.
Fast Algorithms for Finding Maximum-Density Segments of a Sequence with Applications to Bioinformatics
- in Proceedings of the Second International Workshop on Algorithms in Bioinformatics
, 2002
"... We study an abstract optimization problem arising from biomolecular sequence analysis. For a sequence A = ha1 ; a2 ; : : : ; ani of real numbers, a segment S is a consecutive subsequence ha i ; a i+1 ; : : : ; a j i. ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We study an abstract optimization problem arising from biomolecular sequence analysis. For a sequence A = ha1 ; a2 ; : : : ; ani of real numbers, a segment S is a consecutive subsequence ha i ; a i+1 ; : : : ; a j i.
Computational Genefinding
, 1998
"... Introduction Computational methodology for finding genes and other functional sites in genomic DNA has evolved significantly over the last 20 years. Excellent recent surveys have been given by Guig'o [10], Claverie [3], Krogh [14] and others. Among the types of functional sites in genomic DNA that ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Introduction Computational methodology for finding genes and other functional sites in genomic DNA has evolved significantly over the last 20 years. Excellent recent surveys have been given by Guig'o [10], Claverie [3], Krogh [14] and others. Among the types of functional sites in genomic DNA that researchers have sought to recognize are splice sites, start and stop codons, branch points, promoters and terminators of transcription, polyadenylation sites, ribosomal binding sites, topoisomerase II binding sites, topoisomerase I cleavage sites, and various transcription factor binding sites [8]. Local sites such as these are called signals and methods for detecting them may be called signal sensors. Genomic DNA signals can be contrasted with extended and variable length regions such as exons and introns, which are recognized by different methods that may be called content sensors [26]. 2 Signal Sensors The most bas

