Results 1 - 10
of
23
Prediction of complete gene structures in human genomic DNA
- J. Mol. Biol
, 1997
"... The problem of identifying genes in genomic DNA sequences by computational methods has attracted considerable research attention in recent years. From one point of view, the problem is closely ..."
Abstract
-
Cited by 497 (7 self)
- Add to MetaCart
The problem of identifying genes in genomic DNA sequences by computational methods has attracted considerable research attention in recent years. From one point of view, the problem is closely
On Pattern Frequency Occurrences In A Markovian Sequence?
- Algorithmica
, 1997
"... Consider a given pattern H and a random text T generated by a Markovian source. We study the frequency of pattern occurrences in a random text when overlapping copies of the pattern are counted separately. We present exact and asymptotic formulae for all moments (including the variance), and probabi ..."
Abstract
-
Cited by 56 (22 self)
- Add to MetaCart
Consider a given pattern H and a random text T generated by a Markovian source. We study the frequency of pattern occurrences in a random text when overlapping copies of the pattern are counted separately. We present exact and asymptotic formulae for all moments (including the variance), and probability of r pattern occurrences for three different regions of r, namely: (i) r = O(1), (ii) central limit regime, and (iii) large deviations regime. In order to derive these results, we first construct some language expressions that characterize pattern occurrences which are later translated into generating functions. Finally, we use analytical methods to extract asymptotic behaviors of the pattern frequency. Applications of these results include molecular biology, source coding, synchronization, wireless communications, approximate pattern matching, game theory, and stock market analysis. These findings are of particular interest to information theory (e.g., second-order properties of the re...
Eukaryotic Promoter Recognition
- Genome Res
, 1997
"... 957> http://gnomic.stanford.edu/~chris/GENSCANW. html). Because the signals that control the start and stop of transcription and translation, and the location of splicing, are still not very well understood, it is not uncommon for a gene-finding algorithm to confuse internal with initial and termina ..."
Abstract
-
Cited by 55 (0 self)
- Add to MetaCart
957> http://gnomic.stanford.edu/~chris/GENSCANW. html). Because the signals that control the start and stop of transcription and translation, and the location of splicing, are still not very well understood, it is not uncommon for a gene-finding algorithm to confuse internal with initial and terminal exons, thus wrongly partitioning the exons. The problem is compounded by our incomplete understanding of alternative splicing control elements. Another line of development in gene identification is based on homology (e.g., Gish and States 1993; Gelfand et al. 1996). If there is a close homolog in the databases to one of the genes in the sequence under analysis, sequence similarity will usually group the exons for this gene correctly. Still, in many cases there is no close homolog and no guarantee when there is some homolog that the encoded protein lacks insertions/deletions. Clearly, some means of recognizing the beginnings of genes, probably via the promoter, or the ends, probabl
Computational Methods for the Identification of Genes in Vertebrate Genomic Sequences
- Hum. Mol. Genet
, 1997
"... Research into new methods to identify genes in anonymous genomic sequences has been going on for more than 15 years. Over this period of time, the field has evolved from the designing of programs to identify protein coding regions in compact mitochondrial or bacterial genomes, to the challenge of pr ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
Research into new methods to identify genes in anonymous genomic sequences has been going on for more than 15 years. Over this period of time, the field has evolved from the designing of programs to identify protein coding regions in compact mitochondrial or bacterial genomes, to the challenge of predicting the detailed organization of multi-exon vertebrate genes. The best program currently available perfectly locates more than 80 % of the internal coding exons, and only 5 % of the predictions do not overlap a real exon. Given such accuracy, computational methods are indeed very useful; however, they do not alleviate the need for experimental validation. If the performances are satisfactory for the identification of the coding moiety of genes (internal coding exons), the determination of the full extent of the transcript (5 ′ and 3 ′ extremities of the gene) and the location of promoter regions are still unreliable. As the human and mouse genome sequencing projects enter a production mode, the fully automated annotation of megabase-long anonymous genomic sequences is the next big challenge in bioinformatics.
What is bioinformatics? A proposed definition and overview of the field
"... BACKGROUND: The recent flood of data from genome sequencing and functional genomics has given rise to new field, bioinformatics, which combines elements of biology and computer science. OBJECTIVES: Here we propose a definition for this new field and review some the research that is being pursued, p ..."
Abstract
-
Cited by 24 (2 self)
- Add to MetaCart
BACKGROUND: The recent flood of data from genome sequencing and functional genomics has given rise to new field, bioinformatics, which combines elements of biology and computer science. OBJECTIVES: Here we propose a definition for this new field and review some the research that is being pursued, particularly in relation to transcriptional regulatory systems. METHODS: Our definition is as follows: Bioinformatics is conceptualizing biology in terms of macromolecules (in the sense of physical-chemistry) and then applying "informatics" techniques (derived from disciplines such as applied maths, computer science, and statistics) to understand and organize the information associated with these molecules, on a large-scale. RESULTS & CONCLUSIONS: Analyses in bioinformatics predominantly focus on three types of large datasets available in molecular biology: macromolecular structures, genome sequences, and the results of functional genomics experiments (eg expression data). Additional information includes the text of scientific papers and "relationship data" from metabolic pathways, taxonomy trees, and proteinprotein interaction networks. Bioinformatics employs a wide range of computational topics including sequence and structural alignment, database design and data mining, macromolecular geometry, phylogenetic tree construction, prediction of protein structure and function, gene finding, and expression data clustering. The emphasis is on approaches that integrate a variety of computational techniques and heterogeneous data sources. Finally, bioinformatics is a practical discipline. We survey some representative applications, such as finding homologues, designing drugs, and performing large-scale censuses. Additional information pertinent to the review is available over the w...
Identification of Genes in Human Genomic DNA
, 1997
"... A general probabilistic model of the gene structural and compositional properties of human genomic DNA is introduced and applied to the problem of identifying genes in unannotated human genomic sequences. The model uses a \Hidden semi-Markov" or semi-Markov source architecture which incorporate ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
A general probabilistic model of the gene structural and compositional properties of human genomic DNA is introduced and applied to the problem of identifying genes in unannotated human genomic sequences. The model uses a \Hidden semi-Markov" or semi-Markov source architecture which incorporates probabilistic descriptions of fundamental transcriptional, translational and splicing signals, as well as length distri-butions and compositional features of exons, introns and intergenic regions. Distinct sets of model parameters are derived which account for many of the substantial di er-ences in gene density and structure observed in distinct C+G compositional regions (\isochores") of the human genome. A novel model building procedure, termed Max-imal Dependence Decomposition, is introduced which captures potentially important dependencies between non-adjacent aswell as adjacent positions in a biological signal. Application of this model to the donor splice signal not only gives better discrimina-tion of potential donor sites than previous probabilistic models, but also reveals subtle properties of this signal which suggest aspects of its biochemical function. Acceptor
A Comparative Genomics Approach to Prediction of New Members of Regulons
, 2001
"... this article, an approach is presented toev3KK3K the computational predicitions of regulatory sites.We combine the prediction of transcription units havs. orthologous genes with the prediction of transcription factor binding sites based on probabilistic models.We augment the sets of genes in that ..."
Abstract
-
Cited by 21 (6 self)
- Add to MetaCart
this article, an approach is presented toev3KK3K the computational predicitions of regulatory sites.We combine the prediction of transcription units havs. orthologous genes with the prediction of transcription factor binding sites based on probabilistic models.We augment the sets of genes in that are expected to be regulated by two transcription factors, the cAMP receptor protein and the fumarate and nitrate reduction regulatory protein, through a comparison with theHaemophil4 inflphil genome At the same time, we learned more about the regulatory networks of , a species with much less experimental knowledge than E.col By studying orthologous genes subject to regulation by the same transcription factor, we also gained understanding of theev.2K624 of the entire regulatory systems The numberof complete microbial genome sequences is increasing at an unprecedented rate. To date, 29 bacterial genomes have been determined, 11 more are in annotation stage, and 83 are in progress. This surgeof sequenceinfncew--RB provides an enormous amount of dataft comparative genomics analysis. During the earlier stageof genomic analysis, mostof theefwPA was devoted to analysesof protein-coding regions because, in the courseof evolution, protein-coding sequences change much slower than the noncoding sequences (Koonin et al. 1997, 1998). These comparative genomics studies have proved highlyinfywPzRP--w allowing fowingwzz assignmentsfs many putative proteins in poorly studied organisms (Overbeek et al. 1999). One surprising resultfsu these analyses was the lackof long-range conservationof gene order in bacterial genomes, with the exceptionof species within the same genus (Tatusov et al. 1996; Himmelreich et al. 1997). For speciesof intermediate phylogenetic distance, such as in Escherichi...
New Techniques for DNA Sequence Classification
, 1999
"... DNA sequence classification is the activity of determining whether or not an unlabeled sequence S belongs to an existing class C. This paper proposes two new techniques for DNA sequence classification. The first technique works by comparing the unlabeled sequence S with a group of active motifs disc ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
DNA sequence classification is the activity of determining whether or not an unlabeled sequence S belongs to an existing class C. This paper proposes two new techniques for DNA sequence classification. The first technique works by comparing the unlabeled sequence S with a group of active motifs discovered from the elements of C and by distinction with elements outside of C. The second technique generates and matches gapped fingerprints of S with elements of C. Experimental results obtained by running these algorithms on long and well conserved Alu sequences demonstrate the good performance of the presented methods compared with FASTA. When applied to less conserved and relatively short functional sites such as splice-junctions, a variation of the second technique combining fingerprinting with consensus sequence analysis gives better results than the current classifiers employing text compression and machine learning algorithms. 2 INTRODUCTION DNA sequence classification is an import...
Performance-Guarantee Gene Predictions via Spliced Alignment
, 1998
"... An important and still unsolved problem in gene prediction is designing an algorithm that not only predicts genes but estimates the quality of individual predictions as well. Since experimental biologists are interested mainly in the reliability of individual predictions (rather than in the average ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
An important and still unsolved problem in gene prediction is designing an algorithm that not only predicts genes but estimates the quality of individual predictions as well. Since experimental biologists are interested mainly in the reliability of individual predictions (rather than in the average reliability of an algorithm) we attempted to develop a gene recognition algorithm that guarantees a certain quality of predictions. We demonstrate here that the similarity level with a related protein is a reliable quality estimator for the spliced alignment approach to gene recognition. We also study the average performance of the spliced alignment algorithm for different targets on a complete set of human genomic sequences with known relatives and demonstrate that the average performance of the method remains high even for very distant targets. Using plant, fungal, and prokaryotic target proteins for recognition of human genes leads to accurate predictions with 95, 93, and 91 % correlation coefficient, respectively. For target proteins with similarity score above 60%, not only the average correlation coefficient is very high (97 % and up) but also the quality of individual predictions is guaranteed to be at least 82%. It indicates that for this level of similarity the worst case performance of the spliced alignment algorithm is better than the average case performance of many statistical gene recognition methods.
A Unified Approach to Word Statistics
, 1998
"... Evaluation of the frequency of occurrences of a given set of patterns in a DNA sequence has numerous applications and has been extensively studied recently. We provide a unified framework for this evaluation that adapts to various constraints and allow to extend previous results. We assume successiv ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Evaluation of the frequency of occurrences of a given set of patterns in a DNA sequence has numerous applications and has been extensively studied recently. We provide a unified framework for this evaluation that adapts to various constraints and allow to extend previous results. We assume successively that the patterns may, then may not, overlap. We derive asymptotic and exact formulae for the moments in a Markovian model. We show that our formulae, that occasionnally simplify previous results, are computable at low cost, which makes them useful for practical applications. 1 Introduction Repeated patterns and related phenomena in sequences (also called words or strings) are studied in molecular biology. A survey on various methods can be found in [Li97]. One fundamental question that arises is the frequency of pattern occurrences in another string known as the text. This question is addressed below for a set of patterns (H i ) and various assumptions on the counting of possible overl...

