Results 1 - 10
of
135
The UCSC table browser data retrieval tool
- Nucleic Acids Res
, 2004
"... hgText) provides text-based access to a large collection of genome assemblies and annotation data stored in the Genome Browser Database. A ¯exible alternative to the graphical-based Genome Browser, this tool offers an enhanced level of query support that includes restrictions based on ®eld values, f ..."
Abstract
-
Cited by 187 (18 self)
- Add to MetaCart
(Show Context)
hgText) provides text-based access to a large collection of genome assemblies and annotation data stored in the Genome Browser Database. A ¯exible alternative to the graphical-based Genome Browser, this tool offers an enhanced level of query support that includes restrictions based on ®eld values, free-form SQL queries and combined queries on multiple tables. Output can be ®ltered to restrict the ®elds and lines returned, and may be organized into one of several formats, including a simple tabdelimited ®le that can be loaded into a spreadsheet or database as well as advanced formats that may be uploaded into the Genome Browser as custom annotation tracks. The Table Browser User's Guide located on the UCSC website provides instructions and detailed examples for constructing queries and con®guring output.
Spectral learning
- In IJCAI
, 2003
"... We present a simple, easily implemented spectral learning algorithm which applies equally whether we have no supervisory information, pairwise link constraints, or labeled examples. In the unsupervised case, it performs consistently with other spectral clustering algorithms. In the supervised case, ..."
Abstract
-
Cited by 106 (6 self)
- Add to MetaCart
We present a simple, easily implemented spectral learning algorithm which applies equally whether we have no supervisory information, pairwise link constraints, or labeled examples. In the unsupervised case, it performs consistently with other spectral clustering algorithms. In the supervised case, our approach achieves high accuracy on the categorization of thousands of documents given only a few dozen labeled training documents for the 20 Newsgroups data set. Furthermore, its classification accuracy increases with the addition of unlabeled documents, demonstrating effective use of unlabeled data. By using normalized affinity matrices which are both symmetric and stochastic, we also obtain both a probabilistic interpretation of our method and certain guarantees of performance. 1
Reconstructing large regions of an ancestral mammalian genome in silico
, 2004
"... It is believed that most modern mammalian lineages arose from a series of rapid speciation events near the Cretaceous-Tertiary boundary. It is shown that such a phylogeny makes the common ancestral genome sequence an ideal target for reconstruction. Simulations suggest that with methods currently av ..."
Abstract
-
Cited by 63 (7 self)
- Add to MetaCart
It is believed that most modern mammalian lineages arose from a series of rapid speciation events near the Cretaceous-Tertiary boundary. It is shown that such a phylogeny makes the common ancestral genome sequence an ideal target for reconstruction. Simulations suggest that with methods currently available, we can expect to get 98% of the bases correct in reconstructing megabase-scale euchromatic regions of an eutherian ancestral genome from the genomes of ∼20 optimally chosen modern mammals. Using actual genomic sequences from 19 extant mammals, we reconstruct 1.1 Mb of ancient genome sequence around the CFTR locus. Detailed examination suggests the reconstruction is accurate and that it allows us to identify features in modern species, such as remnants of ancient transposon insertions, that were not identified by direct analysis. Tracing the predicted evolutionary history of the bases in the reconstructed region, estimates are made of the amount of DNA turnover due to insertion, deletion, and substitution in the different placental mammalian lineages since the common eutherian ancestor, showing considerable variation between lineages. In coming years, such reconstructions may help in identifying and understanding the genetic features common to eutherian mammals and may shed light on the evolution of human or primate-specific traits.
Computational Identification of Evolutionarily Conserved Exons
, 2004
"... Phylogenetic hidden Markov models (phylo-HMMs) have recently been proposed as a means for addressing a multispecies version of the ab initio gene prediction problem. These models allow sequence divergence,a phylogeny,patterns of substitution,and base composition all to be considered simultaneously,i ..."
Abstract
-
Cited by 62 (13 self)
- Add to MetaCart
Phylogenetic hidden Markov models (phylo-HMMs) have recently been proposed as a means for addressing a multispecies version of the ab initio gene prediction problem. These models allow sequence divergence,a phylogeny,patterns of substitution,and base composition all to be considered simultaneously,in a single unified probabilistic model. Here,we apply phylo-HMMs to a restricted version of the gene prediction problem in which individual exons are sought that are evolutionarily conserved across a diverse set of species. We discuss two new methods for improving prediction performance: (1) the use of context-dependent phylogenetic models,which capture phenomena such as a strong CpG effect in noncoding regions and a preference for synonymous rather than nonsynonymous substitutions in coding regions; and (2) a novel strategy for incorporating insertions and deletion (indels) into the state-transition structure of the model,which captures the different characteristic patterns of alignment gaps in coding and noncoding regions. We also discuss the technique,previously used in pairwise gene predictors,of explicitly modeling conserved noncoding sequence to help reduce false positive predictions. These methods have been incorporated into an exon prediction program called ExoniPhy, and tested with two large datasets. Experimental results indicate that all three methods produce significant improvements in prediction performance. In combination,they lead to prediction accuracy comparable to that of some of the best available gene predictors,despite several limitations of our current models.
Learning Nonsingular Phylogenies and Hidden Markov Models
- Proceedings of the thirty-seventh annual ACM Symposium on Theory of computing, Baltimore (STOC05
, 2005
"... In this paper, we study the problem of learning phylogenies and hidden Markov models. We call the Markov model nonsingular if all transtion matrices have determinants bounded away from 0 (and 1). We highlight the role of the nonsingularity condition for the learning problem. Learning hidden Markov m ..."
Abstract
-
Cited by 45 (7 self)
- Add to MetaCart
In this paper, we study the problem of learning phylogenies and hidden Markov models. We call the Markov model nonsingular if all transtion matrices have determinants bounded away from 0 (and 1). We highlight the role of the nonsingularity condition for the learning problem. Learning hidden Markov models without the nonsingularity condition is at least as hard as learning parity with noise. On the other hand, we give a polynomial-time algorithm for learning nonsingular phylogenies and hidden Markov models.
Characterization of intron loss events in mammals. Genome Res, epub in advance of publication:gr.5703406
"... Characterization of intron loss events in mammals ..."
Multiple-sequence functional annotation and the generalized hidden Markov phylogeny
, 2004
"... Motivation: Phylogenetic shadowing is a comparative genomics principle that allows for the discovery of conserved regions in sequences from multiple closely related organisms. We develop a formal probabilistic framework for combining phylogenetic shadowing with feature-based functional annotation me ..."
Abstract
-
Cited by 37 (5 self)
- Add to MetaCart
Motivation: Phylogenetic shadowing is a comparative genomics principle that allows for the discovery of conserved regions in sequences from multiple closely related organisms. We develop a formal probabilistic framework for combining phylogenetic shadowing with feature-based functional annotation methods. The resulting model, a generalized hidden Markov phylogeny (GHMP), applies to a variety of situations where functional regions are to be inferred from evolutionary constraints. Results: We show how GHMPs can be used to predict complete shared gene structures in multiple primate sequences. We also describe shadower, our implementation of such a prediction system. We find that shadower outperforms previously reported ab initio gene finders, including comparative human–mouse approaches, on a small sample of diverse exonic regions. Finally, we report on an empirical analysis of shadower’s performance which reveals that as few as five wellchosen species may suffice to attain maximal sensitivity and specificity in exon demarcation. Availability: A Web server is available at
Phylogenetic hidden Markov models
- IN STATISTICAL METHODS IN MOLECULAR EVOLUTION
, 2005
"... Phylogenetic hidden Markov models, or phylo-HMMs, are probabilistic models that consider not only the way substitutions occur through evolutionary history at each site of a genome, but also the way this process changes from one site to the next. By treating molecular evolution as a combination of tw ..."
Abstract
-
Cited by 37 (6 self)
- Add to MetaCart
(Show Context)
Phylogenetic hidden Markov models, or phylo-HMMs, are probabilistic models that consider not only the way substitutions occur through evolutionary history at each site of a genome, but also the way this process changes from one site to the next. By treating molecular evolution as a combination of two Markov processes—one that operates in the dimension of space (along a genome) and one that operates in the dimension of time (along the branches of a phylogenetic tree)—these models allow aspects of both sequence structure and sequence evolution to be captured. Moreover, as we will discuss, they permit key computations to be performed exactly and efficiently. Phylo-HMMs allow evolutionary information to be brought to bear on a wide variety of problems of sequence “segmentation, ” such as gene prediction and the identification of conserved elements. Phylo-HMMs were first proposed as a way of improving phylogenetic models that allow for variation among sites in the rate of substitution [8, 52]. Soon afterward, they were adapted for the problem of secondary structure
CONTRAST: A Discriminative, Phylogeny-free Approach to Multiple Informant De Novo Gene Prediction
"... We describe CONTRAST, the first system for vertebrate protein-coding gene prediction to successfully use the information present in multiple alignments to achieve greater accuracy than the best method based on two-species alignments. CONTRAST predicts exact coding region structures for 65 % more gen ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
(Show Context)
We describe CONTRAST, the first system for vertebrate protein-coding gene prediction to successfully use the information present in multiple alignments to achieve greater accuracy than the best method based on two-species alignments. CONTRAST predicts exact coding region structures for 65 % more genes than the previous state-of-the-art method, misses 46 % fewer exons, and displays comparable gains in specificity. Background In this work, we consider the task of predicting the locations and structures of the protein-coding genes in a genome. Gene recognition is one of the best-studied problems in computational biology, and as such has been approached through the use of a wide variety of different methods. Gene recognition methods can be broadly divided into three categories, depending on the type of information they employ. Ab initio predictors use only DNA sequence from the genome in which predictions are desired (the target genome). Predictors such as GENSCAN [1], Genie [2], and CRAIG [3] fall into this category. De novo gene predictors additionally make use of aligned DNA sequence from other genomes 1 (informant genomes). Alignments can increase predictive accuracy since protein-coding genes exhibit