Results 1 - 10
of
1,344
Pfam protein families database
- Nucleic Acids Research, 2008, 36(Database issue): D281–D288
"... Pfam is a comprehensive collection of protein domains and families, represented as multiple sequence alignments and as profile hidden Markov models. The current release of Pfam (22.0) contains 9318 protein families. Pfam is now based not only on the UniProtKB sequence database, but also on NCBI GenP ..."
Abstract
-
Cited by 771 (13 self)
- Add to MetaCart
(Show Context)
Pfam is a comprehensive collection of protein domains and families, represented as multiple sequence alignments and as profile hidden Markov models. The current release of Pfam (22.0) contains 9318 protein families. Pfam is now based not only on the UniProtKB sequence database, but also on NCBI GenPept and on sequences from selected metage-nomics projects. Pfam is available on the web from the consortium members using a new, consistent and improved website design in the UK
Protein homology detection by HMM-HMM comparison
- BIOINFORMATICS
, 2005
"... Motivation: Protein homology detection and sequence alignment are at the basis of protein structure prediction, function prediction, and evolution. Results: We have generalized the alignment of protein se-quences with a profile hidden Markov model (HMM) to the case of pairwise alignment of profile H ..."
Abstract
-
Cited by 401 (8 self)
- Add to MetaCart
(Show Context)
Motivation: Protein homology detection and sequence alignment are at the basis of protein structure prediction, function prediction, and evolution. Results: We have generalized the alignment of protein se-quences with a profile hidden Markov model (HMM) to the case of pairwise alignment of profile HMMs. We present a method for detecting distant homologous relationships between proteins based on this approach. The method (HHsearch) is benchmarked together with BLAST, PSI-BLAST, HMMER, and the profile-profile comparison tools PROF_SIM and COMPASS, in an all-against-all compari-son of a database of 3691 protein domains from SCOP 1.63 with pairwise sequence identities below 20%. Sensitivity: When predicted secondary structure is included in the HMMs, HHsearch is able to detect between 2.7 and 4.2 times more homologs than PSI-BLAST or HMMER and between 1.44 and 1.9 times more than COMPASS or PROF_SIM for a rate of false positives of 10%. Approxi-mately half of the improvement over the profile–profile com-parison methods is attributable to the use of profile HMMs in place of simple profiles. Alignment quality: Higher sensitivity is mirrored by an in-creased alignment quality. HHsearch produced 1.2, 1.7, and 3.3 times more good alignments (“balanced ” score> 0.3) than the next best method (COMPASS), and 1.6, 2.9, and 9.4 times more than PSI-BLAST, at the family, super-family, and fold level. Speed: HHsearch scans a query of 200 residues against 3691 domains in 33s on an AMD64 3GHz PC. This is 10 times faster than PROF_SIM and 17 times faster than
A hidden Markov model for predicting transmembrane helices in protein sequences
- In Proceedings of the 6th International Conference on Intelligent Systems for Molecular Biology (ISMB
, 1998
"... A novel method to model and predict the location and orientation of alpha helices in membrane- spanning proteins is presented. It is based on a hidden Markov model (HMM) with an architecture that corresponds closely to the biological system. The model is cyclic with 7 types of states for helix core, ..."
Abstract
-
Cited by 373 (9 self)
- Add to MetaCart
(Show Context)
A novel method to model and predict the location and orientation of alpha helices in membrane- spanning proteins is presented. It is based on a hidden Markov model (HMM) with an architecture that corresponds closely to the biological system. The model is cyclic with 7 types of states for helix core, helix caps on either side, loop on the cytoplasmic side, two loops for the non-cytoplasmic side, and a globular domain state in the middle of each loop. The two loop paths on the non-cytoplasmic side are used to model short and long loops separately, which corresponds biologically to the two known different membrane insertions mechanisms. The close mapping between the biological and computational states allows us to infer which parts of the model architecture are important to capture the information that encodes the membrane topology, and to gain a better understanding of the mechanisms and constraints involved. Models were estimated both by maximum likelihood and a discriminative method, and a method for reassignment of the membrane helix boundaries were developed. In a cross validated test on single sequences, our transmembrane HMM, TMHMM, correctly predicts the entire topology for 77 % of the sequencesin a standard dataset of 83 proteins with known topology. The same accuracy was achieved on a larger dataset of 160 proteins. These results compare favourably with existing methods.
Sequence Comparisons Using Multiple Sequences Detect Three Times as Many Remote . . .
, 1998
"... The sequences of related proteins can diverge beyond the point where their relationship can be recognised by pairwise sequence comparisons. In attempts to overcome this limitation, methods have been developed that use as a query, not a single sequence, but sets of related sequences or a representati ..."
Abstract
-
Cited by 244 (16 self)
- Add to MetaCart
The sequences of related proteins can diverge beyond the point where their relationship can be recognised by pairwise sequence comparisons. In attempts to overcome this limitation, methods have been developed that use as a query, not a single sequence, but sets of related sequences or a representation of the characteristics shared by related sequences. Here we describe an assessment of three of these methods: the SAM-T98 implementation of a hidden Markov model procedure; PSI-BLAST; and the intermediate sequence search (ISS) procedure. We determined the extent to which these procedures can detect evolutionary relationships between the members of the sequence database PDBD40-J. This database, derived from the structural classification of proteins (SCOP), contains the sequences of proteins of known structure whose sequence identities with each other are 40 % or less. The evolutionary relationships that exist between those that have low sequence identities were found by the examination of their structural details and, in many cases, their functional
Assignment of homology to genome sequences using a library of hidden markov models that represent all proteins of known structure
- J. Mol. Biol
, 2001
"... Protein structure prediction, to discover the fold and hence information about the probable function of the sequence of a gene about which nothing is known, is possible via homology to a sequence of ..."
Abstract
-
Cited by 203 (25 self)
- Add to MetaCart
(Show Context)
Protein structure prediction, to discover the fold and hence information about the probable function of the sequence of a gene about which nothing is known, is possible via homology to a sequence of
Identification of regulatory regions which confer muscle-specific gene expression
- J. Mol. Biol
, 1998
"... For many newly sequenced genes, sequence analysis of the putative protein yields no clue on function. It would be bene®cial to be able to identify in the genome the regulatory regions that confer temporal and spatial expression patterns for the uncharacterized genes. Additionally, it would be advant ..."
Abstract
-
Cited by 181 (13 self)
- Add to MetaCart
(Show Context)
For many newly sequenced genes, sequence analysis of the putative protein yields no clue on function. It would be bene®cial to be able to identify in the genome the regulatory regions that confer temporal and spatial expression patterns for the uncharacterized genes. Additionally, it would be advantageous to identify regulatory regions within genes of known expression pattern without performing the costly and time consuming laboratory studies now required. To achieve these goals, the wealth of case studies performed over the past 15 years will have to be collected into predictive models of expression. Extensive studies of genes expressed in skeletal muscle have identi®ed speci®c transcription factors which bind to regulatory elements to control gene expression. However, potential binding sites for these factors occur with suf®cient frequency that it is rare for a gene to be found without one. Analysis of experimentally determined muscle regulatory sequences indicates that muscle expression requires multiple elements in close proximity. A model is generated with predictive capability for identifying these muscle-speci®c regulatory modules. Phylogenetic footprinting, the identi®cation of sequences conserved between distantly related species, complements the statistical predictions. Through the use of logistic regression analysis, the model promises to be easily modi®ed to take advantage of the elucidation of additional factors, cooperation rules, and spacing constraints.
Review: Protein Secondary Structure Prediction Continues to Rise
- J. Struct. Biol
, 2001
"... f prediction accuracy? We shall see. 2001 Academic Press INTRODUCTION History. Linus Pauling correctly guessed the formation of helices and strands (14, 15) (and falsely hypothesized other structures). Three years before Pauling's guess was verified by the publications of the first X-ray stru ..."
Abstract
-
Cited by 180 (22 self)
- Add to MetaCart
(Show Context)
f prediction accuracy? We shall see. 2001 Academic Press INTRODUCTION History. Linus Pauling correctly guessed the formation of helices and strands (14, 15) (and falsely hypothesized other structures). Three years before Pauling's guess was verified by the publications of the first X-ray structures (16, 17), one group had already ventured to predict secondary structure from sequence (18). The first-generation prediction methods following in the 1960s and 1970s were all based on single amino acid propensities (19). The second-generation methods dominating the scene until the early 1990s used propensities for segments of 3--51 adjacent residues (19). Basically any imaginable theoretical algorithm had been applied to the problem of predicting secondary structure from sequence. However, it seemed that prediction accuracy stalled at levels slightly above 60% (percentage of residues predicted correctly in one of the three states: helix, strand, and other). The reason for this limit was the
Dirichlet Mixtures: A Method for Improving Detection of Weak but Significant Protein Sequence Homology
, 1996
"... This paper presents the mathematical foundations of Dirichlet mixtures, which have been used to improve database search results for homologous sequences, when a variable number of sequences from a protein family or domain are known. We present a method for condensing the information in a protein dat ..."
Abstract
-
Cited by 175 (24 self)
- Add to MetaCart
(Show Context)
This paper presents the mathematical foundations of Dirichlet mixtures, which have been used to improve database search results for homologous sequences, when a variable number of sequences from a protein family or domain are known. We present a method for condensing the information in a protein database into a mixture of Dirichlet densities. These mixtures are designed to be combined with observed amino acid frequencies, to form estimates of expected amino acid probabilities at each position in a profile, hidden Markov model, or other statistical model. These estimates give a statistical model greater generalization capacity, such that remotely related family members can be more reliably recognized by the model. Dirichlet mixtures have been shown to outperform substitution matrices and other methods for computing these expected amino acid distributions in database search, resulting in fewer false positives and false negatives for the families tested. This paper corrects a previously p...
RSEARCH: Finding homologs of single structured RNA sequences
- BMC Bioinformatics
, 2003
"... Background: Many trans-acting noncoding RNA genes and cis-acting RNA regulatory elements conserve secondary structure rather than primary sequence. Most homology search tools only look at the primary sequence level, however. ..."
Abstract
-
Cited by 170 (3 self)
- Add to MetaCart
Background: Many trans-acting noncoding RNA genes and cis-acting RNA regulatory elements conserve secondary structure rather than primary sequence. Most homology search tools only look at the primary sequence level, however.