Results 1 - 10
of
193
On prediction using variable order Markov models
- JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
, 2004
"... This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Cont ..."
Abstract
-
Cited by 105 (1 self)
- Add to MetaCart
(Show Context)
This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Context Tree Weighting (CTW), Prediction by Partial Match (PPM) and Probabilistic Suffix Trees (PSTs). We discuss the properties of these algorithms and compare their performance using real life sequences from three domains: proteins, English text and music pieces. The comparison is made with respect to prediction quality as measured by the average log-loss. We also compare classification algorithms based on these predictors with respect to a number of large protein classification tasks. Our results indicate that a “decomposed” CTW (a variant of the CTW algorithm) and PPM outperform all other algorithms in sequence prediction tasks. Somewhat surprisingly, a different algorithm, which is a modification of the Lempel-Ziv compression algorithm, significantly outperforms all algorithms on the protein classification problems.
Profile-based string kernels for remote homology detection and motif extraction
- Journal of Bioinformatics and Computational Biology
, 2004
"... We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation nei ..."
Abstract
-
Cited by 103 (9 self)
- Add to MetaCart
(Show Context)
We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences (“k-mers”) in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the profiles is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with
Fast string kernels using inexact matching for protein sequences
- Journal of Machine Learning Research
, 2004
"... We describe several families of k-mer based string kernels related to the recently presented mismatch kernel and designed for use with support vector machines (SVMs) for classification of protein sequence data. These new kernels – restricted gappy kernels, substitution kernels, and wildcard kernels ..."
Abstract
-
Cited by 63 (1 self)
- Add to MetaCart
(Show Context)
We describe several families of k-mer based string kernels related to the recently presented mismatch kernel and designed for use with support vector machines (SVMs) for classification of protein sequence data. These new kernels – restricted gappy kernels, substitution kernels, and wildcard kernels – are based on feature spaces indexed by k-length subsequences (“k-mers”) from the string alphabet Σ. However, for all kernels we define here, the kernel value K(x,y) can be computed in O(cK(|x|+|y|)) time, where the constant cK depends on the parameters of the kernel but is independent of the size |Σ | of the alphabet. Thus the computation of these kernels is linear in the length of the sequences, like the mismatch kernel, but we improve upon the parameter-dependent constant cK = k m+1 |Σ | m of the (k,m)-mismatch kernel. We compute the kernels efficiently using a trie data structure and relate our new kernels to the recently described transducer formalism. In protein classification experiments on two benchmark SCOP data sets, we show that our new faster kernels achieve SVM classification performance comparable to the mismatch kernel and the Fisher kernel derived from profile hidden Markov models, and we investigate the dependence of the kernels on parameter choice.
Protein-ligand interaction prediction: an improved chemogenomics approach. Bioinformatics
, 2008
"... Motivation: Predicting interactions between small molecules and proteins is a crucial step to decipher many biological processes, and plays a critical role in drug discovery. When no detailed 3D struc-ture of the protein target is available, ligand-based virtual screening allows the construction of ..."
Abstract
-
Cited by 43 (3 self)
- Add to MetaCart
(Show Context)
Motivation: Predicting interactions between small molecules and proteins is a crucial step to decipher many biological processes, and plays a critical role in drug discovery. When no detailed 3D struc-ture of the protein target is available, ligand-based virtual screening allows the construction of predictive models by learning to discri-minate known ligands from non-ligands. However the accuracy of ligand-based models quickly degrades when the number of known ligands decreases, and in particular the approach is not applicable for orphan receptors with no known ligand. Results: We propose a systematic method to predict ligand-protein interactions, even for targets with no known 3D structure and few or no known ligands. Following the recent chemogenomics trend, we adopt a cross-target view and attempt to screen the chemical space against whole families of proteins simultaneously. The lack of known ligand for a given target can then be compensated by the availability of known ligands for similar targets. We test this strategy on three important classes of drug targets, namely enzymes, G-protein coupled recep-tors (GPCR) and ion channels, and report dramatic improvements in prediction accuracy over classical ligand-based virtual screening, in particular for targets with few or no known ligands. Availability: All data and algorithms are available as supplementary material. Contact:
Accurate identification of alternatively spliced exons using support vector machine
, 2004
"... Motivation: Alternative splicing is a major component of the regulation acting on mammalian transcriptomes. It is estimated that over half of all human genes have more than one splice variant. Previous studies have shown that alternatively spliced exons possess several features that distinguish them ..."
Abstract
-
Cited by 43 (2 self)
- Add to MetaCart
Motivation: Alternative splicing is a major component of the regulation acting on mammalian transcriptomes. It is estimated that over half of all human genes have more than one splice variant. Previous studies have shown that alternatively spliced exons possess several features that distinguish them from constitutively spliced ones. Recently, we have demonstrated that such features can be used to distinguish alternative from constitutive exons. In the current study we use advanced machine learning methods to generate robust alternative exons classifier. Results: We extracted several hundred local sequence features of constitutive as well as alternative exons. Using feature selection methods we find seven attributes that are dominant for the task of classification. Several less informative features help to slightly increase the performance of the classifier. The classifier achieves a true positive rate of 50 % for a false positive rate of 0.5%. This result enables one to reliably identify alternatively spliced exons in exon databases that are believed to be dominated by constitutive exons. Availability: Upon request from the authors.
Predicting linear B-cell epitopes using string kernels
, 2008
"... The identification and characterization of B-cell epitopes play an important role in vaccine design, immunodiagnostic tests, and antibody production. Therefore, computational tools for reliably predicting linear B-cell epitopes are highly desirable. We evaluated Support Vector Machine (SVM) classifi ..."
Abstract
-
Cited by 42 (3 self)
- Add to MetaCart
The identification and characterization of B-cell epitopes play an important role in vaccine design, immunodiagnostic tests, and antibody production. Therefore, computational tools for reliably predicting linear B-cell epitopes are highly desirable. We evaluated Support Vector Machine (SVM) classifiers trained utilizing five different kernel methods using fivefold cross-validation on a homology-reduced data set of 701 linear B-cell epitopes, extracted from Bcipep database, and 701 non-epitopes, randomly extracted from SwissProt sequences. Based on the results of our computational experiments, we propose BCPred, a novel method for predicting linear B-cell epitopes using the subsequence kernel. We show that the predictive performance of BCPred (AUC 0.758) outperforms 11 SVM-based classifiers developed and evaluated in our experiments as well as our implementation of AAP (AUC 0.7), a recently proposed method for predicting linear B-cell epitopes using amino acid pair antigenicity. Furthermore, we compared BCPred with AAP and ABCPred, a method that uses recurrent neural networks, using two data sets of unique B-cell epitopes that had been previously used to evaluate ABCPred. Analysis of the data sets used and the results of this comparison show that conclusions about the relative performance of different B-cell epitope prediction methods drawn on the basis of experiments using data sets of unique B-cell epitopes are likely to yield overly optimistic estimates of performance of evaluated methods. This argues for the use of carefully homology-reduced data sets in comparing B-cell epitope prediction methods to avoid misleading conclusions about how different methods compare to each other. Our homology-reduced data set and implementations of BCPred as well as the APP method are publicly available through
Large-scale prediction of disulphide bridges using kernel methods, two-dimensional recursive neural networks, and weighted graph matching. Proteins,
, 2006
"... ABSTRACT The formation of disulphide bridges between cysteines plays an important role in protein folding, structure, function, and evolution. Here, we develop new methods for predicting disulphide bridges in proteins. We first build a large curated data set of proteins containing disulphide bridge ..."
Abstract
-
Cited by 40 (5 self)
- Add to MetaCart
ABSTRACT The formation of disulphide bridges between cysteines plays an important role in protein folding, structure, function, and evolution. Here, we develop new methods for predicting disulphide bridges in proteins. We first build a large curated data set of proteins containing disulphide bridges to extract relevant statistics. We then use kernel methods to predict whether a given protein chain contains intrachain disulphide bridges or not, and recursive neural networks to predict the bonding probabilities of each pair of cysteines in the chain. These probabilities in turn lead to an accurate estimation of the total number of disulphide bridges and to a weighted graph matching problem that can be addressed efficiently to infer the global disulphide bridge connectivity pattern. This approach can be applied both in situations where the bonded state of each cysteine is known, or in ab initio mode where the state is unknown. Furthermore, it can easily cope with chains containing an arbitrary number of disulphide bridges, overcoming one of the major limitations of previous approaches. It can classify individual cysteine residues as bonded or nonbonded with 87% specificity and 89% sensitivity. The estimate for the total number of bridges in each chain is correct 71% of the times, and within one from the true value over 94% of the times. The prediction of the overall disulphide connectivity pattern is exact in about 51% of the chains. In addition to using profiles in the input to leverage evolutionary information, including true (but not predicted) secondary structure and solvent accessibility information yields small but noticeable improvements. Finally, once the system is trained, predictions can be computed rapidly on a proteomic or protein-engineering scale. The disulphide bridge prediction server (DIpro), software, and datasets are available through www.igb.uci.edu/servers/pass.html. Proteins 2006;62:617-629.
A computational study of offtarget effects of RNA interference
- Nucleic Acids Res
, 2005
"... RNA interference (RNAi) is an intracellular mechanism for post-transcriptional gene silencing that is frequently used to study gene function. RNAi is initiated by short interfering RNA (siRNA) of 21 nt in length, either generated from the double-stranded RNA(dsRNA)byusingtheenzymeDicerorintroduced e ..."
Abstract
-
Cited by 40 (0 self)
- Add to MetaCart
(Show Context)
RNA interference (RNAi) is an intracellular mechanism for post-transcriptional gene silencing that is frequently used to study gene function. RNAi is initiated by short interfering RNA (siRNA) of 21 nt in length, either generated from the double-stranded RNA(dsRNA)byusingtheenzymeDicerorintroduced experimentally. Following association with an RNAi silencing complex, siRNA targets mRNA transcripts that have sequence identity for destruction. A phenotype resulting from this knockdown of expression may inform about the function of the targeted gene. However, ‘off-target effects ’ compromise the specificity of RNAi if sequence identity between siRNA and random mRNA transcripts causes RNAi to
Linear-Time Computation of Similarity Measures for Sequential Data
, 2008
"... Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and non-metric similarity functions. The basis for comp ..."
Abstract
-
Cited by 38 (24 self)
- Add to MetaCart
Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and non-metric similarity functions. The basis for comparison is embedding of sequences using a formal language, such as a set of natural words, k-grams or all contiguous subsequences. As realizations of the framework we provide linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures. Experiments on data sets from bioinformatics, text processing and computer security illustrate the efficiency of the proposed algorithms—enabling peak performances of up to 10^6 pairwise comparisons per second. The utility of distances and non-metric similarity measures for sequences as alternatives to string kernels is demonstrated in applications of text categorization, network intrusion detection and transcription site recognition in DNA.
G.: Support vector machines and kernels for computational biology
- PLoS Computational Biology
, 2008
"... The increasing wealth of biological data coming from a large variety of plat-forms and the continued development of new high-throughput methods for probing biological systems require increasingly more sophisticated compu-tational approaches. Putting all these data in simple to use databases is ..."
Abstract
-
Cited by 37 (6 self)
- Add to MetaCart
(Show Context)
The increasing wealth of biological data coming from a large variety of plat-forms and the continued development of new high-throughput methods for probing biological systems require increasingly more sophisticated compu-tational approaches. Putting all these data in simple to use databases is