Results 1 - 10
of
376
Statistical pattern recognition: A review
- IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
, 2000
"... The primary goal of pattern recognition is supervised or unsupervised classification. Among the various frameworks in which pattern recognition has been traditionally formulated, the statistical approach has been most intensively studied and used in practice. More recently, neural network techniques ..."
Abstract
-
Cited by 1035 (30 self)
- Add to MetaCart
The primary goal of pattern recognition is supervised or unsupervised classification. Among the various frameworks in which pattern recognition has been traditionally formulated, the statistical approach has been most intensively studied and used in practice. More recently, neural network techniques and methods imported from statistical learning theory have bean receiving increasing attention. The design of a recognition system requires careful attention to the following issues: definition of pattern classes, sensing environment, pattern representation, feature extraction and selection, cluster analysis, classifier design and learning, selection of training and test samples, and performance evaluation. In spite of almost 50 years of research and development in this field, the general problem of recognizing complex patterns with arbitrary orientation, location, and scale remains unsolved. New and emerging applications, such as data mining, web searching, retrieval of multimedia data, face recognition, and cursive handwriting recognition, require robust and efficient pattern recognition techniques. The objective of this review paper is to summarize and compare some of the well-known methods used in various stages of a pattern recognition system and identify research topics and applications which are at the forefront of this exciting and challenging field.
Large scale multiple kernel learning
- JOURNAL OF MACHINE LEARNING RESEARCH
, 2006
"... While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for classification, leading to a convex quadratically constrained quadratic program. We s ..."
Abstract
-
Cited by 340 (20 self)
- Add to MetaCart
While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for classification, leading to a convex quadratically constrained quadratic program. We show that it can be rewritten as a semi-infinite linear program that can be efficiently solved by recycling the standard SVM implementations. Moreover, we generalize the formulation and our method to a larger class of problems, including regression and one-class classification. Experimental results show that the proposed algorithm works for hundred thousands of examples or hundreds of kernels to be combined, and helps for automatic model selection, improving the interpretability of the learning result. In a second part we discuss general speed up mechanism for SVMs, especially when used with sparse feature maps as appear for string kernels, allowing us to train a string kernel SVM on a 10 million real-world splice data set from computational biology. We integrated multiple kernel learning in our machine learning toolbox SHOGUN for which the source code is publicly available at
PROBCONS: Probabilistic consistency-based multiple sequence alignment
- Genome Res
, 2005
"... To study gene evolution across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein families. Obtaining accurate alignments, however, is a difficult computational problem because of not only the high computational cost but also the lack of proper objec ..."
Abstract
-
Cited by 256 (10 self)
- Add to MetaCart
(Show Context)
To study gene evolution across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein families. Obtaining accurate alignments, however, is a difficult computational problem because of not only the high computational cost but also the lack of proper objective functions for measuring alignment quality. In this paper, we introduce prob-abilistic consistency, a novel scoring function for multiple sequence comparisons. We present PROBCONS, a practical tool for progressive protein multiple sequence alignment based on prob-abilistic consistency, and evaluate its performance on several standard alignment benchmark datasets. On the BAliBASE, SABmark, and PREFAB benchmark alignment databases, PROB-CONS achieves statistically significant improvement over other leading methods while maintain-ing practical speed. PROBCONS is publicly available as a web resource. Source code and execu-tables are available under the GNU Public License at
Evaluating the predictive performance of habitat models developed using logistic regression
- Ecological Modelling
, 2000
"... The use of statistical models to predict the likely occurrence or distribution of species is becoming an increasingly important tool in conservation planning and wildlife management. Evaluating the predictive performance of models using independent data is a vital step in model development. Such eva ..."
Abstract
-
Cited by 191 (3 self)
- Add to MetaCart
(Show Context)
The use of statistical models to predict the likely occurrence or distribution of species is becoming an increasingly important tool in conservation planning and wildlife management. Evaluating the predictive performance of models using independent data is a vital step in model development. Such evaluation assists in determining the suitability of a model for specific applications, facilitates comparative assessment of competing models and modelling techniques, and identifies aspects of a model most in need of improvement. The predictive performance of habitat models developed using logistic regression needs to be evaluated in terms of two components: reliability or calibration (the agreement between predicted probabilities of occurrence and observed proportions of sites occupied), and discrimina-tion capacity (the ability of a model to correctly distinguish between occupied and unoccupied sites). Lack of reliability can be attributed to two systematic sources, calibration bias and spread. Techniques are described for evaluating both of these sources of error. The discrimination capacity of logistic regression models is often measured by cross-classifying observations and predictions in a two-by-two table, and calculating indices of classification performance. However, this approach relies on the essentially arbitrary choice of a threshold probability to determine whether or not a site is predicted to be occupied. An alternative approach is described which measures discrimination capacity in terms of the area under a relative operating characteristic (ROC) curve relating relative proportions of correctly and incorrectly classified predictions over a wide and continuous range of threshold levels. Wider application of the techniques promoted in this paper could greatly improve understanding of the usefulness, and potential limitations, of habitat models developed for use in conservation planning and wildlife management. © 2000 Elsevier
Principles and practical application of the receiver-operating characteristic analysis for diagnostic tests. Preventive Veterinary Medicine
, 2000
"... Abstract We review the principles and practical application of receiver-operating characteristic (ROC) analysis for diagnostic tests. ROC analysis can be used for diagnostic tests with outcomes measured on ordinal, interval or ratio scales. The dependence of the diagnostic sensitivity and specifici ..."
Abstract
-
Cited by 122 (0 self)
- Add to MetaCart
(Show Context)
Abstract We review the principles and practical application of receiver-operating characteristic (ROC) analysis for diagnostic tests. ROC analysis can be used for diagnostic tests with outcomes measured on ordinal, interval or ratio scales. The dependence of the diagnostic sensitivity and specificity on the selected cut-off value must be considered for a full test evaluation and for test comparison. All possible combinations of sensitivity and specificity that can be achieved by changing the test's cutoff value can be summarised using a single parameter; the area under the ROC curve. The ROC technique can also be used to optimise cut-off values with regard to a given prevalence in the target population and cost ratio of false-positive and false-negative results. However, plots of optimisation parameters against the selected cut-off value provide a more-direct method for cut-off selection. Candidates for such optimisation parameters are linear combinations of sensitivity and specificity (with weights selected to reflect the decision-making situation), odds ratio, chance-corrected measures of association (e.g. kappa) and likelihood ratios. We discuss some recent developments in ROC analysis, including meta-analysis of diagnostic tests, correlated ROC curves (paired-sample design) and chance-and prevalence-corrected ROC curves. # 2000 Elsevier Science B.V. All rights reserved.
Learning when data sets are imbalanced and when costs are unequal and unknown
- ICML-2003 Workshop on Learning from Imbalanced Data Sets II
, 2003
"... The problem of learning from imbalanced data sets, while not the same problem as learning when misclassification costs are unequal and unknown, can be handled in a similar manner. That is, in both contexts, we can use techniques from roc analysis to help with classifier design. We present results fr ..."
Abstract
-
Cited by 87 (0 self)
- Add to MetaCart
(Show Context)
The problem of learning from imbalanced data sets, while not the same problem as learning when misclassification costs are unequal and unknown, can be handled in a similar manner. That is, in both contexts, we can use techniques from roc analysis to help with classifier design. We present results from two studies in which we dealt with skewed data sets and unequal, but unknown costs of error. We also compare for one domain these results to those obtained by over-sampling and under-sampling the data set. The operations of sampling, moving the decision threshold, and adjusting the cost matrix produced sets of classifiers that fell on the same roc curve. 1.
The Prediction of Faulty Classes Using Object-Oriented Design Metrics
, 1999
"... Contemporary evidence suggests that most field faults in software applications are found in a smafi percentage of the software's components. This means that if these faulty software components can be detected early in the development project's life cycle, mitigating actions can be taken, ..."
Abstract
-
Cited by 74 (3 self)
- Add to MetaCart
Contemporary evidence suggests that most field faults in software applications are found in a smafi percentage of the software's components. This means that if these faulty software components can be detected early in the development project's life cycle, mitigating actions can be taken, such as a redesign. For object-oriented applications, prediction models using design metrics can be used to identify faulty classes early on. In this paper we report on a study that used object-oriented design metrics to construct such prediction models. The study used data collected from one version of a commercial Java application for constructing a prediction model. The model was then validated on a subsequent release of the same application. Our results indicate that the prediction model has a high accuracy. Furthermore, we found that an export coupling metric had the strongest association with faultproneness, indicating a structural feature that may be symptomatic of a class with a high probability of latent faults.
ROC analysis of statistical methods used in functional MRI: Individual Subjects. NeuroImage 9
, 1999
"... The complicated structure of fMRI signals and associated noise sources make it difficult to assess the validity of various steps involved in the statistical analysis of brain activation. Most methods used for fMRI analysis assume that observations are independent and that the noise can be treated as ..."
Abstract
-
Cited by 62 (8 self)
- Add to MetaCart
(Show Context)
The complicated structure of fMRI signals and associated noise sources make it difficult to assess the validity of various steps involved in the statistical analysis of brain activation. Most methods used for fMRI analysis assume that observations are independent and that the noise can be treated as white gaussian noise. These assumptions are usually not true but it is difficult to assess how severely these assumptions are violated and what are their practical consequences. In this study a direct comparison is made between the power of various analytical methods used to detect activations, without reference to estimates of statistical significance. The statistics used in fMRI are treated as metrics designed to detect activations and are not interpreted probabilistically. The receiver operator characteristic (ROC) method is used to compare the efficacy of various steps in calculating an activation map in the study of a single subject based on optimizing the ratio of the number of detected activations to the number of false-positive findings. The main findings are as follows: Preprocessing. The removal of intensity drifts and high-pass filtering applied on the voxel time-course level is beneficial to the efficacy of analysis. Temporal normalization of the global image intensity, smoothing in the temporal domain, and lowpass filtering do not improve power of analysis. Choices of statistics. the cross-correlation coefficient and t-statistic, as well as nonparametric Mann–Whitney statistics, prove to be the most effective and are similar in performance, by our criterion. Task design. the proper design of task protocols is shown to be crucial. In an alternating block design the optimal block length is be approximately 18 s. Spatial clustering. an initial spatial smoothing of images is more efficient than cluster filtering of the statistical parametric activation maps. � 1999 Academic Press 1.
Statistical strategies for avoiding false discoveries in metabolomics and related experiments
, 2006
"... Many metabolomics, and other high-content or high-throughput, experiments are set up such that the primary aim is the discovery of biomarker metabolites that can discriminate, with a certain level of certainty, between nominally matched ‘case ’ and ‘control ’ samples. However, it is unfortunately ve ..."
Abstract
-
Cited by 61 (11 self)
- Add to MetaCart
(Show Context)
Many metabolomics, and other high-content or high-throughput, experiments are set up such that the primary aim is the discovery of biomarker metabolites that can discriminate, with a certain level of certainty, between nominally matched ‘case ’ and ‘control ’ samples. However, it is unfortunately very easy to find markers that are apparently persuasive but that are in fact entirely spurious, and there are well-known examples in the proteomics literature. The main types of danger are not entirely independent of each other, but include bias, inadequate sample size (especially relative to the number of metabolite variables and to the required statistical power to prove that a biomarker is discriminant), excessive false discovery rate due to multiple hypothesis testing, inappropriate choice of particular numerical methods, and overfitting (generally caused by the failure to perform adequate validation and cross-validation). Many studies fail to take these into account, and thereby fail to discover anything of true significance (despite their claims). We summarise these problems, and provide pointers to a substantial existing literature that should assist in the improved design and evaluation of metabolomics experiments, thereby allowing robust scientific conclusions to be drawn from the available data. We provide a list of some of the simpler checks that might improve one’s confidence that a candidate biomarker is not simply a statistical artefact, and suggest a series of preferred tests and visualisation tools that can assist readers and authors in assessing papers. These tools can be applied to individual metabolites by using multiple univariate tests performed in parallel across all metabolite peaks. They may also be applied to the validation of multivariate models. We stress in
Using substitution probabilities to improve position-specific scoring matrices
- Computer Applications in the Biosciences
, 1996
"... blocks Subject classification: proteins *To whom reprint requests should be sent Running head: Improved position-specific scoring matrices Each column of amino acids in a multiple alignment of protein sequences can be represented as a vector of 20 amino acid counts. For alignment and searching appli ..."
Abstract
-
Cited by 53 (1 self)
- Add to MetaCart
(Show Context)
blocks Subject classification: proteins *To whom reprint requests should be sent Running head: Improved position-specific scoring matrices Each column of amino acids in a multiple alignment of protein sequences can be represented as a vector of 20 amino acid counts. For alignment and searching applications, the count vector is an imperfect representation of a position, because the observed sequences are an incomplete sample of the full set of related sequences. One general solution to this problem is to model unobserved sequences by adding artificial "pseudo-counts " to the observed counts. We introduce a simple method for computing pseudo-counts that combines the diversity observed in each alignment position with amino acid substitution probabilities. In extensive empirical tests, this position-based method out-performed other pseudo-count methods and was a substantial improvement over the traditional average score method used for constructing profiles. 2