Results 1  10
of
21
Top 10 algorithms in data mining
, 2007
"... Abstract This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, kMeans, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top 10 algorithms are among the most influential data mining a ..."
Abstract

Cited by 113 (2 self)
 Add to MetaCart
Abstract This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, kMeans, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top 10 algorithms are among the most influential data mining algorithms in the research community. With each algorithm, we provide a description of the algorithm, discuss the impact of the algorithm, and review current and further research on the algorithm. These 10 algorithms cover classification,
Computational aspects of feedback in neural circuits
 PLOS Computational Biology
, 2007
"... It has previously been shown that generic cortical microcircuit models can perform complex realtime computations on continuous input streams, provided that these computations can be carried out with a rapidly fading memory. We investigate the computational capability of such circuits in the more re ..."
Abstract

Cited by 37 (7 self)
 Add to MetaCart
(Show Context)
It has previously been shown that generic cortical microcircuit models can perform complex realtime computations on continuous input streams, provided that these computations can be carried out with a rapidly fading memory. We investigate the computational capability of such circuits in the more realistic case where not only readout neurons, but in addition a few neurons within the circuit, have been trained for specific tasks. This is essentially equivalent to the case where the output of trained readout neurons is fed back into the circuit. We show that this new model overcomes the limitation of a rapidly fading memory. In fact, we prove that in the idealized case without noise it can carry out any conceivable digital or analog computation on timevarying inputs. But even with noise, the resulting computational model can perform a large class of biologically relevant realtime computations that require a nonfading memory. We demonstrate these computational implications of feedback both theoretically, and through computer simulations of detailed cortical microcircuit models that are subject to noise and have complex inherent dynamics. We show that the application of simple learning procedures (such as linear regression or perceptron learning) to a few neurons enables such circuits to represent time over behaviorally relevant long time spans, to integrate evidence from incoming spike trains over longer periods of time, and to process new information contained in such spike trains in diverse ways according to the current internal state of the circuit. In particular we show that such generic cortical microcircuits with feedback provide a new model for working memory that is consistent with a large set of biological constraints.
Improving the Caenorhabditis elegans genome annotation using machine learning
 PLoS Computational Biology
, 2007
"... For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ stateoftheart machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ stateoftheart machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87 % (coding and untranslated regions) and 95 % (coding regions only) of all genes tested in several outofsample evaluations, our method correctly identified all exons and introns. Notably, only 37 % and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [1] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18 % of the considered cases, while our predictions deviate from the truth only in 10%–13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75 % of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C.
Local Dimensionality Reduction for NonParametric Regression
, 2009
"... Abstract Locallyweighted regression is a computationallyefficient technique for nonlinear regression. However, for highdimensional data, this technique becomes numerically brittle and computationally too expensive if many local models need to be maintained simultaneously. Thus, local linear dime ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Abstract Locallyweighted regression is a computationallyefficient technique for nonlinear regression. However, for highdimensional data, this technique becomes numerically brittle and computationally too expensive if many local models need to be maintained simultaneously. Thus, local linear dimensionality reduction combined with locallyweighted regression seems to be a promising solution. In this context, we review linear dimensionalityreduction methods, compare their performance on nonparametric locallylinear regression, and discuss their ability to extend to incremental learning. The considered methods belong to the following three groups: (1) reducing dimensionality only on the input data, (2) modeling the joint inputoutput data distribution, and (3) optimizing the correlation between projection directions and output data. Group 1 contains principal component regression (PCR); group 2 contains principal component analysis (PCA) in joint input and output space, factor analysis, and probabilistic PCA; and group 3 contains reduced rank regression (RRR) and partial least squares (PLS) regression. Among the tested methods, only group 3 managed to achieve robust performance even for a nonoptimal number of components (factors or projection directions). In contrast, group 1 and 2 failed for fewer components since these methods rely on the correct estimate of the true intrinsic dimensionality. In group 3, PLS is the only method for which a computationallyefficient incremental implementation exists.
Insights from Classifying Visual Concepts with Multiple Kernel Learning
"... Combining information from various image features has become a standard technique in concept recognition tasks. However, the optimal way of fusing the resulting kernel functions is usually unknown in practical applications. Multiple kernel learning (MKL) techniques allow to determine an optimal line ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Combining information from various image features has become a standard technique in concept recognition tasks. However, the optimal way of fusing the resulting kernel functions is usually unknown in practical applications. Multiple kernel learning (MKL) techniques allow to determine an optimal linear combination of such similarity matrices. Classical approaches to MKL promote sparse mixtures. Unfortunately, 1norm regularized MKL variants are often observed to be outperformed by an unweighted sum kernel. The main contributions of this paper are the following: we apply a recently developed nonsparse MKL variant to stateoftheart concept recognition tasks from the application domain of computer vision. We provide insights on benefits and limits of nonsparse MKL and compare it against its direct competitors, the sumkernel SVM and sparse MKL. We report empirical results for the PASCAL VOC 2009 Classification and ImageCLEF2010 Photo Annotation challenge data sets. Data sets (kernel matrices) as well as further information are available at
aus Frankfurt am Main
"... Dekan: Prof. Dr. Volker Müller Erster Gutachter: Prof. Dr. Gisbert Schneider Zweite Gutachterin: PD Dr. Silja Wessler Datum der Disputation: ……………………………………………………… Inhaltsverzeichnis ..."
Abstract
 Add to MetaCart
(Show Context)
Dekan: Prof. Dr. Volker Müller Erster Gutachter: Prof. Dr. Gisbert Schneider Zweite Gutachterin: PD Dr. Silja Wessler Datum der Disputation: ……………………………………………………… Inhaltsverzeichnis
Estimating Quality of Support Vector Machines Learning Under Probabilistic and Interval Uncertainty: Algorithms and Computational Complexity
"... Summary. Support Vector Machines (SVM) is one of the most widely used technique in machines leaning. After the SVM algorithms process the data and produce some classification, it is desirable to learn how well this classification fits the data. There exist several measures of fit, among them the mos ..."
Abstract
 Add to MetaCart
(Show Context)
Summary. Support Vector Machines (SVM) is one of the most widely used technique in machines leaning. After the SVM algorithms process the data and produce some classification, it is desirable to learn how well this classification fits the data. There exist several measures of fit, among them the most widely used is kernel target alignment. These measures, however, assume that the data are known exactly. In reality, whether the data points come from measurements or from expert estimates, they are only known with uncertainty. As a result, even if we know that the classification perfectly fits the nominal data, this same classification can be a bad fit for the actual values (which are somewhat different from the nominal ones). In this paper, we show how to take this uncertainty into account when estimating the quality of the resulting classification. 1 Formulation of the Problem Machine learning: main problem. In many practical situations, we have examples of several types of objects, and we would like to use these examples to teach the computers to distinguish between objects of different types.
Pattern Clustering Using a Swarm Intelligence Approach
"... Summary. Clustering aims at representing large datasets by a fewer number of prototypes or clusters. It brings simplicity in modeling data and thus plays a central role in the process of knowledge discovery and data mining. Data mining tasks, in these days, require fast and accurate partitioning of ..."
Abstract
 Add to MetaCart
Summary. Clustering aims at representing large datasets by a fewer number of prototypes or clusters. It brings simplicity in modeling data and thus plays a central role in the process of knowledge discovery and data mining. Data mining tasks, in these days, require fast and accurate partitioning of huge datasets, which may come with a variety of attributes or features. This, in turn, imposes severe computational requirements on the relevant clustering techniques. A family of bioinspired algorithms, wellknown as Swarm Intelligence (SI) has recently emerged that meets these requirements and has successfully been applied to a number of real world clustering problems. This chapter explores the role of SI in clustering different kinds of datasets. It finally describes a new SI technique for partitioning a linearly nonseparable dataset into an optimal number of clusters in the kernel induced feature space. Computer simulations undertaken in this research have also been provided to demonstrate the effectiveness of the proposed algorithm. 1
Neural Process Lett DOI 10.1007/s1106300990980 Local Dimensionality Reduction for NonParametric Regression
"... Abstract Locallyweighted regression is a computationallyefficient technique for nonlinear regression. However, for highdimensional data, this technique becomes numerically brittle and computationally too expensive if many local models need to be maintained simultaneously. Thus, local linear dime ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract Locallyweighted regression is a computationallyefficient technique for nonlinear regression. However, for highdimensional data, this technique becomes numerically brittle and computationally too expensive if many local models need to be maintained simultaneously. Thus, local linear dimensionality reduction combined with locallyweighted regression seems to be a promising solution. In this context, we review linear dimensionalityreduction methods, compare their performance on nonparametric locallylinear regression, and discuss their ability to extend to incremental learning. The considered methods belong to the following three groups: (1) reducing dimensionality only on the input data, (2) modeling the joint inputoutput data distribution, and (3) optimizing the correlation between projection directions and output data. Group 1 contains principal component regression (PCR); group 2 contains principal component analysis (PCA) in joint input and output space, factor analysis, and probabilistic PCA; and group 3 contains reduced rank regression (RRR) and partial least squares (PLS) regression. Among the tested methods, only group 3 managed to achieve robust performance even for a nonoptimal number of components (factors or projection directions). In contrast, group 1 and 2 failed for fewer components since these methods rely on the correct estimate of the true intrinsic dimensionality. In group 3, PLS is the only method for which a computationallyefficient incremental implementation exists.
Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning
"... For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ stateoftheart machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine ..."
Abstract
 Add to MetaCart
For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ stateoftheart machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87 % (coding and untranslated regions) and 95 % (coding regions only) of all genes tested in several outofsample evaluations, our method correctly identified all exons and introns. Notably, only 37 % and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [1] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18 % of the considered cases, while our predictions deviate from the truth only in 10%–13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75 % of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C.