Results 1  10
of
21
Nyström method vs random fourier features: A theoretical and empirical comparison
 In Advances in NIPS’12
, 2012
"... Both random Fourier features and the Nyström method have been successfully applied to efficient kernel learning. In this work, we investigate the fundamental difference between these two approaches, and how the difference could affect their generalization performances. Unlike approaches based on ran ..."
Abstract

Cited by 29 (3 self)
 Add to MetaCart
(Show Context)
Both random Fourier features and the Nyström method have been successfully applied to efficient kernel learning. In this work, we investigate the fundamental difference between these two approaches, and how the difference could affect their generalization performances. Unlike approaches based on random Fourier features where the basis functions (i.e., cosine and sine functions) are sampled from a distribution independent from the training data, basis functions used by the Nyström method are randomly sampled from the training examples and are therefore data dependent. By exploring this difference, we show that when there is a large gap in the eigenspectrum of the kernel matrix, approaches based on the Nyström method can yield impressively better generalization error bound than random Fourier features based approach. We empirically verify our theoretical findings on a wide range of large data sets. 1
Coclustering for directed graphs; the stochastic coblockmodel and a spectral algorithm
, 2012
"... Communities of highly connected actors form an essential feature in the structure of several empirical directed and undirected networks. However, compared to the amount of research on clustering for undirected graphs, there is relatively little understanding of clustering in directed networks. Th ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
Communities of highly connected actors form an essential feature in the structure of several empirical directed and undirected networks. However, compared to the amount of research on clustering for undirected graphs, there is relatively little understanding of clustering in directed networks. This paper extends the spectral clustering algorithm to directed networks in a way that coclusters or biclusters the rows and columns of a graph Laplacian. Coclustering leverages the increased complexity of asymmetric relationships to gain new insight into the structure of the directed network. To understand this algorithm and to study its asymptotic properties in a canonical setting, we propose the Stochastic CoBlockmodel to encode coclustering structure. This is the first statistical model of coclustering and it is derived using the concept of stochastic equivalence that motivated the original Stochastic Blockmodel. Although directed spectral clustering is not derived from the Stochastic CoBlockmodel, we show that, asymptotically, the algorithm can estimate the blocks in a high dimensional asymptotic setting in which the number of blocks grows with the number of nodes. The algorithm, model, and asymptotic results can all be extended to bipartite graphs.
Semisupervised Learning using Sparse Eigenfunction Bases
"... We present a new framework for semisupervised learning with sparse eigenfunction bases of kernel matrices. It turns out that when the data has clustered, that is, when the high density regions are sufficiently separated by low density valleys, each high density area corresponds to a unique represen ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
(Show Context)
We present a new framework for semisupervised learning with sparse eigenfunction bases of kernel matrices. It turns out that when the data has clustered, that is, when the high density regions are sufficiently separated by low density valleys, each high density area corresponds to a unique representative eigenvector. Linear combination of such eigenvectors (or, more precisely, of their Nystrom extensions) provide good candidates for good classification functions when the cluster assumption holds. By first choosing an appropriate basis of these eigenvectors from unlabeled data and then using labeled data with Lasso to select a classifier in the span of these eigenvectors, we obtain a classifier, which has a very sparse representation in this basis. Importantly, the sparsity corresponds naturally to the cluster assumption. Experimental results on a number of realworld datasets show that our method is competitive with the state of the art semisupervised learning algorithms and outperforms the natural baseline algorithm (Lasso in the Kernel PCA basis). 1
Inverse density as an inverse problem: The fredholm equation approach (Technical Report 1304.5575). arXiv
, 2013
"... In this paper we address the problem of estimating the ratio q p where p is a density function and q is another density, or, more generally an arbitrary function. Knowing or approximating this ratio is needed in various problems of inference and integration, in particular, when one needs to average ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
In this paper we address the problem of estimating the ratio q p where p is a density function and q is another density, or, more generally an arbitrary function. Knowing or approximating this ratio is needed in various problems of inference and integration, in particular, when one needs to average a function with respect to one probability distribution, given a sample from another. It is often referred as importance sampling in statistical inference and is also closely related to the problem of covariate shift in transfer learning as well as to various MCMC methods. It may also be useful for separating the underlying geometry of a space, say a manifold, from the density function defined on it. Our approach is based on reformulating the problem of estimating q p as an inverse problem in terms of an integral operator corresponding to a kernel, and thus reducing it to an integral equation, known as the Fredholm problem of the first kind. This formulation, combined with the techniques of regularization and kernel methods, leads to a principled kernelbased framework for constructing algorithms and for analyzing them theoretically. The resulting family of algorithms (FIRE, for Fredholm Inverse Regularized Estimator) is flexible, simple and easy to implement. We provide detailed theoretical analysis including concentration bounds and convergence rates for the Gaussian kernel in the case of densities defined on Rd, compact domains in Rd and smooth ddimensional submanifolds of the Euclidean space. We also show experimental results including applications to classification and semisupervised learning within the covariate shift framework and demonstrate some encouraging experimental comparisons. We also show how the parameters of our algorithms can be chosen in a completely unsupervised manner. 1
Comparing datadependent and dataindependent embeddings for classification and ranking of internet images
 In CVPR
, 2011
"... This paper presents a comparative evaluation of feature embeddings for classification and ranking in largescale Internet image datasets. We follow a popular framework for scalable visual learning, in which the data is first transformed by a nonlinear embedding and then an efficient linear classifie ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
This paper presents a comparative evaluation of feature embeddings for classification and ranking in largescale Internet image datasets. We follow a popular framework for scalable visual learning, in which the data is first transformed by a nonlinear embedding and then an efficient linear classifier is trained in the resulting space. Our study includes datadependent embeddings inspired by the semisupervised learning literature, and dataindependent ones based on approximating specific kernels (such as the Gaussian kernel for GIST features and the histogram intersection kernel for bags of words). Perhaps surprisingly, we find that datadependent embeddings, despite being computed from large amounts of unlabeled data, do not have any advantage over dataindependent ones in the regime of scarce labeled data. On the other hand, we find that several datadependent embeddings are competitive with popular dataindependent choices for largescale classification. 1.
Eigenanalysis of nonlinear PCA with polynomial kernels
 Statistical Analysis and Data Mining
"... There has been growing interest in kernel methods for classification, clustering and dimension reduction. For example, kernel Fisher discriminant analysis, spectral clustering and kernel principal component analysis are widely used in statistical learning and data mining applications. The empirical ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
There has been growing interest in kernel methods for classification, clustering and dimension reduction. For example, kernel Fisher discriminant analysis, spectral clustering and kernel principal component analysis are widely used in statistical learning and data mining applications. The empirical success of the kernel method is generally attributed to nonlinear feature mapping induced by the kernel, which in turn determines a low dimensional data embedding. It is important to understand the effect of a kernel and its associated kernel parameter(s) on the embedding in relation to data distributions. In this paper, we examine the geometry of the nonlinear embedding for kernel PCA when polynomial kernels are used. We carry out eigenanalysis of the polynomial kernel operator associated with data distributions and investigate the effect of the degree of polynomial. The results provide both insights into the geometry of nonlinear data embedding and practical guidelines for choosing an appropriate degree for dimension reduction with polynomial kernels. We further comment on the effect of centering kernels on the spectral property of the polynomial kernel operator.
An improved bound for the nystrom method for large eigengap. arXiv preprint arXiv:1209.0001
, 2012
"... ar ..."
Robust Classification of Multivariate Time Series by Imprecise Hidden Markov ModelsI
"... A novel technique to classify time series with imprecise hidden Markov models is presented. The learning of these models is achieved by coupling the EM algorithm with the imprecise Dirichlet model. In the stationarity limit, each model corresponds to an imprecise mixture of Gaussian densities, this ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
A novel technique to classify time series with imprecise hidden Markov models is presented. The learning of these models is achieved by coupling the EM algorithm with the imprecise Dirichlet model. In the stationarity limit, each model corresponds to an imprecise mixture of Gaussian densities, this reducing the problem to the classification of static, impreciseprobabilistic, information. Two classifiers, one based on the expected value of the mixture, the other on the Bhattacharyya distance between pairs of mixtures, are developed. The computation of the bounds of these descriptors with respect to the imprecise quantification of the parameters is reduced to, respectively, linear and quadratic optimization tasks, and hence efficiently solved. Classification is performed by extending the knearest neighbors approach to intervalvalued data. The classifiers are credal, this meaning that multiple class labels can be returned in output. Experiments on benchmark datasets for computer vision show that these methods achieve the required robustness whilst outperforming other precise and imprecise methods.
Temporal data classification by imprecise dynamical models
 Seidenfeld (Eds.), Proceedings of the Eighth International Symposium on Imprecise Probability: Theories and Applications (ISIPTA ’13), SIPTA
"... We propose a new methodology to classify temporal data with imprecise hidden Markov models. For each sequence we learn a different model by coupling the EM algorithm with the imprecise Dirichlet model. As a model descriptor, we consider the expected value of the observable variable in the limit of s ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
We propose a new methodology to classify temporal data with imprecise hidden Markov models. For each sequence we learn a different model by coupling the EM algorithm with the imprecise Dirichlet model. As a model descriptor, we consider the expected value of the observable variable in the limit of stationarity of the Markov chain. In the imprecise case, only the bounds of this descriptor can be evaluated. In practice the sequence, which can be regarded as a trajectory in the feature space, is summarized by a hyperbox in the same space. We classify these static but intervalvalued data by a credal generalization of the knearest neighbors algorithm. Experiments on benchmark datasets for computer vision show that the method achieves the required robustness whilst outperforming other precise and imprecise methods.
Dissimilarity Data in Statistical Model Building and Machine Learning
"... Abstract. We explore three papers concerned with two methods for incorporating discrete, noisy, incomplete dissimilarity data into statistical/machine learning models for supervised, semisupervised or unsupervised machine learning. The two methods are RKE (Regularized Kernel Estimation), and RMU (Re ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. We explore three papers concerned with two methods for incorporating discrete, noisy, incomplete dissimilarity data into statistical/machine learning models for supervised, semisupervised or unsupervised machine learning. The two methods are RKE (Regularized Kernel Estimation), and RMU (Regularized Manifold Unfolding). Briefly put, the methods use dissimilarity information between objects in a training set to obtain a nonnegative definite matrix of (usually) relatively low rank, which is then used to embed the objects into a (usually) relatively low dimensional Euclidean space, where their coordinates can then be used as attributes in learning models of various types. Some suggestions for further work are noted. 1.