Results 1 
7 of
7
Subspace clustering of highdimensional data: a predictive approach, Data Mining and Knowledge Discovery 28
, 2014
"... ar ..."
(Show Context)
Predictive Subspace Clustering
"... Abstract—The problem of detecting clusters in highdimensional data is increasingly common in machine learning applications, for instance in computer vision and bioinformatics. Recently, a number of approaches in the field of subspace clustering have been proposed which search for clusters in subsp ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Abstract—The problem of detecting clusters in highdimensional data is increasingly common in machine learning applications, for instance in computer vision and bioinformatics. Recently, a number of approaches in the field of subspace clustering have been proposed which search for clusters in subspaces of unknown dimensions. Learning the number of clusters, the dimension of each subspace, and the correct assignments is a challenging task, and many existing algorithms often perform poorly in the presence of subspaces that have different dimensions and possibly overlap, or are otherwise computationally expensive. In this work we present a novel approach to subspace clustering that learns the numbers of clusters and the dimensionality of each subspace in an efficient way. We assume that the data points in each cluster are well represented in lowdimensions by a PCA model. We propose a measure of predictive influence of data points modelled by PCA which we minimise to drive the clustering process. The proposed predictive subspace clustering algorithm is assessed on both simulated data and on the popular Yale faces database where stateoftheart performance and speed are obtained. I.
Discovering Correlated Subspace Clusters in 3D ContinuousValued Data
 In Proceedings of the IEEE International Conference on Data Mining
, 2010
"... Abstract—Subspace clusters represent useful information in highdimensional data. However, mining significant subspace clusters in continuousvalued 3D data such as stockfinancial ratioyear data, or genesampletime data, is difficult. Firstly, typical metrics either find subspaces with very few o ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Subspace clusters represent useful information in highdimensional data. However, mining significant subspace clusters in continuousvalued 3D data such as stockfinancial ratioyear data, or genesampletime data, is difficult. Firstly, typical metrics either find subspaces with very few objects, or they find too many insignificant subspaces – those which exist by chance. Besides, typical 3D subspace clustering approaches abound with parameters, which are usually set under biased assumptions, making the mining process a ‘guessing game’. We address these concerns by proposing an information theoretic measure, which allows us to identify 3D subspace clusters that stand out from the data. We also develop a highly effective, efficient and parameterrobust algorithm, which is a hybrid of information theoretical and statistical techniques, to mine these clusters. From extensive experimentations, we show that our approach can discover significant 3D subspace clusters embedded in 110 synthetic datasets of varying conditions. We also perform a case study on realworld stock datasets, which shows that our clusters can generate higher profits compared to those mined by other approaches. Keywords3D subspace clustering, financial data mining, information theory. I.
1Multiclass Data Segmentation using Diffuse Interface Methods on Graphs
"... Abstract—We present two graphbased algorithms for multiclass segmentation of highdimensional data on graphs. The algorithms use a diffuse interface model based on the GinzburgLandau functional, related to total variation and graph cuts. A multiclass extension is introduced using the Gibbs simplex ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract—We present two graphbased algorithms for multiclass segmentation of highdimensional data on graphs. The algorithms use a diffuse interface model based on the GinzburgLandau functional, related to total variation and graph cuts. A multiclass extension is introduced using the Gibbs simplex, with the functional’s doublewell potential modified to handle the multiclass case. The first algorithm minimizes the functional using a convex splitting numerical scheme. The second algorithm uses a graph adaptation of the classical numerical MerrimanBenceOsher (MBO) scheme, which alternates between diffusion and thresholding. We demonstrate the performance of both algorithms experimentally on synthetic data, image labeling, and several benchmark data sets such as MNIST, COIL and WebKB. We also make use of fast numerical solvers for finding the eigenvectors and eigenvalues of the graph Laplacian, and take advantage of the sparsity of the matrix. Experiments indicate that the results are competitive with or better than the current stateoftheart in multiclass graphbased segmentation algorithms for highdimensional data.
Projection Based Models for High Dimensional Data
, 2011
"... 2I certify that this thesis, and the research to which it refers, are the product of my own work, and that any ideas or quotations from the work of other people, published or otherwise, are fully acknowledged in accordance with the standard referencing practices of the discipline. Signed: ..."
Abstract
 Add to MetaCart
(Show Context)
2I certify that this thesis, and the research to which it refers, are the product of my own work, and that any ideas or quotations from the work of other people, published or otherwise, are fully acknowledged in accordance with the standard referencing practices of the discipline. Signed:
THE KDISCS ALGORITHM AND ITS KERNELIZATION
"... Abstract. A new clustering algorithm called kdiscs is introduced. This algorithm, though similar to kmeans, addresses the de ciencies of kmeans as well as its variant ksubspaces. The kdiscs algorithm is applied to the recovery of manifolds from noisy samplings. A kernelization of kdiscs is exh ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. A new clustering algorithm called kdiscs is introduced. This algorithm, though similar to kmeans, addresses the de ciencies of kmeans as well as its variant ksubspaces. The kdiscs algorithm is applied to the recovery of manifolds from noisy samplings. A kernelization of kdiscs is exhibited, and the advantages of this modi cation are demonstrated on synthetic data. For this project, the kernelized and linear algorithms were implemented from scratch in MATLAB. 1.
s6102 Subspace clustering Global linear correlation Divide and conquer strategy
"... is a co sub nea are he relation in noisy and sparse data sets. By using the classical divide and conquer strategy, it first divides ear dep ses co any r condu [34,8,3 where S is spanned by k orthogonal ddimensional vectors v1,..., vk. Parameters t1,..., tk are scalars, 1 6 k < d, and x0 is a poi ..."
Abstract
 Add to MetaCart
(Show Context)
is a co sub nea are he relation in noisy and sparse data sets. By using the classical divide and conquer strategy, it first divides ear dep ses co any r condu [34,8,3 where S is spanned by k orthogonal ddimensional vectors v1,..., vk. Parameters t1,..., tk are scalars, 1 6 k < d, and x0 is a point in S. In this paper, ‘‘linear correlation’’, ‘‘linear subspace’ ’ and ‘‘correlation pattern’ ’ are the same and used interchangeably. Fig. 1 shows an example data set Dwith two different subspaces S1 and S2 in R3. The data set contains three subsets g, projecte to this p [8,7,9,33,22,18,10,14,21,13,3,5,36,37]. Some works on pr clustering and subspace clustering only find axisparallel su es, not arbitrarily oriented subspaces [9,22,18,10]. Other have an assumption that the data points of a correlation should be close to others [34,2,1,35,16]. However, in Fig. 1, S1 has both dense areas and sparse areas. Local linear correlation clustering algorithms are able to identify the correlations in dense areas, but are difficult to capture the correlations in sparse areas. Without the ‘‘locality assumption’’, finding the global correlations is regarded as the ‘‘chickenandegg’ ’ problem [29]. We do not know which subspace has a data subset of certain size, or which data