Results 11  20
of
141
Towards unsupervised pattern discovery in speech
, 2008
"... We present a novel approach to speech processing based on the principle of pattern discovery. Our work represents a departure from traditional models of speech recognition, where the end goal is to classify speech into categories defined by a prespecified inventory of lexical units (i.e., phones or ..."
Abstract

Cited by 78 (10 self)
 Add to MetaCart
We present a novel approach to speech processing based on the principle of pattern discovery. Our work represents a departure from traditional models of speech recognition, where the end goal is to classify speech into categories defined by a prespecified inventory of lexical units (i.e., phones or words). Instead, we attempt to discover such an inventory in an unsupervised manner by exploiting the structure of repeating patterns within the speech signal. We show how pattern discovery can be used to automatically acquire lexical entities directly from an untranscribed audio stream. Our approach to unsupervised word acquisition utilizes a segmental variant of a widely used dynamic programming technique, which allows us to find matching acoustic patterns between spoken utterances. By aggregating information about these matching patterns across audio streams, we demonstrate how to group similar acoustic sequences together to form clusters corresponding to lexical entities such as words and short multiword phrases. On a corpus of academic lecture material, we demonstrate that clusters found using this technique exhibit high purity and that many of the corresponding lexical identities are relevant to the underlying audio stream.
A Unified Framework for Modelbased Clustering
 Journal of Machine Learning Research
, 2003
"... Modelbased clustering techniques have been widely used and have shown promising results in many applications involving complex data. This paper presents a unified framework for probabilistic modelbased clustering based on a bipartite graph view of data and models that highlights the commonaliti ..."
Abstract

Cited by 77 (7 self)
 Add to MetaCart
(Show Context)
Modelbased clustering techniques have been widely used and have shown promising results in many applications involving complex data. This paper presents a unified framework for probabilistic modelbased clustering based on a bipartite graph view of data and models that highlights the commonalities and differences among existing modelbased clustering algorithms. In this view, clusters are represented as probabilistic models in a model space that is conceptually separate from the data space. For partitional clustering, the view is conceptually similar to the ExpectationMaximization (EM) algorithm. For hierarchical clustering, the graphbased view helps to visualize critical/important distinctions between similaritybased approaches and modelbased approaches.
Learning Affinity Functions for Image Segmentation: Combining Patchbased and Gradientbased Approaches
 SUBMITTED TO CVPR 2003
, 2003
"... This paper studies the problem of combining region and boundary cues for natural image segmentation. We employ a large database of manually segmented images in order to learn an optimal affinity function between pairs of pixels. These pairwise affinities can then be used to cluster the pixels into v ..."
Abstract

Cited by 73 (3 self)
 Add to MetaCart
(Show Context)
This paper studies the problem of combining region and boundary cues for natural image segmentation. We employ a large database of manually segmented images in order to learn an optimal affinity function between pairs of pixels. These pairwise affinities can then be used to cluster the pixels into visually coherent groups. Region cues are computed as the similarity in brightness, color, and texture between image patches. Boundary cues are incorporated by looking for the presence of an “intervening contour”, a large gradient along a straight line connecting two pixels. We first use the dataset of human segmentations to individually optimize parameters of the patch and gradient features for brightness, color, and texture cues. We then quantitatively measure the power of different feature combinations by computing the precision and recall of classifiers trained using those features. The mutual information between the output of the classifiers and the samesegment indicator function provides an alternative evaluation technique that yields identical conclusions. As expected, the best classifier makes use of brightness, color, and texture features, in both patch and gradient forms. We find that for brightness, the gradient cue outperforms the patch similarity. In contrast, using color patch similarity yields better results than using color gradients. Texture is the most powerful of the three channels, with both patches and gradients carrying significant independent information. Interestingly, the proximity of the two pixels does not add any information beyond that provided by the similarity cues. We also find that the convexity assumptions made by the intervening contour approach are supported by the ecological statistics of the dataset.
A Unifying Theorem for Spectral Embedding and Clustering
, 2003
"... Spectral methods use selected eigenvectors of a data affinity matrix to obtain a data representation that can be trivially clustered or embedded in a lowdimensional space. We present a theorem that explains, for broad classes of affinity matrices and eigenbases, why this works: For successive ..."
Abstract

Cited by 66 (0 self)
 Add to MetaCart
Spectral methods use selected eigenvectors of a data affinity matrix to obtain a data representation that can be trivially clustered or embedded in a lowdimensional space. We present a theorem that explains, for broad classes of affinity matrices and eigenbases, why this works: For successively smaller eigenbases (i.e., using fewer and fewer of the affinity matrix's dominant eigenvalues and eigenvectors), the angles between "similar" vectors in the new representation shrink while the angles between "dissimilar" vectors grow. Specifically, the sum of the squared cosines of the angles is strictly increasing as the dimensionality of the representation decreases. Thus spectral methods work because the truncated eigenbasis amplifies structure in the data so that any heuristic postprocessing is more likely to succeed. We use this result to construct a nonlinear dimensionality reduction (NLDR) algorithm for data sampled from manifolds whose intrinsic coordinate system has linear and cyclic axes, and a novel clusteringbyprojections algorithm that requires no postprocessing and gives superior performance on "challenge problems" from the recent literature.
1 Parallel Spectral Clustering in Distributed Systems
"... Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms such as kmeans. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform cluster ..."
Abstract

Cited by 63 (1 self)
 Add to MetaCart
(Show Context)
Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms such as kmeans. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform clustering on large data sets, we investigate two representative ways of approximating the dense similarity matrix. We compare one approach by sparsifying the matrix with another by the Nyström method. We then pick the strategy of sparsifying the matrix via retaining nearest neighbors and investigate its parallelization. We parallelize both memory use and computation on distributed computers. Through
Untangling Cycles for Contour Grouping
"... We introduce a novel topological formulation for contour grouping. Our grouping criterion, called untangling cycles, exploits the inherent topological 1D structure of salient contours to extract them from the otherwise 2D image clutter. To define a measure for topological classification robust to cl ..."
Abstract

Cited by 56 (11 self)
 Add to MetaCart
(Show Context)
We introduce a novel topological formulation for contour grouping. Our grouping criterion, called untangling cycles, exploits the inherent topological 1D structure of salient contours to extract them from the otherwise 2D image clutter. To define a measure for topological classification robust to clutter and broken edges, we use a graph formulation instead of the standard computational topology. The key insight is that a pronounced 1D contour should have a clear ordering of edgels, to which all graph edges adhere, and no long range entanglements persist. Finding the contour grouping by optimizing these topological criteria is challenging. We introduce a novel concept of circular embedding to encode this combinatorial task. Our solution leads to computing the dominant complex eigenvectors/eigenvalues of the random walk matrix of the contour grouping graph. We demonstrate major improvements over stateoftheart approaches on challenging real images. 1.
Event Detection by Eigenvector Decomposition using Object and Frame Features
 Proceedings Conference on Computer Vision and Pattern Recognition Workshops
, 2004
"... Abstract We develop an event detection framework that has two significant advantages over past work. First ..."
Abstract

Cited by 51 (3 self)
 Add to MetaCart
(Show Context)
Abstract We develop an event detection framework that has two significant advantages over past work. First
Generative modelbased document clustering: a comparative study
 Knowledge and Information Systems
, 2005
"... Semisupervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semisupervised clustering. Viewing semisupervis ..."
Abstract

Cited by 50 (0 self)
 Add to MetaCart
Semisupervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semisupervised clustering. Viewing semisupervised learning from a clustering angle is useful in practical situations when the set of labels available in labeled data are not complete, i.e., unlabeled data contain new classes that are not present in labeled data. This paper analyzes several multinomial modelbased semisupervised document clustering methods under a principled modelbased clustering framework. The framework naturally leads to a deterministic annealing extension of existing semisupervised clustering approaches. We compare three (slightly) different semisupervised approaches for clustering documents: Seeded damnl, Constrained damnl, and Feedbackbased damnl, where damnl stands for multinomial modelbased deterministic annealing algorithm. The first two are extensions of the seeded kmeans and constrained kmeans algorithms studied by Basu et al. (2002); the last one is motivated by Cohn et al. (2003). Through empirical experiments on text datasets, we show that: (a) deterministic annealing can often significantly improve the performance of semisupervised clustering; (b) the constrained approach is the best when available labels are complete whereas the feedbackbased approach excels when available labels are incomplete.
MultiLabel Image Segmentation for Medical Applications Based on GraphTheoretic Electrical Potentials
 ECCV
, 2004
"... Abstract. A novel method is proposed for performing multilabel, semiautomated image segmentation. Given a small number of pixels with userdefined labels, one can analytically (and quickly) determine the probability that a random walker starting at each unlabeled pixel will first reach one of the ..."
Abstract

Cited by 48 (10 self)
 Add to MetaCart
(Show Context)
Abstract. A novel method is proposed for performing multilabel, semiautomated image segmentation. Given a small number of pixels with userdefined labels, one can analytically (and quickly) determine the probability that a random walker starting at each unlabeled pixel will first reach one of the prelabeled pixels. By assigning each pixel to the label for which the greatest probability is calculated, a highquality image segmentation may be obtained. Theoretical properties of this algorithm are developed along with the corresponding connections to discrete potential theory and electrical circuits. This algorithm is formulated in discrete space (i.e., on a graph) using combinatorial analogues of standard operators and principles from continuous potential theory, allowing it to be applied in arbitrary dimension. 1
RelationshipBased Clustering and Visualization for HighDimensional Data Mining
 INFORMS Journal on Computing
, 2002
"... In several reallife datamining... This paper proposes a relationshipbased approach that alleviates both problems, sidestepping the "curseofdimensionality" issue by working in a suitable similarity space instead of the original highdimensional attribute space. This intermediary simil ..."
Abstract

Cited by 44 (10 self)
 Add to MetaCart
(Show Context)
In several reallife datamining... This paper proposes a relationshipbased approach that alleviates both problems, sidestepping the "curseofdimensionality" issue by working in a suitable similarity space instead of the original highdimensional attribute space. This intermediary similarity space can be suitably tailored to satisfy business criteria such as requiring customer clusters to represent comparable amounts of revenue. We apply efficient and scalable graphpartitioningbased clustering techniques in this space. The output from the clustering algorithm is used to reorder the data points so that the resulting permuted similarity matrix can be readily visualized in two dimensions, with clusters showing up as bands. While twodimensional visualization of a similarity matrix is by itself not novel, its combination with the ordersensitive partitioning of a graph that captures the relevant similarity measure between objects provides three powerful properties: (i) the highdimensionality of the data does not affect further processing once the similarity space is formed; (ii) it leads to clusters of (approximately) equal importance, and (iii) related clusters show up adjacent to one another, further facilitating the visualization of results. The visualization is very helpful for assessing and improving clustering. For example, actionable recommendations for splitting or merging of clusters can be easily derived, and it also guides the user toward the right number of clusters