Results 1  10
of
18
Clustering partially observed graphs via convex optimization.
 Journal of Machine Learning Research,
, 2014
"... Abstract This paper considers the problem of clustering a partially observed unweighted graphi.e., one where for some node pairs we know there is an edge between them, for some others we know there is no edge, and for the remaining we do not know whether or not there is an edge. We want to organiz ..."
Abstract

Cited by 47 (13 self)
 Add to MetaCart
(Show Context)
Abstract This paper considers the problem of clustering a partially observed unweighted graphi.e., one where for some node pairs we know there is an edge between them, for some others we know there is no edge, and for the remaining we do not know whether or not there is an edge. We want to organize the nodes into disjoint clusters so that there is relatively dense (observed) connectivity within clusters, and sparse across clusters. We take a novel yet natural approach to this problem, by focusing on finding the clustering that minimizes the number of "disagreements"i.e., the sum of the number of (observed) missing edges within clusters, and (observed) present edges across clusters. Our algorithm uses convex optimization; its basis is a reduction of disagreement minimization to the problem of recovering an (unknown) lowrank matrix and an (unknown) sparse matrix from their partially observed sum. We evaluate the performance of our algorithm on the classical Planted Partition/Stochastic Block Model. Our main theorem provides sufficient conditions for the success of our algorithm as a function of the minimum cluster size, edge density and observation probability; in particular, the results characterize the tradeoff between the observation probability and the edge density gap. When there are a constant number of clusters of equal size, our results are optimal up to logarithmic factors.
Efficient active algorithms for hierarchical clustering
 in International Conference on Machine Learning (ICML
, 2012
"... Advances in sensing technologies and the growth of the internet have resulted in an explosion in the size of modern datasets, while storage and processing power continue to lag behind. This motivates the need for algorithms that are efficient, both in terms of the number of measurements needed and r ..."
Abstract

Cited by 16 (3 self)
 Add to MetaCart
(Show Context)
Advances in sensing technologies and the growth of the internet have resulted in an explosion in the size of modern datasets, while storage and processing power continue to lag behind. This motivates the need for algorithms that are efficient, both in terms of the number of measurements needed and running time. To combat the challenges associated with large datasets, we propose a general framework for active hierarchical clustering that repeatedly runs an offtheshelf clustering algorithm on small subsets of the data and comes with guarantees on performance, measurement complexity and runtime complexity. We instantiate this framework with a simple spectral clustering algorithm and provide concrete results on its performance, showing that, under some assumptions, this algorithm recovers all clusters of size Ω(log n) using O(n log 2 n) similarities and runs in O(n log 3 n) time for a dataset of n objects. Through extensive experimentation we also demonstrate that this framework is practically alluring. 1.
Active learning using smooth relative regret approximations with applications
, 2000
"... The disagreement coefficient of Hanneke has become a central data independent invariant in proving active learning rates. It has been shown in various ways that a concept class with low complexity together with a bound on the disagreement coefficient at an optimal solution allows active learning rat ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
(Show Context)
The disagreement coefficient of Hanneke has become a central data independent invariant in proving active learning rates. It has been shown in various ways that a concept class with low complexity together with a bound on the disagreement coefficient at an optimal solution allows active learning rates that are superior to passive learning ones. We present a different tool for pool based active learning which follows from the existence of a certain uniform version of low disagreement coefficient, but is not equivalent to it. In fact, we present two fundamental active learning problems of significant interest for which our approach allows nontrivial active learning bounds. However, any general purpose method relying on the disagreement coefficient bounds only fails to guarantee any useful bounds for these problems. The applications of interest are: Learning to rank from pairwise preferences, and clustering with side information (a.k.a. semisupervised clustering). The tool we use is based on the learner’s ability to compute an estimator of the difference between the loss of any hypothesis and some fixed “pivotal ” hypothesis to within an absolute error of at most ε times the disagreement measure (ℓ1 distance) between the two hypotheses. We prove that such an estimator implies the existence of a learning algorithm which, at each iteration, reduces its inclass excess risk to within a constant factor. Each iteration replaces the current pivotal hypothesis with the minimizer of the estimated loss difference function with respect to the previous pivotal hypothesis. The label complexity essentially becomes that of computing this estimator.
Active Clustering of Biological Sequences
, 2012
"... Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s∈S return the distances between s and all oth ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s∈S return the distances between s and all other points. We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only O(k) distance queries. Our algorithm uses an active selection strategy to choose a small set of points that we call landmarks, and considers only the distances between landmarks and other points to produce a clustering. We use our procedure to cluster proteins by sequence similarity. This setting nicely fits our model because we can use a fast sequence database search program to query a sequence against an entire data set. We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classification.
A Novel Approximation to Dynamic Time Warping allows Anytime Clustering of Massive Time Series Datasets
"... Given the ubiquity of time series data, the data mining community has spent significant time investigating the best time series similarity measure to use for various tasks and domains. After more than a decade of extensive efforts, there is increasing evidence that Dynamic Time Warping (DTW) is very ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Given the ubiquity of time series data, the data mining community has spent significant time investigating the best time series similarity measure to use for various tasks and domains. After more than a decade of extensive efforts, there is increasing evidence that Dynamic Time Warping (DTW) is very difficult to beat. Given that, recent efforts have focused on making the intrinsically slow DTW algorithm faster. For the similaritysearch task, an important subroutine in many data mining algorithms, significant progress has been made by replacing the vast majority of expensive DTW calculations with cheaptocompute lower bound calculations. However, these lower bound based optimizations do not directly apply to clustering, and thus for some realistic problems, clustering with DTW can take days or weeks. In this work, we show that we can mitigate this untenable lethargy by casting DTW clustering as an anytime algorithm. At the heart of our algorithm is a novel dataadaptive approximation to DTW which can be quickly computed, and which produces approximations to DTW that are much better than the best currently known lineartime approximations. We demonstrate our ideas on real world problems showing that we can get virtually all the accuracy of a batch DTW clustering algorithm in a fraction of the time.
Completion of highrank ultrametric matrices using selective entries
 In IEEE International Conference on Signal Processing and Communications
, 2012
"... AbstractUltrametric matrices are hierarchically structured matrices that arise naturally in many scenarios, e.g. delay covariance of packets sent from a source to a set of clients in a computer network, interactions between multiscale communities in a social network, and genome sequence alignment ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
AbstractUltrametric matrices are hierarchically structured matrices that arise naturally in many scenarios, e.g. delay covariance of packets sent from a source to a set of clients in a computer network, interactions between multiscale communities in a social network, and genome sequence alignment scores in phylogenetic tree reconstruction problems. In this work, we show that it is possible to complete n × n ultrametric matrices using only n log 2 n entries. Since ultrametric matrices are highrank matrices, our results extend recent work on completion of n × n lowrank matrices that requires n log n randomly sampled entries. In the ultrametric setting, a random sampling of entries does not suffice, and we require selective sampling of entries using feedback obtained from entries observed at a previous stage.
Hierarchical clustering using randomly selected measurements
 In Proceedings of the IEEE Statistical Signal Processing Workshop
, 2012
"... ar ..."
(Show Context)
Comprehensive CrossHierarchy Cluster Agreement Evaluation
"... Hierarchical clustering represents a family of widely used clustering approaches that can organize objects into a hierarchy based on the similarity in objects ’ feature values. One significant obstacle facing hierarchical clustering research today is the lack of general and robust evaluation methods ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Hierarchical clustering represents a family of widely used clustering approaches that can organize objects into a hierarchy based on the similarity in objects ’ feature values. One significant obstacle facing hierarchical clustering research today is the lack of general and robust evaluation methods. Existing works rely on a range of evaluation techniques including both internal (no groundtruth is considered in evaluation) and external measures (results are compared to groundtruth semantic structure). The existing internal techniques may have strong hierarchical validity, but the available external measures were not developed specifically for hierarchies. This lack of specificity prevents them from comparing hierarchy structures in a holistic, principled way. To address this problem, we propose the Hierarchy Agreement Index, a novel hierarchy
ESTIMATING INTRINSIC DIMENSION VIA CLUSTERING
"... Estimating the intrinsic dimension of a data set from pairwise distances is a critical issue for a wide range of disciplines, including genomics, finance, and networking. Current estimation techniques are agnostic to structure in the data, failing to exploit properties that can improve efficiency. I ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Estimating the intrinsic dimension of a data set from pairwise distances is a critical issue for a wide range of disciplines, including genomics, finance, and networking. Current estimation techniques are agnostic to structure in the data, failing to exploit properties that can improve efficiency. In this paper, we present a methodology that uses inherent clustering present in data to efficiently and accurately estimate intrinsic dimension. Our experiments show that this approach has greater accuracy and better scalability than prior techniques, even when the data does not conform to an obvious clustering structure. 1.