Results 1  10
of
54
Are stable instances easy?
, 2008
"... We introduce the notion of a stable instance for a discrete optimization problem, and argue that in many practical situations only sufficiently stable instances are of interest. The question then arises whether stable instances of NP–hard problems are easier to solve. In particular, whether there ex ..."
Abstract

Cited by 30 (2 self)
 Add to MetaCart
We introduce the notion of a stable instance for a discrete optimization problem, and argue that in many practical situations only sufficiently stable instances are of interest. The question then arises whether stable instances of NP–hard problems are easier to solve. In particular, whether there exist algorithms that solve correctly and in polynomial time all sufficiently stable instances of some NP–hard problem. The paper focuses on the Max–Cut problem, for which we show that this is indeed the case.
Clusterability: A Theoretical Study
 Proceedings of AISTATS 09, JMLR: W&CP 5
, 2009
"... We investigate measures of the clusterability of data sets. Namely, ways to define how ‘strong ’ or ‘conclusive ’ is the clustering structure of a given data set. We address this issue with generality, aiming for conclusions that apply regardless of any particular clustering algorithm or any specifi ..."
Abstract

Cited by 22 (4 self)
 Add to MetaCart
We investigate measures of the clusterability of data sets. Namely, ways to define how ‘strong ’ or ‘conclusive ’ is the clustering structure of a given data set. We address this issue with generality, aiming for conclusions that apply regardless of any particular clustering algorithm or any specific data generation model. We survey several notions of clusterability that have been discussed in the literature, as well as propose a new notion of data clusterability. Our comparison of these notions reveals that, although they all attempt to evaluate the same intuitive property, they are pairwise inconsistent. Our analysis discovers an interesting phenomenon; Although most of the common clustering tasks are NPhard, finding a closetooptimal clustering for well clusterable data sets is easy (computationally). We prove instances of this general claim with respect to the various clusterability notions that we discuss. Finally, we investigate how hard it is to determine the clusterability value of a given data set. In most cases, it turns out that this is an NPhard problem. 1
Centerbased clustering under perturbation stability
"... Optimal clustering under most popular objective functions is NPhard, and therefore unlikely to be efficiently solvable in the worst case. Recently, Bilu and Linial [11] suggested an approach aimed instead at understanding the complexity of clustering instances which arise in practice. They argue th ..."
Abstract

Cited by 19 (4 self)
 Add to MetaCart
(Show Context)
Optimal clustering under most popular objective functions is NPhard, and therefore unlikely to be efficiently solvable in the worst case. Recently, Bilu and Linial [11] suggested an approach aimed instead at understanding the complexity of clustering instances which arise in practice. They argue that such instances should be stable to perturbations in the metric space and give an efficient algorithm for clustering instances which are stable to perturbations of size O(n 1/2) for MaxCut based clustering. In addition, they conjecture that instances stable to as little as O(1) perturbations should be solvable in polynomial time. In this paper we prove that this conjecture is true for any centerbased clustering objective (such as kmedian, kmeans, and kcenter). I.e., we can efficiently find the optimal clustering assuming only stability to constantmagnitude perturbations of the underlying metric. Specifically, we show that for centerbased clustering instances which are stable to O(1) perturbations, the popular SingleLinkage algorithm combined with dynamic programming will find the optimal clustering. Keywords: Clustering, kmedian, kmeans, Stability Conditions
Predicting Trust and Distrust in Social Networks
"... Abstract—As usergenerated content and interactions have overtaken the web as the default mode of use, questions of whom and what to trust have become increasingly important. Fortunately, online social networks and social media have made it easy for users to indicate whom they trust and whom they do ..."
Abstract

Cited by 19 (1 self)
 Add to MetaCart
(Show Context)
Abstract—As usergenerated content and interactions have overtaken the web as the default mode of use, questions of whom and what to trust have become increasingly important. Fortunately, online social networks and social media have made it easy for users to indicate whom they trust and whom they do not. However, this does not solve the problem since each user is only likely to know a tiny fraction of other users; we must have methods for inferring trust and distrust between users who do not know one another. In this paper, we present a new method for computing both trust and distrust (i.e., positive and negative trust). We do this by combining an inference algorithm that relies on a probabilistic interpretation of trust based on random graphs with a modified springembedding algorithm. Our algorithm correctly classifies hidden trust edges as positive or negative with high accuracy. These results are useful in a wide range of social web applications where trust is important to user behavior and satisfaction. I.
Stability yields a PTAS for kMedian and kMeans Clustering
, 2010
"... We consider kmedian clustering in finite metric spaces and kmeans clustering in Euclidean spaces, in the setting where k is part of the input (not a constant). For the kmeans problem, Ostrovsky et al. [ORSS06] show that if the input satisfies the condition that the optimal (k − 1)means clusterin ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
(Show Context)
We consider kmedian clustering in finite metric spaces and kmeans clustering in Euclidean spaces, in the setting where k is part of the input (not a constant). For the kmeans problem, Ostrovsky et al. [ORSS06] show that if the input satisfies the condition that the optimal (k − 1)means clustering is more expensive than the optimal kmeans clustering by a factor of max{100, 1/α 2}, then one can achieve a (1 + f(α))approximation to the kmeans optimal in time polynomial in n and k by using a variant of Lloyd’s algorithm. In this work we substantially improve this approximation guarantee. We show that given only the condition that the (k − 1)means optimal is more expensive than the kmeans optimal by a factor 1 + α for some constant α> 0, we can obtain a PTAS. In particular, under this assumption, for any ǫ> 0 we achieve a (1 + ǫ)approximation to the kmeans optimal in time polynomial in n and k, and exponential in 1/ǫ and 1/α. We thus decouple the strength of the assumption from the quality of the approximation ratio. We also give a PTAS for the kmedian problem in finite metrics under the analogous assumption as well. For kmeans, we in addition give a randomized algorithm with improved running time of n O(1) (k log n) poly(1/ǫ,1/α). Our technique also obtains a PTAS under the assumption of Balcan et al. [BBG09] that all (1 + α) approximations are δclose to a desired target clustering, when all target clusters have size greater than 2δn and α> 0 is constant. Note that the motivation of [BBG09] is that the true goal in clustering is often to get the points right, with objective values serving just as a proxy, and [BBG09] already get O(δ/α)close for general α and arbitrary target cluster sizes. So the primary advance here is in further elucidating the approximation implications and in formally relating the assumptions. In particular, both results are based on a new notion of clustering stability, that extends both the notions of [ORSS06] and of [BBG09]. 1
Robust hierarchical clustering
, 2010
"... Oneofthemostwidelyusedtechniquesfordataclusteringisagglomerativeclustering. Such algorithms have been long used across many different fields ranging from computational biologytosocialsciencestocomputervisioninpartbecausetheiroutputiseasytointerpret. Unfortunately, it is well known, however, that man ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
(Show Context)
Oneofthemostwidelyusedtechniquesfordataclusteringisagglomerativeclustering. Such algorithms have been long used across many different fields ranging from computational biologytosocialsciencestocomputervisioninpartbecausetheiroutputiseasytointerpret. Unfortunately, it is well known, however, that many of the classic agglomerative clustering algorithms are not robust to noise [14]. In this paper we propose and analyze a new robust algorithm for bottomup agglomerative clustering. We show that our algorithm can be used to cluster accurately in cases where the data satisfies a number of natural properties and where the traditional agglomerative algorithms fail. We also show how to adapt our algorithm to the inductive setting where our given data is only a small random sample of the entire data set. 1
Robust Subspace Clustering
, 2013
"... Subspace clustering refers to the task of finding a multisubspace representation that best fits a collection of points taken from a highdimensional space. This paper introduces an algorithm inspired by sparse subspace clustering (SSC) [17] to cluster noisy data, and develops some novel theory demo ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
(Show Context)
Subspace clustering refers to the task of finding a multisubspace representation that best fits a collection of points taken from a highdimensional space. This paper introduces an algorithm inspired by sparse subspace clustering (SSC) [17] to cluster noisy data, and develops some novel theory demonstrating its correctness. In particular, the theory uses ideas from geometric functional analysis to show that the algorithm can accurately recover the underlying subspaces under minimal requirements on their orientation, and on the number of samples per subspace. Synthetic as well as real data experiments complement our theoretical study, illustrating our approach and demonstrating its effectiveness.
Finding low error clusterings
 In COLT, 2009. 12 [BBV08] [BCR01] MariaFlorina Balcan, Avrim Blum, and Anupam
, 2009
"... A common approach for solving clustering problems is to design algorithms to approximately optimize various objective functions (e.g., kmeans or minsum) defined in terms of some given pairwise distance or similarity information. However, in many learning motivated clustering applications (such as ..."
Abstract

Cited by 15 (5 self)
 Add to MetaCart
(Show Context)
A common approach for solving clustering problems is to design algorithms to approximately optimize various objective functions (e.g., kmeans or minsum) defined in terms of some given pairwise distance or similarity information. However, in many learning motivated clustering applications (such as clustering proteins by function) there is some unknown target clustering; in such cases the pairwise information is merely based on heuristics and the real goal is to achieve low error on the data. In these settings, an arbitrary capproximation algorithm for some objective would work well only if any capproximation to that objective is close to the target clustering. In recent work, Balcan et. al [7] have shown how both for the kmeans and kmedian objectives this property allows one to produce clusterings of low error, even for values c such that getting a capproximation to these objective functions is provably NPhard. In this paper we analyze the minsum objective from this perspective. While [7] also considered the minsum problem, the results they derived for this objective were substantially weaker. In this work we derive new and more subtle structural properties for minsum in this context and use these to design efficient algorithms for producing accurate clusterings, both in the transductive and in the inductive case. We also analyze the correlation clustering problem from this perspective, and point out interesting differences between this objective and kmedian, kmeans, or minsum objectives. 1
Active learning using smooth relative regret approximations with applications (full version
 In arXiv:1110.2136
, 2012
"... The disagreement coefficient of Hanneke has become a central data independent invariant in proving active learning rates. It has been shown in various ways that a concept class with low complexity together with a bound on the disagreement coefficient at an optimal solution allows active learning rat ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
(Show Context)
The disagreement coefficient of Hanneke has become a central data independent invariant in proving active learning rates. It has been shown in various ways that a concept class with low complexity together with a bound on the disagreement coefficient at an optimal solution allows active learning rates that are superior to passive learning ones. We present a different tool for pool based active learning which follows from the existence of a certain uniform version of low disagreement coefficient, but is not equivalent to it. In fact, we present two fundamental active learning problems of significant interest for which our approach allows nontrivial active learning bounds. However, any general purpose method relying on the disagreement coefficient bounds only fails to guarantee any useful bounds for these problems. The applications of interest are: Learning to rank from pairwise preferences, and clustering with side information (a.k.a. semisupervised clustering). The tool we use is based on the learner’s ability to compute an estimator of the difference between the loss of any hypothesis and some fixed “pivotal ” hypothesis to within an absolute error of at most ε times the disagreement measure (ℓ1 distance) between the two hypotheses. We prove that such an estimator implies the existence of a learning algorithm which, at each iteration, reduces its inclass excess risk to within a constant factor. Each iteration replaces the current pivotal hypothesis with the minimizer of the estimated loss difference function with respect to the previous pivotal hypothesis. The label complexity essentially becomes that of computing this estimator.
Efficient Clustering with Limited Distance Information
"... Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s ∈ S return the distances between s and all o ..."
Abstract

Cited by 9 (4 self)
 Add to MetaCart
(Show Context)
Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s ∈ S return the distances between s and all other points. We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only O(k) distance queries. We use our algorithm to cluster proteins by sequence similarity. This setting nicely fits our model because we can use a fast sequence database search program to query a sequence against an entire dataset. We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classification. 1