Results 1  10
of
31
Distributed kmeans and kmedian clustering on general topologies
 in Advances in Neural Information Processing Systems
, 2013
"... This paper provides new algorithms for distributed clustering for two popular centerbased objectives, kmedian and kmeans. These algorithms have provable guarantees and improve communication complexity over existing approaches. Following a classic approach in clustering by [13], we reduce the prob ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
(Show Context)
This paper provides new algorithms for distributed clustering for two popular centerbased objectives, kmedian and kmeans. These algorithms have provable guarantees and improve communication complexity over existing approaches. Following a classic approach in clustering by [13], we reduce the problem of finding a clustering with low cost to the problem of finding a coreset of small size. We provide a distributed method for constructing a global coreset which improves over the previous methods by reducing the communication complexity, and which works over general communication topologies. Experimental results on large scale data sets show that this approach outperforms other coresetbased distributed clustering algorithms. 1
Clustering under Approximation Stability
, 2009
"... A common approach to clustering data is to view data objects as points in a metric space, and then to optimize a natural distancebased objective such as the kmedian, kmeans, or minsum score. For applications such as clustering proteins by function or clustering images by subject, the implicit ho ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
A common approach to clustering data is to view data objects as points in a metric space, and then to optimize a natural distancebased objective such as the kmedian, kmeans, or minsum score. For applications such as clustering proteins by function or clustering images by subject, the implicit hope in taking this approach is that the optimal solution to the chosen objective will closely match the desired “target ” clustering (e.g., a correct clustering of proteins by function or of images by who is in them). However, most distancebased objectives, including those above, are NPhard to optimize. So, this assumption by itself is not sufficient, assuming P ̸ = NP, to achieve clusterings of lowerror via polynomial time algorithms. In this paper, we show that we can bypass this barrier if we slightly extend this assumption to ask that for some small constant c, not only the optimal solution, but also all capproximations to the optimal solution, differ from the target on at most some ɛ fraction of points—we call this (c, ɛ)approximationstability. We show that under this condition, it is possible to efficiently obtain lowerror clusterings even if the property holds only for values c for which the objective is known to be NPhard to approximate. Specifically, for any constant c> 1, (c, ɛ)approximationstability of kmedian or kmeans objectives can be used to efficiently produce a clustering of error O(ɛ), as
Approximating MinimumCost kNode Connected Subgraphs via IndependenceFree Graphs
, 2012
"... We present a 6approximation algorithm for the minimumcost knode connected spanning subgraph problem, assuming that the number of nodes is at least k3 (k−1)+k. We apply a combinatorial preprocessing, based on the FrankTardos algorithm for koutconnectivity, to transform any input into an instance ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
We present a 6approximation algorithm for the minimumcost knode connected spanning subgraph problem, assuming that the number of nodes is at least k3 (k−1)+k. We apply a combinatorial preprocessing, based on the FrankTardos algorithm for koutconnectivity, to transform any input into an instance such that the iterative rounding method gives a 2approximation guarantee. This is the first constantfactor approximation algorithm even in the asymptotic setting of the problem, that is, the restriction to instances where the number of nodes is lower bounded by a function of k.
An improved approximation for kmedian, and positive correlation in budgeted optimization. Accepted by
 Proceedings of SODA
, 2015
"... Dependent rounding is a useful technique for optimization problems with hard budget constraints. This framework naturally leads to negative correlation properties. However, what if an application naturally calls for dependent rounding on the one hand, and desires positive correlation on the other? M ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
Dependent rounding is a useful technique for optimization problems with hard budget constraints. This framework naturally leads to negative correlation properties. However, what if an application naturally calls for dependent rounding on the one hand, and desires positive correlation on the other? More generally, we develop algorithms that guarantee the known properties of dependent rounding, but also have nearly bestpossible behavior – nearindependence, which generalizes positive correlation – on “small ” subsets of the variables. The recent breakthrough of Li & Svensson for the classical kmedian problem has to handle positive correlation in certain dependentrounding settings, and does so implicitly. We improve upon LiSvensson’s approximation ratio for kmedian from 2.732+ to 2.611+ by developing an algorithm that improves upon various aspects of their work. Our dependentrounding approach helps us improve the dependence of the runtime on the parameter from LiSvensson’s NO(1/ 2) to NO((1/) log(1/)).
Recovery guarantees for exemplarbased clustering. arXiv preprint arXiv:1309.3256
, 2013
"... For a certain class of distributions, we prove that the linear programming relaxation of kmedoids clustering—a variant of kmeans clustering where means are replaced by exemplars from within the dataset—distinguishes points drawn from nonoverlapping balls with high probability once the number of p ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
For a certain class of distributions, we prove that the linear programming relaxation of kmedoids clustering—a variant of kmeans clustering where means are replaced by exemplars from within the dataset—distinguishes points drawn from nonoverlapping balls with high probability once the number of points drawn and the separation distance between any two balls are sufficiently large. Our results hold in the nontrivial regime where the separation distance is small enough that points drawn from different balls may be closer to each other than points drawn from the same ball; in this case, clustering by thresholding pairwise distances between points can fail. We also exhibit numerical evidence of highprobability recovery in a substantially more permissive regime. 1
Network crossvalidation for determining the number of communities in network data. Available at arXiv:1411.1715v1
, 2014
"... The stochastic block model and its variants have been a popular tool in analyzing large network data with community structures. Model selection for these network models, such as determining the number of communities, has been a challenging statistical inference task. In this paper we develop an effi ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
The stochastic block model and its variants have been a popular tool in analyzing large network data with community structures. Model selection for these network models, such as determining the number of communities, has been a challenging statistical inference task. In this paper we develop an efficient crossvalidation approach to determine the number of communities, as well as to choose between the regular stochastic block model and the degree corrected block model. Our method, called network crossvalidation, is based on a blockwise edge splitting technique, combined with an integrated step of community recovery using subblocks of the adjacency matrix. The solid performance of our method is supported by theoretical analysis of the subblock parameter estimation, and is demonstrated in extensive simulations and a data example. Extensions to more general network models are also discussed. 1
Scalable Exemplar Clustering and Facility Location via Augmented Block Coordinate Descent with Column Generation
"... Abstract In recent years exemplar clustering has become a popular tool for applications in document and video summarization, active learning, and clustering with general similarity, where cluster centroids are required to be a subset of the data samples rather than their linear combinations. The pr ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract In recent years exemplar clustering has become a popular tool for applications in document and video summarization, active learning, and clustering with general similarity, where cluster centroids are required to be a subset of the data samples rather than their linear combinations. The problem is also wellknown as facility location in the operations research literature. While the problem has welldeveloped convex relaxation with approximation and recovery guarantees, its number of variables grows quadratically with the number of samples. Therefore, stateoftheart methods can hardly handle more than 10 4 samples (i.e. 10 8 variables). In this work, we propose an AugmentedLagrangian with Block Coordinate Descent (ALBCD) algorithm that utilizes problem structure to obtain closedform solution for each block subproblem, and exploits lowrank representation of the dissimilarity matrix to search active columns without computing the entire matrix. Experiments show our approach to be orders of magnitude faster than existing approaches and can handle problems of up to 10 6 samples. We also demonstrate successful applications of the algorithm on worldscale facility location, document summarization and active learning.
The hardness of approximation of Euclidean kmeans.
 In Proc. 31st Int. Symp. Computational Geometry,
, 2015
"... Abstract The Euclidean kmeans problem is a classical problem that has been extensively studied in the theoretical computer science, machine learning and the computational geometry communities. In this problem, we are given a set of n points in Euclidean space R d , and the goal is to choose k cent ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract The Euclidean kmeans problem is a classical problem that has been extensively studied in the theoretical computer science, machine learning and the computational geometry communities. In this problem, we are given a set of n points in Euclidean space R d , and the goal is to choose k center points in R d so that the sum of squared distances of each point to its nearest center is minimized. The best approximation algorithms for this problem include a polynomial time constant factor approximation for general k and a (1 + )approximation which runs in time poly(n) exp(k/ ). At the other extreme, the only known computational complexity result for this problem is NPhardness [1]. The main difficulty in obtaining hardness results stems from the Euclidean nature of the problem, and the fact that any point in R d can be a potential center. This gap in understanding left open the intriguing possibility that the problem might admit a PTAS for all k, d. In this paper we provide the first hardness of approximation for the Euclidean kmeans problem. Concretely, we show that there exists a constant > 0 such that it is NPhard to approximate the kmeans objective to within a factor of (1 + ). We show this via an efficient reduction from the vertex cover problem on trianglefree graphs: given a trianglefree graph, the goal is to choose the fewest number of vertices which are incident on all the edges. Additionally, we give a proof that the current best hardness results for vertex cover can be carried over to trianglefree graphs. To show this we transform G, a known hard vertex cover instance, by taking a graph product with a suitably chosen graph H, and showing that the size of the (normalized) maximum independent set is almost exactly preserved in the product graph using a spectral analysis, which might be of independent interest.
On Uniform Capacitated kMedian Beyond the Natural LP Relaxation
"... In this paper, we study the uniform capacitated kmedian problem. In the problem, we are given a set F of potential facility locations, a set C of clients, a metric d over F ∪ C, an upper bound k on the number of facilities we can open and an upper bound u on the number of clients each facility can ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
In this paper, we study the uniform capacitated kmedian problem. In the problem, we are given a set F of potential facility locations, a set C of clients, a metric d over F ∪ C, an upper bound k on the number of facilities we can open and an upper bound u on the number of clients each facility can serve. We need to open a subset S ⊆ F of k facilities and connect clients in C to facilities in S so that each facility is connected by at most u clients. The goal is to minimize the total connection cost over all clients. Obtaining a constant approximation algorithm for this problem is a notorious open problem; most previous works gave constant approximations by either violating the capacity constraints or the cardinality constraint. Notably, all these algorithms are based on the natural LPrelaxation for the problem. The LPrelaxation has unbounded integrality gap, even when we are allowed to violate the capacity constraints or the cardinality constraint by a factor of 2 − . Our result is an exp(O(1/2))approximation algorithm for the problem that violates the cardinality constraint by a factor of 1 + . That is, we find a solution that opens at most (1 + )k facilities whose cost is at most exp(O(1/2)) times the optimum solution when at most