Results 1 
5 of
5
Diversity Maximization via Composable Coresets
"... Given a set S of points in a metric space, and a diversity measure div(·) defined over subsets of S, the goal of the diversity maximization problem is to find a subset T ⊆ S of size k that maximizes div(T). Motivated by applications in massive data processing, we consider the composable coreset fr ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Given a set S of points in a metric space, and a diversity measure div(·) defined over subsets of S, the goal of the diversity maximization problem is to find a subset T ⊆ S of size k that maximizes div(T). Motivated by applications in massive data processing, we consider the composable coreset framework in which a coreset for a diversity measure is called αcomposable, if for any collection of sets and their corresponding coresets, the maximum diversity of the union of the coresets αapproximates the maximum diversity of the union of the sets. We present composable coresets with nearoptimal approximation factors for several notions of diversity, including remoteclique, remotecycle, and remotetree. We also prove a general lower bound on the approximation factor of composable coresets for a large class of diversity maximization problems. 1
The power of randomization: Distributed submodular maximization on massive datasets
 In arXiv
, 2015
"... A wide variety of problems in machine learning, including exemplar clustering, document summarization, and sensor placement, can be cast as constrained submodular maximization problems. Unfortunately, the resulting submodular optimization problems are often too large to be solved on a single machi ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
A wide variety of problems in machine learning, including exemplar clustering, document summarization, and sensor placement, can be cast as constrained submodular maximization problems. Unfortunately, the resulting submodular optimization problems are often too large to be solved on a single machine. We consider a distributed, greedy algorithm that combines previous approaches with randomization. The result is an algorithm that is embarrassingly parallel and achieves provable, constant factor, worstcase approximation guarantees. In our experiments, we demonstrate its efficiency in large problems with different kinds of constraints with objective values always close to what is achievable in the centralized setting. 1.
Selecting Representative Objects Considering Coverage and Diversity
"... ABSTRACT We say that an object o attracts a user u if o is one of the topk objects according to the preference function defined by u. Given a set of objects (e.g., restaurants) and a set of users, in this paper, we study the problem of computing a set of representative objects considering two crit ..."
Abstract
 Add to MetaCart
(Show Context)
ABSTRACT We say that an object o attracts a user u if o is one of the topk objects according to the preference function defined by u. Given a set of objects (e.g., restaurants) and a set of users, in this paper, we study the problem of computing a set of representative objects considering two criteria: coverage and diversity. Coverage of a set S of objects is the distinct number of users that are attracted by the objects in S. Although a set of objects with high coverage attracts a large number of users, it is possible that all of these users have quite similar preferences. Consequently, the set of objects may be attractive only for a specific class of users with similar preference functions which may disappoint other users having widely different preferences. The diversity criterion addresses this issue by selecting a set S of objects such that the set of attracted users for each object in S is as different as possible from the sets of users attracted by the other objects in S. The existing work on representative objects considers only one of the coverage and diversity criteria. We are the first to consider both of the criteria where the importance of each criterion can be controlled using a parameter. Our algorithm has two phases. In the first phase, we prune the objects that cannot be among the representative objects and compute the set of attracted users (also called reverse topk) for each of the remaining objects. In the second phase, the reverse topk of these objects are used to compute the representative objects maximizing coverage and diversity. Since this problem is NPhard, the second phase employs a greedy algorithm. For the sake of time and space efficiency, we adopt MinHash and KMV Synopses to assist the set operations. We prove that the proposed greedy algorithm is ǫapproximate. Our extensive experimental study on real and synthetic data sets demonstrates the effectiveness of our proposed techniques.
MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded Doubling Dimension
"... ABSTRACT Given a dataset of points in a metric space and an integer k, a diversity maximization problem requires determining a subset of k points maximizing some diversity objective measure, e.g., the minimum or the average distance between two points in the subset. Diversity maximization is comput ..."
Abstract
 Add to MetaCart
(Show Context)
ABSTRACT Given a dataset of points in a metric space and an integer k, a diversity maximization problem requires determining a subset of k points maximizing some diversity objective measure, e.g., the minimum or the average distance between two points in the subset. Diversity maximization is computationally hard, hence only approximate solutions can be hoped for. Although its applications are mainly in massive data analysis, most of the past research on diversity maximization focused on the sequential setting. In this work we present space and pass/roundefficient diversity maximization algorithms for the Streaming and MapReduce models and analyze their approximation guarantees for the relevant class of metric spaces of bounded doubling dimension. Like other approaches in the literature, our algorithms rely on the determination of highquality coresets, i.e., (much) smaller subsets of the input which contain good approximations to the optimal solution for the whole input. For a variety of diversity objective functions, our algorithms attain an (α + ε)approximation ratio, for any constant ε > 0, where α is the best approximation ratio achieved by a polynomialtime, linearspace sequential algorithm for the same diversity objective. This improves substantially over the approximation ratios attainable in Streaming and MapReduce by stateoftheart algorithms for general metric spaces. We provide extensive experimental evidence of the effectiveness of our algorithms on both real world and synthetic datasets, scaling up to over a billion points.
kvariates++: more pluses in the kmeans++
, 2016
"... Abstract kmeans++ seeding has become a de facto standard for hard clustering algorithms. In this paper, our first contribution is a twoway generalisation of this seeding, kvariates++, that includes the sampling of general densities rather than just a discrete set of Dirac densities anchored at t ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract kmeans++ seeding has become a de facto standard for hard clustering algorithms. In this paper, our first contribution is a twoway generalisation of this seeding, kvariates++, that includes the sampling of general densities rather than just a discrete set of Dirac densities anchored at the point locations, and a generalisation of the well known ArthurVassilvitskii (AV) approximation guarantee, in the form of a bias+variance approximation bound of the global optimum. This approximation exhibits a reduced dependency on the "noise" component with respect to the optimal potential actually approaching the statistical lower bound. We show that kvariates++ reduces to efficient (biased seeding) clustering algorithms tailored to specific frameworks; these include distributed, streaming and online clustering, with direct approximation results for these algorithms. Finally, we present a novel application of kvariates++ to differential privacy. For either the specific frameworks considered here, or for the differential privacy setting, there is little to no prior results on the direct application of kmeans++ and its approximation bounds state of the art contenders appear to be significantly more complex and / or display less favorable (approximation) properties. We stress that our algorithms can still be run in cases where there is no closed form solution for the population minimizer. We demonstrate the applicability of our analysis via experimental evaluation on several domains and settings, displaying competitive performances vs state of the art.