Results 1 
7 of
7
Generating a Diverse Set of HighQuality Clusterings
"... Abstract. We provide a new framework for generating multiple good quality partitions (clusterings) of a single data set. Our approach decomposes this problem into two components, generating many highquality partitions, and then grouping these partitions to obtain k representatives. The decompositio ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract. We provide a new framework for generating multiple good quality partitions (clusterings) of a single data set. Our approach decomposes this problem into two components, generating many highquality partitions, and then grouping these partitions to obtain k representatives. The decomposition makes the approach extremely modular and allows us to optimize various criteria that control the choice of representative partitions. 1
EPMEANS: An Efficient Nonparametric Clustering of Empirical Probability Distributions
"... ABSTRACT Given a collection of m continuousvalued, onedimensional empirical probability distributions {P1, . . . , Pm}, how can we cluster these distributions efficiently with a nonparametric approach? Such problems arise in many realworld settings where keeping the moments of the distribution i ..."
Abstract
 Add to MetaCart
(Show Context)
ABSTRACT Given a collection of m continuousvalued, onedimensional empirical probability distributions {P1, . . . , Pm}, how can we cluster these distributions efficiently with a nonparametric approach? Such problems arise in many realworld settings where keeping the moments of the distribution is not appropriate, because either some of the moments are not defined or the distributions are heavytailed or bimodal. Examples include mining distributions of interarrival times and phonecall lengths. We present an efficient algorithm with a nonparametric model for clustering empirical, onedimensional, continuous probability distributions. Our algorithm, called epmeans, is based on the Earth Mover's Distance and kmeans clustering. We illustrate the utility of epmeans on various data sets and applications. In particular, we demonstrate that epmeans effectively and efficiently clusters probability distributions of mixed and arbitrary shapes, recovering groundtruth clusters exactly in cases where existing methods perform at baseline accuracy. We also demonstrate that epmeans outperforms momentbased classification techniques and discovers useful patterns in a variety of realworld applications.
Date Approved
, 2012
"... The computing landscape is undergoing a major change, primarily enabled by ubiquitous wireless networks and the rapid increase in the use of mobile devices which access a webbased information infrastructure. It is expected that most intensive computing may either happen in servers housed in large d ..."
Abstract
 Add to MetaCart
The computing landscape is undergoing a major change, primarily enabled by ubiquitous wireless networks and the rapid increase in the use of mobile devices which access a webbased information infrastructure. It is expected that most intensive computing may either happen in servers housed in large datacenters (warehousescale computers), e.g., cloud computing and other web services, or in manycore highperformance computing (HPC) platforms in scientific labs. It is clear that the primary challenge to scaling such computing systems into the exascale realm is the efficient supply of large amounts of data to hundreds or thousands of compute cores, i.e., building an efficient memory system. Main memory systems are at an inflection point, due to the convergence of several major application and technology trends. Examples include the increasing importance of energy consumption, reduced access stream locality, increasing failure rates, limited pin counts, increasing heterogeneity and complexity, and the diminished importance of costperbit. In light of these trends, the memory system requires a major overhaul. The key to architecting the next generation of memory systems is a combination of the prudent incorporation
Power to the Points: Validating Data Memberships in Clusterings
, 2013
"... A clustering is an implicit assignment of labels of points, based on proximity to other points. It is these labels that are then used for downstream analysis (either focusing on individual clusters, or identifying representatives of clusters and so on). Thus, in order to trust a clustering as a firs ..."
Abstract
 Add to MetaCart
(Show Context)
A clustering is an implicit assignment of labels of points, based on proximity to other points. It is these labels that are then used for downstream analysis (either focusing on individual clusters, or identifying representatives of clusters and so on). Thus, in order to trust a clustering as a first step in exploratory data analysis, we must trust the labels assigned to individual data. Without supervision, how can we validate this assignment? In this paper, we present a method to attach affinity scores to the implicit labels of individual points in a clustering. The affinity scores capture the confidence level of the cluster that claims to ”own ” the point. This method is very general: it can be used with clusterings derived from Euclidean data, kernelized data, or even data derived from information spaces. It smoothly incorporates importance functions on clusters, allowing us to weight different clusters differently. It is also efficient: assigning an affinity score to a point depends only polynomially on the number of clusters and is independent of the number of points in the data. The dimensionality of the underlying space only appears in preprocessing. We demonstrate the value of our approach with an experimental study that illustrates the use of these scores in different data analysis tasks, as well as the efficiency and flexibility of the method. We also demonstrate useful visualizations of these scores; these might prove useful within an interactive analytics framework. 1