Results 1  10
of
23
LowNoise Density Clustering
"... We study densitybased clustering under lownoise conditions. Our framework allows for sharply defined clusters such as clusters on lower dimensional manifolds. We show that accurate clustering is possible even in high dimensions. We propose two databased methods for choosing the bandwidth and we s ..."
Abstract

Cited by 18 (7 self)
 Add to MetaCart
(Show Context)
We study densitybased clustering under lownoise conditions. Our framework allows for sharply defined clusters such as clusters on lower dimensional manifolds. We show that accurate clustering is possible even in high dimensions. We propose two databased methods for choosing the bandwidth and we study the stability properties of density clusters. We show that a simple graphbased algorithm known as the “friendsoffriends ” algorithm successfully approximates the high density clusters. 1
Stability of DensityBased Clustering
"... High density clusters can be characterized by the connected components of a level set L(λ) = {x: p(x)> λ} of the underlying probability density function p generating the data, at some appropriate level λ ≥ 0. The complete hierarchical clustering can be characterized by a cluster tree T = ⋃ λ L(λ) ..."
Abstract

Cited by 16 (4 self)
 Add to MetaCart
(Show Context)
High density clusters can be characterized by the connected components of a level set L(λ) = {x: p(x)> λ} of the underlying probability density function p generating the data, at some appropriate level λ ≥ 0. The complete hierarchical clustering can be characterized by a cluster tree T = ⋃ λ L(λ). In this paper, we study the behavior of a density level set estimate L(λ) and cluster tree estimate T based on a kernel density estimator with kernel bandwidth h. We define two notions of instability to measure the variability of L(λ) and T as a function of h, and investigate the theoretical properties of these instability measures.
Clustering via Nonparametric Density Estimation
 Statistics and Computing
, 2007
"... The R package pdfCluster performs cluster analysis based on a nonparametric estimate of the density of the observed variables. After summarizing the main aspects of the methodology, we describe the features and the usage of the package, and finally illustrate its working with the aid of two datasets ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
The R package pdfCluster performs cluster analysis based on a nonparametric estimate of the density of the observed variables. After summarizing the main aspects of the methodology, we describe the features and the usage of the package, and finally illustrate its working with the aid of two datasets.
Pruning Nearest Neighbor Cluster Trees
"... Nearest neighbor (kNN) graphs are widely used in machine learning and data mining applications, and our aim is to better understand what they reveal about the cluster structure of the unknown underlying distribution of points. Moreover, is it possible to identify spurious structures that might aris ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
Nearest neighbor (kNN) graphs are widely used in machine learning and data mining applications, and our aim is to better understand what they reveal about the cluster structure of the unknown underlying distribution of points. Moreover, is it possible to identify spurious structures that might arise due to sampling variability? Our first contribution is a statistical analysis that reveals how certain subgraphs of a kNN graph form a consistent estimator of the cluster tree of the underlying distribution of points. Our second and perhaps most important contribution is the following finite sample guarantee. We carefully work out the tradeoff between aggressive and conservative pruning and are able to guarantee the removal of all spurious cluster structures at all levels of the tree while at the same time guaranteeing the recovery of salient clusters. This is the first such finite sample result in the context of clustering. 1.
Clusters and water flows: a novel approach to modal clustering through Morse theory
, 2014
"... The problem of finding groups in data (cluster analysis) has been extensively studied by researchers from the fields of Statistics and Computer Science, among others. However, despite its popularity it is widely recognized that the investigation of some theoretical aspects of clustering has been re ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
The problem of finding groups in data (cluster analysis) has been extensively studied by researchers from the fields of Statistics and Computer Science, among others. However, despite its popularity it is widely recognized that the investigation of some theoretical aspects of clustering has been relatively sparse. One of the main reasons for this lack of theoretical results is surely the fact that, unlike the situation with other statistical problems as regression or classification, for some of the cluster methodologies it is quite difficult to specify a population goal to which the databased clustering algorithms should try to get close. This paper aims to provide some insight into the theoretical foundations of the usual nonparametric approach to clustering, which understands clusters as regions of high density, by presenting an explicit formulation for the ideal population clustering.
Multiparameter Hierarchical Clustering Methods
"... Summary. We propose an extension of hierarchical clustering methods, called multiparameter hierarchical clustering methods which are designed to exhibit sensitivity to density while retaining desirable theoretical properties. The input of the method we propose is a triple pX, d, fq, where pX, dq is ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Summary. We propose an extension of hierarchical clustering methods, called multiparameter hierarchical clustering methods which are designed to exhibit sensitivity to density while retaining desirable theoretical properties. The input of the method we propose is a triple pX, d, fq, where pX, dq is a finite metric space and f: X Ñ R is a function defined on the data X, which could be a density estimate or could represent some other type of information. The output of our method is more general than dendrograms in that we track two parameters: the usual scale parameter and a parameter related to the function f. Our construction is motivated by the methods of persistent topology [6], the Reeb graph and Cluster Trees [16]. We present both a characterization, and a stability theorem. Key words: Hierarchical clustering, single linkage, persistent topology. 1
Beyond hartigan consistency: Merge distortion metric for hierarchical clustering.
 In Proceedings of The 28th Conference on Learning Theory,
, 2015
"... Abstract Hierarchical clustering is a popular method for analyzing data which associates a tree to a dataset. Hartigan consistency has been used extensively as a framework to analyze such clustering algorithms from a statistical point of view. Still, as we show in the paper, a tree which is Hartiga ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract Hierarchical clustering is a popular method for analyzing data which associates a tree to a dataset. Hartigan consistency has been used extensively as a framework to analyze such clustering algorithms from a statistical point of view. Still, as we show in the paper, a tree which is Hartigan consistent with a given density can look very different than the correct limit tree. Specifically, Hartigan consistency permits two types of undesirable configurations which we term oversegmentation and improper nesting. Moreover, Hartigan consistency is a limit property and does not directly quantify difference between trees. In this paper we identify two limit properties, separation and minimality, which address both oversegmentation and improper nesting and together imply (but are not implied by) Hartigan consistency. We proceed to introduce a merge distortion metric between hierarchical clusterings and show that convergence in our distance implies both separation and minimality. We also prove that uniform separation and minimality imply convergence in the merge distortion metric. Furthermore, we show that our merge distortion metric is stable under perturbations of the density. Finally, we demonstrate applicability of these concepts by proving convergence results for two clustering algorithms. First, we show convergence (and hence separation and minimality) of the recent robust single linkage algorithm of
Confidence Regions for Level Sets
, 2012
"... This paper discusses a universal approach to the construction of confidence regions for level sets {h(x) ≥ 0} ⊂Rd of a function h of interest. The proposed construction is based on a plugin estimate of the level sets using an appropriate estimate hn of h. The approach provides finite sample upper ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
This paper discusses a universal approach to the construction of confidence regions for level sets {h(x) ≥ 0} ⊂Rd of a function h of interest. The proposed construction is based on a plugin estimate of the level sets using an appropriate estimate hn of h. The approach provides finite sample upper and lower confidence limits. This leads to generic conditions under which the constructed confidence regions achieve a prescribed coverage level asymptotically. The construction requires an estimate of quantiles of the distribution of sup∆n hn(x) − h(x)  for appropriate sets ∆n ⊂ R d. In contrast to related work from the literature, the existence of a weak limit for an appropriately normalized process {hn(x),x ∈ D} is not required. This adds significantly to the challenge of deriving asymptotic results for the corresponding coverage level. Our approach is exemplified in the case of a density level set utilizing a kernel density estimator and a bootstrap procedure.
Consistency and rates for clustering with dbscan
 Journal of Machine Learning Research  Proceedings Track, 22:1090–1098
, 2012
"... We propose a simple and efficient modification of the popular DBSCAN clustering algorithm. This modification is able to detect the most interesting vertical threshold level in an automated, datadriven way. We establish both consistency and optimal learning rates for this modification. 1 ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We propose a simple and efficient modification of the popular DBSCAN clustering algorithm. This modification is able to detect the most interesting vertical threshold level in an automated, datadriven way. We establish both consistency and optimal learning rates for this modification. 1