## On Clustering Validation Techniques (2001)

Venue: | Journal of Intelligent Information Systems |

Citations: | 191 - 2 self |

### BibTeX

@ARTICLE{Halkidi01onclustering,

author = {Maria Halkidi and Yannis Batistakis and Michalis Vazirgiannis},

title = {On Clustering Validation Techniques},

journal = {Journal of Intelligent Information Systems},

year = {2001},

volume = {17},

pages = {107--145}

}

### Years of Citing Articles

### OpenURL

### Abstract

Cluster analysis aims at identifying groups of similar objects and, therefore helps to discover distribution of patterns and interesting correlations in large data sets. It has been subject of wide research since it arises in many application domains in engineering, business and social sciences. Especially, in the last years the availability of huge transactional and experimental data sets and the arising requirements for data mining created needs for clustering algorithms that scale and can be applied in diverse domains.

### Citations

2061 |
Data Mining: Concepts and Techniques
- Han, Kamber
- 2001
(Show Context)
Citation Context ...fy the cluster in which he/she can be classified and based on this decision his/her medication can be made. More specifically, some typical applications of the clustering are in the following fields (=-=Han & Kamber, 2001):s•-=- Business. In business, clustering may help marketers discover significant groups in their customers’ database and characterize them based on purchasing patterns. • Biology. In biology, it can be ... |

1995 | Some methods for classification and analysis of multivariate observations
- MACQUEEN
- 1967
(Show Context)
Citation Context ...has been proposed and is available in the literature. Some representative algorithms of the above categories follow. 2.1 Partitional Algorithms In this category, K-Means is a commonly used algorithm (=-=MacQueen, 1967). The-=- aim of K-Means clustering is the optimisation of an objective function that is described by the equation c E = ∑ ∑ d( x, mi) i= 1 x∈C i In the above equation, mi is the center of cluster Ci, wh... |

1179 | A Densitybased Algorithm for Discovering Clusters
- Ester, Kriegel, et al.
- 1996
(Show Context)
Citation Context ...hms Density based algorithms typically regard clusters as dense regions of objects in the data space that are separated by regions of low density. A widely known algorithm of this category is DBSCAN (=-=Ester, et al., 1996-=-). The key idea in DBSCAN is that for each point in a cluster, the neighbourhood of a given radius has to contain at least a minimum number of points. DBSCAN can handle noise (outliers) and discover c... |

575 | Cure: an efficient clustering algorithm for large databases
- Guha, Rastogi, et al.
- 1998
(Show Context)
Citation Context ...erlying data. Clustering problem is about partitioning a given data set into groups (clusters) such that the data points in a cluster are more similar to each other than points in different clusters (=-=Guha, et al., 1998-=-). For example, consider a retail database records containing items purchased by customers. A clustering procedure could group the customers in such a way that customers with similar buying patterns a... |

476 |
Pattern Recognition
- Theodoridis, Koutroumbas
- 1999
(Show Context)
Citation Context ...different names in different contexts, such as unsupervised learning (in pattern recognition), numerical taxonomy (in biology, ecology), typology (in social sciences) and partition (in graph theory) (=-=Theodoridis & Koutroubas, 1999-=-).sFeature Selection Data Data for process Figure 1. Steps of Clustering Process Clustering Algorithm Selection Validation of results Algorithm results Interpretation Final Clusters Knowledge In the c... |

351 | Rock: A robust clustering algorithm for categorical attributes
- GUHA, RASTOGI, et al.
- 1999
(Show Context)
Citation Context ...For each of above categories there is a wealth of subtypes and different algorithms for finding the clusters. Thus, according to the type of variables allowed in the data set can be categorized into (=-=Guha, et al., 1999; Hu-=-ang, et al., 1997; Rezaee, et al., 1998): • Statistical, which are based on statistical analysis concepts. They use similarity measures to partition objects and they are limited to numeric data. •... |

327 |
A cluster separation measure
- Davies, Bouldin
- 1979
(Show Context)
Citation Context ... the clusters Ci and Cj is defined based on a measure of dispersion of a cluster Ci and a dissimilarity measure between two clusters dij. The Rij index is defined to satisfy the following conditions (=-=Davies & Bouldin, 1979):-=- 1. Rij ≥ 0 2. Rij = Rji 3. if si = 0 and sj = 0 then Rij = 0 4. if sj > sk and dij = dik then Rij > Rik 5. if sj = sk and dij < dik then Rij > Rik. These conditions state that Rij is nonnegative an... |

318 | An examination of procedures for determining the number of clusters in a data set - Milligan, Cooper - 1985 |

241 |
Unsupervised Optimal Fuzzy Clustering
- Gath, Geva
- 1989
(Show Context)
Citation Context ...rong decisions. The problems of deciding the number of clusters better fitting a data set as well as the evaluation of the clustering results has been subject of several research efforts (Dave, 1996; =-=Gath & Geva, 1989-=-; Rezaee, et al., 1998; Smyth, 1996; Theodoridis & Koutroubas, 1999; Xie & Beni, 1991). In the sequel, we discuss the fundamental concepts of clustering validity and we present the most important crit... |

215 | Efficient Approach to Clustering in Large Multimedia Databases with Noise
- Hinneburg, An
- 1998
(Show Context)
Citation Context ...ering only in the neighbourhood of this object and thus efficient algorithms based on DBSCAN can be given for incremental insertions and deletions to an existing clustering (Ester, et al., 1998).sIn (=-=Hinneburg & Keim, 1998-=-) another density-based clustering algorithm, DENCLUE, is proposed. This algorithm introduces a new approach to cluster large multimedia databases The basic idea of this approach is to model the overa... |

197 |
A validity measure for fuzzy clustering
- Xie
- 1991
(Show Context)
Citation Context ...set as well as the evaluation of the clustering results has been subject of several research efforts (Dave, 1996; Gath & Geva, 1989; Rezaee, et al., 1998; Smyth, 1996; Theodoridis & Koutroubas, 1999; =-=Xie & Beni, 1991-=-). In the sequel, we discuss the fundamental concepts of clustering validity and we present the most important criteria in the context of clustering validity assessment. 4.2 Fundamental concepts of cl... |

177 |
Applied multivariate techniques
- Sharma
- 1996
(Show Context)
Citation Context ...e defined under certain assumptions and parameters. A number of validity indices have been defined and proposed in literature for each of above approaches (Halkidi,et al., 2000; Rezaee, et al., 1998; =-=Sharma, 1996;-=- Theodoridis & Koutroubas, 1999; Xie & Beni, 1991).sFigure 3. Confidence interval for (a) two-tailed index, (b) right-tailed index, (c) left-tailed index, where q 0 p is the ρ proportion of q under h... |

128 |
Well Separated Clusters and Optimal Fuzzy Partitions
- Dunn
- 1974
(Show Context)
Citation Context ...ion of the number of clusters that underlie the data. We note, that for nc =1 and nc=N the index is not defined.sDunn and Dunn-like indices. A cluster validity index for crisp clustering proposed in (=-=Dunn, 1974), attempts to identif-=-y “compact and well separated clusters”. The index is defined by equation (8) for a specific number of clusters D nc ⎧ ⎛ ⎪ ⎜ = min ⎨ min i= ,..., nc j= i+ 1,..., nc⎜ ⎪ ⎩ ⎝ i 1 ma... |

113 | Incremental clustering for mining in a data warehousing environment
- Ester, Kriegel, et al.
- 1998
(Show Context)
Citation Context ...ast a minimum number of points. DBSCAN can handle noise (outliers) and discover clusters of arbitrary shape. Moreover, DBSCAN is used as the basis for an incremental clustering algorithm proposed in (=-=Ester, et al., 1998-=-). Due to its density-based nature, the insertion or deletion of an object affects the current clustering only in the neighbourhood of this object and thus efficient algorithms based on DBSCAN can be ... |

90 | A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining - Huang - 1997 |

65 | Clustering using Monte Carlo cross-validation
- Smyth
- 1996
(Show Context)
Citation Context ...the number of clusters better fitting a data set as well as the evaluation of the clustering results has been subject of several research efforts (Dave, 1996; Gath & Geva, 1989; Rezaee, et al., 1998; =-=Smyth, 1996-=-; Theodoridis & Koutroubas, 1999; Xie & Beni, 1991). In the sequel, we discuss the fundamental concepts of clustering validity and we present the most important criteria in the context of clustering v... |

38 | Quality scheme assessment in the clustering. Process - Halkidi, Vazirgiannis, et al. - 2000 |

38 |
The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure
- Milligan, Soon, et al.
- 1983
(Show Context)
Citation Context ...Bnc index are proposed. They are based on MST, RNG and GG concepts similarly to the cases of the Dunn-like indices. Other validity indices for crisp clustering have been proposed in (Dave, 1996) and (=-=Milligan, et al., 1983-=-). The implementation of most of these indices is very computationally expensive, especially when the number of clusters and number of objects in the data set grows very large (Xie & Beni, 1991). In (... |

38 |
A new cluster validity index for the fuzzy c-mean
- Rezaee, Lelieveldt, et al.
- 1998
(Show Context)
Citation Context ...es. Since clustering algorithms define clusters that are not known a priori, irrespective of the clustering methods, the final partition of data requires some kind of evaluation in most applications (=-=Rezaee, et al., 1998).-=- ♦ Interpretation of the results. In many cases, the experts in the application area have to integrate the clustering results with other experimental evidence and analysis in order to draw the right... |

27 |
Cluster Validation Using Graph Theoretic Concepts
- Pal, Biswas
- 1997
(Show Context)
Citation Context ...for its computation, ii) the sensitive to the presence of noise in datasets, since these are likely to increase the values of diam(c) (i.e., dominator of equation (8)) Three indices, are proposed in (=-=Pal & Biswas, 1997-=-) that are more robust to the presence of noise. They are widely known as Dunn-like indices since they are based on Dunn index. Moreover, the three indices use for their definition the concepts of the... |

23 |
Validating fuzzy partitions obtained through c-shells clustering
- Dave
- 1996
(Show Context)
Citation Context ...leading to wrong decisions. The problems of deciding the number of clusters better fitting a data set as well as the evaluation of the clustering results has been subject of several research efforts (=-=Dave, 1996-=-; Gath & Geva, 1989; Rezaee, et al., 1998; Smyth, 1996; Theodoridis & Koutroubas, 1999; Xie & Beni, 1991). In the sequel, we discuss the fundamental concepts of clustering validity and we present the ... |

13 |
WaveCluster: A-MultiResolution Clustering Approach for Very Large Spatial Database
- Sheikholeslami, Chatterjee, et al.
- 1998
(Show Context)
Citation Context ...ormation at different levels. Based on this structure STING enables the usage of clustering information to search for queries or the efficient assignment of a new object to the clusters. WaveCluster (=-=Sheikholeslami, et al, 1998-=-) is the latest grid-based algorithm proposed in literature. It is based on signal processing techniques (wavelet transformation) to convert the spatial data into frequency domain. More specifically, ... |

5 |
FCM:Fuzzy CMeans Algorithm", Computers and Geoscience
- Bezdeck, Ehrlich, et al.
- 1984
(Show Context)
Citation Context ...f algorithms leads to clustering schemes that are compatible with everyday life experience as they handle the uncertainty of real data. The most important fuzzy clustering algorithm is Fuzzy C-Means (=-=Bezdeck, et al, 1984).-=- • Crisp clustering, considers non-overlapping partitions meaning that a data point either belongs to a class or not. Most of the clustering algorithms result in crisp clusters, and thus can be cate... |

4 |
Spatial Datasets: an "unofficial" collection. http://dias.cti.gr/~ytheod/research/datasets/spatial.html
- Theodoridis
- 1999
(Show Context)
Citation Context ...tic two-dimensional data sets further referred to as DataSet1, DataSet2, DataSet3 and DataSet4 (see Figure 5a-d) and a real data set Real_Data1 (Figure 5e), representing a part of Greek road network (=-=Theodoridis, 1999-=-). Table 8 summarizes the results of the validity indices (RS, RMSSDT, DB, SD), for different clustering schemes of the above-mentioned data sets as resulting from a clustering algorithm. For our stud... |

3 | Origins of Information Science - unknown authors - 1993 |

3 | Effecient and Effictive Clustering Methods for Spatial Data Mining - Ng, Han - 1994 |

1 | Gordon Linoff - Berry - 1996 |