| Sudipto Guha, Rajeev Rastogi, Kyuseok Shim: "CURE: A Clustering Algorithm for Large Databases". Proc. of the ACM SIGMOD Int '1 Conf. Management of Data, 1998, pp. 73-84. |
....such as marketing and customer segmentation. Clustering typically groups data into sets in such a way that the intra cluster similarity is maximized while the inter cluster similarity is minimized. Many efficient clustering algorithms, such as ROCK [1] DBSCAN [3] BIRTH [4] C2p [2] CURE [5], CHAMELEON [6] WaveCluster [7] and CLIQUE [8] have been proposed by the database research community. Most previous works in clustering focus on numerical data whose inherent geometric properties can be exploited naturally to define distance functions between data points. However, much of the ....
Sudipto Guha, Rajeev Rastogi, Kyuseok Shim: "CURE: A Clustering Algorithm for Large Databases". Proc. of the ACM SIGMOD Int '1 Conf. Management of Data, 1998, pp. 73-84.
....clustering algorithm. 2 Related Work Clustering has been extensively studied by researchers in psychology, statistics, biology and so on. Surveys of clustering algorithms can be found in [DH73, JD88] More recently, clustering algorithms for mining large databases have been proposed in [NH94, ZRL96, EKSX96, GRS98]. Most of these, however, are variants of either partitional (e.g. NH94] or centroid based hierarchical clustering (e.g. ZRL96, GRS98] As a result, as pointed out in Section 1.1, these algorithms are more suitable for clustering numeric data rather than data sets with categorical ....
.... of clustering algorithms can be found in [DH73, JD88] More recently, clustering algorithms for mining large databases have been proposed in [NH94, ZRL96, EKSX96, GRS98] Most of these, however, are variants of either partitional (e.g. NH94] or centroid based hierarchical clustering (e.g. [ZRL96, GRS98]) As a result, as pointed out in Section 1.1, these algorithms are more suitable for clustering numeric data rather than data sets with categorical attributes. Recently, in [HKKM97] the authors address the problem of clustering related customer transactions in a market basket database. Frequent ....
Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. CURE: A clustering algorithm for large databases. In Proc. of the ACM SIGMOD Conference on Management of Data, May 1998.
....Furthermore, since the scattered points for w are chosen from the original scattered points for clusters u and v, they can be expected to be fairly well spread out. 3. 3 Time and Space Complexity The worst case time complexity of our clustering algorithm can be shown to be O(n 2 log n) In [GRS97], we show that when the dimensionality of data points is small, the time complexity further reduces to O(n 2 ) Since both the heap and the k d tree require linear space, it follows that the space complexity of our algorithm is O(n) 4 Enhancements for Large Data Sets Most hierarchical ....
....Theorem 4.1: For a cluster u, if the sample size s satisfies s fN N juj log( 1 ffi ) N juj r (log( 1 ffi ) 2 2f juj log( 1 ffi ) 1) then the probability that the sample contains fewer than f juj points belonging to cluster u is less than ffi, 0 ffi 1. Proof: See [GRS97]. Thus, based on the above equation, we conclude that for the sample to contain at least f juj points belonging to cluster u (with high probability) we need the sample to contain more than a fraction f of the total number of points which seems intuitive. Also, suppose umin is the smallest ....
[Article contains additional citation context not shown here]
Sudipto Guha, R. Rastogi, and K. Shim. CURE: A clustering algorithm for large databases. Technical report, Bell Laboratories, Murray Hill, 1997.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC