#### DMCA

## Automatic Subspace Clustering of High Dimensional Data (2005)

### Cached

### Download Links

- [www.cs.cornell.edu]
- [www.cs.cornell.edu]
- [users.cs.dal.ca]
- [cs-people.bu.edu]
- [barbera.cnuce.cnr.it]
- [www.cs.uiuc.edu]
- [www.cs.uml.edu]
- [www.cs.uml.edu]
- [www.cs.sfu.ca]
- [miles.cnuce.cnr.it]
- [www.almaden.ibm.com]
- [rakesh.agrawal-family.com]
- [www.rakesh.agrawal-family.com]
- DBLP

### Other Repositories/Bibliography

Venue: | Data Mining and Knowledge Discovery |

Citations: | 724 - 12 self |

### Citations

4844 |
Pattern classification and scene analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...ks to identify homogeneous groups of objects based on the values of their attributes (dimensions) [24] [25]. Clustering techniques have been studied extensively in statistics [3], pattern recognition =-=[11]-=- [19], and machine learning [9] [31]. Recent work in the database community includes CLARANS [33], Focused CLARANS [14], BIRCH [45], and DBSCAN [13]. Current clustering techniques can be broadly class... |

3784 |
An Introduction to Statistical Pattern Recognition
- Fukunaga
- 1972
(Show Context)
Citation Context ... identify homogeneous groups of objects based on the values of their attributes (dimensions) [24] [25]. Clustering techniques have been studied extensively in statistics [3], pattern recognition [11] =-=[19]-=-, and machine learning [9] [31]. Recent work in the database community includes CLARANS [33], Focused CLARANS [14], BIRCH [45], and DBSCAN [13]. Current clustering techniques can be broadly classi ed ... |

2899 |
The Design and Analysis of Computer Algorithms
- Aho, Hopcroft, et al.
- 1974
(Show Context)
Citation Context ...e in the same cluster. On the other hand, units corresponding to vertices in di erent components cannot be connected, and therefore cannot be in the same cluster. We use a depth- rst search algorithm =-=[2]-=- to nd the connected components of the graph. We start with some unit u in D, assign it the rst cluster number, and nd all the units it is connected to. Then, if there still are units in D that have n... |

2797 |
Dubes. Algorithms for clustering data
- Jain, Richard
- 1988
(Show Context)
Citation Context ... clusters in large high dimensional datasets. 1 Introduction Clustering is a descriptive task that seeks to identify homogeneous groups of objects based on the values of their attributes (dimensions) =-=[24]-=- [25]. Clustering techniques have been studied extensively in statistics [3], pattern recognition [11] [19], and machine learning [9] [31]. Recent work in the database community includes CLARANS [33],... |

2229 |
Finding groups in data: an introduction to cluster analysis, volume 344
- Kaufman, Rousseeuw
- 2009
(Show Context)
Citation Context ...ters in large high dimensional datasets. 1 Introduction Clustering is a descriptive task that seeks to identify homogeneous groups of objects based on the values of their attributes (dimensions) [24] =-=[25]-=-. Clustering techniques have been studied extensively in statistics [3], pattern recognition [11] [19], and machine learning [9] [31]. Recent work in the database community includes CLARANS [33], Focu... |

1786 | A density-based algorithm for discovering clusters in large spatial databases with noise
- Ester, Kriegel, et al.
- 1996
(Show Context)
Citation Context ... extensively in statistics [3], pattern recognition [11] [19], and machine learning [9] [31]. Recent work in the database community includes CLARANS [33], Focused CLARANS [14], BIRCH [45], and DBSCAN =-=[13]-=-. Current clustering techniques can be broadly classi ed into two categories [24] [25]: partitional and hierarchical. Given a set of objects and a clustering criterion [39], partitional clustering obt... |

776 | A threshold of ln n for approximating set cover.
- Feige
- 1998
(Show Context)
Citation Context ...r setting. For the general set cover problem, the best known algorithm for approximating the smallest set cover gives an approximation factor of ln n where n is the size of the universe being covered =-=[16]-=- [28]. This problem is similar to the problem of constructive solid geometry formulae in solid-modeling [44]. It is also related to the problem of covering marked boxes in a grid with rectangles in lo... |

722 | CURE: An efficient clustering algorithm for large databases - Guha, Rastogi, et al. - 1998 |

709 | Efficient and effective clustering methods for spatial data min ing
- Ng, Han
- 1994
(Show Context)
Citation Context ... [24] [25]. Clustering techniques have been studied extensively in statistics [3], pattern recognition [11] [19], and machine learning [9] [31]. Recent work in the database community includes CLARANS =-=[33]-=-, Focused CLARANS [14], BIRCH [45], and DBSCAN [13]. Current clustering techniques can be broadly classified into two categories [24] [25]: partitional and hierarchical. Given a set of objects and a c... |

615 | Dynamic itemset counting and implication rules for market basket data,”
- Brin, Motwani, et al.
- 1997
(Show Context)
Citation Context ...algorithm makes k passes over the database. It follows that the running time of our algorithm is O(c k + mk) for a constant c. The number of database passes can be reduced by adapting ideas from [41] =-=[8]-=-. 3.1.2 Making the bottom-up algorithm faster While the procedure just described dramatically reduces the number of units that are tested for being dense, we still may have a computationally infeasibl... |

599 |
Stochastic complexity.
- Rissanen
- 1987
(Show Context)
Citation Context ...pply the MDL (Minimal Description Length) principle. The basic idea underlying the MDL principle is to encode the input data under a given model and select the encoding that minimizes the code length =-=[35]-=-. Assume we have the subspaces S1 ; S2 ; : : : ; Sn . Our pruning technique first groups together the dense units that lie in the same subspace. Then, for each subspace, it computes the fraction of th... |

581 |
Fast discovery of association rules
- AGRAWAL, MANNILA, et al.
- 1995
(Show Context)
Citation Context ...hm that exploits the monotonicity of the clustering criterion with respect to dimensionality to prune the search space. This algorithm is similar to the Apriori algorithm for mining Association rules =-=[l]-=-. A somewhat similar bottom-up scheme was also used in [lo) for determining modes in high dimensional histograms. Lemma 1 (Monotonicity): If a collection of points S is a cluster in a k-dimensional sp... |

576 | BIRCH: an efficient data clustering method for very large databases.
- Zhang, Ramakrishnon, et al.
- 1996
(Show Context)
Citation Context ...have been studied extensively in statistics [3], pattern recognition [11] [19], and machine learning [9] [31]. Recent work in the database community includes CLARANS [33], Focused CLARANS [14], BIRCH =-=[45]-=-, and DBSCAN [13]. Current clustering techniques can be broadly classified into two categories [24] [25]: partitional and hierarchical. Given a set of objects and a clustering criterion [39], partitio... |

567 |
Bayesian classification (AutoClass): Theory and results. In:
- Cheeseman, Stutz
- 1996
(Show Context)
Citation Context ...s of objects based on the values of their attributes (dimensions) [24] [25]. Clustering techniques have been studied extensively in statistics [3], pattern recognition [11] [19], and machine learning =-=[9]-=- [31]. Recent work in the database community includes CLARANS [33], Focused CLARANS [14], BIRCH [45], and DBSCAN [13]. Current clustering techniques can be broadly classified into two categories [24] ... |

470 | Sampling large databases for association rules”,
- Toivonen
- 1996
(Show Context)
Citation Context ... The algorithm makes k passes over the database. It follows that the running time of our algorithm is O(c k + mk) for a constant c. The number of database passes can be reduced by adapting ideas from =-=[41]-=- [8]. 3.1.2 Making the bottom-up algorithm faster While the procedure just described dramatically reduces the number of units that are tested for being dense, we still may have a computationally infea... |

457 | Efficiently mining long patterns from databases.
- Bayardo
- 1998
(Show Context)
Citation Context ... nding dense units. If the user is only interested in clusters in the subspaces of highest dimensionality, we can use techniques based on recently proposed algorithms for discovering maximal itemsets =-=[5]-=- [26]. These techniques will allow CLIQUE to nd dense units of high dimensionality without having to nd all of their projections. Acknowledgment The code for CLIQUE builds on several components that R... |

444 | Mining quantitative association rules in large relational tables
- Srikant, Agrawal
- 1996
(Show Context)
Citation Context ...to get closer to an optimal solution. The subspace identi cation problem is related to the problem of nding quantitative association rules that also identify interesting regions of various attributes =-=[40]-=- [32]. However, the techniques proposed are quite di erent. One can also imagine adapting a tree-classi er designed for data mining (e.g. [30] [37]) for subspace clustering. In the treegrowth phase, t... |

426 |
On the hardness of approximating minimization problems
- Lund, Yannakakis
- 1994
(Show Context)
Citation Context ...ting. For the general set cover problem, the best known algorithm for approximating the smallest set cover gives an approximation factor of ln n where n is the size of the universe being covered [16] =-=[28]-=-. This problem is similar to the problem of constructive solid geometry formulae in solid-modeling [44]. It is also related to the problem of covering marked boxes in a grid with rectangles in logic m... |

338 |
Verkamo. Fast discovery of association rules.
- Agrawal, Mannila, et al.
- 1996
(Show Context)
Citation Context ...hm that exploits the monotonicity of the clustering criterion with respect to dimensionality to prune the search space. This algorithm is similar to the Apriori algorithm for mining Association rules =-=[1]-=-. A somewhat similar bottom-up scheme was also used in [10] for determining modes in high dimensional histograms. Lemma 1 (Monotonicity): If a collection of points S is a cluster in a k-dimensional sp... |

320 |
On the ratio of optimal integral and fractional covers
- Lovasz
- 1975
(Show Context)
Citation Context ...he procedure until the whole cluster is covered. For general set cover, the addition heuristic is known to give acover within a factor ln n of the optimum where n is the number of units to be covered =-=[27]-=-. Thus it would appear that the addition heuristic, since its quality of approximation matches the negative results of [16] [28], would be the obvious choice. However, its implementation in our high d... |

312 | Sprint: A scalable parallel classifier for data mining
- Shafer, Agrawal, et al.
- 1996
(Show Context)
Citation Context ...identify interesting regions of various attributes [40] [32]. However, the techniques proposed are quite different. One can also imagine adapting a tree-classifier designed for data mining (e.g. [30] =-=[37]-=-) for subspace clustering. In the treegrowth phase, the splitting criterion will have to be changed so that some clustering criterion (e.g. average cluster diameter) is optimized. In the tree-pruning ... |

305 |
Learning from observation: conceptual clustering. In
- Michalski, Stepp
- 1983
(Show Context)
Citation Context ... objects based on the values of their attributes (dimensions) [24] [25]. Clustering techniques have been studied extensively in statistics [3], pattern recognition [11] [19], and machine learning [9] =-=[31]-=-. Recent work in the database community includes CLARANS [33], Focused CLARANS [14], BIRCH [45], and DBSCAN [13]. Current clustering techniques can be broadly classi ed into two categories [24] [25]: ... |

267 |
Pattern Classi cation and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...ks to identify homogeneous groups of objects based on the values of their attributes (dimensions) [24] [25]. Clustering techniques have been studied extensively in statistics [3], pattern recognition =-=[11]-=- [19], and machine learning [9] [31]. Recent work in the database community includes CLARANS [33], Focused CLARANS [14], BIRCH [45], and DBSCAN [13]. Current clustering techniques can be broadly class... |

263 |
an e cient data clustering method for very large databases
- BIRCH
- 1996
(Show Context)
Citation Context ...have been studied extensively in statistics [3], pattern recognition [11] [19], and machine learning [9] [31]. Recent work in the database community includes CLARANS [33], Focused CLARANS [14], BIRCH =-=[45]-=-, and DBSCAN [13]. Current clustering techniques can be broadly classi ed into two categories [24] [25]: partitional and hierarchical. Given a set of objects and a clustering criterion [39], partition... |

240 | Sliq: A fast scalable classifier for data mining.
- Mehta, Agrawal, et al.
- 1996
(Show Context)
Citation Context ...also identify interesting regions of various attributes [40] [32]. However, the techniques proposed are quite different. One can also imagine adapting a tree-classifier designed for data mining (e.g. =-=[30]-=- [37]) for subspace clustering. In the treegrowth phase, the splitting criterion will have to be changed so that some clustering criterion (e.g. average cluster diameter) is optimized. In the tree-pru... |

203 | A cost model for nearest neighbor search in highdimensional data space
- Keim, Berchtold, et al.
- 1997
(Show Context)
Citation Context ...omain for each attribute can be large. It is not meaningful to look for clusters in such a high dimensional space as the average density of points anywhere in the data space is likely to be quite low =-=[6]-=-. Compounding this problem, many dimensions or combinations of dimensions can have noise or values that are uniformly distributed. Therefore, distance functions that use all the dimensions of the data... |

194 | Finding generalized projected clusters in high dimensional spaces - Aggarwal, Yu |

191 | Almost optimal set covers in finite VC-dimension - Brönnimann, Goodrich - 1995 |

128 | Pincer-search: a new algorithm for discovering themaximum frequent set,”
- Lin, Kedem
- 1998
(Show Context)
Citation Context ...ng dense units. If the user is only interested in clusters in the subspaces of highest dimensionality, we can use techniques based on recently proposed algorithms for discovering maximal itemsets [5] =-=[26]-=-. These techniques will allow CLIQUE to nd dense units of high dimensionality without having to nd all of their projections. Acknowledgment The code for CLIQUE builds on several components that Ramakr... |

113 |
An algorithm for point clustering and grid generation.
- Berger, Rigoutsos
- 1991
(Show Context)
Citation Context ...ulae in solid-modeling [44]. It is also related to the problem of covering marked boxes in a grid with rectangles in logic minimization (e.g. [22]). Some clustering algorithms in image analysis (e.g. =-=[7]-=- [36] [42]) also nd rectangular dense regions. In these domains, datasets are in low dimensional spaces and the techniques used are computationally too expensive for large datasets of high dimensional... |

107 |
Stochastic Complexity in Statistical Inquiry. World Scientific,
- Rissanen
- 1989
(Show Context)
Citation Context ...pply the MDL (Minimal Description Length) principle. The basic idea underlying the MDL principle is to encode the input data under a given model and select the encoding that minimizes the code length =-=[35]-=-. Assume we havethe subspaces S1;S2;:::;Sn. Our pruning technique rst groups together the dense units that lie in the same subspace. Then, for each subspace, it computes the fraction of the database t... |

104 | A Monte Carlo algorithm for fast projective clustering,” - Procopiuc, Jones, et al. - 2002 |

93 |
Association rules over interval data.”. In:
- Miller, ”
- 1997
(Show Context)
Citation Context ...t closer to an optimal solution. The subspace identi cation problem is related to the problem of nding quantitative association rules that also identify interesting regions of various attributes [40] =-=[32]-=-. However, the techniques proposed are quite di erent. One can also imagine adapting a tree-classi er designed for data mining (e.g. [30] [37]) for subspace clustering. In the treegrowth phase, the sp... |

77 | Data mining, hypergraph transversals, and machine learning.
- Gunopulos, Khardon, et al.
- 1997
(Show Context)
Citation Context ...subset of the k dimensions that is, O(2 k ) di erent combinations, are also dense. The running time of our algorithm is therefore exponential in the highest dimensionality ofany dense unit. As in [1] =-=[20]-=-, it can be shown that the candidate generation procedure produces the minimal number of candidates that can guarantee that all dense units will be found. Let k be the highest dimensionality ofany den... |

64 | Range queries in OLAP data cubes.
- Ho, Agrawal, et al.
- 1997
(Show Context)
Citation Context ...ee structure is used to store sparse regions. Currently, users are required to specify dense and sparse dimensions [4]. Similarly, the precomputation techniques for range queries over OLAP data cubes =-=[21]-=- require identi cation of dense regions in sparse data cubes. CLIQUE can be used for this purpose. In future work, we plan to address the problem of evaluating the quality of clusterings in di erent s... |

58 | A database interface for clustering in large spatial databases.
- Ester, Kriegel, et al.
- 1995
(Show Context)
Citation Context ... techniques have been studied extensively in statistics [3], pattern recognition [11] [19], and machine learning [9] [31]. Recent work in the database community includes CLARANS [33], Focused CLARANS =-=[14]-=-, BIRCH [45], and DBSCAN [13]. Current clustering techniques can be broadly classi ed into two categories [24] [25]: partitional and hierarchical. Given a set of objects and a clustering criterion [39... |

39 |
Hierarchical image segmentation by multi-dimensional clustering and orientation-adaptive boundary refinement.
- Schroeter, Bigün
- 1995
(Show Context)
Citation Context ... in solid-modeling [44]. It is also related to the problem of covering marked boxes in a grid with rectangles in logic minimization (e.g. [22]). Some clustering algorithms in image analysis (e.g. [7] =-=[36]-=- [42]) also nd rectangular dense regions. In these domains, datasets are in low dimensional spaces and the techniques used are computationally too expensive for large datasets of high dimensionality. ... |

37 |
E cient and e ective clustering methods for spatial data mining
- Ng, Han
- 1994
(Show Context)
Citation Context ... [24] [25]. Clustering techniques have been studied extensively in statistics [3], pattern recognition [11] [19], and machine learning [9] [31]. Recent work in the database community includes CLARANS =-=[33]-=-, Focused CLARANS [14], BIRCH [45], and DBSCAN [13]. Current clustering techniques can be broadly classi ed into two categories [24] [25]: partitional and hierarchical. Given a set of objects and a cl... |

34 |
A threshold for approximating set cover,
- Feige
- 1998
(Show Context)
Citation Context ... setting. For the general set cover problem, the best known algorithm for approximating the smallest set cover gives an approximation factor of ln ra where n is the size of the universe being covered =-=[16]-=- [28]. This problem is similar to the problem of constructive solid geometry formulae in solid-modeling [44]. It is also related to the problem of covering marked boxes in a grid with rectangles in lo... |

28 | Subspace clustering of high dimensional data - Domeniconi, Papadopoulos, et al. |

26 |
Bayesian classi cation (AutoClass): theory and results
- Cheeseman, Stutz
- 1996
(Show Context)
Citation Context ...s of objects based on the values of their attributes (dimensions) [24] [25]. Clustering techniques have been studied extensively in statistics [3], pattern recognition [11] [19], and machine learning =-=[9]-=- [31]. Recent work in the database community includes CLARANS [33], Focused CLARANS [14], BIRCH [45], and DBSCAN [13]. Current clustering techniques can be broadly classi ed into two categories [24] [... |

23 |
Sliq: A fast scalable classi er for data mining.
- Mehta, Agrawal, et al.
- 1996
(Show Context)
Citation Context ...t also identify interesting regions of various attributes [40] [32]. However, the techniques proposed are quite di erent. One can also imagine adapting a tree-classi er designed for data mining (e.g. =-=[30]-=- [37]) for subspace clustering. In the treegrowth phase, the splitting criterion will have tobechanged so that some clustering criterion (e.g. average cluster diameter) is optimized. In the tree-pruni... |

21 | SPRINT: A scalable parallel classi er for data mining.
- Shafer, Agrawal, et al.
- 1996
(Show Context)
Citation Context ...o identify interesting regions of various attributes [40] [32]. However, the techniques proposed are quite di erent. One can also imagine adapting a tree-classi er designed for data mining (e.g. [30] =-=[37]-=-) for subspace clustering. In the treegrowth phase, the splitting criterion will have tobechanged so that some clustering criterion (e.g. average cluster diameter) is optimized. In the tree-pruning ph... |

18 |
Performance guarantees on a sweep-line heuristic for covering rectilinear polygons with rectangles
- Franzblau
- 1989
(Show Context)
Citation Context ...es. The best approximate algorithm known for the special case of nding a cover of a 2-dimensional rectilinear polygon with no holes produces a cover of size bounded by a factor of 2 times the optimal =-=[17]-=-. Since this algorithm only works forsthe 2-dimensional case, it cannot be used in our setting. For the general set cover problem, the best known algorithm for approximating the smallest set cover giv... |

16 |
Some NP-Complete Set Covering Problems
- Masek
- 1979
(Show Context)
Citation Context ...ver of C if every region R 2Ris contained in C, and each unit in C is contained in at least one of the regions in R. Computing the optimal cover is known to be NP-hard, even in the 2-dimensional case =-=[29]-=- [34]. The optimal cover is the cover with the minimal number of rectangles. The best approximate algorithm known for the special case of nding a cover of a 2-dimensional rectilinear polygon with no h... |

13 |
A comparative study of clustering methods. Future Generation Computer Systems
- Zait, Messatfa
- 1997
(Show Context)
Citation Context ...tion. The data resided in the AIX fle system and was stored on a 2GB SCSI drive with sequential throughput of about 2 MB/second. 4.1 Synthetic data generation We use the synthetic data generator from =-=[43]-=- to produce datasets with clusters of high density in specific subspaces. The data generator allows control over the structure and the size of datasets through parameters such as the number of records... |

12 | An Algorithm for Constructing Regions with Rectangles - Franzblau, Kleitman - 1984 |

11 |
A comparative study of clustering methods
- Zait, Messatfa
- 1997
(Show Context)
Citation Context ...ation. The data resided in the AIX le system and was stored on a 2GB SCSI drive with sequential throughput of about 2 MB/second. 4.1 Synthetic data generation We use the synthetic data generator from =-=[43]-=- to produce datasets with clusters of high density in speci c subspaces. The data generator allows control over the structure and the size of datasets through parameters such as the number of records,... |

11 | Minimum dissection of a rectilinear polygon with arbitrary holes into rectangles - GORPINEVICH, SOLTAN - 1992 |

10 | Covering a simple orthogonal polygon with a minimum number of orthogonally convex polygons
- Culberson, Reckhow
- 1987
(Show Context)
Citation Context ...f C if every region R 2Ris contained in C, and each unit in C is contained in at least one of the regions in R. Computing the optimal cover is known to be NP-hard, even in the 2-dimensional case [29] =-=[34]-=-. The optimal cover is the cover with the minimal number of rectangles. The best approximate algorithm known for the special case of nding a cover of a 2-dimensional rectilinear polygon with no holes ... |

7 | Fast algorithms for projected clustering - Agarrval, Wolf, et al. - 1999 |

6 |
MINI: A heuristic algorithm for two-level logic minimization
- Hong
- 1987
(Show Context)
Citation Context ...similar to the problem of constructive solid geometry formulae in solid-modeling [44]. It is also related to the problem of covering marked boxes in a grid with rectangles in logic minimization (e.g. =-=[22]-=-). Some clustering algorithms in image analysis (e.g. [7] [36] [42]) also nd rectangular dense regions. In these domains, datasets are in low dimensional spaces and the techniques used are computation... |

6 |
CSG Set-Theoretical Solid Modelling and NC Machining of Blend Surfaces
- Zhang, Bowyer
- 1986
(Show Context)
Citation Context ...ver gives an approximation factor of ln n where n is the size of the universe being covered [16] [28]. This problem is similar to the problem of constructive solid geometry formulae in solid-modeling =-=[44]-=-. It is also related to the problem of covering marked boxes in a grid with rectangles in logic minimization (e.g. [22]). Some clustering algorithms in image analysis (e.g. [7] [36] [42]) also nd rect... |

3 |
Method and Apparatus for Storing and Retrieving Multi-dimensional data in Computer Memory
- Earle
- 1994
(Show Context)
Citation Context ...i cation [23]. Automatic subspace clustering can be useful in other applications besides data mining. To index OLAP data, for instance, the data space is rst partitioned into dense and sparse regions =-=[12]-=-. Data in dense regions is stored in an array whereas a tree structure is used to store sparse regions. Currently, users are required to specify dense and sparse dimensions [4]. Similarly, the precomp... |

3 |
A Generalized Histogram Clustering for Multidimensional Image Data
- Wharton
- 1983
(Show Context)
Citation Context ...olid-modeling [44]. It is also related to the problem of covering marked boxes in a grid with rectangles in logic minimization (e.g. [22]). Some clustering algorithms in image analysis (e.g. [7] [36] =-=[42]-=-) also nd rectangular dense regions. In these domains, datasets are in low dimensional spaces and the techniques used are computationally too expensive for large datasets of high dimensionality. Our s... |

2 |
A numerical classification method for partitioning of a large multidimensional mixed data set
- Chhikara, Register
- 1979
(Show Context)
Citation Context ...ion with respect to dimensionality to prune the search space. This algorithm is similar to the Apriori algorithm for mining Association rules [1]. A somewhat similar bottom-up scheme was also used in =-=[10]-=- for determining modes in high dimensional histograms. Lemma 1 (Monotonicity): If a collection of points S is a cluster in a k-dimensional space, then S is also part of a cluster in any (k \Gamma1)-di... |

2 | IBM Intelligent Miner User’s Guide, Version 1 Release 1, SH12-6213-00 edition - Machines - 1996 |

1 |
An overview of combinatorial data analyis
- Arabie, Hubert
- 1996
(Show Context)
Citation Context ...descriptive task that seeks to identify homogeneous groups of objects based on the values of their attributes (dimensions) [24] [25]. Clustering techniques have been studied extensively in statistics =-=[3]-=-, pattern recognition [11] [19], and machine learning [9] [31]. Recent work in the database community includes CLARANS [33], Focused CLARANS [14], BIRCH [45], and DBSCAN [13]. Current clustering techn... |

1 |
A numerical classi cation method for partitioning of a large multidimensional mixed data set
- Chhikara, Register
- 1979
(Show Context)
Citation Context ...ion with respect to dimensionality to prune the search space. This algorithm is similar to the Apriori algorithm for mining Association rules [1]. A somewhat similar bottom-up scheme was also used in =-=[10]-=- for determining modes in high dimensional histograms. Lemma 1 (Monotonicity): If a collection of points S is a cluster in a k-dimensional space, then S is also part of a cluster in any (k,1)-dimensio... |

1 |
Optimizing a noisy function of many variables with application to data mining
- Friedman
- 1997
(Show Context)
Citation Context ...ing subspaces, and also generate minimal descriptions for the clusters. A di erent technique to nd rectangular clusters of high density in a projection of the data space has been proposed by Friedman =-=[18]-=-. This algorithm works in a top down fashion. Starting from the full space, it greedily chooses which projection should be taken and reevaluates the solution after each step in order to get closer to ... |

1 |
Performance guarantees on a sweep-line heuristic for covering rectilinear polygons with rectangles
- Praneblau
- 1989
(Show Context)
Citation Context ...les. The best approximate algorithm known for the speci8I case of finding 8 cover of 8 2dimensional rectihneer polygon with no holes produces a cover of size bounded by a factor of 2 times the optimd =-=[17]-=-. Since this algorithm only works for 98the 2-dimensional case, it cannot be used in our setting. For the general set cover problem, the best known algorithm for approximating the smallest set cover ... |

1 | On the ratio of the optimal integral and fractional covers. Discrete Mathematics - Lov6sz - 1975 |

1 |
Stochastic Complezity in Statistical Inquiry
- Rissanen
- 1989
(Show Context)
Citation Context ...pply the MDL (Minimal Description Length) principle. The basic idea underlying the MDL principle is to encode the input data under a given model and select the encoding that minimizes the code length =-=[35]-=-. Assume we have the subspaces Si, 5’2,. . . , S,. Our pruning technique first groups together the dense units that he in the same subspace. Then, for each subspace, it computes the fraction of the da... |

1 |
Mining Quantitative Association Rules in Large Relational Tables
- F’reeman
- 1973
(Show Context)
Citation Context ...4], BIRCH [45], and DBSCAN [13]. Current clustering techniques can be broadly classified into two categories [24] [25]: partitional and hierarchical. Given a set of objects and a clustering criterion =-=[39]-=-, partitional clustering obtains a partition of the objects into clusters such that the objects in a cluster are more similar to each other than to objects in different clusters. The popular K-means a... |

1 |
Sampling large databases for association
- lleoivonen
- 1996
(Show Context)
Citation Context ... The algorithm makes k passes over the database. It follows that the running time of our algorithm is O(ck + m k) for a constant c. The number of database passes can be reduced by adapting ideas from =-=[41]-=- [8]. 3.1.2 Making the bottom-up algorithm faster While the procedure just described dramatically reduces the number of units that are tested for being dense, we still may have a computationally infea... |

1 |
On the hardness of approxi- mating minimization problems
- Lund, Yannakakis
- 1993
(Show Context)
Citation Context ...g. Forsthe general set cover problem, the best known algorithm forsapproximating the smallest set cover gives an approxima-stion factor of ln ra where n is the size of the universe beingscovered [16] =-=[28]-=-.sThis problem is similar to the problem of constructivessolid geometry formulae in solid-modeling [44]. It is alsosrelated to the problem of covering marked boxes in a gridswith rectangles in logic m... |

1 |
Covering simple orthogo- nal polygon with a minimum number of orthogonally convex polygons
- Reckhow, Culberson
- 1987
(Show Context)
Citation Context ... C if every region R E R is contained in C, and eachsunit in C is contained in at least one of the regions in R.sComputing the optimal cover is known to be NP-hard,seven in the 2dimensional case [29] =-=[34]-=-. The optimal coversis the cover with the minimal number of rectangles. Thesbest approximate algorithm known for the speci8I case ofsfinding 8 cover of 8 2dimensional rectihneer polygon withsno holes ... |

1 |
Mining Quantitative Associa- tion Rules in Large Relational Tables
- F’reeman
- 1973
(Show Context)
Citation Context ...4], BIRCH [45], and DBSCAN [13].sCurrent clustering techniques can be broadly classifiedsinto two categories [24] [25]: partitional and hierarchical.sGiven a set of objects and a clustering criterion =-=[39]-=-, parti-stional clustering obtains a partition of the objects into clus-sters such that the objects in a cluster are more similar toseach other than to objects in different clusters. The popularsK-mea... |

1 |
A comparative study of cluster- ing methods. Future Generation Computer Systems
- Zait, Messatfa
- 1997
(Show Context)
Citation Context ...tion. The data resided in the AIX flessystem and was stored on a 2GB SCSI drive with sequentialsthroughput of about 2 MB/second.s4.1 Synthetic data generationsWe use the synthetic data generator from =-=[43]-=- to producesdatasets with clusters of high density in specific subspaces.sThe data generator allows control over the structure and thessize of datasets through parameters such as the number ofsrecords... |