| Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1(2):141--182, 1997. |
....we must evaluate how well this model represents the complexity of the data. Clustering may be performed using methods such as K means [18] expectation maximization (EM) 16] or optimization models [8] Recently a set of novel clustering algorithms have been proposed in the database community [26,55]. For instance, Agrawal et al. 1] present an order independent clustering algorithm, CLIQUE, that forms clusters in large data sets. The problem of determining an appropriate model in unsupervised learning has gained popularity in the machine learning, pattern recognition, and data mining ....
T. Zhang, R. Ramakrishnan and M. Livny, BIRCH: A new data clustering algorithm and its applications, Data Mining and Knowledge Discovery 1(2) (1997), 141--182.
.... [12, 22] Different improvements to alleviate effects of initialization on results of KMeans are proposed [4, 34] While we are not concerned here with scalability issues, we note that classic K Means has numerous extensions to scalable unsupervised learning in databases CLARANS [14] and BIRCH [35]. IR algorithm is an iterative optimization of objective function (equal to a reduction in mutual information) It can be viewed in a context of general EM framework [9, 28, 30] Two specific examples of particular algorithms are AutoClass [6] and MCLUST [18] New approaches are emerging for ....
Zhang, T., Ramakrishnan, R., Livny, M, BIRCH: A New Data Clustering Algorithm and its Applications, Data Mining and Knowledge Discovery, v. 1, no. 2, 1997.
....to compute certain data summaries (sufficient statistics) DVJ99] The obtained summaries are then used instead of the original data for further clustering. The pivotal role here belongs to the algorithm BIRCH (Balanced Iterative Reduction and Clustering using Hierarchies) by Zhang et al. ZRL96] [ZRL97]. This work had a significant impact on overall direction of scalability research in clustering. BIRCH creates a height balanced tree of nodes that summarize data by accumulating its zero, first, and second moments. A node, Cluster Feature (CF) is a tight small cluster of numerical data. The ....
....into clusters. Correspondingly, in partitioning the data the preferable way to deal with outliers is to keep one extra outliers set, so as not to pollute factual clusters. Data preprocessing schemes used to address scalability issues also provide outliers handling. The algorithm BIRCH [ZRL96] [ZRL97] revisits outliers during the major CF tree rebuilds, but in general handles them separately. This approach is shared by other similar systems [CFCW01] Bradley et al. BFR98] framework utilizes a multiphase approach to outliers. The algorithm CURE [GRS98] uses shrinkage of cluster s representing ....
Zhang, T., Ramakrishnan, R. and Livny, M. BIRCH: A new data clustering algorithm and its applications. dournal of Data Mining and Knowledge Discovery, 1(2), 141-182, 1997.
....the worst case. This assumption is reasonable for a sequence of events of the same type since: a) we only need to keep track of arrival instants and (b) this is a shorter sequence than raw data that contains a mixture of event types. When memory is problematic, indexing mechanism such as CF tree[15] can be used to control the total number of buckets. 10 The algorithm takes as input the points s in q, the tolerance , and confident level (e.g. 95 ) 1. For i 2 to I 1 (b) If 6 does not exist, then 6 1 (c) Else, 6 6 1 2. AdjustCounts(6) Adjust counts to deal with time ....
....events by type so that events in each group can be read out sequentially to find possible periods using algorithm discussed in Section 3.3. Alternatively, this step can be implemented in parallel to find periods for all event types simultaneously. However, tree indexing schemes, as those used in [15], are needed to handle a large amount of events. Step 2 finds all patterns for each period. It first finds event types Ap: a GAla has a period p . It then seeds initial candidate set C l and specifies the minimum support and window size. Last, the level wise mining (Section 3.3) can be ....
T. Zhang, R. Ramakrishnan, and M. Livny. Birch: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, pages 141-182, 1997. 20
....not much attention has been given to integrating them with database systems, some of these techniques are beginning to be scaled to operate on large databases. Examples in classification (Section 3) include decision trees [18] in summarization (Section 3) association rules [1] and in clustering [28] Predictive Modeling The goal is to predict some field(s) in a database based on other fields. If the field being predicted is a numeric (continuous) variable (such as a physical measurement of e.g. height) then the prediction problem is a regression problem. If the field is categorical then it ....
....server needs to provide (data intensive operations to derive sufficient statistics) versus what a data mining client needs to do (consume sufficient statistics to build a model) This is likely to be a good direction towards scaling mining algorithms to effectively work with large databases. See [18, 5, 28, 3] for examples in classification and clustering. Partitioning methods are suitable to large databases. Most data mining methods are essentially partitioning methods: find local partitions in the data and build a model for each local region. This is a particularly suitable model for large ....
T. Zhang, R. Ramakrishnan, and M. Livny. Birch: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1(2), 1997. 48
....worst case. This assumption is reasonable for a sequence of events of the same type since: a) we only need to keep track of arrival instants and (b) this is a shorter sequence than raw data that contains a mixture of event types. When memory is problematic, indexing mechanism 9 such as Cl tree[19] can be used to control the total number of buckets. The algorithm takes as input the points s in q, the tolerance c, and confidence level (e.g. 95 ) 1. For i 2 to I 1 (b) If C does not exist, then C 1 (c) Else, C C 1 2. AdjustCounts( C) 5) Adjust counts to deal with time ....
....events by type so that events in each group can be read out sequentially to find possible periods using algorithm discussed in Section 3.3. Alternatively, this step can be implemented in parallel to find periods for all event types simultaneously. However, tree indexing schemes, as those used in [19], are needed to handle a large number of events. Step 2 finds all patterns for each period. It first finds event types Ap a Ala has a period 9) It then seeds initial candidate set el and specifies the minimum support and window size. Last, the level wise mining (Section 3.3) can be performed ....
T. Zhang, R. Ramakrishnan, and M. Livny. Birch: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, pages 141-182, 1997. 20
....al. suggested a technique called Generalization based Clustering which uses page URLs to construct a hierarchy which is then used to categorize the pages [6] The page accesses in each user session are described using these page categorizations and are then clustered using the BIRCH algorithm [20]. Banerjee et al. utilized the combination of time spent on a page and Longest Common Subsequences (LCS) to cluster the user sessions [1] The LCS algorithm is first applied on all pairs of user sessions. Then each LCS path is reduced using page hierarchy in a generalization based approach ....
Zhang, T., Ramakrishnan, R., and Livny, M. BIRCH: A New Data Clustering Algorithm and Its Applications. Data Mining and Knowledge Discovery, 1(2), 1997. 141-182.
....to other imagery problems. The SKICAT [FDW96] applied data mining technology to the classification of astronomical objects, discovering new types in the process. JARtool [BFP 94] looked for specific features in overhead images; a specific application was identifying volcanoes on Venus. Birch [ZRL97] used data mining technology to study foliage. A closer application to the one presented here was Quakefinder[SD96] which looked for a specific type of change (earth movements) in overhead imagery. One aspect of Quakefinder that is particularly relevant to this paper is the ability to accurately ....
Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1(2):141--182, 1997. 31
....and corresponding boundaries. Like the nearest neighbour technique, clustering is completely dependent on the distance measure to be applied. Clustering is an important application area for many fields including data mining [47] statistical data analysis [75, 49] data compression or reduction [161], 38 psychology, biology, sociology, and business applications [16] For instance, clustering techniques can be used in marketing for finding customer groups; to find patients with similar profiles in the health care industry; to derive a taxonomy for grouping all living organisms in biology; and ....
....patterns for multi dimensional data points in a large database. Their techniques include building a hierarchical structure of clusters and identification of subspace in high dimensional data. In BIRCH, the clusters are merged or split according to some meta information describing each cluster [161]. The meta information contains the number of data points, the linear sum and the square sum of those data points in a cluster. In the process of scanning the database, data points are inserted into the dynamic hierarchical clusters. The cluster structure will reorganize if necessary according to ....
Zhang T., Ramakrishnan R. and Livny M., "BIRCH: A New Data Clustering Algorithm and its Applications", Data Mining and Knowledge Discovery 1(2), 1997.
....Gini index, mutations, crossover operations 1 Introduction Clustering is a main topic of computer science and is particularly important in the eld of data mining. Research of this problem has resulted in many clustering algorithms: ABKS99] AGGR98] ChWcY99] MHPJX96] SRK99] HGV97] [TRL96], SCC00a] For an overview of clustering algorithms see [Mir96] KR90] and [JMF99] In this paper we investigate the application of information theoretical methods to genetic algorithms for clustering. The objects that we cluster are represented by tuples in a table and the goal of the genetic ....
Zhang Tian, Raghu Ramakrishnan, and Miron Livny. Birch: A new data clustering algorithm and its applications. Proceedings of the ACM SIGMOD Conference on Management of Data, ntreal, Canada, June 1996. 24
....proceeds successively by building a tree of clusters. It can be viewed as a nested sequence of partitioning. The tree of clusters, often called dendrogram, shows the relationship of the clusters and a clustering of the data set can be obtained by cutting the dendrogram at a desired level. BIRCH [43, 44], CURE [18] and ROCK [19] are representative hierarchical clustering algorithms. Density based clustering is to group the neighboring points of a data set into clusters based on density conditions. DBSCAN [11, 10] OPTICS [3] and DENCLUE [21] are well known algorithms of this category. Grid based ....
Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1(2):141-182, 1997. 12
....of data points in the cluster, 2. LS i is the linear sum of the N i data points (i.e. LS i = P N i i=1 X i , fX i g N i i=1 is the cluster of data points) and 3. RpclCenter i is the cluster center calculated by RPCL clustering. The tuple F i is similar to the Clustering Feature (CF) in [72]. We keep this tuple in each non leaf node because it can help us to calculate the centroid of the cluster for retrieval. The centroid C of a cluster fX i g N i=1 is defined as: C = P N i=1 X i ) N , which can be easily computed from information in the tuple. Based on Definitions 5.1 and ....
....which can be easily computed from information in the tuple. Based on Definitions 5.1 and 5.2, RPCL b tree satisfies the following properties. Property 5.1 Each leaf node contains between 1 and M data point(s) Property 5.2 Each non leaf node has two children. Property 5. 3 It has been proven in [72] that N and LS for a non leaf node can be easily calculated from the clustering information of its child nodes with F 1 = N 1 ; LS 1 ; RpclCenter 1 ) and F 2 = N 2 ; LS 2 ; RpclCenter 2 ) as: N = N 1 N 2 ; LS = LS 1 LS 2 : 59 Chapter 5 Hierarchical RPCL Indexing 55 45 D E F B C 80 ....
T. Zhang, R. Ramakrishnan, and M. Livny. "BIRCH: A New Data Clustering Algorithm and Its Applications". Data Mining and Knowledge Discovery, 1(2):141--182, 1997. 108
....well for a small number of pages. Another method is to group pages according to their hierarchy [FSS99] Thus, the home page of the Web server is considered as the root page and other pages that are related are either node page or leaf page. For clustering such a data format, BIRCH algorithm [ZRL97] was used. BIRCH is a hierarchical and incremental clustering algorithm for a large dataset. In order to apply this algorithm, web data have to be structured in a hierarchical way, in addition to the grouping into sessions. Clustering of client information or data items on Web transaction logs, ....
T. Zhang, R. Ramakrishnan and M. Livny, "BIRCH: A New Data Clustering Algorithm and Its Applications", Data Mining and Knowledge Discovery, vol. 1, no. 2, pp. 141-182, 1997, url = citeseer.nj.nec.com/zhang97birch.html
....datasets. In [18, 59] an algorithm is proposed for refining a given clustering initialization targeted at clustering massive datasets. In [19, 20] a scalable clustering framework is presented and instantiated on the k Mean and EM algorithms. A scalable clustering approach is also presented in [168]. We next consider methods in which the clusters themselves are described by a centroid in R n . 80 3.1 Clustering to Centroids We consider the unsupervised assignment of elements of a given set to groups or clusters of like points. Many approaches to this problem include statistical [81, ....
T. Zhang, R. Ramakrishnan, and M. Livny. Birch: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1(2):141--182, 1997.
No context found.
Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1(2):141--182, 1997.
No context found.
Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1(2):141--182, 1997.
No context found.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: A New Data Clustering Algorithm and Its Applications. Data Mining and Knowledge Discovery, 2(1), 1997.
No context found.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: a new data clustering algorithm and its applications. volume 1, pages 141-182, 1997.
No context found.
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH:A New Data Clustering Algorithm and Its Applications. Data Mining and Knowledge Discovery, 1(2) (1997) 141--182
No context found.
R. Ramakrishnan T. Zhang and M. Livny. Birch: a new data clustering algorithm and its applications. In Data Mining and Knowlegde Discovery, 1997. 36
No context found.
Zhang, T. and Ramakrishnan, R. (1997) BIRCH: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, Vol. 1, No.
No context found.
T. Zhang, R. Ramakrishnan, and M. Livny, "BIRCH: A new data clustering algorithm and its applications," Data Mining and Knowledge Discovery, vol. 1, no. 2, pp. 141--182, 1997.
No context found.
T. Zhang, R. Ramakrishnan, and M. Livny, "BIRCH: A New Data Clustering Algorithm and Its Applications", in Data Mining and Knowledge Discovery, volume 1, pages 141--182, 1997.
No context found.
Zhang, T., Ramakrishnan, R., and Livny, M. (1997). BIRCH: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1(2):141--182.
No context found.
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH:A New Data Clustering Algorithm and Its Applications. Data Mining and Knowledge Discovery, 1(2) (1997) 141--182
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC