• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

BIRCH: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1(2):141182. Vita Srinivas P. Maloor Education 2007 Ph.D. Electrical and Computer Engineering, Rutgers University 2004 M.S. Applied and Mathematical Statisti (1997)

by T Zhang, R Ramakrishnan, M Livny
Venue:In 15th ISEE conference
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 89
Next 10 →

Survey of clustering data mining techniques

by Pavel Berkhin , 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract - Cited by 400 (0 self) - Add to MetaCart
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
(Show Context)

Citation Context

...summaries are then used instead of the original data for clustering. Here the most important role belongs to the algorithm BIRCH (Balanced Iterative Reduction and Clustering using Hierarchies) [226], =-=[227]-=-. This work had a significant impact on overall direction of scalability research in clustering. BIRCH creates a height-balanced tree of nodes that summarizes data by accumulating its zero, first, and...

Refining Initial Points for K-Means Clustering

by P. S. Bradley, Usama M. Fayyad , 1998
"... Practical approaches to clustering use an iterative procedure (e.g. K-Means, EM) which converges to one of numerous local minima. It is known that these iterative techniques are especially sensitive to initial starting conditions. We present a procedure for computing a refined starting condition fro ..."
Abstract - Cited by 308 (5 self) - Add to MetaCart
Practical approaches to clustering use an iterative procedure (e.g. K-Means, EM) which converges to one of numerous local minima. It is known that these iterative techniques are especially sensitive to initial starting conditions. We present a procedure for computing a refined starting condition from a given initial one that is based on an efficient technique for estimating the modes of a distribution. The refined initial starting condition allows the iterative algorithm to converge to a "better" local minimum. The procedure is applicable to a wide class of clustering algorithms for both discrete and continuous data. We demonstrate the application of this method to the popular K-Means clustering algorithm and show that refined initial starting points indeed lead to improved solutions. Refinement run time is considerably lower than the time required to cluster the full database. The method is scalable and can be coupled with a scalable clustering algorithm to address the large-scale cl...

Scaling Clustering Algorithms to Large Databases

by P. S. Bradley, Usama Fayyad, Cory Reina , 1998
"... Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive. We present a scalable clustering framework applicable to a wide class of iterative clustering. We require at most one scan of the database. In this wor ..."
Abstract - Cited by 299 (5 self) - Add to MetaCart
Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive. We present a scalable clustering framework applicable to a wide class of iterative clustering. We require at most one scan of the database. In this work, the framework is instantiated and numerically justified with the popular K-Means clustering algorithm. The method is based on identifying regions of the data that are compressible, regions that must be maintained in memory, and regions that are discardable. The algorithm operates within the confines of a limited memory buffer. Empirical results demonstrate that the scalable scheme outperforms a sampling-based approach. In our scheme, data resolution is preserved to the extent possible based upon the size of the allocated memory buffer and the fit of current clustering model to the data. The framework is naturally extended to update multiple clustering models simultaneously. We empirically evaluate on synthetic and publicly available data sets.

Towards higher disk head utilization: extracting free bandwidth from busy disk drives

by Christopher R. Lumb, Jiri Schindler, Gregory R. Ganger, David F. Nagle - Symposium on Operating Systems Design and Implementation , 2000
"... Abstract Freeblock scheduling is a new approach to utilizing more of a disk's potential media bandwidth. By filling rotational latency periods with useful media transfers, 20-50 % of a never-idle disk's bandwidth can often be provided to background applications with no effect on foreground ..."
Abstract - Cited by 96 (20 self) - Add to MetaCart
Abstract Freeblock scheduling is a new approach to utilizing more of a disk's potential media bandwidth. By filling rotational latency periods with useful media transfers, 20-50 % of a never-idle disk's bandwidth can often be provided to background applications with no effect on foreground response times. This paper describes freeblock scheduling and demonstrates its value with simulation studies of two concrete applications: segment cleaning and data mining. Free segment cleaning often allows an LFS file system to maintain its ideal write performance when cleaning overheads would otherwise reduce performance by up to a factor of three. Free data mining can achieve over 47 full disk scans per day on an active transaction processing system, with no effect on its disk performance.
(Show Context)

Citation Context

...f records for statistical features and correlations. Many data mining operations, including nearest neighbor search, association rules [2], ratio and singular value decomposition [34], and clustering =-=[62, 26]-=-, eventually translate into a few scans of the entire dataset. Further, individual records can be processed immediately and in any order, matching three of the criteria of appropriate free bandwidth u...

Mining Partially Periodic Event Patterns With Unknown Periods

by Sheng Ma, Joseph L. Hellerstein - Proc. ICDE , 2000
"... Periodic behavior is common in real-world applications. However, in many cases, periodicities are partial in that they are present only intermittently. Herein, we study such intermittent patterns, which we refer to as p-patterns. Our formulation of p-patterns takes into account imprecise time inf ..."
Abstract - Cited by 72 (1 self) - Add to MetaCart
Periodic behavior is common in real-world applications. However, in many cases, periodicities are partial in that they are present only intermittently. Herein, we study such intermittent patterns, which we refer to as p-patterns. Our formulation of p-patterns takes into account imprecise time information (e.g., due to unsynchronized clocks in distributed environments), noisy data (e.g., due to extraneous events), and shifts in phase and/or periods. We structure mining for p-patterns as two sub-tasks: (1) finding the periods of p-patterns and (2) mining temporal associations. For (2), a level-wise algorithm is used. For (1), we develop a novel approach based on a chi-squared test, and study its performance in the presence of noise.
(Show Context)

Citation Context

...we only need to keep track of arrival instants and (b) this is a shorter sequence than raw data that contains a mixture of event types. When memory is problematic, indexing mechanism 9 such as Cl-tree=-=[19]-=- can be used to control the total number of buckets. The algorithm takes as input the points s in q, the tolerance c, and confidence-level (e.g. 95%). 1. For i-- 2 to I$1 (b) If C does not exist, then...

Alternatives to the k-Means Algorithm That Find Better Clusterings

by Greg Hamerly, Charles Elkan
"... We investigate here the behavior of the standard k-means clustering algorithm and several alternatives to it: the k- harmonic means algorithm due to Zhang and colleagues, fuzzy k-means, Gaussian expectation-maximization, and two new variants of k-harmonic means. Our aim is to nd which aspect ..."
Abstract - Cited by 62 (5 self) - Add to MetaCart
We investigate here the behavior of the standard k-means clustering algorithm and several alternatives to it: the k- harmonic means algorithm due to Zhang and colleagues, fuzzy k-means, Gaussian expectation-maximization, and two new variants of k-harmonic means. Our aim is to nd which aspects of these algorithms contribute to nding good clusterings, as opposed to converging to a low-quality local optimum. We describe each algorithm in a uni ed framework that introduces separate cluster membership and data weight functions.

Mathematical Programming for Data Mining: Formulations and Challenges

by P. S. Bradley, Usama M. Fayyad, O. L. Mangasarian - INFORMS Journal on Computing , 1998
"... This paper is intended to serve as an overview of a rapidly emerging research and applications area. In addition to providing a general overview, motivating the importance of data mining problems within the area of knowledge discovery in databases, our aim is to list some of the pressing research ch ..."
Abstract - Cited by 61 (0 self) - Add to MetaCart
This paper is intended to serve as an overview of a rapidly emerging research and applications area. In addition to providing a general overview, motivating the importance of data mining problems within the area of knowledge discovery in databases, our aim is to list some of the pressing research challenges, and outline opportunities for contributions by the optimization research communities. Towards these goals, we include formulations of the basic categories of data mining methods as optimization problems. We also provide examples of successful mathematical programming approaches to some data mining problems. keywords: data analysis, data mining, mathematical programming methods, challenges for massive data sets, classification, clustering, prediction, optimization. To appear: INFORMS: Journal of Compting, special issue on Data Mining, A. Basu and B. Golden (guest editors). Also appears as Mathematical Programming Technical Report 98-01, Computer Sciences Department, University of Wi...
(Show Context)

Citation Context

...bases. In classification, examples include scalable decision tree algorithms [95, 33] and scalable approaches to computing classification surfaces [21, 102]. In clustering scalable approaches include =-=[17, 125, 60, 2]-=-. In data summarization, examples include [18, 78, 3]. We provide example formulations of some data mining problems as mathematical programs. The formulations are intended as general guidelines and do...

Scaling EM (Expectation-Maximization) Clustering to Large Databases

by Paul S. Bradley, Usama M. Fayyad, Cory A. Reina, P. S. Bradley, Usama Fayyad, Cory Reina , 1999
"... Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the ..."
Abstract - Cited by 52 (1 self) - Add to MetaCart
Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the access to every record in the data table. For large databases, the scans become prohibitively expensive. We present a scalable implementation of the Expectation-Maximization (EM) algorithm. The database community has focused on distance-based clustering schemes and methods have been developed to cluster either numerical or categorical data. Unlike distancebased algorithms (such as K-Means), EM constructs proper statistical models of the underlying data source and naturally generalizes to cluster databases containing both discrete-valued and continuous-valued data. The scalable method is based on a decomposition of the basic statistics the algorithm needs: identifying regions of the data that...
(Show Context)

Citation Context

...ustering to Large Databases Bradley, Fayyad, and Reina 1 1 Introduction Data clustering is important in many fields, including data mining [FPSU96], statistical data analysis [KR89,BR93], compression =-=[ZRL97]-=-, and vector quantization. Applications include data analysis and modeling [FDW97,FHS96], image segmentation, marketing, fraud detection, predictive modeling, data summarization, and general data repo...

Data Cleansing: Beyond Integrity Analysis

by Jonathan I. Maletic, Andrian Marcus , 2000
"... The paper analyzes the problem of data cleansing and automatically identifying potential errors in data sets. An overview of the diminutive amount of existing literature concerning data cleansing is given. Methods for error detection that go beyond integrity analysis are reviewed and presented. The ..."
Abstract - Cited by 51 (0 self) - Add to MetaCart
The paper analyzes the problem of data cleansing and automatically identifying potential errors in data sets. An overview of the diminutive amount of existing literature concerning data cleansing is given. Methods for error detection that go beyond integrity analysis are reviewed and presented. The applicable methods include: statistical outlier detection, pattern matching, clustering, and data mining techniques. Some brief results supporting the use of such methods are given. The future research directions necessary to address the data cleansing problem are discussed. Keywords: data cleansing, data cleaning, data quality, error detection.

A Fast Parallel Clustering Algorithm for Large Spatial Databases

by Xiaowei Xu, JOCHEN JÄGER, HANS-PETER KRIEGEL - DATA MINING AND KNOWLEDGE DISCOVERY, 3, 263–290 , 1999
"... The clustering algorithm DBSCAN relies on a density-based notion of clusters and is designed to discover clusters of arbitrary shape as well as to distinguish noise. In this paper, we present PDBSCAN, a parallel version of this algorithm. We use the ‘shared-nothing’ architecture with multiple compu ..."
Abstract - Cited by 51 (1 self) - Add to MetaCart
The clustering algorithm DBSCAN relies on a density-based notion of clusters and is designed to discover clusters of arbitrary shape as well as to distinguish noise. In this paper, we present PDBSCAN, a parallel version of this algorithm. We use the ‘shared-nothing’ architecture with multiple computers interconnected through a network. A fundamental component of a shared-nothing system is its distributed data structure. We introduce the dR∗-tree, a distributed spatial index structure in which the data is spread among multiple computers and the indexes of the data are replicated on every computer. We implemented our method using a number of workstations connected via Ethernet (10 Mbit). A performance evaluation shows that PDBSCAN offers nearly linear speedup and has excellent scaleup and sizeup behavior.
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University