Results 11  20
of
722
CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling
, 1999
"... Clustering in data mining is a discovery process that groups a set of data such that the intracluster similarity is maximized and the intercluster similarity is minimized. Existing clustering algorithms, such as Kmeans, PAM, CLARANS, DBSCAN, CURE, and ROCK are designed to find clusters that fit s ..."
Abstract

Cited by 268 (19 self)
 Add to MetaCart
Clustering in data mining is a discovery process that groups a set of data such that the intracluster similarity is maximized and the intercluster similarity is minimized. Existing clustering algorithms, such as Kmeans, PAM, CLARANS, DBSCAN, CURE, and ROCK are designed to find clusters that fit some static models. These algorithms can breakdown if the choice of parameters in the static model is incorrect with respect to the data set being clustered, or if the model is not adequate to capture the characteristics of clusters. Furthermore, most of these algorithms breakdown when the data consists of clusters that are of diverse shapes, densities, and sizes. In this paper, we present a novel hierarchical clustering algorithm called CHAMELEON that measures the similarity of two clusters based on a dynamic model. In the clustering process, two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal intercon...
Evaluation of Hierarchical Clustering Algorithms for Document Datasets
 Data Mining and Knowledge Discovery
, 2002
"... Fast and highquality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at ..."
Abstract

Cited by 261 (6 self)
 Add to MetaCart
Fast and highquality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at different levels of granularity, making them ideal for people to visualize and interactively explore large document collections.
Outlier detection for high dimensional data
, 2001
"... The outlier detection problem has important applications in the eld of fraud detection, netw ork robustness analysis, and intrusion detection. Most suc h applications are high dimensional domains in whic hthe data can con tain hundreds of dimensions. Many recen t algorithms use concepts of pro ximit ..."
Abstract

Cited by 235 (4 self)
 Add to MetaCart
(Show Context)
The outlier detection problem has important applications in the eld of fraud detection, netw ork robustness analysis, and intrusion detection. Most suc h applications are high dimensional domains in whic hthe data can con tain hundreds of dimensions. Many recen t algorithms use concepts of pro ximity in order to nd outliers based on their relationship to the rest of the data. Ho w ever, in high dimensional space, the data is sparse and the notion of proximity fails to retain its meaningfulness. In fact, the sparsity of high dimensional data implies that every point is an almost equally good outlier from the perspective ofproximitybased de nitions. Consequently, for high dimensional data, the notion of nding meaningful outliers becomes substantially more complex and nonobvious. In this paper, w e discuss new techniques for outlier detection whic h nd the outliers by studying the behavior of projections from the data set. 1.
Criterion Functions for Document Clustering: Experiments and Analysis
, 2002
"... In recent years, we have witnessed a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and companywide intranets. This has led to an increased interest in developing methods that can help users to effectively navigate, summarize, and org ..."
Abstract

Cited by 202 (13 self)
 Add to MetaCart
(Show Context)
In recent years, we have witnessed a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and companywide intranets. This has led to an increased interest in developing methods that can help users to effectively navigate, summarize, and organize this information with the ultimate goal of helping them to find what they are looking for. Fast and highquality document clustering algorithms play an important role towards this goal as they have been shown to provide both an intuitive navigation/browsing mechanism by organizing large amounts of information into a small number of meaningful clusters as well as to greatly improve the retrieval performance either via clusterdriven dimensionality reduction, termweighting, or query expansion. This everincreasing importance of document clustering and the expanded range of its applications led to the development of a number of new and novel algorithms with different complexityquality tradeoffs. Among them, a class of clustering algorithms that have relatively low computational requirements are those that treat the clustering problem as an optimization process which seeks to maximize or minimize a particular clustering criterion function defined over the entire clustering solution.
Finding Generalized Projected Clusters in High Dimensional Spaces
"... High dimensional data has always been a challenge for clustering algorithms because of the inherent sparsity of the points. Recent research results indicate that in high dimensional data, even the concept of proximity or clustering may not be meaningful. We discuss very general techniques for projec ..."
Abstract

Cited by 194 (8 self)
 Add to MetaCart
(Show Context)
High dimensional data has always been a challenge for clustering algorithms because of the inherent sparsity of the points. Recent research results indicate that in high dimensional data, even the concept of proximity or clustering may not be meaningful. We discuss very general techniques for projected clustering which are able to construct clusters in arbitrarily aligned subspaces of lower dimensionality. The subspaces are specific to the clusters themselves. This definition is substantially more general and realistic than currently available techniques which limit the method to only projections from the original set of attributes. The generalized projected clustering technique may also be viewed as a way of trying to rede ne clustering for high dimensional applications by searching for hidden subspaces with clusters which are created by interattribute correlations. We provide a new concept of using extended cluster feature vectors in order to make the algorithm scalable for very large databases. The running time and space requirements of the algorithm are adjustable, and are likely to tradeoff with better accuracy.
Entropybased Subspace Clustering for Mining Numerical Data
, 1999
"... Mining numerical data is a relatively difficult problem in data mining. Clustering is one of the techniques. We consider a database with numerical attributes, in which each transaction is viewed as a multidimensional vector. By studying the clusters formed by these vectors, we can discover certain ..."
Abstract

Cited by 162 (1 self)
 Add to MetaCart
Mining numerical data is a relatively difficult problem in data mining. Clustering is one of the techniques. We consider a database with numerical attributes, in which each transaction is viewed as a multidimensional vector. By studying the clusters formed by these vectors, we can discover certain behaviors hidden in the data. Traditional clustering algorithms find clusters in the full space of the data sets. This results in high dimensional clusters, which are poorly comprehensible to human. One important task in this setting is the ability to discover clusters embedded in the subspaces of a highdimensional data set. This problem is known as subspace clustering. We follow the basic assumptions of previous work CLIQUE. It is found that the number of subspaces with clustering is very large, and a criterion called the coverage is proposed in CLIQUE for the pruning. In addition to coverage, we identify new useful criteria for this problem and propose an entropybased algorithm called ENC...
Finding interesting associations without support pruning
 IN ICDE
, 2000
"... Associationrule mining has heretofore relied on the condition of high support to do its work efficiently. In particular, the wellknown apriori algorithm is only effective when the only rules of interest are relationships that occur very frequently. However, there are a number of applications, suc ..."
Abstract

Cited by 162 (15 self)
 Add to MetaCart
(Show Context)
Associationrule mining has heretofore relied on the condition of high support to do its work efficiently. In particular, the wellknown apriori algorithm is only effective when the only rules of interest are relationships that occur very frequently. However, there are a number of applications, such as data mining, identification of similar web documents, clustering, and collaborative filtering, where the rules of interest have comparatively few instances in the data. In these cases, we must look for highly correlated items, or possibly even causal relationships between infrequent items. We develop a family of algorithms for solving this problem, employing a combination of random sampling and hashing techniques. We provide analysis of the algorithms developed, and conduct experiments on real and synthetic data to obtain a comparative performance analysis.
SimilarityBased Queries for Time Series Data
 Proc. 1997 ACMSIGMOD Conf
, 1997
"... We study a set of linear transformations on the Fourier series representation of a sequence that can be used as the basis for similarity queries on timeseries data. We show that our set of transformations is rich enough to formulate operations such as moving average and time warping. We present a q ..."
Abstract

Cited by 158 (6 self)
 Add to MetaCart
We study a set of linear transformations on the Fourier series representation of a sequence that can be used as the basis for similarity queries on timeseries data. We show that our set of transformations is rich enough to formulate operations such as moving average and time warping. We present a query processing algorithm that uses the underlying Rtree index of a multidimensional data set to answer similarity queries efficiently. Our experiments show that the performance of this algorithm is competitive to that of processing ordinary (exact match) queries using the index, and much faster than sequential scanning. We relate our transformations to the general framework for similarity queries of Jagadish et al. 1
Clustering data streams: Theory and practice
 IEEE TKDE
, 2003
"... The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, ..."
Abstract

Cited by 157 (5 self)
 Add to MetaCart
(Show Context)
The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm’s performance on synthetic and real data streams.
CLARANS: A Method for Clustering Objects for Spatial Data Mining
 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 2002
"... Spatial data mining is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases. To this end, this paper has three main contributions. First, we propose a new clustering method called CLARANS, whose aim is to identify spatial structures that may ..."
Abstract

Cited by 142 (0 self)
 Add to MetaCart
(Show Context)
Spatial data mining is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases. To this end, this paper has three main contributions. First, we propose a new clustering method called CLARANS, whose aim is to identify spatial structures that may be present in the data. Experimental results indicate that, when compared with existing clustering methods, CLARANS is very efficient and effective. Second, we investigate how CLARANS can handle not only points objects, but also polygon objects efficiently. One of the methods considered, called the IRapproximation, is very efficient in clustering convex and nonconvex polygon objects. Third, building on top of CLARANS, we develop two spatial data mining algorithms that aim to discover relationships between spatial and nonspatial attributes. Both algorithms can discover knowledge that is difficult to find with existing spatial data mining algorithms.