Results 1  10
of
20
Using the Fractal Dimension to Cluster Datasets
 IN PROCEEDINGS OF THE SIXTH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING
, 2000
"... Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail ..."
Abstract

Cited by 52 (5 self)
 Add to MetaCart
(Show Context)
Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail to do well in scaling with the size of the data set and the number of dimensions that describe the points, or in finding arbitrary shapes of clusters, or dealing effectively with the presence of noise. In this paper, we present a new clustering algorithm, based in the fractal properties of the data sets. The new algorithm which we call Fractal Clustering (FC) places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same cluster have a great degree of selfsimilarity among them (and much less selfsimilarity with respect to points in other clusters). FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively deals with large data sets, highdimensionality and noise and is capable of recognizing clusters of arbitrary shape.
Pixnostics: Towards measuring the value of visualization
 Symposium On Visual Analytics Science And Technology
"... During the last two decades a wide variety of advanced methods for the Visual Exploration of large data sets have been proposed. For most of these techniques user interaction has become a crucial element, since there are many situations in which an user or an analyst has to select the right paramet ..."
Abstract

Cited by 15 (7 self)
 Add to MetaCart
(Show Context)
During the last two decades a wide variety of advanced methods for the Visual Exploration of large data sets have been proposed. For most of these techniques user interaction has become a crucial element, since there are many situations in which an user or an analyst has to select the right parameter settings from among many or select a subset of the available attribute space for the visualization process, in order to construct valuable visualizations that provide insight into the data and reveal interesting patterns. The right choice of input parameters is often essential, since suboptimal parameter settings or the investigation of irrelevant data dimensions make the exploration process more time consuming and may result in wrong conclusions. In this paper we propose a novel method for automatically determining meaningful parameter and attribute settings based on the Information content of the resulting visualizations. Our technique called Pixnostics, in analogy to Scagnostics[1] automatically analyses pixel images resulting from diverse parameter mappings and ranks them according to the potential value for the user. This allows a more effective and more efficient visual data analysis process, since the attribute/parameter space is reduced to meaningful selections and thus the analyst obtains faster insight into the data. Real world applications are provided to show the benefit of the proposed approach.
An Efficient Clustering Algorithm for Market Basket Data Based on Small Large Ratios
 IN: PROC OF COMPSAC’01
, 2001
"... In this paper, we devise an efficient algorithm for clustering marketbasket data items. In view of the nature of clustering market basket data, we devise in this paper a novel measurement, called the smalllarge (abbreviated as SL) ratio, and utilize this ratio to perform the clustering. With this ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
In this paper, we devise an efficient algorithm for clustering marketbasket data items. In view of the nature of clustering market basket data, we devise in this paper a novel measurement, called the smalllarge (abbreviated as SL) ratio, and utilize this ratio to perform the clustering. With this SL ratio measurement, we develop an efficient clustering algorithm for data items to minimize the SL ratio in each group. The proposed algorithm not only incurs an execution time that is significantly smaller than that by prior work but also leads to the clustering results of very good quality.
Using selfsimilarity to cluster large data sets
 DATA MINING AND KNOWLEDGE DISCOVERY
, 2003
"... Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail to do well in scaling with the size of the data set and the number of dimensions that describe the points, or in nding arbitrary shapes of clusters, or dealing effectively with the presence of noise. In this paper, we present a new clustering algorithm, based in selfsimilarity properties of the data sets. Selfsimilarity is the property of being invariant with respect to the scale used to look at the data set. While fractals are selfsimilar at every scale used to look at them, many data sets exhibit selfsimilarity over a range of scales. Selfsimilarity canbe measured using the fractal dimension. The new algorithm which we call Fractal Clustering (FC) places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same cluster have a great degree of selfsimilarity among them (and much less selfsimilarity with respect to points in other clusters). FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively deals with large data sets, highdimensionality and noise and is capable of recognizing clusters of arbitrary shape.
Towards a Simple Clustering Criterion Based on Minimum Length Encoding
 In Proceedings of the 13th European Conference on Machine Learning (ECML'02
, 2002
"... We propose a simple and intuitive clustering evaluation criterion based on the minimum description length principle which yields a particularly simple way of describing and encoding a set of examples. The basic idea is to view a clustering as a restriction of the attribute domains, given an exam ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
We propose a simple and intuitive clustering evaluation criterion based on the minimum description length principle which yields a particularly simple way of describing and encoding a set of examples. The basic idea is to view a clustering as a restriction of the attribute domains, given an example's cluster membership. As a special operational case we develop the socalled rectangular uniform message length measure that can be used to evaluate clusterings described as sets of hyperrectangles.
DensityBased Centroid Approximation for Initializing Iterative Clustering Algorithms
, 2002
"... We present KDI (Kernel Density Initialization), a densitybased procedure for approximating centroids for the initialization step of iterationbased clustering algorithms. We show empirically that a rather low number of distance calculations in conjunction with a fast algorithm for finding the hi ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
We present KDI (Kernel Density Initialization), a densitybased procedure for approximating centroids for the initialization step of iterationbased clustering algorithms. We show empirically that a rather low number of distance calculations in conjunction with a fast algorithm for finding the highest peaks are sufficient for effectively and efficiently finding a prespecified number of good centroids, which can subsequently be used as initial cluster centers. Finally we evaluate our algorithm in several realworld datasets against two wellknown methods from the literature and show that KDI achieves favorable results.
Image Cluster Compression using Partitioned Iterated Function Systems and efficient InterImage Similarity Features
 in SITIS 2007
"... Abstract—When dealing with large scale image archive systems, efficient data compression is crucial for the economic storage of data. Currently, most image compression algorithms only work on a perpicture basis — however most image databases (both private and commercial) contain high redundancies b ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract—When dealing with large scale image archive systems, efficient data compression is crucial for the economic storage of data. Currently, most image compression algorithms only work on a perpicture basis — however most image databases (both private and commercial) contain high redundancies between images, especially when a lot of images of the same objects, persons, locations, or made with the same camera, exist. In order to exploit those correlations, it’s desirable to apply image compression not only to individual images, but also to groups of images, in order to gain better compression rates by exploiting interimage redundancies. This paper proposes to employ a multiimage fractal Partitioned Iterated Function System (PIFS) for compressing image groups and exploiting correlations between images. In order to partition an image database into optimal groups to be compressed with this algorithm, a number of metrics are derived based on the normalized compression distance (NCD) of the PIFS algorithm. We compare a number of relational and hierarchical clustering algorithms based on the said metric. In particular, we show how a reasonable good approximation of optimal image clusters can be obtained by an approximation of the NCD and nCut clustering. While the results in this paper are primarily derived from PIFS, they can also be leveraged against other compression algorithms for image groups. I.
• Text mining.
"... Daniel Barbara WAIM’00Coorporations and organizations have huge databases containing a wealth of knowledge. There is very little in current DBMS and DW tools to extract knowledge. Data mining: discovery of (previously unknown) patterns in (large) data sets. Data mining ..."
Abstract
 Add to MetaCart
Daniel Barbara WAIM’00Coorporations and organizations have huge databases containing a wealth of knowledge. There is very little in current DBMS and DW tools to extract knowledge. Data mining: discovery of (previously unknown) patterns in (large) data sets. Data mining
A Genetic Algorithm to Exploit Genetic Data
"... In this chapter, we are interested in discovering genetic and environmental factors that are involved in multifactorial diseases. Therefore, experiments havebeen achieved by the Biological Institute of Lille (France) and a lot of data has been generated. To exploit this data, data mining tools ..."
Abstract
 Add to MetaCart
In this chapter, we are interested in discovering genetic and environmental factors that are involved in multifactorial diseases. Therefore, experiments havebeen achieved by the Biological Institute of Lille (France) and a lot of data has been generated. To exploit this data, data mining tools are required and we propose a 2phase optimization approach using a specific genetic algorithm. During the first step, we select significant features from a very large set with a genetic algorithm.
2 3 4 5 6 7 8
, 2006
"... A data mining approach to discover genetic and environmental factors involved in multifactorial diseases ..."
Abstract
 Add to MetaCart
(Show Context)
A data mining approach to discover genetic and environmental factors involved in multifactorial diseases