Results 1 - 10
of
20
Using the Fractal Dimension to Cluster Datasets
- IN PROCEEDINGS OF THE SIXTH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING
, 2000
"... Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail ..."
Abstract
-
Cited by 52 (5 self)
- Add to MetaCart
(Show Context)
Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail to do well in scaling with the size of the data set and the number of dimensions that describe the points, or in finding arbitrary shapes of clusters, or dealing effectively with the presence of noise. In this paper, we present a new clustering algorithm, based in the fractal properties of the data sets. The new algorithm which we call Fractal Clustering (FC) places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same cluster have a great degree of self-similarity among them (and much less self-similarity with respect to points in other clusters). FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively deals with large data sets, high-dimensionality and noise and is capable of recognizing clusters of arbitrary shape.
Pixnostics: Towards measuring the value of visualization
- Symposium On Visual Analytics Science And Technology
"... During the last two decades a wide variety of advanced meth-ods for the Visual Exploration of large data sets have been proposed. For most of these techniques user interaction has become a crucial element, since there are many situations in which an user or an analyst has to select the right paramet ..."
Abstract
-
Cited by 15 (7 self)
- Add to MetaCart
(Show Context)
During the last two decades a wide variety of advanced meth-ods for the Visual Exploration of large data sets have been proposed. For most of these techniques user interaction has become a crucial element, since there are many situations in which an user or an analyst has to select the right parameter settings from among many or select a subset of the avail-able attribute space for the visualization process, in order to construct valuable visualizations that provide insight into the data and reveal interesting patterns. The right choice of input parameters is often essential, since suboptimal para-meter settings or the investigation of irrelevant data dimen-sions make the exploration process more time consuming and may result in wrong conclusions. In this paper we propose a novel method for automatically determining meaningful parameter- and attribute settings based on the Information content of the resulting visualizations. Our technique called Pixnostics, in analogy to Scagnostics[1] automatically analyses pixel images re-sulting from diverse parameter mappings and ranks them according to the potential value for the user. This allows a more effective and more efficient visual data analysis process, since the attribute/parameter space is reduced to meaningful selections and thus the analyst obtains faster insight into the data. Real world applications are provided to show the benefit of the proposed approach.
An Efficient Clustering Algorithm for Market Basket Data Based on Small Large Ratios
- IN: PROC OF COMPSAC’01
, 2001
"... In this paper, we devise an efficient algorithm for clustering market-basket data items. In view of the nature of clustering market basket data, we devise in this paper a novel measurement, called the small-large (abbreviated as SL) ratio, and utilize this ratio to perform the clustering. With this ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
In this paper, we devise an efficient algorithm for clustering market-basket data items. In view of the nature of clustering market basket data, we devise in this paper a novel measurement, called the small-large (abbreviated as SL) ratio, and utilize this ratio to perform the clustering. With this SL ratio measurement, we develop an efficient clustering algorithm for data items to minimize the SL ratio in each group. The proposed algorithm not only incurs an execution time that is significantly smaller than that by prior work but also leads to the clustering results of very good quality.
Using self-similarity to cluster large data sets
- DATA MINING AND KNOWLEDGE DISCOVERY
, 2003
"... Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail to do well in scaling with the size of the data set and the number of dimensions that describe the points, or in nding arbitrary shapes of clusters, or dealing effectively with the presence of noise. In this paper, we present a new clustering algorithm, based in self-similarity properties of the data sets. Selfsimilarity is the property of being invariant with respect to the scale used to look at the data set. While fractals are self-similar at every scale used to look at them, many data sets exhibit self-similarity over a range of scales. Self-similarity canbe measured using the fractal dimension. The new algorithm which we call Fractal Clustering (FC) places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same cluster have a great degree of self-similarity among them (and much less self-similarity with respect to points in other clusters). FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively deals with large data sets, high-dimensionality and noise and is capable of recognizing clusters of arbitrary shape.
Towards a Simple Clustering Criterion Based on Minimum Length Encoding
- In Proceedings of the 13th European Conference on Machine Learning (ECML'02
, 2002
"... We propose a simple and intuitive clustering evaluation criterion based on the minimum description length principle which yields a particularly simple way of describing and encoding a set of examples. The basic idea is to view a clustering as a restriction of the attribute domains, given an exam ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
(Show Context)
We propose a simple and intuitive clustering evaluation criterion based on the minimum description length principle which yields a particularly simple way of describing and encoding a set of examples. The basic idea is to view a clustering as a restriction of the attribute domains, given an example's cluster membership. As a special operational case we develop the so-called rectangular uniform message length measure that can be used to evaluate clusterings described as sets of hyper-rectangles.
Density-Based Centroid Approximation for Initializing Iterative Clustering Algorithms
, 2002
"... We present KDI (Kernel Density Initialization), a densitybased procedure for approximating centroids for the initialization step of iteration-based clustering algorithms. We show empirically that a rather low number of distance calculations in conjunction with a fast algorithm for finding the hi ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
We present KDI (Kernel Density Initialization), a densitybased procedure for approximating centroids for the initialization step of iteration-based clustering algorithms. We show empirically that a rather low number of distance calculations in conjunction with a fast algorithm for finding the highest peaks are sufficient for effectively and efficiently finding a pre-specified number of good centroids, which can subsequently be used as initial cluster centers. Finally we evaluate our algorithm in several real-world datasets against two well-known methods from the literature and show that KDI achieves favorable results.
Image Cluster Compression using Partitioned Iterated Function Systems and efficient InterImage Similarity Features
- in SITIS 2007
"... Abstract—When dealing with large scale image archive systems, efficient data compression is crucial for the economic storage of data. Currently, most image compression algorithms only work on a per-picture basis — however most image databases (both private and commercial) contain high redundancies b ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract—When dealing with large scale image archive systems, efficient data compression is crucial for the economic storage of data. Currently, most image compression algorithms only work on a per-picture basis — however most image databases (both private and commercial) contain high redundancies between images, especially when a lot of images of the same objects, persons, locations, or made with the same camera, exist. In order to exploit those correlations, it’s desirable to apply image compression not only to individual images, but also to groups of images, in order to gain better compression rates by exploiting inter-image redundancies. This paper proposes to employ a multi-image fractal Partitioned Iterated Function System (PIFS) for compressing image groups and exploiting correlations between images. In order to partition an image database into optimal groups to be compressed with this algorithm, a number of metrics are derived based on the normalized compression distance (NCD) of the PIFS algorithm. We compare a number of relational and hierarchical clustering algorithms based on the said metric. In particular, we show how a reasonable good approximation of optimal image clusters can be obtained by an approximation of the NCD and nCut clustering. While the results in this paper are primarily derived from PIFS, they can also be leveraged against other compression algorithms for image groups. I.
• Text mining.
"... Daniel Barbara WAIM’00Coorporations and organizations have huge databases containing a wealth of knowledge. There is very little in current DBMS and DW tools to extract knowledge. Data mining: discovery of (previously unknown) patterns in (large) data sets. Data mining ..."
Abstract
- Add to MetaCart
Daniel Barbara WAIM’00Coorporations and organizations have huge databases containing a wealth of knowledge. There is very little in current DBMS and DW tools to extract knowledge. Data mining: discovery of (previously unknown) patterns in (large) data sets. Data mining
A Genetic Algorithm to Exploit Genetic Data
"... In this chapter, we are interested in discovering genetic and environmental factors that are involved in multifactorial diseases. Therefore, experiments havebeen achieved by the Biological Institute of Lille (France) and a lot of data has been generated. To exploit this data, data mining tools ..."
Abstract
- Add to MetaCart
In this chapter, we are interested in discovering genetic and environmental factors that are involved in multifactorial diseases. Therefore, experiments havebeen achieved by the Biological Institute of Lille (France) and a lot of data has been generated. To exploit this data, data mining tools are required and we propose a 2-phase optimization approach using a specific genetic algorithm. During the first step, we select significant features from a very large set with a genetic algorithm.
2 3 4 5 6 7 8
, 2006
"... A data mining approach to discover genetic and environmental factors involved in multifactorial diseases ..."
Abstract
- Add to MetaCart
(Show Context)
A data mining approach to discover genetic and environmental factors involved in multifactorial diseases