Results 1  10
of
193
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract

Cited by 400 (0 self)
 Add to MetaCart
(Show Context)
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
Subspace clustering for high dimensional data: a review
 ACM SIGKDD Explorations Newsletter
, 2004
"... Subspace clustering for high dimensional data: ..."
Using the Triangle Inequality to Accelerate kMeans
, 2003
"... The kmeans algorithm is by far the most widely used method for discovering clusters in data. We show how to accelerate it dramatically, while still always computing exactly the same result as the standard algorithm. The accelerated algorithm avoids unnecessary distance calculations by applying the ..."
Abstract

Cited by 153 (1 self)
 Add to MetaCart
The kmeans algorithm is by far the most widely used method for discovering clusters in data. We show how to accelerate it dramatically, while still always computing exactly the same result as the standard algorithm. The accelerated algorithm avoids unnecessary distance calculations by applying the triangle inequality in two different ways, and by keeping track of lower and upper bounds for distances between points and centers. Experiments show that the new algorithm is effective for datasets with up to 1000 dimensions, and becomes more and more effective as the number k of clusters increases. For k>=20 it is many times faster than the best previously known accelerated kmeans method.
CBSA: contentbased soft annotation for multimodal image retrieval using Bayes point machines
 IEEE Transactions on Circuits and Systems for Video Technology
, 2003
"... ..."
Clustering of Time Series Subsequences is Meaningless: Implications for Past and Future Research
 In Proc. of the 3rd IEEE International Conference on Data Mining
, 2003
"... Time series data is perhaps the most frequently encountered type of data examined by the data mining community. Clustering is perhaps the most frequently used data mining algorithm, being useful in it’s own right as an exploratory technique, and also as a subroutine in more complex data mining algor ..."
Abstract

Cited by 112 (17 self)
 Add to MetaCart
(Show Context)
Time series data is perhaps the most frequently encountered type of data examined by the data mining community. Clustering is perhaps the most frequently used data mining algorithm, being useful in it’s own right as an exploratory technique, and also as a subroutine in more complex data mining algorithms such as rule discovery, indexing, summarization, anomaly detection, and classification. Given these two facts, it is hardly surprising that time series clustering has attracted much attention. The data to be clustered can be in one of two formats: many individual time series, or a single time series, from which individual time series are extracted with a sliding window. Given the recent explosion of interest in streaming data and online algorithms, the latter case has received much attention. In this work we make a surprising claim. Clustering of streaming time series is completely meaningless. More concretely, clusters extracted from streaming time series are forced to obey a certain constraint that is pathologically unlikely to be satisfied by any dataset, and because of this, the clusters extracted by any clustering algorithm are essentially random. While this constraint can be intuitively demonstrated with a simple illustration and is simple to prove, it has never appeared in the literature. We can justify calling our claim surprising, since it invalidates the contribution of dozens of previously published papers. We will justify our claim with a theorem, illustrative examples, and a comprehensive set of experiments on reimplementations of previous work. Although the primary contribution of our work is to draw attention to the fact that an apparent solution to an important problem is incorrect and should no longer be used, we also introduce a novel method which, based on the concept of time series motifs, is able to meaningfully cluster some streaming time series datasets.
Databionic visualization of music collections according to perceptual distance
 In Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR’05
, 2005
"... We describe the MusicMiner system for organizing large collections of music with databionic mining techniques. Low level audio features are extracted from the raw audio data on short time windows during which the sound is assumed to be stationary. Static and temporal statistics were consistently and ..."
Abstract

Cited by 34 (1 self)
 Add to MetaCart
(Show Context)
We describe the MusicMiner system for organizing large collections of music with databionic mining techniques. Low level audio features are extracted from the raw audio data on short time windows during which the sound is assumed to be stationary. Static and temporal statistics were consistently and systematically used for aggregation of low level features to form high level features. A supervised feature selection targeted to model perceptual distance between different sounding music lead to a small set of nonredundant sound features. Clustering and visualization based on these feature vectors can discover emergent structures in collections of music. Visualization based on Emergent SelfOrganizing Maps in particular enables the unsupervised discovery of timbrally consistent clusters that may or may not correspond to musical genres and artists. We demonstrate the visualizations capabilities of the UMap, displaying local sound differences based on the new audio features. An intuitive browsing of large music collections is offered based on the paradigm of topographic maps. The user can navigate the sound space and interact with the maps to play music or show the context of a song.
The curse of dimensionality in data mining and time series prediction
 Computational Intelligence and Bioinspired Systems, Lecture Notes in Computer Science 3512
, 2005
"... www.ucl.ac.be/mlg Abstract. Modern data analysis tools have to work on highdimensional data, whose components are not independently distributed. Highdimensional spaces show surprising, counterintuitive geometrical properties that have a large influence on the performances of data analysis tools. ..."
Abstract

Cited by 32 (0 self)
 Add to MetaCart
(Show Context)
www.ucl.ac.be/mlg Abstract. Modern data analysis tools have to work on highdimensional data, whose components are not independently distributed. Highdimensional spaces show surprising, counterintuitive geometrical properties that have a large influence on the performances of data analysis tools. Among these properties, the concentration of the norm phenomenon results in the fact that Euclidean norms and Gaussian kernels, both commonly used in models, become inappropriate in highdimensional spaces. This papers presents alternative distance measures and kernels, together with geometrical methods to decrease the dimension of the space. The methodology is applied to a typical time series prediction example. 1
AngleBased Outlier Detection in Highdimensional Data
"... Detectingoutliersinalargesetofdataobjectsisamajor data mining task aiming at finding different mechanisms responsible for different groups of objects in a data set. All existing approaches, however, are based on an assessment of distances (sometimes indirectly by assuming certain distributions) in t ..."
Abstract

Cited by 31 (8 self)
 Add to MetaCart
(Show Context)
Detectingoutliersinalargesetofdataobjectsisamajor data mining task aiming at finding different mechanisms responsible for different groups of objects in a data set. All existing approaches, however, are based on an assessment of distances (sometimes indirectly by assuming certain distributions) in the fulldimensional Euclidean data space. In highdimensional data, these approaches are bound to deteriorate due to the notorious “curse of dimensionality”. In this paper, we propose a novel approach named ABOD (AngleBased Outlier Detection) and some variants assessing the variance in the angles between the difference vectors of a point to the other points. This way, the effects of the “curse of dimensionality ” are alleviated compared to purely distancebased approaches. A main advantage of our new approach is that our method does not rely on any parameter selection influencing the quality of the achieved ranking. In a thorough experimental evaluation, we compare ABOD to the wellestablished distancebased method LOF for various artificial and a real world data set and show ABOD to perform especially well on highdimensional data.
Robust Similarity Measures for Mobile Object Trajectories
 Proc. of DEXA Workshops
, 2002
"... We investigate techniques for similarity analysis of spatiotemporal trajectories for mobile objects. Such kind of data may contain a great amount of outliers, which degrades the performance of Euclidean and Time Warping Distance. Therefore, here we propose the use of nonmetric distance functions b ..."
Abstract

Cited by 31 (1 self)
 Add to MetaCart
(Show Context)
We investigate techniques for similarity analysis of spatiotemporal trajectories for mobile objects. Such kind of data may contain a great amount of outliers, which degrades the performance of Euclidean and Time Warping Distance. Therefore, here we propose the use of nonmetric distance functions based on the Longest Common Subsequence (LCSS), in conjunction with a sigmoidal matching function. Finally, we compare these new methods to various L p Norms and also to Time Warping distance (for real and synthetic data) and we present experimental results that validate the accuracy and efficiency of our approach, especially under the strong presence of noise.