Results 1 - 10
of
33
Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms
, 2003
"... Many clustering and segmentation algorithms both suffer from the limitation that the number of clusters/segments are specified by a human user. It is often impractical to expect a human with sufficient domain knowledge to be available to select the number of clusters/segments to return. In this pape ..."
Abstract
-
Cited by 36 (1 self)
- Add to MetaCart
Many clustering and segmentation algorithms both suffer from the limitation that the number of clusters/segments are specified by a human user. It is often impractical to expect a human with sufficient domain knowledge to be available to select the number of clusters/segments to return. In this paper, we investigate techniques to determine the number of clusters or segments to return from hierarchical clustering and segmentation algorithms. We propose an efficient algorithm, the L method, that finds the “knee ” in a ‘ # of clusters vs. clustering evaluation metric ’ graph. Using the knee is well-known, but is not a particularly well-understood method to determine the number of clusters. We explore the feasibility of this method, and attempt to determine in which situations it will and will not work. We also compare the L method to existing methods based on the accuracy of the number of clusters that are determined and efficiency. Our results show favorable performance for these criteria compared to the existing methods that were evaluated.
Visually mining and monitoring massive time series
- In Proceedings of the 10 th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
, 2004
"... Moments before the launch of every space vehicle, engineering discipline specialists must make a critical go/no-go decision. The cost of a false positive, allowing a launch in spite of a fault, or a false negative, stopping a potentially successful launch, can be measured in the tens of millions of ..."
Abstract
-
Cited by 29 (9 self)
- Add to MetaCart
Moments before the launch of every space vehicle, engineering discipline specialists must make a critical go/no-go decision. The cost of a false positive, allowing a launch in spite of a fault, or a false negative, stopping a potentially successful launch, can be measured in the tens of millions of dollars, not including the cost in morale and other more intangible detriments. The Aerospace Corporation is responsible for providing engineering assessments critical to the go/no-go decision for every Department of Defense space vehicle. These assessments are made by constantly monitoring streaming telemetry data in the hours before launch. We will introduce VizTree, a novel time-series visualization tool to aid the Aerospace analysts who must make these engineering assessments. VizTree was developed at the University of California, Riverside and is unique in that the same tool is used for mining archival data and monitoring incoming live telemetry. The use of a single tool for both aspects of the task allows a natural and intuitive transfer of mined knowledge to the monitoring task. Our visualization approach works by transforming the time series into a symbolic representation, and encoding the data in a modified suffix tree in which the frequency and other properties of patterns are mapped onto colors and other visual properties. We demonstrate the utility of our system by comparing it with state-of-the-art batch algorithms on several real and synthetic datasets.
A Wavelet-Based Anytime Algorithm for K-Means Clustering of Time Series
- In Proc. Workshop on Clustering High Dimensionality Data and Its Applications
, 2003
"... The emergence of the field of data mining in the last decade has sparked an increasing interest in clustering of tiate series. Although there has been much research on clustering in general, most classic machine learning and data mining algorithms do not work well for time series due to their unique ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
The emergence of the field of data mining in the last decade has sparked an increasing interest in clustering of tiate series. Although there has been much research on clustering in general, most classic machine learning and data mining algorithms do not work well for time series due to their unique structure. In particular, the high dimensionaliF, very high feature correlation, and the (typically) large amount of noise that characterize time series data present a difficult challenge. In this work we address these challenges by introducing a novel anytiate version of k-Means clustering algorithm for time series. The algorithm works by leveraging off the multi-resolution property of wavelets. In particular, an initial clustering is perforated with a very coarse resolution representation of the data. The results obtained from this "quick and dirty" clustering are used to initialize a clustering at a slightly finer level of approximation. This process is repeated until the clustering results stabilize or until the "approxiatation" is the raw data. In addition to casting k-Means as an anytime algorithm, our approach has two other very unintuitive properties. The quality of the clustering is often better than the batch algorithm, and even if the algorithm is run to coatpletion, the time taken is typically much less than the time taken by the original algorithm. We explain, and eatpirically demonstrate these surprising and desirable properties with coatprehensive experiatents on several publicly available real data sets.
Three Myths about Dynamic Time Warping Data
- Mining, in the Proceedings of SIAM International Conference on Data Mining (2005
"... The Dynamic Time Warping (DTW) distance measure is a technique that has long been known in speech recognition community. It allows a non-linear mapping of one signal to another by minimizing the distance between the two. A decade ago, DTW was introduced into Data Mining community as a utility for va ..."
Abstract
-
Cited by 18 (8 self)
- Add to MetaCart
The Dynamic Time Warping (DTW) distance measure is a technique that has long been known in speech recognition community. It allows a non-linear mapping of one signal to another by minimizing the distance between the two. A decade ago, DTW was introduced into Data Mining community as a utility for various tasks for time series problems including classification, clustering, and anomaly detection. The technique has flourished, particularly in the last three years, and has been applied to a variety of problems in various disciplines. In spite of DTW’s great success, there are still several persistent “myths ” about it. These myths have caused confusion and led to much wasted research effort. In this work, we will dispel these myths with the most comprehensive set of time series experiments ever conducted.
Iterative Incremental Clustering of Time Series
- IN EDBT
, 2004
"... We present a novel anytime version of partitional clustering algorithm, such as k-Means and EM, for time series. The algorithm works by leveraging off the multi-resolution property of wavelets. The dilemma of choosing the initial centers is mitigated by initializing the centers at each approxima ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
We present a novel anytime version of partitional clustering algorithm, such as k-Means and EM, for time series. The algorithm works by leveraging off the multi-resolution property of wavelets. The dilemma of choosing the initial centers is mitigated by initializing the centers at each approximation level, using the final centers returned by the coarser representations. In addition to casting the clustering algorithms as anytime algorithms, this approach has two other very desirable properties. By working at lower dimensionalities we can efficiently avoid local minima. Therefore, the quality of the clustering is usually better than the batch algorithm. In addition, even if the algorithm is run to completion, our approach is much faster than its batch counterpart. We explain, and empirically demonstrate these surprising and desirable properties with comprehensive experiments on several publicly available real data sets. We further demonstrate that our approach can be generalized to a framework of much broader range of algorithms or data mining problems.
C.: A multiresolution symbolic representation of time series
- In: Proc. IEEE Int. Conf. on Data Engineering (ICDE05
, 2005
"... Efficiently and accurately searching for similarities among time series and discovering interesting patterns is an important and non-trivial problem. In this paper, we introduce a new representation of time series, the Multiresolution Vector Quantized (MVQ) approximation, along with a new distance f ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
Efficiently and accurately searching for similarities among time series and discovering interesting patterns is an important and non-trivial problem. In this paper, we introduce a new representation of time series, the Multiresolution Vector Quantized (MVQ) approximation, along with a new distance function. The novelty of MVQ is that it keeps both local and global information about the original time series in a hierarchical mechanism, processing the original time series at multiple resolutions. Moreover, the proposed representation is symbolic employing key subsequences and potentially allows the application of text-based retrieval techniques into the similarity analysis of time series. The proposed method is fast and scales linearly with the size of
Approximate embedding-based subsequence matching of time series
- In SIGMOD ’08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data
, 2008
"... A method for approximate subsequence matching is introduced, that significantly improves the efficiency of subsequence matching in large time series data sets under the dynamic time warping (DTW) distance measure. Our method is called EBSM, shorthand for Embedding-Based Subsequence Matching. The key ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
A method for approximate subsequence matching is introduced, that significantly improves the efficiency of subsequence matching in large time series data sets under the dynamic time warping (DTW) distance measure. Our method is called EBSM, shorthand for Embedding-Based Subsequence Matching. The key idea is to convert subsequence matching to vector matching using an embedding. This embedding maps each database time series into a sequence of vectors, so that every step of every time series in the database is mapped to a vector. The embedding is computed by applying full dynamic time warping between reference objects and each database time series. At runtime, given a query object, an embedding of that object is computed in the same manner, by running dynamic time warping between the reference objects and the query. Comparing the embedding of the query with the database vectors is used to efficiently identify relatively few areas of interest in the database sequences. Those areas of interest are then fully explored using the exact DTW-based subsequence matching algorithm. Experiments on a large, public time series data set produce speedups of over one order of magnitude compared to brute-force search, with very small losses (< 1%) in retrieval accuracy.
Global distancebased segmentation of trajectories
- In KDD
, 2006
"... This work introduces distance-based criteria for segmentation of object trajectories. Segmentation leads to simplification of the original objects into smaller, less complex primitives that are better suited for storage and retrieval purposes. Previous work on trajectory segmentation attacked the pr ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
This work introduces distance-based criteria for segmentation of object trajectories. Segmentation leads to simplification of the original objects into smaller, less complex primitives that are better suited for storage and retrieval purposes. Previous work on trajectory segmentation attacked the problem locally, segmenting separately each trajectory of the database. Therefore, they did not directly optimize the inter-object separability, which is necessary for mining operations such as searching, clustering, and classification on large databases. In this paper we analyze the trajectory segmentation problem from a global perspective, utilizing data aware distance-based optimization techniques, which optimize pairwise distance estimates hence leading to more efficient object pruning. We first derive exact solutions of the distance-based formulation. Due to the intractable complexity of the exact solution, we present an approximate, greedy solution that exploits forward searching of locally optimal solutions. Since the greedy solution also imposes a prohibitive computational cost, we also put forward more lightweight variance-based segmentation techniques, which intelligently “relax ” the pairwise distance only in the areas that affect the least the mining operations.
Local correlation tracking in time series
- In ICDM
, 2006
"... We address the problem of capturing and tracking local correlations among time evolving time series. Our approach is based on comparing the local auto-covariance matrices (via their spectral decompositions) of each series and generalizes the notion of linear cross-correlation. In this way, it is pos ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
We address the problem of capturing and tracking local correlations among time evolving time series. Our approach is based on comparing the local auto-covariance matrices (via their spectral decompositions) of each series and generalizes the notion of linear cross-correlation. In this way, it is possible to concisely capture a wide variety of local patterns or trends. Our method produces a general similarity score, which evolves over time, and accurately reflects the changing relationships. Finally, it can also be estimated incrementally, in a streaming setting. We demonstrate its usefulness, robustness and efficiency on a wide range of real datasets. 1
Escalation: Complex Event Detection in Wireless Sensor Networks
"... Abstract. We present a new approach for the detection of complex events in Wireless Sensor Networks. Complex events are sets of data points that correspond to interesting or unusual patterns in the underlying phenomenon that the network monitors. Our approach is inspired from time-series data mining ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Abstract. We present a new approach for the detection of complex events in Wireless Sensor Networks. Complex events are sets of data points that correspond to interesting or unusual patterns in the underlying phenomenon that the network monitors. Our approach is inspired from time-series data mining techniques and transforms a stream of realvalued sensor readings into a symbolic representation. Complex event detection is then performed using distance metrics, allowing us to detect events that are difficult or even impossible to describe using traditional declarative SQL-like languages and thresholds. We have tested our approach with four distinct data sets and the experimental results were encouraging in all cases. We have implemented our approach for the TinyOS and Contiki Operating Systems, for the Sky mote platform.

