Results 1 - 10
of
62
Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases
- In proceedings of ACM SIGMOD Conference on Management of Data
, 2002
"... Similarity search in large time series databases has attracted much research interest recently. It is a difficult problem because of the typically high dimensionality of the data.. The most promising solutions' involve performing dimensionality reduction on the data, then indexing the reduced data w ..."
Abstract
-
Cited by 185 (22 self)
- Add to MetaCart
Similarity search in large time series databases has attracted much research interest recently. It is a difficult problem because of the typically high dimensionality of the data.. The most promising solutions' involve performing dimensionality reduction on the data, then indexing the reduced data with a multidimensional index structure. Many dimensionality reduction techniques have been proposed, including Singular Value Decomposition (SVD), the Discrete Fourier transform (DFT), and the Discrete Wavelet Transform (DWT). In this work we introduce a new dimensionality reduction technique which we call Adaptive Piecewise Constant Approximation (APCA). While previous techniques (e.g., SVD, DFT and DWT) choose a common representation for all the items in the database that minimizes the global reconstruction error, APCA approximates each time series by a set of constant value segments' of varying lengths' such that their individual reconstruction errors' are minimal. We show how APCA can be indexed using a multidimensional index structure. We propose two distance measures in the indexed space that exploit the high fidelity of APCA for fast searching: a lower bounding Euclidean distance approximation, and a non-lower bounding, but very tight Euclidean distance approximation and show how they can support fast exact searchin& and even faster approximate searching on the same index structure. We theoretically and empirically compare APCA to all the other techniques and demonstrate its' superiority.
Discovering similar multidimensional trajectories
- In ICDE
, 2002
"... We investigate techniques for analysis and retrieval of object trajectories in a two or three dimensional space. Such kind of data usually contain a great amount of noise, that makes all previously used metrics fail. Therefore, here we formalize non-metric similarity functions based on the Longest C ..."
Abstract
-
Cited by 138 (5 self)
- Add to MetaCart
We investigate techniques for analysis and retrieval of object trajectories in a two or three dimensional space. Such kind of data usually contain a great amount of noise, that makes all previously used metrics fail. Therefore, here we formalize non-metric similarity functions based on the Longest Common Subsequence (LCSS), which are very robust to noise and furthermore provide an intuitive notion of similarity between trajectories by giving more weight to the similar portions of the sequences. Stretching of sequences in time is allowed, as well as global translating of the sequences in space. Efficient approximate algorithms that compute these similarity measures are also provided. We compare these new methods to the widely used Euclidean and Time Warping distance functions (for real and synthetic data) and show the superiority of our approach, especially under the strong presence of noise. We prove a weaker version of the triangle inequality and employ it in an indexing structure to answer nearest neighbor queries. Finally, we present experimental results that validate the accuracy and efficiency of our approach. 1
Outlier detection for high dimensional data
, 2001
"... The outlier detection problem has important applications in the eld of fraud detection, netw ork robustness analysis, and intrusion detection. Most suc h applications are high dimensional domains in whic hthe data can con tain hundreds of dimensions. Many recen t algorithms use concepts of pro ximit ..."
Abstract
-
Cited by 128 (0 self)
- Add to MetaCart
The outlier detection problem has important applications in the eld of fraud detection, netw ork robustness analysis, and intrusion detection. Most suc h applications are high dimensional domains in whic hthe data can con tain hundreds of dimensions. Many recen t algorithms use concepts of pro ximity in order to nd outliers based on their relationship to the rest of the data. Ho w ever, in high dimensional space, the data is sparse and the notion of proximity fails to retain its meaningfulness. In fact, the sparsity of high dimensional data implies that every point is an almost equally good outlier from the perspective ofproximity-based de nitions. Consequently, for high dimensional data, the notion of nding meaningful outliers becomes substantially more complex and non-obvious. In this paper, w e discuss new techniques for outlier detection whic h nd the outliers by studying the behavior of projections from the data set. 1.
A monte carlo algorithm for fast projective clustering
, 2002
"... We propose a mathematical formulation for the notion of optimal projective cluster, starting from natural requirements on the density of points in subspaces. This allows us to develop a Monte Carlo algorithm for iteratively computing projective clusters. We prove that the computed clusters are good ..."
Abstract
-
Cited by 67 (0 self)
- Add to MetaCart
We propose a mathematical formulation for the notion of optimal projective cluster, starting from natural requirements on the density of points in subspaces. This allows us to develop a Monte Carlo algorithm for iteratively computing projective clusters. We prove that the computed clusters are good with high probability. We implemented a modified version of the algorithm, using heuristics to speed up computation. Our extensive experiments show that our method is significantly more accurate than previous approaches. In particular, we use our techniques to build a classifier for detecting rotated human faces in cluttered images. 1. PROJECTIVE CLUSTERING Clustering is a widely used technique for data mining, indexing, and classification. Many practical methods proposed in the last few years, such as CLARANS [11], BIRCH [15], DBSCAN [5, 6], and CURE [7], are “full-dimensional, ” in the sense that they give equal importance to all the dimensions when computing the distance between two points. While such approaches have proven successful for low-dimensional datasets, their accuracy and/or efficiency decrease significantly in higher dimensional spaces (see [9] for an excellent analysis and discussion). The reason for this performance deterioration is the so-called dimensionality curse. Recent research shows that for moderate-to-high dimensional spaces (tens or hundreds of dimensions), a full-dimensional distance is often irrelevant, as the farthest neighbor of a point is expected to be almost as ÝAuthor did this research when he was associated with the Compaq
Indexing the Distance: An Efficient Method to KNN Processing
, 2001
"... In this paper, we present an efficient method, called iDistance, for K-nearest neighbor (KNN) search in a high-dimensional space. iDistance partitions the data and selects a reference point for each partition. The data in each cluster are transformed into a single dimensional space based on their si ..."
Abstract
-
Cited by 52 (12 self)
- Add to MetaCart
In this paper, we present an efficient method, called iDistance, for K-nearest neighbor (KNN) search in a high-dimensional space. iDistance partitions the data and selects a reference point for each partition. The data in each cluster are transformed into a single dimensional space based on their similarity with respect to a reference point. This allows the points to be indexed using a B + -tree structure and KNN search be performed using one-dimensional range search. The choice of partition and reference point provides the iDistance technique with degrees of freedom most other techniques do not have. We describe how appropriate choices here can effectively adapt the index structure to the data distribution. We conducted extensive experiments to evaluate the iDistance technique, and report results demonstrating its effectiveness.
Indexing Spatio-Temporal Trajectories with Chebyshev Polynomials
- Proc. 2004 SIGMOD, toappear
"... In this thesis, we investigate the subject of indexing large collections of spatiotemporal trajectories for similarity matching. Our proposed technique is to first mitigate the dimensionality curse problem by approximating each trajectory with a low order polynomial-like curve, and then incorporate ..."
Abstract
-
Cited by 41 (0 self)
- Add to MetaCart
In this thesis, we investigate the subject of indexing large collections of spatiotemporal trajectories for similarity matching. Our proposed technique is to first mitigate the dimensionality curse problem by approximating each trajectory with a low order polynomial-like curve, and then incorporate a multidimensional index into the reduced space of polynomial coefficients. There are many possible ways to choose the polynomial, including Fourier transforms, splines, non-linear regressions, etc. Some of these possibilities have indeed been studied before. We hypothesize that one of the best approaches is the polynomial that minimizes the maximum deviation from the true value, which is called the minimax polynomial. Minimax approximation is particularly meaningful for indexing because in a branch-and-bound search (i.e., for finding nearest neighbours), the smaller the maximum deviation, the more pruning opportunities there exist. In general, among all the polynomials of the same degree, the optimal minimax polynomial is very hard to compute. However, it has been shown that the Chebyshev approximation is almost identical to the optimal minimax polynomial, and is easy to compute [32]. Thus, we shall explore how to use
Computing Clusters of Correlation Connected Objects
, 2004
"... The detection of correlations between different features in a set of feature vectors is a very important data mining task because correlation indicates a dependency between the features or some association of cause and effect between them. This association can be arbitrarily complex, i.e. one or mor ..."
Abstract
-
Cited by 29 (10 self)
- Add to MetaCart
The detection of correlations between different features in a set of feature vectors is a very important data mining task because correlation indicates a dependency between the features or some association of cause and effect between them. This association can be arbitrarily complex, i.e. one or more features might be dependent from a combination of several other features. Well-known methods like the principal components analysis (PCA) can perfectly find correlations which are global, linear, not hidden in a set of noise vectors, and uniform, i.e. the same type of correlation is exhibited in all feature vectors. In many applications such as medical diagnosis, molecular biology, time sequences, or electronic commerce, however, correlations are not global since the dependency between features can be different in different subgroups of the set. In this paper, we propose a method called 4C (Computing Correlation Connected Clusters) to identify local subgroups of the data objects sharing a uniform but arbitrarily complex correlation. Our algorithm is based on a combination of PCA and density-based clustering (DBSCAN). Our method has a determinate result and is robust against noise. A broad comparative evaluation demonstrates the superior performance of 4C over competing methods such as DBSCAN, CLIQUE and ORCLUS.
Variable Selection for Model-Based Clustering
- Journal of the American Statistical Association
, 2006
"... We consider the problem of variable or feature selection for model-based clustering. We recast the problem of comparing two nested subsets of variables as a model comparison problem, and address it using approximate Bayes factors. We develop a greedy search algorithm for finding a local optimum in m ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
We consider the problem of variable or feature selection for model-based clustering. We recast the problem of comparing two nested subsets of variables as a model comparison problem, and address it using approximate Bayes factors. We develop a greedy search algorithm for finding a local optimum in model space. The resulting method selects variables (or features), the number of clusters, and the clustering model simultaneously. We applied the method to several simulated and real examples, and found that removing irrelevant variables often improved performance. Compared to methods based on all the variables, our variable selection method consistently yielded more accurate estimates of the number of clusters, and lower classification error rates, as well as more parsimonious clustering models and easier visualization of results.
Indexing of network constrained moving objects
- In Proc. of the 11th Intl. Symp. on Advances in Geographic Information Systems (ACM-GIS
, 2003
"... With the proliferation of mobile computing, the ability to index efficiently the movements of mobile objects becomes important. Objects are typically seen as moving in two-dimensional (x,y) space, which means that their movements across time may be embedded in the three-dimensional (x,y,t) space. Fu ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
With the proliferation of mobile computing, the ability to index efficiently the movements of mobile objects becomes important. Objects are typically seen as moving in two-dimensional (x,y) space, which means that their movements across time may be embedded in the three-dimensional (x,y,t) space. Further, the movements are typically represented as trajectories, sequences of connected line segments. In certain cases, movement is restricted, and specifically in this paper, we aim at exploiting that movements occur in transportation networks to reduce the dimensionality of the data. Briefly, the idea is to reduce movements to occur in one spatial dimension. As a consequence, the movement data becomes two-dimensional (x,t). The advantages of considering such lowerdimensional trajectories are the reduced overall size of the data and the lower-dimensional indexing challenge. Since off-the-shelf database management systems typically do not offer higherdimensional indexing, this reduction in dimensionality allows us to use such DBMSes to store and index trajectories. Moreover, we argue that, given the right circumstances, indexing these dimensionality-reduced trajectories can be more efficient than using a three-dimensional index. This hypothesis is verified by an experimental study that incorporates trajectories stemming from real and synthetic road networks.
Finding localized associations in market basket data
- Knowledge and Data Engineering
, 2002
"... In this paper, we discuss a technique for discovering localized associations in segments of the data using clustering. Often the aggregate behavior of a data set may be very di erent from localized segments. In such cases, it is desirable to design algorithms which are e ective in discovering locali ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
In this paper, we discuss a technique for discovering localized associations in segments of the data using clustering. Often the aggregate behavior of a data set may be very di erent from localized segments. In such cases, it is desirable to design algorithms which are e ective in discovering localized associations, because they expose a customer pattern which is more speci c than the aggregate behavior. This information may bevery useful for target marketing. We present empirical results which show that the method is indeed able to nd a signi cantly larger number of associations than what can be discovered by analysis of the aggregate data.

