Results 1  10
of
103
Online Identification and Tracking of Subspaces from Highly Incomplete Information
, 2010
"... This work presents GROUSE (Grassmanian RankOne Update Subspace Estimation), an efficient online algorithm for tracking subspaces from highly incomplete observations. GROUSE requires only basic linear algebraic manipulations at each iteration, and each subspace update can be performed in linear time ..."
Abstract

Cited by 78 (10 self)
 Add to MetaCart
(Show Context)
This work presents GROUSE (Grassmanian RankOne Update Subspace Estimation), an efficient online algorithm for tracking subspaces from highly incomplete observations. GROUSE requires only basic linear algebraic manipulations at each iteration, and each subspace update can be performed in linear time in the dimension of the subspace. The algorithm is derived by analyzing incremental gradient descent on the Grassmannian manifold of subspaces. With a slight modification, GROUSE can also be used as an online incremental algorithm for the matrix completion problem of imputing missing entries of a lowrank matrix. GROUSE performs exceptionally well in practice both in tracking subspaces and as an online algorithm for matrix completion. 1
Less is more: Compact matrix decomposition for large sparse graphs
, 2007
"... Given a large sparse graph, how can we find patterns and anomalies? Several important applications can be modeled as large sparse graphs, e.g., network traffic monitoring, research citation network analysis, social network analysis, and regulatory networks in genes. Low rank decompositions, such as ..."
Abstract

Cited by 34 (3 self)
 Add to MetaCart
(Show Context)
Given a large sparse graph, how can we find patterns and anomalies? Several important applications can be modeled as large sparse graphs, e.g., network traffic monitoring, research citation network analysis, social network analysis, and regulatory networks in genes. Low rank decompositions, such as SVD and CUR, are powerful techniques for revealing latent/hidden variables and associated patterns from high dimensional data. However, those methods often ignore the sparsity property of the graph, and hence usually incur too high memory and computational cost to be practical. We propose a novel method, the Compact Matrix Decomposition (CMD), to compute sparse low rank approximations. CMD dramatically reduces both the computation cost and the space requirements over existing decomposition methods (SVD, CUR). Using CMD as the key building block, we further propose procedures to efficiently construct and analyze dynamic graphs from realtime application data. We provide theoretical guarantee for our methods, and present results on two real, large datasets, one on network flow data (100GB trace of 22K hosts over one month) and one on DBLP (200MB over 25 years). We show that CMD is often an order of magnitude more efficient than the state of the art (SVD and CUR): it is over 10X faster, but requires less than 1/10 of the space, for the same reconstruction accuracy. Finally, we demonstrate how CMD is used for detecting anomalies and monitoring timeevolving graphs, in which it successfully detects wormlike hierarchical scanning patterns in real network data.
Colibri: Fast Mining of Large Static and Dynamic Graphs
"... Lowrank approximations of the adjacency matrix of a graph are essential in finding patterns (such as communities) and detecting anomalies. Additionally, it is desirable to track the lowrank structure as the graph evolves over time, efficiently and within limited storage. Real graphs typically have ..."
Abstract

Cited by 30 (6 self)
 Add to MetaCart
(Show Context)
Lowrank approximations of the adjacency matrix of a graph are essential in finding patterns (such as communities) and detecting anomalies. Additionally, it is desirable to track the lowrank structure as the graph evolves over time, efficiently and within limited storage. Real graphs typically have thousands or millions of nodes, but are usually very sparse. However, standard decompositions such as SVD do not preserve sparsity. This has led to the development of methods such as CUR and CMD, which seek a nonorthogonal basis by sampling the columns and/or rows of the sparse matrix. However, these approaches will typically produce overcomplete bases, which wastes both space and time. In this paper we propose the family of Colibri methods to deal with these challenges. Our version for static graphs, ColibriS, iteratively finds a nonredundant basis and we prove that it has no loss of accuracy compared to the best competitors (CUR and CMD), while achieving significant savings in space and time: on real data, ColibriS requires much less space and is orders of magnitude faster (in proportion to the square of the number of nonredundant columns). Additionally, we propose an efficient update algorithm for dynamic, timeevolving graphs, ColibriD. Our evaluation on a large, real network traffic dataset shows that ColibriD is over 100 times faster than the best published competitor (CMD).
Stream Monitoring under the Time Warping Distance
"... Data stream processing has recently attracted an increasing amount of interest. The goal of this paper is to monitor numerical streams, and to find subsequences that are similar to a given query sequence, under the DTW (Dynamic Time Warping) distance. Applications include word spotting, sensor patte ..."
Abstract

Cited by 29 (2 self)
 Add to MetaCart
(Show Context)
Data stream processing has recently attracted an increasing amount of interest. The goal of this paper is to monitor numerical streams, and to find subsequences that are similar to a given query sequence, under the DTW (Dynamic Time Warping) distance. Applications include word spotting, sensor pattern matching, and monitoring of biomedical signals (e.g., EKG, ECG), and monitoring of environmental (seismic and volcanic) signals. DTW is a very popular distance measure, permitting accelerations and decelerations, and it has been studied for finite, stored sequence sets. However, in many applications such as network analysis and sensor monitoring, massive amounts of data arrive continuously and it is infeasible to save all the historical data. We propose SPRING, a novel algorithm that can solve the problem. We provide a theoretical analysis and prove that SPRING does not sacrifice accuracy, while it requires constant space and time per timetick. These are dramatic improvements over the naive method. Our experiments on real and realistic data illustrate that SPRING does indeed detect the qualifying subsequences correctly and that it can offer dramatic improvements in speed (up to 650,000 times) over the naive implementation. 1
Fast Approximate Correlation for Massive Timeseries Data
"... We consider the problem of computing allpair correlations in a warehouse containing a large number (e.g., tens of thousands) of timeseries (or, signals). The problem arises in automatic discovery of patterns and anomalies in data intensive applications such as data center management, environmental ..."
Abstract

Cited by 24 (5 self)
 Add to MetaCart
(Show Context)
We consider the problem of computing allpair correlations in a warehouse containing a large number (e.g., tens of thousands) of timeseries (or, signals). The problem arises in automatic discovery of patterns and anomalies in data intensive applications such as data center management, environmental monitoring, and scientific experiments. However, with existing techniques, solving the problem for a large stream warehouse is extremely expensive, due to the problem’s inherent quadratic I/O and CPU complexities. We propose novel algorithms, based on Discrete Fourier Transformation (DFT) and graph partitioning, to reduce the endtoend response time of an allpair correlation query. To minimize I/O cost, we partition a massive set of input signals into smaller batches such that caching the signals one batch at a time maximizes data reuse and minimizes disk I/O. To reduce CPU cost, we propose two approximation algorithms. Our first algorithm efficiently computes approximate correlation coefficients of similar signal pairs within a given error bound. The second algorithm efficiently identifies, without any false positives or negatives, all signal pairs with correlations above a given threshold. For many real applications, our approximate solutions are as useful as corresponding exact solutions, due to our strict error guarantees. However, compared to the stateoftheart exact algorithms, our algorithms are up to 17× faster for several real datasets.
Windowbased Tensor Analysis on Highdimensional and Multiaspect Streams
"... Data stream values are often associated with multiple aspects. For example, each value from environmental sensors may have an associated type (e.g., temperature, humidity, etc) as well as location. Aside from timestamp, type and location are the two additional aspects. How to model such streams? How ..."
Abstract

Cited by 22 (3 self)
 Add to MetaCart
(Show Context)
Data stream values are often associated with multiple aspects. For example, each value from environmental sensors may have an associated type (e.g., temperature, humidity, etc) as well as location. Aside from timestamp, type and location are the two additional aspects. How to model such streams? How to simultaneously find patterns within and across the multiple aspects? How to do it incrementally in a streaming fashion? In this paper, all these problems are addressed through a general data model, tensor streams, and an effective algorithmic framework, windowbased tensor analysis (WTA). Two variations of WTA, independentwindow tensor analysis (IW) and movingwindow tensor analysis (MW), are presented and evaluated extensively on real datasets. Finally, we illustrate one important application, MultiAspect Correlation Analysis (MACA), which uses WTA and we demonstrate its effectiveness on an environmental monitoring application. 1
Spade: On shapebased pattern detection in streaming time series
 in ICDE, 2007
"... Monitoring predefined patterns in streaming time series is useful to applications such as trendrelated analysis, sensor networks and video surveillance. Most current studies on such monitoring employ Euclidean distance to calculate the similarities between given query patterns and subsequences of s ..."
Abstract

Cited by 22 (2 self)
 Add to MetaCart
(Show Context)
Monitoring predefined patterns in streaming time series is useful to applications such as trendrelated analysis, sensor networks and video surveillance. Most current studies on such monitoring employ Euclidean distance to calculate the similarities between given query patterns and subsequences of streaming time series. Euclidean distance has been shown to be ineffective in measuring distances of time series in which shifting and scaling usually exist. Consequently, warping distances such as dynamic time warping (DTW), longest common subsequence (LCSS), have been proposed to handle warps in temporal dimension. However, they are inadequate in handling shifting and scaling in amplitude dimension. Moreover, they have been designed mainly for full sequence matching, whereas in online monitoring applications, we typically have no knowledge on the positions and lengths of possible matching subsequences. In this paper, we first discuss the weaknesses of existing warping distances on detecting patterns from streaming time series. We then propose a novel warping distance, which we name Spatial Assembling Distance (SpADe), that is able to handle shifting and scaling in both temporal and amplitude dimensions. We further propose an efficient approach for continuous pattern detection using SpADe, that is fundamental for subsequence matching on streaming data. Finally, our experimental results show that SpADe is effective and efficient for continuous pattern detection in streaming time series. 1
Sustainable Operation and Management of Data Center Chillers using Temporal Data Mining
"... Motivation: Data centers are a critical component of modern IT infrastructure but are also among the worst environmental offenders through their increasing energy usage and the resulting large carbon footprints. Efficient management of data centers, including power management, networking, and coolin ..."
Abstract

Cited by 20 (5 self)
 Add to MetaCart
(Show Context)
Motivation: Data centers are a critical component of modern IT infrastructure but are also among the worst environmental offenders through their increasing energy usage and the resulting large carbon footprints. Efficient management of data centers, including power management, networking, and cooling infrastructure, is hence crucial to sustainability. In the absence of a “firstprinciples ” approach to manage these complex components and their interactions, datadriven approaches have become attractive and tenable. Results: We present a temporal data mining solution to model and optimize performance of data center chillers, a key component of the cooling infrastructure. It helps bridge raw, numeric, timeseries information from sensor streams toward higher level characterizations of chiller behavior, suitable for a data center engineer. To aid in this transduction, temporal data streams are first encoded into a symbolic representation, next runlength encoded segments are mined to form frequent motifs in time series, and finally these metrics are evaluated by their contributions to sustainability. A key innovation in our application is the ability to intersperse “don’t care ” transitions (e.g., transients) in continuousvalued time series data, an advantage we inherit by the application of frequent episode mining to symbolized representations of numeric time series. Our approach provides both qualitative and quantitative characterizations of the sensor streams to the data center engineer, to aid him in tuning chiller operating characteristics. This system is currently being prototyped for a data center managed by HP and experimental results from this application reveal the promise of our approach.
Gamps: Compressing multi sensor data by grouping and amplitude scaling
 In: ACM SIGMOD. (2009
"... We consider the problem of collectively approximating a set of sensor signals using the least amount of space so that any individual signal can be efficiently reconstructed within a given maximum (L∞) error ε. The problem arises naturally in applications that need to collect large amounts of data fr ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
(Show Context)
We consider the problem of collectively approximating a set of sensor signals using the least amount of space so that any individual signal can be efficiently reconstructed within a given maximum (L∞) error ε. The problem arises naturally in applications that need to collect large amounts of data from multiple concurrent sources, such as sensors, servers and network routers, and archive them over a long period of time for offline data mining. We present GAMPS, a general framework that addresses this problem by combining several novel techniques. First, it dynamically groups multiple signals together so that signals within each group are correlated and can be maximally compressed jointly. Second, it appropriately scales the amplitudes of different signals within a group and compresses them within the maximum allowed reconstruction error bound. Our schemes are polynomial time O(α, β) approximation schemes, meaning that the maximum (L∞) error is at most αε and it uses at most β times the optimal memory. Finally, GAMPS maintains an index so that various queries can be issued directly on compressed data. Our experiments on several realworld sensor datasets show that GAMPS significantly reduces space without compromising the quality of search and query. Categories and Subject Descriptors