Results 1 - 10
of
70
Mining Data Streams: A Review.
- SIGMOD Record,
, 2005
"... Abstract The recent advances in hardware and software have enabled the capture of different measurements of data in a wide range of fields. These measurements are generated continuously and in a very high fluctuating data rates. Examples include sensor networks, web logs, and computer network traff ..."
Abstract
-
Cited by 113 (6 self)
- Add to MetaCart
(Show Context)
Abstract The recent advances in hardware and software have enabled the capture of different measurements of data in a wide range of fields. These measurements are generated continuously and in a very high fluctuating data rates. Examples include sensor networks, web logs, and computer network traffic. The storage, querying and mining of such data sets are highly computationally challenging tasks. Mining data streams is concerned with extracting knowledge structures represented in models and patterns in non stopping streams of information. The research in data stream mining has gained a high attraction due to the importance of its applications and the increasing generation of streaming information. Applications of data stream analysis can vary from critical scientific and astronomical applications to important business and financial ones. Algorithms, systems and frameworks that address streaming challenges have been developed over the past three years. In this review paper, we present the stateof-the-art in this growing vital field. 1-Introduction The intelligent data analysis has passed through a number of stages. Each stage addresses novel research issues that have arisen. Statistical exploratory data analysis represents the first stage. The goal was to explore the available data in order to test a specific hypothesis. With the advances in computing power, machine learning field has arisen. The objective was to find computationally efficient solutions to data analysis problems. Along with the progress in machine learning research, new data analysis problems have been addressed. Due to the increase in database sizes, new algorithms have been proposed to deal with the scalability issue. Moreover machine learning and statistical analysis techniques have been adopted and modified in order to address the problem of very large databases. Data mining is that interdisciplinary field of study that can extract models and patterns from large amounts of information stored in data repositories Recently, the data generation rates in some data sources become faster than ever before. This rapid generation of continuous streams of information has challenged our storage, computation and communication capabilities in computing systems. Systems, models and techniques have been proposed and developed over the past few years to address these challenges In this paper, we review the theoretical foundations of data stream analysis. Mining data stream systems, techniques are critically reviewed. Finally, we outline and discuss research problems in streaming mining field of study. These research issues should be addressed in order to realize robust systems that are capable of fulfilling the needs of data stream mining applications. The paper is organized as follows. Section 2 presents the theoretical background of data stream analysis. Mining data stream techniques and systems are reviewed in sections 3 and 4 respectively. Open and addressed research issues in this growing field are discussed in section 5. Finally section 6 summarizes this review paper. 2-Theoretical Foundations Research problems and challenges that have been arisen in mining data streams have its solutions using wellestablished statistical and computational approaches. We can categorize these solutions to data-based and task-based ones. In data-based solutions, the idea is to examine only a subset of the whole dataset or to transform the data vertically or horizontally to an approximate smaller size data representation. At the other hand, in task-based solutions, techniques from computational theory have been adopted to achieve time
Mining correlated bursty topic patterns from coordinated text streams
- ACM SIGKDD conference (KDD
, 2007
"... Previous work on text mining has almost exclusively focused on a single stream. However, we often have available mul-tiple text streams indexed by the same set of time points (called coordinated text streams), which oer new opportu-nities for text mining. For example, when a major event happens, all ..."
Abstract
-
Cited by 54 (6 self)
- Add to MetaCart
Previous work on text mining has almost exclusively focused on a single stream. However, we often have available mul-tiple text streams indexed by the same set of time points (called coordinated text streams), which oer new opportu-nities for text mining. For example, when a major event happens, all the news articles published by dierent agen-cies in dierent languages tend to cover the same event for a certain period, exhibiting a correlated bursty topic pattern in all the news article streams. In general, mining corre-lated bursty topic patterns from coordinated text streams can reveal interesting latent associations or events behind these streams. In this paper, we dene and study this novel text mining problem. We propose a general probabilistic algorithm which can eectively discover correlated bursty patterns and their bursty periods across text streams even if the streams have completely dierent vocabularies (e.g., English vs Chinese). Evaluation of the proposed method on a news data set and a literature data set shows that it can eectively discover quite meaningful topic patterns from both data sets: the patterns discovered from the news data set accurately reveal the major common events cov-ered in the two streams of news articles (in English and Chinese, respectively), while the patterns discovered from two database publication streams match well with the ma-jor research paradigm shifts in database research. Since the proposed method is general and does not require the streams to share vocabulary, it can be applied to any coordinated text streams to discover correlated topic patterns that burst in multiple streams in the same period.
A general framework for mining concept-drifting data streams with skewed distributions
- In Proc. SDM’07
, 2007
"... In recent years, there have been some interesting studies on predictive modeling in data streams. However, most such studies assume relatively balanced and stable data streams but cannot handle well rather skewed (e.g., few positives but lots of negatives) and stochastic distributions, which are typ ..."
Abstract
-
Cited by 47 (6 self)
- Add to MetaCart
(Show Context)
In recent years, there have been some interesting studies on predictive modeling in data streams. However, most such studies assume relatively balanced and stable data streams but cannot handle well rather skewed (e.g., few positives but lots of negatives) and stochastic distributions, which are typical in many data stream applications. In this paper, we propose a new approach to mine data streams by estimating reliable posterior probabilities using an ensemble of models to match the distribution over under-samples of negatives and repeated samples of positives. We formally show some interesting and important properties of the proposed framework, e.g., reliability of estimated probabilities on skewed positive class, accuracy of estimated probabilities, efficiency and scalability. Experiments are performed on several synthetic as well as real-world datasets with skewed distributions, and they demonstrate that our framework has substantial advantages over existing approaches in estimation reliability and predication accuracy. 1
Incremental linear discriminant analysis for classification of data streams
- IEEE Transactions on Systems, Man and Cybernetics - Part B
"... Abstract—This paper presents a constructive method for de-riving an updated discriminant eigenspace for classification when bursts of data that contains new classes is being added to an initial discriminant eigenspace in the form of random chunks. Basically, we propose an incremental linear discrimi ..."
Abstract
-
Cited by 36 (5 self)
- Add to MetaCart
(Show Context)
Abstract—This paper presents a constructive method for de-riving an updated discriminant eigenspace for classification when bursts of data that contains new classes is being added to an initial discriminant eigenspace in the form of random chunks. Basically, we propose an incremental linear discriminant analysis (ILDA) in its two forms: a sequential ILDA and a Chunk ILDA. In experi-ments, we have tested ILDA using datasets with a small number of classes and small-dimensional features, as well as datasets with a large number of classes and large-dimensional features. We have compared the proposed ILDA against the traditional batch LDA in terms of discriminability, execution time and memory usage with the increasing volume of data addition. The results show that the proposed ILDA can effectively evolve a discriminant eigenspace over a fast and large data stream, and extract features with superior discriminability in classification, when compared with other methods. Index Terms—Classification, data stream, incremental linear discriminant analysis, incremental principle component analysis, linear discriminant analysis, pattern recognition, principle com-ponent analysis.
On Appropriate Assumptions to Mine Data Streams: Analysis and Practice
"... Recent years have witnessed an increasing number of studies in stream mining, which aim at building an accurate model for continuously arriving data. Somehow most existing work makes the implicit assumption that the training data and the yet-to-come testing data are always sampled from the “same dis ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
(Show Context)
Recent years have witnessed an increasing number of studies in stream mining, which aim at building an accurate model for continuously arriving data. Somehow most existing work makes the implicit assumption that the training data and the yet-to-come testing data are always sampled from the “same distribution”, and yet this “same distribution” evolves over time. We demonstrate that this may not be true, and one actually may never know either “how ” or “when ” the distribution changes. Thus, a model that fits well on the observed distribution can have unsatisfactory accuracy on the incoming data. Practically, one can just assume the bare minimum that learning from observed data is better than both random guessing and always predicting exactly the same class label. Importantly, we formally and
Anytime classification using the nearest neighbor algorithm with applications to stream mining
- IEEE International Conference on Data Mining (ICDM
, 2006
"... For many real world problems we must perform classification under widely varying amounts of computational resources. For example, if asked to classify an instance taken from a bursty stream, we may have from milliseconds to minutes to return a class prediction. For such problems an anytime algorithm ..."
Abstract
-
Cited by 24 (12 self)
- Add to MetaCart
(Show Context)
For many real world problems we must perform classification under widely varying amounts of computational resources. For example, if asked to classify an instance taken from a bursty stream, we may have from milliseconds to minutes to return a class prediction. For such problems an anytime algorithm may be especially useful. In this work we show how we can convert the ubiquitous nearest neighbor classifier into an anytime algorithm that can produce an instant classification, or if given the luxury of additional time, can utilize the extra time to increase classification accuracy. We demonstrate the utility of our approach with a comprehensive set of experiments on data from diverse domains.
One Sketch For All: Theory and Application of Conditional Random Sampling
"... Conditional Random Sampling (CRS) was originally proposed for efficiently computing pairwise (l2, l1) distances, in static, large-scale, and sparse data. This study modifies the original CRS and extends CRS to handle dynamic or streaming data, which much better reflect the real-world situation than ..."
Abstract
-
Cited by 21 (11 self)
- Add to MetaCart
(Show Context)
Conditional Random Sampling (CRS) was originally proposed for efficiently computing pairwise (l2, l1) distances, in static, large-scale, and sparse data. This study modifies the original CRS and extends CRS to handle dynamic or streaming data, which much better reflect the real-world situation than assuming static data. Compared with many other sketching algorithms for dimension reductions such as stable random projections, CRS exhibits a significant advantage in that it is “one-sketch-for-all. ” In particular, we demonstrate the effectiveness of CRS in efficiently computing the Hamming norm, the Hamming distance, the lp distance, and the χ 2 distance. A generic estimator and an approximate variance formula are also provided, for approximating any type of distances. We recommend CRS as a promising tool for building highly scalable systems, in machine learning, data mining, recommender systems, and information retrieval. 1
Compressed counting
- CoRR
"... We propose Compressed Counting (CC) for approximating the αth frequency moments (0 < α ≤ 2) of data streams under a relaxed strict-Turnstile model, using maximallyskewed stable random projections. Estimators based on the geometric mean and the harmonic mean are developed. When α = 1, a simple cou ..."
Abstract
-
Cited by 21 (13 self)
- Add to MetaCart
We propose Compressed Counting (CC) for approximating the αth frequency moments (0 < α ≤ 2) of data streams under a relaxed strict-Turnstile model, using maximallyskewed stable random projections. Estimators based on the geometric mean and the harmonic mean are developed. When α = 1, a simple counter suffices for counting the first moment (i.e., sum). The geometric mean estimator of CC has asymptotic variance ∝ ∆ = |α − 1|, capturing the intuition that the complexity should decrease as ∆ = |α−1 | → 0. However, the previous classical algorithms based on symmetric stable random projections[12, 15] required O ( 1/ɛ 2) space, in order to approximate the αth moments within a 1 + ɛ factor, for any 0 < α ≤ 2 including α = 1. We show ( that using the geometric mean estimator, CC 1 requires O log(1+ɛ) + 2 √ ∆ log3/2 ( √∆)) + o space, as ∆ → (1+ɛ) 0. Therefore, in the neighborhood of α = 1, the complexity of CC is essentially O (1/ɛ) instead of O ( 1/ɛ 2). CC may be useful for estimating Shannon entropy, which can be approximated by certain functions of the αth moments with α → 1. [10, 9] suggested using α = 1 + ∆ with (e.g.,) ∆ < 0.0001 and ɛ < 10 −7, to rigorously ensure reasonable approximations. Thus, unfortunately, CC is “theoretically impractical ” for estimating Shannon entropy, despite its empirical success reported in [16]. 1
A SURVEY OF SYNOPSIS CONSTRUCTION IN DATA STREAMS
"... The large volume of data streams poses unique space and time constraints on the computation process. Many query processing, database operations, and mining algorithms require efficient execution which can be difficult to achieve with a fast data stream. In many cases, it may be acceptable to generat ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
The large volume of data streams poses unique space and time constraints on the computation process. Many query processing, database operations, and mining algorithms require efficient execution which can be difficult to achieve with a fast data stream. In many cases, it may be acceptable to generate approximate solutions for such problems. In recent years a number of synopsis structures have been developed, which can be used in conjunction with a variety of mining and query processing techniques in data stream processing. Some key synopsis methods include those of sampling, wavelets, sketches and histograms. In this chapter, we will provide a survey of the key synopsis techniques, and the mining techniques supported by such methods. We will discuss the challenges and tradeoffs associated with using different kinds of techniques, and the important research directions for synopsis construction.
Abbadi. Using Association Rules for Fraud Detection in Web Advertising Networks
, 2005
"... Discovering associations between elements oc-curring in a stream is applicable in numerous applications, including predictive caching and fraud detection. These applications require a new model of association between pairs of el-ements in streams. We develop an algorithm, Streaming-Rules, to report ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
(Show Context)
Discovering associations between elements oc-curring in a stream is applicable in numerous applications, including predictive caching and fraud detection. These applications require a new model of association between pairs of el-ements in streams. We develop an algorithm, Streaming-Rules, to report association rules with tight guarantees on errors, using limited processing per element, and minimal space. The modular design of Streaming-Rules allows for integration with current stream manage-ment systems, since it employs existing tech-niques for finding frequent elements. The pre-sentation emphasizes the applicability of the algorithm to fraud detection in advertising networks. Such fraud instances have not been successfully detected by current techniques. Our experiments on synthetic data demon-strate scalability and efficiency. On real data, potential fraud was discovered. 1