Results 1  10
of
113
BBM: Bayesian Browsing Model from Petabytescale Data
"... Given a quarter of petabyte click log data, how can we estimate the relevance of each URL for a given query? In this paper, we propose the Bayesian Browsing Model (BBM), a new modeling technique with following advantages: (a) it does exact inference; (b) it is singlepass and parallelizable; (c) it ..."
Abstract

Cited by 21 (4 self)
 Add to MetaCart
(Show Context)
Given a quarter of petabyte click log data, how can we estimate the relevance of each URL for a given query? In this paper, we propose the Bayesian Browsing Model (BBM), a new modeling technique with following advantages: (a) it does exact inference; (b) it is singlepass and parallelizable; (c) it is effective. We present two sets of experiments to test model effectiveness and efficiency. On the first set of over 50 million search instances of 1.1 million distinct queries, BBM outperforms the stateoftheart competitor by 29.2 % in loglikelihood while being 57 times faster. On the second clicklog set, spanning a quarter of petabyte data, we showcase the scalability of BBM: we implemented it on a commercial MapReduce cluster, and it took only 3 hours to compute the relevance for 1.15 billion distinct queryURL pairs.
Resourceaware knowledge discovery in data streams
 Proceedings of First International Workshop on Knowledge Discovery in Data Streams, in conjunction with the 15th European Conference on Machine Learning (ECML 2004) and the 8th European Conference on the Principals and Practice of Knowledge Discovery in D
, 2004
"... ABSTRACT Mining data streams is a field of increase interest due to the importance of its applications and dissemination of data stream generators. Most of the streaming techniques developed so far have not addressed the need of resourceaware computing in data stream analysis. The fact that stream ..."
Abstract

Cited by 18 (5 self)
 Add to MetaCart
(Show Context)
ABSTRACT Mining data streams is a field of increase interest due to the importance of its applications and dissemination of data stream generators. Most of the streaming techniques developed so far have not addressed the need of resourceaware computing in data stream analysis. The fact that streaming information is often generated or received onboard resourceconstrained computational devices such as sensors and mobile devices motivates the need for resourceawareness in data stream processing systems. In this paper, we propose a generic framework that enables resourceawareness in streaming computation using algorithm granularity settings in order to change the resource consumption patterns periodically. This generic framework is applied to a novel thresholdbased microclustering algorithm to test its validity and feasibility. We have termed this algorithm as RACluster. RACuster is the first stream clustering algorithm that can adapt to the changing availability of different resources. The experimental results showed the applicability of the framework and the algorithm in terms of resourceawareness and accuracy.
Online Mining of Temporal Maximal Utility Itemsets from Data Streams
 In Proc. of Annual ACM Symposium on Applied Computing
, 2010
"... Data stream mining has become an emerging research topic in the data mining field, and finding frequent itemsets is an important task in data stream mining with wide applications. Recently, utility mining is receiving extensive attentions with two issues reconsidered: First, the utility (e.g., prof ..."
Abstract

Cited by 12 (5 self)
 Add to MetaCart
(Show Context)
Data stream mining has become an emerging research topic in the data mining field, and finding frequent itemsets is an important task in data stream mining with wide applications. Recently, utility mining is receiving extensive attentions with two issues reconsidered: First, the utility (e.g., profit) of each item may be different in real applications; second, the frequent itemsets might not produce the highest utility. In this paper, we propose a novel algorithm named GUIDE (Generation of temporal maximal Utility Itemsets from Data strEams) which can find temporal maximal utility itemsets from data streams. A novel data structure, namely, TMUItree (Temporal Maximal Utility Itemset tree), is also proposed for efficiently capturing the utility of each itemset with onetime scanning. The main contributions of this paper are as follows: 1) GUIDE is the first onepass utilitybased algorithm for mining temporal maximal utility itemsets from data streams, and 2) TMUItree is efficient and easy to maintain. The experimental results show that our approach outperforms other existing utility mining algorithms like TwoPhase algorithm under the data stream environments.
Budgeted Nonparametric Learning from Data Streams
"... We consider the problem of extracting informative exemplars from a data stream. Examples of this problem include exemplarbased clustering and nonparametric inference such as Gaussian process regression on massive data sets. We show that these problems require maximization of a submodular function th ..."
Abstract

Cited by 12 (7 self)
 Add to MetaCart
(Show Context)
We consider the problem of extracting informative exemplars from a data stream. Examples of this problem include exemplarbased clustering and nonparametric inference such as Gaussian process regression on massive data sets. We show that these problems require maximization of a submodular function that captures the informativeness of a set of exemplars, over a data stream. We develop an efficient algorithm, StreamGreedy, which is guaranteed to obtain a constant fraction of the value achieved by the optimal solution to this NPhard optimization problem. We extensively evaluate our algorithm on large realworld data sets. 1.
Network Sampling: From Static to Streaming Graphs
, 2013
"... Network sampling is integral to the analysis of social, information, and biological networks. Since many realworld networks are massive in size, continuously evolving, and/or distributed in nature, the network structure is often sampled in order to facilitate study. For these reasons, a more thorou ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
Network sampling is integral to the analysis of social, information, and biological networks. Since many realworld networks are massive in size, continuously evolving, and/or distributed in nature, the network structure is often sampled in order to facilitate study. For these reasons, a more thorough and complete understanding of network sampling is critical to support the field of network science. In this paper, we outline a framework for the general problem of network sampling, by highlighting the different objectives, population and units of interest, and classes of network sampling methods. In addition, we propose a spectrum of computational models for network sampling methods, ranging from the traditionally studied model based on the assumption of a static domain to a more challenging model that is appropriate for streaming domains. We design a family of sampling methods based on the concept of graph induction that generalize across the full spectrum of computational models (from static to streaming) while efficiently preserving many of the topological properties of the input graphs. Furthermore, we demonstrate how traditional static sampling algorithms can be modified for graph streams for each of the three main classes of sampling methods: node, edge, and topologybased sampling. Experimental results indicate that our proposed family of sampling methods more accurately preserve the underlying properties of the graph in both static and streaming domains. Finally, we study the impact of network sampling algorithms on the parameter estimation and performance evaluation of relational classification algorithms.
Sliding Window Query Processing over Data Streams
, 2006
"... I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii Database management systems (DBMSs) have been used suc ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
(Show Context)
I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii Database management systems (DBMSs) have been used successfully in traditional business applications that require persistent data storage and an efficient querying mechanism. Typically, it is assumed that the data are static, unless explicitly modified or deleted by a user or application. Database queries are executed when issued and their answers reflect the current state of the data. However, emerging applications, such as sensor networks, realtime Internet traffic analysis, and online financial trading, require support for processing of unbounded data streams. The fundamental assumption of a data stream management system (DSMS) is that new data are generated continually, making it infeasible to store a stream in its entirety. At best, a sliding window of recently arrived data may be maintained, meaning that old data must be removed as time goes on. Furthermore, as the contents of the sliding windows evolve over time, it makes
Towards visual sedimentation
 VisWeek 2012 Electronic Conference Proceedings
, 2012
"... Fig. 1. The Visual Sedimentation metaphor applied to a bar chart (left), a pie chart (center), and a bubble chart (right). Abstract—We introduce Visual Sedimentation, a novel design metaphor for visualizing data streams directly inspired by the physical process of sedimentation. Visualizing data str ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
(Show Context)
Fig. 1. The Visual Sedimentation metaphor applied to a bar chart (left), a pie chart (center), and a bubble chart (right). Abstract—We introduce Visual Sedimentation, a novel design metaphor for visualizing data streams directly inspired by the physical process of sedimentation. Visualizing data streams (e. g., Tweets, RSS, Emails) is challenging as incoming data arrive at unpredictable rates and have to remain readable. For data streams, clearly expressing chronological order while avoiding clutter, and keeping aging data visible, are important. The metaphor is drawn from the realworld sedimentation processes: objects fall due to gravity, and aggregate into strata over time. Inspired by this metaphor, data is visually depicted as falling objects using a force model to land on a surface, aggregating into strata over time. In this paper, we discuss how this metaphor addresses the specific challenge of smoothing the transition between incoming and aging data. We describe the metaphor’s design space, a toolkit developed to facilitate its implementation, and example applications to a range of case studies. We then explore the generative capabilities of the design space through our toolkit. We finally illustrate creative extensions of the metaphor when applied to real streams of data. Index Terms—Design, information visualization, dynamic visualization, dynamic data, data stream, real time, metaphor 1
N.: Finding surprising patterns in textual data streams
 In: 2010 IAPR Workshop on Cognitive Information Processing (2010
"... Abstract—We address the task of detecting surprising patterns in large textual data streams. These can reveal events in the real world when the data streams are generated by online news media, emails, Twitter feeds, movie subtitles, scientific publications, and more. The volume of interest in such t ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
(Show Context)
Abstract—We address the task of detecting surprising patterns in large textual data streams. These can reveal events in the real world when the data streams are generated by online news media, emails, Twitter feeds, movie subtitles, scientific publications, and more. The volume of interest in such text streams often exceeds human capacity for analysis, such that automatic pattern recognition tools are indispensable. In particular, we are interested in surprising changes in the frequency of ngrams of words, or more generally of symbols from an unlimited alphabet size. Despite the exponentially large number of possible ngrams in the size of the alphabet (which is itself unbounded), we show how these can be detected efficiently. To this end, we rely on a data structure known as a generalised suffix tree, which is additionally annotated with a limited amount of statistical information. Crucially, we show how the generalised suffix tree as well as these statistical annotations can efficiently be updated in an online fashion. I.
Streaming Submodular Maximization: Massive Data Summarization on the Fly
, 2014
"... How can one summarize a massive data set “on the fly”, i.e., without even having seen it in its entirety? In this paper, we address the problem of extracting representative elements from a large stream of data. I.e., we would like to select a subset of say k data points from the stream that are most ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
How can one summarize a massive data set “on the fly”, i.e., without even having seen it in its entirety? In this paper, we address the problem of extracting representative elements from a large stream of data. I.e., we would like to select a subset of say k data points from the stream that are most representative according to some objective function. Many natural notions of “representativeness ” satisfy submodularity, an intuitive notion of diminishing returns. Thus, such problems can be reduced to maximizing a submodular set function subject to a cardinality constraint. Classical approaches to submodular maximization require full access to the data set. We develop the first efficient streaming algorithm with constant factor 1/2 − ε approximation guarantee to the optimum solution, requiring only a single pass through the data, and memory independent of data size. In our experiments, we extensively evaluate the effectiveness of our approach on several applications, including training largescale kernel methods and exemplarbased clustering, on millions of data points. We observe that our streaming method, while achieving practically the same utility value, runs about 100 times faster than previous work.