Results 1  10
of
44
Bypassing the embedding: Algorithms for lowdimensional metrics
 In Proceedings of the 36th ACM Symposium on the Theory of Computing (STOC
, 2004
"... The doubling dimension of a metric is the smallest k such that any ball of radius 2r can be covered using 2 k balls of radius r. This concept for abstract metrics has been proposed as a natural analog to the dimension of a Euclidean space. If we could embed metrics with low doubling dimension into l ..."
Abstract

Cited by 77 (3 self)
 Add to MetaCart
(Show Context)
The doubling dimension of a metric is the smallest k such that any ball of radius 2r can be covered using 2 k balls of radius r. This concept for abstract metrics has been proposed as a natural analog to the dimension of a Euclidean space. If we could embed metrics with low doubling dimension into low dimensional Euclidean spaces, they would inherit several algorithmic and structural properties of the Euclidean spaces. Unfortunately however, such a restriction on dimension does not suffice to guarantee embeddibility in a normed space. In this paper we explore the option of bypassing the embedding. In particular we show the following for low dimensional metrics: • Quasipolynomial time (1+ɛ)approximation algorithm for various optimization problems such as TSP, kmedian and facility location. • (1 + ɛ)approximate distance labeling scheme with optimal label length. • (1+ɛ)stretch polylogarithmic storage routing scheme.
An Optimal Algorithm for the Distinct Elements Problem
"... We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in their seminal paper in FOCS 1983. This problem has applications to query optimization, Internet routing, ne ..."
Abstract

Cited by 67 (7 self)
 Add to MetaCart
(Show Context)
We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in their seminal paper in FOCS 1983. This problem has applications to query optimization, Internet routing, network topology, and data mining. For a stream of indices in {1,..., n}, our algorithm computes a (1 ± ε)approximation using an optimal O(ε −2 +log(n)) bits of space with 2/3 success probability, where 0 < ε < 1 is given. This probability can be amplified by independent repetition. Furthermore, our algorithm processes each stream update in O(1) worstcase time, and can report an estimate at any point midstream in O(1) worstcase time, thus settling both the space and time complexities simultaneously.
Streaming and sublinear approximation of entropy and information distances
 In ACMSIAM Symposium on Discrete Algorithms
, 2006
"... In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the pr ..."
Abstract

Cited by 67 (13 self)
 Add to MetaCart
(Show Context)
In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the problem of property testing with respect to the JensenShannon distance. We present optimal algorithms for estimating bounded, symmetric fdivergences (including the JensenShannon divergence and the Hellinger distance) between distributions in various property testing frameworks. Along the way, we close a (log n)/H gap between the upper and lower bounds for estimating entropy H, yielding an optimal algorithm over all values of the entropy. In a data stream setting (sublinear space), we give the first algorithm for estimating the entropy of a distribution. Our algorithm runs in polylogarithmic space and yields an asymptotic constant factor approximation scheme. An integral part of the algorithm is an interesting use of an F0 (the number of distinct elements in a set) estimation algorithm; we also provide other results along the space/time/approximation tradeoff curve. Our results have interesting structural implications that connect sublinear time and space constrained algorithms. The mediating model is the random order streaming model, which assumes the input is a random permutation of a multiset and was first considered by Munro and Paterson in 1980. We show that any property testing algorithm in the combined oracle model for calculating a permutation invariant functions can be simulated in the random order model in a single pass. This addresses a question raised by Feigenbaum et al regarding the relationship between property testing and stream algorithms. Further, we give a polylogspace PTAS for estimating the entropy of a one pass random order stream. This bound cannot be achieved in the combined oracle (generalized property testing) model. 1
Algorithms for Distributed Functional Monitoring
, 2008
"... We study what we call functional monitoring problems. We have k players each tracking their inputs, say player i tracking a multiset Ai(t) up until time t, and communicating with a central coordinator. The coordinator’s task is to monitor a given function f computed over the union of the inputs ∪iAi ..."
Abstract

Cited by 60 (12 self)
 Add to MetaCart
We study what we call functional monitoring problems. We have k players each tracking their inputs, say player i tracking a multiset Ai(t) up until time t, and communicating with a central coordinator. The coordinator’s task is to monitor a given function f computed over the union of the inputs ∪iAi(t), continuously at all times t. The goal is to minimize the number of bits communicated between the players and the coordinator. A simple example is when f is the sum, and the coordinator is required to alert when the sum of a distributed set of values exceeds a given threshold τ. Of interest is the approximate version where the coordinator outputs 1 if f ≥ τ and 0 if f ≤ (1 − ɛ)τ. This defines the (k, f, τ, ɛ) distributed, functional monitoring problem. Functional monitoring problems are fundamental in distributed systems, in particular sensor networks, where we must minimize communication; they also connect to problems in communication complexity, communication theory, and signal processing. Yet few formal bounds are known for functional monitoring. We give upper and lower bounds for the (k, f, τ, ɛ) problem for some of the basic f’s. In particular, we study frequency moments (F0, F1, F2). For F0 and F1, we obtain continuously monitoring algorithms with costs almost the same as their oneshot computation algorithms. However, for F2 the monitoring problem seems much harder. We give a carefully constructed multiround algorithm that uses “sketch summaries ” at multiple levels of detail and solves the (k, F2, τ, ɛ) problem with communication Õ(k2 /ɛ+ ( √ k/ɛ) 3). Since frequency moment estimation is central to other problems, our results have immediate applications to histograms, wavelet computations, and others. Our algorithmic techniques are likely to be useful for other functional monitoring problems as well.
Fast and Approximate Stream Mining of Quantiles and Frequencies Using Graphics Processors
, 2005
"... We present algorithms for fast quantile and frequency estimation in large data streams using graphics processor units (GPUs). We exploit the high computational power and memory bandwidth of graphics processors and present a novel sorting algorithm that performs rasterization operations on the GPUs. ..."
Abstract

Cited by 56 (11 self)
 Add to MetaCart
(Show Context)
We present algorithms for fast quantile and frequency estimation in large data streams using graphics processor units (GPUs). We exploit the high computational power and memory bandwidth of graphics processors and present a novel sorting algorithm that performs rasterization operations on the GPUs. We use sorting as the main computational component for histogram approximation and the construction of ɛapproximate quantile and frequency summaries. Our overall algorithms for numerical statistics computation on data streams are deterministic, applicable to fixed or variablesized sliding windows and use a limited memory footprint. We use the GPU as a coprocessor and minimize the data transmission between the CPU and GPU by taking into account the low bus bandwidth. We have implemented our algorithms on a PC with a NVIDIA GeForce FX 6800 Ultra GPU and a 3.4 GHz Pentium IV CPU and applied them to large data streams consisting of more than 100 million values. We have compared their performance against optimized implementations of prior CPUbased algorithms. Overall, our results demonstrate that the graphics processor available on a commodity computer system is an efficient streamprocessor and a useful coprocessor for mining data streams.
Estimating PageRank on Graph Streams
"... This study focuses on computations on large graphs (e.g., the webgraph) where the edges of the graph are presented as a stream. The objective in the streaming model is to use small amount of memory (preferably sublinear in the number of nodes n) and a few passes. In the streaming model, we show ho ..."
Abstract

Cited by 44 (6 self)
 Add to MetaCart
(Show Context)
This study focuses on computations on large graphs (e.g., the webgraph) where the edges of the graph are presented as a stream. The objective in the streaming model is to use small amount of memory (preferably sublinear in the number of nodes n) and a few passes. In the streaming model, we show how to perform several graph computations including estimating the probability distribution after a random walk of length l, mixing time, and the conductance. We estimate the mixing time M of a random walk in Õ(nα+Mα √ q q Mn M n+) space and Õ( α α) passes. Furthermore, the relation between mixing time and conductance gives us an estimate for the conductance of the graph. By applying our algorithm for computing probability distribution on the webgraph, we can estimate the PageRank p of any node up to an additive error of √ ɛp in Õ( q M α) passes and Õ(min(nα + 1
Analyzing Graph Structure via Linear Measurements
"... We initiate the study of graph sketching, i.e., algorithms that use a limited number of linear measurements of a graph to determine the properties of the graph. While a graph on n nodes is essentially O(n 2)dimensional, we show the existence of a distribution over random projections into ddimensio ..."
Abstract

Cited by 41 (11 self)
 Add to MetaCart
We initiate the study of graph sketching, i.e., algorithms that use a limited number of linear measurements of a graph to determine the properties of the graph. While a graph on n nodes is essentially O(n 2)dimensional, we show the existence of a distribution over random projections into ddimensional “sketch ” space (d ≪ n 2) such that the relevant properties of the original graph can be inferred from the sketch with high probability. Specifically, we show that: 1. d = O(n · polylog n) suffices to evaluate properties including connectivity, kconnectivity, bipartiteness, and to return any constant approximation of the weight of the minimum spanning tree. 2. d = O(n 1+γ) suffices to compute graph sparsifiers, the exact MST, and approximate the maximum weighted matchings if we permit O(1/γ)round adaptive sketches, i.e., a sequence of projections where each projection may be chosen dependent on the outcome of earlier sketches. Our results have two main applications, both of which have the potential to give rise to fruitful lines of further research. First, our results can be thought of as giving the first compressedsensing style algorithms for graph data. Secondly, our work initiates the study of dynamic graph streams. There is already extensive literature on processing massive graphs in the datastream model. However, the existing work focuses on graphs defined by a sequence of inserted edges and does not consider edge deletions. We think this is a curious omission given the existing work on both dynamic graphs in the nonstreaming setting and dynamic geometric streaming. Our results include the first dynamic graph semistreaming algorithms for connectivity, spanning trees, sparsification, and matching problems.
Estimating statistical aggregates on probabilistic data streams
 ACM Trans. Database Syst
, 2008
"... The probabilistic stream model was introduced by Jayram, Kale, and Vee [2007]. It is a generalization of the data stream model that is suited to handling “probabilistic ” data, where each item of the stream represents a probability distribution over a set of possible events. Therefore, a probabilist ..."
Abstract

Cited by 39 (5 self)
 Add to MetaCart
The probabilistic stream model was introduced by Jayram, Kale, and Vee [2007]. It is a generalization of the data stream model that is suited to handling “probabilistic ” data, where each item of the stream represents a probability distribution over a set of possible events. Therefore, a probabilistic stream determines a distribution over a potentially exponential number of classical “deterministic ” streams where each item is deterministically one of the domain values. We present algorithms for computing commonly used aggregates on a probabilistic stream. We present the first one pass streaming algorithms for estimating the expected mean of a probabilistic stream. Next, we consider the problem of estimating frequency moments for probabilistic data. We propose a general approach to obtain unbiased estimators working over probabilistic data by utilizing unbiased estimators designed for standard streams. Applying this approach, we extend a classical data stream algorithm to obtain a onepass algorithm for estimating F2, the second frequency moment. We present the first known streaming algorithms for estimating F0, the number of distinct items on probabilistic streams. Our work also gives an efficient onepass algorithm for estimating the median and a twopass algorithm for estimating the range.
Approximation algorithms for clustering uncertain data
 in PODS Conference
, 2008
"... There is an increasing quantity of data with uncertainty arising from applications such as sensor network measurements, record linkage, and as output of mining algorithms. This uncertainty is typically formalized as probability density functions over tuple values. Beyond storing and processing such ..."
Abstract

Cited by 39 (1 self)
 Add to MetaCart
(Show Context)
There is an increasing quantity of data with uncertainty arising from applications such as sensor network measurements, record linkage, and as output of mining algorithms. This uncertainty is typically formalized as probability density functions over tuple values. Beyond storing and processing such data in a DBMS, it is necessary to perform other data analysis tasks such as data mining. We study the core mining problem of clustering on uncertain data, and define appropriate natural generalizations of standard clustering optimization criteria. Two variations arise, depending on whether a point is automatically associated with its optimal center, or whether it must be assigned to a fixed cluster no matter where it is actually located. For uncertain versions of kmeans and kmedian, we show reductions to their corresponding weighted versions on data with no uncertainties. These are simple in the unassigned case, but require some care for the assigned version. Our most interesting results are for uncertain kcenter, which generalizes both traditional kcenter and kmedian objectives. We show a variety of bicriteria approximation algorithms. One picks O(kɛ −1 log 2 n) centers and achieves a (1 + ɛ) approximation to the best uncertain kcenters. Another picks 2k centers and achieves a constant factor approximation. Collectively, these results are the first known guaranteed approximation algorithms for the problems of clustering uncertain data.
Conquering the divide: Continuous clustering of distributed data streams
 In Intl. Conf. on Data Engineering
, 2007
"... Data is often collected over a distributed network, but in many cases, is so voluminous that it is impractical and undesirable to collect it in a central location. Instead, we must perform distributed computations over the data, guaranteeing high quality answers even as new data arrives. In this pap ..."
Abstract

Cited by 35 (4 self)
 Add to MetaCart
(Show Context)
Data is often collected over a distributed network, but in many cases, is so voluminous that it is impractical and undesirable to collect it in a central location. Instead, we must perform distributed computations over the data, guaranteeing high quality answers even as new data arrives. In this paper, we formalize and study the problem of maintaining a clustering of such distributed data that is continuously evolving. In particular, our goal is to minimize the communication and computational cost, still providing guaranteed accuracy of the clustering. We focus on the kcenter clustering, and provide a suite of algorithms that vary based on which centralized algorithm they derive from, and whether they maintain a single global clustering or many local clusterings that can be merged together. We show that these algorithms can be designed to give accuracy guarantees that are close to the best possible even in the centralized case. In our experiments, we see clear trends among these algorithms, showing that the choice of algorithm is crucial, and that we can achieve a clustering that is as good as the best centralized clustering, with only a small fraction of the communication required to collect all the data in a single location. 1