Results 21  30
of
324
Better Streaming Algorithms for Clustering Problems
 In Proc. of 35th ACM Symposium on Theory of Computing (STOC
, 2003
"... We study cluster ng pr blems in the str aming model, wher e the goal is to cluster a set of points by making one pass (or a few passes) over the data using a small amount of storSD space.Our mainr esult is a r ndomized algor ithm for kMedian prE lem which p duces a constant factor a ..."
Abstract

Cited by 90 (1 self)
 Add to MetaCart
We study cluster ng pr blems in the str aming model, wher e the goal is to cluster a set of points by making one pass (or a few passes) over the data using a small amount of storSD space.Our mainr esult is a r ndomized algor ithm for kMedian prE lem which p duces a constant factor appr oximation in one pass using storR4 space O(kpolylog n). This is a significant imp r vement of the prS ious best algor5 hm which yielded a 2 appr ximation using O(n )space.
Nearoptimal lower bounds on the multiparty communication complexity of set disjointness
 In IEEE Conference on Computational Complexity
, 2003
"... We study the communication complexity of the set disjointness problem in the general multiparty model. For t players, each holding a subset of a universe of size n, we establish a nearoptimal lower bound of Ω(n/(t log t)) on the communication complexity of the problem of determining whether their ..."
Abstract

Cited by 89 (7 self)
 Add to MetaCart
(Show Context)
We study the communication complexity of the set disjointness problem in the general multiparty model. For t players, each holding a subset of a universe of size n, we establish a nearoptimal lower bound of Ω(n/(t log t)) on the communication complexity of the problem of determining whether their sets are disjoint. In the more restrictive oneway communication model, in which the players are required to speak in a predetermined order, we improve our bound to an optimal Ω(n/t). These results improve upon the earlier bounds of Ω(n/t 2) in the general model, and Ω(ε 2 n/t 1+ε) in the oneway model, due to BarYossef, Jayram, Kumar, and Sivakumar [5]. As in the case of earlier results, our bounds apply to the unique intersection promise problem. This communication problem is known to have connections with the space complexity of approximating frequency moments in the data stream model. Our results lead to an improved space complexity lower bound of Ω(n 1−2/k / log n) for approximating the k th frequency moment with a constant number of passes over the input, and a technical improvement to Ω(n 1−2/k) if only one pass over the input is permitted. Our proofs rely on the information theoretic direct sum decomposition paradigm of BarYossef et al [5]. Our improvements stem from novel analytical tech
What's New: Finding Significant Differences in Network Data Streams
 in Proc. of IEEE Infocom
, 2004
"... Monitoring and analyzing network traffic usage patterns is vital for managing IP Networks. An important problem is to provide network managers with information about changes in traffic, informing them about "what's new". Specifically, we focus on the challenge of finding significantly ..."
Abstract

Cited by 85 (8 self)
 Add to MetaCart
(Show Context)
Monitoring and analyzing network traffic usage patterns is vital for managing IP Networks. An important problem is to provide network managers with information about changes in traffic, informing them about "what's new". Specifically, we focus on the challenge of finding significantly large differences in traffic: over time, between interfaces and between routers. We introduce the idea of a deltoid: an item that has a large difference, whether the difference is absolute, relative or variational. We present novel...
Comparing data streams using hamming norms (how to zero in)
, 2003
"... Massive data streams are now fundamental to many data processing applications. For example, Internet routers produce large scale diagnostic data streams. Such streams are rarely stored in traditional databases and instead must be processed “on the fly” as they are produced. Similarly, sensor networ ..."
Abstract

Cited by 82 (7 self)
 Add to MetaCart
(Show Context)
Massive data streams are now fundamental to many data processing applications. For example, Internet routers produce large scale diagnostic data streams. Such streams are rarely stored in traditional databases and instead must be processed “on the fly” as they are produced. Similarly, sensor networks produce multiple data streams of observations from their sensors. There is growing focus on manipulating data streams and, hence, there is a need to identify basic operations of interest in managing data streams, and to support them efficiently. We propose computation of the Hamming norm as a basic operation of interest. The Hamming norm formalizes ideas that are used throughout data processing. When applied to a single stream, the Hamming norm gives the number of distinct items that are present in that data stream, which is a statistic of great interest in databases. When applied to a pair of streams, the Hamming norm gives an important measure of (dis)similarity: the number of unequal item counts in the two streams. Hamming norms have many uses in comparing data streams. We present a novel approximation technique for estimating the Hamming norm for massive data streams; this relies on what we call the “l0 sketch ” and we prove its accuracy. We test our approximation method on a large quantity of synthetic and real stream data, and show that the estimation is accurate to within a few percentage points.
Data streaming algorithms for estimating entropy of network traffic
 In Proceedings of the joint international conference on Measurement and modeling of computer systems (SIGMETRICS). ACM
, 2006
"... • Given n flows of sizes a1,..., an. Let s ≡ i ai. The empirical entropy is defined as ..."
Abstract

Cited by 72 (12 self)
 Add to MetaCart
• Given n flows of sizes a1,..., an. Let s ≡ i ai. The empirical entropy is defined as
The String Edit Distance Matching Problems with Moves
, 2006
"... The edit distance between two strings S and R is defined to be the minimum number of character inserts, deletes and changes needed to convert R to S. Given a text string t of length n, and a pattern string p of length m, informally, the string edit distance matching problem is to compute the smalles ..."
Abstract

Cited by 72 (3 self)
 Add to MetaCart
The edit distance between two strings S and R is defined to be the minimum number of character inserts, deletes and changes needed to convert R to S. Given a text string t of length n, and a pattern string p of length m, informally, the string edit distance matching problem is to compute the smallest edit distance between p and substrings of t. We relax the problem so that (a) we allow an additional operation, namely, substring moves, and (b) we allow approximation of this string edit distance. Our result is a near linear time deterministic algorithm to produce a factor of O(log n log ∗ n) approximation to the string edit distance with moves. This is the first known significantly subquadratic algorithm for a string edit distance problem in which the distance involves nontrivial alignments. Our results are obtained by embedding strings into L1 vector space using a simplified parsing technique we call Edit
Graph distances in the streaming model: the value of space
 In ACMSIAM Symposium on Discrete Algorithms
, 2005
"... We investigate the importance of space when solving problems based on graph distance in the streaming model. In this model, the input graph is presented as a stream of edges in an arbitrary order. The main computational restriction of the model is that we have limited space and therefore cannot stor ..."
Abstract

Cited by 69 (11 self)
 Add to MetaCart
We investigate the importance of space when solving problems based on graph distance in the streaming model. In this model, the input graph is presented as a stream of edges in an arbitrary order. The main computational restriction of the model is that we have limited space and therefore cannot store all the streamed data; we are forced to make spaceefficient summaries of the data as we go along. For a graph of n vertices and m edges, we show that testing many graph properties, including connectivity (ergo any reasonable decision problem about distances) and bipartiteness, requires Ω(n) bits of space. Given this, we then investigate how the power of the model increases as we relax our space restriction. Our main result is an efficient randomized algorithm that constructs a (2t + 1)spanner in one pass. With high probability, it uses O(t · n 1+1/t log 2 n) bits of space and processes each edge in the stream in O(t 2 · n 1/t log n) time. We find approximations to diameter and girth via the log n constructed spanner. For t = Ω (), the space log log n requirement of the algorithm is O(n·polylog n), and the peredge processing time is O(polylog n). We also show a corresponding lower bound of t for the approximation ratio achievable when the space restriction is O(t · n1+1/t log 2 n). We then consider the scenario in which we are allowed multiple passes over the input stream. Here, we investigate whether allowing these extra passes will compensate for a given space restriction. We show that ∗This work was supported by the DoD University Research Initiative (URI) administered by the Office of Naval Research
LowDistortion Embeddings of Finite Metric Spaces
 in Handbook of Discrete and Computational Geometry
, 2004
"... INTRODUCTION An npoint metric space (X; D) can be represented by an n n table specifying the distances. Such tables arise in many diverse areas. For example, consider the following scenario in microbiology: X is a collection of bacterial strains, and for every two strains, one is given their diss ..."
Abstract

Cited by 65 (2 self)
 Add to MetaCart
INTRODUCTION An npoint metric space (X; D) can be represented by an n n table specifying the distances. Such tables arise in many diverse areas. For example, consider the following scenario in microbiology: X is a collection of bacterial strains, and for every two strains, one is given their dissimilarity (computed, say, by comparing their DNA). It is dicult to see any structure in a large table of numbers, and so we would like to represent a given metric space in a more comprehensible way. For example, it would be very nice if we could assign to each x 2 X a point f(x) in the plane in such a way that D(x; y) equals the Euclidean distance of f(x) and f(y). Such a representation would allow us to see the structure of the metric space: tight clusters, isolated points, and so on. Another advantage would be that the metric would now be represented by only 2n real numbers, the coordinates of the n points in the plane, instead of numbers as before. Moreover, many quantities concern
Processing Set Expressions over Continuous Update Streams
, 2003
"... There is growing interest in algorithms for processing and querying continuous data streams (i.e., data that is seen only once in a fixed order) with limited memory resources. In its most general form, a data stream is actually an update stream, i.e., comprising dataitem deletions as well as insert ..."
Abstract

Cited by 63 (14 self)
 Add to MetaCart
There is growing interest in algorithms for processing and querying continuous data streams (i.e., data that is seen only once in a fixed order) with limited memory resources. In its most general form, a data stream is actually an update stream, i.e., comprising dataitem deletions as well as insertions. Such massive update streams arise naturally in several application domains (e.g., monitoring of large IP network installations, or processing of retailchain transactions). Estimating the cardinality of set expressions dened over several (perhaps, distributed) update streams is perhaps one of the most fundamental query classes of interest; as an example, such a query may ask \what is the number of distinct IP source addresses seen in passing packets from both router R1 and R2 but not router R3?". Earlier work has only addressed very restricted forms of this problem, focusing solely on the special case of insertonly streams and specic operators (e.g., union). In this paper, we propose the first spaceefficient algorithmic solution for estimating the cardinality of full
edged set expressions over general update streams. Our estimation algorithms are probabilistic in nature and rely on a novel, hashbased synopsis data structure, termed "2level hash sketch". We demonstrate how our 2level hash sketch synopses can be used to provide lowerror, highconfidence estimates for the cardinality of set expressions (including operators such as set union, intersection, and difference) over continuous update streams, using only small space and small processing time per update. Furthermore, our estimators never require rescanning or resampling of past stream items, regardless of the number of deletions in the stream. We also present lower bounds for the problem, demonstrating that the space usage of our estimation algorithms is within small factors of the optimal. Preliminary experimental results verify the effectiveness of our approach.
Distributed streams algorithms for sliding windows
 In Proc. ACM Symp. on Parallel Algorithms and Architectures (SPAA
, 2002
"... Massive data sets often arise as physically distributed, parallel data streams, and it is important to estimate various aggregates and statistics on the union of these streams. This paper presents algorithms for estimating aggregate functions over a “sliding window ” of the N most recent data items ..."
Abstract

Cited by 62 (11 self)
 Add to MetaCart
(Show Context)
Massive data sets often arise as physically distributed, parallel data streams, and it is important to estimate various aggregates and statistics on the union of these streams. This paper presents algorithms for estimating aggregate functions over a “sliding window ” of the N most recent data items in one or more streams. Our results include: 1. For a single stream, we present the first ɛapproximation scheme for the number of 1’s in a sliding window that is optimal in both worst case time and space. We also present the first ɛapproximation scheme for the sum of integers in [0..R] in a sliding window that is optimal in both worst case time and space (assuming R is at most polynomial in N). Both algorithms are deterministic and use only logarithmic memory words. 2. In contrast, we show that any deterministic algorithm that estimates, to within a small constant relative error, the number of 1’s (or the sum of integers) in a sliding window on the union of distributed streams requires Ω(N) space.