Results 31  40
of
324
Sampling Algorithms: Lower Bounds and Applications (Extended Abstract)
, 2001
"... ] Ziv BarYossef y Computer Science Division U. C. Berkeley Berkeley, CA 94720 zivi@cs.berkeley.edu Ravi Kumar IBM Almaden 650 Harry Road San Jose, CA 95120 ravi@almaden.ibm.com D. Sivakumar IBM Almaden 650 Harry Road San Jose, CA 95120 siva@almaden.ibm.com ABSTRACT We develop a fr ..."
Abstract

Cited by 60 (2 self)
 Add to MetaCart
] Ziv BarYossef y Computer Science Division U. C. Berkeley Berkeley, CA 94720 zivi@cs.berkeley.edu Ravi Kumar IBM Almaden 650 Harry Road San Jose, CA 95120 ravi@almaden.ibm.com D. Sivakumar IBM Almaden 650 Harry Road San Jose, CA 95120 siva@almaden.ibm.com ABSTRACT We develop a framework to study probabilistic sampling algorithms that approximate general functions of the form f : A n ! B, where A and B are arbitrary sets. Our goal is to obtain lower bounds on the query complexity of functions, namely the number of input variables x i that any sampling algorithm needs to query to approximate f(x1 ; : : : ; xn ). We define two quantitative properties of functions  the block sensitivity and the minimum Hellinger distance  that give us techniques to prove lower bounds on the query complexity. These techniques are quite general, easy to use, yet powerful enough to yield tight results. Our applications include the mean and higher statistical moments, the median and other selection functions, and the frequency moments, where we obtain lower bounds that are close to the corresponding upper bounds. We also point out some connections between sampling and streaming algorithms and lossy compression schemes. 1.
QuickSAND: Quick Summary and Analysis of Network Data
, 2001
"... Monitoring and analyzing traffic data generated from large ISP networks imposes challenges both at the data gathering phase as well as the data analysis itself. Still both tasks are crucial for responding to day to day challenges of engineering large networks with thousands of customers. In this pap ..."
Abstract

Cited by 57 (10 self)
 Add to MetaCart
Monitoring and analyzing traffic data generated from large ISP networks imposes challenges both at the data gathering phase as well as the data analysis itself. Still both tasks are crucial for responding to day to day challenges of engineering large networks with thousands of customers. In this paper we build on the premise that approximation is a necessary evil of handling massive datasets such as network data. We propose building compact summaries of the traffic data called sketches at distributed network elements and centers. These sketches are able to respond well to queries that seek features that stand out of the data. We call such features "heavy hitters." In this paper, we describe sketches and show how to use sketches to answer aggregate and trendrelated queries and identify heavy hitters. This may be used for exploratory data analysis of network operations interest. We support our proposal by experimentally studying AT&T WorldNet data and performing a feasibility study on the Cisco NetFlow data collected at several routers. 1
The fast JohnsonLindenstrauss transform and approximate nearest neighbors
 SIAM J. Comput
, 2009
"... Abstract. We introduce a new lowdistortion embedding of ℓd n) 2 into ℓO(log p (p =1, 2) called the fast Johnson–Lindenstrauss transform (FJLT). The FJLT is faster than standard random projections and just as easy to implement. It is based upon the preconditioning of a sparse projection matrix with ..."
Abstract

Cited by 57 (0 self)
 Add to MetaCart
(Show Context)
Abstract. We introduce a new lowdistortion embedding of ℓd n) 2 into ℓO(log p (p =1, 2) called the fast Johnson–Lindenstrauss transform (FJLT). The FJLT is faster than standard random projections and just as easy to implement. It is based upon the preconditioning of a sparse projection matrix with a randomized Fourier transform. Sparse random projections are unsuitable for lowdistortion embeddings. We overcome this handicap by exploiting the “Heisenberg principle ” of the Fourier transform, i.e., its localglobal duality. The FJLT can be used to speed up search algorithms based on lowdistortion embeddings in ℓ1 and ℓ2. We consider the case of approximate nearest neighbors in ℓd 2. We provide a faster algorithm using classical projections, which we then speed up further by plugging in the FJLT. We also give a faster algorithm for searching over the hypercube.
On the Impossibility of Dimension Reduction in l_1
 In Proc. 35th Annu. ACM Sympos. Theory Comput
, 2003
"... The JohnsonLindenstrauss Lemma shows that any n points in Euclidean space (with distances measured by the L2 norm) may be mapped down to O((log n)/ep^2) dimensions such that no pairwise distance is distorted by more than a (1 ep) factor. Determining whether such dimension reduction is possible in L ..."
Abstract

Cited by 56 (1 self)
 Add to MetaCart
The JohnsonLindenstrauss Lemma shows that any n points in Euclidean space (with distances measured by the L2 norm) may be mapped down to O((log n)/ep^2) dimensions such that no pairwise distance is distorted by more than a (1 ep) factor. Determining whether such dimension reduction is possible in L1 has been an intriguing open question. Charikar and Sahai [7] recently showed lower bounds for dimension reduction in L1 that can be achieved by linear projections, and positive results for shortest path metrics of restricted graph families. However the question of general dimension reduction in L1 was still open. For example, it was not known whether it is possible to reduce the number of dimensions to O(log n) with 1 ep distortion. We show strong lower bounds for general dimension reduction in L1. We give an explicity family of n points in L1 such that any embedding with distortion d requires n^Omega(1/d^2) dimensions. This proves that there is no analog of the JohnsonLindenstrauss Lemma for L1
OnePass Wavelet Decompositions of Data Streams
 IEEE TKDE
, 2003
"... We present techniques for computing small space representations of massive data streams. These are inspired by traditional waveletbased approximations that consist of specific linear projections of the underlying data. We present general "sketch" based methods for capturing various lin ..."
Abstract

Cited by 53 (7 self)
 Add to MetaCart
(Show Context)
We present techniques for computing small space representations of massive data streams. These are inspired by traditional waveletbased approximations that consist of specific linear projections of the underlying data. We present general "sketch" based methods for capturing various linear projections and use them to provide pointwise and rangesum estimation of data streams.
Estimating Rarity and Similarity over Data Stream Windows
 In Proceedings of 10th Annual European Symposium on Algorithms, volume 2461 of Lecture Notes in Computer Science
, 2002
"... In the windowed data stream model, we observe items coming in over time. At any time t, we consider the window of the last N observations a t\Gamma(N \Gamma1) ; a t\Gamma(N \Gamma2) ; : : : ; a t , each a i 2 f1; : : : ; ug; we are allowed to ask queries about the data in the window, say, we wish to ..."
Abstract

Cited by 50 (8 self)
 Add to MetaCart
(Show Context)
In the windowed data stream model, we observe items coming in over time. At any time t, we consider the window of the last N observations a t\Gamma(N \Gamma1) ; a t\Gamma(N \Gamma2) ; : : : ; a t , each a i 2 f1; : : : ; ug; we are allowed to ask queries about the data in the window, say, we wish to compute the minimum or the median of the items in the window. A crucial restriction is that we are only allowed o(N) (often polylogarithmic in N) storage space, that is, space smaller than the window size, so the items within the window can not be archived. Window data stream model arose out of the need to formally reason about the underlying data analyses problems in applications like internetworking and transactions processing.
Reverse Nearest Neighbor Aggregates Over Data Streams
, 2001
"... Reverse Nearest Neighbor (RNN) queries have been studied for finite, stored data sets and are of interest for decision support. ..."
Abstract

Cited by 46 (2 self)
 Add to MetaCart
Reverse Nearest Neighbor (RNN) queries have been studied for finite, stored data sets and are of interest for decision support.
Algorithms for dynamic geometric problems over data streams
 In STOC ’04: Proceedings of the thirtysixth annual ACM symposium on Theory of computing
, 2004
"... ..."
Graph sketches: sparsification, spanners, and subgraphs
 In PODS
, 2012
"... When processing massive data sets, a core task is to construct synopses of the data. To be useful, a synopsis data structure should be easy to construct while also yielding good approximations of the relevant properties of the data set. A particularly useful class of synopses are sketches, i.e., tho ..."
Abstract

Cited by 46 (9 self)
 Add to MetaCart
(Show Context)
When processing massive data sets, a core task is to construct synopses of the data. To be useful, a synopsis data structure should be easy to construct while also yielding good approximations of the relevant properties of the data set. A particularly useful class of synopses are sketches, i.e., those based on linear projections of the data. These are applicable in many models including various parallel, stream, and compressed sensing settings. A rich body of analytic and empirical work exists for sketching numerical data such as the frequencies of a set of entities. Our work investigates graph sketching where the graphs of interest encode the relationships between these entities. The main challenge is to capture this richer structure and build the necessary synopses with only linear measurements. In this paper we consider properties of graphs including the size of the cuts, the distances between nodes, and the prevalence of