Results 1 
7 of
7
What you can do with coordinated samples
 In The 17th. International Workshop on Randomization and Computation (RANDOM
, 2013
"... ar ..."
(Show Context)
Is MinWise Hashing Optimal for Summarizing Set Intersection?
"... Minwise hashing is an important method for estimating the size of the intersection of sets, based on a succinct summary (a “minhash”) independently computed for each set. One application is estimation of the number of data points that satisfy the conjunction of m ≥ 2 simple predicates, where a min ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Minwise hashing is an important method for estimating the size of the intersection of sets, based on a succinct summary (a “minhash”) independently computed for each set. One application is estimation of the number of data points that satisfy the conjunction of m ≥ 2 simple predicates, where a minhash is available for the set of points satisfying each predicate. This has applications in query optimization and for approximate computation of COUNT aggregates. In this paper we address the question: How many bits is it necessary to allocate to each summary in order to get an estimate with 1 ± ε relative error? The stateoftheart technique for minimizing the encoding size, for any desired estimation error, is bbit minwise hashing due to Li and König (Communications of the ACM, 2011). We give new lower and upper bounds: • Using information complexity arguments, we show that bbit minwise hashing is space optimal for m = 2 predicates in the sense that the estimator’s variance is within a constant factor of the smallest possible among all summaries with the given space usage. But for conjunctions of m> 2 predicates we show that the performance of bbit minwise hashing (and more generally any method based on “kpermutation ” minhash) deteriorates as m grows. • We describe a new summary that nearly matches our lower bound for m ≥ 2. It asymptotically outperform all kpermutation schemes (by around a factor Ω(m / logm)), as well as methods based on subsampling (by a factor Ω(lognmax), where nmax is the maximum set size).
Get the most out of your sample: Optimal unbiased estimators using partial information
 In Proc. of the 2011 ACM Symp. on Principles of Database Systems (PODS 2011). ACM
, 2011
"... 1 Random sampling is an essential tool in the processing and transmission of data. It is used to summarize data too large to store or manipulate and meet resource constraints on bandwidth or battery power. Estimators that are applied to the sample facilitate fast approximate processing of queries po ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
1 Random sampling is an essential tool in the processing and transmission of data. It is used to summarize data too large to store or manipulate and meet resource constraints on bandwidth or battery power. Estimators that are applied to the sample facilitate fast approximate processing of queries posed over the original data and the value of the sample hinges on the quality of these estimators. Our work targets data sets such as request and traffic logs and sensor measurements, where data is repeatedly collected over multiple instances: time periods, locations, or snapshots. We are interested in queries that span multiple instances, such as distinct counts and distance measures over selected records. These queries are used for applications ranging from planning to anomaly and change detection. Unbiased lowvariance estimators are particularly effective as the relative error decreases with the number of selected record keys. The HorvitzThompson estimator, known to minimize variance for sampling with “all or nothing ” outcomes (which reveals exacts value or no information on estimated quantity), is not optimal for multiinstance operations for which an outcome may provide partial information. We present a general principled methodology for the derivation of (Pareto) optimal unbiased estimators over sampled instances and aim to understand its potential. We demonstrate significant improvement in estimate accuracy of fundamental queries for common sampling schemes. 1.
Estimation for monotone sampling: Competitiveness and customization
 In PODC. ACM
, 2014
"... ar ..."
(Show Context)
Coordinated Weighted Sampling: Estimation of MultipleAssignment Aggregates
"... Many data sources are naturally modeled by multiple weight assignments over a set of keys: snapshots of an evolving database at multiple points in time, measurements collected over multiple time periods, requests for resources served at multiple locations, and records with multiple numeric attribu ..."
Abstract
 Add to MetaCart
Many data sources are naturally modeled by multiple weight assignments over a set of keys: snapshots of an evolving database at multiple points in time, measurements collected over multiple time periods, requests for resources served at multiple locations, and records with multiple numeric attributes. Over such vectorweighted data we are interested in aggregates with respect to one set of weights, such as weighted sums, and aggregates over multiple sets of weights such as the difference. Samplebased summarization is highly effective for data sets that are too large to be stored or manipulated. The summary facilitates approximate processing of queries that may be specied after the summary was generated. Current designs, however, are geared for data sets where a single scalar weight is associated with each key. We develop a sampling framework based on coordinated weighted samples that is suited for multiple weight assignments and obtain estimators that are orders of magnitude tighter than previously possible. We demonstrate the power of our methods through an extensive empirical evaluation on diverse data sets ranging from IP network to stock quotes data. 1
1MultiObjective Weighted Sampling
"... Abstract — Key value data sets of the form {(x,wx)} where wx> 0 are prevalent. Common queries over such data are segment fstatistics Q(f,H) = x∈H f(wx), specified for a segment H of the keys and a function f. Different choices of f correspond to count, sum, moments, capping, and threshold statis ..."
Abstract
 Add to MetaCart
Abstract — Key value data sets of the form {(x,wx)} where wx> 0 are prevalent. Common queries over such data are segment fstatistics Q(f,H) = x∈H f(wx), specified for a segment H of the keys and a function f. Different choices of f correspond to count, sum, moments, capping, and threshold statistics. When the data set is large, we can compute a smaller sample from which we can quickly estimate statistics. A weighted sample of keys taken with respect to f(wx) provides estimates with statistically guaranteed quality for fstatistics. Such a sample S(f) can be used to estimate gstatistics for g 6 = f, but quality degrades with the disparity between g and f. In this paper we address applications that require quality estimates for a set F of different functions. A naive solution is to compute and work with a different sample S(f) for each f ∈ F. Instead, this can be achieved more effectively and seamlessly using a single multiobjective sample S(F) of a much smaller size. We review multiobjective sampling schemes and place them in our context of estimating fstatistics. We show that a multiobjective sample for F provides quality estimates for any f that is a positive linear combination of functions from F. We then establish a surprising and powerful result when the target set M is all monotone nondecreasing functions, noting that M includes most natural statistics. We provide efficient multiobjective sampling algorithms for M and show that a sample size of k lnn (where n is the number of active keys) provides the same estimation quality, for any f ∈M, as a dedicated weighted sample of size k for f. F 1