Results 1 - 10
of
19
Join synopses for approximate query answering
- In SIGMOD
, 1999
"... In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex aggregate queries based on statistical summaries of the full data. In this paper, we demonstrate the difficulty of providing good approximate answers for join-queries using only statistic ..."
Abstract
-
Cited by 169 (9 self)
- Add to MetaCart
In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex aggregate queries based on statistical summaries of the full data. In this paper, we demonstrate the difficulty of providing good approximate answers for join-queries using only statistics (in particular, samples) from the base relations. We propose join synopses (join samples) as an effective solution for this problem and show how precomputing just one join synopsis for each relation suffices to significantly improve the quality of approximate answers for arbitrary queries with foreign key joins. We present optimal strategies for allocating the available space among the various join synopses when the query work load is known and identify heuristics for the common case when the work load is not known. We also present efficient algorithms for incrementally maintaining join synopses in the presence of updates to the base relations. One of our key contributions is a detailed analysis of the error bounds obtained for approximate answers that demonstrates the trade-offs in various methods, as well as the advantages in certain scenarios of a new subsampling method we propose. Our extensive set of experiments on the TPC-D benchmark database show the effectiveness of join synopses and various other techniques proposed in this paper. 1
Tracking join and self-join sizes in limited storage
, 2002
"... This paper presents algorithms for tracking (approximate) join and self-join sizes in limited storage, in the presence of insertions and deletions to the data set(s). Such algorithms detect changes in join and self-join sizes without an expensive recomputation from the base data, and without the lar ..."
Abstract
-
Cited by 123 (0 self)
- Add to MetaCart
This paper presents algorithms for tracking (approximate) join and self-join sizes in limited storage, in the presence of insertions and deletions to the data set(s). Such algorithms detect changes in join and self-join sizes without an expensive recomputation from the base data, and without the large space overhead required to maintain such sizes exactly. Query optimizers rely on fast, high-quality estimates of join sizes in order to select between various join plans, and estimates of self-join sizes are used to indicate the degree of skew in the data. For self-joins, we considertwo approaches proposed in [Alon, Matias, and Szegedy. The Space Complexity of Approximating the Frequency Moments. JCSS, vol. 58, 1999, p.137-147], which we denote tug-of-war and sample-count. Wepresent fast algorithms for implementing these approaches, and extensions to handle deletions as well as insertions. We also report on the rst experimental study of the two approaches, on a range of synthetic and real-world data sets. Our study shows that tug-of-war provides more accurate estimates for a given storage limit than sample-count, which in turn is far more accurate than a standard sampling-based approach. For example, tug-of-war needed only 4{256 memory words, depending on the data set, in order to estimate the self-join size
Synopsis Data Structures for Massive Data Sets
"... Abstract. Massive data sets with terabytes of data are becoming commonplace. There is an increasing demand for algorithms and data structures that provide fast response times to queries on such data sets. In this paper, we describe a context for algorithmic work relevant to massive data sets and a f ..."
Abstract
-
Cited by 116 (13 self)
- Add to MetaCart
Abstract. Massive data sets with terabytes of data are becoming commonplace. There is an increasing demand for algorithms and data structures that provide fast response times to queries on such data sets. In this paper, we describe a context for algorithmic work relevant to massive data sets and a framework for evaluating such work. We consider the use of "synopsis" data structures, which use very little space and provide fast (typically approximated) answers to queries. The design and analysis of effective synopsis data structures o er many algorithmic challenges. We discuss a number of concrete examples of synopsis data structures, and describe fast algorithms for keeping them up-to-date in the presence of online updates to the data sets.
AQUA: System and techniques for approximate query answering
, 1998
"... In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries. The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer, by avoiding or minimizing the number of accesses to ..."
Abstract
-
Cited by 24 (5 self)
- Add to MetaCart
In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries. The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer, by avoiding or minimizing the number of accesses to the base data. This paper presents the Approximate QUery Answering (AQUA) System, for fast, highly accurate approximate answers to queries. Aqua provides approximate answers using small, precomputed synopses (samples, counts, etc.) of the underlying base data. An important feature of Aqua is that it provides accuracy guarantees without any a priori assumptions on either the data distribution, the order in which the base data is loaded, or the layout of the data on the disks. Currently, the system provides fast approximate answers for queries with selects, aggregates, group bys and/or joins (especially, the multi-way foreign key joins that are popular in OLAP). We present several new techniques for improving the accuracy of approximate query answers for this class of queries. We show how join sampling can significantly improve the approximation quality. We describe how biased sampling can be used to overcome the problem of group size disparities
A dip in the reservoir: Maintaining sample synopses of evolving datasets
- IN: PROC. VLDB
, 2006
"... Perhaps the most flexible synopsis of a database is a random sample of the data; such samples are widely used to speed up processing of analytic queries and data-mining tasks, enhance query optimization, and facilitate information integration. In this paper, we study methods for incrementally mainta ..."
Abstract
-
Cited by 15 (6 self)
- Add to MetaCart
Perhaps the most flexible synopsis of a database is a random sample of the data; such samples are widely used to speed up processing of analytic queries and data-mining tasks, enhance query optimization, and facilitate information integration. In this paper, we study methods for incrementally maintaining a uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions and deletions. For “stable ” datasets whose size remains roughly constant over time, we provide a novel sampling scheme, called “random pairing ” (RP) which maintains a bounded-size uniform sample by using newly inserted data items to compensate for previous deletions. The RP algorithm is the first extension of the almost 40-year-old reservoir sampling algorithm to handle deletions. Experiments show that, when dataset-size fluctuations over time are not too extreme, RP is the algorithm of choice with respect to speed and sample-size stability. For “growing” datasets, we consider algorithms for periodically “resizing” a bounded-size random sample upwards. We prove that any such algorithm cannot avoid accessing the base data, and provide a novel resizing algorithm that minimizes the time needed to increase the sample size.
Accuracy vs Lifetime: Linear Sketches for Appoximate Aggregate Range Queries in Sensor Networks
, 2004
"... Query processing in sensor networks is critical for several sensor based monitoring applications and poses several challenging research problems. The in--network aggregation paradigm in sensor networks provides a versatile approach for evaluating simple aggregate queries, in which an aggregation-- ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Query processing in sensor networks is critical for several sensor based monitoring applications and poses several challenging research problems. The in--network aggregation paradigm in sensor networks provides a versatile approach for evaluating simple aggregate queries, in which an aggregation--tree is imposed on the sensor network that is rooted at the base--station and the data gets aggregated as it gets forwarded up the tree. In this paper we consider an two kinds of aggregate queries: value range queries that compute the number of sensors that report values in the given range, and location range queries that compute the sum of values reported by sensors in a given location range. Such queries can be answered by using the in--network aggregation approach where only sensors that fall within the range contribute to the aggregate being maintained. However it requires a separate aggregate to be computed and communicated for each query and hence does not scale well with the number of queries. Many
Maintaining bounded-size sample synopses of evolving datasets
, 2007
"... Perhaps the most flexible synopsis of a database is a uniform random sample of the data; such samples are widely used to speed up processing of analytic queries and data-mining tasks, enhance query optimization, and facilitate information integration. The ability to bound the maximum size of a sam ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
Perhaps the most flexible synopsis of a database is a uniform random sample of the data; such samples are widely used to speed up processing of analytic queries and data-mining tasks, enhance query optimization, and facilitate information integration. The ability to bound the maximum size of a sample can be very convenient from a system-design point of view, because the task of memory management is simplified, especially when many samples are maintained simultaneously. In this paper, we study methods for incrementally maintaining a bounded-size uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions and deletions. For “stable ” datasets whose size remains roughly constant over time, we provide a novel sampling scheme, called “random pairing ” (RP), that maintains a bounded-size uniform sample by using newly inserted data items to compensate for previous deletions. The RP algorithm is the first extension of the 45-year-old reservoir sampling algorithm to handle deletions; RP reduces to the “passive” algorithm of Babcock et al. when the insertions and deletions correspond to a moving window over a data stream. Experiments show that, when dataset-size fluctuations over time are not too extreme, RP is the algorithm of choice with respect to speed and sample-size stability. For “growing ” datasets, we consider algorithms for periodically resizing a bounded-size random sample upwards. We prove that any such algorithm cannot avoid accessing the base data,
Approximate Spatial Query Processing Using Raster Signatures
- In: Proceedings of VI Brazilian Symposium on GeoInformatics
, 2004
"... Abstract: Nowadays, the database characteristics, such as the huge volume of data, the complexity of the queries, and even the data availability, can demand minutes or hours to process a query. On the other hand, in many cases it may be enough to the user to get a fast approximate answer, since it h ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
Abstract: Nowadays, the database characteristics, such as the huge volume of data, the complexity of the queries, and even the data availability, can demand minutes or hours to process a query. On the other hand, in many cases it may be enough to the user to get a fast approximate answer, since it has the desired precision. The challenge to give to the user an exact query answer within a reasonable time becomes even bigger in the spatial database field. This work proposes the use of the Four Color Raster Signature (4CRS) for approximate query processing. The main goal is to reduce the time required to process a query executing it on approximate data (4CRS signature) instead of accessing the real datasets. The experimental tests demonstrated the good results of our proposal. Considering the test of the most important algorithm, the time required to process an approximate query answer has average of 7.22 % of the time to get an exact answer, the disk accesses have average of 7.04 % and the average error is 1 % related to exact processing. Besides the 4CRS storage requirements are also quite small, which has an average of only 3.57 % of the space required to store the real datasets.
The Design and Architecture of the τ-Synopses System
"... Data synopses are concise representations of data sets, that enable effective processing of approximate queries to the data sets. Approximate query processing provides important alternatives when exact query answers are not required. τ-Synopses is a system designed to provide a runtime environment f ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Data synopses are concise representations of data sets, that enable effective processing of approximate queries to the data sets. Approximate query processing provides important alternatives when exact query answers are not required. τ-Synopses is a system designed to provide a runtime environment for remote execution of various synopses for both relational as well as XML databases. It enables easy registration of new synopses from remote platforms, after which the system can manage these synopses, including triggering their construction, rebuild and update, and invoking them for approximate query processing. The system captures and analyzes query workloads, enabling its registered synopses to significantly boost their effectiveness (efficiency, accuracy, confidence), by exploiting workload information for synopses construction and update. The system can also serve as a research platform for experimental evaluation and comparison of different synopses. 1
Approximate Query Processing in Spatial Databases Using Raster Signatures
"... Abstract. Traditional query processing provides exact answers to queries. However, in many applications, the response time of exact answers is often longer than what is acceptable. Approximate query processing has emerged as an alternative approach to give to the user an answer in a short time. The ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Traditional query processing provides exact answers to queries. However, in many applications, the response time of exact answers is often longer than what is acceptable. Approximate query processing has emerged as an alternative approach to give to the user an answer in a short time. The goal is to provide an estimated result in one order of magnitude less time than the time to compute the exact answer. There is a large set of techniques for approximate query processing; however, most of them are only suitable for traditional data. This work proposes new algorithms for a set of spatial operations that can be processed approximately using 4CRS (Four-Color Raster Signature). Resumo. Processamento tradicional de consultas visa prover respostas exatas para consultas; todavia, em muitas aplicações, o tempo de uma resposta exata é frequentemente muito maior do que o desejado. Processamento aproximado de consultas tem surgido como uma abordagem alternativa para processar consultas em um curto período de tempo, retornando uma resposta estimada para o usuário. Existem várias técnicas para processamento aproximado de consultas; todavia, muitas delas são aplicáveis apenas a dados tradicionais. Este trabalho propõe novos algoritmos para operações espaciais que podem ser processadas de forma aproximada usando a assinatura 4CRS (Assinatura Raster de Quatro Cores). 1.