Results 1 
7 of
7
A PayAsYouGo Framework for Query Execution Feedback
"... Past work has suggested that query execution feedback can be useful in improving the quality of plans by correcting cardinality estimation errors in the query optimizer. The stateoftheart approach for obtaining execution feedback is “passive” monitoring which records the cardinality of each opera ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
(Show Context)
Past work has suggested that query execution feedback can be useful in improving the quality of plans by correcting cardinality estimation errors in the query optimizer. The stateoftheart approach for obtaining execution feedback is “passive” monitoring which records the cardinality of each operator in the execution plan. We observe that there are many cases where even after repeated executions of the same query with use of feedback from passive monitoring, suboptimal choices in the execution plan cannot be corrected. We present a novel “payasyougo” framework in which a query potentially incurs a small overhead on each execution but obtains cardinality information that is not available with passive monitoring alone. Such a framework can significantly extend the reach of query execution feedback in obtaining better plans. We have implemented our techniques in Microsoft SQL Server, and our evaluation on real world and synthetic queries suggests that plan quality can improve significantly compared to passive monitoring even at low overheads. 1.
Understanding Cardinality Estimation using Entropy Maximization
"... Cardinality estimation is the problem of estimating the number of tuples returned by a query; it is a fundamentally important task in data management, used in query optimization, progress estimation, and resource provisioning. We study cardinality estimation in a principled framework: given a set of ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
(Show Context)
Cardinality estimation is the problem of estimating the number of tuples returned by a query; it is a fundamentally important task in data management, used in query optimization, progress estimation, and resource provisioning. We study cardinality estimation in a principled framework: given a set of statistical assertions about the number of tuples returned by a fixed set of queries, predict the number of tuples returned by a new query. We model this problem using the probability space, over possible worlds, that satisfies all provided statistical assertions and maximizes entropy. We call this the Entropy Maximization model for statistics (MaxEnt). In this paper we develop the mathematical techniques needed to use the MaxEnt model for predicting the cardinality of conjunctive queries.
Consistent Histograms In The Presence of Distinct Value
"... Selftuning histograms have been proposed in the past as an attempt to leverage feedback from query execution. However, the focus thus far has been on histograms that only store cardinalities. In this paper, we study consistent histogram construction from query feedback that also takes distinct valu ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Selftuning histograms have been proposed in the past as an attempt to leverage feedback from query execution. However, the focus thus far has been on histograms that only store cardinalities. In this paper, we study consistent histogram construction from query feedback that also takes distinct value counts into account. We first show how the entropy maximization (EM) principle can be leveraged to identify a distribution that approximates the data given the execution feedback making the least additional assumptions. This EM model that takes both distinct value counts and cardinalities into account. However, we find that it is computationally prohibitively expensive. We thus consider an alternative formulation for consistency – for a given query workload, the goal is to minimize the L2 distance between the true and estimated cardinalities. This approach also handles both cardinalities and distinct values counts. We propose an efficient onepass algorithm with several theoretical properties modeling this formulation. Our experiments show that this approach produces similar improvements in accuracy as the EM based approach while being computationally significantly more efficient. 1.
General Database Statistics Using Entropy
"... Abstract. We propose a framework in which query sizes can be estimated from arbitrary statistical assertions on the data. In its most general form, a statistical assertion states that the size of the output of a conjunctive query over the data is a given number. A very simple example is a histogram, ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. We propose a framework in which query sizes can be estimated from arbitrary statistical assertions on the data. In its most general form, a statistical assertion states that the size of the output of a conjunctive query over the data is a given number. A very simple example is a histogram, which makes assertions about the sizes of the output of several range queries. Our model also allows much more complex assertions that include joins and projections. To model such complex statistical assertions we propose to use the EntropyMaximization (EM) probability distribution. In this model any set of statistics that is consistent has a precise semantics, and every query has an precise size estimate. We show that several classes of statistics can be solved in closed form. 1
Warm Cache Costing – A Feedback Optimization Technique for Buffer Pool Aware Costing
"... Most modern RDBMS depend on the query processing optimizer’s cost model to choose the best execution plan for a given query. Since the physical IO (PIO) is a costly operation to execute, it naturally has an important weight in RDBMS classical cost models, which assume that the data is diskresident ..."
Abstract
 Add to MetaCart
(Show Context)
Most modern RDBMS depend on the query processing optimizer’s cost model to choose the best execution plan for a given query. Since the physical IO (PIO) is a costly operation to execute, it naturally has an important weight in RDBMS classical cost models, which assume that the data is diskresident and does not fit in the available main memory. However, this assumption is no longer true with the advent of cheap large main memories. In this paper, we discuss the importance of considering the buffercache occupancy during optimization and propose the Warm Cache Costing (WCC) model as a new technique for bufferpool aware query optimization. The WCCmodel is a novel feedback optimization technique, based on the execution statistics by learning PIOcompensation (PIOC) factors. The PIOC factor defines the average percentage of a table which is cached in the buffer pool. These PIOC factors are used in subsequent query optimizations to better estimate the PIO, thus leading to better plans. These techniques have been implemented in Sybase Adaptive Server Enterprise (ASE) database system. We have observed that they provide considerable improvements in query timings, in Decision Support environments and with almost negligible regression(if any) in other environments. This model enjoys the advantage of requiring no change to the buffer manager or other modules underlying the query processing layer and therefore is easy to implement. Also, since this model is part of an extensive feedback optimization architecture, other techniques using feedback optimization framework can be plugged in easily. 1.
Determining Essential Statistics for Cost Based Optimization of an ETL Workflow
"... Many of the ETL products in the market today provide tools for design of ETL workflows, with very little or no support for optimization of such workflows. Optimization of ETL workflows pose several new challenges compared to traditional query optimization in database systems. There have been many a ..."
Abstract
 Add to MetaCart
(Show Context)
Many of the ETL products in the market today provide tools for design of ETL workflows, with very little or no support for optimization of such workflows. Optimization of ETL workflows pose several new challenges compared to traditional query optimization in database systems. There have been many attempts both in the industry and the research community to support costbased optimization techniques for ETL Workflows, but with limited success. Nonavailability of source statistics in ETL is one of the major challenges that precludes the use of a cost based optimization strategy. However, the basic philosophy of ETL workflows of design once and execute repeatedly allows interesting possibilities for determining the statistics of the input. In this paper, we propose a framework to determine various sets of statistics to collect for a given workflow, using which the optimizer can estimate the cost of any alternative plan for the workflow. The initial few runs of the workflow are used to collect the statistics and future runs are optimized based on the learned statistics. Since there can be several alternative sets of statistics that are sufficient, we propose an optimization framework to choose a set of statistics that can be measured with the least overhead. We experimentally demonstrate the effectiveness and efficiency of the proposed algorithms.