Results 1 - 10
of
26
Predicting Query Execution Time: Are Optimizer Cost Models Really Unusable
- In ICDE. IEEE
, 2013
"... Abstract — Predicting query execution time is useful in many database management issues including admission control, query scheduling, progress monitoring, and system sizing. Recently the research community has been exploring the use of statistical machine learning approaches to build predictive mod ..."
Abstract
-
Cited by 10 (6 self)
- Add to MetaCart
(Show Context)
Abstract — Predicting query execution time is useful in many database management issues including admission control, query scheduling, progress monitoring, and system sizing. Recently the research community has been exploring the use of statistical machine learning approaches to build predictive models for this task. An implicit assumption behind this work is that the cost models used by query optimizers are insufficient for query execution time prediction. In this paper we challenge this assumption and show while the simple approach of scaling the optimizer’s estimated cost indeed fails, a properly calibrated optimizer cost model is surprisingly effective. However, even a well-tuned optimizer cost model will fail in the presence of errors in cardinality estimates. Accordingly we investigate the novel idea of spending extra resources to refine estimates for the query plan after it has been chosen by the optimizer but before execution. In our experiments we find that a well calibrated query optimizer model along with cardinality estimation refinement provides a low overhead way to provide estimates that are always competitive and often much better than the best reported numbers from the machine learning approaches. I.
Robust Estimation of Resource Consumption for SQL Queries using Statistical Techniques ABSTRACT
"... The ability to estimate resource consumption of SQL queries is crucial for a number of tasks in a database system such as admission control, query scheduling and costing during query optimization. Recent work has explored the use of statistical techniques for resource estimation in place of the manu ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
The ability to estimate resource consumption of SQL queries is crucial for a number of tasks in a database system such as admission control, query scheduling and costing during query optimization. Recent work has explored the use of statistical techniques for resource estimation in place of the manually constructed cost models used in query optimization. Such techniques, which require as training data examples of resource usage in queries, offer the promise of superior estimation accuracy since they can account for factors such as hardware characteristics of the system or bias in cardinality estimates. However, the proposed approaches lack robustness in that they do not generalize well to queries that are different from the training examples, resulting in significant estimation errors. Our approach aims to address this problem by combining knowledge of database query processing with statistical models. We model resource-usage at the level of individual operators, with different models and features for each operator type, and explicitly model the asymptotic behavior of each operator. This results in significantly better estimation accuracy and the ability to estimate resource usage of arbitrary plans, even when they are very different from the training instances. We validate our approach using various large scale real-life and benchmark workloads on Microsoft
Toward multitenant performance SLOs
- in Proc. IEEE 28th ICDE
, 2012
"... Abstract—As traditional and mission-critical relational database workloads migrate to the cloud in the form of Database-as-a-Service (DaaS), there is an increasing motivation to provide performance goals in Service Level Objectives (SLOs). Providing such performance goals is challenging for DaaS pro ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
Abstract—As traditional and mission-critical relational database workloads migrate to the cloud in the form of Database-as-a-Service (DaaS), there is an increasing motivation to provide performance goals in Service Level Objectives (SLOs). Providing such performance goals is challenging for DaaS providers as they must balance the performance that they can deliver to tenants and the data center’s operating costs. In general, aggressively aggregating tenants on each server reduces the operating costs but degrades performance for the tenants, and vice versa. In this paper, we present a framework that takes as input the tenant workloads, their performance SLOs, and the server hardware that is available to the DaaS provider, and outputs a cost-effective recipe that specifies how much hardware to provision and how to schedule the tenants on each hardware resource. We evaluate our method and show that it produces effective solutions that can reduce the costs for the DaaS provider while meeting performance goals. Index Terms—Database management, relational databases 1
Towards Predicting Query Execution Time for Concurrent and Dynamic Database Workloads
"... Predicting query execution time is crucial for many database management tasks including admission control, query scheduling, and progress monitoring. While a number of recent papers have explored this problem, the bulk of the existing work either considers prediction for a single query, or predictio ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
(Show Context)
Predicting query execution time is crucial for many database management tasks including admission control, query scheduling, and progress monitoring. While a number of recent papers have explored this problem, the bulk of the existing work either considers prediction for a single query, or prediction for a static workload of concurrent queries, where by “static ” we mean that the queries to be run are fixed and known. In this paper, we consider the more general problem of dynamic concurrent workloads. Unlike most previous work on query execution time prediction, our proposed framework is based on analytic modeling rather than machine learning. We first use the optimizer’s cost model to estimate the I/O and CPU requirements for each pipeline of each query in isolation, and then use a combination queueing model and buffer pool model that merges the I/O and CPU requests from concurrent queries to predict running times. We compare the proposed approach with a machine-learning based approach that is a variant of previous work. Our experiments show that our analytic-model based approach can lead tocompetitive and often better prediction accuracy than itsmachine-learning basedcounterpart.
GSLPI: A cost-based query progress indicator
- In ICDE
, 2012
"... Abstract—Progress indicators for SQL queries were first published in 2004 with the simultaneous and independent proposals from Chaudhuri et al. and Luo et al. In this paper, we implement both progress indicators in the same commercial RDBMS to investigate their performance. We summarize common cases ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
(Show Context)
Abstract—Progress indicators for SQL queries were first published in 2004 with the simultaneous and independent proposals from Chaudhuri et al. and Luo et al. In this paper, we implement both progress indicators in the same commercial RDBMS to investigate their performance. We summarize common cases in which they are both accurate and cases in which they fail to provide reliable estimates. Although there are differences in their performance, much more striking is the similarity in the errors they make due to a common simplifying uniform future speed assumption. While the developers of these progress indicators were aware that this assumption could cause errors, they neither explored how large the errors might be nor did they investigate the feasibility of removing the assumption. To rectify this we propose a new query progress indicator, similar to these early progress indicators but without the uniform speed assumption. Experiments show that on the TPC-H benchmark, on queries for which the original progress indicators have errors up to 30X the query running time, the new progress indicator is accurate to within 10 percent. We also discuss the sources of the errors that still remain and shed some light on what would need to be done to eliminate them. I.
DBSeer: Resource and Performance Prediction for Building a Next Generation Database Cloud ∗
"... Cloud computing is characterized by shared infrastructure and a decoupling between its operators and tenants. These two characteristics impose new challenges to databases applications hosted in the cloud, namely: (i) how to price database services, (ii) how to isolate database tenants, and (iii) how ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
Cloud computing is characterized by shared infrastructure and a decoupling between its operators and tenants. These two characteristics impose new challenges to databases applications hosted in the cloud, namely: (i) how to price database services, (ii) how to isolate database tenants, and (iii) how to optimize database performance on this shared infrastructure. We argue that today’s solutions, based on virtual-machines, do not properly address these challenges. We hint at new research directions to tackle these problems and argue that these three challenges share a common need for accurate predictive models of performance and resource utilization. We present our approach, called DBSeer, with our initial results on predictive models for the important class of OLTP/Web workloads and show how they can be used to address these challenges. 1.
Ailamaki Same Queries, Different Data: Can we Predict Runtime Performance? SMDB
, 2012
"... Abstract — We consider MapReduce workloads that are produced by analytics applications. In contrast to ad hoc query workloads, analytics applications are comprised of fixed data flows that are run over newly arriving data sets or on different portions of an existing data set. Examples of such worklo ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
(Show Context)
Abstract — We consider MapReduce workloads that are produced by analytics applications. In contrast to ad hoc query workloads, analytics applications are comprised of fixed data flows that are run over newly arriving data sets or on different portions of an existing data set. Examples of such workloads include document analysis/indexing, social media analytics, and ETL (Extract Transform Load). Motivated by these workloads, we propose a technique that predicts the runtime performance for a fixed set of queries running over varying input data sets. Our prediction technique splits each query into several segments where each segment’s performance is estimated using machine learning models. These per-segment estimates are plugged into a global analytical model to predict the overall query runtime. Our approach uses minimal statistics about the input data sets (e.g., tuple size, cardinality), which are complemented with historical information about prior query executions (e.g., execution time). We analyze the accuracy of predictions for several segment granularities on both standard analytical benchmarks such as TPC-DS [17], and on several real workloads. We obtain less than 25 % prediction errors for 90 % of predictions. I.
A Statistical Approach Towards Robust Progress Estimation
"... The need for accurate SQL progress estimation in the context of decision support administration has led to a number of techniques proposed for this task. Unfortunately, no single one of these progress estimators behaves robustly across the variety of SQL queries encountered in practice, meaning that ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
The need for accurate SQL progress estimation in the context of decision support administration has led to a number of techniques proposed for this task. Unfortunately, no single one of these progress estimators behaves robustly across the variety of SQL queries encountered in practice, meaning that each technique performs poorly for a significant fraction of queries. This paper proposes a novel estimator selection framework that uses a statistical model to characterize the sets of conditions under which certain estimators outperform others, leading to a significant increase in estimation robustness. The generality of this framework also enables us to add a number of novel “special purpose ” estimators which increase accuracy further. Most importantly, the resulting model generalizes well to queries very different from the ones used to train it. We validate our findings using a large number of industrial real-life and benchmark workloads. 1.
Resource and Performance Prediction for Building a Next Generation Database Cloud
"... Cloud computing is characterized by shared infrastructure and a decoupling between its operators and tenants. These two characteristics impose new challenges to databases applications hosted in the cloud, namely: (i) how to price database services, (ii) how to isolate database tenants, and (iii) how ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
Cloud computing is characterized by shared infrastructure and a decoupling between its operators and tenants. These two characteristics impose new challenges to databases applications hosted in the cloud, namely: (i) how to price database services, (ii) how to isolate database tenants, and (iii) how to optimize database performance on this shared infrastructure. We argue that today’s solutions, based on virtual-machines, do not properly address these challenges. We hint at new research directions to tackle these problems and argue that these three challenges share a common need for accurate predictive models of performance and resource utilization. We present initial predictive models for the important class of OLTP/Web workloads and show how they can be used to address these challenges. 1.
PREDIcT: Towards Predicting the Runtime of Large Scale Iterative Analytics
"... Machine learning algorithms are widely used today for analytical tasks such as data cleaning, data categorization, or data filtering. At the same time, the rise of social media motivates recent uptake in large scale graph processing. Both categories of algorithms are dominated by iterative subtasks, ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Machine learning algorithms are widely used today for analytical tasks such as data cleaning, data categorization, or data filtering. At the same time, the rise of social media motivates recent uptake in large scale graph processing. Both categories of algorithms are dominated by iterative subtasks, i.e., processing steps which are executed repetitively until a convergence condition is met. Optimizing cluster resource allocations among multiple workloads of iterative algorithms motivates the need for estimating their runtime, which in turn requires: i) predicting the number of iterations, and ii) predicting the processing time of each iteration. As both parameters depend on the characteristics of the dataset and on the convergence function, estimating their values before execution is difficult. This paper proposes PREDIcT, an experimental methodology for predicting the runtime of iterative algorithms. PRE-DIcT uses sample runs for capturing the algorithm’s convergence trend and per-iteration key input features that are well correlated with the actual processing requirements of the complete input dataset. Using this combination of characteristics we predict the runtime of iterative algorithms, including algorithms with very different runtime patterns among subsequent iterations. Our experimental evaluation of multiple algorithms on scale-free graphs shows a relative prediction error of 10%-30 % for predicting runtime, including algorithms with up to 100x runtime variability among consecutive iterations. 1.