Results 1 - 10
of
36
Predicting multiple metrics for queries: Better decisions enabled by machine learning
- In ICDE
, 2009
"... Abstract — One of the most challenging aspects of managing a very large data warehouse is identifying how queries will behave before they start executing. Yet knowing their performance characteristics — their runtimes and resource usage — can solve two important problems. First, every database vendo ..."
Abstract
-
Cited by 66 (1 self)
- Add to MetaCart
(Show Context)
Abstract — One of the most challenging aspects of managing a very large data warehouse is identifying how queries will behave before they start executing. Yet knowing their performance characteristics — their runtimes and resource usage — can solve two important problems. First, every database vendor struggles with managing unexpectedly long-running queries. When these long-running queries can be identified before they start, they can be rejected or scheduled when they will not cause extreme resource contention for the other queries in the system. Second, deciding whether a system can complete a given workload in a given time period (or a bigger system is necessary) depends on knowing the resource requirements of the queries in that workload. We have developed a system that uses machine learning to accurately predict the performance metrics of database queries whose execution times range from milliseconds to hours. For training and testing our system, we used both real customer queries and queries generated from an extended set of TPC-DS templates. The extensions mimic queries that caused customer problems. We used these queries to compare how accurately different techniques predict metrics such as elapsed time, records used, disk I/Os, and message bytes. The most promising technique was not only the most accurate, but also predicted these metrics simultaneously and using only information available prior to query execution. We validated the accuracy of this machine learning technique on a number of HP Neoview configurations. We were able to predict individual query elapsed time within 20 % of its actual time for 85 % of the test queries. Most importantly, we were able to correctly identify both the short and long-running (up to two hour) queries to inform workload management and capacity planning. I.
XSEED: Accurate and fast cardinality estimation for XPath queries
- In to appear Proc. 22nd Int. Conf. on Data Engineering (ICDE
, 2006
"... We propose XSEED, a synopsis of path queries for cardinality estimation that is accurate, robust, efficient, and adaptive to memory budgets. XSEED starts from a very small kernel, and then incrementally updates information of the synopsis. With such an incremental construction, a synopsis structure ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
(Show Context)
We propose XSEED, a synopsis of path queries for cardinality estimation that is accurate, robust, efficient, and adaptive to memory budgets. XSEED starts from a very small kernel, and then incrementally updates information of the synopsis. With such an incremental construction, a synopsis structure can be dynamically configured to accommodate different memory budgets. Cardinality estimation based on XSEED can be performed very efficiently and accurately. Extensive experiments on both synthetic and real data sets show that even with less memory, XSEED could achieve accuracy that is an order of magnitude better than that of other synopsis structures. The cardinality estimation time is under 2 % of the actual querying time for a wide range of queries in all test cases. 1
Active and Accelerated Learning of Cost Models for Optimizing Scientific Applications
- In Proc. of the 2006 Intl. Conf. on Very Large Data Bases
, 2006
"... We present the NIMO system that automatically learns cost models for predicting the execution time of computationalscience applications running on large-scale networked utilities such as computational grids. Accurate cost models are important for selecting efficient plans for executing these applica ..."
Abstract
-
Cited by 20 (6 self)
- Add to MetaCart
We present the NIMO system that automatically learns cost models for predicting the execution time of computationalscience applications running on large-scale networked utilities such as computational grids. Accurate cost models are important for selecting efficient plans for executing these applications on the utility. Computational-science applications are often scripts (written, e.g., in languages like Perl or Matlab) connected using a workflow-description language, and therefore, pose different challenges compared to modeling the execution of plans for declarative queries with wellunderstood semantics. NIMO generates appropriate training samples for these applications to learn fairly-accurate cost models quickly using statistical learning techniques. NIMO’s approach is active and noninvasive: it actively deploys and monitors the application under varying conditions, and obtains its training data from passive instrumentation streams that require no changes to the operating system or applications. Our experiments with real scientific applications demonstrate that NIMO significantly reduces the number of training samples and the time to learn fairly-accurate cost models. 1.
Learning-based Query Performance Modeling and Prediction
"... Abstract — Accurate query performance prediction (QPP) is central to effective resource management, query optimization and query scheduling. Analytical cost models, used in current generation of query optimizers, have been successful in comparing the costs of alternative query plans, but they are po ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
(Show Context)
Abstract — Accurate query performance prediction (QPP) is central to effective resource management, query optimization and query scheduling. Analytical cost models, used in current generation of query optimizers, have been successful in comparing the costs of alternative query plans, but they are poor predictors of execution latency. As a more promising approach to QPP, this paper studies the practicality and utility of sophisticated learningbased models, which have recently been applied to a variety of predictive tasks with great success, in both static (i.e., fixed) and dynamic query workloads. We propose and evaluate predictive modeling techniques that learn query execution behavior at different granularities, ranging from coarse-grained plan-level models to fine-grained operatorlevel models. We demonstrate that these two extremes offer a tradeoff between high accuracy for static workload queries and generality to unforeseen queries in dynamic workloads, respectively, and introduce a hybrid approach that combines their respective strengths by selectively composing them in the process of QPP. We discuss how we can use a training workload to (i) pre-build and materialize such models offline, so that they are readily available for future predictions, and (ii) build new models online as new predictions are needed. All prediction models are built using only static features (available prior to query execution) and the performance values obtained from the offline execution of the training workload. We fully implemented all these techniques and extensions on top of PostgreSQL and evaluated them experimentally by quantifying their effectiveness over analytical workloads, represented by well-established TPC-H data and queries. The results provide quantitative evidence that learning-based modeling for QPP is both feasible and effective for both static and dynamic workload scenarios. I.
TGV: a Tree Graph View for Modelling Untyped XQuery
"... XML [7] has become a de facto standard to exchange and represent any kind of data in various contexts. XML data can be manipulated using the XQuery [31] language, which can express though a compact and comprehensive way any queries and transformations. An XQuery expression is evaluated as follows. ( ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
XML [7] has become a de facto standard to exchange and represent any kind of data in various contexts. XML data can be manipulated using the XQuery [31] language, which can express though a compact and comprehensive way any queries and transformations. An XQuery expression is evaluated as follows. (1) the expression is rewritten into a "canonical XQuery", then (2) it is modeled in an internal representation and nally (3) it is optimized and executed. In [9], Chen et al. proposed to rewrite XQuery expressions in "canonical XQuery". However, their solution is restricted to a rather limited subset of XQuery and does not support complex XQuery expressions. [4] introduced the TPQ model to represent a single FWR statements as a Tree Pattern and a formula. Then [9] proposed the GTP model which generalize their approach to support several FWR. However, their model cannot capture well all the expressiveness of XQuery expressions, cannot handle mediation problems and do not support extensible optimizations. In our paper, we made three contributions. First, we extend the rules developed by [9] to rewrite any XQuery expression into a "canonical XQuery". Second, we design a new model called TGV which supports all the functionnalities of XQuery, uses an intuitive representation compliant with mediation issues, and provides a support for optimization and cost information. Finally, we provide a generic cost model coupled with the TGV.
Robust Estimation of Resource Consumption for SQL Queries using Statistical Techniques ABSTRACT
"... The ability to estimate resource consumption of SQL queries is crucial for a number of tasks in a database system such as admission control, query scheduling and costing during query optimization. Recent work has explored the use of statistical techniques for resource estimation in place of the manu ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
The ability to estimate resource consumption of SQL queries is crucial for a number of tasks in a database system such as admission control, query scheduling and costing during query optimization. Recent work has explored the use of statistical techniques for resource estimation in place of the manually constructed cost models used in query optimization. Such techniques, which require as training data examples of resource usage in queries, offer the promise of superior estimation accuracy since they can account for factors such as hardware characteristics of the system or bias in cardinality estimates. However, the proposed approaches lack robustness in that they do not generalize well to queries that are different from the training examples, resulting in significant estimation errors. Our approach aims to address this problem by combining knowledge of database query processing with statistical models. We model resource-usage at the level of individual operators, with different models and features for each operator type, and explicitly model the asymptotic behavior of each operator. This results in significantly better estimation accuracy and the ability to estimate resource usage of arbitrary plans, even when they are very different from the training instances. We validate our approach using various large scale real-life and benchmark workloads on Microsoft
Towards Adaptive Costing of Database Access Methods
- In Proc. of the Int’l. Workshop on Self-Managing Database Systems (SMDB
, 2007
"... Most database query optimizers use cost models to identify good query execution plans. Inaccuracies in the cost models can cause query optimizers to select poor plans. In this paper, we consider the problem of accurately estimating the I/O costs of database access methods, such as index scans. We pr ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
(Show Context)
Most database query optimizers use cost models to identify good query execution plans. Inaccuracies in the cost models can cause query optimizers to select poor plans. In this paper, we consider the problem of accurately estimating the I/O costs of database access methods, such as index scans. We present some experimental results which show that existing analytical I/O cost models can be very inaccurate. We also present a simple analysis which shows that larger cost estimation errors can cause the query optimizer to make larger mistakes in plan selection. We propose the use of an adaptive black-box statistical cost estimation methodology to achieve better estimates. 1.
Active Sampling for Accelerated Learning of Performance Models
- In First Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML
, 2006
"... Models of system behavior are useful for prediction, diagnosis, and optimization in self-managing systems. Statistical learning approaches have an important role in part because they can infer system models automatically from instrumentation data collected as the system operates. Work in this domain ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
Models of system behavior are useful for prediction, diagnosis, and optimization in self-managing systems. Statistical learning approaches have an important role in part because they can infer system models automatically from instrumentation data collected as the system operates. Work in this domain often takes a “given the right data, we learn the right model ” approach, and leaves the issue of acquiring the “right data ” unaddressed. This gap is problematic for two reasons: (i) the accuracy of the models depends on adequate coverage of the system operating range in the observations; and (ii) it may take a long time to obtain adequate coverage with passive observations. This paper describes our approach to bridging this gap in the NIMO system, which incorporates active learning of performance models in a self-managing computing utility. 1
Ailamaki Same Queries, Different Data: Can we Predict Runtime Performance? SMDB
, 2012
"... Abstract — We consider MapReduce workloads that are produced by analytics applications. In contrast to ad hoc query workloads, analytics applications are comprised of fixed data flows that are run over newly arriving data sets or on different portions of an existing data set. Examples of such worklo ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
(Show Context)
Abstract — We consider MapReduce workloads that are produced by analytics applications. In contrast to ad hoc query workloads, analytics applications are comprised of fixed data flows that are run over newly arriving data sets or on different portions of an existing data set. Examples of such workloads include document analysis/indexing, social media analytics, and ETL (Extract Transform Load). Motivated by these workloads, we propose a technique that predicts the runtime performance for a fixed set of queries running over varying input data sets. Our prediction technique splits each query into several segments where each segment’s performance is estimated using machine learning models. These per-segment estimates are plugged into a global analytical model to predict the overall query runtime. Our approach uses minimal statistics about the input data sets (e.g., tuple size, cardinality), which are complemented with historical information about prior query executions (e.g., execution time). We analyze the accuracy of predictions for several segment granularities on both standard analytical benchmarks such as TPC-DS [17], and on several real workloads. We obtain less than 25 % prediction errors for 90 % of predictions. I.
Self-Tuning Distribution of DB-Operations on Hybrid CPU/GPU Platforms
"... A current research trend focuses on accelerating database operations with the help of GPUs (Graphics Processing Units). Since GPU algorithms are not necessarily faster than their CPU counterparts, it is important to use them only if they outperform their CPU counterparts. In this paper, we address t ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
(Show Context)
A current research trend focuses on accelerating database operations with the help of GPUs (Graphics Processing Units). Since GPU algorithms are not necessarily faster than their CPU counterparts, it is important to use them only if they outperform their CPU counterparts. In this paper, we address this problem by constructing a decision model for a framework that is able to distribute database operations response time minimal on CPUs and GPUs. Furthermore, we discuss necessary quality measures for evaluating our model. 1.