Automating Workload Management for Enterprise-Scale Business Intelligence Systems: Old Problems Just Won t Go Away



Similar documents
Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning

Application of Predictive Analytics for Better Alignment of Business and IT

Optimization of Analytic Data Flows for Next Generation Business Intelligence Applications

<Insert Picture Here> Best Practices for Extreme Performance with Data Warehousing on Oracle Database

ICS Principles of Operating Systems

Business Processes Meet Operational Business Intelligence

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

HP Vertica Concurrency and Workload Management

Optimizing Shared Resource Contention in HPC Clusters

Network Infrastructure Services CS848 Project

Capacity Management for Oracle Database Machine Exadata v2

Monitor and Manage Your MicroStrategy BI Environment Using Enterprise Manager and Health Center

High performance ETL Benchmark

Introduction to Apache YARN Schedulers & Queues

SQL Server Business Intelligence on HP ProLiant DL785 Server

Key Attributes for Analytics in an IBM i environment

SQL Server 2012 Performance White Paper

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

How To Use Hp Vertica Ondemand

SWISSBOX REVISITING THE DATA PROCESSING SOFTWARE STACK

SQL Server 2008 Performance and Scale

Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices

- An Essential Building Block for Stable and Reliable Compute Clusters

Classification and Prediction

Trafodion Operational SQL-on-Hadoop

On Correlating Performance Metrics

Tableau Server 7.0 scalability

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

Keywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age.

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

VDI FIT and VDI UX: Composite Metrics Track Good, Fair, Poor Desktop Performance

IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!

MS SQL Performance (Tuning) Best Practices:

Enterprise Performance Tuning: Best Practices with SQL Server 2008 Analysis Services. By Ajay Goyal Consultant Scalability Experts, Inc.

Binary search tree with SIMD bandwidth optimization using SSE

Testing Big data is one of the biggest

Performance Workload Design

Applied Business Intelligence. Iakovos Motakis, Ph.D. Director, DW & Decision Support Systems Intrasoft SA

Cray: Enabling Real-Time Discovery in Big Data

The Importance of Software License Server Monitoring

Building a Scalable Big Data Infrastructure for Dynamic Workflows

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Proactive Performance Management for Enterprise Databases

Quality of Service versus Fairness. Inelastic Applications. QoS Analogy: Surface Mail. How to Provide QoS?

CPU Scheduling Outline

Duke University

Recommendations for Performance Benchmarking

The Importance of Software License Server Monitoring

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing

Innovation: Add Predictability to an Unpredictable World

Maximum performance, minimal risk for data warehousing

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

More Data in Less Time

Data Warehouse (DW) Maturity Assessment Questionnaire

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Understanding the Value of In-Memory in the IT Landscape

The Complete Performance Solution for Microsoft SQL Server

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Understanding the Benefits of IBM SPSS Statistics Server

CPU Scheduling. CPU Scheduling

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Scheduling Allowance Adaptability in Load Balancing technique for Distributed Systems

Turnkey Hardware, Software and Cash Flow / Operational Analytics Framework

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

From Spark to Ignition:

Oracle Real Time Decisions

Load Testing and Monitoring Web Applications in a Windows Environment

PERFORMANCE MODELS FOR APACHE ACCUMULO:

Data warehousing with PostgreSQL

.:!II PACKARD. Performance Evaluation ofa Distributed Application Performance Monitor

Scheduling. Monday, November 22, 2004

Processing and Analyzing Streams. CDRs in Real Time

Daniel J. Adabi. Workshop presentation by Lukas Probst

Various Schemes of Load Balancing in Distributed Systems- A Review

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

Big Data and Scripting. Part 4: Memory Hierarchies

Embedded inside the database. No need for Hadoop or customcode. True real-time analytics done per transaction and in aggregate. On-the-fly linking IP

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

SAP HANA. SAP HANA Performance Efficient Speed and Scale-Out for Real-Time Business Intelligence

Performance Prediction, Sizing and Capacity Planning for Distributed E-Commerce Applications

James Serra Sr BI Architect

CPU Scheduling. Basic Concepts. Basic Concepts (2) Basic Concepts Scheduling Criteria Scheduling Algorithms Batch systems Interactive systems

Performance Tuning and Optimizing SQL Databases 2016

TIBCO Live Datamart: Push-Based Real-Time Analytics

Enhancing SQL Server Performance

Classification On The Clouds Using MapReduce

Traditional BI vs. Business Data Lake A comparison

Updated November 30, Version 4.1

Grid Computing Approach for Dynamic Load Balancing

Process Scheduling CS 241. February 24, Copyright University of Illinois CS 241 Staff

A Review on Load Balancing In Cloud Computing 1

Efficient Load Balancing using VM Migration by QEMU-KVM

The HP Neoview data warehousing platform for business intelligence Die clevere Alternative

Course 55144B: SQL Server 2014 Performance Tuning and Optimization

The Top 20 VMware Performance Metrics You Should Care About

Towards an understanding of oversubscription in cloud

Transcription:

Automating Workload Management for Enterprise-Scale Business Intelligence Systems: Old Problems Just Won t Go Away Umeshwar Dayal umeshwar.dayal@hp.com Harumi Kuno, Janet Weiner, Kevin Wilkinson, Abhay Mehta, Chetan Gupta Intelligent Information Management Lab Hewlett-Packard Labs Palo Alto, CA, USA Stefan Krompass, Alfons Kemper (TU Munich), Archana Ganapathi (UC-Berkeley) 2008 Hewlett-Packard

Outline Context: Next Generation Business Intelligence Workload Management Challenges Predicting Query Performance Managing Batch Workloads Managing Interactive Workloads Experimental Framework for Studying Mixed Workload Management 2

Business Intelligence Today Analysis Reporting Applications Technologies, tools, and practices for collecting, integrating, analyzing, and presenting large volumes of information to enable better decision making Enterprise Data Warehouse ETL Strategic decision making: analytics by experts Off-line, serial pipeline: batch ETL (Extract- Transform-Load) Workloads run in isolation on different platforms BI is the #1 technology priority for CIO s 2006-2008 Market for BI is huge and growing (2006 numbers) Business analytics market: $19B, 10.3% growth Operational Operational Operational Data warehouse platform software market: $5.7B, 12.5% growth System System System BI services market (data integration, DW design, and vertical solutions): $31B 3

Next-Generation BI: Operational BI Bringing BI into the mainstream 4 Analytic Applications Data Access Operational System Analysis High Ingest Operational System Model Building Models RT Model Execution Operational System Systems, Sensors, External feeds, Structured & Unstructured info, with nearreal time extraction and integration A change in concurrency, reliability, workload profile Yesterday: Small number of back office power users Tomorrow: 10,000 users perform business actions dozens of times a day Near real-time: no end-to-end latency Mixed workloads Robust, predictable performance Integrate structured & lessstructured data Gartner: 90% of Global 2000 will have missioncritical DW applications in place in the next 5 years

Operational BI: Research Challenges Online Operational Systems Online Analytic Applications Visual interaction and collaboration interfaces Operational BI Platform OLTP External Feeds Streams/Events Operational data store Extraction & Integration Analytics Delivery and Interaction Parallel Data Warehouse Relational Query Processor Transactional Storage Manager Low latency data integration pipeline End-to-end optimization Integration of structured and less structured data Massively scalable analytics Self-managing: automated design, tuning, and maintenance Robust, stable performance Highly scalable & available Mixed Workloads Continuous, on-line updates Complex batch and ad hoc queries Highly parallel computer architectures, multi-core processors, storage hierarchies 5

Outline Context: Next Generation Business Intelligence Workload Management Challenges Predicting Query Performance Managing Batch Workloads Managing Interactive Workloads Experimental Framework for Studying Mixed Workload Management 6

Mixed Workload Performance Management Managing data warehouse performance is like juggling feathers, golf balls, bowling balls, Enterprise Data Warehouses are very complex and expensive to manage. Hundreds of knobs Workloads are becoming increasingly complex and dynamic (especially for operational BI) Some queries run in seconds, some take hours Mixed workloads Batch (rollups, reports), interactive (ad hoc queries) Different performance objectives Some are urgent, some have lower priority Automation is needed 7 "An Enterprise Data Warehouse must service a wide spectrum of queries: from simple, sub-second lookups, to complex multi-hour rollups. An EDW is expected to provide the best response time for on-demand queries while simultaneously meeting deadlines for batch queries. All of this must be done without overloading (or underloading) the system, and with effective controls for rogue queries. This is a difficult problem. Managing the query workload effectively is one of the major challenges in running an EDW. Through 2010, mixed workload performance will remain the single most important performance issue in data warehousing [Gartner].

Workload Management Problems How long should this query take? With other queries running concurrently? What resources will it consume? Should we allow this query into the system? When should we start the query? Will this query ever finish? When? What should we do if it is taking too long? Wait? Kill it? Kill it and restart it? When? Kill some other query? Which one? performance prediction admission control workload scheduling progress indicator execution control 8

Workload Management Approach Machine learning models for predicting performance of queries and workloads Predict query run times under load Simultaneously predict multiple performance metrics: cpu time, memory usage, message bytes, disk I/O,... Experimental Framework Create a taxonomy of problem queries and operational BI workloads Simulate system configurations, workload execution & workload management policies Systematically study the ability of workload management policies to deal with problem workloads Embed lessons learnt into policy-driven self-management tools 9

Workload Management Architecture Workload Query Optimizer Workload + Query Optimizer Cost Estimates 1. Prediction of Query Runtime Workload +Dynamic Query Cost Estimates System Load and Execution Cost 4. Early Warning System 10 SLAs S L A M A N A G E R Workload SLOs Q U E R I E S Admission Control Scheduler 2. Query Workload Management Queue 1 Queue 2 Queue k Execution Manager Query Control E X E C U T I O N E N G I N E Execution Cost Statement Name: Query_101 18% Query Progress Indicator Elapsed Time: 53.22 Current Time (t*): 25th August 2006 16:24:40 Estimated Cost: 352 dop Query Execution Speed (at t*): 1.2 dop/min Estimated Query Time Left: 4:09:20 Query Progress Monitor 3. Query Progress Indicator System Load Load Monitor

Outline Context: Next Generation Business Intelligence Workload Management Challenges Predicting Query Performance Managing Batch Workloads Managing Interactive Workloads Experimental Framework for Studying Mixed Workload Management 11

Currently, no good way to estimate execution time* Cost as a predictor of Execution time (559 HPIT Canned reports and Yotta generated queries) Query Optimizer estimates 100 90 RMS Error = 26.70 80 70 60 Time Taken 50 40 30 20 10 y = 0.2206x + 13.067 R 2 = 0.0766 0 0 5 10 15 20 25 30 Cost * 12Even on a quiet system with no load and no query concurrency.

Prediction of Query Run Time (PQR) Predict the execution time of a query, allowing the user to trade off accuracy and precision. Learn from execution histories Understand the problem as that of classification with unknown classes. At the heart of this approach is a machine learning model called PQR Tree (Prediction Of Query Runtime). Approach: From historic data extract features that describe the query plan and the system load. Train a PQR Tree on these features. For a new query extract its plan and load features. Apply the PQR Tree on these features to predict the time range of the new query. Obtaining a PQR Tree Historic Queries Extract Features Plan & Load Vectors P1, P2: Construct Tree Obtaining a time range for a new query New Query Extract Features Plan & Load Vectors P3: Apply Tree 13 A. Mehta, C. Gupta, U. Dayal, PQR: Predicting Query Execution Times for Autonomous Workload Management Proc. Intl Conf on Autonomic Computing, 2008. PQR Tree [Predicted execution time range]

PQR Tree A PQR Tree, denoted by T s, is a binary tree such that: 1. For every node u of T s, there is an associated 2-class classifier f u. 2. The node u contains examples E u, on which the classifier f u is trained. 3. f u is a classifier that decides for each new query q with execution time t q in [t ua, t ub ], if q should go to [t ua,t ua + ) or [t ua +, t ub ], where lies in (0, t ub -t ua ). 4. For every node u of T s, there is an associated accuracy, where accuracy is measured as the percentage of correct predictions made by f u on the example set E u. 5. Every node and leaf of the tree corresponds to a time range. N1: f u = Classification Tree, [1 2690] N11: f u = Classification Tree [1 170) N111:= [1 13) 93.5% N12: f u = Classification Tree [170 2690) 100% 84.6% N112 [13 170) N121 [170 1000) N122: f u = Tree [1000 2690) N1221 [1000 1629) 87.5% N1222 [1629 2690) 14

Feature Vectors Query Plan Vector 15 Select features through enumeration, construction and domain expertise. Bushiness, Disk Partition Parallelism, Process Parallelism, Number of IOs, IO Cardinality, IO Cost, Non-IO Cost, Number of Joins, Join Cardinality, Number of Probes, Table Count, Number of Sorts, Sort Cardinality, Total Cost, Total Estimated Cardinality, etc. Load Vector The CPU characteristics tend to be erratic and unpredictable. ``Stretch of a query is a good indicator of the load the query experiences while running. MPL is a determining factor for stretch and a good variable for prediction. Hence the new load vectors: Query Stretched Cost, Query Stretched Process Parallelism, Query Stretched Operator Cost, Process stretched cost, Process stretched Process parallelism, Process stretched Operator Cost SELECT n.name FROM tbnation n, tbregion r WHERE n.region = r.region AND r.name = 'EUROPE' ORDER BY n.name; split partitioning file_scan [ tbregion ] root exchange sort nested_join split partitioning file_scan [ tbnation ]

Creating the PQR Tree Recursively create nodes till one of the termination conditions is met: Time range is too small. Number of training examples is too small in a node. Accuracy of prediction falls below a threshold. To create a node: Take all the queries in the training set and arrange them in ascending order of running time. Each running time is a potential point at which the time range for the node can be split. For every classifier in a fixed set of classifiers compute accuracy for each potential split point. Chose the combination of classifier and the split that has the highest accuracy on the test set. Enhancements: Top K: Compute the percentage change between successive values in this list. Chose some fixed number of largest changes. They are potential split points. Skip Split Points: Skip n skip percent points from both ends of an interval as candidate split points. 16

Applying PQR Tree Predict the time interval for the execution time of a new query. Decompose the query and the system load into the feature vector described earlier. Apply the classifier at each node to this feature vector, to determine whether this new query belongs to the left or the right child. Do this recursively at each node until we reach a leaf of the PQR tree. The leaf has a time range associated with it and this becomes the predicted execution time of the new query. As the range of execution times becomes narrower corresponding to a greater depth in the tree, the prediction accuracy will be lower. This approach allows the user to choose a time range they are comfortable with. 17

Resulting Prediction of Query Runtime (PQR) Model Loaded system N1: F = Classification Tree, [1 2690] N11: F = Classification Tree [1 170) 93.5% N12: F = Classification Tree [170 2690) Optimi zer Cost Total Opera tor Cost Static PQR Prediction of Query Runtime (PQR) 100% 84.6% Accura 36.5% cy 37.1% 55.9% 86.5% N111: Tree [1 13) N112: Tree [13 170) N121: Tree [170 1000) N122: Tree [1000 2690) 87.5% N1221 [1000 1629) N1222 [1629 2690) 18 Test set: 170 actual customer queries run concurrently on a loaded system

PQR Results TPC-H MPL = 10 MPL = 8 MPL = 6 MPL = 4 Accuracy Buckets Accuracy Buckets Accuracy Buckets Accuracy Buckets Test 1 85.585586 6 89.344262 6 88.571429 5 91.176471 6 2 80 9 73.831776 9 83.157895 11 88.888889 9 3 84 5 84.210526 6 85.714286 8 87.610619 8 4 85.185185 9 87.5 3 80.869565 10 88.461538 10 5 75.490196 11 88.043478 8 82.653061 9 79.381443 8 6 88.77551 7 84.090909 10 87.5 7 91.304348 9 7 81.632653 6 92.66055 9 80.412371 10 84.782609 8 8 75.247525 6 67.567568 8 85.542169 10 90.425532 7 9 85.454545 8 81.132075 9 91.891892 5 89.07563 6 10 75.229358 8 87.628866 7 81.521739 5 82.978723 6 Avg 81.6600558 7.5 83.601001 7.5 84.7834407 8 87.4085802 7.7 19

Results on a real customer workload Accuracy vs Time Ranges for Customer X Database (Workload # 2) Accuracy vs Time Ranges for Customer X Database (Workload # 3) 12 18 10 16 14 8 12 Number of Time Ranges 6 4 Number of Time Ranges 10 8 6 2 4 2 0 0 10 20 30 40 50 60 70 80 90 100 Accuracy 0 0 10 20 30 40 50 60 70 80 90 100 Accuracy Accuracy 20

Beyond Run-time Prediction: Predicting Multiple Performance Metrics Goal: For a specific machine configuration Predict performance of a query run at MPL=1 Using only the query and compiler output Predict execution time, cpu time, message bytes, disk i/o, Predict performance of a workload run at given MPL > 1 Applications: Initial sizing What system configuration for a given workload? Capacity planning What happens to performance when workload or configuration changes? Workload management What to expect from queries, workload? What is ideal MPL? 21

Create Query Vector from Compile-Time features for input to prediction SELECT n.name FROM tbnation n, tbregion r WHERE n.region = r.region AND r.name = 'EUROPE' ORDER BY n.name; root exchange sort nested_join split split partitioning file_scan [ tbregion ] partitioning file_scan [ tbnation ] exchange file_scan nested_join partitioning root sort split 1 5.00 2 7.02 1 5.00 2 4.52 1 5 1 5 2 4.52 Number of instances of the operator in query Sum of cardinalities for each instance of the operator 22

Training (through Kernel Canonical Correlation Analysis - KCCA) Compile-time features Query Plan Query Plan Feature Matrix Op1 1.4E4 23E5 Op2 80E2 2E4 Query Plan Similarity Matrix 1 N 1 : N L Matrices showing similarity of query pairs using a Gaussian kernel function Query Plan Projection L X A Statistics Elapsed Time Execution Time I/Os 23 Performance Feature Matrix Time 4.3E7 23839 I/O 27E2 7.3E3 Run-time features 0 LP PL 0 A B Performance Similarity Matrix 1 N 1 : N P = LL 0 0 PP A B P X B Co-locate queries based on plan AND performance features Performance Projection

Prediction (through KCCA) nearest neighbors Query Plan Projection Query Plan Compiletime Feature Vector KCCA 1.4E 80E 3E Predicted Performance Vector Performance Projection 128 25792 8E 24

Results: Predicted vs. Actual Elapsed Time 1 hour 1 minute 1 second Actual actual elapsed time time 100 Under-estimated records accessed 10 1 0.01 0.1 1 10 100 0.1 0.01 Perfect prediction disk i/o estimate too high KCCA KCCA predicted elapsed time time 1 second 1 hour 1 minute 25

Optimizer Cost Estimates 1 hour 1 minute 1 second Actual Elapsed Time Time Cost estimate 100x away from best fit 10000 100 1 0.0001 0.01 1 100 10000 0.01 0.0001 Optimizer Cost Optimizer Cost Estimates Line of best fit Cost estimate 100x away from best fit Cost estimate 10x away from best fit 26

Predicting Other Metrics actual records used 1.00E+08 1.00E+07 1.00E+06 1.00E+05 1.00E+04 1.00E+03 1.00E+02 1.00E+01 1.00E+00 Under-estimated records accessed Perfect prediction 1.00E+00 1.00E+01 1.00E+02 1.00E+03 1.00E+04 1.00E+05 1.00E+06 KCCA predicted records used 1.00E+07 1.00E+08 27

Performance Prediction Summary Predict performance for queries AND workloads Train and predict for a given configuration (hardware / software / MPL) Predict using only query plan features of new queries. Test results show good prediction accuracy Predict multiple aspects of performance: cpu time, memory usage, message bytes, disk I/O,... Multiple metrics useful: explain inaccuracies; Suggest which queries to run together (complementary use of memory, disk) Training is time-consuming Many hours to run queries and workloads; Few hours to train models But prediction is fast: only 10-15 seconds per query Continuous retraining will be needed A. Ganapathi, H. Kuno, U. Dayal, J.Wiener, A. Fox, M. Jordan. D. Patterson, Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning. Proc. ICDE 2009. 28

Outline Context: Next Generation Business Intelligence Workload Management Challenges Predicting Query Performance Managing Batch Workloads Managing Interactive Workloads Experimental Framework for Studying Mixed Workload Management 29

Batch Workload Management Maximize the throughput of BI batch workloads (while protecting against underload and overload) Intuition: 30 Find a good manipulated variable. Make the system stable over a wide range of this manipulated variable. Target somewhere in the middle of the range MPL has been used but: The optimal mpl is a lower number for resource-intensive queries, whereas the optimal mpl is higher for less intensive queries. A typical BI workload can fluctuate rapidly between long, resource intensive queries and short, less intensive queries. Throughput Underload Optimal range LARGE WORKLOAD Target MEDIUM WORKLOAD MPL Overload (thrashing) *MPL = Number of concurrent streams

Peak Memory Consumption A Better Manipulated Variable SF100 SF200 SF400 Abhay Mehta, Chetan Gupta and Umeshwar Dayal, BI batch manager: A system for managing batch workloads on enterprise data-warehouses. EDBT 2008: 640-651, Nantes, France. 31

PGM (Priority Gradient Multiprogramming) Equal Priority Multiprogramming EPM Priority Gradient Multiprogramming PGM m concurrent streams (MPL=m) Each running at an equal priority (eg 148) m concurrent streams (MPL = m) Each running at a DIFFERENT priority (eg 148, 146, 144, 142,, 148-2(m-1)) 32 EPM: Default Execution Control PGM: Proposed Execution Control

PGM Protects Against Overload.. 100 90 80 70 60 50 40 Memory Pressure PGM Memory Pressure EPM 30 20 10 0 0:00:00 0:07:12 0:14:24 0:21:36 0:28:48 0:36:00 0:43:12 0:50:24 * 33 SF200, MPL=15, Query 4 data from MEASURE Figure : Memory Pressure Curves for EPM and PGM Schemes PGM is stable and exhibits saw-tooth pattern of acquisition, consumption and release. EPM shows the concurrent buildup of memory resulting in overflow.

. Hence, stabilizes execution across memory prediction errors 140 120 SF100 Complex Workload - Q4 PGM Reasons for stability: - PGM is unfair - Quick release 100 MPL SF 100 Throughput 80 60 SF200 PGM MPL SF 200 MPL SF 400 40 SF400 EPM PGM 20 EPM EPM 34 0 0 2 4 6 8 10 12 14 16 1/kF 1F Memory kf

Outline Context: Next Generation Business Intelligence Workload Management Challenges Predicting Query Performance Managing Batch Workloads Managing Interactive Workloads Experimental Framework for Studying Mixed Workload Management 35

Interactive Workloads: Scheduling Design a strategy for the online scheduling of queries on an Enterprise Data Warehouse. Frequency 30000 25000 20000 15000 10000 5000 0 0.1 1 10 100 1000 10000 Query Size (Seconds) OLTP BI 51 49 ART = 74.50 99 1 ART = 50.50 AS = 1.48 AS = 1.00 a) c) 49 51 ART = 75.50 1 99 ART = 99.50 AS = 1.52 AS = 50.50 b) d) Stretch = (p i + w i )/p i 36

Criteria for a good Online Scheduler FEED Fair Avoid starvation of queries Lower the max, or the variance of query times Effective The majority of queries should do well Lower the average Efficient The strategy should be lightweight Computational complexity of implementation should be sub-linear Differentiated Different service levels should give differentiated performance without unduly penalizing the overall avg and max. The CEO query 37

How our scheduler works Maintain a queue of queries ordered by their rank. When a new query is submitted, compute its rank, and insert it into the queue in the order of its rank. Run the queries in the order of their rank, with Highest Rank First rank? DBMS Ordered by decreasing rank How is rank computed?... 38 Abhay Mehta, Chetan Gupta, Song Wang and Umesh Dayal, rfeed: A Mixed Workload Scheduler for Enterprise Data Warehouses. ICDE 2009

Design the rank function using the FEED criteria 1. Start with a proven effective algorithm (SJF) : r = 1/p, where p is the estimated cost of the query 2. Add a wait component that achieves fairness : r = 1/p + Kw, where K is a constant, and w is the wait time of the query 3. Include a component that allows differentiation between service levels: r = /p + Kw, where indicates the service level 4. Implementing this function as a queue has a sub-linear computational complexity Computational complexity is O(log n) 39

Computing the Parameters Based on the value K, rank function (r = /p + Kw) generalizes known scheduling functions: 40 SJF: K=0 FIFO: K > 1-1/ where is the execution time of the longest query Tradeoff between fairness and effectiveness 0 < K < 1-1/ Derive a value for K With this value, provably does not cause starvation Mapping Service Levels: The ratio r of stretches of a query of size p with service level and 1 can be computed as: For Pareto Distribution: = e 1 K = ψ 2 ( r 1)( N1 ln p+ N ln p+ p) N + N 1 r T1, p + T, p + p = T + T + p, p, p/

Experimental Setup Model workloads Pareto Distributions Optimizer Cost Estimate Actual Processing Time Study L 2 norm for Stretch L 2 = (p 1 2 + p 1 2 + + p 1 2 ) 1/2 Study differentiation: Two models for rfeed: rfeed 1 & rfeed 10 Compare with an optimal resource slicing system RS OPT Study Peak Load and Steady State Load 41

Execution Time With Pareto Distribution (Steady State) L 2 Norm for Pareto Distribution with Steady State Load 3.5 3 2.5 Ratio 2 1.5 RSopt rfeed1 rfeed10 1 0.5 0 Overall Highest SL Medium SL Lowest SL Metric 42

Execution Time with Optimizer Estimate (Peak Load) L 2 Norm for Cost with Peak Load 40 35 30 25 Ratio 20 RSopt rfeed1 rfeed10 15 10 5 0 Overall Highest SL Medium SL Lowest SL Metric 43

Outline Context: Next Generation Business Intelligence Workload Management Challenges Predicting Query Performance Managing Batch Workloads Managing Interactive Workloads Experimental Framework for Studying Mixed Workload Management 44

Experimental Framework for Workload Management It s non-trivial to gauge the impact of system configuration and workload on a BI system s performance... Gives you knobs to control: Workload characteristics: problem queries, variance Workload Management policies: admission control, scheduling, execution controls Multi-programming levels Resource models and behavior 45 So you can see: Impact of admission control, scheduling, and execution control policies policies Impact of problem queries Impact on different objectives (e.g., complete 100% workload, minimize makespan, maximize useful work

Taxonomy of long-running queries 46

Workload Management Actions Admission Control Policies Types of Scheduling Queues and Policies Execution Control Actions 47

Workload Management Policies Implemented by Commercial Products 48

Execution Controller Execution control 1 Fuzzy Controller 0,75 0,5 0,25 0 0,25 0, 5 0,75 1 actions Execution engine Current State Query 1... Query n data Runtime statistics & Query information Database IF relexectime IS high AND (Progress IS low OR Progress IS medium) THEN cancel IS applicable IF relexectime IS high AND (Progress IS high THEN reprioritize IS applicable 49

Preliminary Results from Experimental Framework Huge variance in query sizes and resource demands Short queries: wait time can outweigh execution time Long queries: even a few can significantly slow system Admission control is effective when execution cost estimates ore accurate When execution costs are underestimated, threshold-based execution control actions have to be taken But need guidelines for setting thresholds Corrective actions against long-running queries can help, but Overly aggressive actions can hurt performance Simple actions (kill, resubmit) about as effective as newly proposed actions (suspend/resume) S. Krompass, H. Kuno, U. Dayal, A. Kemper, Juggling Feathers and Bowling Balls VLDB 2007. S. Krompass, H.Kuno. J.Wiener, U. Dayal, K.Wilkinson, A Kemper, Managing Long Running Queries. HPL Tech Report, 2008. 50 U.Dayal, H. Kuno, J.Winer, K.Wilkinson, A. Ganapathi, S. Krompass, Managing Operational Business Intelligence Workloads. Operating Systems Review (to appear).

Example of Experimental Results Interactive jobs Batch jobs Impact of problem queries Impact of control actions 51

52 Thank You!