Characterizing Task Usage Shapes in Google s Compute Clusters



Similar documents
Characterizing Task Usage Shapes in Google s Compute Clusters

Cloud Management: Knowing is Half The Battle

Energy Efficient MapReduce

Dynamic Workload Management in Heterogeneous Cloud Computing Environments

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

BiDAl: Big Data Analyzer for Cluster Traces

Enterprise Applications in the Cloud: Non-virtualized Deployment

Anomaly detection. Problem motivation. Machine Learning

CloudRank-D:A Benchmark Suite for Private Cloud Systems

How To Model A System

Technical Investigation of Computational Resource Interdependencies

MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT

Keywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age.

PostgreSQL Performance Characteristics on Joyent and Amazon EC2

Topology Aware Analytics for Elastic Cloud Services

Managing Capacity Using VMware vcenter CapacityIQ TECHNICAL WHITE PAPER

Capacity Analysis Techniques Applied to VMware VMs (aka When is a Server not really a Server?)

Network Infrastructure Services CS848 Project

CloudCmp:Comparing Cloud Providers. Raja Abhinay Moparthi

Joint Optimization of Overlapping Phases in MapReduce

Efficient Virtual Machine Sizing For Hosting Containers as a Service

Black-box Performance Models for Virtualized Web. Danilo Ardagna, Mara Tanelli, Marco Lovera, Li Zhang

Cloud Cruiser and Azure Public Rate Card API Integration

1. Simulation of load balancing in a cloud computing environment using OMNET

QoS-Aware Storage Virtualization for Cloud File Systems. Christoph Kleineweber (Speaker) Alexander Reinefeld Thorsten Schütt. Zuse Institute Berlin

vrops Microsoft SQL Server MANAGEMENT PACK OVERVIEW

Energy Constrained Resource Scheduling for Cloud Environment

Paul Brebner, Senior Researcher, NICTA,

Performance and Energy Efficiency of. Hadoop deployment models

White Paper. How to Achieve Best-in-Class Performance Monitoring for Distributed Java Applications

Case Study I: A Database Service

Delivering Quality in Software Performance and Scalability Testing

DOCLITE: DOCKER CONTAINER-BASED LIGHTWEIGHT BENCHMARKING ON THE CLOUD

Massive Cloud Auditing using Data Mining on Hadoop

Performance White Paper

Recommendations for Performance Benchmarking

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

DSS. Diskpool and cloud storage benchmarks used in IT-DSS. Data & Storage Services. Geoffray ADDE

Cross Validation. Dr. Thomas Jensen Expedia.com

Developing MapReduce Programs

IMPROVEMENT OF RESPONSE TIME OF LOAD BALANCING ALGORITHM IN CLOUD ENVIROMENT

Kronos Workforce Central on VMware Virtual Infrastructure

White paper: Unlocking the potential of load testing to maximise ROI and reduce risk.

Financial Services Grid Computing on Amazon Web Services January 2013 Ian Meyers

Testing & Assuring Mobile End User Experience Before Production. Neotys

The Theory And Practice of Testing Software Applications For Cloud Computing. Mark Grechanik University of Illinois at Chicago

Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations. Database Solutions Engineering

Research on Job Scheduling Algorithm in Hadoop

LBPerf: An Open Toolkit to Empirically Evaluate the Quality of Service of Middleware Load Balancing Services

Can t We All Just Get Along? Spark and Resource Management on Hadoop

Performance Tuning and Optimizing SQL Databases 2016

Application of Predictive Analytics for Better Alignment of Business and IT

How To Test For Elulla

Performance Analysis of Mixed Distributed Filesystem Workloads

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing

Capacity Planning for Hyper-V. Using Sumerian Capacity Planner

How swift is your Swift? Ning Zhang, OpenStack Engineer at Zmanda Chander Kant, CEO at Zmanda

Tableau Server 7.0 scalability

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Towards an understanding of oversubscription in cloud

Energy Efficient Load Balancing among Heterogeneous Nodes of Wireless Sensor Network

Performance And Scalability In Oracle9i And SQL Server 2000

Dynamic Resource allocation in Cloud

Improve query performance with the new SQL Server 2016 Query Store!!

Exploring Big Data in Social Networks

OPERATING SYSTEMS SCHEDULING

Generational Performance Comparison: Microsoft Azure s A- Series and D-Series. A Buyer's Lens Report by Anne Qingyang Liu

Capacity Planning Use Case: Mobile SMS How one mobile operator uses BMC Capacity Management to avoid problems with a major revenue stream

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

PLA 7 WAYS TO USE LOG DATA FOR PROACTIVE PERFORMANCE MONITORING. [ WhitePaper ]

Benchmarking Hadoop & HBase on Violin

The Answer Is Blowing in the Wind: Analysis of Powering Internet Data Centers with Wind Energy

Towards an Optimized Big Data Processing System

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

DELL s Oracle Database Advisor

Transcription:

Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc.

Introduction Cloud computing is becoming a key component of today s IT infrastructure Characteristics of Cloud computing Extremely large scale infrastructure and workloads Diversity in workload composition User facing vs. batch applications (e.g. MapReduce) Different performance objectives Workload management becomes a challenging problem in cloud computing environments Need to understand the impact of management activities on workload performance E.g. scheduler change and capacity upgrade

Motivation Using performance benchmarks to assess the impact of management activities Existing approach: use historical traces as performance benchmark Advantage: high accuracy Disadvantage: expensive; only simulates performance in the past

Motivation Need to construct workload models In this work, we create a workload model of task usage shapes that describes task resource consumption at runtime The accuracy of our model is evaluated by its ability to reproduce the performance characteristics of real workloads Key performance metrics: Task wait time and Resource utilization Our Previous work Workload characterization at a medium-grained level 1 Not clear if the model is sufficient for predicting workload performance 1 Towards Characterizing Cloud Backend Workloads: Insights From Google Compute Clusters," A. Misra et al., Sigmetrics Performance Evaluation Review, 2010.

Dataset Description Compute Cluster No. of machines Type 1 (%) Type 2 (%) Workload traces from 6 clusters for 5 days 4 types of tasks Type 1: high priority user-facing tasks Type 4: low priority batch tasks Type 2 and 3 stand between Type 1 and Type 4 Type 3 (%) Type 4 (%) A 10000s 3.12 0.26 3.14 93.47 B 1000s 1.46 0.86 2.52 95.16 C 1000s 4.54 0.34 4.67 90.45 D 1000s 5.86 2.42 31.77 59.95 E 1000s 39.26 1.48 34.27 24.99 F 10s 1.23 0.2 72.93 25.64

Experiment Methodology Performance Metrics: Task wait time Resource utilization Stress Generator increases the workload intensity by randomly removing a fraction of the machines

Characteristics of Performance Metrics Effect of Removing Machines on Performance Metrics for Cluster A Day-to-day variability of Performance Metrics for Cluster A

Workload Characteristics Most of the tasks have low coefficient of variation CPU has the highest CV, but mean is low This suggests that we can simply use the mean usage as a model for capturing workload characteristics We call this model the mean usage model Mean CPU usage=0.239 Cores Mean Mem usage=0.219 GB Mean Disk usage=0.33 GB Distribution of CV for CPU, Memory and Disk for Tasks in Compute Cluster D

Evaluating the Mean Usage Model: Resource Utilization Compute Cluster A Compute Cluster B Compute Cluster C Compute Cluster D Compute Cluster E Compute Cluster F The mean model accurately reproduces resource utilization in each compute cluster.

Evaluating the Mean Usage Model: Task Wait Time Compute Cluster A Compute Cluster B Compute Cluster C Compute Cluster D Compute Cluster E Compute Cluster F The mean model is accurate for reproducing average task wait time

Analysis of Simulation Results The mean usage model Performs well for predicting resource utilization for all resource types (5% error) Performs moderately well for predicting task wait time (10-20% error on average) Interpreting model errors 1. Understand the impact of utilization on performance metrics 2. Correlate estimation error with theoretical model error (i.e., CV of task usage shapes)

Analysis Result for Task Wait Time Both task wait time and difference in task wait time grow exponentially with utilization The ratio of growth rate positively correlate with average CV

Analysis Result for Resource Utilization Utilization has low impact on model error utilization Correlating model error with CV of sum weighted by time Using the fact that variance of the sum is the sum of variance Short task has less impact on model error than long tasks

Conclusion and Future Work We studied the problem of deriving characterization models for task usage shapes in Google s compute cloud. For performance forecasting and analysis in hypothetic scenarios We show that simply capturing the mean usage of each task (i.e., the mean usage model) is sufficient for capturing workload performance in terms of resource utilization and task wait time Future (on-going) work Capture more fine-grained workload characteristics Using clustering algorithms to find more accurate clusters.

Thank you!