Stop the Guessing. Performance Methodologies for Production Systems. Brendan Gregg. Lead Performance Engineer, Joyent. Wednesday, June 19, 13



Similar documents
Real-time KVM from the ground up

One of the database administrators

Debugging Java performance problems. Ryan Matteson

Mark Bennett. Search and the Virtual Machine

White Paper Perceived Performance Tuning a system for what really matters

IBM Tivoli Monitoring Version 6.3 Fix Pack 2. Infrastructure Management Dashboards for Servers Reference

Response Time Analysis

Delivering Quality in Software Performance and Scalability Testing

These sub-systems are all highly dependent on each other. Any one of them with high utilization can easily cause problems in the other.

Response Time Analysis

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

Response Time Analysis

Case Study I: A Database Service

OS Observability Tools

Capacity planning with Microsoft System Center

Users are Complaining that the System is Slow What Should I Do Now? Part 1

Capacity planning for IBM Power Systems using LPAR2RRD.

SAS Application Performance Monitoring for UNIX

Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays

Using SmartOS as a Hypervisor

Linux Tools for Monitoring and Performance. Khalid Baheyeldin November 2009 KWLUG

Audit & Tune Deliverables

Agenda. Capacity Planning practical view CPU Capacity Planning LPAR2RRD LPAR2RRD. Discussion. Premium features Future

Optimizing Linux Performance

CIT 470: Advanced Network and System Administration. Topics. Performance Monitoring. Performance Monitoring

Tool - 1: Health Center

Perf Tool: Performance Analysis Tool for Linux

Monitoring Remedy with BMC Solutions

Enterprise Applications in the Cloud: Non-virtualized Deployment

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Understanding Linux on z/vm Steal Time

Monitoring and Analysis of CPU load relationships between Host and Guests in a Cloud Networking Infrastructure

Real-Time Scheduling 1 / 39

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

vrealize Operations Manager Customization and Administration Guide

BridgeWays Management Pack for VMware ESX

TRACE PERFORMANCE TESTING APPROACH. Overview. Approach. Flow. Attributes

Windows Server Performance Monitoring

SQL Sentry Essentials

Transaction Monitoring Version for AIX, Linux, and Windows. Reference IBM

1 Storage Devices Summary

CIT 668: System Architecture. Performance Testing

Operating System and Process Monitoring Tools

Performance Workload Design

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda

vrealize Operations Management Pack for vcloud Air 2.0

Kernel Optimizations for KVM. Rik van Riel Senior Software Engineer, Red Hat June

Capacity Planning Use Case: Mobile SMS How one mobile operator uses BMC Capacity Management to avoid problems with a major revenue stream

Virtual Desktop Infrastructure Optimization with SysTrack Monitoring Tools and Login VSI Testing Tools

The IntelliMagic White Paper on: Storage Performance Analysis for an IBM San Volume Controller (SVC) (IBM V7000)

Analyzing IBM i Performance Metrics

Lecture 36: Chapter 6

KVM PERFORMANCE IMPROVEMENTS AND OPTIMIZATIONS. Mark Wagner Principal SW Engineer, Red Hat August 14, 2011

Why Alerts Suck and Monitoring Solutions need to become Smarter

WAIT-TIME ANALYSIS METHOD: NEW BEST PRACTICE FOR APPLICATION PERFORMANCE MANAGEMENT

XpoLog Center Suite Log Management & Analysis platform

HP OO 10.X - SiteScope Monitoring Templates

Rackspace Cloud Databases and Container-based Virtualization

Performance Tuning and Optimizing SQL Databases 2016

Oracle Database 12c: Performance Management and Tuning NEW

VDI Optimization Real World Learnings. Russ Fellows, Evaluator Group

New Relic & JMeter - Perfect Performance Testing

Linux Performance Optimizations for Big Data Environments

VIRTUALIZATION AND CPU WAIT TIMES IN A LINUX GUEST ENVIRONMENT

Benchmarking Hadoop & HBase on Violin

Cloud computing is a marketing term that means different things to different people. In this presentation, we look at the pros and cons of using

Operating Systems OBJECTIVES 7.1 DEFINITION. Chapter 7. Note:

Monitoring IBM HMC Server. eg Enterprise v6

Operations Management for Virtual and Cloud Infrastructures: A Best Practices Guide

Performance Management in a Virtual Environment. Eric Siebert Author and vexpert. whitepaper

The Truth Behind IBM AIX LPAR Performance

Copyright 1

ArcGIS for Server Performance and Scalability-Testing and Monitoring Tools. Amr Wahba

Improved metrics collection and correlation for the CERN cloud storage test framework

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Throughput Capacity Planning and Application Saturation

An Implementation Of Multiprocessor Linux

Test Run Analysis Interpretation (AI) Made Easy with OpenLoad

ArcGIS Server Performance and Scalability Testing Methodologies. Andrew Sakowicz, Frank Pizzi

Avoiding Performance Bottlenecks in Hyper-V

Squeezing The Most Performance from your VMware-based SQL Server

PERFORMANCE TUNING ORACLE RAC ON LINUX

IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Internet Information Services Agent Version Fix Pack 2.

PRODUCT OVERVIEW SUITE DEALS. Combine our award-winning products for complete performance monitoring and optimization, and cost effective solutions.

Realtime Linux Kernel Features

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

UBUNTU DISK IO BENCHMARK TEST RESULTS

Rigorous Performance Testing on the Web. Grant Ellis Senior Performance Architect, Instart Logic

Best of Breed of an ITIL based IT Monitoring. The System Management strategy of NetEye

Scaling to Millions of Simultaneous Connections

Luxembourg June

Transcription:

Stop the Guessing Performance Methodologies for Production Systems Brendan Gregg Lead Performance Engineer, Joyent

Audience This is for developers, support, DBAs, sysadmins When perf isn t your day job, but you want to: - Fix common performance issues, quickly - Have guidance for using performance monitoring tools Environments with small to large scale production systems

whoami Lead Performance Engineer: analyze everything from apps to metal Work/Research: tools, visualizations, methodologies Methodologies is the focus of my next book

Joyent High-Performance Cloud Infrastructure - Public/private cloud provider OS Virtualization for bare metal performance KVM for Linux and Windows guests Core developers of SmartOS and node.js

Performance Analysis Where do I start? Then what do I do?

Performance Methodologies Provide - Beginners: a starting point - Casual users: a checklist - Guidance for using existing tools: pose questions to ask The following six are for production system monitoring

Production System Monitoring Guessing Methodologies - 1. Traffic Light Anti-Method - 2. Average Anti-Method - 3. Concentration Game Anti-Method Not Guessing Methodologies - 4. Workload Characterization Method - 5. USE Method - 6. Thread State Analysis Method

Traffic Light Anti-Method

Traffic Light Anti-Method 1. Open monitoring dashboard 2. All green? Everything good, mate. = BAD = GOOD

Traffic Light Anti-Method, cont. Performance is subjective - Depends on environment, requirements - No universal thresholds for good/bad Latency outlier example: - customer A) 200 ms is bad - customer B) 2 ms is bad (an eternity ) Developer may have chosen thresholds by guessing

Traffic Light Anti-Method, cont. Performance is complex - Not just one threshold required, but multiple different tests For example, a disk traffic light: - Utilization-based: one disk at 100% for less than 2 seconds means green (variance), for more than 2 seconds is red (outliers or imbalance), but if all disks are at 100% for more than 2 seconds, that may be green (FS flush) provided it is async write I/O, if sync then red, also if their IOPS is less than 10 each (errors), that s red (sloth disks), unless those I/O are actually huge, say, 1 Mbyte each or larger, as that can be green,... etc... - Latency-based: I/O more than 100 ms means red, except for async writes which are green, but slowish I/O more than 20 ms can red in combination, unless they are more than 1 Mbyte each as that can be green...

Traffic Light Anti-Method, cont. Types of error: - I. False positive: red instead of green - Team wastes time - II. False negative: green insead of red - Performance issues remain undiagnosed - Team wastes more time looking elsewhere

Traffic Light Anti-Method, cont. Subjective metrics (opinion): - utilization, IOPS, latency Objective metrics (fact): - errors, alerts, SLAs For subjective metrics, use weather icons - implies an inexact science, with no hard guarantees - also attention grabbing A dashboard can use both as appropriate for the metric http://dtrace.org/blogs/brendan/2008/11/10/status-dashboard

Traffic Light Anti-Method, cont. Pros: - Intuitive, attention grabbing - Quick (initially) Cons: - Type I error (red not green): time wasted - Type II error (green not red): more time wasted & undiagnosed errors - Misleading for subjective metrics: green might not mean what you think it means - depends on tests - Over-simplification

Average Anti-Method

Average Anti-Method 1. Measure the average (mean) 2. Assume a normal-like distribution (unimodal) 3. Focus investigation on explaining the average

Average Anti-Method: You Have stddev mean stddev 99th Latency

Average Anti-Method: You Guess stddev mean stddev 99th Latency

Average Anti-Method: Reality stddev mean stddev 99th Latency

Average Anti-Method: Reality x50 http://dtrace.org/blogs/brendan/2013/06/19/frequency-trails

Average Anti-Method: Examine the Distribution Many distributions aren t normal, gaussian, or unimodal Many distributions have outliers - seen by the max; may not be visible in the 99...th percentiles - influence mean and stddev

Average Anti-Method: Outliers mean stddev 99th Latency

Average Anti-Method: Visualizations Distribution is best understood by examining it - Histogram summary - Density Plot detailed summary (shown earlier) - Frequency Trail detailed summary, highlights outliers (previous slides) - Scatter Plot show distribution over time - Heat Map show distribution over time, and is scaleable

Average Anti-Method: Heat Map Latency (us) Time (s) http://dtrace.org/blogs/brendan/2013/05/19/revealing-hidden-latency-patterns http://queue.acm.org/detail.cfm?id=1809426

Average Anti-Method Pros: - Averages are versitile: time series line graphs, Little s Law Cons: - Misleading for multimodal distributions - Misleading when outliers are present - Averages are average

Concentration Game Anti-Method

Concentration Game Anti-Method 1. Pick one metric 2. Pick another metric 3. Do their time series look the same? - If so, investigate correlation! 4. Problem not solved? goto 1

Concentration Game Anti-Method, cont. App Latency

Concentration Game Anti-Method, cont. App Latency NO

Concentration Game Anti-Method, cont. App Latency YES!

Concentration Game Anti-Method, cont. Pros: - Ages 3 and up - Can discover important correlations between distant systems Cons: - Time consuming: can discover many symptoms before the cause - Incomplete: missing metrics

Workload Characterization Method

Workload Characterization Method 1. Who is causing the load? 2. Why is the load called? 3. What is the load? 4. How is the load changing over time?

Workload Characterization Method, cont. 1. Who: PID, user, IP addr, country, browser 2. Why: code path, logic 3. What: targets, URLs, I/O types, request rate (IOPS) 4. How: minute, hour, day The target is the system input (the workload) not the resulting performance Workload System

Workload Characterization Method, cont. Pros: - Potentially largest wins: eliminating unnecessary work Cons: - Only solves a class of issues load - Can be time consuming and discouraging most attributes examined will not be a problem

USE Method

USE Method For every resource, check: 1. Utilization 2. Saturation 3. Errors

USE Method, cont. For every resource, check: 1. Utilization: time resource was busy, or degree used 2. Saturation: degree of queued extra work 3. Errors: any errors Identifies resource bottnecks Saturation quickly X Utilization Errors

USE Method, cont. Hardware Resources: - CPUs - Main Memory - Network Interfaces - Storage Devices - Controllers - Interconnects Find the functional diagram and examine every item in the data path...

USE Method, cont.: System Functional Diagram DRAM Memory Bus CPU 1 CPU Interconnect CPU 1 Memory Bus DRAM For each check: I/O Bus 1. Utilization 2. Saturation 3. Errors I/O Bridge Expander Interconnect I/O Controller Interface Network Controller Transports Disk Disk Port Port

USE Method, cont.: Linux System Checklist Resource Type Metric CPU CPU CPU Utilization Saturation Errors per-cpu: mpstat -P ALL 1, %idle ; sar -P ALL, %idle ; system-wide: vmstat 1, id ; sar -u, %idle ; dstat -c, idl ; per-process:top, %CPU ; htop, CPU% ; ps -o pcpu; pidstat 1, %CPU ; per-kernel-thread: top/htop ( K to toggle), where VIRT == 0 (heuristic). system-wide: vmstat 1, r > CPU count [2]; sar -q, runq-sz > CPU count; dstat -p, run > CPU count; per-process: /proc/pid/ schedstat 2nd field (sched_info.run_delay); perf sched latency (shows Average and Maximum delay per-schedule); dynamic tracing, eg, SystemTap schedtimes.stp queued(us) perf (LPE) if processor specific error events (CPC) are available; eg, AMD64 s 04Ah Single-bit ECC Errors Recorded by Scrubber......... http://dtrace.org/blogs/brendan/2012/03/07/the-use-method-linux-performance-checklist

USE Method, cont.: Monitoring Tools Average metrics don t work: individual components can become bottlenecks Eg, CPU utilization 100 Utilization heat map on the right shows 5,312 CPUs for 60 secs; hot CPUs can still identify hot CPUs 0Utilization darkness == # of CPUs Time http://dtrace.org/blogs/brendan/2011/12/18/visualizing-device-utilization

USE Method, cont.: Other Targets For cloud computing, must study any resource limits as well as physical; eg: - physical network interface U.S.E. - AND instance network cap U.S.E. Other software resources can also be studied with USE metrics: - Mutex Locks - Thread Pools The application environment can also be studied - Find or draw a functional diagram - Decompose into queueing systems

USE Method, cont.: Homework Your ToDo: - 1. find a system functional diagram - 2. based on it, create a USE checklist on your internal wiki - 3. fill out metrics based on your available toolset - 4. repeat for your application environment You get: - A checklist for all staff for quickly finding bottlenecks - Awareness of what you cannot measure: - unknown unknowns become known unknowns -... and known unknowns can become feature requests!

USE Method, cont. Pros: - Complete: all resource bottlenecks and errors - Not limited in scope by available metrics - No unknown unknowns at least known unknowns - Efficient: picks three metrics for each resource from what may be hundreds available Cons: - Limited to a class of issues: resource bottlenecks

Thread State Analysis Method

Thread State Analysis Method 1. Divide thread time into operating system states 2. Measure states for each application thread 3. Investigate largest non-idle state

Thread State Analysis Method, cont.: 2 State A minimum of two states: On-CPU Off-CPU

Thread State Analysis Method, cont.: 2 State A minimum of two states: On-CPU executing spinning on a lock waiting for a turn on-cpu waiting for storage or network I/O Off-CPU waiting for swap ins or page ins blocked on a lock idle waiting for work Simple, but off-cpu state ambiguous without further division

Thread State Analysis Method, cont.: 6 State Six states, based on Unix process states: Executing Runnable Anonymous Paging Sleeping Lock Idle

Thread State Analysis Method, cont.: 6 State Six states, based on Unix process states: Executing on-cpu Runnable and waiting for a turn on CPU Anonymous Paging runnable, but blocked waiting for page ins Sleeping waiting for I/O: storage, network, and data/text page ins Lock waiting to acquire a synchronization lock Idle waiting for work Generic: works for all applications

Thread State Analysis Method, cont. As with other methodologies, these pose questions to answer - Even if they are hard to answer Measuring states isn t currently easy, but can be done - Linux: /proc, schedstats, delay accounting, I/O accounting, DTrace - SmartOS: /proc, microstate accounting, DTrace Idle state may be the most difficult: applications use different techniques to wait for work

Thread State Analysis Method, cont. States lead to further investigation and actionable items: Executing Profile stacks; split into usr/sys; sys = analyze syscalls Runnable Examine CPU load for entire system, and caps Anonymous Paging Check main memory free, and process memory usage Sleeping Identify resource thread is blocked on; syscall analysis Lock Lock analysis

Thread State Analysis Method, cont. Compare to database query time. This alone can be misleading, including: - swap time (anonymous paging) due to a memory misconfig - CPU scheduler latency due to another application Same for any time spent in... metric - is it really in...?

Thread State Analysis Method, cont. Pros: - Identifies common problem sources, including from other applications - Quantifies application effects: compare times numerically - Directs further analysis and actions Cons: - Currently difficult to measure all states

More Methodologies Include: - Drill Down Analysis - Latency Analysis - Event Tracing - Scientific Method - Micro Benchmarking - Baseline Statistics - Modelling For when performance is your day job

Stop the Guessing The anti-methodolgies involved: - guesswork - beginning with the tools or metrics (answers) The actual methodolgies posed questions, then sought metrics to answer them You don t need to guess post-dtrace, practically everything can be known Stop guessing and start asking questions!

Thank You! email: brendan@joyent.com twitter: @brendangregg github: https://github.com/brendangregg blog: http://dtrace.org/blogs/brendan blog resources: - http://dtrace.org/blogs/brendan/2008/11/10/status-dashboard - http://dtrace.org/blogs/brendan/2013/06/19/frequency-trails - http://dtrace.org/blogs/brendan/2013/05/19/revealing-hidden-latency-patterns - http://dtrace.org/blogs/brendan/2012/03/07/the-use-method-linux-performance-checklist - http://dtrace.org/blogs/brendan/2011/12/18/visualizing-device-utilization