IO latency tracking. Daniel Lezcano (dlezcano) Linaro Power Management Team

Similar documents

Power Management in the Linux Kernel

Comparing Power Saving Techniques for Multi cores ARM Platforms

Agenda. Context. System Power Management Issues. Power Capping Overview. Power capping participants. Recommendations

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Update on big.little scheduling experiments. Morten Rasmussen Technology Researcher

Accelerating I/O- Intensive Applications in IT Infrastructure with Innodisk FlexiArray Flash Appliance. Alex Ho, Product Manager Innodisk Corporation

EWeb: Highly Scalable Client Transparent Fault Tolerant System for Cloud based Web Applications

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

Understanding Linux on z/vm Steal Time

Seeking Fast, Durable Data Management: A Database System and Persistent Storage Benchmark

CFS-v: I/O Demand-driven VM Scheduler in KVM

An Implementation Of Multiprocessor Linux

Jorix kernel: real-time scheduling

Real-time KVM from the ground up

The Data Placement Challenge

Completely Fair Scheduler and its tuning 1

Process Scheduling in Linux

Accelerating Server Storage Performance on Lenovo ThinkServer

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Redpaper. Performance Metrics in TotalStorage Productivity Center Performance Reports. Introduction. Mary Lovelace

IST 301. Class Exercise: Simulating Business Processes

Kernel Optimizations for KVM. Rik van Riel Senior Software Engineer, Red Hat June

1. Computer System Structure and Components

A Close Look at PCI Express SSDs. Shirish Jamthe Director of System Engineering Virident Systems, Inc. August 2011

Operating Systems. Lecture 03. February 11, 2013

Linux Load Balancing

Computer-System Architecture

A Survey of Parallel Processing in Linux

Overlapping Data Transfer With Application Execution on Clusters

Using Synology SSD Technology to Enhance System Performance. Based on DSM 5.2

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

1: B asic S imu lati on Modeling

Client-aware Cloud Storage

Linux Process Scheduling. sched.c. schedule() scheduler_tick() hooks. try_to_wake_up() ... CFS CPU 0 CPU 1 CPU 2 CPU 3

White paper. QNAP Turbo NAS with SSD Cache

OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect

Application-Focused Flash Acceleration

Getting maximum mileage out of tickless

Deep Dive: Maximizing EC2 & EBS Performance

SSD Performance Tips: Avoid The Write Cliff

BHyVe. BSD Hypervisor. Neel Natu Peter Grehan

SH-Sim: A Flexible Simulation Platform for Hybrid Storage Systems

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at

Freezing Exabytes of Data at Facebook s Cold Storage. Kestutis Patiejunas (kestutip@fb.com)

Mobile Devices - An Introduction to the Android Operating Environment. Design, Architecture, and Performance Implications

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Analysis of VDI Storage Performance During Bootstorm

Operating Systems Concepts: Chapter 7: Scheduling Strategies

Wireless Microcontrollers for Environment Management, Asset Tracking and Consumer. October 2009

A Taxonomy and Survey of Energy-Efficient Data Centers and Cloud Computing Systems

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

OpenMosix Presented by Dr. Moshe Bar and MAASK [01]

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Memory Channel Storage ( M C S ) Demystified. Jerome McFarland

Multilevel Load Balancing in NUMA Computers

Virtualization in the ARMv7 Architecture Lecture for the Embedded Systems Course CSD, University of Crete (May 20, 2014)

CPS104 Computer Organization and Programming Lecture 18: Input-Output. Robert Wagner

Delivering Quality in Software Performance and Scalability Testing

Methods to achieve low latency and consistent performance

Deciding which process to run. (Deciding which thread to run) Deciding how long the chosen process can run

Confinement Problem. The confinement problem Isolating entities. Example Problem. Server balances bank accounts for clients Server security issues:

SLIDE 1 Previous Next Exit

Chapter 19: Real-Time Systems. Overview of Real-Time Systems. Objectives. System Characteristics. Features of Real-Time Systems

Comparison of Hybrid Flash Storage System Performance

BridgeWays Management Pack for VMware ESX

Section 29. Real-Time Clock and Calendar (RTCC)

Accelerating Database Applications on Linux Servers

Best Practices for Increasing Ceph Performance with SSD

HP Z Turbo Drive PCIe SSD

Chapter 5 Linux Load Balancing Mechanisms

System Software Integration: An Expansive View. Overview

Linux Scheduler. Linux Scheduler

File System Management

Solid State Storage in Massive Data Environments Erik Eyberg

Energy aware RAID Configuration for Large Storage Systems

OpenFlow Based Load Balancing

OCZ s NVMe SSDs provide Lower Latency and Faster, more Consistent Performance

Thread-based analysis of embedded applications with real-time and non real-time processing on a single-processor platform

Priority Based Implementation in Pintos

Efficient Load Balancing using VM Migration by QEMU-KVM

EMC SCALEIO OPERATION OVERVIEW

Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays

Why Relative Share Does Not Work

Online Failure Prediction in Cloud Datacenters

Lesson Objectives. To provide a grand tour of the major operating systems components To provide coverage of basic computer system organization

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

EZ GPO. Power Management Tool for Network Administrators.

Kerrighed: use cases. Cyril Brulebois. Kerrighed. Kerlabs

Operating System Multilevel Load Balancing

Using the TASKING Software Platform for AURIX

Transcription:

IO latency tracking Daniel Lezcano (dlezcano) Linaro Power Management Team

CPUIDLE CPUidle is a framework divided into three parts: The generic framework The back-end driver: low level calls or hidden by firmware (ACPI, PSCI, MWAIT) The governor Select idle state Idle task Enter idle state CPUidle Generic framework Governor PM Back end driver

CPUIDLE : the generic framework Provides an API: To register a driver and a governor To ask for the governor suggestion To enter idle with the state abstraction No algorithm

CPUIDLE : the back end drivers It gives the abstraction for the complex PM code Very generic for some hardware based on firmware abstraction (eg. ACPI, PSCI,...) Very hardware dependent and complex if directly handled in the kernel (eg. ARM SoC specific drivers)

CPUIDLE : the governors Two governors used today: Ladder with periodic tick configuration: for server focused on performance Menu with nohz configuration: for most of the platforms trying to provide the best trade-off between performance and energy saving

CPUIDLE : the menu governor Tries to predict the next event on the system Does statistics on each CPU for the sleep duration Weighting the idle states regarding the pending IOs

CPUIDLE : the menu governor Source of wakeups: IRQ Timer, IOs, keyboard, IPI Mostly generated by the scheduler But most of the interrupts are coming from: Timers, IOs and rescheduling IPIs

CPUIDLE : the menu governor It takes all the source of interrupts without distinction: Is it up to the scheduler to predict the scheduler behavior (IPI reschedule)? What about if a task is moved to another CPU? Could stay to a very shallow state for seconds because the statistics are different Why not focus on event which are predictable? Timers and IOs and consider the rest as noise preventing us to be accurate

Why an IO could be predictable? The duration of an IO is inside a reasonable interval, with repeating pattern The tasks are blocked on IO, hence these latencies information can be tied with it The information can follow the task when this one is migrated

Experimentation Latency measurements for: SSD 6Gb/s HDD 6Gb/s with different block sizes From 4KB to 512KB

SSD behavior

SSD behavior

HDD behavior

HDD behavior

HDD behavior

HDD behavior

Observations Let's group the different latencies into buckets Experiment with different bucket size : 50us, 100us, 200us and 500us And observe the distribution : how many times a bucket is hit?

SSD 4KB Random RW

SSD 8KB Random RW

SSD 16KB Random RW

SSD 32KB Random RW

SSD 64KB Random RW

SSD 128KB Randow RW

SSD 256KB Random RW

SSD 512KB Randow RW

HDD 4KB Random RW

HDD 8KB Random RW

HDD 16KB Random RW

HDD 32KB Random RW

HDD 64KB Randow RW

HDD 128KB Randow RW

HDD 256KB Randow RW

HDD 512KB Random RW

Observations - Conclusions The smaller the bucket is, the more there are buckets Reduce probability to estimate the right bucket On slow media, there is an important deviation from the average latency The 200us bucket size shows a interesting trade-off between the number of buckets and the accuracy We can rely on these observations to build a model to predict the next IO

IO latency tracking infrastructure Each time a task is unblocked after an IO completion, measure the duration Add this latency to the IO latency tracking infrastructure Next time the task is blocked on a IO, ask the IO latency tracking infrastructure the guessed blocking time When going to idle, take the remaining time to be blocked on an IO as part of the equation to compute the sleep duration

IO latency tracking infrastructure Task Scheduler IO latency tracking io_schedule io_latency_begin io_latency_guess + io_latency_insert_node io_latency_end wakeup io_latency_delete_node + io_latency_add

IO latency tracking infrastructure Group the latencies into buckets, representing an interval (200us) Each bucket has an index Each index gives the bucket's interval index : 0 => [0, 199] index : 10 => [2000, 2799] Index : 5 => [1000, 1199]

IO latency tracking infrastructure io_latency_add(413us) find_bucket 413%200 = 2 bucket2 bucket3 bucket6 bucket12 update_history(bucket2,413us) bucket2.hits++ bucket2.suchits++ bucket2.avg =...

IO latency tracking infrastructure How do we guess the next blocking time? Buckets are sorted Position in the list shows the history (first happening more, last happening less) Buckets have the number of hits and the position in the list weight these numbers Compute a score with the position in the list and the number of hits with a decaying function

IO latency tracking infrastructure Score = nrhits / (pos + 1)² first bucket2.hits = 123 Score = 123 / (0+1)² = 123 bucket3.hits = 321 Score = 321 / (1+1)² = 80 bucket6.hits = 12 Score = 12 / (2+1)² = 1 last bucket12.hits = 32 Score = 12 / (3+1)² = 0 Bucket 2 is the next expected IO latency

IO latency tracking infrastructure What happens when there are several tasks blocked on an IO? A red-black tree per cpu The number of elements of this tree is the number of tasks blocked on a IO Each node is the guessed IO duration for the corresponding task Left most node is the smaller guessed IO duration

IO latency tracking infrastructure + CPUidle When entering idle we have now: The next timer event which is reliable (next_timer_event) The next IO completion (next_io_event) The next event is: next_event = min(next_timer_event, next_io_event)

IO latency tracking + CPUidle Choosing the idle state is straighforward: for (i = 0; i < drv->state_count; i++) { struct cpuidle_state *s = &drv->states[i]; struct cpuidle_state_usage *su = &dev->states_usage[i]; if (s->disabled su->disable) continue; if (s->target_residency > next_event) continue; if (s->exit_latency > latency_req) continue; } idle_state = i;

The big picture Task waiting for an IO Task waiting for a timer Idle task IO latency tracking Timer framework idle_for(min(t1, tn)) t1 tn+5 tn+4 tn+3 tn+2 tn+1 tn CPUIDLE Choose idle state Basic selection loop sleep Backend driver

Some results Tools used to measure: Some extra infos added to sysfs for each cpu Over predicted : the sleep duration was shorter Under predicted : the sleep duration was longer Idlestat : a tool computing the idle state statistics http://git.linaro.org/power/idlestat.git Iolatsimu: a tool implementing the prediction algorithm and randomly read/write a file http://git.linaro.org/people/daniel.lezcano/iolatsimu.git

Some results Tests done under certain circumstances Process pinned on one cpu in order to prevent migration Tried on the media described at the beginning of the documentation For each block size measure the prediction regarding the idle state which has been choose

Some results IO latency framework + timer give an expected sleep time of 30us 1. Effective sleep time = 15 2. Effective sleep time = 35 3. Effective sleep time = 45 residency = 10 residency = 20 residency = 40 residency = 70 residency = 150 Shallow Deep 1. and 3. are failed predictions, 2. is a right prediction

SDD results - 4KB

SDD results 8KB

SSD results - 16KB

SSD results - 32KB

SDD results - 64KB

SDD results - 128KB

SDD results - 256KB

SDD results - 512KB

HDD results - 4KB

HDD results - 8KB

HDD results - 16KB

HDD results - 32KB

HDD results - 64KB

HDD results - 128KB

HDD results - 256KB

HDD results - 512KB

Conclusion A noticeable improvement in terms of prediction, more than 30% under some circumstances The resulting design makes the idle state decision within the scheduler Idling decision integrated Smarter decision could be made in the future

Next steps Improve the IO latency tracking framework code Investigate pointless wakeup resulting from an IPI when an IO completes Fix the IO completions measurement probes Make the latency per device Big kernel impact Improve the scheduler to take smarter decision regarding idle

THANKS!