Overview of the Current Approaches to Enhance the Linux Scheduler. Preeti U. Murthy preeti@linux.vnet.ibm.com IBM Linux Technology Center



Similar documents
Linux VM Infrastructure for memory power management

Linux on POWER for Green Data Center

Multi-core and Linux* Kernel

EECS 750: Advanced Operating Systems. 01/28 /2015 Heechul Yun

Update on big.little scheduling experiments. Morten Rasmussen Technology Researcher

Communications Server for Linux

z/os V1R11 Communications Server system management and monitoring

Process Scheduling in Linux

Power Management. User s Guide. User s Guide

Chapter 5 Linux Load Balancing Mechanisms

Oracle Database Scalability in VMware ESX VMware ESX 3.5

SAS deployment on IBM Power servers with IBM PowerVM dedicated-donating LPARs

Patch Management for Red Hat Enterprise Linux. User s Guide

Platform LSF Version 9 Release 1.2. Migrating on Windows SC

Multilevel Load Balancing in NUMA Computers

Linux Load Balancing

Intel I340 Ethernet Dual Port and Quad Port Server Adapters for System x Product Guide

WebSphere Commerce V7.0

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

IBM Security SiteProtector System Migration Utility Guide

Version 8.2. Tivoli Endpoint Manager for Asset Discovery User's Guide

CA Nimsoft Monitor. Probe Guide for CPU, Disk and Memory. cdm v4.7 series

Optimizing Shared Resource Contention in HPC Clusters

Self Help Guides. Create a New User in a Domain

New!! - Higher performance for Windows and UNIX environments

IBM Tivoli Service Request Manager 7.1

Memory-to-memory session replication

IBM Software Group Enterprise Networking Solutions z/os V1R11 Communications Server

DataPower z/os crypto integration

big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices

Self Help Guides. Setup Exchange with Outlook

Database Instance Caging: A Simple Approach to Server Consolidation. An Oracle White Paper September 2009

IBM Cognos Controller Version New Features Guide

CA Chorus for Security and Compliance Management Deep Dive

Operating System Multilevel Load Balancing

CS z/os Application Enhancements: Introduction to Advanced Encryption Standards (AES)

IBM TRIRIGA Version 10 Release 4.2. Inventory Management User Guide IBM

How to Deliver Measurable Business Value with the Enterprise CMDB

Requesting Access to IBM Director Agent on Windows Planning / Implementation

IBM Configuring Rational Insight and later for Rational Asset Manager

CA Nimsoft Monitor. Probe Guide for Active Directory Server. ad_server v1.4 series

Taking Virtualization

Ready Time Observations

Simplify security management in the cloud

IBM Lotus Protector for Mail Encryption. User's Guide

IBM Tivoli Storage Manager for Virtual Environments

New SMTP client for sending Internet mail

Using esxtop to Troubleshoot Performance Problems

CPU bandwidth control for CFS

Oracle Hyperion Financial Management Virtualization Whitepaper

Sage HRMS 2014 Sage Employee Self Service Tech Installation Guide for Windows 2003, 2008, and October 2013

High Availability Server Clustering Solutions

Database lifecycle management

Running VirtualCenter in a Virtual Machine

Fundtech offers a Global Payments Solution on IBM Power Systems IBM Redbooks Solution Guide

Business Process Management IBM Business Process Manager V7.5

Brocade Enterprise SAN Switch Module for IBM eserver BladeCenter Brocade Entry SAN Switch Module for IBM eserver BladeCenter

Dell Reference Configuration for Hortonworks Data Platform

A Survey of Parallel Processing in Linux

Continuous access to Read on Standby databases using Virtual IP addresses

Release Notes. IBM Tivoli Identity Manager Oracle Database Adapter. Version First Edition (December 7, 2007)

A Vision for Tomorrow s Hosting Data Center

Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM

How To Manage Energy At An Energy Efficient Cost

Energy Management in a Cloud Computing Environment

IBM SmartCloud Monitoring

Software Usage Analysis Version 1.3

Tivoli Endpoint Manager for Configuration Management. User s Guide

IBM Storwize V7000: For your VMware virtual infrastructure

Red Hat CloudForms for Lenovo Product Guide

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

IBM SmartCloud Analytics - Log Analysis. Anomaly App. Version 1.2

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

IBM Endpoint Manager for Server Automation

IBM Systems Director Navigator for i5/os New Web console for i5, Fast, Easy, Ready

IBM WebSphere Data Interchange V3.3

Installing on Windows

Linux. Managing security compliance

Running a Workflow on a PowerCenter Grid

SUSE Linux Enterprise 10 SP2: Virtualization Technology Support

Deploying Citrix MetaFrame on IBM eserver BladeCenter with FAStT Storage Planning / Implementation

Redbooks Redpaper. IBM TotalStorage NAS Advantages of the Windows Powered OS. Roland Tretau

InfoPrint 4247 Serial Matrix Printers. Remote Printer Management Utility For InfoPrint Serial Matrix Printers

IBM XIV Management Tools Version 4.7. Release Notes IBM

RED HAT ENTERPRISE VIRTUALIZATION PERFORMANCE: SPECVIRT BENCHMARK

CS z/os Network Security Configuration Assistant GUI

Tivoli Endpoint Manager for Security and Compliance Analytics. Setup Guide

SAS Business Analytics. Base SAS for SAS 9.2

IBM Security QRadar Version Installing QRadar with a Bootable USB Flash-drive Technical Note

Tivoli Endpoint Manager for Software Distribution. User s Guide

IBM Tivoli Endpoint Manager for Lifecycle Management

IBM Tivoli Endpoint Manager for Lifecycle Management

Transcription:

Overview of the Current Approaches to Enhance the Linux Scheduler Preeti U. Murthy preeti@linux.vnet.ibm.com IBM Linux Technology Center Linux Foundation Collaboration Summit San Francisco,CA 16 April, 2013

Agenda Introduction to Linux Scheduling Overview of the Power Aware Scheduler Usage of Per Entity Load Tracking Metric in Power Aware Scheduler  Where does Power Aware Scheduler stand today? Experimental Evaluation of the Power Aware Scheduler Other Efforts towards Power Efficiency in Scheduling References

What does the Linux scheduler do? The Linux scheduler does two primary jobs: - Scheduling multiple runnable jobs fairly on a CPU - Scheduling jobs optimally across multiple CPUs in an SMP system

Scheduling multiple runnable jobs fairly on a CPU Requirement: The CPU bandwidth needs to be fairly distributed among tasks Data Structure used is the Red Black Tree. The tasks are sorted out in this tree in the increasing order of CPU bandwidth received and the leftmost task is chosen to run on the CPU. A new task or a woken up task is positioned at the left most end of the tree. The CPU bandwidth consumed by a task is called the vruntime. vruntime and the above data structure ensure that all tasks receive a fair share of CPU bandwidth, which in-turn ensure better interactivity of workloads.

Scheduling jobs optimally across multiple CPUs in an SMP system Goals of SMP scheduling : No CPU must have too many or too few tasks to run A new task/ woken up task must find the right CPU to run on quickly

Scheduling jobs optimally across multiple CPUs in an SMP system Work Conserving Nature of scheduler: If there is CPU bandwidth, schedule the task on it. New Task arrives CPU2 goes idle T4 T4 T1 T2 T3 T1 T2 Load Balancing T1 T2 T4 CPU0 CPU1 CPU2 CPU0 CPU1 CPU2 CPU0 CPU1 CPU2 CPU0 CPU1 CPU2 T1 T2 T3 T4 Queue of tasks to be scheduled SMP Scheduling

Implementation schemes for SMP Scheduling Implementing the goals of SMP Scheduling Efficient Load Balancing Scalable scheduling Taking SMP Scheduling one step forward: Consolidation of tasks to fewer CPUs belonging to fewer power domains for power efficiency through Power Aware Scheduling.

Power Aware Scheduling Goal: Consolidation of load to fewer CPU belonging to fewer power domains, without degrading the power efficiency of the current scheduler. Power Efficiency = Performance/Power

Scheduling Domains and Scheduling groups NODE 0, 1, 2, 3, 8, 9, 10, 11 4, 5, 6, 7, 12, 13, 14, 15 Sibling CPU Level 0, 1, 2, 3, 8, 9, 10, 11 Multi core (MC) 0, 8 1, 9 2, 10 3, 11 Cores in a CPU package NUMA Nodes CPU packages in the system or node 0 8 Scheduling Domain H/w Threads Scheduling Group Scheduling domains and groups define the scheduling entities in a hierarchical fashion

How does Power Aware Scheduling work? During forking of tasks: During waking of tasks: Schedule the new task on - idlest_cpu(group_leader) Schedule the woken up task on - leader_cpu(group_leader) if task_is_light_weight() P5 New task P5 Woken up task P1 P2 P4 leader_cpu P1 P5 P2 P4 P3 P3 Group Leader Group Min

Power Aware Scheduler against Race to Idle? No, because... - Power aware scheduler consolidates load within a power domain when it can afford to,i.e. when power domain is underutilized - idle CPUs exist. Tasks will not be throttled. - Power aware scheduler defaults to race to idle when it cannot afford to consolidate within a power domain, i.e. when power domain is overloaded. P1 P2 P4 P1 P2 P3 P3 P4 Group Leader Group Min Task Consolidation P1 P2 P4 P1 P2 P4 P3 P5 P3 P5 Group Leader Group Min Race To Idle

Measuring the idleness of a CPU What determines if a CPU is busy and how much busy? If we were to measure the % utilization of a CPU,then % utilization of a CPU = (time_runnable_tasks_exist / total_lifetime_of_cpu_rq) * 100 at the time that the above measurement is made. - If there is some task running on the CPU run queue all the time,then the CPU is 100% utilized. - If there is no task to run on the CPU at some point, then the utilization of the CPU declines. What is it then? 0 % or 0% < x < 100%? If it is 0%, why not use nr_running ( number of tasks running) instead?

Measuring the idleness of a CPU Importance of choosing the right metric for measuring utilization of CPU T1 T2 T2 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 Idle for a long time Just idle New Task on CPU1 Move task from lower utilized CPU to higher utilized CPU CPU0 CPU1 The above move ensures that idle CPUs have higher residency times in idle states. This can be ensured only with the %util metric and not nr_running.

Measuring the idleness of a CPU Behavior of tasks:how they affect the amount of idleness of a CPU Never sleeps 1s 100 % task when measured anytime within 1s Task forks Task dies awake sleeps awake sleeps awake sleeps 1ms 9ms 1ms 9ms 1ms................................ 9ms Task forks 10ms 10ms 900ms 10 % task if measured every 10ms sleeps Task dies 100ms awake Task forks 10 % task if measured at the end of 1s Task dies

Measuring the idleness of a CPU Importance of choosing the right metric for measuring utilization of CPU Currently in the scheduler, the following is how the idleness of a CPU is measured. If we were to use the % utilization of task as a metric then: T4 T3 T4 T3 10% 10% T1 T2 100% T1 T2 10% CPU0 CPU1 #tasks on cpu0 < #tasks on cpu1 Thus cpu0 is less idle than cpu1 CPU0 CPU1 %util of cpu1 < %util of cpu0 Thus cpu1 is less idle than cpu0 Thus %util identifies the busier CPU appropriately.

Per Entity Load Tracking Metric Power aware scheduler makes use of the %utilization of CPU to measure how idle the CPU is. Hence a group_leader is one in which the utilization of CPUs add up to very near to the group's capacity. How do we measure the utilization of a task? Here is where the Per Entity Load Tracking metric comes in. The Per Entity Load Tracking metric comes with many improvements over the current approach of tracking CPU loads.

Per Entity Load Tracking Metric Current Approach to calculating cpu->utilization Per Entity Load tracking approach to calculating cpu- >utilization cpu->utilization = amount_of_time_cpu_rq_is_active / amount_of_time_cpu_rq_is_(active+idle) cpu->utilization = amount_of_time_cpu_rq_is_active / amount_of_time_cpu_rq_is_(active+idle) However All points in the history of the cpu->util is given equal importance The history decays with time and more importance is given to the recent observation of cpu->utilization.......... Time-line of cpu->util Time-line of cpu->util Coarse grained tracking of cpu->util compared to the PJT metric Fine grained tracking of cpu->util

Per Entity Load Tracking Metric All the above points make the measurement of the utilization of a CPU more accurate and stable. But there are a few drawbacks to this metric: - Time taken to pick up cpu>util - Time taken to decay the cpu->util which could lead to incorrect prediction of CPU utilization.

Per Entity Load Tracking Metric For example,when there are many tasks waking up on a CPU that is idle for very long,the cpu_rq->utilization will take time to catch up to 100% to reflect those many tasks. When the tasks leave the run queue, the cpu_rq->utilization will take time to decay so as to reflect no tasks. Hence at times we might need to fall back to the number_of_tasks_running on the CPU to measure its utilization in certain cases like bursty wakeups. Hence under stable load, per entity load tracking metric gives an accurate picture of the CPU utilization. However under steep increase or decrease of load, nr_running is a better measure of CPU utilization.

Where does the Power Aware Scheduler stand today? Alex Shi from Intel has posted out a series of patches implementing the Power Aware Scheduler as explained above. We, scheduler developers had discussions around: - The usage of the Per Entity Load Tracking Metric in the Power Aware Scheduler. - Areas where the metric fell short in evaluating the CPU utilization. - Scheduling policies of the Power Aware scheduler. - Implementation of the Power Aware Scheduler during fork and wake up of tasks and more implementation specifics. As a result, significant improvements in the code and design of Power Aware Scheduler has been achieved.

Where does the Power Aware Scheduler stand today? The Per Entity Load Tracking metric has been up-streamed into mainline. However the usage of it in load balancing and power aware scheduling-the recent efforts, is still under discussion. The reason being, any major changes in the Linux scheduler demands significant positive experimental results over its current implementation. There could be points which we have missed while discussing,which could show up during experimental evaluation. Thus the following slides evaluate the Power Aware Scheduler on two aspects: - Behavior : Consolidation of tasks under lightly loaded systems; Fallback to default scheduling otherwise. - Performance per Watt

Experimental Evaluation of Power Aware Scheduler Demonstration of the behavior of Power Aware Scheduler against today's scheduler: Load consolidation against Spreading of Load Power_Aware_sched Sched Benchmark: Ebizzy Platform: 2 socket, 8 cores per socket x86 machine

Experimental Evaluation of Power Aware Scheduler Demonstration of the behavior of Power Aware Scheduler against today's scheduler: Load Consolidation to fewer power domains under lightly loaded systems. Power_Aware_sched Sched Sched Benchmark: Kernbench Platform: 2 socket, 8 cores per socket x86 machine

Experimental Evaluation of Power Aware Scheduler Demonstration of the behavior of Power Aware Scheduler against today's scheduler: Load Consolidation to fewer power domains under lightly loaded systems. Power_Aware_sched Sched Benchmark: Ebizzy Platform: 2 socket, 8 cores per socket,smt-4, Power machine

Experimental Evaluation of Power Aware Scheduler Demonstration of the behavior of Power Aware Scheduler against today' scheduler: Load Consolidation to fewer power domains under lightly loaded systems. Power_Aware_sched Sched Benchmark: Kernbench Platform: 2 socket, 8 cores per socket, SMT-4, Power machine

Experimental Evaluation of Power Aware Scheduler Demonstration of the impact of power aware scheduler relative to today's scheduler on the performance per Watt of benchmarks. Benchmark: Ebizzy Platform: 2 socket, 8 cores per socket x86 machine Benchmark: Kernbench Platform: 2 socket, 8 cores per socket x86 machine

Other Efforts towards Power Efficient Scheduling Heterogeneous Platform Scheduling Map nature of tasks to the CPU Power Example: ARM's big.little CPUs - Do not wake up higher power consuming CPUs big CPUs if the tasks do not demand so much performance - Schedule such tasks on LITTLE CPUs - The nature of the task is known by the Per Entity Load Tracking measurement of the CPU utilization of the task

Other Efforts towards Power Efficient Scheduling Packing Small Tasks Pack small tasks - Ensure only selected power domains are responsible for handling light weight tasks so that the other power domains can go to idle state and save power. - This will ensure that the CPUs of these power domains are not frequently woken up to handle small tasks and can have longer residency time in idle states. - Identify light weight tasks using Per Entity Load Tracking metric.

References Links to my discussions around the Per Entity Load Tracking Metric with the community: Initial efforts towards integrating the Per-Entity-Load-Tracking metric into the load balancer https://lwn.net/articles/521272/ Subsequent challenges of integrating the Per-Entity-Load-Tracking metric into the Normal Load Balancing of the Linux Scheduler https://lkml.org/lkml/2013/1/1/89 Links to the discussions around the Power Efficiency of the scheduler with the community: Alex Shi's patchset on the Power Aware Scheduler http://lwn.net/articles/545910/ Vincent Guittot's patchset on Packing small Tasks http://lwn.net/articles/518834/ Morten Rasmussens's patchset on Heterogeneous CPU scheduling http://lwn.net/articles/544358/ Per-Entity-Load-Tracking Metric patchset and discussions around it https://lkml.org/lkml/2012/6/27/644

QUESTIONS?

Legal Statement Copyright International Business Machines Corporation 2013. Permission to redistribute in accordance with Linux Foundation Collaboration Summit submission guidelines is granted; all other rights reserved. This work represents the view of the authors and does not necessarily represent the view of IBM IBM, IBM logo, ibm.com are trademarks of International Business Machines Corporation in the United States, other countries, or both. Intel is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, and service names may be trademarks or service marks of others. References in this publication to IBM products or services do not imply that IBM intends to make them available in all countries in which IBM operates. INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.