Update on big.little scheduling experiments. Morten Rasmussen Technology Researcher

Similar documents
Comparing Power Saving Techniques for Multi cores ARM Platforms

Process Scheduling in Linux

big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices

EECS 750: Advanced Operating Systems. 01/28 /2015 Heechul Yun

Task Scheduling for Multicore Embedded Devices

big.little Technology: The Future of Mobile Making very high performance available in a mobile envelope without sacrificing energy efficiency

Hardware accelerated Virtualization in the ARM Cortex Processors

Overview of the Current Approaches to Enhance the Linux Scheduler. Preeti U. Murthy IBM Linux Technology Center

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Power Management in the Linux Kernel

High Performance or Cycle Accuracy?

Linux Process Scheduling. sched.c. schedule() scheduler_tick() hooks. try_to_wake_up() ... CFS CPU 0 CPU 1 CPU 2 CPU 3

Linux Load Balancing

Chapter 5 Linux Load Balancing Mechanisms

Load-Balancing for a Real-Time System Based on Asymmetric Multi-Processing

Linux on POWER for Green Data Center

Feb.2012 Benefits of the big.little Architecture

CPU Scheduling. Core Definitions

IO latency tracking. Daniel Lezcano (dlezcano) Linaro Power Management Team

Linux scheduler history. We will be talking about the O(1) scheduler

Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances:

Understanding Linux on z/vm Steal Time

Audit & Tune Deliverables

ARM Cortex-A9 MPCore Multicore Processor Hierarchical Implementation with IC Compiler

Application Performance Analysis of the Cortex-A9 MPCore

Completely Fair Scheduler and its tuning 1

ò Paper reading assigned for next Thursday ò Lab 2 due next Friday ò What is cooperative multitasking? ò What is preemptive multitasking?

Optimizing Shared Resource Contention in HPC Clusters

Hard Real-Time Linux

An Implementation Of Multiprocessor Linux

Why Relative Share Does Not Work

2. is the number of processes that are completed per time unit. A) CPU utilization B) Response time C) Turnaround time D) Throughput

CPU Scheduling Outline

Linux Scheduler Analysis and Tuning for Parallel Processing on the Raspberry PI Platform. Ed Spetka Mike Kohler

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

Performance Tuning Guide for ECM 2.0

Real-Time KVM for the Masses Unrestricted Siemens AG All rights reserved

Host Power Management in VMware vsphere 5

Linux O(1) CPU Scheduler. Amit Gud amit (dot) gud (at) veritas (dot) com

HyperThreading Support in VMware ESX Server 2.1

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

Multilevel Load Balancing in NUMA Computers

Load-Balancing for Improving User Responsiveness on Multicore Embedded Systems

Applied Micro development platform. ZT Systems (ST based) HP Redstone platform. Mitac Dell Copper platform. ARM in Servers

Agenda. Context. System Power Management Issues. Power Capping Overview. Power capping participants. Recommendations

Visualizing gem5 via ARM DS-5 Streamline. Dam Sunwoo ARM R&D December 2012

A Survey of Parallel Processing in Linux

Embedded Parallel Computing

Multi-core and Linux* Kernel

In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches

Chapter 5 Process Scheduling

The CPU Scheduler in VMware vsphere 5.1

Energy-Efficient, High-Performance Heterogeneous Core Design

Real-Time Scheduling 1 / 39

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

ARM Architecture for the Computing World: From Devices to Cloud. Roy Chen Director, Client Computing ARM

W4118 Operating Systems. Instructor: Junfeng Yang

Cortex-A9 MPCore Software Development

Cognos8 Deployment Best Practices for Performance/Scalability. Barnaby Cole Practice Lead, Technical Services

ARM Processors and the Internet of Things. Joseph Yiu Senior Embedded Technology Specialist, ARM

Ready Time Observations

Development With ARM DS-5. Mervyn Liu FAE Aug. 2015

Energy Constrained Resource Scheduling for Cloud Environment

Overview of the Linux Scheduler Framework

POWER8 Performance Analysis

Finding Performance and Power Issues on Android Systems. By Eric W Moore

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Network Infrastructure Services CS848 Project

SYSTEM ecos Embedded Configurable Operating System

BridgeWays Management Pack for VMware ESX

OpenFlow Based Load Balancing

Process Scheduling II

A Tool for Evaluation and Optimization of Web Application Performance

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

Running a Workflow on a PowerCenter Grid

POWER EFFICIENT SCHEDULING FOR

A Taxonomy and Survey of Energy-Efficient Data Centers and Cloud Computing Systems

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

OpenMosix Presented by Dr. Moshe Bar and MAASK [01]

Whitepaper. The Benefits of Multiple CPU Cores in Mobile Devices

RED HAT ENTERPRISE LINUX 7

Asymmetric Scheduling and Load Balancing for Real-Time on Linux SMP

Which ARM Cortex Core Is Right for Your Application: A, R or M?

Introduction to AMBA 4 ACE and big.little Processing Technology

Overlapping Data Transfer With Application Execution on Clusters

Capacity planning for IBM Power Systems using LPAR2RRD.

Real-time KVM from the ground up

Linux Scheduler. Linux Scheduler

Transcription:

Update on big.little scheduling experiments Morten Rasmussen Technology Researcher 1

Agenda Why is big.little different from SMP? Summary of previous experiments on emulated big.little. New results for big.little in silicon (ARM TC2). Next steps... 2

Why is big.little different from SMP? SMP: Scheduling goal is to distribute work evenly across all available CPUs to get maximum performance. If we have DVFS support we can even save power this way too. big.little: Scheduling goal is to maximize power efficiency with only a modest performance sacrifice. Task should be distributed unevenly. Only critical tasks should execute on big CPUs to minimize power consumption. Contrary to SMP, it matters where a task is scheduled. 3

What is the (mainline) status? Example: Android UI render thread execution time. 4 core SMP It matters where a task is scheduled. 2+2 big.little (emulated) 4

What is the (mainline) status? Example: Android UI render thread execution time. 4 core SMP It matters where a task is scheduled. big.little aware scheduling 2+2 big.little (emulated) 5

big.little hardware platform We are now in the process of investigating scheduling issues on real big.little hardware. ARM TC2 big.little test chip: Two CPU clusters: 2x Cortex-A15 (big) + 3x Cortex-A7 (LITTLE) Per-cluster L2 caches, cache coherent interconnect No GPU cpufreq support cpuidle support Linux SMP across all five cores Cortex-A15 MPCore CPU Interrupt Control CPU Cortex-A7 MPCore CPU CPU L2 Cache L2 Cache CCI-400 Coherent Interconnect 6

Running on real HW: ARM TC2 Bbench on Android: 40 35 SurfaceFlinger (50 runs) SMP HMP 30 25 Occurence 20 15 10 5 0 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Exec. time [s] 7

Mainline Linux Scheduler (CFS) We need proper big.little/heterogeneous system support in CFS. Load-balancing is currently based on an expression of CPU load which is basically: cpu load =cpu power task prio task The scheduler does not know how much CPU time is consumed by each task. The current scheduler can handle distributing tasks fairly evenly based on cpu_power for big.little system, but this is not what we want for power efficiency. Embedded use cases focus mainly on responsiveness. It is therefore important that each task is scheduled on an appropriate cpu to get the best performance and power efficiency. 8

Tracking task load The load contribution of a particular task is needed to make an appropriate scheduling decision. We have experimented internally with identifying task characteristics based on the tasks time slice utilization. Meanwhile, Paul Turner (Google) posted a RFC patch set on LKML with similar features. LKML: https://lkml.org/lkml/2012/2/1/763 Focusing in improving fair group scheduling, but very useful for task placement on asymmetric systems. Can potentially be used for aspects of power aware scheduling too. This is now out in v2 (v3?). Mainline plans? 9

Entity load-tracking summary Tracks the time each task spends on the runqueue (executing or waiting) approximately every ms. Note that: t runqueue t executing The contributed load is a geometric series over the history of time spent on the runqueue scaled by the task priority. Also task cpu usage and runqueue load. Task load Task state Load decay Executing Sleep 10

big.little scheduling: First stab Policy: Keep all task on little cores unless: 1. The runqueue residency is above a fixed threshold, and 2. The task priority is default or higher (nice 0) Goal: Only use big cores when it is necessary. Task loads Frequent, but low intensity task are assumed to suffer minimally by being stuck on a little core. High intensity low priority tasks will not be scheduled on big cores to finish earlier when it is not necessary. Tasks can migrate to match current requirements. Migrate to big Migrate to LITTLE Task 2 state Task 1 state 11

Experimental Implementation Scheduler modifications: Apply PJTs load-tracking patch set. Set up big and little sched_domains with no load-balancing between them. select_task_rq_fair() checks task load history to select appropriate target CPU for tasks waking up. Add forced migration mechanism to push of the currently running task to big core similar to the existing active load balancing mechanism. Periodically check (run_rebalance_domains()) current task on little runqueues for tasks that need to be forced to migrate to a big core. Note: There are known issues related to global load-balancing. LL select_task_rq_fair()/ forced migration load_balance LL load_balance BB Forced migration latency: ~160 us on vexpress-a9 (migration->schedule) BB 12

Example: Bbench on Android Filesystem: Android ICS (4.0) Browser benchmark Renders a new webpage every ~50s using JavaScript. Scrolls each page after a fixed delay. Three main threads involved: WebViewCoreThread: Webkit rendering thread. SurfaceFlinger: Android UI rendering thread. android.browser: Browser thread 13

Bbench@TC2 SMP example analysis 14

Bbench@TC2 HMP example analysis 15

Bbench@TC2: WebViewCoreThread Bbench on Android: 35 30 WebViewCoreThread (50 runs) SMP HMP 25 Frequency 20 15 10 5 0 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Exec. time [s] 16

Bbench@TC2: android.browser Bbench on Android: 40 35 android.browser (50 runs) SMP HMP 30 25 Frequency 20 15 10 5 0 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Exec. time [s] 17

Next step: Reimplementation of asymmetric task placement Experimental implementation disables the existing load balancing mechanism to override it. The current public (open Linaro repository) patch set is not meant for direct adoption, but serves as a tool for demonstration and evaluation. Ideally, a similar functionality should be integrated with the existing load balancer instead. Investigate the need for more control over task migrations (extra knobs). PJT's patches might need tuning knobs. Work on generalizing the patch set to support multiple cpu clusters is currently ongoing. We only have two clusters (big.little) for testing. Different target cluster selection policies for multi-cluster systems might be possible, but this is not our main focus for now. 18

Next step: Spread/Fill task placement Ongoing LKML discussions about power aware scheduling after SCHED_MC was removed. A spread/fill task distribution tuning knob is needed per cpu cluster for asymmetric systems like big.little. For low leakage cpus spreading might be the best power/performance trade-off. For high performance cpus filling might be better since leakage can be minimized. Task load could potentially be used for better cpu filling, but more investigation is needed. The current implementation of tracked load might not be ideal as the individual task load is affected by the total cpu load. A scale invariant task load metric might be needed, but is not trivial to define. 19

Next step: Integration with cpuidle Task load tracking gives the scheduler much more information about the tasks and the cpu load. Use this information to improve power aware scheduling in general. Not just for asymmetric systems. Example: When waking up an idle cpu, select the one in the cheapest C-state. Related: Selection of appropriate IRQ affinity. If cpu 0 is big, we need to be able to specify a different default target. 20

Next step: cpufreq intersections With task load tracking, the scheduler is in a good position to predict the cpu load every time a task is scheduled. Instead of waiting for cpufreq to figure out that the load has increased, it might be more efficient to drive/hint cpufreq from the scheduler. This would allow much more responsive and aggressive frequency scaling. Frequency transition latency is well below the schedule period on ARM TC2. Counterproductive scheduling behaviour can be avoided, e.g. the scheduler migrates tasks to another (idle) cpu before cpufreq has had a chance to increase the frequency. This applies to SMP system as well. For HMP, we also need to consider per cluster policies. Interactive/performance policy configuration on A7/A15 has shown good results in the lab for ARM TC2. 21

Questions? 22