Linux Load Balancing



Similar documents
Linux scheduler history. We will be talking about the O(1) scheduler

Chapter 5 Linux Load Balancing Mechanisms

EECS 750: Advanced Operating Systems. 01/28 /2015 Heechul Yun

Linux O(1) CPU Scheduler. Amit Gud amit (dot) gud (at) veritas (dot) com

A Survey of Parallel Processing in Linux

Linux Process Scheduling. sched.c. schedule() scheduler_tick() hooks. try_to_wake_up() ... CFS CPU 0 CPU 1 CPU 2 CPU 3

Comparing Power Saving Techniques for Multi cores ARM Platforms

Process Scheduling II

Linux Scheduler. Linux Scheduler

POWER EFFICIENT SCHEDULING FOR

Power Management in the Linux Kernel

Multi-core and Linux* Kernel

Process Scheduling in Linux

Process Scheduling in Linux

Update on big.little scheduling experiments. Morten Rasmussen Technology Researcher

Multilevel Load Balancing in NUMA Computers

W4118 Operating Systems. Instructor: Junfeng Yang

ò Paper reading assigned for next Thursday ò Lab 2 due next Friday ò What is cooperative multitasking? ò What is preemptive multitasking?

Overview of the Current Approaches to Enhance the Linux Scheduler. Preeti U. Murthy IBM Linux Technology Center

Completely Fair Scheduler and its tuning 1

Project No. 2: Process Scheduling in Linux Submission due: April 28, 2014, 11:59pm

Linux Process Scheduling Policy

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Load-Balancing for Improving User Responsiveness on Multicore Embedded Systems

Overview of the Linux Scheduler Framework

STEPPING TOWARDS A NOISELESS LINUX ENVIRONMENT

Scheduling in The Age of Virtualization

CPU Scheduling. Core Definitions

Understanding Linux on z/vm Steal Time

A Study of Performance Monitoring Unit, perf and perf_events subsystem

2. is the number of processes that are completed per time unit. A) CPU utilization B) Response time C) Turnaround time D) Throughput

Operating Systems Concepts: Chapter 7: Scheduling Strategies

The Linux Kernel: Process Management. CS591 (Spring 2001)

Task Scheduling for Multicore Embedded Devices

CPU Scheduling Outline

Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances:

Load-Balancing for a Real-Time System Based on Asymmetric Multi-Processing

Process Scheduling in Linux

Scheduling policy. ULK3e 7.1. Operating Systems: Scheduling in Linux p. 1

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

Chapter 5 Process Scheduling

Getting maximum mileage out of tickless

IO latency tracking. Daniel Lezcano (dlezcano) Linaro Power Management Team

Migration of Process Credentials

Multi-Threading Performance on Commodity Multi-Core Processors

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Scheduling 0 : Levels. High level scheduling: Medium level scheduling: Low level scheduling

An Adaptive Task-Core Ratio Load Balancing Strategy for Multi-core Processors

Kernel Synchronization and Interrupt Handling

Energy-aware Memory Management through Database Buffer Control

RIOS: A Lightweight Task Scheduler for Embedded Systems

Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY Operating System Engineering: Fall 2005

Agenda. Context. System Power Management Issues. Power Capping Overview. Power capping participants. Recommendations

Beyond the Hypervisor

Jorix kernel: real-time scheduling

Process definition Concurrency Process status Process attributes PROCESES 1.3

An Implementation Of Multiprocessor Linux

Job Scheduling Model

Chapter 5: CPU Scheduling. Operating System Concepts 8 th Edition

EE8205: Embedded Computer System Electrical and Computer Engineering, Ryerson University. Multitasking ARM-Applications with uvision and RTX

Linux Kernel Networking. Raoul Rivas

Scheduling and Kernel Synchronization

8-bit Microcontroller. Application Note. AVR134: Real-Time Clock (RTC) using the Asynchronous Timer. Features. Theory of Operation.

Sistemi Operativi. Lezione 25: JOS processes (ENVS) Corso: Sistemi Operativi Danilo Bruschi A.A. 2015/2016

The Linux Scheduler: a Decade of Wasted Cores

Improvement of Scheduling Granularity for Deadline Scheduler

STUDY OF PERFORMANCE COUNTERS AND PROFILING TOOLS TO MONITOR PERFORMANCE OF APPLICATION

Programming and Scheduling Model for Supporting Heterogeneous Architectures in Linux

MPLAB Harmony System Service Libraries Help

On the implementation of real-time slot-based task-splitting scheduling algorithms for multiprocessor systems

Scheduling Support for Heterogeneous Hardware Accelerators under Linux

The CPU Scheduler in VMware vsphere 5.1

How To Save Power On A Server With A Power Management System On A Vsphere Vsphee V (Vsphere) Vspheer (Vpower) (Vesphere) (Vmware

Performance Evaluation of a Multilevel Load Balancing Algorithm

Implementation of ARTiS, an Asymmetric Real-Time Extension of SMP Linux

Dispatching Domains for Multiprocessor Platforms and their Representation in Ada

Scheduling. Yücel Saygın. These slides are based on your text book and on the slides prepared by Andrew S. Tanenbaum

Long-term monitoring of apparent latency in PREEMPT RT Linux real-time systems

Advanced topics: reentrant function

Linux Scheduler Analysis and Tuning for Parallel Processing on the Raspberry PI Platform. Ed Spetka Mike Kohler

Domains. Seminar on High Availability and Timeliness in Linux. Zhao, Xiaodong March 2003 Department of Computer Science University of Helsinki

Quantum Support for Multiprocessor Pfair Scheduling in Linux

Energy-aware task and interrupt management in Linux

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Linux on POWER for Green Data Center

HUAWEI OceanStor Load Balancing Technical White Paper. Issue 01. Date HUAWEI TECHNOLOGIES CO., LTD.

Understanding the Linux CPU Scheduler

Performance Testing for GDB

C++FA 5.1 PRACTICE MID-TERM EXAM

Scaling Networking Applications to Multiple Cores

Virtual Servers. Virtual machines. Virtualization. Design of IBM s VM. Virtual machine systems can give everyone the OS (and hardware) that they want.

SYSTEM ecos Embedded Configurable Operating System

Understanding the Linux CPU Scheduler

White Paper. Real-time Capabilities for Linux SGI REACT Real-Time for Linux

A complete guide to Linux process scheduling. Nikita Ishkov

Real-Time Performance and Middleware for Multiprocessor and Multicore Linux Platforms

ò Scheduling overview, key trade-offs, etc. ò O(1) scheduler older Linux scheduler ò Today: Completely Fair Scheduler (CFS) new hotness

Kernel Optimizations for KVM. Rik van Riel Senior Software Engineer, Red Hat June

Utilization Driven Power-Aware Parallel Job Scheduling

an embedded perspective on linux power management

Transcription:

Linux Load Balancing Hyunmin Yoon

2 Load Balancing Linux scheduler attempts to evenly distribute load across CPUs Load of CPU (run queue): sum of task weights Load balancing is triggered by Timer interrupts Load balancing code is invoked periodically Scheduling events Load balancing code is executed when the CPU becomes idle fork() has been executed exec() has been executed a task has waken up 2

3 Scheduling Domains and Groups Scheduling domains Load balancing takes place within a scheduling domain Scheduling domains define the scheduling entities in a hierarchical fashion Each scheduling domain spans a number of CPUs A domain's span MUST be a superset of it child's span Each scheduling domain must have one or more scheduling groups Scheduling groups Each scheduling group contains one or more (virtual) CPUs Load balancing takes place between scheduling groups 3

4 CPU Topology 4

5 Scheduling Domain Default scheduling domain topology levels SMT domain: for multi-threading in a package ARM uses GMC domain MC domain: for multi-core in a package DIE domain: for multi-package Domains have different configuration Implies differences in flags SD_SHARE_PKG_RESOURCES Sharing resources like the cache between groups SD_SHARE_POWERDOMAIN Reflect whether groups of CPUs in a sched_domain level can or not reach different power state 5

6 Timer-Driven Load Balancing Load balancing is triggered by scheduling ticks kernel/sched/core.c Invoked by a timer interrupt Check if it is time to do load balancing kernel/sched/fair.c If it is time for load balancing, mark it for the softirq handler Perform load balancing 6

7 run_rebalance_domains() kernel/sched/fair.c 7

8 rebalance_domains() kernel/sched/fair.c static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) { unsigned long next_balance = jiffies + 60*HZ; int update_next_balance = 0; for_each_domain(cpu, sd) { Set the balancing interval for current domain interval = (the number of CPUs of this domain) x (busy factor = 32) milliseconds Check the balancing interval to see if the sched_domain should be rebalanced interval = get_sd_balance_interval(sd, idle!= CPU_IDLE); if (time_after_eq(jiffies, sd->last_balance + interval)) { if (load_balance(cpu, rq, sd, idle, &continue_balancing)) { idle = idle_cpu(cpu)? CPU_IDLE : CPU_NOT_IDLE; sd->last_balance = jiffies; interval = get_sd_balance_interval(sd, idle!= CPU_IDLE); if (time_after(next_balance, sd->last_balance + interval)) { next_balance = sd->last_balance + interval; update_next_balance = 1; Set the interval for periodic execution of load balancer 1 minute Starting from the current domain All parent domains are visited 8

9 Default Load Balancing Method Find a busiest group that has highest group average load in a domain Find a busiest run-queue that has highest run-queue load in the busiest group Pull tasks from the busiest run-queue to run-queue calling load balancer Domain Domain Group 1 Group 2 Busiest Group Group 1 Group 2 migration RQ1 RQ2 RQ3 RQ4 Call Load Balancer RQ1 RQ2 RQ3 RQ4 Busiest run-queue 9

10 Domain Load load_balance() -> find_busiest_group() -> update_sd_lb_stats(); static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds) { struct sched_domain *child = env->sd->child; struct sched_group *sg = env->sd->groups; do { struct sg_lb_stats *sgs = &tmp_sgs; int local_group; local_group = cpumask_test_cpu(env->dst_cpu, sched_group_cpus(sg)); if (local_group) { sds->local = sg; sgs = &sds->local_stat; if (env->idle!= CPU_NEWLY_IDLE time_after_eq(jiffies, sg->sgc->next_update)) update_group_capacity(env->sd, env->dst_cpu); update_sg_lb_stats(env, sg, load_idx, local_group, sgs, &overload); if (local_group) goto next_group; next_group: if (update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest = sg; sds->busiest_stat = *sgs; sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity; sg = sg->next; while (sg!= env->sd->groups); designate local group (= current group) and update group capacity update group load information calculate domain load and pick busiest group in the domain 10

11 Group Load load_balance() -> find_busiest_group() -> update_sd_lb_stats(); -> update_sg_lb_stats(); static inline void update_sg_lb_stats(struct lb_env *env, struct sched_group *group, int load_idx, int local_group, struct sg_lb_stats *sgs, bool *overload) { for_each_cpu_and(i, sched_group_cpus(group), env->cpus) { struct rq *rq = cpu_rq(i); if (local_group) load = target_load(i, load_idx); else load = source_load(i, load_idx); sgs->group_load += load; sgs->sum_nr_running += rq->nr_running; if (rq->nr_running > 1) *overload = true; sgs->sum_weighted_load += weighted_cpuload(i); if (idle_cpu(i)) sgs->idle_cpus++; sgs->group_capacity = group->sgc->capacity; sgs->avg_load = (sgs->group_load*sched_capacity_scale) / sgs->group_capacity; if (sgs->sum_nr_running) sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running; sgs->group_weight = group->group_weight; sgs->group_imb = sg_imbalanced(group); sgs->group_capacity_factor = sg_capacity_factor(env, group); if (sgs->group_capacity_factor > sgs->sum_nr_running) sgs->group_has_free_capacity = 1; 11

12 calculate_imbalance() static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds) { unsigned long max_pull, load_above_capacity = ~0UL; struct sg_lb_stats *local, *busiest; local = &sds->local_stat; busiest = &sds->busiest_stat; if (busiest->avg_load <= sds->avg_load local->avg_load >= sds->avg_load) { env->imbalance = 0; return fix_small_imbalance(env, sds); if (!busiest->group_imb) { load_above_capacity = (busiest->sum_nr_running - busiest->group_capacity_factor); load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE); load_above_capacity /= busiest->group_capacity; max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity); env->imbalance = min( max_pull * busiest->group_capacity, (sds->avg_load - local->avg_load) * local->group_capacity ) / SCHED_CAPACITY_SCALE; 12

13 Out Balancing Condition Many out balancing conditions exist in load balancer ex) The imbalance is within the specified limit in find_busiest_group() if (env->idle == CPU_IDLE) { if ((local->idle_cpus < busiest->idle_cpus) && busiest->sum_nr_running <= busiest->group_weight) goto out_balanced; else { if (100 * busiest->avg_load <= env->sd->imbalance_pct * local->avg_load) goto out_balanced; Default = 125% case of setting SD_SHARE_CPUCAPACITY = 110% case of setting SD_SHARE_PKG_RESOURCES = 117% case of NUMA migration = 112% 13

14 Choosing the Tasks load_balance() looks for tasks Inactive (likely to be cache cold) load_balance() skips tasks that are Likely to be cache warm Currently running on a CPU Not allowed to run on the current CPU (as indicated by the cpus_allowed bitmask in the task_struct) 14

15 Handling Imbalance after Load Balancing If load balancer failed many times to move load, it sets active_balance active_balance uses push mechanism to avoid physical/logical imbalance active_load_balance_cpu_stop() pushes one task from busiest CPU to idle CPU even though busiest CPU has only one task If active_balance was set, load balancer doesn t need to work at next interval because it is covered by push mechanism 15

16 Event-Driven Load Balancing Event-driven load balancing is performed by setting flags include/linux/sched.h #define SD_LOAD_BALANCE 0x0001 /* Do load balancing on this domain. */ #define SD_BALANCE_NEWIDLE 0x0002 /* Balance when about to become idle */ #define SD_BALANCE_EXEC 0x0004 /* Balance on exec */ #define SD_BALANCE_FORK 0x0008 /* Balance on fork, clone */ #define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */ When a task is newly created or woke up via fork(), exec(), wakeup() Select the least loaded group in its current domain Move the task to the least loaded CPU When the CPU becomes newly idle Select the most loaded group in its current domain Move tasks from the most loaded CPU to this CPU 16

17 Tickless Idle Traditional systems use a periodic interrupt 'tick' Update the system clock Tick requires wakeup from idle state Tickless idle eliminates the periodic timer tick when the CPU is idle The CPU can remain in power saving states for a longer period of time, reducing the overall system power consumption 17

18 18