Linux Load Balancing

Size: px

Start display at page:

Download "Linux Load Balancing"

Edward Charles
8 years ago
Views:

1 Linux Load Balancing Hyunmin Yoon

2 2 Load Balancing Linux scheduler attempts to evenly distribute load across CPUs Load of CPU (run queue): sum of task weights Load balancing is triggered by Timer interrupts Load balancing code is invoked periodically Scheduling events Load balancing code is executed when the CPU becomes idle fork() has been executed exec() has been executed a task has waken up 2

balancing code is invoked periodically Scheduling events Load balancing code is executed

3 3 Scheduling Domains and Groups Scheduling domains Load balancing takes place within a scheduling domain Scheduling domains define the scheduling entities in a hierarchical fashion Each scheduling domain spans a number of CPUs A domain's span MUST be a superset of it child's span Each scheduling domain must have one or more scheduling groups Scheduling groups Each scheduling group contains one or more (virtual) CPUs Load balancing takes place between scheduling groups 3

domain's span MUST be a superset of it child's span Each scheduling domain must have one or more scheduling groups

4 4 CPU Topology 4

5 5 Scheduling Domain Default scheduling domain topology levels SMT domain: for multi-threading in a package ARM uses GMC domain MC domain: for multi-core in a package DIE domain: for multi-package Domains have different configuration Implies differences in flags SD_SHARE_PKG_RESOURCES Sharing resources like the cache between groups SD_SHARE_POWERDOMAIN Reflect whether groups of CPUs in a sched_domain level can or not reach different power state 5

configuration Implies differences in flags SD_SHARE_PKG_RESOURCES Sharing resources like the cache between

6 6 Timer-Driven Load Balancing Load balancing is triggered by scheduling ticks kernel/sched/core.c Invoked by a timer interrupt Check if it is time to do load balancing kernel/sched/fair.c If it is time for load balancing, mark it for the softirq handler Perform load balancing 6

c Invoked by a timer interrupt Check if it is time to do load

7 7 run_rebalance_domains() kernel/sched/fair.c 7

8 8 rebalance_domains() kernel/sched/fair.c static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) { unsigned long next_balance = jiffies + 60*HZ; int update_next_balance = 0; for_each_domain(cpu, sd) { Set the balancing interval for current domain interval = (the number of CPUs of this domain) x (busy factor = 32) milliseconds Check the balancing interval to see if the sched_domain should be rebalanced interval = get_sd_balance_interval(sd, idle!= CPU_IDLE); if (time_after_eq(jiffies, sd->last_balance + interval)) { if (load_balance(cpu, rq, sd, idle, &continue_balancing)) { idle = idle_cpu(cpu)? CPU_IDLE : CPU_NOT_IDLE; sd->last_balance = jiffies; interval = get_sd_balance_interval(sd, idle!= CPU_IDLE); if (time_after(next_balance, sd->last_balance + interval)) { next_balance = sd->last_balance + interval; update_next_balance = 1; Set the interval for periodic execution of load balancer 1 minute Starting from the current domain All parent domains are visited 8

$get_sd_balance_interval(sd, idle!= CPU_IDLE); if (time_after_eq(jiffies, sd->last_balance + interval)) { if (load_balance(cpu, rq, sd, idle, &continue_balancing)) { idle = idle_cpu(cpu)?$

9 9 Default Load Balancing Method Find a busiest group that has highest group average load in a domain Find a busiest run-queue that has highest run-queue load in the busiest group Pull tasks from the busiest run-queue to run-queue calling load balancer Domain Domain Group 1 Group 2 Busiest Group Group 1 Group 2 migration RQ1 RQ2 RQ3 RQ4 Call Load Balancer RQ1 RQ2 RQ3 RQ4 Busiest run-queue 9

from the busiest run-queue to run-queue calling load balancer Domain Domain Group 1 Group 2

10 10 Domain Load load_balance() -> find_busiest_group() -> update_sd_lb_stats(); static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds) { struct sched_domain *child = env->sd->child; struct sched_group *sg = env->sd->groups; do { struct sg_lb_stats *sgs = &tmp_sgs; int local_group; local_group = cpumask_test_cpu(env->dst_cpu, sched_group_cpus(sg)); if (local_group) { sds->local = sg; sgs = &sds->local_stat; if (env->idle!= CPU_NEWLY_IDLE time_after_eq(jiffies, sg->sgc->next_update)) update_group_capacity(env->sd, env->dst_cpu); update_sg_lb_stats(env, sg, load_idx, local_group, sgs, &overload); if (local_group) goto next_group; next_group: if (update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest = sg; sds->busiest_stat = *sgs; sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity; sg = sg->next; while (sg!= env->sd->groups); designate local group (= current group) and update group capacity update group load information calculate domain load and pick busiest group in the domain 10

$(local_group) { sds->local = sg; sgs = &sds->local_stat; if (env->idle!$

11 11 Group Load load_balance() -> find_busiest_group() -> update_sd_lb_stats(); -> update_sg_lb_stats(); static inline void update_sg_lb_stats(struct lb_env *env, struct sched_group *group, int load_idx, int local_group, struct sg_lb_stats *sgs, bool *overload) { for_each_cpu_and(i, sched_group_cpus(group), env->cpus) { struct rq *rq = cpu_rq(i); if (local_group) load = target_load(i, load_idx); else load = source_load(i, load_idx); sgs->group_load += load; sgs->sum_nr_running += rq->nr_running; if (rq->nr_running > 1) *overload = true; sgs->sum_weighted_load += weighted_cpuload(i); if (idle_cpu(i)) sgs->idle_cpus++; sgs->group_capacity = group->sgc->capacity; sgs->avg_load = (sgs->group_load*sched_capacity_scale) / sgs->group_capacity; if (sgs->sum_nr_running) sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running; sgs->group_weight = group->group_weight; sgs->group_imb = sg_imbalanced(group); sgs->group_capacity_factor = sg_capacity_factor(env, group); if (sgs->group_capacity_factor > sgs->sum_nr_running) sgs->group_has_free_capacity = 1; 11

load_idx); else load = source_load(i, load_idx); sgs->group_load += load; sgs->sum_nr_running += rq->nr_running; if (rq->nr_running > 1) *overload = true; sgs->sum_weighted_load +=

12 12 calculate_imbalance() static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds) { unsigned long max_pull, load_above_capacity = ~0UL; struct sg_lb_stats *local, *busiest; local = &sds->local_stat; busiest = &sds->busiest_stat; if (busiest->avg_load <= sds->avg_load local->avg_load >= sds->avg_load) { env->imbalance = 0; return fix_small_imbalance(env, sds); if (!busiest->group_imb) { load_above_capacity = (busiest->sum_nr_running - busiest->group_capacity_factor); load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE); load_above_capacity /= busiest->group_capacity; max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity); env->imbalance = min( max_pull * busiest->group_capacity, (sds->avg_load - local->avg_load) * local->group_capacity ) / SCHED_CAPACITY_SCALE; 12

$busiest->group_imb) { load_above_capacity = (busiest->sum_nr_running - busiest->group_capacity_factor); load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE); load_above_capacity /=$

13 13 Out Balancing Condition Many out balancing conditions exist in load balancer ex) The imbalance is within the specified limit in find_busiest_group() if (env->idle == CPU_IDLE) { if ((local->idle_cpus < busiest->idle_cpus) && busiest->sum_nr_running <= busiest->group_weight) goto out_balanced; else { if (100 * busiest->avg_load <= env->sd->imbalance_pct * local->avg_load) goto out_balanced; Default = 125% case of setting SD_SHARE_CPUCAPACITY = 110% case of setting SD_SHARE_PKG_RESOURCES = 117% case of NUMA migration = 112% 13

$busiest->group_weight) goto out_balanced; else { if (100 * busiest->avg_load <= env->sd->imbalance_pct * local->avg_load) goto$

14 14 Choosing the Tasks load_balance() looks for tasks Inactive (likely to be cache cold) load_balance() skips tasks that are Likely to be cache warm Currently running on a CPU Not allowed to run on the current CPU (as indicated by the cpus_allowed bitmask in the task_struct) 14

cache warm Currently running on a CPU Not allowed to run on the

15 15 Handling Imbalance after Load Balancing If load balancer failed many times to move load, it sets active_balance active_balance uses push mechanism to avoid physical/logical imbalance active_load_balance_cpu_stop() pushes one task from busiest CPU to idle CPU even though busiest CPU has only one task If active_balance was set, load balancer doesn t need to work at next interval because it is covered by push mechanism 15

active_load_balance_cpu_stop() pushes one task from busiest CPU to idle CPU even though busiest CPU has

16 16 Event-Driven Load Balancing Event-driven load balancing is performed by setting flags include/linux/sched.h #define SD_LOAD_BALANCE 0x0001 /* Do load balancing on this domain. */ #define SD_BALANCE_NEWIDLE 0x0002 /* Balance when about to become idle */ #define SD_BALANCE_EXEC 0x0004 /* Balance on exec */ #define SD_BALANCE_FORK 0x0008 /* Balance on fork, clone */ #define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */ When a task is newly created or woke up via fork(), exec(), wakeup() Select the least loaded group in its current domain Move the task to the least loaded CPU When the CPU becomes newly idle Select the most loaded group in its current domain Move tasks from the most loaded CPU to this CPU 16

fork, clone */ #define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */ When a task is newly created or woke up via fork(), exec(), wakeup() Select the least loaded group in its

17 17 Tickless Idle Traditional systems use a periodic interrupt 'tick' Update the system clock Tick requires wakeup from idle state Tickless idle eliminates the periodic timer tick when the CPU is idle The CPU can remain in power saving states for a longer period of time, reducing the overall system power consumption 17

periodic timer tick when the CPU is idle The CPU can remain in power saving

18 18 18

Linux scheduler history. We will be talking about the O(1) scheduler

Linux scheduler history. We will be talking about the O(1) scheduler CPU Scheduling Linux scheduler history We will be talking about the O(1) scheduler SMP Support in 2.4 and 2.6 versions 2.4 Kernel 2.6 Kernel CPU1 CPU2 CPU3 CPU1 CPU2 CPU3 Linux Scheduling 3 scheduling