Linux Load Balancing Hyunmin Yoon
2 Load Balancing Linux scheduler attempts to evenly distribute load across CPUs Load of CPU (run queue): sum of task weights Load balancing is triggered by Timer interrupts Load balancing code is invoked periodically Scheduling events Load balancing code is executed when the CPU becomes idle fork() has been executed exec() has been executed a task has waken up 2
3 Scheduling Domains and Groups Scheduling domains Load balancing takes place within a scheduling domain Scheduling domains define the scheduling entities in a hierarchical fashion Each scheduling domain spans a number of CPUs A domain's span MUST be a superset of it child's span Each scheduling domain must have one or more scheduling groups Scheduling groups Each scheduling group contains one or more (virtual) CPUs Load balancing takes place between scheduling groups 3
4 CPU Topology 4
5 Scheduling Domain Default scheduling domain topology levels SMT domain: for multi-threading in a package ARM uses GMC domain MC domain: for multi-core in a package DIE domain: for multi-package Domains have different configuration Implies differences in flags SD_SHARE_PKG_RESOURCES Sharing resources like the cache between groups SD_SHARE_POWERDOMAIN Reflect whether groups of CPUs in a sched_domain level can or not reach different power state 5
6 Timer-Driven Load Balancing Load balancing is triggered by scheduling ticks kernel/sched/core.c Invoked by a timer interrupt Check if it is time to do load balancing kernel/sched/fair.c If it is time for load balancing, mark it for the softirq handler Perform load balancing 6
7 run_rebalance_domains() kernel/sched/fair.c 7
8 rebalance_domains() kernel/sched/fair.c static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) { unsigned long next_balance = jiffies + 60*HZ; int update_next_balance = 0; for_each_domain(cpu, sd) { Set the balancing interval for current domain interval = (the number of CPUs of this domain) x (busy factor = 32) milliseconds Check the balancing interval to see if the sched_domain should be rebalanced interval = get_sd_balance_interval(sd, idle!= CPU_IDLE); if (time_after_eq(jiffies, sd->last_balance + interval)) { if (load_balance(cpu, rq, sd, idle, &continue_balancing)) { idle = idle_cpu(cpu)? CPU_IDLE : CPU_NOT_IDLE; sd->last_balance = jiffies; interval = get_sd_balance_interval(sd, idle!= CPU_IDLE); if (time_after(next_balance, sd->last_balance + interval)) { next_balance = sd->last_balance + interval; update_next_balance = 1; Set the interval for periodic execution of load balancer 1 minute Starting from the current domain All parent domains are visited 8
9 Default Load Balancing Method Find a busiest group that has highest group average load in a domain Find a busiest run-queue that has highest run-queue load in the busiest group Pull tasks from the busiest run-queue to run-queue calling load balancer Domain Domain Group 1 Group 2 Busiest Group Group 1 Group 2 migration RQ1 RQ2 RQ3 RQ4 Call Load Balancer RQ1 RQ2 RQ3 RQ4 Busiest run-queue 9
10 Domain Load load_balance() -> find_busiest_group() -> update_sd_lb_stats(); static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds) { struct sched_domain *child = env->sd->child; struct sched_group *sg = env->sd->groups; do { struct sg_lb_stats *sgs = &tmp_sgs; int local_group; local_group = cpumask_test_cpu(env->dst_cpu, sched_group_cpus(sg)); if (local_group) { sds->local = sg; sgs = &sds->local_stat; if (env->idle!= CPU_NEWLY_IDLE time_after_eq(jiffies, sg->sgc->next_update)) update_group_capacity(env->sd, env->dst_cpu); update_sg_lb_stats(env, sg, load_idx, local_group, sgs, &overload); if (local_group) goto next_group; next_group: if (update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest = sg; sds->busiest_stat = *sgs; sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity; sg = sg->next; while (sg!= env->sd->groups); designate local group (= current group) and update group capacity update group load information calculate domain load and pick busiest group in the domain 10
11 Group Load load_balance() -> find_busiest_group() -> update_sd_lb_stats(); -> update_sg_lb_stats(); static inline void update_sg_lb_stats(struct lb_env *env, struct sched_group *group, int load_idx, int local_group, struct sg_lb_stats *sgs, bool *overload) { for_each_cpu_and(i, sched_group_cpus(group), env->cpus) { struct rq *rq = cpu_rq(i); if (local_group) load = target_load(i, load_idx); else load = source_load(i, load_idx); sgs->group_load += load; sgs->sum_nr_running += rq->nr_running; if (rq->nr_running > 1) *overload = true; sgs->sum_weighted_load += weighted_cpuload(i); if (idle_cpu(i)) sgs->idle_cpus++; sgs->group_capacity = group->sgc->capacity; sgs->avg_load = (sgs->group_load*sched_capacity_scale) / sgs->group_capacity; if (sgs->sum_nr_running) sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running; sgs->group_weight = group->group_weight; sgs->group_imb = sg_imbalanced(group); sgs->group_capacity_factor = sg_capacity_factor(env, group); if (sgs->group_capacity_factor > sgs->sum_nr_running) sgs->group_has_free_capacity = 1; 11
12 calculate_imbalance() static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds) { unsigned long max_pull, load_above_capacity = ~0UL; struct sg_lb_stats *local, *busiest; local = &sds->local_stat; busiest = &sds->busiest_stat; if (busiest->avg_load <= sds->avg_load local->avg_load >= sds->avg_load) { env->imbalance = 0; return fix_small_imbalance(env, sds); if (!busiest->group_imb) { load_above_capacity = (busiest->sum_nr_running - busiest->group_capacity_factor); load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE); load_above_capacity /= busiest->group_capacity; max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity); env->imbalance = min( max_pull * busiest->group_capacity, (sds->avg_load - local->avg_load) * local->group_capacity ) / SCHED_CAPACITY_SCALE; 12
13 Out Balancing Condition Many out balancing conditions exist in load balancer ex) The imbalance is within the specified limit in find_busiest_group() if (env->idle == CPU_IDLE) { if ((local->idle_cpus < busiest->idle_cpus) && busiest->sum_nr_running <= busiest->group_weight) goto out_balanced; else { if (100 * busiest->avg_load <= env->sd->imbalance_pct * local->avg_load) goto out_balanced; Default = 125% case of setting SD_SHARE_CPUCAPACITY = 110% case of setting SD_SHARE_PKG_RESOURCES = 117% case of NUMA migration = 112% 13
14 Choosing the Tasks load_balance() looks for tasks Inactive (likely to be cache cold) load_balance() skips tasks that are Likely to be cache warm Currently running on a CPU Not allowed to run on the current CPU (as indicated by the cpus_allowed bitmask in the task_struct) 14
15 Handling Imbalance after Load Balancing If load balancer failed many times to move load, it sets active_balance active_balance uses push mechanism to avoid physical/logical imbalance active_load_balance_cpu_stop() pushes one task from busiest CPU to idle CPU even though busiest CPU has only one task If active_balance was set, load balancer doesn t need to work at next interval because it is covered by push mechanism 15
16 Event-Driven Load Balancing Event-driven load balancing is performed by setting flags include/linux/sched.h #define SD_LOAD_BALANCE 0x0001 /* Do load balancing on this domain. */ #define SD_BALANCE_NEWIDLE 0x0002 /* Balance when about to become idle */ #define SD_BALANCE_EXEC 0x0004 /* Balance on exec */ #define SD_BALANCE_FORK 0x0008 /* Balance on fork, clone */ #define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */ When a task is newly created or woke up via fork(), exec(), wakeup() Select the least loaded group in its current domain Move the task to the least loaded CPU When the CPU becomes newly idle Select the most loaded group in its current domain Move tasks from the most loaded CPU to this CPU 16
17 Tickless Idle Traditional systems use a periodic interrupt 'tick' Update the system clock Tick requires wakeup from idle state Tickless idle eliminates the periodic timer tick when the CPU is idle The CPU can remain in power saving states for a longer period of time, reducing the overall system power consumption 17
18 18