SOFT CONTAINER TOWARDS 100% RESOURCE UTILIZATION ACCELA ZHAO, LAYNE PENG 1
WHAT IS RESOURCE UTILIZATION? This is what we buy A gap of $$$ wasted This is what we use 2
ENERGY AND RESOURCE UTILIZATION Energy-related costs 42% of total (including buy new machines) An idle server consumes even 70% as much energy as running in fullspeed Low resource utilization is energy inefficient Waste energy, waste money 3
A CLOSER LOOK TO CLOUD The key advantage of cloud - workload consolidation Improved resource utilization Less machines, more apps. Energyefficient and saves money. 4
CLOUD RESOURCE UTILIZATION BIG PICTURE Scheduling - choose the best resource placement when app starts Examples: Green Cloud, Paragon. And the schedulers in Openstack, Kubernetes, Mesos, Migration - continuously optimize the resource placement when app is running Examples: Openstack Watcher, VMware DRS Soft Container - dynamically bubble up/down resource constraints in respond to co-located apps Related: Google Heracles 5
CLOUD RESOURCE UTILIZATION BIG PICTURE Apps Scheduler Manages resource utilization at app kick-off Soft Container Manages resource utilization at fine granularity inside host Migration Manages resource utilization cross hosts while app running 6
CLOUD RESOURCE UTILIZATION BIG PICTURE A battle of putting more apps in each host vs. guarantee app SLA The key problem: resource interference 7
THE KEY PROBLEM: RESOURCE INTERFERENCE What is resource interference? Apps co-located in one host share resources like CPU, cache, memory, They interfere with each other, result in poor performance compared to running standalone Resource interference make SLA easy to be violated Related readings Google Heracles: an analysis of resource interference Paragon: resource interference-aware scheduling Bubble-up: to measure resource interference 8
RESOURCE INTERFERENCE: IT LOOKS LIKE? MySQL standalone running vs co-located with a CPU & disk hungry task 9
RESOURCE INTERFERENCE: HOW TO MEASURE? Bubble-up The setup Run app co-located with resource benchmarks, each benchmark stresses one type of resource App tolerated resource interference Slowly increase resource benchmark stress until app fails its SLA. The critical point shows how much resource interference the app can tolerate. App caused resource interference Run app at what its SLA requires. The stress it causes on each type of resource is the app s caused resource interference. Where to use it? Better resource utilization management Scheduling, Migration, Soft Container, 10
RESOURCE INTERFERENCE: HOW TO MEASURE? MySQL standalone running, vs co-located with CPU stress, vs disk stress. In my case, MySQL is much more sensitive to CPU interference. 11
INTRODUCING TO SOFT CONTAINER Motivations Increase resource utilization by co-locating more apps E.g. Business services is critical but may not use all resources on the host. Add the low priority hadoop batching tasks to fill what is left. Respond to the dynamic nature of time-varying workload E.g. Business service may become more idle at lunch time, hadoop tasks can then expand its resource bubble and utilize the leftover. Guarantee the SLA of critical apps E.g. When the business service suddenly requires more resource for processing, hadoop tasks will shrink instantly to give out resources. Challenges Resource control and isolation of interference Respond to dynamic workload change 12
RESOURCES CPU Core Time Quota Disk I/O IOPS Throughput Memory Size Bandwidth 13
RESOURCES - MISSING CPU Core Time Quota Disk I/O IOPS Throughput Memory Size Bandwidth* Cache LLC Network Ulimit Bandwidth GPU Device* Waiting & implemented some in house 14
ISOLATION THE RESOURCES - NAMESPACE clone(): create a new process and attached to a new namespace unshare(): create a new namespace and attaches to a existed process setns(): Set a a process to a existing namespace /proc/<pid>/ns: lrwxrwxrwx 1 root root 0 Jun 21 18:38 ipc -> ipc:[4026532509] lrwxrwxrwx 1 root root 0 Jun 21 18:38 mnt -> mnt:[4026532507] lrwxrwxrwx 1 root root 0 Jun 16 18:24 net -> net:[4026532512] lrwxrwxrwx 1 root root 0 Jun 21 18:38 pid -> pid:[4026532510] lrwxrwxrwx 1 root root 0 Jun 21 18:38 user -> user:[4026531837] lrwxrwxrwx 1 root root 0 Jun 21 18:38 uts -> uts:[4026532508] We are still waiting security namespace security keys namespace device namespace time namespace 15
LIMIT THE RESOURCE - CGROUP Task, Control Group & Hierarchy Subsystem What can be control blkio cpu cpuacct cpuset devices freezer memory net_cls net_prio ns Usage Create a cgroup subsystem Change the limitation # echo 524288000 > /sys/fs/cgroup/memory/foo/memory.limit_in_b ytes 16
MISSING - NETWORK Community attempts: Base on Traffic Control (tc) 17
MISSING - GPU Nvidia s efforts: a. GPU exposed as separated normal devices in /dev b. devices cgroup => partial supported: Allow/Deny/List Access i. R ii. W iii. M Ref: https://github.com/nvidia/nvidia-docker/wiki/gpu-isolation 18
MISSING - CACHE Intel s efforts: Cache Monitor Technology (CMT) For an OS or VMM to indicate a softwaredefined ID for each of applications or VMs that are scheduled to run on a core. This ID is called the Resource Monitoring ID (RMID). To Monitor cache occupancy on a per RMID basis For an OS or VMM to read LLC occupancy for a given RMID at any time. Cache Allocation Technology (CAT) The ability to enumerate the CAT capability and the associated LLC allocation support via CPUID. Interfaces for the OS/hypervisor to group applications into classes of service (CLOS) and indicate the amount of last-level cache available to each CLOS. These interfaces are based on MSRs (Model-Specific Registers). Code and Data Prioritization (CDP) Extension to CAT a new CPUID feature flag is added within the CAT sub-leaves at CPUID.0x10.[ResID=1]:ECx[bit 2] to indicate support 19
MISSING MEMORY BANDWIDTH Memory Bandwidth Monitoring (MBM) Mechanisms in hardware to monitor cache occupancy and bandwidth statistics as applicable to a given product generation on a per software-id basis. Mechanisms for the OS or hypervisor to read back the collected metrics such as L3 occupancy or Memory Bandwidth for a given software ID at any point during runtime. Monitor Control Ref Memory Bandwidth Management for Efficient Performance Isolation in Multi-core Platform: http://pertsserver.cs.uiuc.edu/~mcaccamo/papers/private/ieee_tc_journal_submitted_c.pdf Code: https://github.com/heechul/memguard 20
MISSING MEMORY BANDWIDTH Memory Bandwidth Monitoring (MBM) Mechanisms in hardware to monitor cache occupancy and bandwidth statistics as applicable to a given product generation on a per software-id basis. Mechanisms for the OS or hypervisor to read back the collected metrics such as L3 occupancy or Memory Bandwidth for a given software ID at any point during runtime. Monitor Control Ref Memory Bandwidth Management for Efficient Performance Isolation in Multi-core Platform: http://pertsserver.cs.uiuc.edu/~mcaccamo/papers/private/ieee_tc_journal_submitted_c.pdf Code: https://github.com/heechul/memguard 21
WATCH THE WORKLOAD CHANGE Latencies App request latency Disk IO await Network response time Queue length CPU load average Disk request queue size Network queue length Utilization CPU util rate Disk util rate Network util rate Bandwidth DRAM bandwidth CPU bandwidth Disk bandwidth Request count App request count Disk IOPS / req/s Network IOPS / req/s Granularity Global level Per container level 22
THE FEEDBACK CONTROL LOOP Controller Soft Container Watcher Limiter Containers 23
THE FEEDBACK CONTROL LOOP Controller Soft Container Watcher Immediately response Limiter Containers 24
THE FEEDBACK CONTROL LOOP Controller Soft Container Watcher Immediately response Limiter Containers How to immediately resize the containers? 25
HOW WE LOOK AT SHRINK & EXPANSION? a. Create a new container; b. Live migrate the contents to new container: 1. Transfer existed data to new container; 2. Transfer the instant data to new container. c. Stop the old container d. Start the new container e. Route the traffic to new container 26
IN CONTAINER S WORLD 9527 /usr/sbin/httpd a. Mount to new cgroup or change the value of the cgroup b. Done! Control Groups (cgroup): CPU time: 20 System memory: 1G Disk bandwidth: 2000 Network bandwidth: 100Mbs Control Groups (cgroup): CPU time: 70 System memory: 5G Disk bandwidth: 8000 Network bandwidth: 1Gbs 27
IN CONTAINER S WORLD a. Mount to new cgroup or change the value of the cgroup b. Done! 9527 /usr/sbin/httpd Control Groups (cgroup): CPU time: 20 We need to take a fresh look at System memory: 1G Disk bandwidth: 2000 Network bandwidth: 100Mbs the resources management from Container s perspective. Control Groups (cgroup): CPU time: 70 System memory: 5G Disk bandwidth: 8000 Network bandwidth: 1Gbs 28
SOFT CONTAINER: IMPLEMENTATION Controller Algorithm expand Algorithm pin_idle Container Repo RunC plugin Docker plugin Algorithm plugin N Container type N Watcher CPU plugin Disk plugin Watcher plugin N CPU statistics Disk More Metrics Store Auto discovery Limiter RunC plugin Docker plugin Limiter plugin N Containers 29
SOFT CONTAINER: CURRENT STATUS Support RunC and Docker containers A few controller algorithms which is effective Able to expand with more plugins Completely runnable! 30
Demo Time :-) 31
BENCHMARK RESULTS: BEFORE If uncontrolled, MySQL workload is severely interfered by co-located low priority task 32
BENCHMARK RESULTS: BEFORE The CPU utilization is far from saturation while workload varies by time (Although in my case, disk IO is highly utilized) 33
BENCHMARK RESULTS: SOFT CONTAINER With Soft Container (green line), latency impact is controlled. (We can improve the algorithm to cope better with peak workload) 34
BENCHMARK RESULTS: SOFT CONTAINER Soft Container helps improve CPU utilization by co-locating new tasks with MySQL 35
BENCHMARK RESULTS: SOFT CONTAINER CPU utilization looks close to saturation, after adding in iowait time 36
BENCHMARK RESULTS: SOFT CONTAINER How the resource bubble floats under the control of Soft Container. (The vibration threshold are made very sensitive to workload change) 37
Q&A