Cache-Aware Compositional Analysis of Real-Time Multicore Virtualization Platforms

Cache-Aware Compositional Analysis of Real-Time Multicore Virtualization Platforms Meng Xu, Linh T.X. Phan, Insup Lee, Oleg Sokolsky, Sisu Xi, Chenyang Lu and Christopher D. Gill

Complex Systems on Multicore Platforms Embedded systems Become more and more complex Consist of multiple sub-systems Multicore platforms Number of cores keeps increasing http://www.codeproject.com/articles/16165/robotics-embedded-systems-part-i International technology roadmap for semiconductors 007 edition: System drivers

Virtualization The benefits of virtualization Consolidate legacy systems Integrate large, complex systems VM 0 VM 1 VM Guest OS Guest OS Guest OS VCPU VCPU VCPU VCPU VCPU VCPU VCPU VCPU Virtual Machine Monitor CPU CPU CPU CPU cache cache cache cache 3

Compositional Analysis for RT Guarantees Step 1: Abstract each component (VM) into an interface Step : Transform each interface into a set of VCPUs Step 3: Abstract the VCPUs of all VMs to the system s interface VCPU: (Period, Budget) VM 0 VM 1 VM Interface 0 Interface 1 Interface Guest OS Guest OS Guest OS Interface of the system VCPU VCPU VCPU VCPU VCPU VCPU VCPU VCPU Virtual Machine Monitor CPU CPU CPU CPU cache cache cache cache 4

Limitations of Existing Multicore Compositional Analysis Existing multicore compositional analysis does not consider platform overhead In practice, platform overhead is not negligible Example: cache overhead Result: unsafe analysis! Reason: analysis does not consider the effect of cache overhead in virtualization and under-estimates resource Examples: cache overhead due to task preemption, VCPU preemption and VCPU completion 5

Contributions Introduce overhead-free compositional analysis DMPR: improved MPR resource model Quantify events that cause cache overhead Task-preemption events, VCPU-preemption events, VCPU-completion events Propose cache-aware compositional analysis Hybrid analysis: combination of task-centric analysis and model-centric analysis 6

Deterministic Multi-Processor Resource Model (DMPR) DMPR µ = Π,Θ,m Interface Bandwidth = m full VCPUs (i.e., with bandwidth 1) m + Π, Θ one partial VCPU, with period Π and budget Θ Π Θ Partial VCPU: Full VCPU: Full VCPU: VP 1 VP VP 3 t Worst-case resource supply of a DMPR µ = 5,1, 7

Assumptions Each core has a private cache; no shared cache Period of each component s interface is given by designers Maximum cache overhead per task preemption or crpmd migration in the system is upper bounded by Virtual machine monitor uses hybrid EDF (hedf) cpu1 cpu cpu 3 cpu4 pin VP1 VP3 VP VP4 VP5 hedf scheduling of VCPUs 8

Outline Introduction Events that cause cache overhead Cache-aware compositional analysis Evaluation 9

Event 1: Task Preemption Event Definition: A task-preemption event happens when a task preempts another task within the same VM. Example = {,, 3} 1 1 = (1,5) = (8,5) = (4,3) 3 cpu 1 cpu 1, priority > 3 1 3 0 1 3 4 5 6 7 8 Cache overhead 3 > 1 1 Task-preemption event overhead t t 10

Event : VCPU-Preemption Event Definition: A VCPU-preemption event occurs when a VCPU is preempted by another VCPU of another VM. Example: CPU 3 4 1 CPU 1 3 4 hedf pin VP 1 VP VP 3 VP 4 VP5 VP VP 1 3 VP 4 VP VP5 C 1 µ 1 = 5,3,1 C µ = 8,3,1 C µ = 3 6,4, 0 3 Full VCPU Partial VCPU 1 3 4 5 6 7 8 (b) VCPUs configuration (a) VMs configuration 11

Event : VCPU-Preemption Event C VP3 VP4 4 5 6 VP (5,3) VP 5 (6,4) VP 4 (8,3) VP,VP VP 4 5 VP (5,3) VP (6,4 5 ) VP 4 (8,3) (c) Scheduling of partial VCPUs 4 (8,4) 5 (6,) 6 (10,1.5) CPU1 CPU VP 3 VP 4 0 1 3 4 5 6 7 8 4 6 5 overhead caused by VCPU-preemption event 0 1 3 4 5 6 7 8 (6,) 5 4(8,4) 6 (10,1.5) VP unavailable cache overhead (d) Cache overhead of tasks in component 1

Event 3: VCPU-Completion Event Definition: A VCPU-completion event of a VCPU happens when the VCPU exhausts its budget in a period and stops its execution. Example: C full (4,) VP1 VP 1 3 1 (8,4) (6,) 3 (10,1.5) 1 3 VP 1 VP 0 1 3 4 5 6 7 8 (6,) 1(8,4) 3 (10,1.5) VP unavailable cache overhead caused by VCPUcompletion event cache overhead 13

Task-Centric Analysis Task-preemption event Inflate higher priority task with one cache overhead VCPU-preemption/completion event e Inflate task with the number of cache overhead caused by VCPU-preemption/completion events during a task s period k e = e = k e k k + + crpmd crpmd ( N (1) 3 VP i + NVP,, k i k ) j k k task-preemption event cache overhead for task k crpmd (a) Task-preemption event overhead number of VCPU-preemption/ completion events (b) VCPU-preemption event overhead during a period of task k See paper for how to compute number of VCPU-preemption/completion events 15 ()

Task-Centric Analysis Inflated WCET of each task e k = e i + crpmd + crpmd ( N 3 VP i + NVP,, k i k ) System is schedulable under cache overhead if the inflated workload is schedulable 16

Pessimistic When Number of Tasks Is Large Only two tasks have cache overhead in a VCPUpreemption/completion event But don t know which two tasks have cache overhead To be safe: have to inflate all tasks WCET with one cache overhead per VCPU-preemption/completion event 1, 3 Only two tasks have cache overhead due to the event VP 1 VP 0 1 3 4 5 6 7 8 (6,) 1(8,4) 3 (10,.5) VP unavailable cache overhead Cache overhead in VCPU-completion event 17

Model-Centric Approach Subtract the overhead due to VCPU-preemption/completion events from the original resource supply of the interface to obtain its effective resource supply. VCPU-preemption/ completion event overhead Task-preemption event overhead How to compute effective resource supply (red line)? 18

Effective SBF of DMPR Interface Effective SBF of the partial VCPU Effective SBF of m full VCPUs Effective SBF of the interface Reason: A DMPR interface provides resource with one partial VCPU and m full VCPUs 19

Worst Case Scenario of Effective Resource Supply of Partial VCPU: The worst case happens when: (1) The partial VCPU has all VCPU-preemption/completion events ()The partial VCPU incurs the overhead as late as possible in the first period and as early as possible in the rest of periods (3) The time interval t begins when the VCPU finishes supplying its effective resource in the first period. Maximum number of VCPU-preemption/completion events during a partial VCPU s period is computed in the paper (3) () (1) t VP i t1 t t3 t4 t t t 5 6 7 8 Worst-case effective resource supply of the partial VCPU Proof is in the paper. 0 t

Effective Resource Supply of Partial VCPU SBF stop (t) = yθ * + max{0, t x yπ z} if Θ VP i 0 if Θ = 0 where VP i belongs to interface Θ stop = max{0, Θ N * crpmd VP i t }, x µ = crpmd = Π Π,Θ, m * Θ y = t x Π and z 0 * = Π Θ VP i t 1 t * Θ t3 t4 t 5 t6 t 7 t8 x Worst-case effective resource supply of the partial VCPU VP i 1 z

Effective SBF of The Interface Effective SBF of the partial VCPU Effective SBF of m full VCPUs Effective SBF of the interface

Model-Centric Analysis Step 1: Consider task-preemption event overhead Step : Consider VCPU-preemption/completion event overhead Step 3: Check if effective resource supply >= resource demand C µ = 10,8.5, 1 1 5 1,, 5 = (0,5) 3

Pessimistic When Number of Full VCPUS Is Large Only one full VCPU is affected per VCPU-preemption/ completion event in practice But all full VCPUs marked unavailable at a VCPU-preemption/ completion event when we compute the effective SBF of m full VCPUs VP 1 VP VP 3 VP 4 t1 t t3 t4 t5 t6 t7 CRPMD Unavailable resource in analysis 4

Task-Centric vs. Model-Centric Neither of these two analysis dominates the other Task-centric is better Model-centric is better C period = 5 hedf Bandwidth of taskcentric analysis: 4.94 Bandwidth of modelcentric analysis: 6.90 C period = 5 hedf Bandwidth of taskcentric analysis: 3.8 Bandwidth of modelcentric analysis:.86 C C1 period = 0 period = 50 crpmd = C C1 period = 0 period = 50 crpmd = 1 5 1 5 1 = (100,50),, 5 = (100,50) 1 = (100,5),, 5 = (100,5) 5

Hybrid Cache-Aware Analysis C period = 5 hedf C C1 period = 0 period = 50 1 5 6 10 1 = (100,5),, 5 = (100,5) 6

Hybrid Cache-Aware Analysis Task-centric analysis Model-centric analysis C µ = 5,4.1,3 bandwidth :3.8 C µ = 5,4.3, bandwidth :.86 hedf hedf C C 1 µ 0,9.8,0 µ = 50,36.1, 1 = C C 1 µ 0,8.8,0 µ = 50,39.7, 1 1 = 1 5 6 10 1 = (100,5),, 5 = (100,5) 7 1 = (100,5),, 5 = (100,5)

Experimental Setup Dell Optiplex-980 quad-core workstation (3 cores for guest VMs, 1 core for VM0) Hardware hedf RT-Xen D 1 1 1 D Π D 1 = 56 Π = 18 Π = 64 3 3 4 Π 4 = D 3 k 1 k 1 k 1 k 1 3 3 4 4 LITMUS measured crpmd =1.9 WSS=56KB ms Task set: utilization 1.8; Task utilization distribution: uniformly in [0.001,0.1] 9

Cache Overhead Is Not Negligible Unsafe taskset claimed schedulabled by overheadfree analysis is not schedulable in practice MPR DMPR Theory RT-Xen Theory RT-Xen Schedulable Yes No Yes No Cache-aware Hybrid Safe same taskset is claimed NOT schedulable by cache-aware analysis Cache-aware Task-centric Theory RT-Xen Theory RT-Xen Schedulable No No No No 30

Simulation Setup hedf D 1 D Π = 56 Π 18 D 1 = Π = 3 64 Π = 3 4 4 D 3 1 1 k 1 k 1 k 1 k 1 3 3 4 4 crpmd = 0.9 ms Task's period Task's utilization uniformly in [350ms, 850ms] uniform uniformly in [0.001,0.1] light bimodal 8/9 in [0.1,0.4] and 1/9 in [0.5,0.9] medium bimodal 6/9 in [0.1,0.4] and 3/9 in [0.5,0.9] heavy bimodal 4/9 in [0.1,0.4] and 5/9 in [0.5,0.9] 31

Hybrid Analysis Saves Bandwidth Hybrid approach saves bandwidth for 64% of the tasksets crpmd Average wcet = 0.003 Hybrid analysis saves bandwidth over task-centric analysis per taskset utilization 3

Hybrid Analysis Saves Bandwidth Hybrid analysis still saves bandwidth over task-centric analysis when the distribution of tasks utilization changes crpmd = 0.0005 = 0. 0004 Average wcet Average wcet crpmd crpmd Average wcet = 0.0003 a) bimodal-light distribution b) bimodal-medium distribution c) bimodal-heavy distribution 33

Related Work Overhead-free compositional analysis S. Baruah and N. Fisher. Component-based design in multiprocessor real-time systems. In ICESS, 009. A. Easwaran, I. Shin, and I. Lee. Optimal virtual cluster-based multiprocessor scheduling. Real-Time Systems, 43(1):5 59, 009. H. Leontyev and J. H. Anderson. A hierarchical multiprocessor bandwidth reservation scheme with timing guarantees. In ECRTS, 008. G. Lipari and E. Bini. A framework for hierarchical scheduling on multiprocessors: From application requirements to run-time allocation. In RTSS, 010. E. Bini, M. Bertogna, and S. Baruah. Virtual multiprocessor platforms: Specification and use. In RTSS, 009. Overhead-aware analysis on non-virtualization environment B. B. Brandenburg. Scheduling and Locking in Multiprocessor Real-Time Operating Systems. PhD thesis, The University of North Carolina at Chapel Hill, 011. Methods of getting the cache overhead value A. Bastoni, B. B. Brandenburg, and J. H. Anderson. Cache-Related Preemption and Migration Delays: Empirical Approximation and Impact on Schedulability. In OSPERT, 010. S. Altmeyer, R. I. Davis, and C. Maiza. Improved cache related preemption delay aware response time analysis for fixed priority preemptive systems. Real-Time Systems, 01. 34

Conclusion Contribution Propose DMPR resource model Introduce overhead-free compositional analysis under DMPR Quantify events that cause cache overhead Propose cache-aware compositional analysis Future work Extend our method to multi-level cache hierarchy with shared cache Explore cache management methods to reduce the cache overhead 35