Challenges for Worst-case Execution Time Analysis of Multi-core Architectures Jan Reineke @ saarland university computer science Intel, Braunschweig April 29, 2013
The Context: Hard Real-Time Systems Safety-critical applications: Avionics, automotive, train industries, manufacturing Side airbag in car Side airbag in car Reaction in < 10 msec Crankshaft-synchronous tasks Reaction in < 45 microsec Embedded controllers must finish their tasks within given time bounds. Developers would like to know the Worst-Case Execution Time (WCET) to give a guarantee. Jan Reineke, Saarland 2
The Timing Analysis Problem Embedded + Software? Timing Requirements Microarchitecture Jan Reineke, Saarland 3
What does the execution time depend on? The input, determining which path is taken through the program. The state of the hardware platform: l Due to caches, pipelining, speculation, etc. Interference from the environment: l External interference as seen from the analyzed task on shared busses, caches, memory. Simple CPU Memory Jan Reineke, Saarland 4
What does the execution time depend on? The input, determining which path is taken through the program. The state of the hardware platform: l Due to caches, pipelining, speculation, etc. Interference from the environment: l External interference as seen from the analyzed task on shared busses, caches, memory. Complex CPU (out-of-order Simple execution, branch CPU prediction, etc.) L1 Cache Memory Main Memory Jan Reineke, Saarland 5
What does the execution time depend on? The input, determining which path is taken through the program. The state of the hardware platform: l Due to caches, pipelining, speculation, etc. Interference from the environment: l External interference as seen from the analyzed task on shared busses, caches, memory. Complex Complex CPU CPU (out-of-order Simple execution,... branch CPU prediction, etc.) Complex CPU L1 Cache L1 Cache L1 Cache L2 Memory Cache Main Main Memory Jan Reineke, Saarland 6
Example of Influence of Microarchitectural State x=a+b; LOAD LOAD ADD r2, _a r1, _b r3,r2,r1 PowerPC 755 Courte Jan Reineke, Saarland 7
Example of Influence of Corunning Tasks in Multicores Radojkovic et al. (ACM TACO, 2012) on Intel Atom and Intel Core 2 Quad: up to 14x slow-down due to interference on shared L2 cache and memory controller Jan Reineke, Saarland 8
Challenges 1. Modeling How to construct sound timing models? 2. Analysis How to precisely & efficiently bound the WCET? 3. Design How to design microarchitectures that enable precise & efficient WCET analysis? Jan Reineke, Saarland 9
The Modeling Challenge Microarchitecture? Timing Model Timing model = Formal specification of microarchitecture s timing Incorrect timing model à possibly incorrect WCET bound. Jan Reineke, Saarland 10
Current Process of Deriving Timing Model Microarchitecture? Timing Model Jan Reineke, Saarland 11
Current Process of Deriving Timing Model Microarchitecture? Timing Model Jan Reineke, Saarland 12
Current Process of Deriving Timing Model Microarchitecture? Timing Model Jan Reineke, Saarland 13
Current Process of Deriving Timing Model Microarchitecture? Timing Model Jan Reineke, Saarland 14
Current Process of Deriving Timing Model Microarchitecture? Timing Model à Time-consuming, and à error-prone. Jan Reineke, Saarland 15
Current Process of Deriving Timing Model Microarchitecture? Timing Model à Time-consuming, and à error-prone. Jan Reineke, Saarland 16
1. Future Process of Deriving Timing Model Microarchitecture VHDL Model Timing Model Jan Reineke, Saarland 17
1. Future Process of Deriving Timing Model Microarchitecture VHDL Model Timing Model Derive timing model automatically from formal specification of microarchitecture. à Less manual effort, thus less time-consuming, and à provably correct. Jan Reineke, Saarland 18
1. Future Process of Deriving Timing Model Microarchitecture VHDL Model Timing Model Derive timing model automatically from formal specification of microarchitecture. à Less manual effort, thus less time-consuming, and à provably correct. Jan Reineke, Saarland 19
1. Future Process of Deriving Timing Model Microarchitecture VHDL Model Timing Model Derive timing model automatically from formal specification of microarchitecture. à Less manual effort, thus less time-consuming, and à provably correct. Jan Reineke, Saarland 20
2. Future Process of Deriving Timing Model Microarchitecture Perform measurements on hardware Infer model Timing Model Jan Reineke, Saarland 21
2. Future Process of Deriving Timing Model Microarchitecture Perform measurements on hardware Infer model Timing Model Derive timing model automatically from measurements on the hardware using ideas from automata learning. à No manual effort, and à (under certain assumptions) provably correct. à Also useful to validate assumptions about microarch. Jan Reineke, Saarland 22
2. Future Process of Deriving Timing Model Microarchitecture Perform measurements on hardware Infer model Timing Model Derive timing model automatically from measurements on the hardware using ideas from automata learning. à No manual effort, and à (under certain assumptions) provably correct. à Also useful to validate assumptions about microarch. Jan Reineke, Saarland 23
2. Future Process of Deriving Timing Model Microarchitecture Perform measurements on hardware Infer model Timing Model Derive timing model automatically from measurements on the hardware using ideas from automata learning. à No manual effort, and à (under certain assumptions) provably correct. à Also useful to validate assumptions about microarch. Jan Reineke, Saarland 24
2. Future Process of Deriving Timing Model Microarchitecture Perform measurements on hardware Infer model Timing Model Derive timing model automatically from measurements on the hardware using ideas from automata learning. à No manual effort, and à (under certain assumptions) provably correct. à Also useful to validate assumptions about microarch. Jan Reineke, Saarland 25
Proof-of-concept: Automatic Modeling of the Cache Hierarchy Cache Model is important part of Timing Model Can be characterized by a few parameters: l ABC: associativity, block size, capacity l Replacement policy B = Block Size Tag Data Tag Data Tag Data A = Associativity Tag Tag Data Data Tag Tag Data Data... Tag Tag Data Data Tag Data Tag Data Tag Data N = Number of Cache Sets chi [Abel and Reineke, RTAS 2013] derives all of these parameters fully automatically. Jan Reineke, Saarland 26
Example: Intel Core 2 Duo E6750, L1 Data Cache Misses 90000 80000 70000 60000 50000 40000 L1 Misses 30000 20000 10000 0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132333435363738394041424344454647484950 Size Jan Reineke, Saarland 27
Example: Intel Core 2 Duo E6750, L1 Data Cache Misses 90000 80000 70000 60000 50000 40000 L1 Misses 30000 20000 10000 0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132333435363738394041424344454647484950 Size Capacity = 32 KB Jan Reineke, Saarland 28
Example: Intel Core 2 Duo E6750, L1 Data Cache Way Size = 4 KB Misses 90000 80000 70000 60000 50000 40000 L1 Misses 30000 20000 10000 0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132333435363738394041424344454647484950 Size Capacity = 32 KB Jan Reineke, Saarland 29
Replacement Policy Approach inspired by methods to learn finite automata. Heavily specialized to problem domain. Jan Reineke, Saarland 30
Replacement Policy Approach inspired by methods to learn finite automata. Heavily specialized to problem domain. Discovered to our knowledge undocumented policy of the Intel Atom D525: d x a b c d e f d c a b e f x e d c a b More information: http://embedded.cs.uni-saarland.de/chi.php Jan Reineke, Saarland 31
Modeling Challenge: Future Work Extend automation to other parts of the microarchitecture: Translation lookaside buffers, branch predictors Shared caches in multicores including their coherency protocols Out-of-order pipelines? Jan Reineke, Saarland 32
The Analysis Challenge Microarchitecture! Timing Model? Precise & Efficient Timing Analysis Consider all possible program inputs Consider all possible initial states of the hardware WCET H (P ):= max i2inputs max ET H(P, i, h) h2states(h) Jan Reineke, Saarland 33
The Analysis Challenge Consider all possible program inputs Consider all possible initial states of the hardware WCET H (P ):= max i2inputs max ET H(P, i, h) h2states(h) Explicitly evaluating ET for all inputs and all hardware states is not feasible in practice: There are simply too many. è Need for abstraction and thus approximation! Jan Reineke, Saarland 34
The Analysis Challenge: State of the Art Component Caches, Branch Target Buffers Analysis Status Precise & efficient abstractions, for LRU [Ferdinand, 1999] Not-so-precise but efficient abstractions, for FIFO, PLRU, MRU [Grund and Reineke, 2008-2011] Complex Pipelines Precise but very inefficient; little abstraction Major challenge: timing anomalies Shared resources, e.g. busses, shared caches, DRAM No realistic approaches yet Major challenge: interference between hardware threads à execution time depends on corunning tasks Jan Reineke, Saarland 35
Timing Anomalies Timing Anomaly = Local worst-case does not imply Global worst-case Resource 1 A D E Resource 2 C B Resource 1 A D E Resource 2 B C Scheduling Anomaly Jan Reineke, Saarland 36
Timing Anomalies Timing Anomaly = Local worst-case does not imply Global worst-case Branch Condition Evaluated Cache Hit A Prefetch B - Miss C Cache Miss A C Speculation Anomaly Jan Reineke, Saarland 37
The Design Challenge Wanted: Multi-/many-core architecture with No timing anomalies à precise & efficient analysis of individual cores Temporal isolation between cores à independent/incremental development & analysis and high performance! Jan Reineke, Saarland 38
Approaches to the Design Challenge At the level of individual cores: Simple in-order pipelines, with static or no branch prediction Scratchpad Memories or LRU Caches Softwarecontrolled caches Jan Reineke, Saarland 39
Approaches to the Design Challenge For resources shared among multiple cores: Temporal partitioning, e.g. l TDMA arbitration of buses, shared memories l Thread-interleaved pipeline in PRET Spatial partitioning, e.g. l Partition shared caches l Partition shared DRAM à Temporal isolation Jan Reineke, Saarland 40
Design Challenge: Predictable Pipelining from Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2007. Jan Reineke, Saarland 41
Pipelining: Hazards from Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2007. Jan Reineke, Saarland 42
Forwarding helps, but not all the time LD R1, 45(r2) DADD R5, R1, R7 BE R5, R3, R0 ST R5, 48(R2) Unpipelined The Dream F D E M W F D E M W F D E M W F D E M W F D E M W F D E M W F D E M W F D E M W F D E M W The Reality F D E M W Memory Hazard F D E M W Data Hazard F D E M W Branch Hazard Jan Reineke, Saarland 43
Solution: PTARM Thread-interleaved Pipelines [Lickly et al., CASES 2008] + T1: F D E M W F D E M W T2: F D E M W F D E M W T3: F D E M W F D E M W T4: F D E M W F D E M W T5: F D E M W F D E M W Each thread occupies only one stage of the pipeline at a time à No hazards; perfect utilization of pipeline à Simple hardware implementation (no forwarding, etc.) à Each instruction takes the same amount of time à Temporal isolation between different hardware threads Drawback: reduced single-thread performance Jan Reineke, Saarland 44
Design Challenge: DRAM Controller Translates sequences of memory accesses by Clients (CPUs and I/O) into legal sequences of DRAM commands l Needs to obey all timing constraints l Needs to insert refresh commands sufficiently often l Needs to translate physical memory addresses into row/column/ bank tuples CPU1... CPU1 Interconnect + Arbitration Memory Controller DRAM Module I/O Jan Reineke, Saarland 45
Dynamic RAM Timing Constraints DRAM Memory Controllers have to conform to different timing constraints that define minimal distances between consecutive DRAM commands. Almost all of these constraints are due to the sharing of resources at different levels of the hierarchy: addr+cmd DIMM Bit line Word line Transistor Capacitor Bank Row Address Row Decoder DRAM Array Sense Amplifiers and Row Buffer Column Decoder/ Multiplexer command chip select address Control Logic Mode Register Address Register DRAM Device Row Address Mux Refresh Counter Bank Bank Bank Bank I/O Gating I/O Registers + Data I/O data 16 data 64 data 16 data 16 data 16 data 16 x16 Device x16 Device x16 Device x16 Device x16 Device x16 Device x16 Device x16 Device chip select 0 chip select 1 Rank 0 Rank 1 Needs to insert refresh commands sufficiently often Rows within a bank share sense amplifiers Banks within a DRAM device share I/O gating and control logic Different ranks share data/address/ command busses Jan Reineke, Saarland 46
General-Purpose DRAM Controllers Schedule DRAM commands dynamically Timing hard to predict even for single client: l Timing of request depends on past requests: Request to same/different bank? Request to open/closed row within bank? Controller might reorder requests to minimize latency l Controllers dynamically schedule refreshes No temporal isolation. Timing depends on behavior of other clients: l They influence sequence of past requests l Arbitration may or may not provide guarantees Jan Reineke, Saarland 47
General-Purpose DRAM Controllers Thread 1 Thread 2 Load B1.R3.C2 Load B2.R4.C3 Store B4.R3.C5 Load B3.R3.C2 Load B3.R5.C3 Store B2.R3.C5 Arbitration Load B1.R3.C2 Load B3.R3.C2 Load B2.R4.C3 Store B4.R3.C5 Load B3.R5.C3 Store B2.R3.C5? Memory Controller Jan Reineke, Saarland 48
General-Purpose DRAM Controllers Thread 1 Thread 2 Load B1.R3.C2 Load B2.R4.C3 Store B4.R3.C5 Load B3.R3.C2 Load B3.R5.C3 Store B2.R3.C5 Arbitration Load B3.R3.C2 Load B1.R3.C2 Load B2.R4.C3 Load B3.R5.C3 Store B4.R3.C5 Store B2.R3.C5? Memory Controller Jan Reineke, Saarland 49
General-Purpose DRAM Controllers Load B1.R3.C2 Load B1.R4.C3 Load B1.R3.C5 Memory Controller RAS B1.R3 CAS B1.C2 RAS B1.R4 CAS B1.C3 RAS B1.R3 CAS B1.C5? RAS B1.R3 CAS B1.C2 CAS B1.C5 RAS B1.R4 CAS B1.C3 Jan Reineke, Saarland 50
PRET DRAM Controller: Three Innovations [Reineke et al., CODES+ISSS 2011] Expose internal structure of DRAM devices: l Expose individual banks within DRAM device as multiple independent resources CPU1... CPU1 Interconnect + Arbitration PRET DRAM Controller DRAM Bank DRAM Module DRAM Module DRAM Module I/O Defer refreshes to the end of transactions l Allows to hide refresh latency Perform refreshes manually : l Replace standard refresh command with multiple reads Jan Reineke, Saarland 51
PRET DRAM Controller: Exploiting Internal Structure of DRAM Module l Consists of 4-8 banks in 1-2 ranks Share only command and data bus, otherwise independent l Partition banks into four groups in alternating ranks l Cycle through groups in a time-triggered fashion Rank 0: Bank 0 Bank 1 Bank 2 Bank 3 Rank 1: Bank 0 Bank 1 Bank 2 Bank 3 Jan Reineke, Saarland 52
PRET DRAM Controller: Exploiting Internal Structure of DRAM Module l Consists of 4-8 banks in 1-2 ranks Share only command and data bus, otherwise independent l Partition banks into four groups in alternating ranks l Cycle through groups in a time-triggered fashion Rank 0: Bank 0 Bank 1 Bank 2 Bank 3 Successive accesses to same group obey timing constraints Reads/writes to different groups do not interfere Rank 1: Bank 0 Bank 1 Bank 2 Bank 3 Jan Reineke, Saarland 53
PRET DRAM Controller: Exploiting Internal Structure of DRAM Module l Consists of 4-8 banks in 1-2 ranks Share only command and data bus, otherwise independent l Partition banks into four groups in alternating ranks l Cycle through groups in a time-triggered fashion Rank 0: Bank 0 Bank 1 Bank 2 Bank 3 Successive accesses to same group obey timing constraints Reads/writes to different groups do not interfere Rank 1: Bank 0 Bank 1 Bank 2 Bank 3 Provides four independent and predictable resources Jan Reineke, Saarland 54
Conventional DRAM Controller (DRAMSim2) vs PRET DRAM Controller: Latency Evaluation latency [cycles] 3,000 2,000 1,000 0 Varying interference for fixed transfer size: 0 0.5 1 1.5 2 2.5 3 Interference [# of other threads occupied] 4096B transfers, conventional controller 4096B transfers, PRET controller 1024B transfers, conventional controller 1024B transfers, PRET controller Figure 9: Latencies of conventional and PRET memory con- average latency [cycles] 3,000 2,000 1,000 0 Varying transfer size at maximal interference: Conventional controller PRET controller 0 1,000 2,000 3,000 4,000 transfer size [bytes] Figure 10: Latencies of conventional and PRET memory con- More information: http://chess.eecs.berkeley.edu/pret/ Jan Reineke, Saarland 55
Emerging Challenge: Microarchitecture Selection & Configuration Embedded Software with? Timing Requirements Family of Microarchitectures = Platform Jan Reineke, Saarland 56
Emerging Challenge: Microarchitecture Selection & Configuration Embedded Software with? Timing Requirements Choices: Processor frequency Sizes and latencies of local memories Latency and bandwidth of interconnect Presence of floatingpoint unit Family of Microarchitectures = Platform Jan Reineke, Saarland 57
Emerging Challenge: Microarchitecture Selection & Configuration Embedded Software Timing Requirements Choices: Processor frequency Sizes and latencies of local memories Latency and bandwidth of interconnect Presence of floatingpoint unit? Select a microarchitecture that a) satisfies all timing requirements, and with b) minimizes cost/size/energy. Family of Microarchitectures = Platform Jan Reineke, Saarland 58
Conclusions Challenges in modeling analysis design remain. Jan Reineke, Saarland 59
Conclusions Challenges in modeling analysis design remain. Progress based on automation abstraction partitioning has been made. Jan Reineke, Saarland 60
Conclusions Challenges in modeling analysis design remain. Progress based on automation abstraction partitioning has been made. Thank you for your attention! Jan Reineke, Saarland 61