CLASS: Combined Logic Architecture Soft Error Sensitivity Analysis Mojtaba Ebrahimi, Liang Chen, Hossein Asadi, and Mehdi B. Tahoori INSTITUTE OF COMPUTER ENGINEERING (ITEC) CHAIR FOR DEPENDABLE NANO COMPUTING (CDNC) 0 KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu
Soft Error Rate (Normalized) Soft Errors Soft Errors caused by radiation-induced particle strikes Strike releases electron & hole pairs Absorbed by source & drain Alter transistor state source + + - + + - - - Particle Strike drain Soft error at logic-level and higher: Bit-flip in flip-flops Transient pulses in combinational gates Transistor Device SER vs. Design Rule [adapted from Ibe10TED] 5 System Soft Error Rate (SER) grows exponentially with Moore s law [Dixit11IRPS][Ibe10TED] Number of transistors per system is proportional to SER 4 3 2 1 0 250 180 130 90 65 45 32 22 Design Rule (nm) 1
Motivation: Importance of SER Estimation Not all soft errors cause system failures Circuit level masking (logical, timing, and electrical) Architecture level masking Application level masking Nodes not equally susceptible to soft errors Finding most susceptible modules Applying selective protection schemes Balancing reliability, cost, and performance SER Estimation technique is needed Rank the nodes according to their SER 2
Motivation: What is missing? Previous Soft Error Rate (SER) estimation techniques Fault injection (time consuming) Analytical (accuracy) Architecture-level techniques regular structures (register-file, cache) Circuit-level techniques irregular structures (controller unit, pipeline) Microprocessors include both regular and irregular structures Independent analysis is not accurate Proposed: CLASS: Combined Logic Architecture Soft error Sensitivity analysis Combines a circuit-level technique with an architecture-level analysis Discrete time Markov chains for handling different error propagation and masking scenarios 3
Outline Preliminaries CLASS Experimental Results Conclusions 4
Outline Preliminaries CLASS Experimental Results Conclusions 5
Analytical Architecture Level SER Estimation Techniques ACE (Architecturally Correct Execution) analysis is the most common technique [Mukherjee03Micro] a.k.a Life-time analysis Lifetime of a bit is divided into : ACE e.g. Program counter almost always in ACE state un-ace (unnecessary for ACE) e.g. Branch predictor always in un-ace state AVF (Architecture Vulnerability Factor) Fraction of time that system is on ACE state Program counter ~ 100%, Branch predictor = 0% ACE analysis overestimates failure rate In some cases up to 7x [George10DSN] Overdesign 6
Analytical Circuit Level SER Estimation Techniques Binary/Algebraic Decision Diagrams (BDD/ADD)-based techniques [Zhang06ISQED] [Norman05TCAD] [Krishnaswamy05DATE] [Miskov08TCAD] Based on decision diagrams Highly accurate Scalability problem Error Probability Propagation (EPP)-based techniques [Asadi06ISCAS] [Fazeli10DSN] [Chen12IOLTS] Based on propagation rules Reasonable accuracy Scalable 7
Shortcoming of Previous techniques Circuit-level techniques consider: A circuit What seen by circuit level techniques ACE cannot completely model effect of irregular structures in error propagation from regular structures 8
Overview Preliminaries CLASS Experimental Results Conclusions 9
CLASS: A Simple Example Primary Inputs D SET CLR 0.4 0.3 Q Q D SET CLR Q Q Primary Ouputs 0.1 0.2 0.6 Memory 1 1 Error at Error Site Error at Error Site 0.2 0.5 0.3 0.2 0.5 0.3 0.4 Masked Masked Propagation from memory error site output contents (Iterative) to its output Saturates in few iterations Latent in Memory Latent in Memory 0.5 0.6 Error at Memory 0.1 Outputs 0.4 Error at Output (Failure) Error at Output (Failure) Discrete Time Markov Chain (DTMC) 10
Overview of CLASS Phase I System Decomposition Phase II Independent Analysis Phase III Joint Analysis Irregular Structures Irregular Structures (Combinational and Sequential Elements) Enhanced EPP Analysis Regular Structures Memory 1 Memory 2 Memory m Memory ACE Analysis (MACE) DTMC 11
Important Error Propagations Propagation from irregular structures to regular structures and vice versa : 1 2 3 Short-term effect: error in memory inputs affects its outputs immediately Long-term effect: error in memory inputs contaminates its contents MVF (Memory Vulnerability Factor): propagation probability of an error from memory contents to its output Primary Inputs Irregular Structure Primary Ouputs Memory Inputs Memory 2 3 1 Memory Outputs 1 Short-term Effect 2 Long-term Effect 3 MVF 12
Computation of Short- and Long-term Effects Error at Port None Probability of Error Output Contents (Short-term) (Long-term) Both A simple memory Address 0 1-SP(WEn) SP(WEn) 0 Data-in 1-SP(WEn) 0 0 SP(Wen) WEn 0 0 0 1 Probability of having short- and long-term effect for each input port of memory Short-term effect Long-term effect 13
Memory ACE (MACE) Looking at memory boundaries Intervals leading to a read access are ACE MVF Word : Fraction of time that it has an ACE state MVF Memory : Average of all MVF Word Write Read Write Read Read Write Write Time ACE unace ACE unace 14
Enhanced Circuit-level Technique CLASS is compatible with different circuit-level techniques It inherits characteristics of employed circuit-level technique Multi-cycle 4-value EPP [Fazeli10DSN] is employed Based on topological traverse of netlist O(n 2 ), Inaccuracy within 2% Circuit-level technique should be enhanced to: EP=1 D SET CLR Model short-term effect of error in memories Compute long-term effect probability (treated as independent error) Q Q 1 0.8=0.8 SP=0.8 A Error Probability (EP) EP=0.8 Address Data-in WEn SP=0.25 Short-term 0.8 0.75=0.6 Memory Lang-term 0.8 0.25=0.2 SP=0.5 B EP=0.3 EP=0.6 CLR Q 0.6 0.5=0.3 MVF EPP(Memory output to primary output) D SET Q EP=0.3 15
Integration of MACE and EPP Using DTMC 1 Primary Inputs Logic Primary Ouputs Independent of error site P ELT 1-MVF Error at Error Site Masked 1-P ELT -P EO P EO Memory A simple system Latent in Memory MVF P MLT Error at Memory Outputs 1-P MLT -P MO P MO Error at Output (Failure) Steady state solution of DTMC Symbol P EO P ELT P MO P MLT MVF Error propagation DTMC Definition P(Error site Primary Outputs) P(Error site Latent in Memory) P(Memory Output Primary Outputs) P(Memory Output Latent in Memory) P(Latent in Memory Memory Outputs) 16
Overall Flow System Specification System Specification System Specification Done by Commercial Tools Done by Commercial Tools CLASS MACE Analysis Application Application Application Post-synthesis Simulation Post-synthesis Simulation VCD File VCD File VCD file Analysis Signal Probabilites HDL Discritpion HDL Discritpion HDL Discritpion Synthesis Synthesis Gate-level Netlist Gate-level Netlist EPP Analysis MVF DTMC Matrix EPPs Vulnerability of Cells Solve Matrix for Each Cell 17
Outline Preliminaries CLASS Experimental Results Conclusions 18
Experimental Setup OpenRISC 1200 processor Synthesis results: 44,480 gates 4,960 flip-flops 4 Memory Units Overall: 49,444 error sites Benchmarks are selected from Mibench: QSort, CRC32, Stringsearch [www.opencores.org] 19
Comparison with Other Techniques EPP [Asadi06ISCAS][Fazeli10DSN] ACE[Biswas05ISCA] implemented for caches and their tags Simulation-based Statistical Fault Injection (SFI) Faults are injected during post-synthesis simulation One fault per running entire application Each FI on average takes 87 seconds Only logic masking is considered Comparable with functional simulation System configuration: Intel Xeon E5520 16 GB RAM 20
Failure Probability Inaccuracy Analysis Regular structures: OpenRISC Regular Structures CLASS ACE Instruction Cache (IC) Instruction Cache Tag (ICT) Data Cache (DC) Data Cache Tag (DCT) Average inaccuracy: 3.4% Maximum inaccuracy : 9.2% Average inaccuracy : 11.7% Maximum inaccuracy : 23.3% CLASS may overestimate or underestimate Due to inaccuracy of 4-value EPP SFI CLASS ACE 21
Failure Probability Inaccuracy Analysis Irregular structures: Three representative blocks of OpenRISC: ALU Instruction Cache FSM (IC FSM) Pipeline fetch unit CLASS EPP Average inaccuracy : 2.1% Maximum inaccuracy : 4.0% Average inaccuracy : 7.2% Maximum inaccuracy : 26.3% Average inaccuracy of individual error sites is less than 7%. SFI CLASS EPP 22
Runtime Individual inaccuracy of 7% needs 196 faults at each error site 49,444 possible error sites 49,444 x 196 x 87 more than 27 years CLASS is 5 orders of magnitude faster than simulation-based SFI Less than one hour Runtime Breakdown: VCD Analysis 85% EPP 14% Solving DTMCs 1% 23
Outline Preliminaries CLASS Experimental Results Conclusions 24
Conclusion CLASS: SER estimation technique for entire systems Unlike existing techniques which only focused on either regular or irregular structures Interactions between regular and irregular structures are considered Compared to SFI: Inaccuracy is less than 7% 5 orders of magnitude faster Multiple times more accurate than EPP (4x) and ACE (3x) 25
Thanks for your attention! Q & A 26
Backup Slides 27
Sources of Inaccuracy Sources of inaccuracy in current implementation Propagation of an error to several memory units Multiple propagation of an error to memory outputs 28
ACE Analysis: Instruction Cache Access Types: Read Miss (RM) and Read Hit (RH) Un-ACE Intervals * RM ACE Intervals: * RH 29 How to Estimate the Soft Error Rate of an Embedded Processor?
ACE Analysis: Data Cache (Write-back) Access types: Read Miss (RM), Read Hit (RH), Write Miss (WM), Write Hit (WH) ACE intervals: * RH Un-ACE intervals: * WH Potentially ACE intervals: * WM, * RM 30
Enhanced EPP Short-term effect of error in memories Long-term effect are considered as propagated as independent error Enhanced to calculate the probability of error propagation to memory inputs Only logic masking is considered Multi-cycle 4-value EPP [Fazeli10DSN] is employed CLASS has no limitation regarding circuit-level techniques 31