Error Characterization of Petascale Machines: A study of the error logs from Blue Waters Anno Accademico

Size: px

Start display at page:

Download "Error Characterization of Petascale Machines: A study of the error logs from Blue Waters Anno Accademico 2012-2013"

Muriel West
8 years ago
Views:

1 tesi di laurea magistrale Error Characterization of Petascale Machines: Anno Accademico Advisor Prof. Domenico Cotroneo Co-Advisor Prof. Ravishankar K. Iyer Dr. Catello di Martino Prof. Zbigniew Kalbarczyk Student Fabio Baccanico M

Domenico Cotroneo Co-Advisor Prof. Ravishankar K. Iyer Dr.

Bibliography 1. Franck Cappello, Al Geist, Bill Gropp, Laxmikant Kale, Bill Kramer, and Marc Snir. 2009. Toward Exascale Resilience. Int. J. High Perform. Comput. Appl. 23, 4 (November 2009), 374-388.

2 Bibliography 1. Franck Cappello, Al Geist, Bill Gropp, Laxmikant Kale, Bill Kramer, and Marc Snir Toward Exascale Resilience. Int. J. High Perform. Comput. Appl. 23, 4 (November 2009), Schroeder, B.; Gibson, G.A., A Large-Scale Study of Failures in High-Performance Computing Systems. Dependable and Secure Computing, IEEE Transactions on, vol.7, no.4, pp.337,350, Oct.- Dec Catello Di Martino. One size does not t all: Clustering supercomputer failures using a multiple time window approach. In International Supercomputing Conference - Supercomputing, volume 7905 of Lecture Notes in Computer Science, pages Springer Berlin Heidelberg, Franck Cappello. Fault tolerance in petascale/ exascale systems: Current knowledge, challenges and research opportunities. Int. J. High Perform. Comput. Appl., 23(3): , August A. Oliner and J. Stearley. What supercomputers say: A study of five system logs. Dependable Systems and Networks, DSN '07. 37th Annual IEEE/IFIP Int. Conference on, pages , June Y. Liang, A. Sivasubramaniam, J. Moreira, Y. Zhang, R.K. Sahoo, and M. Jette. Filtering failure logs for a Bluegene/l prototype. Proceedings of the 2005 International Conference on Dependable Systems and Networks, pages , Washington, DC, USA, IEEE Computer Society 7. C. Spritz, A. Koehler, Tips and Tricks for Diagnosing Lustre Problems on Cray Systems, Cray User Group 2011 Proceedings ; 8. BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD Opteron Processors

2010 3. Catello Di Martino. One size does not t all: Clustering supercomputer failures using a multiple time window approach.

Towards system with several million of CPU core running up to billion of thread [1] Failures not rare events any

3 Context Large-scale high-performance computing Blue Waters: the fastest supercomputer on a university campus 14 petaflop of peak performance. Towards system with several million of CPU core running up to billion of thread [1] Failures not rare events any longer and should be considered as normal events Resiliency as one of major challenges in the large-scale HPC infrastructure [1-5] First initial work on the analysis of a petascale machines errors via log analysis

4 Contribution and Findings Behind a Petascale Machines: How is it made? Hardware and Software Architecture Analysis of petascale machine error logs: a big data problem Not all the logs are created equal: data management and harmonizing The Needle in the haystack: filtering event logs looking for errors Even the single bit matters: decoding error codes Reliability at Petascale: major findings Availability is 95,5%, MTTR is 6h 46m 53% of system failures are due to Lustre Hardware is not the problem! % of hardware errors are masked by the ECC/Chipkill technique protecting the memory Lustre (parallel filesystem) errors highly critical Timeout mechanisms used in the failover mechanisms are not adequate at this scale

haystack: filtering event logs looking for errors Even the single bit matters: decoding error codes Reliability at Petascale: major findings Availability is 95,5%, MTTR is 6h 46m

5 Error Characterization of Petascale Machines: About Blue Waters Hybrid (CPU/GPU) Cray architecture delivering 14 PF (peak) AMD Opteron nodes 724,480 cores Cray Gemini Network Lustre File system over

architecture delivering 14 PF (peak) 27648 AMD

6 RAM RAM RAM RAM RAM RAM Error Characterization of Petascale Machines: Blue Waters nodes Cray XE6 blade (tot ) four nodes, each with two AMD Opteron 6272 (16 cores each) 64GB DDR3 RAM per node Gemini communication interface Cray XK7 blade (tot 3072) One AMD Opteron 6272/node 32GB DDR3 RAM per node NVIDIA K20X accelerator, 6GB on-board DDR5 RAM, 2272 cores Gemini communication interface Voltage Regulator NVIDIA GPU ACCELERATOR Node 0 Node 1 Voltage Regulator Gemini Network Asic NVIDIA GPU ACCELERATOR Node 2 Node 3 Gemini Network Asic Blade Controller

3072) One AMD Opteron 6272/node 32GB DDR3 RAM per node NVIDIA K20X accelerator, 6GB on-board DDR5 RAM, 2272 cores Gemini communication

7 User Perceived Availability (1/2) Blue Waters has experienced a service interruption on the compute and job scheduler resource due to un expected cabinet power fault. Estimated return to service time is not available yet, but is not currently expected to be before 6AM on 1/18 The BW team signaling Blue Waters general failures User access or job scheduling affected Time Between Failure = time between two failure Time To Repair = time between failure and restore 46 Analyzed From 03/06/2013 To 01/11/2013 The results is an estimation of user perceived availability

Estimated return to service time is not available yet, but is not currently expected to be before 6AM on 1/18 The BW team Email signaling Blue

User Perceived Availability (2/2) User Perceived Availability: 0.955 i.e., 1.

8 User Perceived Availability (2/2) User Perceived Availability: i.e., 1.5 days/month offline Uptime: 150 days 3h Downtime: 6 days 18h System-wide statistics MTTI = 6 days and 16h MIN= 2 days 21h MAX = 18days 4h MTTF = 5 days 12h MIN TTF = 1h 26min MAX TTF = 15days 13h MTTR = 6h 46min Min TTR = 26min MAX TTR = 1day 7h FileSystem 52% System Reboot 3% SYSTEM-WIDE UNAVAILABILITY (REASONS) Expansion 8% Storage 3% Mainten ance 13% Globus 3% CAUSE OF DOWNTIME Scheduler 35% Queue Policy Changed Power 4% 3% Failure 76%

18days 4h MTTF = 5 days 12h MIN TTF = 1h 26min MAX TTF = 15days 13h MTTR = 6h 46min Min TTR = 26min MAX TTR = 1day 7h FileSystem

9 Log Analysis Workflow 4.4 TB 640 GB ~220 GB

Initial Breakdown of Errors MTTE [mean time to error] 890 s (almost 15 m); 76,419,404 error grouped in 17,135 error clusters (tuples) Machine Check errors (MCE) are the

10 Initial Breakdown of Errors MTTE [mean time to error] 890 s (almost 15 m); 76,419,404 error grouped in 17,135 error clusters (tuples) Machine Check errors (MCE) are the major causes of errors (55%): But also the least critical Present in tuples 92% of that, it are the only cause of errors 76% of other error messages are from Gemini

(MCE) are the major causes of errors (55%): But also the least critical Present in

11 Decoding Machine Check Data Extract 1. Extraction of Machine Check Exceptions (MCE) from logs Decode 2. Decoding of MCE 1. Read Status Register 2. Decoding based-on bank using information from manual 3. Add result from Mcelog (AMD) 4. Decoding of the status register Analyze Bank 0 = Data Cache Bank 1 = Instruction cache (IC) 3. Data Analysis Bank 2 = Bus Unit (BU) Bank 3 = Load/Storage Unit (LS) Bank 4 = Northbridge (NB)

12 Hardware is resilient! 1,544,398 Machine Check Events 45,5 % of nodes have at least 1 machine check 6252 Machine check at day on the average. Majority are Memory errors (97,70%) Only 28 uncorrectable errors (0,002%) Chipkill/ECC effective in correcting % of memory errors TBF for uncorrectable error 292h (12.1 days) Node usage pattern impacts on the machine check rates

13 Lustre Error Codes Transport end-point is not connected Lustre errors detection based-on timeout More than 80% of errors due to timeout. Timeout period tends to be long Use of distributed locks Sensitive to: Network congestion Depends on the system load Number of clients connected

Timeout period tends to be long Use of distributed locks Sensitive to:

14 Hazard Rate Probability Probability Frequency Frequency Error Characterization of Petascale Machines: LUSTRE Time To Error Distribution fitting - Lustre Distribution fitting - Lustre Weibull and Fatigue Life good fit for the Time To Error (TTE) i.e., time between two consecutive tuples. Fatigue Life used to model accumulation of error. Exponential is not a good fit Small p-value Dependence between consecutive errors About 25% of errors happen within 1h after a former error TBE [h] Istogramma Exponential TBE Fatigue [h] Life Weibull (3P) Cumulative distribution function - Lustre Cumulative distribution function - Lustre TBE [h] TBE [h] Campione Exponential Fatigue Life Weibull (3P) Distribution G.O.F. (Kolmogorow) Critical value Parameter Weibull 0,037 0,044 (α<0,2) α=0,57 β=4,93 Fatique Life 0,043 0,044 (α<0,2) α=2,14 β=2,16 Exponential 0,2362 0,06 (α<0,01) λ=0,13 γ=0 Follow-up Time

Exponential is not a good fit Small p-value Dependence between consecutive errors About 25% of errors happen within 1h after a former error 0.24 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.

15 Conclusions and Future Work Availability is 95,5% an one critical failure is 6d 16h Lustre (filesystem) reason of 52% of system wide failures Hardware is highly resilient use of error correcting codes % of coverage Lustre sensitive to time out errors Error times follows a weibull distribution with a <1 Accumulation of errors (e.g., due to high load) might be responsible for the high error rate May need different mechanisms in larger systems Future Work Improve the analysis taking the impact of errors on users job Transform logs into actionable information e.g., use of machine learning to predict failures and decide proactive recovery actions and/or to reduce MTTR

16 Backup Slides

17 Increasing Scale of problems Large-scale Machines big number of components! i.e. more computing power = more failures!

18 18 Error Characterization of Petascale Machines: Log Analysis: Objectives Logs are just data... Processed and analyzed, they become information Ideal goals to use logs to Measure, quantify, diagnose Blue Waters has experienced a service interruption on the compute and job scheduler resource due to un expected cabinet power fault. Estimated return to service time is not available yet, but is not currently expected to be before 6AM on 1/18 The BW team

Waters has experienced a service interruption on the compute and job scheduler resource due to un expected

19 Resiliency mechanisms in Blue Waters: Hardware Supervisory System (HSS)

20 h(x) Error Characterization of Petascale Machines: LUSTRE Hazard Rate More errors happened, more will happen. MTTR 3.5 Hours Then tend to be memory lees Funzione di rischio x Exponential Fatigue Life Weibull (3P)

5 Hours Then tend to be memory lees Funzione di rischio 0.72 0.64 0.

21 formulas Error Characterization of Petascale Machines: Fatigue Life Hazard Rate

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) ( TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx