Error Characterization of Petascale Machines: A study of the error logs from Blue Waters Anno Accademico

Size: px
Start display at page:

Download "Error Characterization of Petascale Machines: A study of the error logs from Blue Waters Anno Accademico 2012-2013"

Transcription

1 tesi di laurea magistrale Error Characterization of Petascale Machines: Anno Accademico Advisor Prof. Domenico Cotroneo Co-Advisor Prof. Ravishankar K. Iyer Dr. Catello di Martino Prof. Zbigniew Kalbarczyk Student Fabio Baccanico M

2 Bibliography 1. Franck Cappello, Al Geist, Bill Gropp, Laxmikant Kale, Bill Kramer, and Marc Snir Toward Exascale Resilience. Int. J. High Perform. Comput. Appl. 23, 4 (November 2009), Schroeder, B.; Gibson, G.A., A Large-Scale Study of Failures in High-Performance Computing Systems. Dependable and Secure Computing, IEEE Transactions on, vol.7, no.4, pp.337,350, Oct.- Dec Catello Di Martino. One size does not t all: Clustering supercomputer failures using a multiple time window approach. In International Supercomputing Conference - Supercomputing, volume 7905 of Lecture Notes in Computer Science, pages Springer Berlin Heidelberg, Franck Cappello. Fault tolerance in petascale/ exascale systems: Current knowledge, challenges and research opportunities. Int. J. High Perform. Comput. Appl., 23(3): , August A. Oliner and J. Stearley. What supercomputers say: A study of five system logs. Dependable Systems and Networks, DSN '07. 37th Annual IEEE/IFIP Int. Conference on, pages , June Y. Liang, A. Sivasubramaniam, J. Moreira, Y. Zhang, R.K. Sahoo, and M. Jette. Filtering failure logs for a Bluegene/l prototype. Proceedings of the 2005 International Conference on Dependable Systems and Networks, pages , Washington, DC, USA, IEEE Computer Society 7. C. Spritz, A. Koehler, Tips and Tricks for Diagnosing Lustre Problems on Cray Systems, Cray User Group 2011 Proceedings ; 8. BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD Opteron Processors

3 Context Large-scale high-performance computing Blue Waters: the fastest supercomputer on a university campus 14 petaflop of peak performance. Towards system with several million of CPU core running up to billion of thread [1] Failures not rare events any longer and should be considered as normal events Resiliency as one of major challenges in the large-scale HPC infrastructure [1-5] First initial work on the analysis of a petascale machines errors via log analysis

4 Contribution and Findings Behind a Petascale Machines: How is it made? Hardware and Software Architecture Analysis of petascale machine error logs: a big data problem Not all the logs are created equal: data management and harmonizing The Needle in the haystack: filtering event logs looking for errors Even the single bit matters: decoding error codes Reliability at Petascale: major findings Availability is 95,5%, MTTR is 6h 46m 53% of system failures are due to Lustre Hardware is not the problem! % of hardware errors are masked by the ECC/Chipkill technique protecting the memory Lustre (parallel filesystem) errors highly critical Timeout mechanisms used in the failover mechanisms are not adequate at this scale

5 Error Characterization of Petascale Machines: About Blue Waters Hybrid (CPU/GPU) Cray architecture delivering 14 PF (peak) AMD Opteron nodes 724,480 cores Cray Gemini Network Lustre File system over

6 RAM RAM RAM RAM RAM RAM Error Characterization of Petascale Machines: Blue Waters nodes Cray XE6 blade (tot ) four nodes, each with two AMD Opteron 6272 (16 cores each) 64GB DDR3 RAM per node Gemini communication interface Cray XK7 blade (tot 3072) One AMD Opteron 6272/node 32GB DDR3 RAM per node NVIDIA K20X accelerator, 6GB on-board DDR5 RAM, 2272 cores Gemini communication interface Voltage Regulator NVIDIA GPU ACCELERATOR Node 0 Node 1 Voltage Regulator Gemini Network Asic NVIDIA GPU ACCELERATOR Node 2 Node 3 Gemini Network Asic Blade Controller

7 User Perceived Availability (1/2) Blue Waters has experienced a service interruption on the compute and job scheduler resource due to un expected cabinet power fault. Estimated return to service time is not available yet, but is not currently expected to be before 6AM on 1/18 The BW team signaling Blue Waters general failures User access or job scheduling affected Time Between Failure = time between two failure Time To Repair = time between failure and restore 46 Analyzed From 03/06/2013 To 01/11/2013 The results is an estimation of user perceived availability

8 User Perceived Availability (2/2) User Perceived Availability: i.e., 1.5 days/month offline Uptime: 150 days 3h Downtime: 6 days 18h System-wide statistics MTTI = 6 days and 16h MIN= 2 days 21h MAX = 18days 4h MTTF = 5 days 12h MIN TTF = 1h 26min MAX TTF = 15days 13h MTTR = 6h 46min Min TTR = 26min MAX TTR = 1day 7h FileSystem 52% System Reboot 3% SYSTEM-WIDE UNAVAILABILITY (REASONS) Expansion 8% Storage 3% Mainten ance 13% Globus 3% CAUSE OF DOWNTIME Scheduler 35% Queue Policy Changed Power 4% 3% Failure 76%

9 Log Analysis Workflow 4.4 TB 640 GB ~220 GB

10 Initial Breakdown of Errors MTTE [mean time to error] 890 s (almost 15 m); 76,419,404 error grouped in 17,135 error clusters (tuples) Machine Check errors (MCE) are the major causes of errors (55%): But also the least critical Present in tuples 92% of that, it are the only cause of errors 76% of other error messages are from Gemini

11 Decoding Machine Check Data Extract 1. Extraction of Machine Check Exceptions (MCE) from logs Decode 2. Decoding of MCE 1. Read Status Register 2. Decoding based-on bank using information from manual 3. Add result from Mcelog (AMD) 4. Decoding of the status register Analyze Bank 0 = Data Cache Bank 1 = Instruction cache (IC) 3. Data Analysis Bank 2 = Bus Unit (BU) Bank 3 = Load/Storage Unit (LS) Bank 4 = Northbridge (NB)

12 Hardware is resilient! 1,544,398 Machine Check Events 45,5 % of nodes have at least 1 machine check 6252 Machine check at day on the average. Majority are Memory errors (97,70%) Only 28 uncorrectable errors (0,002%) Chipkill/ECC effective in correcting % of memory errors TBF for uncorrectable error 292h (12.1 days) Node usage pattern impacts on the machine check rates

13 Lustre Error Codes Transport end-point is not connected Lustre errors detection based-on timeout More than 80% of errors due to timeout. Timeout period tends to be long Use of distributed locks Sensitive to: Network congestion Depends on the system load Number of clients connected

14 Hazard Rate Probability Probability Frequency Frequency Error Characterization of Petascale Machines: LUSTRE Time To Error Distribution fitting - Lustre Distribution fitting - Lustre Weibull and Fatigue Life good fit for the Time To Error (TTE) i.e., time between two consecutive tuples. Fatigue Life used to model accumulation of error. Exponential is not a good fit Small p-value Dependence between consecutive errors About 25% of errors happen within 1h after a former error TBE [h] Istogramma Exponential TBE Fatigue [h] Life Weibull (3P) Cumulative distribution function - Lustre Cumulative distribution function - Lustre TBE [h] TBE [h] Campione Exponential Fatigue Life Weibull (3P) Distribution G.O.F. (Kolmogorow) Critical value Parameter Weibull 0,037 0,044 (α<0,2) α=0,57 β=4,93 Fatique Life 0,043 0,044 (α<0,2) α=2,14 β=2,16 Exponential 0,2362 0,06 (α<0,01) λ=0,13 γ=0 Follow-up Time

15 Conclusions and Future Work Availability is 95,5% an one critical failure is 6d 16h Lustre (filesystem) reason of 52% of system wide failures Hardware is highly resilient use of error correcting codes % of coverage Lustre sensitive to time out errors Error times follows a weibull distribution with a <1 Accumulation of errors (e.g., due to high load) might be responsible for the high error rate May need different mechanisms in larger systems Future Work Improve the analysis taking the impact of errors on users job Transform logs into actionable information e.g., use of machine learning to predict failures and decide proactive recovery actions and/or to reduce MTTR

16 Backup Slides

17 Increasing Scale of problems Large-scale Machines big number of components! i.e. more computing power = more failures!

18 18 Error Characterization of Petascale Machines: Log Analysis: Objectives Logs are just data... Processed and analyzed, they become information Ideal goals to use logs to Measure, quantify, diagnose Blue Waters has experienced a service interruption on the compute and job scheduler resource due to un expected cabinet power fault. Estimated return to service time is not available yet, but is not currently expected to be before 6AM on 1/18 The BW team

19 Resiliency mechanisms in Blue Waters: Hardware Supervisory System (HSS)

20 h(x) Error Characterization of Petascale Machines: LUSTRE Hazard Rate More errors happened, more will happen. MTTR 3.5 Hours Then tend to be memory lees Funzione di rischio x Exponential Fatigue Life Weibull (3P)

21 formulas Error Characterization of Petascale Machines: Fatigue Life Hazard Rate

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0)

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak Cray Gemini Interconnect Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak Outline 1. Introduction 2. Overview 3. Architecture 4. Gemini Blocks 5. FMA & BTA 6. Fault tolerance

More information

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect a.jackson@epcc.ed.ac.uk

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect a.jackson@epcc.ed.ac.uk HPC and Big Data EPCC The University of Edinburgh Adrian Jackson Technical Architect a.jackson@epcc.ed.ac.uk EPCC Facilities Technology Transfer European Projects HPC Research Visitor Programmes Training

More information

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011 Oracle Database Reliability, Performance and scalability on Intel platforms Mitch Shults, Intel Corporation October 2011 1 Intel Processor E7-8800/4800/2800 Product Families Up to 10 s and 20 Threads 30MB

More information

CARMA CUDA on ARM Architecture. Developing Accelerated Applications on ARM

CARMA CUDA on ARM Architecture. Developing Accelerated Applications on ARM CARMA CUDA on ARM Architecture Developing Accelerated Applications on ARM CARMA is an architectural prototype for high performance, energy efficient hybrid computing Schedule Motivation System Overview

More information

Distributed communication-aware load balancing with TreeMatch in Charm++

Distributed communication-aware load balancing with TreeMatch in Charm++ Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration

More information

The Evolution of Cray Management Services

The Evolution of Cray Management Services The Evolution of Cray Management Services Tara Fly, Alan Mutschelknaus, Andrew Barry and John Navitsky OS/IO Cray, Inc. Seattle, WA USA e-mail: {tara, alanm, abarry, johnn}@cray.com Abstract Cray Management

More information

SGI High Performance Computing

SGI High Performance Computing SGI High Performance Computing Accelerate time to discovery, innovation, and profitability 2014 SGI SGI Company Proprietary 1 Typical Use Cases for SGI HPC Products Large scale-out, distributed memory

More information

Trends in High-Performance Computing for Power Grid Applications

Trends in High-Performance Computing for Power Grid Applications Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views

More information

Introduction History Design Blue Gene/Q Job Scheduler Filesystem Power usage Performance Summary Sequoia is a petascale Blue Gene/Q supercomputer Being constructed by IBM for the National Nuclear Security

More information

~ Greetings from WSU CAPPLab ~

~ Greetings from WSU CAPPLab ~ ~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)

More information

Tuning Tableau Server for High Performance

Tuning Tableau Server for High Performance Tuning Tableau Server for High Performance I wanna go fast PRESENT ED BY Francois Ajenstat Alan Doerhoefer Daniel Meyer Agenda What are the things that can impact performance? Tips and tricks to improve

More information

A State-Machine Approach to Disambiguating Supercomputer Event Logs

A State-Machine Approach to Disambiguating Supercomputer Event Logs A State-Machine Approach to Disambiguating Supercomputer Event Logs Jon Stearley, Robert Ballance, Lara Bauman Sandia National Laboratories 1 {jrstear,raballa,lebauma}@sandia.gov Abstract Supercomputer

More information

Data Management Best Practices

Data Management Best Practices December 4, 2013 Data Management Best Practices Ryan Mokos Outline Overview of Nearline system (HPSS) Hardware File system structure Data transfer on Blue Waters Globus Online (GO) interface Web GUI Command-Line

More information

Mining event log patterns in HPC systems

Mining event log patterns in HPC systems Mining event log patterns in HPC systems Ana Gainaru joint work with Franck Cappello and Bill Kramer HPC Resilience Summit 2010: Workshop on Resilience for Exascale HPC HPC Resilience Third Workshop Summit

More information

HPCHadoop: A framework to run Hadoop on Cray X-series supercomputers

HPCHadoop: A framework to run Hadoop on Cray X-series supercomputers HPCHadoop: A framework to run Hadoop on Cray X-series supercomputers Scott Michael, Abhinav Thota, and Robert Henschel Pervasive Technology Institute Indiana University Bloomington, IN, USA Email: scamicha@iu.edu

More information

RWTH GPU Cluster. Sandra Wienke wienke@rz.rwth-aachen.de November 2012. Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

RWTH GPU Cluster. Sandra Wienke wienke@rz.rwth-aachen.de November 2012. Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky RWTH GPU Cluster Fotos: Christian Iwainsky Sandra Wienke wienke@rz.rwth-aachen.de November 2012 Rechen- und Kommunikationszentrum (RZ) The RWTH GPU Cluster GPU Cluster: 57 Nvidia Quadro 6000 (Fermi) innovative

More information

NEC Corporation of America Intro to High Availability / Fault Tolerant Solutions

NEC Corporation of America Intro to High Availability / Fault Tolerant Solutions NEC Corporation of America Intro to High Availability / Fault Tolerant Solutions 1 NEC Corporation Technology solutions leader for 100+ years Established 1899, headquartered in Tokyo First Japanese joint

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

Kriterien für ein PetaFlop System

Kriterien für ein PetaFlop System Kriterien für ein PetaFlop System Rainer Keller, HLRS :: :: :: Context: Organizational HLRS is one of the three national supercomputing centers in Germany. The national supercomputing centers are working

More information

Sun Constellation System: The Open Petascale Computing Architecture

Sun Constellation System: The Open Petascale Computing Architecture CAS2K7 13 September, 2007 Sun Constellation System: The Open Petascale Computing Architecture John Fragalla Senior HPC Technical Specialist Global Systems Practice Sun Microsystems, Inc. 25 Years of Technical

More information

Methodologies for Advance Warning of Compute Cluster Problems via Statistical Analysis: A Case Study

Methodologies for Advance Warning of Compute Cluster Problems via Statistical Analysis: A Case Study Methodologies for Advance Warning of Compute Cluster Problems via Statistical Analysis: A Case Study Jim Brandt MS 9159, P.O. Box 969 brandt@sandia.gov Philippe Pébay MS 9159, P.O. Box 969 pppebay@sandia.gov

More information

ULLtraDIMM SSD Overview. Rob Callaghan June 9 th, 2014

ULLtraDIMM SSD Overview. Rob Callaghan June 9 th, 2014 ULLtraDIMM SSD Overview Rob Callaghan June 9 th, 2014 c 1 A Global Leader in Flash Storage Solutions Rankings Trailing 4 Qtr Financials* Global Operations Leading Retail Brand $6.2B $3.6B $0.7B Revenue

More information

Performance Evaluation and Energy Efficiency of HPC Platforms

Performance Evaluation and Energy Efficiency of HPC Platforms Performance Evaluation and Energy Efficiency of HPC Platforms Based on Intel, AMD and ARM Processors M. Jarus, S. Varrette, A. Oleksiak and P.Bouvry Poznań Supercomputing and Networking Center CSC, University

More information

SERVER CLUSTERING TECHNOLOGY & CONCEPT

SERVER CLUSTERING TECHNOLOGY & CONCEPT SERVER CLUSTERING TECHNOLOGY & CONCEPT M00383937, Computer Network, Middlesex University, E mail: vaibhav.mathur2007@gmail.com Abstract Server Cluster is one of the clustering technologies; it is use for

More information

Practical Online Failure Prediction for Blue Gene/P: Period-based vs Event-driven

Practical Online Failure Prediction for Blue Gene/P: Period-based vs Event-driven Practical Online Failure Prediction for Blue Gene/P: Period-based vs Event-driven Li Yu, Ziming Zheng, Zhiling Lan Department of Computer Science Illinois Institute of Technology {lyu17, zzheng11, lan}@iit.edu

More information

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database WHITE PAPER Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive

More information

Tips for Performance. Running PTC Creo Elements Pro 5.0 (Pro/ENGINEER Wildfire 5.0) on HP Z and Mobile Workstations

Tips for Performance. Running PTC Creo Elements Pro 5.0 (Pro/ENGINEER Wildfire 5.0) on HP Z and Mobile Workstations System Memory - size and layout Optimum performance is only possible when application data resides in system RAM. Waiting on slower disk I/O page file adversely impacts system and application performance.

More information

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Technical white paper HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Scale-up your Microsoft SQL Server environment to new heights Table of contents Executive summary... 2 Introduction...

More information

HRG Assessment: Stratus everrun Enterprise

HRG Assessment: Stratus everrun Enterprise HRG Assessment: Stratus everrun Enterprise Today IT executive decision makers and their technology recommenders are faced with escalating demands for more effective technology based solutions while at

More information

LS DYNA Performance Benchmarks and Profiling. January 2009

LS DYNA Performance Benchmarks and Profiling. January 2009 LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The

More information

Correlating Multiple TB of Performance Data to User Jobs

Correlating Multiple TB of Performance Data to User Jobs Michael Kluge, ZIH Correlating Multiple TB of Performance Data to User Jobs Lustre User Group 2015, Denver, Colorado Zellescher Weg 12 Willers-Bau A 208 Tel. +49 351-463 34217 Michael Kluge (michael.kluge@tu-dresden.de)

More information

LS-DYNA Performance Benchmark and Profiling on Windows. July 2009

LS-DYNA Performance Benchmark and Profiling on Windows. July 2009 LS-DYNA Performance Benchmark and Profiling on Windows July 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center

More information

Cray XT3 Supercomputer Scalable by Design CRAY XT3 DATASHEET

Cray XT3 Supercomputer Scalable by Design CRAY XT3 DATASHEET CRAY XT3 DATASHEET Cray XT3 Supercomputer Scalable by Design The Cray XT3 system offers a new level of scalable computing where: a single powerful computing system handles the most complex problems every

More information

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE 1 W W W. F U S I ON I O.COM Table of Contents Table of Contents... 2 Executive Summary... 3 Introduction: In-Memory Meets iomemory... 4 What

More information

Eloquence Training What s new in Eloquence B.08.00

Eloquence Training What s new in Eloquence B.08.00 Eloquence Training What s new in Eloquence B.08.00 2010 Marxmeier Software AG Rev:100727 Overview Released December 2008 Supported until November 2013 Supports 32-bit and 64-bit platforms HP-UX Itanium

More information

Price/performance Modern Memory Hierarchy

Price/performance Modern Memory Hierarchy Lecture 21: Storage Administration Take QUIZ 15 over P&H 6.1-4, 6.8-9 before 11:59pm today Project: Cache Simulator, Due April 29, 2010 NEW OFFICE HOUR TIME: Tuesday 1-2, McKinley Last Time Exam discussion

More information

EOFS Workshop Paris Sept, 2011. Lustre at exascale. Eric Barton. CTO Whamcloud, Inc. eeb@whamcloud.com. 2011 Whamcloud, Inc.

EOFS Workshop Paris Sept, 2011. Lustre at exascale. Eric Barton. CTO Whamcloud, Inc. eeb@whamcloud.com. 2011 Whamcloud, Inc. EOFS Workshop Paris Sept, 2011 Lustre at exascale Eric Barton CTO Whamcloud, Inc. eeb@whamcloud.com Agenda Forces at work in exascale I/O Technology drivers I/O requirements Software engineering issues

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

Implementing Windows Server 2003 and SQL Server 2005 Clustering with StorTrends Storage.

Implementing Windows Server 2003 and SQL Server 2005 Clustering with StorTrends Storage. Tech Sheet Implementing Windows Server 2003 and SQL Server 2005 Clustering with StorTrends Storage. Copyright 1998-2010 American Megatrends India pvt ltd All rights reserved American Megatrends India PVT

More information

Failure Prediction in IBM BlueGene/L Event Logs

Failure Prediction in IBM BlueGene/L Event Logs Seventh IEEE International Conference on Data Mining Failure Prediction in IBM BlueGene/L Event Logs Yinglung Liang, Yanyong Zhang ECE Department, Rutgers University {ylliang, yyzhang}@ece.rutgers.edu

More information

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance

More information

HAVmS: Highly Available Virtual machine Computer System Fault Tolerant with Automatic Failback and close to zero downtime

HAVmS: Highly Available Virtual machine Computer System Fault Tolerant with Automatic Failback and close to zero downtime HAVmS: Highly Available Virtual machine Computer System Fault Tolerant with Automatic Failback and close to zero downtime Memmo Federici INAF - IAPS, Bruno Martino CNR - IASI The basics Highly available

More information

Computing Support UNIT 17

Computing Support UNIT 17 UNIT 17 Computing Support STARTER Find out what the most common computing problems are for your classmates and how they get help with these problems. Use this form to record your results. Problems viruses

More information

Management of Very Large Security Event Logs

Management of Very Large Security Event Logs Management of Very Large Security Event Logs Balasubramanian Ramaiah Myungsook Klassen Computer Science Department, California Lutheran University 60 West Olsen Rd, Thousand Oaks, CA 91360, USA Abstract

More information

Web Server (Step 1) Processes request and sends query to SQL server via ADO/OLEDB. Web Server (Step 2) Creates HTML page dynamically from record set

Web Server (Step 1) Processes request and sends query to SQL server via ADO/OLEDB. Web Server (Step 2) Creates HTML page dynamically from record set Dawn CF Performance Considerations Dawn CF key processes Request (http) Web Server (Step 1) Processes request and sends query to SQL server via ADO/OLEDB. Query (SQL) SQL Server Queries Database & returns

More information

Windows Server 2008 R2 Hyper-V Live Migration

Windows Server 2008 R2 Hyper-V Live Migration Windows Server 2008 R2 Hyper-V Live Migration Table of Contents Overview of Windows Server 2008 R2 Hyper-V Features... 3 Dynamic VM storage... 3 Enhanced Processor Support... 3 Enhanced Networking Support...

More information

Enterprise Planning Large Scale ARGUS Enterprise 10.6. 5/29/2015 ARGUS Software An Altus Group Company

Enterprise Planning Large Scale ARGUS Enterprise 10.6. 5/29/2015 ARGUS Software An Altus Group Company Enterprise Planning Large Scale ARGUS Enterprise 10.6 5/29/2015 ARGUS Software An Altus Group Company Large Enterprise Planning Guide ARGUS Enterprise 10.6 5/29/2015 Published by: ARGUS Software, Inc.

More information

High Performance Computing in CST STUDIO SUITE

High Performance Computing in CST STUDIO SUITE High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver

More information

INDIA 28-30 September 2011 virtual techdays

INDIA 28-30 September 2011 virtual techdays Building highly Available Services on Windows Azure Platform Pooja Singh Technical Architect, Accenture Aakash Sharma Technical Lead, Accenture Laxmikant Bhole Senior Architect, Accenture Assumptions You

More information

Assessing Time Coalescence Techniques for the Analysis of Supercomputer Logs

Assessing Time Coalescence Techniques for the Analysis of Supercomputer Logs Assessing Time Coalescence Techniques for the Analysis of Supercomputer Logs Catello Di Martino Center for Reliable and High-Performance Computing University of Illinois at Urbana-Champaign 1308 W. Main

More information

CS 6290 I/O and Storage. Milos Prvulovic

CS 6290 I/O and Storage. Milos Prvulovic CS 6290 I/O and Storage Milos Prvulovic Storage Systems I/O performance (bandwidth, latency) Bandwidth improving, but not as fast as CPU Latency improving very slowly Consequently, by Amdahl s Law: fraction

More information

Computing in High- Energy-Physics: How Virtualization meets the Grid

Computing in High- Energy-Physics: How Virtualization meets the Grid Computing in High- Energy-Physics: How Virtualization meets the Grid Yves Kemp Institut für Experimentelle Kernphysik Universität Karlsruhe Yves Kemp Barcelona, 10/23/2006 Outline: Problems encountered

More information

Machine check handling on Linux

Machine check handling on Linux Machine check handling on Linux Andi Kleen SUSE Labs ak@suse.de Aug 2004 Abstract The number of transistors in common CPUs and memory chips is growing each year. Hardware busses are getting faster. This

More information

SGI UV 300, UV 30EX: Big Brains for No-Limit Computing

SGI UV 300, UV 30EX: Big Brains for No-Limit Computing SGI UV 300, UV 30EX: Big Brains for No-Limit Computing The Most ful In-memory Supercomputers for Data-Intensive Workloads Key Features Scales up to 64 sockets and 64TB of coherent shared memory Extreme

More information

Distribution One Server Requirements

Distribution One Server Requirements Distribution One Server Requirements Introduction Welcome to the Hardware Configuration Guide. The goal of this guide is to provide a practical approach to sizing your Distribution One application and

More information

Red Hat Enterprise linux 5 Continuous Availability

Red Hat Enterprise linux 5 Continuous Availability Red Hat Enterprise linux 5 Continuous Availability Businesses continuity needs to be at the heart of any enterprise IT deployment. Even a modest disruption in service is costly in terms of lost revenue

More information

Checkpoint-based Fault-tolerant Infrastructure for Virtualized Service Providers

Checkpoint-based Fault-tolerant Infrastructure for Virtualized Service Providers Checkpoint-based Fault-tolerant Infrastructure for Virtualized Service Providers Íñigo Goiri, Ferran Julià, Jordi Guitart, and Jordi Torres Barcelona Supercomputing Center and Technical University of Catalonia

More information

Without a doubt availability is the

Without a doubt availability is the June 2013 Michael Otey The Path to Five 9s Without a doubt availability is the DBA s first priority. Even performance ceases to matter if the database isn t available. High availability isn t just for

More information

HPC Update: Engagement Model

HPC Update: Engagement Model HPC Update: Engagement Model MIKE VILDIBILL Director, Strategic Engagements Sun Microsystems mikev@sun.com Our Strategy Building a Comprehensive HPC Portfolio that Delivers Differentiated Customer Value

More information

Certification: HP ATA Servers & Storage

Certification: HP ATA Servers & Storage HP ExpertONE Competency Model Certification: HP ATA Servers & Storage Overview Achieving an HP certification provides relevant skills that can lead to a fulfilling career in Information Technology. HP

More information

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 g_suhakaran@vssc.gov.in THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833

More information

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,

More information

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Building a Top500-class Supercomputing Cluster at LNS-BUAP Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad

More information

Scientific Computing Programming with Parallel Objects

Scientific Computing Programming with Parallel Objects Scientific Computing Programming with Parallel Objects Esteban Meneses, PhD School of Computing, Costa Rica Institute of Technology Parallel Architectures Galore Personal Computing Embedded Computing Moore

More information

HPC Growing Pains. Lessons learned from building a Top500 supercomputer

HPC Growing Pains. Lessons learned from building a Top500 supercomputer HPC Growing Pains Lessons learned from building a Top500 supercomputer John L. Wofford Center for Computational Biology & Bioinformatics Columbia University I. What is C2B2? Outline Lessons learned from

More information

Main Memory Data Warehouses

Main Memory Data Warehouses Main Memory Data Warehouses Robert Wrembel Poznan University of Technology Institute of Computing Science Robert.Wrembel@cs.put.poznan.pl www.cs.put.poznan.pl/rwrembel Lecture outline Teradata Data Warehouse

More information

ABB Technology Days Fall 2013 System 800xA Server and Client Virtualization. ABB Inc 3BSE074389 en. October 29, 2013 Slide 1

ABB Technology Days Fall 2013 System 800xA Server and Client Virtualization. ABB Inc 3BSE074389 en. October 29, 2013 Slide 1 ABB Technology Days Fall 2013 System 800xA Server and Client ization October 29, 2013 Slide 1 System 800xA ization Customers specify it Customers harmonize with IT Training environments Lower cost of ownership

More information

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance. Agenda Enterprise Performance Factors Overall Enterprise Performance Factors Best Practice for generic Enterprise Best Practice for 3-tiers Enterprise Hardware Load Balancer Basic Unix Tuning Performance

More information

Creating A Highly Available Database Solution

Creating A Highly Available Database Solution WHITE PAPER Creating A Highly Available Database Solution Advantage Database Server and High Availability TABLE OF CONTENTS 1 Introduction 1 High Availability 2 High Availability Hardware Requirements

More information

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015 GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once

More information

Event Log based Dependability Analysis of Windows NT and 2K Systems

Event Log based Dependability Analysis of Windows NT and 2K Systems Event Log based Dependability Analysis of Windows NT and 2K Systems Cristina Simache, Mohamed Kaâniche, and Ayda Saidane LAAS-CNRS 7 avenue du Colonel Roche 31077 Toulouse Cedex 4 France {crina, kaaniche,

More information

Beyond The CPU: Defeating Hardware Based RAM Acquisition (part I: AMD case)

Beyond The CPU: Defeating Hardware Based RAM Acquisition (part I: AMD case) Beyond The CPU: Defeating Hardware Based RAM Acquisition (part I: AMD case) Joanna Rutkowska COSEINC Advanced Malware Labs Black Hat DC 2007 February 28 th, 2007, Washington, DC Focus In this presentation

More information

A Very Brief History of High-Performance Computing

A Very Brief History of High-Performance Computing A Very Brief History of High-Performance Computing CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Very Brief History of High-Performance Computing Spring 2016 1

More information

New Storage System Solutions

New Storage System Solutions New Storage System Solutions Craig Prescott Research Computing May 2, 2013 Outline } Existing storage systems } Requirements and Solutions } Lustre } /scratch/lfs } Questions? Existing Storage Systems

More information

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca Carlo Cavazzoni CINECA Supercomputing Application & Innovation www.cineca.it 21 Aprile 2015 FERMI Name: Fermi Architecture: BlueGene/Q

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

LSKA 2010 Survey Report Job Scheduler

LSKA 2010 Survey Report Job Scheduler LSKA 2010 Survey Report Job Scheduler Graduate Institute of Communication Engineering {r98942067, r98942112}@ntu.edu.tw March 31, 2010 1. Motivation Recently, the computing becomes much more complex. However,

More information

Lecture 1: the anatomy of a supercomputer

Lecture 1: the anatomy of a supercomputer Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers of the future may have only 1,000 vacuum tubes and perhaps weigh 1½ tons. Popular Mechanics, March 1949

More information

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE Tuyou Peng 1, Jun Peng 2 1 Electronics and information Technology Department Jiangmen Polytechnic, Jiangmen, Guangdong, China, typeng2001@yahoo.com

More information

Big Data Challenges In Leadership Computing

Big Data Challenges In Leadership Computing Big Data Challenges In Leadership Computing Presented to: Data Direct Network s SC 2011 Technical Lunch November 14, 2011 Galen Shipman Technology Integration Group Leader Office of Science Computing at

More information

Dell Reliable Memory Technology

Dell Reliable Memory Technology Dell Reliable Memory Technology Detecting and isolating memory errors THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS

More information

VMware Virtual SAN Remote Office / Branch Office Deployment

VMware Virtual SAN Remote Office / Branch Office Deployment SOLUTION OVERVIEW VMware Virtual SAN Radically Simple Storage for Remote and Branch Offices VMware Virtual SAN is VMware s radically simple, enterprise-class, software-defined storage solution for Hyper-Converged

More information

The safer, easier way to help you pass any IT exams. Industry Standard Architecture and Technology. Title : Version : Demo 1 / 5

The safer, easier way to help you pass any IT exams. Industry Standard Architecture and Technology. Title : Version : Demo 1 / 5 Exam : HP2-T16 Title : Industry Standard Architecture and Technology Version : Demo 1 / 5 1.How does single-mode fiber compare with multimode fiber? A. Single mode fiber has a higher bandwidth and lower

More information

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems 202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric

More information

Energy-aware job scheduler for highperformance

Energy-aware job scheduler for highperformance Energy-aware job scheduler for highperformance computing 7.9.2011 Olli Mämmelä (VTT), Mikko Majanen (VTT), Robert Basmadjian (University of Passau), Hermann De Meer (University of Passau), André Giesler

More information

XenData Product Brief: SX-520 Series Servers for Sony Optical Disc Archives

XenData Product Brief: SX-520 Series Servers for Sony Optical Disc Archives XenData Product Brief: SX-520 Series Servers for Sony Optical Disc Archives The SX-520 Series of Archive Servers creates highly scalable Optical Disc Digital Video Archives that are optimized for broadcasters,

More information

Current Status of FEFS for the K computer

Current Status of FEFS for the K computer Current Status of FEFS for the K computer Shinji Sumimoto Fujitsu Limited Apr.24 2012 LUG2012@Austin Outline RIKEN and Fujitsu are jointly developing the K computer * Development continues with system

More information

Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?

Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In FAST'7: 5th USENIX Conference on File and Storage Technologies, San Jose, CA, Feb. -6, 7. Disk failures in the real world: What does an MTTF of,, hours mean to you? Bianca Schroeder Garth A. Gibson

More information

Accelerating Real Time Big Data Applications. PRESENTATION TITLE GOES HERE Bob Hansen

Accelerating Real Time Big Data Applications. PRESENTATION TITLE GOES HERE Bob Hansen Accelerating Real Time Big Data Applications PRESENTATION TITLE GOES HERE Bob Hansen Apeiron Data Systems Apeiron is developing a VERY high performance Flash storage system that alters the economics of

More information

高 通 量 科 学 计 算 集 群 及 Lustre 文 件 系 统. High Throughput Scientific Computing Clusters And Lustre Filesystem In Tsinghua University

高 通 量 科 学 计 算 集 群 及 Lustre 文 件 系 统. High Throughput Scientific Computing Clusters And Lustre Filesystem In Tsinghua University 高 通 量 科 学 计 算 集 群 及 Lustre 文 件 系 统 High Throughput Scientific Computing Clusters And Lustre Filesystem In Tsinghua University 清 华 信 息 科 学 与 技 术 国 家 实 验 室 ( 筹 ) 公 共 平 台 与 技 术 部 清 华 大 学 科 学 与 工 程 计 算 实 验

More information

InfiniBand Strengthens Leadership as the High-Speed Interconnect Of Choice

InfiniBand Strengthens Leadership as the High-Speed Interconnect Of Choice InfiniBand Strengthens Leadership as the High-Speed Interconnect Of Choice Provides the Best Return-on-Investment by Delivering the Highest System Efficiency and Utilization TOP500 Supercomputers June

More information

Backup & Recovery. 10 Suite PARAGON. Data Sheet. Automatization Features

Backup & Recovery. 10 Suite PARAGON. Data Sheet. Automatization Features PARAGON Backup & Recovery 10 Suite Data Sheet Automatization Features Paragon combines our latest patented technologies with 15 years of expertise to deliver a cutting edge solution to protect home Windows

More information

Mining Invariant Relationships for Failure Analysis of Batch Software Systems

Mining Invariant Relationships for Failure Analysis of Batch Software Systems tesi di laurea magistrale Mining Invariant Relationships for Failure Analysis of Batch Software Systems Anno Accademico 2012/2013 relatori Ch.mo Prof. Stefano Russo Ch.mo Prof. Marcello Cinque correlatori

More information

Upgrading Small Business Client and Server Infrastructure E-LEET Solutions. E-LEET Solutions is an information technology consulting firm

Upgrading Small Business Client and Server Infrastructure E-LEET Solutions. E-LEET Solutions is an information technology consulting firm Thank you for considering E-LEET Solutions! E-LEET Solutions is an information technology consulting firm that specializes in low-cost high-performance computing solutions. This document was written as

More information

One Solution for Real-Time Data protection, Disaster Recovery & Migration

One Solution for Real-Time Data protection, Disaster Recovery & Migration One Solution for Real-Time Data protection, Disaster Recovery & Migration Built-in standby virtualisation server Backs up every 15 minutes up to 12 servers On and Off-site Backup User initialed file, folder

More information

Monitoring DoubleTake Availability

Monitoring DoubleTake Availability Monitoring DoubleTake Availability eg Enterprise v6 Restricted Rights Legend The information contained in this document is confidential and subject to change without notice. No part of this document may

More information

Active-Active and High Availability

Active-Active and High Availability Active-Active and High Availability Advanced Design and Setup Guide Perceptive Content Version: 7.0.x Written by: Product Knowledge, R&D Date: July 2015 2015 Perceptive Software. All rights reserved. Lexmark

More information

Open-E DSS V7 Active-Active vs. Active-Passive

Open-E DSS V7 Active-Active vs. Active-Passive 1 Open-E DSS V7 Active-Active vs. Active-Passive Performance comparison of failover solutions Contents INTRODUCTION 2 HIGH AVAILABILITY OVERVIEW 2 ACTIVE-PASSIVE 3 ACTIVE-ACTIVE 3 TESTING METHODOLOGY AND

More information