Error Characterization of Petascale Machines: A study of the error logs from Blue Waters Anno Accademico
|
|
- Muriel West
- 8 years ago
- Views:
Transcription
1 tesi di laurea magistrale Error Characterization of Petascale Machines: Anno Accademico Advisor Prof. Domenico Cotroneo Co-Advisor Prof. Ravishankar K. Iyer Dr. Catello di Martino Prof. Zbigniew Kalbarczyk Student Fabio Baccanico M
2 Bibliography 1. Franck Cappello, Al Geist, Bill Gropp, Laxmikant Kale, Bill Kramer, and Marc Snir Toward Exascale Resilience. Int. J. High Perform. Comput. Appl. 23, 4 (November 2009), Schroeder, B.; Gibson, G.A., A Large-Scale Study of Failures in High-Performance Computing Systems. Dependable and Secure Computing, IEEE Transactions on, vol.7, no.4, pp.337,350, Oct.- Dec Catello Di Martino. One size does not t all: Clustering supercomputer failures using a multiple time window approach. In International Supercomputing Conference - Supercomputing, volume 7905 of Lecture Notes in Computer Science, pages Springer Berlin Heidelberg, Franck Cappello. Fault tolerance in petascale/ exascale systems: Current knowledge, challenges and research opportunities. Int. J. High Perform. Comput. Appl., 23(3): , August A. Oliner and J. Stearley. What supercomputers say: A study of five system logs. Dependable Systems and Networks, DSN '07. 37th Annual IEEE/IFIP Int. Conference on, pages , June Y. Liang, A. Sivasubramaniam, J. Moreira, Y. Zhang, R.K. Sahoo, and M. Jette. Filtering failure logs for a Bluegene/l prototype. Proceedings of the 2005 International Conference on Dependable Systems and Networks, pages , Washington, DC, USA, IEEE Computer Society 7. C. Spritz, A. Koehler, Tips and Tricks for Diagnosing Lustre Problems on Cray Systems, Cray User Group 2011 Proceedings ; 8. BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD Opteron Processors
3 Context Large-scale high-performance computing Blue Waters: the fastest supercomputer on a university campus 14 petaflop of peak performance. Towards system with several million of CPU core running up to billion of thread [1] Failures not rare events any longer and should be considered as normal events Resiliency as one of major challenges in the large-scale HPC infrastructure [1-5] First initial work on the analysis of a petascale machines errors via log analysis
4 Contribution and Findings Behind a Petascale Machines: How is it made? Hardware and Software Architecture Analysis of petascale machine error logs: a big data problem Not all the logs are created equal: data management and harmonizing The Needle in the haystack: filtering event logs looking for errors Even the single bit matters: decoding error codes Reliability at Petascale: major findings Availability is 95,5%, MTTR is 6h 46m 53% of system failures are due to Lustre Hardware is not the problem! % of hardware errors are masked by the ECC/Chipkill technique protecting the memory Lustre (parallel filesystem) errors highly critical Timeout mechanisms used in the failover mechanisms are not adequate at this scale
5 Error Characterization of Petascale Machines: About Blue Waters Hybrid (CPU/GPU) Cray architecture delivering 14 PF (peak) AMD Opteron nodes 724,480 cores Cray Gemini Network Lustre File system over
6 RAM RAM RAM RAM RAM RAM Error Characterization of Petascale Machines: Blue Waters nodes Cray XE6 blade (tot ) four nodes, each with two AMD Opteron 6272 (16 cores each) 64GB DDR3 RAM per node Gemini communication interface Cray XK7 blade (tot 3072) One AMD Opteron 6272/node 32GB DDR3 RAM per node NVIDIA K20X accelerator, 6GB on-board DDR5 RAM, 2272 cores Gemini communication interface Voltage Regulator NVIDIA GPU ACCELERATOR Node 0 Node 1 Voltage Regulator Gemini Network Asic NVIDIA GPU ACCELERATOR Node 2 Node 3 Gemini Network Asic Blade Controller
7 User Perceived Availability (1/2) Blue Waters has experienced a service interruption on the compute and job scheduler resource due to un expected cabinet power fault. Estimated return to service time is not available yet, but is not currently expected to be before 6AM on 1/18 The BW team signaling Blue Waters general failures User access or job scheduling affected Time Between Failure = time between two failure Time To Repair = time between failure and restore 46 Analyzed From 03/06/2013 To 01/11/2013 The results is an estimation of user perceived availability
8 User Perceived Availability (2/2) User Perceived Availability: i.e., 1.5 days/month offline Uptime: 150 days 3h Downtime: 6 days 18h System-wide statistics MTTI = 6 days and 16h MIN= 2 days 21h MAX = 18days 4h MTTF = 5 days 12h MIN TTF = 1h 26min MAX TTF = 15days 13h MTTR = 6h 46min Min TTR = 26min MAX TTR = 1day 7h FileSystem 52% System Reboot 3% SYSTEM-WIDE UNAVAILABILITY (REASONS) Expansion 8% Storage 3% Mainten ance 13% Globus 3% CAUSE OF DOWNTIME Scheduler 35% Queue Policy Changed Power 4% 3% Failure 76%
9 Log Analysis Workflow 4.4 TB 640 GB ~220 GB
10 Initial Breakdown of Errors MTTE [mean time to error] 890 s (almost 15 m); 76,419,404 error grouped in 17,135 error clusters (tuples) Machine Check errors (MCE) are the major causes of errors (55%): But also the least critical Present in tuples 92% of that, it are the only cause of errors 76% of other error messages are from Gemini
11 Decoding Machine Check Data Extract 1. Extraction of Machine Check Exceptions (MCE) from logs Decode 2. Decoding of MCE 1. Read Status Register 2. Decoding based-on bank using information from manual 3. Add result from Mcelog (AMD) 4. Decoding of the status register Analyze Bank 0 = Data Cache Bank 1 = Instruction cache (IC) 3. Data Analysis Bank 2 = Bus Unit (BU) Bank 3 = Load/Storage Unit (LS) Bank 4 = Northbridge (NB)
12 Hardware is resilient! 1,544,398 Machine Check Events 45,5 % of nodes have at least 1 machine check 6252 Machine check at day on the average. Majority are Memory errors (97,70%) Only 28 uncorrectable errors (0,002%) Chipkill/ECC effective in correcting % of memory errors TBF for uncorrectable error 292h (12.1 days) Node usage pattern impacts on the machine check rates
13 Lustre Error Codes Transport end-point is not connected Lustre errors detection based-on timeout More than 80% of errors due to timeout. Timeout period tends to be long Use of distributed locks Sensitive to: Network congestion Depends on the system load Number of clients connected
14 Hazard Rate Probability Probability Frequency Frequency Error Characterization of Petascale Machines: LUSTRE Time To Error Distribution fitting - Lustre Distribution fitting - Lustre Weibull and Fatigue Life good fit for the Time To Error (TTE) i.e., time between two consecutive tuples. Fatigue Life used to model accumulation of error. Exponential is not a good fit Small p-value Dependence between consecutive errors About 25% of errors happen within 1h after a former error TBE [h] Istogramma Exponential TBE Fatigue [h] Life Weibull (3P) Cumulative distribution function - Lustre Cumulative distribution function - Lustre TBE [h] TBE [h] Campione Exponential Fatigue Life Weibull (3P) Distribution G.O.F. (Kolmogorow) Critical value Parameter Weibull 0,037 0,044 (α<0,2) α=0,57 β=4,93 Fatique Life 0,043 0,044 (α<0,2) α=2,14 β=2,16 Exponential 0,2362 0,06 (α<0,01) λ=0,13 γ=0 Follow-up Time
15 Conclusions and Future Work Availability is 95,5% an one critical failure is 6d 16h Lustre (filesystem) reason of 52% of system wide failures Hardware is highly resilient use of error correcting codes % of coverage Lustre sensitive to time out errors Error times follows a weibull distribution with a <1 Accumulation of errors (e.g., due to high load) might be responsible for the high error rate May need different mechanisms in larger systems Future Work Improve the analysis taking the impact of errors on users job Transform logs into actionable information e.g., use of machine learning to predict failures and decide proactive recovery actions and/or to reduce MTTR
16 Backup Slides
17 Increasing Scale of problems Large-scale Machines big number of components! i.e. more computing power = more failures!
18 18 Error Characterization of Petascale Machines: Log Analysis: Objectives Logs are just data... Processed and analyzed, they become information Ideal goals to use logs to Measure, quantify, diagnose Blue Waters has experienced a service interruption on the compute and job scheduler resource due to un expected cabinet power fault. Estimated return to service time is not available yet, but is not currently expected to be before 6AM on 1/18 The BW team
19 Resiliency mechanisms in Blue Waters: Hardware Supervisory System (HSS)
20 h(x) Error Characterization of Petascale Machines: LUSTRE Hazard Rate More errors happened, more will happen. MTTR 3.5 Hours Then tend to be memory lees Funzione di rischio x Exponential Fatigue Life Weibull (3P)
21 formulas Error Characterization of Petascale Machines: Fatigue Life Hazard Rate
How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationCray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak
Cray Gemini Interconnect Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak Outline 1. Introduction 2. Overview 3. Architecture 4. Gemini Blocks 5. FMA & BTA 6. Fault tolerance
More informationOracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011
Oracle Database Reliability, Performance and scalability on Intel platforms Mitch Shults, Intel Corporation October 2011 1 Intel Processor E7-8800/4800/2800 Product Families Up to 10 s and 20 Threads 30MB
More informationHPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect a.jackson@epcc.ed.ac.uk
HPC and Big Data EPCC The University of Edinburgh Adrian Jackson Technical Architect a.jackson@epcc.ed.ac.uk EPCC Facilities Technology Transfer European Projects HPC Research Visitor Programmes Training
More informationDistributed communication-aware load balancing with TreeMatch in Charm++
Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration
More informationThe Evolution of Cray Management Services
The Evolution of Cray Management Services Tara Fly, Alan Mutschelknaus, Andrew Barry and John Navitsky OS/IO Cray, Inc. Seattle, WA USA e-mail: {tara, alanm, abarry, johnn}@cray.com Abstract Cray Management
More informationSGI High Performance Computing
SGI High Performance Computing Accelerate time to discovery, innovation, and profitability 2014 SGI SGI Company Proprietary 1 Typical Use Cases for SGI HPC Products Large scale-out, distributed memory
More informationTrends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
More information~ Greetings from WSU CAPPLab ~
~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)
More informationMachine check handling on Linux
Machine check handling on Linux Andi Kleen SUSE Labs ak@suse.de Aug 2004 Abstract The number of transistors in common CPUs and memory chips is growing each year. Hardware busses are getting faster. This
More informationMining event log patterns in HPC systems
Mining event log patterns in HPC systems Ana Gainaru joint work with Franck Cappello and Bill Kramer HPC Resilience Summit 2010: Workshop on Resilience for Exascale HPC HPC Resilience Third Workshop Summit
More informationIntroduction History Design Blue Gene/Q Job Scheduler Filesystem Power usage Performance Summary Sequoia is a petascale Blue Gene/Q supercomputer Being constructed by IBM for the National Nuclear Security
More informationKriterien für ein PetaFlop System
Kriterien für ein PetaFlop System Rainer Keller, HLRS :: :: :: Context: Organizational HLRS is one of the three national supercomputing centers in Germany. The national supercomputing centers are working
More informationData Management Best Practices
December 4, 2013 Data Management Best Practices Ryan Mokos Outline Overview of Nearline system (HPSS) Hardware File system structure Data transfer on Blue Waters Globus Online (GO) interface Web GUI Command-Line
More informationA State-Machine Approach to Disambiguating Supercomputer Event Logs
A State-Machine Approach to Disambiguating Supercomputer Event Logs Jon Stearley, Robert Ballance, Lara Bauman Sandia National Laboratories 1 {jrstear,raballa,lebauma}@sandia.gov Abstract Supercomputer
More informationRWTH GPU Cluster. Sandra Wienke wienke@rz.rwth-aachen.de November 2012. Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky
RWTH GPU Cluster Fotos: Christian Iwainsky Sandra Wienke wienke@rz.rwth-aachen.de November 2012 Rechen- und Kommunikationszentrum (RZ) The RWTH GPU Cluster GPU Cluster: 57 Nvidia Quadro 6000 (Fermi) innovative
More informationMethodologies for Advance Warning of Compute Cluster Problems via Statistical Analysis: A Case Study
Methodologies for Advance Warning of Compute Cluster Problems via Statistical Analysis: A Case Study Jim Brandt MS 9159, P.O. Box 969 brandt@sandia.gov Philippe Pébay MS 9159, P.O. Box 969 pppebay@sandia.gov
More informationTuning Tableau Server for High Performance
Tuning Tableau Server for High Performance I wanna go fast PRESENT ED BY Francois Ajenstat Alan Doerhoefer Daniel Meyer Agenda What are the things that can impact performance? Tips and tricks to improve
More informationHETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
More informationSun Constellation System: The Open Petascale Computing Architecture
CAS2K7 13 September, 2007 Sun Constellation System: The Open Petascale Computing Architecture John Fragalla Senior HPC Technical Specialist Global Systems Practice Sun Microsystems, Inc. 25 Years of Technical
More informationPractical Online Failure Prediction for Blue Gene/P: Period-based vs Event-driven
Practical Online Failure Prediction for Blue Gene/P: Period-based vs Event-driven Li Yu, Ziming Zheng, Zhiling Lan Department of Computer Science Illinois Institute of Technology {lyu17, zzheng11, lan}@iit.edu
More informationHPCHadoop: A framework to run Hadoop on Cray X-series supercomputers
HPCHadoop: A framework to run Hadoop on Cray X-series supercomputers Scott Michael, Abhinav Thota, and Robert Henschel Pervasive Technology Institute Indiana University Bloomington, IN, USA Email: scamicha@iu.edu
More informationCheckpoint-based Fault-tolerant Infrastructure for Virtualized Service Providers
Checkpoint-based Fault-tolerant Infrastructure for Virtualized Service Providers Íñigo Goiri, Ferran Julià, Jordi Guitart, and Jordi Torres Barcelona Supercomputing Center and Technical University of Catalonia
More informationImprove Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database
WHITE PAPER Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive
More informationRed Hat Enterprise linux 5 Continuous Availability
Red Hat Enterprise linux 5 Continuous Availability Businesses continuity needs to be at the heart of any enterprise IT deployment. Even a modest disruption in service is costly in terms of lost revenue
More informationNEC Corporation of America Intro to High Availability / Fault Tolerant Solutions
NEC Corporation of America Intro to High Availability / Fault Tolerant Solutions 1 NEC Corporation Technology solutions leader for 100+ years Established 1899, headquartered in Tokyo First Japanese joint
More informationLS DYNA Performance Benchmarks and Profiling. January 2009
LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The
More informationCray XT3 Supercomputer Scalable by Design CRAY XT3 DATASHEET
CRAY XT3 DATASHEET Cray XT3 Supercomputer Scalable by Design The Cray XT3 system offers a new level of scalable computing where: a single powerful computing system handles the most complex problems every
More informationFailure Prediction in IBM BlueGene/L Event Logs
Seventh IEEE International Conference on Data Mining Failure Prediction in IBM BlueGene/L Event Logs Yinglung Liang, Yanyong Zhang ECE Department, Rutgers University {ylliang, yyzhang}@ece.rutgers.edu
More informationWITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE
WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE 1 W W W. F U S I ON I O.COM Table of Contents Table of Contents... 2 Executive Summary... 3 Introduction: In-Memory Meets iomemory... 4 What
More informationMain Memory Data Warehouses
Main Memory Data Warehouses Robert Wrembel Poznan University of Technology Institute of Computing Science Robert.Wrembel@cs.put.poznan.pl www.cs.put.poznan.pl/rwrembel Lecture outline Teradata Data Warehouse
More informationAssessing Time Coalescence Techniques for the Analysis of Supercomputer Logs
Assessing Time Coalescence Techniques for the Analysis of Supercomputer Logs Catello Di Martino Center for Reliable and High-Performance Computing University of Illinois at Urbana-Champaign 1308 W. Main
More informationCreating A Highly Available Database Solution
WHITE PAPER Creating A Highly Available Database Solution Advantage Database Server and High Availability TABLE OF CONTENTS 1 Introduction 1 High Availability 2 High Availability Hardware Requirements
More informationWrite a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical
Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or
More informationPrice/performance Modern Memory Hierarchy
Lecture 21: Storage Administration Take QUIZ 15 over P&H 6.1-4, 6.8-9 before 11:59pm today Project: Cache Simulator, Due April 29, 2010 NEW OFFICE HOUR TIME: Tuesday 1-2, McKinley Last Time Exam discussion
More informationDistribution One Server Requirements
Distribution One Server Requirements Introduction Welcome to the Hardware Configuration Guide. The goal of this guide is to provide a practical approach to sizing your Distribution One application and
More informationDell Reliable Memory Technology
Dell Reliable Memory Technology Detecting and isolating memory errors THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS
More informationHP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief
Technical white paper HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Scale-up your Microsoft SQL Server environment to new heights Table of contents Executive summary... 2 Introduction...
More informationBig Data Challenges In Leadership Computing
Big Data Challenges In Leadership Computing Presented to: Data Direct Network s SC 2011 Technical Lunch November 14, 2011 Galen Shipman Technology Integration Group Leader Office of Science Computing at
More informationGPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
More informationEOFS Workshop Paris Sept, 2011. Lustre at exascale. Eric Barton. CTO Whamcloud, Inc. eeb@whamcloud.com. 2011 Whamcloud, Inc.
EOFS Workshop Paris Sept, 2011 Lustre at exascale Eric Barton CTO Whamcloud, Inc. eeb@whamcloud.com Agenda Forces at work in exascale I/O Technology drivers I/O requirements Software engineering issues
More informationNew Storage System Solutions
New Storage System Solutions Craig Prescott Research Computing May 2, 2013 Outline } Existing storage systems } Requirements and Solutions } Lustre } /scratch/lfs } Questions? Existing Storage Systems
More informationThe safer, easier way to help you pass any IT exams. Industry Standard Architecture and Technology. Title : Version : Demo 1 / 5
Exam : HP2-T16 Title : Industry Standard Architecture and Technology Version : Demo 1 / 5 1.How does single-mode fiber compare with multimode fiber? A. Single mode fiber has a higher bandwidth and lower
More informationEvoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca
Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca Carlo Cavazzoni CINECA Supercomputing Application & Innovation www.cineca.it 21 Aprile 2015 FERMI Name: Fermi Architecture: BlueGene/Q
More informationHigh Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
More informationCurrent Status of FEFS for the K computer
Current Status of FEFS for the K computer Shinji Sumimoto Fujitsu Limited Apr.24 2012 LUG2012@Austin Outline RIKEN and Fujitsu are jointly developing the K computer * Development continues with system
More informationLecture 1: the anatomy of a supercomputer
Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers of the future may have only 1,000 vacuum tubes and perhaps weigh 1½ tons. Popular Mechanics, March 1949
More informationHPC Update: Engagement Model
HPC Update: Engagement Model MIKE VILDIBILL Director, Strategic Engagements Sun Microsystems mikev@sun.com Our Strategy Building a Comprehensive HPC Portfolio that Delivers Differentiated Customer Value
More informationCertification: HP ATA Servers & Storage
HP ExpertONE Competency Model Certification: HP ATA Servers & Storage Overview Achieving an HP certification provides relevant skills that can lead to a fulfilling career in Information Technology. HP
More informationInfiniBand Strengthens Leadership as the High-Speed Interconnect Of Choice
InfiniBand Strengthens Leadership as the High-Speed Interconnect Of Choice Provides the Best Return-on-Investment by Delivering the Highest System Efficiency and Utilization TOP500 Supercomputers June
More informationPower Aware and Temperature Restraint Modeling for Maximizing Performance and Reliability Laxmikant Kale, Akhil Langer, and Osman Sarood
Power Aware and Temperature Restraint Modeling for Maximizing Performance and Reliability Laxmikant Kale, Akhil Langer, and Osman Sarood Parallel Programming Laboratory (PPL) University of Illinois Urbana
More informationHPC Growing Pains. Lessons learned from building a Top500 supercomputer
HPC Growing Pains Lessons learned from building a Top500 supercomputer John L. Wofford Center for Computational Biology & Bioinformatics Columbia University I. What is C2B2? Outline Lessons learned from
More informationBuilding a Top500-class Supercomputing Cluster at LNS-BUAP
Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad
More informationHow To Compare Amazon Ec2 To A Supercomputer For Scientific Applications
Amazon Cloud Performance Compared David Adams Amazon EC2 performance comparison How does EC2 compare to traditional supercomputer for scientific applications? "Performance Analysis of High Performance
More informationScientific Computing Programming with Parallel Objects
Scientific Computing Programming with Parallel Objects Esteban Meneses, PhD School of Computing, Costa Rica Institute of Technology Parallel Architectures Galore Personal Computing Embedded Computing Moore
More informationHigh Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates
High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationComputing in High- Energy-Physics: How Virtualization meets the Grid
Computing in High- Energy-Physics: How Virtualization meets the Grid Yves Kemp Institut für Experimentelle Kernphysik Universität Karlsruhe Yves Kemp Barcelona, 10/23/2006 Outline: Problems encountered
More informationHAVmS: Highly Available Virtual machine Computer System Fault Tolerant with Automatic Failback and close to zero downtime
HAVmS: Highly Available Virtual machine Computer System Fault Tolerant with Automatic Failback and close to zero downtime Memmo Federici INAF - IAPS, Bruno Martino CNR - IASI The basics Highly available
More informationEnterprise Planning Large Scale ARGUS Enterprise 10.6. 5/29/2015 ARGUS Software An Altus Group Company
Enterprise Planning Large Scale ARGUS Enterprise 10.6 5/29/2015 ARGUS Software An Altus Group Company Large Enterprise Planning Guide ARGUS Enterprise 10.6 5/29/2015 Published by: ARGUS Software, Inc.
More informationWhy Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat
Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat Why Computers Are Getting Slower The traditional approach better performance Why computers are
More informationBENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
More informationWindows Server 2008 R2 Hyper-V Live Migration
Windows Server 2008 R2 Hyper-V Live Migration Table of Contents Overview of Windows Server 2008 R2 Hyper-V Features... 3 Dynamic VM storage... 3 Enhanced Processor Support... 3 Enhanced Networking Support...
More informationAn Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing
An Alternative Storage Solution for MapReduce Eric Lomascolo Director, Solutions Marketing MapReduce Breaks the Problem Down Data Analysis Distributes processing work (Map) across compute nodes and accumulates
More informationManagement of Very Large Security Event Logs
Management of Very Large Security Event Logs Balasubramanian Ramaiah Myungsook Klassen Computer Science Department, California Lutheran University 60 West Olsen Rd, Thousand Oaks, CA 91360, USA Abstract
More informationOverview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
More informationHRG Assessment: Stratus everrun Enterprise
HRG Assessment: Stratus everrun Enterprise Today IT executive decision makers and their technology recommenders are faced with escalating demands for more effective technology based solutions while at
More informationBarry Bolding, Ph.D. VP, Cray Product Division
Barry Bolding, Ph.D. VP, Cray Product Division 1 Corporate Overview Trends in Supercomputing Types of Supercomputing and Cray s Approach The Cloud The Exascale Challenge Conclusion 2 Slide 3 Seymour Cray
More informationPurchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers
Information Technology Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers Effective for FY2016 Purpose This document summarizes High Performance Computing
More informationMachine checks on i386/x86-64. Andi Kleen, SuSE Labs ak@suse.de
Machine checks on i386/x86-64 Andi Kleen, SuSE Labs ak@suse.de What is a machine check? Hardware error Internal errors, Memory, Cache, IO, Busses But users have a hard time to recognize this Hardware is
More informationHigh availability for parallel computers
High availability for parallel computers Dolores Rexachs and Emilio Luque Computer Architecture an Operating System Department, Universidad Autónoma de Barcelona, Barcelona 8193, Spain ABSTRACT Fault tolerance
More informationData Centric Systems (DCS)
Data Centric Systems (DCS) Architecture and Solutions for High Performance Computing, Big Data and High Performance Analytics High Performance Computing with Data Centric Systems 1 Data Centric Systems
More informationUsing the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial
Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial Bill Barth, Kent Milfeld, Dan Stanzione Tommy Minyard Texas Advanced Computing Center Jim Jeffers, Intel June 2013, Leipzig, Germany
More informationORACLE INFRASTRUCTURE AS A SERVICE PRIVATE CLOUD WITH CAPACITY ON DEMAND
ORACLE INFRASTRUCTURE AS A SERVICE PRIVATE CLOUD WITH CAPACITY ON DEMAND FEATURES AND FACTS FEATURES Hardware and hardware support for a monthly fee Optionally acquire Exadata Storage Server Software and
More informationA Very Brief History of High-Performance Computing
A Very Brief History of High-Performance Computing CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Very Brief History of High-Performance Computing Spring 2016 1
More informationCloud Based Application Architectures using Smart Computing
Cloud Based Application Architectures using Smart Computing How to Use this Guide Joyent Smart Technology represents a sophisticated evolution in cloud computing infrastructure. Most cloud computing products
More informationCS 6290 I/O and Storage. Milos Prvulovic
CS 6290 I/O and Storage Milos Prvulovic Storage Systems I/O performance (bandwidth, latency) Bandwidth improving, but not as fast as CPU Latency improving very slowly Consequently, by Amdahl s Law: fraction
More informationA GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 g_suhakaran@vssc.gov.in THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833
More informationALPS Supercomputing System A Scalable Supercomputer with Flexible Services
ALPS Supercomputing System A Scalable Supercomputer with Flexible Services 1 Abstract Supercomputing is moving from the realm of abstract to mainstream with more and more applications and research being
More informationGrid Computing Approach for Dynamic Load Balancing
International Journal of Computer Sciences and Engineering Open Access Review Paper Volume-4, Issue-1 E-ISSN: 2347-2693 Grid Computing Approach for Dynamic Load Balancing Kapil B. Morey 1*, Sachin B. Jadhav
More informationultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
More informationEvent Log based Dependability Analysis of Windows NT and 2K Systems
Event Log based Dependability Analysis of Windows NT and 2K Systems Cristina Simache, Mohamed Kaâniche, and Ayda Saidane LAAS-CNRS 7 avenue du Colonel Roche 31077 Toulouse Cedex 4 France {crina, kaaniche,
More informationMining Invariant Relationships for Failure Analysis of Batch Software Systems
tesi di laurea magistrale Mining Invariant Relationships for Failure Analysis of Batch Software Systems Anno Accademico 2012/2013 relatori Ch.mo Prof. Stefano Russo Ch.mo Prof. Marcello Cinque correlatori
More informationSERVER CLUSTERING TECHNOLOGY & CONCEPT
SERVER CLUSTERING TECHNOLOGY & CONCEPT M00383937, Computer Network, Middlesex University, E mail: vaibhav.mathur2007@gmail.com Abstract Server Cluster is one of the clustering technologies; it is use for
More informationOpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
More informationOverview: X5 Generation Database Machines
Overview: X5 Generation Database Machines Spend Less by Doing More Spend Less by Paying Less Rob Kolb Exadata X5-2 Exadata X4-8 SuperCluster T5-8 SuperCluster M6-32 Big Memory Machine Oracle Exadata Database
More informationJuRoPA. Jülich Research on Petaflop Architecture. One Year on. Hugo R. Falter, COO Lee J Porter, Engineering
JuRoPA Jülich Research on Petaflop Architecture One Year on Hugo R. Falter, COO Lee J Porter, Engineering HPC Advisoy Counsil, Workshop 2010, Lugano 1 Outline The work of ParTec on JuRoPA (HF) Overview
More information高 通 量 科 学 计 算 集 群 及 Lustre 文 件 系 统. High Throughput Scientific Computing Clusters And Lustre Filesystem In Tsinghua University
高 通 量 科 学 计 算 集 群 及 Lustre 文 件 系 统 High Throughput Scientific Computing Clusters And Lustre Filesystem In Tsinghua University 清 华 信 息 科 学 与 技 术 国 家 实 验 室 ( 筹 ) 公 共 平 台 与 技 术 部 清 华 大 学 科 学 与 工 程 计 算 实 验
More informationAgenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.
Agenda Enterprise Performance Factors Overall Enterprise Performance Factors Best Practice for generic Enterprise Best Practice for 3-tiers Enterprise Hardware Load Balancer Basic Unix Tuning Performance
More informationECLIPSE Performance Benchmarks and Profiling. January 2009
ECLIPSE Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox, Schlumberger HPC Advisory Council Cluster
More informationINDIA 28-30 September 2011 virtual techdays
Building highly Available Services on Windows Azure Platform Pooja Singh Technical Architect, Accenture Aakash Sharma Technical Lead, Accenture Laxmikant Bhole Senior Architect, Accenture Assumptions You
More informationOnline Event Correlations Analysis in System Logs of Large-Scale Cluster Systems
Online Event Correlations Analysis in System Logs of Large-Scale Cluster Systems Wei Zhou, Jianfeng Zhan, Dan Meng, Zhihong Zhang To cite this version: Wei Zhou, Jianfeng Zhan, Dan Meng, Zhihong Zhang.
More informationWeb Server (Step 1) Processes request and sends query to SQL server via ADO/OLEDB. Web Server (Step 2) Creates HTML page dynamically from record set
Dawn CF Performance Considerations Dawn CF key processes Request (http) Web Server (Step 1) Processes request and sends query to SQL server via ADO/OLEDB. Query (SQL) SQL Server Queries Database & returns
More informationCOM 444 Cloud Computing
COM 444 Cloud Computing Lec 2: Computer Clusters for Scalable Parallel Computing Computer Clusters for Scalable Parallel Computing 1. Clustering for Massive Parallelism 2. Computer Clusters and MPP Architectures
More informationSciDAC Petascale Data Storage Institute
SciDAC Petascale Data Storage Institute Advanced Scientific Computing Advisory Committee Meeting October 29 2008, Gaithersburg MD Garth Gibson Carnegie Mellon University and Panasas Inc. SciDAC Petascale
More informationPetascale Software Challenges. Piyush Chaudhary piyushc@us.ibm.com High Performance Computing
Petascale Software Challenges Piyush Chaudhary piyushc@us.ibm.com High Performance Computing Fundamental Observations Applications are struggling to realize growth in sustained performance at scale Reasons
More informationEnergy-aware job scheduler for highperformance
Energy-aware job scheduler for highperformance computing 7.9.2011 Olli Mämmelä (VTT), Mikko Majanen (VTT), Robert Basmadjian (University of Passau), Hermann De Meer (University of Passau), André Giesler
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationData on Kernel Failures and Security Incidents
Data on Kernel Failures and Security Incidents Ravishankar K. Iyer (W. Gu, Z. Kalbarczyk, G. Lyle, A. Sharma, L. Wang ) Center for Reliable and High-Performance Computing Coordinated Science Laboratory
More information