An Overview of High- Performance Computing and Challenges for the Future
|
|
|
- Laura Miles
- 9 years ago
- Views:
Transcription
1 The 2006 International Conference on Computational Science and its Applications (ICCSA 2006) An Overview of High- Performance Computing and Challenges for the Future Jack Dongarra University of Tennessee and Oak Ridge National Laboratory 5/10/ Overview Look at current state of high performance computing Past, present and a look ahead Potential gains by exploiting lower precision devices GPUs, Cell, SSE2, AltaVec New performance evaluation tools HPCS HPC Challenge 2 1
2 H. Meuer, H. Simon, E. Strohmaier, & JD - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem TPP performance - Updated twice a year Size SC xy in the States in November Meeting in Germany in June - All data available from Rate 3 Performance Development 1 Pflop/s 2.3 PF/s TF/s 100 Tflop/s 10 Tflop/s 1 Tflop/s 100 Gflop/s 10 Gflop/s 1 Gflop/s TF/s 59.7 GF/s Fujitsu 'NWT' NAL 0.4 GF/s SUM N=1 Intel ASCI Red Sandia N=500 NEC Earth Simulator IBM ASCI White LLNL My Laptop IBM BlueGene/L TF/s 100 Mflop/s
3 Tightly Architecture/Systems Continuum Coupled Custom processor Best processor performance for with custom interconnect codes Custom that are not cache 80% friendly Cray X1 NEC SX-8 Good communication performance IBM Regatta Simpler programming model IBM Blue Gene/L 60% Most expensive Loosely Coupled Commodity processor with custom interconnect SGI Altix Intel Itanium 2 Cray XT3, XD1 AMD Opteron Commodity processor with commodity interconnect Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics NEC TX7 IBM eserver Dawning 100% 40% 20% 0% Good communication Hybrid performance Good scalability Best price/performance (for codes that work well with caches Commod and are latency tolerant) More complex programming model Jun-93 Dec-93 Jun-94 Dec-94 Jun-95 Dec-95 Jun-96 Dec-96 Jun-97 Dec-97 Jun-98 Dec-98 Jun-99 Dec-99 Jun-00 Dec-00 Jun-01 Dec-01 Jun-02 Dec-02 Jun-03 Dec-03 Jun-04 5 Processor Types 500 SIMD 400 Sparc Vector 300 MIPS Alpha 200 HP 100 AMD IBM Power 0 Intel Intel + IBM Power PC + AMD = 91% 6 3
4 Interconnects / Systems Others Cray Interconnect SP Switch Crossbar Quadrics Infiniband Myrinet (101) Gigabit Ethernet (249) N/A GigE + Myrinet = 70% 7 Parallelism in the Top k-128k 32k-64k 16k-32k 8k-16k 4k-8k
5 26th List: The TOP10 Manufacturer Computer Rmax [TF/s] Installation Site Country Year #Proc 1 IBM BlueGene/L eserver Blue Gene DOE Lawrence Livermore Nat Lab USA custom IBM BGW eserver Blue Gene IBM Thomas Watson Research USA custom IBM ASC Purple Power5 p DOE Lawrence Livermore Nat Lab USA custom SGI Columbia Altix, Itanium/Infiniband NASA Ames USA 2004 hybrid Dell Cray NEC Thunderbird Pentium/Infiniband Red Storm Cray XT3 AMD Earth-Simulator SX DOE Sandia Nat Lab DOE Sandia Nat Lab Earth Simulator Center USA USA Japan commod hybrid 2002 custom IBM MareNostrum PPC 970/Myrinet Barcelona Supercomputer Center Spain commod IBM eserver Blue Gene ASTRON University Groningen Netherlands custom Cray Jaguar Cray XT3 AMD DOE Oak Ridge Nat Lab USA hybrid Performance Projection 1 Eflop/s 100 Pflop/s 10 Pflop/s 1 Pflop/s 100 Tflop/s 10 Tflop/s 1 Tflop/s SUM DARPA HPCS 100 Gflop/s 10 Gflop/s 1 Gflop/s 100 Mflop/s N=1 N=500 My My Laptop Laptop
6 A PetaFlop Computer by the End of the Decade 10 Companies working on a building a Petaflop system by the end of the decade. Cray } IBM Sun Dawning } Galactic Chinese Companies Lenovo Hitachi } Japanese NEC Life Simulator (10 Pflop/s) Fujitsu Bull 11 Increasing the number of gates into a tight knot and decreasing the cycle time of the processor 10,000 Sun s s Surface Power Density (W/cm 2 ) 1, Hot Plate 486 Nuclear Reactor Rocket Nozzle Pentium Intel Developer Forum, Spring Pat Gelsinger (Pentium at 90 W) Square relationship between the cycle time and power. 12 6
7 Lower Voltage Increase Clock Rate & Transistor Density Cache Core C1 C2 Cache C3 C4 Cache Core Core C1 C2 C1 C2 C1 C2 C3 C4 C3 C4 Cache C1 C2 C1 C2 C3 C4 C3 C4 C3 C4 We have seen increasing number of gates on a chip and increasing clock speed. Heat becoming an unmanageable problem, Intel Processors > 100 Watts We will not see the dramatic increases in clock speeds in the future. However, the number of gates on a chip will continue to increase. 13 Operations per second for serial code Free Lunch For Traditional Software (It just runs twice as fast every 18 months with no change to the code!) 24 GHz, 1 Core 12 GHz, 1 Core 6 GHz 1 Core 3GHz 1 Core 1 Core 3 GHz 2 Cores No Free Lunch For Traditional Software (Without highly concurrent software it won t get any faster!) 2 Cores 3 GHz, 4 Cores 4 Cores 3 GHz, 8 Cores 8 Cores Additional operations per second if code can take advantage of concurrency From Craig Mundie, Microsoft 14 7
8 CPU Desktop Trends Change is Coming Relative processing power will continue to double every 18 months 256 logical processors per chip in late Year Hardware Threads Per Chip Cores Per Processor Chip 15 Commodity Processor Trends Bandwidth/Latency is the Critical Issue, not FLOPS Annual increase Typical value in 2006 Typical value in 2010 Got Bandwidth? Typical value in 2020 Single-chip floating-point performance 59% 4 GFLOP/s 32 GFLOP/s 3300 GFLOP/s Front-side bus bandwidth 23% 1 GWord/s = 0.25 word/flop 3.5 GWord/s = 0.11 word/flop 27 GWord/s = word/flop DRAM latency (5.5%) 70 ns = 280 FP ops = 70 loads 50 ns = 1600 FP ops = 170 loads 28 ns = 94,000 FP ops = 780 loads Source: Getting Up to Speed: The Future of Supercomputing, National Research Council, 222 pages, 2004, National Academies Press, Washington DC, ISBN
9 That Was the Good News Bad news: the effect of the hardware change on the existing software base Must rethink the design of our software Another disruptive technology Rethink and rewrite the applications, algorithms, and software 17 LAPACK - ScaLAPACK Numerical libraries for linear algebra LAPACK Late 1980 s Sequential and SMPs ScaLAPACK Early 1990 s Message passing systems 18 9
10 Right-Looking LU factorization (LAPACK) DGETF2 DLSWP DLSWP DTRSM DGEMM DGETF2 Unblocked LU DLSWP row swaps DTRSM triangular solve with many right-hand sides DGEMM matrix-matrix multiply 19 Steps in the LAPACK LU DGETF2 LAPACK DLSWP LAPACK DLSWP LAPACK DTRSM BLAS DGEMM BLAS 20 10
11 LU Timing Profile LAPACK + BLAS threads Time for each component 1D decomposition and SGI Origin DGETF2 DLASWP(L) DLASWP(R) DTRSM DGEMM LU Timing Profile LAPACK + BLAS threads Threads no lookahead Time for each component In this case the performance difference comes from parallelizing row exchanges (DLASWP) and threads in the LU algorithm. 1D decomposition and SGI Origin DGETF2 DLASWP(L) DLASWP(R) DTRSM DGEMM 11
12 Right-Looking Right-Looking LU factorization LU Factorization 23 Right-Looking LU with a Lookahead 12
13 Pivot Rearrangement and Lookahead 4 Processor runs Lookahead = Pivot Rearrangement and Lookahead 16 SMP runs 26 13
14 Motivated by The PlayStation 3's CPU based on a chip codenamed "Cell" Each Cell contains 8 APUs. An APU is a self contained vector processor which acts independently from the others. 4 floating point units capable of a total of 32 Gflop/s (8 Gflop/s each) 256 Gflop/s peak! 32 bit floating point; 64 bit floating point at 25 Gflop/s. IEEE format, but only rounds toward zero in 32 bit, overflow set to largest According to IBM, the SPE s double precision unit is fully IEEE854 compliant. Datapaths lite 27 GPU Performance GPU Vendor NVIDIA NVIDIA ATI Model 6800Ultra 7800GTX X1900XTX Release Year bit Performance 60 GFLOPS 200 GFLOPS 400 GFLOPS 64-bit Performance must be emulated in software 28 14
15 Idea Something Like This Exploit 32 bit floating point as much as possible. Especially for the bulk of the computation Correct or update the solution with selective use of 64 bit floating point to provide a refined results Intuitively: Compute a 32 bit result, Calculate a correction to 32 bit result using selected higher precision and, Perform the update of the 32 bit results with the correction using high precision and 64 Bit Floating Point Arithmetic Iterative refinement for dense systems can work this way. Solve Ax = b in lower precision, save the factorization (L*U = A*P); O(n 3 ) Compute in higher precision r = b A*x; O(n 2 ) Requires the original data A (stored in high precision) Solve Az = r; using the lower precision factorization; O(n 2 ) Update solution x + = x + z using high precision; O(n) Iterate until converged. Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. We can show using this approach that we can compute the solution to 64-bit floating point precision. Requires extra storage, total is 1.5 times normal; O(n 3 ) work is done in lower precision O(n 2 ) work is done in high precision Problems if the matrix is ill-conditioned in sp; O(10 8 ) 30 15
16 On the Way to Understanding How to Use the Cell Something Else Happened Realized have the similar situation on our commodity processors. That is, SP is 2X as fast as DP on many systems The Intel Pentium and AMD Opteron have SSE2 2 flops/cycle DP 4 flops/cycle SP IBM PowerPC has AltiVec 8 flops/cycle SP 4 flops/cycle DP No DP on AltiVec Processor and BLAS Library Pentium III Katmai (0.6GHz) Goto BLAS Pentium III CopperMine (0.9GHz) Goto BLAS Pentium Xeon Northwood (2.4GHz) Goto BLAS Pentium Xeon Prescott (3.2GHz) Goto BLAS Pentium IV Prescott (3.4GHz) Goto BLAS AMD Opteron 240 (1.4GHz) Goto BLAS PowerPC G5 (2.7GHz) AltiVec SGEMM (GFlop/s) DGEMM (GFlop/s) Speedup SP/DP Performance of single precision and double precision matrix multiply (SGEMM and DGEMM) with n=m=k=1000 Speedups (Ratio of Times) Cray X1 (libsci) Architecture (BLAS) Intel Pentium IV-M Northwood (Goto) Intel Pentium III Katmai (Goto) Intel Pentium III Coppermine (Goto) Intel Pentium IV Prescott (Goto) AMD Opteron (Goto) Sun UltraSPARC IIe (Sunperf) IBM Power PC G5 (2.7 GHz) (VecLib) Compaq Alpha EV6 (CXML) IBM SP Power3 (ESSL) SGI Octane (ATLAS) n DGEMM /SGEMM DP Solve /SP Solve DP Solve /Iter Ref # iter Architecture (BLAS-MPI) AMD Opteron (Goto OpenMPI MX) AMD Opteron (Goto OpenMPI MX) # procs n DP Solve /SP Solve DP Solve /Iter Ref # iter
17 Refinement Technique Using Single/Double Precision Linear Systems LU (dense and sparse) Cholesky QR Factorization Eigenvalue Symmetric eigenvalue problem SVD Same idea as with dense systems, Reduce to tridiagonal/bi-diagonal in lower precision, retain original data and improve with iterative technique using the lower precision to solve systems and use higher precision to calculate residual with original data. O(n 2 ) per value/vector Iterative Linear System Relaxed GMRES Inner/outer scheme LAPACK Working Note 33 Motivation for Additional Benchmarks Linpack Benchmark Good One number Simple to define & easy to rank Allows problem size to change with machine and over time Bad Emphasizes only peak CPU speed and number of CPUs Does not stress local bandwidth Does not stress the network Does not test gather/scatter Ignores Amdahl s Law (Only does weak scaling) Ugly Benchmarketeering hype From Linpack Benchmark and Top500: no single number can reflect overall performance Clearly need something more than Linpack HPC Challenge Benchmark Test suite stresses not only the processors, but the memory system and the interconnect. The real utility of the HPCC benchmarks are that architectures can be described with a wider range of metrics than just Flop/s from Linpack
18 DARPA s High Productivity Computing Systems Full Scale Development Half-Way Point Phase 2 Technology Assessment Review Vendors Petascale/s Systems Validated Procurement Evaluation Methodology Advanced Design & Prototypes Concept Study Productivity Team Test Evaluation Framework New Evaluation Framework Phase 1 $10M Phase 2 (2003-) $50M Phase 3 ( ) $100M 35 Goals HPC Challenge Benchmark Stress CPU, memory system, interconnect To complement the Top500 list To provide benchmarks that bound the performance of many real applications as a function of memory access characteristics e.g., spatial and temporal locality Allow for optimizations Record effort needed for tuning Base run requires MPI and BLAS Provide verification of results Archive results 36 18
19 Tests on Single Processor and System Local - only a single processor is performing computations. Embarrassingly Parallel - each processor in the entire system is performing computations but they do no communicate with each other explicitly. Global - all processors in the system are performing computations and they explicitly communicate with each other. 37 HPC Challenge Benchmark Consists of basically 7 benchmarks; Think of it as a framework or harness for adding benchmarks of interest. 1. HPL (LINPACK) MPI Global (Ax = b) 2. STREAM Local; single CPU *STREAM Embarrassingly parallel 3. PTRANS (A A + B T ) MPI Global 4. RandomAccess Local; single CPU *RandomAccess Embarrassingly parallel RandomAccess MPI Global 5. BW and Latency MPI Random integer read; update; & write 6. FFT - Global, single CPU, and EP 7. Matrix Multiply single CPU and EP proc i proc k 38 19
20 Computational Resources and HPC Challenge Benchmarks CPU computational speed Computational resources Memory bandwidth Node Interconnect bandwidth 39 Computational Resources and HPC Challenge Benchmarks HPL Matrix Multiply CPU computational speed Computational resources STREAM Memory bandwidth Node Interconnect bandwidth Random & Natural Ring Bandwidth & Latency 40 20
21 Memory Access Patterns high Spatial locality Computational Fluid Dynamics Applications Traveling Sales Person Radar Cross Section Digital Signal Processing low Temporal locality high 41 Memory Access Patterns high Spatial locality low STREAM (EP & SP) PTRANS (G) Computational Fluid Dynamics Applications Traveling Sales Person RandomAccess (G, EP, & SP) Radar Cross Section Temporal locality Digital Signal Processing HPL Linpack (G) Matrix Mult (EP & SP) FFT (G, EP, & SP) high 42 21
22 web
23 web
24 Summary of Current Unmet Needs Performance / Portability Fault tolerance Memory bandwidth/latency Adaptability: Some degree of autonomy to self optimize, test, or monitor. Able to change mode of operation: static or dynamic Better programming models Global shared address space Visible locality Maybe coming soon (incremental, yet offering real benefits): Global Address Space (GAS) languages: UPC, Co-Array Fortran, Titanium, Chapel, X10, Fortress) Minor extensions to existing languages More convenient than MPI Have performance transparency via explicit remote memory references What s needed is a long-term, balanced investment in hardware, software, algorithms and applications in the HPC Ecosystem. 47 Real Crisis With HPC Is With The Software Our ability to configure a hardware system capable of 1 PetaFlop (10 15 ops/s) is without question just a matter of time and $$. A supercomputer application and software are usually much more longlived than a hardware Hardware life typically five years at most. Apps years Fortran and C are the main programming models (still!!) The REAL CHALLENGE is Software Programming hasn t changed since the 70 s HUGE manpower investment MPI is that all there is? Often requires HERO programming Investments in the entire software stack is required (OS, libs, etc.) Software is a major cost component of modern technologies. The tradition in HPC system procurement is to assume that the software is free SOFTWARE COSTS (over and over) What s needed is a long-term, balanced investment in the HPC Ecosystem: hardware, software, algorithms and applications
25 Collaborators / Support Top500 Team Erich Strohmaier, NERSC Hans Meuer, Mannheim Horst Simon, NERSC Sca/LAPACK Julien Langou Jakub Kurzak Piotr Luszczek Stan Tomov Julie Langou 49 25
Supercomputing 2004 - Status und Trends (Conference Report) Peter Wegner
(Conference Report) Peter Wegner SC2004 conference Top500 List BG/L Moors Law, problems of recent architectures Solutions Interconnects Software Lattice QCD machines DESY @SC2004 QCDOC Conclusions Technical
BSC - Barcelona Supercomputer Center
Objectives Research in Supercomputing and Computer Architecture Collaborate in R&D e-science projects with prestigious scientific teams Manage BSC supercomputers to accelerate relevant contributions to
Trends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
64-Bit versus 32-Bit CPUs in Scientific Computing
64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples
The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems
202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric
Getting the Performance Out Of High Performance Computing. Getting the Performance Out Of High Performance Computing
Getting the Performance Out Of High Performance Computing Jack Dongarra Innovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge National Lab http://www.cs.utk.edu/~dongarra/
Introduction to the HPC Challenge Benchmark Suite
Introduction to the HPC Challenge Benchmark Suite Piotr Luszczek 1, Jack J. Dongarra 1,2, David Koester 3, Rolf Rabenseifner 4,5, Bob Lucas 6,7, Jeremy Kepner 8, John McCalpin 9, David Bailey 10, and Daisuke
Building a Top500-class Supercomputing Cluster at LNS-BUAP
Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1
Introduction to High Performance Cluster Computing Cluster Training for UCL Part 1 What is HPC HPC = High Performance Computing Includes Supercomputing HPCC = High Performance Cluster Computing Note: these
High Performance Computing
High Performance Computing Trey Breckenridge Computing Systems Manager Engineering Research Center Mississippi State University What is High Performance Computing? HPC is ill defined and context dependent.
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Current Trend of Supercomputer Architecture
Current Trend of Supercomputer Architecture Haibei Zhang Department of Computer Science and Engineering [email protected] Abstract As computer technology evolves at an amazingly fast pace,
1 Bull, 2011 Bull Extreme Computing
1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance
Cluster Computing in a College of Criminal Justice
Cluster Computing in a College of Criminal Justice Boris Bondarenko and Douglas E. Salane Mathematics & Computer Science Dept. John Jay College of Criminal Justice The City University of New York 2004
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)
COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1) Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University
The Assessment of Benchmarks Executed on Bare-Metal and Using Para-Virtualisation
The Assessment of Benchmarks Executed on Bare-Metal and Using Para-Virtualisation Mark Baker, Garry Smith and Ahmad Hasaan SSE, University of Reading Paravirtualization A full assessment of paravirtualization
High Performance Computing, an Introduction to
High Performance ing, an Introduction to Nicolas Renon, Ph. D, Research Engineer in Scientific ations CALMIP - DTSI Université Paul Sabatier University of Toulouse ([email protected]) Michel
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
Lecture 1: the anatomy of a supercomputer
Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers of the future may have only 1,000 vacuum tubes and perhaps weigh 1½ tons. Popular Mechanics, March 1949
Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales
Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes Anthony Kenisky, VP of North America Sales About Appro Over 20 Years of Experience 1991 2000 OEM Server Manufacturer 2001-2007
Introduction History Design Blue Gene/Q Job Scheduler Filesystem Power usage Performance Summary Sequoia is a petascale Blue Gene/Q supercomputer Being constructed by IBM for the National Nuclear Security
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
Linux Cluster Computing An Administrator s Perspective
Linux Cluster Computing An Administrator s Perspective Robert Whitinger Traques LLC and High Performance Computing Center East Tennessee State University : http://lxer.com/pub/self2015_clusters.pdf 2015-Jun-14
Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer
Res. Lett. Inf. Math. Sci., 2003, Vol.5, pp 1-10 Available online at http://iims.massey.ac.nz/research/letters/ 1 Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures
11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the
Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez
Energy efficient computing on Embedded and Mobile devices Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez A brief look at the (outdated) Top500 list Most systems are built
A Flexible Cluster Infrastructure for Systems Research and Software Development
Award Number: CNS-551555 Title: CRI: Acquisition of an InfiniBand Cluster with SMP Nodes Institution: Florida State University PIs: Xin Yuan, Robert van Engelen, Kartik Gopalan A Flexible Cluster Infrastructure
High Performance Matrix Inversion with Several GPUs
High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República
ANALYSIS OF SUPERCOMPUTER DESIGN
ANALYSIS OF SUPERCOMPUTER DESIGN CS/ECE 566 Parallel Processing Fall 2011 1 Anh Huy Bui Nilesh Malpekar Vishnu Gajendran AGENDA Brief introduction of supercomputer Supercomputer design concerns and analysis
The TOP500 Project: Looking Back over 15 Years of Supercomputing Experience
The TOP Project: Looking Back over 15 Years of Supercomputing Experience Hans Werner Meuer, University of Mannheim & Prometeus GmbH, Germany January 2, 28 Abstract The TOP project was launched in to provide
How To Write A Parallel Computer Program
An Introduction to Parallel Programming An Introduction to Parallel Programming Tobias Wittwer VSSD Tobias Wittwer First edition 2006 Published by: VSSD Leeghwaterstraat 42, 2628 CA Delft, The Netherlands
Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp
Petascale Software Challenges William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges Why should you care? What are they? Which are different from non-petascale? What has changed since
Large Scale Simulation on Clusters using COMSOL 4.2
Large Scale Simulation on Clusters using COMSOL 4.2 Darrell W. Pepper 1 Xiuling Wang 2 Steven Senator 3 Joseph Lombardo 4 David Carrington 5 with David Kan and Ed Fontes 6 1 DVP-USAFA-UNLV, 2 Purdue-Calumet,
Recommended hardware system configurations for ANSYS users
Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range
SERVER CLUSTERING TECHNOLOGY & CONCEPT
SERVER CLUSTERING TECHNOLOGY & CONCEPT M00383937, Computer Network, Middlesex University, E mail: [email protected] Abstract Server Cluster is one of the clustering technologies; it is use for
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 [email protected] THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833
Data Centric Systems (DCS)
Data Centric Systems (DCS) Architecture and Solutions for High Performance Computing, Big Data and High Performance Analytics High Performance Computing with Data Centric Systems 1 Data Centric Systems
Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013
Cluster performance, how to get the most out of Abel Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013 Introduction Architecture x86-64 and NVIDIA Compilers MPI Interconnect Storage Batch queue
Sun Constellation System: The Open Petascale Computing Architecture
CAS2K7 13 September, 2007 Sun Constellation System: The Open Petascale Computing Architecture John Fragalla Senior HPC Technical Specialist Global Systems Practice Sun Microsystems, Inc. 25 Years of Technical
Cluster Computing at HRI
Cluster Computing at HRI J.S.Bagla Harish-Chandra Research Institute, Chhatnag Road, Jhunsi, Allahabad 211019. E-mail: [email protected] 1 Introduction and some local history High performance computing
Obtaining Hardware Performance Metrics for the BlueGene/L Supercomputer
Obtaining Hardware Performance Metrics for the BlueGene/L Supercomputer Pedro Mindlin, José R. Brunheroto, Luiz DeRose, and José E. Moreira pamindli,brunhe,laderose,jmoreira @us.ibm.com IBM Thomas J. Watson
High Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
Performance Evaluation of Amazon EC2 for NASA HPC Applications!
National Aeronautics and Space Administration Performance Evaluation of Amazon EC2 for NASA HPC Applications! Piyush Mehrotra!! J. Djomehri, S. Heistand, R. Hood, H. Jin, A. Lazanoff,! S. Saini, R. Biswas!
Performance of the JMA NWP models on the PC cluster TSUBAME.
Performance of the JMA NWP models on the PC cluster TSUBAME. K.Takenouchi 1), S.Yokoi 1), T.Hara 1) *, T.Aoki 2), C.Muroi 1), K.Aranami 1), K.Iwamura 1), Y.Aikawa 1) 1) Japan Meteorological Agency (JMA)
Kashif Iqbal - PhD [email protected]
HPC/HTC vs. Cloud Benchmarking An empirical evalua.on of the performance and cost implica.ons Kashif Iqbal - PhD [email protected] ICHEC, NUI Galway, Ireland With acknowledgment to Michele MicheloDo
Performance Monitoring of Parallel Scientific Applications
Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure
Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin. http://www.dell.com/clustering
Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin Reza Rooholamini, Ph.D. Director Enterprise Solutions Dell Computer Corp. [email protected] http://www.dell.com/clustering
Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC
HPC Architecture End to End Alexandre Chauvin Agenda HPC Software Stack Visualization National Scientific Center 2 Agenda HPC Software Stack Alexandre Chauvin Typical HPC Software Stack Externes LAN Typical
Petascale Software Challenges. Piyush Chaudhary [email protected] High Performance Computing
Petascale Software Challenges Piyush Chaudhary [email protected] High Performance Computing Fundamental Observations Applications are struggling to realize growth in sustained performance at scale Reasons
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State
Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis
Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work
Benchmarking Large Scale Cloud Computing in Asia Pacific
2013 19th IEEE International Conference on Parallel and Distributed Systems ing Large Scale Cloud Computing in Asia Pacific Amalina Mohamad Sabri 1, Suresh Reuben Balakrishnan 1, Sun Veer Moolye 1, Chung
How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications
Amazon Cloud Performance Compared David Adams Amazon EC2 performance comparison How does EC2 compare to traditional supercomputer for scientific applications? "Performance Analysis of High Performance
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Lattice QCD Performance. on Multi core Linux Servers
Lattice QCD Performance on Multi core Linux Servers Yang Suli * Department of Physics, Peking University, Beijing, 100871 Abstract At the moment, lattice quantum chromodynamics (lattice QCD) is the most
LS DYNA Performance Benchmarks and Profiling. January 2009
LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The
OpenMP Programming on ScaleMP
OpenMP Programming on ScaleMP Dirk Schmidl [email protected] Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign
Retargeting PLAPACK to Clusters with Hardware Accelerators
Retargeting PLAPACK to Clusters with Hardware Accelerators Manuel Fogué 1 Francisco Igual 1 Enrique S. Quintana-Ortí 1 Robert van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores.
Generations of the computer. processors.
. Piotr Gwizdała 1 Contents 1 st Generation 2 nd Generation 3 rd Generation 4 th Generation 5 th Generation 6 th Generation 7 th Generation 8 th Generation Dual Core generation Improves and actualizations
Performance of Various Computers Using Standard Linear Equations Software
CS - 89-85 Performance of Various s Using Standard Linear Equations Software Jack J. Dongarra* Electrical Engineering and Science Department University of Tennessee Knoxville, TN 37996-1301 Science and
Matrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca
Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca Carlo Cavazzoni CINECA Supercomputing Application & Innovation www.cineca.it 21 Aprile 2015 FERMI Name: Fermi Architecture: BlueGene/Q
PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN
1 PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster Construction
High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates
High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of
ST810 Advanced Computing
ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
Parallel Computing. Introduction
Parallel Computing Introduction Thorsten Grahs, 14. April 2014 Administration Lecturer Dr. Thorsten Grahs (that s me) [email protected] Institute of Scientific Computing Room RZ 120 Lecture Monday 11:30-13:00
The K computer: Project overview
The Next-Generation Supercomputer The K computer: Project overview SHOJI, Fumiyoshi Next-Generation Supercomputer R&D Center, RIKEN The K computer Outline Project Overview System Configuration of the K
A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications
1 A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications Simon McIntosh-Smith Director of Architecture 2 Multi-Threaded Array Processing Architecture
and RISC Optimization Techniques for the Hitachi SR8000 Architecture
1 KONWIHR Project: Centre of Excellence for High Performance Computing Pseudo-Vectorization and RISC Optimization Techniques for the Hitachi SR8000 Architecture F. Deserno, G. Hager, F. Brechtefeld, G.
Parallel Computing with MATLAB
Parallel Computing with MATLAB Scott Benway Senior Account Manager Jiro Doke, Ph.D. Senior Application Engineer 2013 The MathWorks, Inc. 1 Acceleration Strategies Applied in MATLAB Approach Options Best
Virtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
BLM 413E - Parallel Programming Lecture 3
BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
High Performance Computing in the Multi-core Area
High Performance Computing in the Multi-core Area Arndt Bode Technische Universität München Technology Trends for Petascale Computing Architectures: Multicore Accelerators Special Purpose Reconfigurable
2: Computer Performance
2: Computer Performance http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture2.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12
Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?
Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and
Power-Aware High-Performance Scientific Computing
Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan
HSL and its out-of-core solver
HSL and its out-of-core solver Jennifer A. Scott [email protected] Prague November 2006 p. 1/37 Sparse systems Problem: we wish to solve where A is Ax = b LARGE Informal definition: A is sparse if many
