High Performance Computing

Similar documents

Scalability and Classifications

Introduction to Cloud Computing

High Performance Computing. Course Notes HPC Fundamentals

CMSC 611: Advanced Computer Architecture

Large Scale Simulation on Clusters using COMSOL 4.2

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

Symmetric Multiprocessing

LSN 2 Computer Processors

Parallel Programming

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

Cellular Computing on a Linux Cluster

ANALYSIS OF SUPERCOMPUTER DESIGN

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Introduction to GPU Programming Languages

Computer Architecture TDTS10

Chapter 2 Parallel Architecture, Software And Performance

BLM 413E - Parallel Programming Lecture 3

Making A Beowulf Cluster Using Sun computers, Solaris operating system and other commodity components

Lecture 2 Parallel Programming Platforms

Middleware and Distributed Systems. Introduction. Dr. Martin v. Löwis

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Lecture 23: Multiprocessors

CS 352H: Computer Systems Architecture

ADVANCED COMPUTER ARCHITECTURE: Parallelism, Scalability, Programmability

A Comparison of Distributed Systems: ChorusOS and Amoeba

Building an Inexpensive Parallel Computer

Chapter 2 Parallel Computer Architecture

Parallel Programming Survey

HPC Wales Skills Academy Course Catalogue 2015

CUDA programming on NVIDIA GPUs

64-Bit versus 32-Bit CPUs in Scientific Computing

Cluster, Grid, Cloud Concepts

10- High Performance Compu5ng

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Clouds vs Grids KHALID ELGAZZAR GOODWIN 531

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

IS-ENES/PrACE Meeting EC-EARTH 3. A High-resolution Configuration

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Architecture of Hitachi SR-8000

Software Development around a Millisecond

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Multicore Parallel Computing with OpenMP

Client/Server and Distributed Computing

Introduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Basic Concepts in Parallelization

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

RevoScaleR Speed and Scalability

DISTRIBUTED SYSTEMS AND CLOUD COMPUTING. A Comparative Study

GPUs for Scientific Computing

Principles and characteristics of distributed systems and environments

An Introduction to Parallel Computing/ Programming

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

A Very Brief History of High-Performance Computing

Performance of the JMA NWP models on the PC cluster TSUBAME.

System Models for Distributed and Cloud Computing

Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin.

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

Improved LS-DYNA Performance on Sun Servers

MOSIX: High performance Linux farm

End-user Tools for Application Performance Analysis Using Hardware Counters

Introduction to GPU hardware and to CUDA

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

A Lab Course on Computer Architecture

THE NAS KERNEL BENCHMARK PROGRAM

Parallel Algorithm Engineering

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

HPC Software Requirements to Support an HPC Cluster Supercomputer

Distributed and Cloud Computing

Linux for Scientific Computing

Lecture 1. Course Introduction

Panasas High Performance Storage Powers the First Petaflop Supercomputer at Los Alamos National Laboratory

MAQAO Performance Analysis and Optimization Tool

QTP Computing Laboratory Strategy

Recommended hardware system configurations for ANSYS users

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland

Using an MPI Cluster in the Control of a Mobile Robots System

Spring 2011 Prof. Hyesoon Kim

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

VII ENCUENTRO IBÉRICO DE ELECTROMAGNETISMO COMPUTACIONAL, MONFRAGÜE, CÁCERES, MAYO

A Chromium Based Viewer for CUMULVS

Scientific Computing Programming with Parallel Objects

Chapter 18: Database System Architectures. Centralized Systems

Linux Cluster Computing An Administrator s Perspective

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Parallel Computing with MATLAB

Program Optimization for Multi-core Architectures

PARALLEL PROGRAMMING

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Clusters: Mainstream Technology for CAE

On the Importance of Thread Placement on Multicore Architectures

Systolic Computing. Fundamentals

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Transcription:

High Performance Computing Trey Breckenridge Computing Systems Manager Engineering Research Center Mississippi State University

What is High Performance Computing? HPC is ill defined and context dependent. In the late 1980 s, the US Government defined supercomputers as processors capable of more than 100MFlops. This definition is clearly obsolete, as modern desktop PC s are capable of ~ 5GFlops. Another approach is to describe HPC as the fastest computers at any point in time, however, that is more a budgetary dependent definition. For the intent of this presentation, we will define HPC as: Computing resources which provide at least an order of magnitude more computing power than is normally available on a desktop computer.

What does the definition really mean? That definition sounds like HPC is hardware only. Isn t the software important too? The full range of supercomputing activities including existing supercomputer systems, special purpose and experimental systems, and the new generation of large scale parallel architectures. HPC exists on a broad range of computer systems, from departmental clusters of desktop workstations to large parallel processing systems.

Why High Performance Computing? To achieve the maximum amount of computations in a minimum amount of time SPEED! To solve problems that couldn t otherwise be solved without large computer systems. Traditionally, HPC used in scientific and engineering fields for work with massively complex simulations. Computations are typically floating point intensive.

Areas of HPC Use Traditional: Computational Fluid Dynamics (CFD) Climate, Weather, and Ocean Modeling and Simulation (CWO) Nuclear Modeling and Simulation Geophysical/Petroleum Modeling Emerging: Computer Graphics/Scientific Visualization Financial Modeling Database Applications Bioinformatics Biomedical

Parallel Computing A collection of processing elements that can communicate and cooperate to solve large problems more quickly than a single processing element. Simultaneous use of multiple processors to execute different parts of a program. Goal: To reduce wall-clock time of run No single processor ever again is likely to match performance of existing parallel HPC systems: HPC => Parallel

Overt Type of Parallelism Parallelism is visible to the programmer May be difficult to program (correctly) Large improvements in performance Covert Parallelism is not visible to the programmer Compiler responsible for parallelism Easy to do Small improvements in performance are typical

Speed Up Speed Up is one quantitative measure of the benefit of parallelism Speed Up is defined as S / T(N) where, S = best serial time T(N) = time required for N processors Since S/N is the best possible parallel time, speedup typically should not exceed N S is sometimes difficult to measure causing many people to substitute T(1) for S

Types of Speed Up

Efficiency Speed up does not measure how efficiently the processors are being used Is it worth using 100 processors to get a speed up of 2? Efficiency is defined as the ratio of the speed up and the number of processors required to achieve it The best efficiency is 1 In reality, it is between 0 and 1

HPC Architecture and Design

Vector Processors Large rows of data are operated on simultaneously Scalar Data is operated on in a sequential fashion Instruction sets Complex Instruction Set Computer (CISC) Reduced Instruction Set Computer (RISC) Post-RISC or CISC/RISC UltraSPARC IBM Power4 IA64

Scalar vs. Vector Arithmetic DO 10 i = 1.n a(i) = b(i) + c(i) 10 CONTINUE Scalar: a(1) = b(1) + c(1) a(2) = b(2) + c(2) a(n) = b(n) + c(n) Vector: a = b + c n instructions one vector instruction

Where is Scalar better? If the vector length is small If the loop contains IF statements If partial vectorization involves large overhead If recursion is used Small budget for capital expenditures!

Architectural Classifications Published by Flynn in 1972 Flynn s Taxonomy Outdated, but still widely used Categorizes machines by instruction streams and data streams A stream of instructions (the algorithm) tells the computer what to do. A stream of data (the input) is affected by these instructions. Four Categories SISD Single Instruction, Single Data MISD Multiple Instruction, Single Data SIMD Single Instruction, Multiple Data MIMD Multiple Instruction, Multiple Data

SISD Single Instruction, Single Data Conventional single processor computers Each arithmetic instruction initiates an operation on a data item taken from a single stream of data elements. Historical supercomputers and most contemporary microprocessors are SISD

SIMD Single Instruction, Multiple Data Many, simple processing elements 1000s Each processor has its own local memory Each processor runs the same program Each processor processes different data streams All processors work in lock-step (synchronously) Very efficient for array/matrix operations Most older vector/array computers are SIMD Example machines: Cray YMP Thinking Machine s CM-200

MISD Multiple Instruction, Single Data Very few machines fit this category None have been commercially successful or have had any impact on computational science

MIMD Multiple Instruction, Multiple Data Most diverse of the four classifications Multiple processors Each processor either has own, or accesses shared, memory Each processor can run the same or different programs Each processor processes different data streams Processors can work synchronously or asynchronously

MIMD cont. Processors can be either tightly or loosely coupled Examples include: Processors and memory units specifically designed to be components of a parallel architecture (e.g., Intel Paragon) Large scale parallel machines built from off the shelf workstations (e.g., Beowulf Cluster) Small scale multiprocessors made by connecting multiple vector processors together (e.g., Cray T90) Wide variety of other designs as well

SPMD Computing Not a Flynn category, per se, but instead a combination of categories. SPMD stands for single program, multiple data The same program is run on the processors of an MIMD machine. Occasionally the processors may synchronize. Because an entire program is executed on separate data, it is possible that different branches are taken, leading to asynchronous parallelism SPMD came about as a desire to do SIMD like calculations on MIMD machine SPMD is not a hardware paradigm, but instead, the software equivalent of SIMD

Memory Classifications Organization Shared Memory (SM-MIMD) Bus based Interconnection network Distributed Memory (DM-MIMD) Local Message passing Virtual shared memory (VSM-MIMD) Physically distributed, but appears as one image Access Uniform Memory Access (UMA) All processors take the same time to reach all memory locations Non-Uniform Memory Access (NUMA)

Memory Organization Shared Memory One common memory block between all processors Bus Based Since bus has limited bandwidth, number of processors which can be used is limited to a few tens of processors Examples include typical multi-processors PC s, SGI Challenge

Memory Organization Switch based Utilizes (complex) inter-connected network to connect processors to shared memory modules May use multi-stage networks - NUMA Increases bandwidth to memory over bus based systems Every processor still has access to global memory Examples include Sun E10000

Memory Organization Distributed Memory Message Passing. Memory physically distributed through the machine. Each processor has private memory. Contents of private memory can only be accessed by that processor. If required by another processor, then it must be sent explicitly. In general, machines can be scaled to thousands of processors. Requires special programming techniques. Examples include Cray T3E, IBM SP

Memory Organization Virtual Shared Memory Objective is to have the scalability of distributed memory with the programmability of shared memory Global address space mapped onto physically distributed memory Data moves between processors on demand or as it is accessed

Compute Clusters Connecting multiple standalone machines via a network interconnect, utilizing software to access the combined systems as one computer The standalone machines could be inexpensive single processor workstations or multi-million dollar multiprocessor servers Individual machines can be connected via numerous networking technologies using a variety of topologies. 100BaseT Ethernet inexpensive, low performance, high latency Myrinet (2 Gb/s) expensive, high performance, low latency Proprietary high speed network Nearly 20% of fastest 500 supercomputers in the world are clusters.

Beowulf Clusters First developed in 1994 at NASA Goddard Goal is to build a supercomputer utilizing a large number of inexpensive, commodity off-the-shelf (COTS) parts. Increasingly used for HPC applications due to high cost of MPPs and the wide availability of networked workstations. Not a panacea for HPC. Many applications require shared memory or vector solutions. Existing Beowulf clusters range from 2 to 4000 processors, are likely to reach 10000 processors in the near future.

Metacomputing Metacomputing is a dynamic environment that has some informal pool of nodes that can join or leave the environment whenever they desire. SETI@HOME Why do we need metacomputing? Our computational needs are infinite Our financial needs are finite Someday we will utilize computing cycles just like we utilize electricity from the power company. Enables us to buy cycles on an as needed basis. Commonly referred to The Grid or Computational Grids

Job Execution Most HPC systems do not allow interactive access. Batch-style jobs are submitted to the system via a queuing mechanism. Schedulers determine the order in which jobs should be run. Factors include User priority Resource availability The goal of the Scheduler is to maximize system utilization. Scheduler optimization is an important component and is a field of study of its own.

HPC Software

Programming Languages It has been said, I don t know what language they will be using to program high performance computers 10 years from now, but we do know it will be called FORTRAN. C and C++ are making strides in the HPC community due to their ability to create complex data structures and better I/O routines. FORTRAN 90 incorporated many of the features of C (e.g., pointers). High Performance Fortran (HPF) is FORTRAN 90 with directivebased extensions allowing for shared and distributed memory machines clusters, traditional supercomputers, and massively parallel processors Today, many programmers prefer to do their data structure, communications, etc. in C, while doing the computations in FORTRAN.

Compilers Compilers are an often overlooked area of HPC, but are of critical importance. Application run times are directly related to the ability of the compiler to produce highly optimized code. Poor compiler optimization could result in run times increasing by an order of magnitude. Optimization Levels None, Basic, Interprocedural analysis, Runtime profile analysis, Floating-point, Data flow analysis, Advanced

Distributed Memory Parallel Programming Message passing is a programming paradigm where one effectively writes multiple programs for parallel execution. The problem must be decomposed, typically by domain or function Each process knows only about its own local data. If data is required from a different process, it must send a message to that process asking for the data Access to remote data is much slower than to local data, so a major objective is to minimize remote communications.

Message Passing Environments PVM Parallel Virtual Machine Portable and operable across heterogeneous computers Performance sacrificed for flexibility Well defined protocol allows for interoperability between different implementations MPI Message Passing Interface Today s standard for message passing Widely adopted by most vendors Portable and operable across heterogeneous computers Good performance with reasonable efficiency No standard for interoperability between implementations

Shared Memory Parallel Programming Every processor has direct access to the memory of every other processor in the system Not widely used at programmer level, but widely used at the system level (even on single processors systems via Multithreading) Allows low-latency, high-bandwidth communications Portability is poor Easy to program (compared to message passing) Directive controlled parallelism

Shared Memory Environments POSIX Threads (Pthreads) SHMEM OpenMP Quickly becoming the standard API for shared memory programming Emphasis on performance and scalability Allows for fine-grain or coarse-grain parallelism Some implementations are interoperable with MPI and PVM Message Passing

Benchmarking Benchmarking is an important aspect of HPC and is used for purchase decisions, system configuration, and application tuning. Rule 1: All vendors lie about their benchmarks!! Purchase decisions should not be based on published benchmark results. If at all possible, run your code on the exact machine you are considering for purchase. LINPACK Mother of all benchmarks Not originally designed to be a benchmark, but instead a set of high performance library routines for linear algebra. Reports average megaflop rates by dividing the total number of floating-point operations by time Used for the TOP500 Supercomputing Sites report

www.top500.org

Summary HPC is parallel computing. HPC involves a broad spectrum of components, and is only as fast as the weakest component, whether that be processor, memory, network interconnect, compiler, or software. HPC exists on a broad range of computer systems, from departmental clusters of desktop workstations to large parallel processing systems.

Additional Information Dowd, Kevin and Severance, Charles. High Performance Computing, Second Edition. O Reilly & Associates, Inc., 1998. Dongarra, Jack. High Performance Computing: Technology, Methods and Applications. Elsevier, 1995. Buyya, Rajkumar. High Performance Cluster Computing, Volume 1. Prentice Hall PTR, 1999. Foster, Ian and Kesselman, Carl. The Grid: Blueprint for a new Computing Infrastructure. Morgan Kaufmann Publishers, Inc., 1999.