Introduction to High Performance Cluster Computing Cluster Training for UCL Part 1
What is HPC HPC = High Performance Computing Includes Supercomputing HPCC = High Performance Cluster Computing Note: these are NOT High Availability clusters HPTC = High Performance Technical Computing The ultimate aim of HPC users is to max out the CPUs!
Agenda Parallel Computing Concepts Clusters Cluster Usage
Concurrency and Parallel Computing A central concept in computer science is concurrency: Concurrency: Computing in which multiple tasks are active at the same time. There are many ways to use Concurrency: Concurrency is key to all modern Operating Systems as a way to hide latencies. Concurrency can be used together with redundancy to provide high availability. Parallel Computing uses concurrency to decrease program runtimes. HPC systems are based on onparallel Computing
Hardware for Parallel Computing Parallel computers are classified in terms of streams of data and streams of instructions: MIMD Computers: Multiple streams of instructions acting on multiple streams of data. SIMD Computers: A single stream of instructions acting on multiple streams of data. Parallel Hardware comes in many forms: On chip: Instruction level parallelism (e.g. IPF) Multicore: Multiple execution cores inside a single CPU Multiprocessor: Multiple processors inside a single computer. Multicomputer: networks of computers working together.
Hardware for Parallel Computing Parallel Computers Single Instruction Multiple Data (SIMD)* Multiple Instruction Multiple Data (MIMD) Shared Address Space Disjoint Address Space Symmetric Multiprocessor (SMP) Non-uniform Memory Architecture (NUMA) Massively Parallel Processor (MPP) Cluster Distributed Computing
What is an HPC Cluster A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone computers cooperatively working together as a single, integrated computing resource. A typical cluster uses: Commodity off the shelf (COTS) parts Low latency communication protocols between the disjoint address spaces (memory)
What is HPCC? Master Node File Server / Gateway Compute nodes Cluster Management Tools
Cluster Architecture View Application Parallel Benchmarks: Perf,, Ring, HINT, NAS, Real Applications Middleware shmem MPI PVM OS OS Other OSes Linux Protocol TCP/IP VIA Proprietary Interconnect Ethernet Quadrics Infiniband Myrinet Hardware desktop Workstation Server 1P/2P Server 4U +
Cluster Hardware The Node A single element within the cluster Compute Node Just computes little else Private IP address no user access Master/Head/Front End Node User login Job scheduler Public IP address connects to external network Management/Administrator Node Systems/cluster management functions Secure administrator address I/O Node Access to data Generally internal to cluster or to data centre
Interconnect Interconnect 100 Mbps Ethernet Typical Latency usec 75 Typical Bandwidth MB/s 80 1Gbit/s Ethernet 60-90 90 10 Gb/s Ethernet 12-20 800 Myricom Myrinet* 2.2-3 2500 InfiniBand* 2-4 1400-2500
Agenda Parallel Computing Concepts Clusters Cluster Usage
Cluster Usage Performance Measurements Usage Model Application Classification Application Behaviour
The Mysterious FLOPS 1 GFlops = 1 billion floating point operations per second Theoretical v Real GFlops Xeon Processor Theoretical peak = 4 x Clock speed Xeons have 128 bit SSE registers which allows the processor to carry out 2 double precision floating point add and 2 multiply operations per clock cycle 2 computational cores per processor 2 processors per node (4 cores per node) Sustained (Rmax) = ~35-80% of theoretical peak (interconnect dependent) You ll NEVER hit peak!
Other measures of CPU performance SPEC Spec CPU2000/2006 Base single core performance indicator Spec CPU2000/2006 Rate node performance indicator SpecFP Floating Point performance SpecINT Integer performance Many other performance metrics may be required STREAM - memory bandwidth HPL High Performance Linpack NPB suite of performance tests Pallas Parallel Benchmark another suite IOZone file system throughput
Technology Advancements in 5 Years Codename Release date GHz Number of cores Peak FLOP per CPU cycle Peak GFLOPS per CPU Linpack on 256 Processors Westmere Nov 2009 3.0 6 4 72 14500 Woodcrest June 2006 3.0 2 4 24 4781 * From November 2001 top500 supercomputer list (cluster of Dell Precision 530) ** Intel internal cluster built in 2006
Usage Model Electronic Design Monte Carlo Design Optimisation Parallel Search Many Serial Jobs (Capacity) Many Users Mixed size Parallel/Serial jobs Ability to Partition and Allocate Jobs to Nodes for Best Performance Meteorology Seismic Analysis Fluid Dynamics Molecular Chemistry One Big Parallel Job (Capability) Batch Usage Load Balancing More Important Job Scheduling very important Normal Mixed Usage Appliance Usage Interconnect More Important
Application and Usage Model HPC clusters run parallel applications, and applications in parallel! One single application that takes advantage of multiple computing platforms Fine-Grained Application Uses many systems to run one application Shares data heavily across systems PDVR3D (Eigenvalues and Eigenstates of a matrix) Coarse-Grained Application Uses many systems to run one application Infrequent data sharing among systems Casino (Monte-Carlo stochastic methods) Pleasurably Parallel/HTC Application An instance of the entire application runs on each node Little or no data sharing among compute nodes BLAST (pattern matching) A shared memory machine will run all sorts of application
Types of Applications Forward Modelling Inversion Signal Processing Searching/Comparing
Forward Modelling Solving linear equations Grid Based Parallelization by domain decomposition (split and distribute the data) Finite element/finite difference
Inversion From measurements (F) compute models (M) representing properties (d) of the measured object(s). Deterministic Matrix inversions Conjugate gradient Stochastic Monte Carlo, Markov chain Genetic algorithms Generally large amounts of shared memory Parallelism through multiple runs with different models
Signal Processing/Quantum Mechanics Convolution model (stencil) Matrix computations (eigenvalues ) Conjugate gradient methods (matrix methods) Normally not very demanding on latency and bandwidth Some algorithms are embarrassingly parallel Examples: seismic migration/processing, medical imaging, SETI@Home
Searching/Comparing Integer operations are more dominant than floating point IO intensive Pattern matching Embarrassingly parallel very suitable for grid computing Examples: encryption/decryption, message interception, bioinformatics, data mining Examples: BLAST, HMMER
Application Classes Applications FEA Finite Element Analysis The simulation of hard physical materials, e.g. metal, plastic Crash test, product design, suitability for purpose Examples: MSC Nastran, Ansys, LS-Dyna, Abaqus, ESI PAMCrash, Radioss CFD Computational Fluid Dynamics The simulation of soft physical materials, gases and fluids Engine design, airflow, oil reservoir modelling Examples: Fluent, Star-CD, CFX Geophysical Sciences Seismic Imaging taking echo traces and building a picture of the sub-earth geology Reservoir Simulation CFD specific to oil asset management Examples: Omega, Landmark VIP and Pro/Max, Geoquest Eclipse
Application Classes Applications Life Sciences Understanding the living world genome matching, protein folding, drug design, bio-informatics, organic chemistry Examples: BLAST, Gaussian, other High Energy Physics Understanding the atomic and sub-atomic world Software from Fermi-Lab or CERN, or home-grown Financial Modelling Meeting internal and external financial targets particularly regarding investment positions VaR Value at Risk assessing the impact of economic and political factors on the bank s investment portfolio Trader Risk Analysis what is the risk on a trader s position, a group of traders