David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems
About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM - School of Aeronautic and Space Engineers (280 cores) Universidade de Coimbra (56 cores) Universidad Carlos III de Madrid (176 cores) Universidad de Castilla-La Mancha (96 cores) Network and Storage projects Microsoft Certified Technology Specialist 03/11/2010 2
About Englobe Origins in global reference CFD research group Building clusters since 2002 (Aeolos) HPC Department Computational High Performance Networks Storage Systems Service and Support Microsoft HPC Partner 03/11/2010 3
Contents Basics on supercomputing (knowing the problem) The power issue Parallelization Performance HPC hardware (alternatives and what they mean for us) Architectures Types of parallel machine High performance networks Storage HPC management (how to control the machine) Manager's wishlist Data center strategies Examples & solutions 03/11/2010 4
Knowing the problem BASICS ON SUPERCOMPUTING 03/11/2010 5
The power issue (or «why parallel?») Old limit: size MOS capacitance effect New limit: dissipation dq dt = ha s T s T e Let s increase clock frequency F(fluid,surface) Air conditioning power limited Oops! Put smaller components dq x dt = ka T x 03/11/2010 6
Parallelization Multicore processor cores Profit from smaller devices by adding more circuits CPU CPU core CPU core CPU core CPU core c c c c c c Multiple sockets Multiple processors in a server c c c c c c Chipset Resources (memory, I/O) c c c c c c 03/11/2010 7
Parallelization Concurrent programming Fork-join paradigm t Big problem Fork t t Join Solution t Developing: OpenMP libraries Processes accessing the same memory space Mutex and semaphores needed Most applications exploit multicore systems But this is usually not enough 03/11/2010 8
Parallelization Distributed memory system Put N servers to work together Network More alternatives on next block 03/11/2010 9
Parallelization Message Passing Interface (MPI) Processes own memory Processes «ask» for data belonging others Multiple implementations MSMPI is based in MPICH2 p p p Big problem Partition mpirun p p Merge Solution p p p 03/11/2010 10
Performance Speedup: S u = T s T p Speedup of «N» Where N means cores? $? watts? Cores: developer s choice, but not valid to compare different architectures $: not only acquisition (TCO, developing cost, waiting time while running) Power: related to TOC, but a compromise solution is needed (example: Atom vs. Xeon) Performance also depends on the application 03/11/2010 11
Performance Causes of performance loss Sequential S u = 1 Embarrassingly parallel S u N Processmessaging S u N Unbalanced system S u? 03/11/2010 12
Example 1: perf. problem Agilent ADS - Momentum Solver RF circuit simulation Calculating 70 frequency points Taking 8:48 minutes in 8 core server Using 1 process with threads Strange behaviour: 8 threads taking longer than 4 Threads User Wall User [s] Wall [s] 1 00:14:41 00:14:43 881 883 2 00:16:43 00:09:29 1003 569 4 00:20:26 00:08:11 1226 491 8 00:24:25 00:08:48 1465 528 How large is the parallel part? s 1 p 1 s 2 p 2 /2 s 4 p 4 /4 s 8 p 8 /8 s 1 + p 1 = 881? s 2 + p 2 = 1003 s 2 + p 2 2 = 569 s 2 = 135 p 2 = 868 s 4 + p 4 = 1226 s 4 + p 4 4 = 491 s 2 = 246 p 2 = 980 s 8 + p 8 = 1465 s 8 + p 8 8 = 528 s 2 394 p 2 1071 03/11/2010 13
Example 1: perf. solution Problem: Cores not being exploited 100% Single iteration too short Not a problem of 9 minutes but 70 little problems of ~8 seconds 70 times 100% 12.5% 8s 16s 24s Analysing the process: Iterations were independent (no need for data from previous frequencies) Perfect for a parametric sweep 9 times 100% 12.5% 8s 16s 24s 03/11/2010 14
Example 1: perf. results Momentum interacting with a queue system Hybrid solution: jobs used as many cores as available Tried up to 8 parametric tasks This mode allows to use more than 1 node Before: >8,5 min After: < 2 min Bigger problems scale better, though 03/11/2010 15
Alternatives and what they mean for us HPC HARDWARE 03/11/2010 16
Architectures Vector processors (SIMD) Data has to be aligned in memory An operation is performed in a long array of data GPU accelerators High memory bw and performance Bottleneck in communications to CPU Power x86_64
Parallel machines Distributed memory system Put N servers to work together Network N E T W O R K 03/11/2010 18
Parallel machines Shared memory system Lots of sockets accessing the same memory Resources This is expensive and not so scalable 03/11/2010 19
Parallel machines Virtual Shared Memory Resources are «exported» through a network N E T W O R K virt Operating System Virtual system Backplane (Network) virt virt virt HW HW HW HW 03/11/2010 20
HPC Networks Target: reduce wait times in MPI programs High bandwidth, low latency and scalability needed Approaches Avoid congestion by using switched fabrics Direct memory access for lower latency Reduce protocol overheads Intelligent NICs for communication process automation Examples Gigabit Ethernet 10-Gigabit Ethernet InfiniBand Myricom 03/11/2010 21
HPC Networks What s being used? Using top500.org as an indicator 03/11/2010 22
Storage HPC systems are continuously generating output data Data has to be stored for post-process Critical component Has to be reliable Has to be accessible Example: wrapping + redundancy + replication RAID controller
Storage Parallel file systems Client LAN Storage server Storage server Meta-data server
How to control the machine HPC MANAGEMENT 03/11/2010 25
Manager s wishlist What do I want as a HPC system manager? Work from home or at least do things from home if needed Automatic deployment (WDS) Remote management If system is up: RDP, cluster console, event viewer If system is down: IPMI or other KVM redirection Infraestructure integration AD/DS Users and groups policies Quick troubleshooting Diagnostics and events Reimage can save time HPC nodes don t need much customization WDS can perform all operations needed to have a freshly installed node 03/11/2010 26
Data center strategies Basic data center 03/11/2010 27
Data center strategies Redundancy On critical components (power supplies, coolers, some disks ) Reduces downtime Enhanced monitoring Detailed information (temperature, power) Failure and warning notification Failure protection U Auto power-off on failure (temperature, long power cut) Remote operation Remote desktop and console or KVM Hardware control (IPMI) 03/11/2010 28
Example: data center airflow Energy efficiency Good practices can drastically reduce TCO Study of airflow through equipment Cold/Hot corridor approach 03/11/2010 29