HPC Hardware Overview

HPC Hardware Overview John Lockman III February 7, 2012 Texas Advanced Computing Center The University of Texas at Austin

Outline Some general comments Lonestar System Dell blade-based system InfiniBand ( QDR) Intel Processors Ranger System Sun blade-based system InfiniBand (SDR) AMD Processors Longhorn System Dell blade-based system InfiniBand (QDR) Intel Processors

About this Talk We will focus on TACC systems, but much of the information applies to HPC systems in general As an applications programmer you may not care about hardware details, but We need to think about it because of performance issues I will try to give you pointers to the most relevant architecture characteristics Do not hesitate to ask questions as we go

High Performance Computing In our context, it refers to hardware and software tools dedicated to computationally intensive tasks Distinction between HPC center (throughput focused) and Data center (data focused) is becoming fuzzy High bandwidth, low latency Memory Network

Lonestar: Intel hexa-core system

Lonestar: Introduction Lonestar Cluster Configuration & Diagram Server Blades Dell PowerEdge M610 Blade (Intel Hexa-Core) Server Nodes Microprocessor Architecture Features Instruction Pipeline Speeds and Feeds Block Diagram Node Interconnect Hierarchy InfiniBand Switch and Adapters Performance

Lonestar Cluster Overview lonestar.tacc.utexas.edu

Lonestar Cluster Overview Hardware Components Characteristics Peak Performance 302 TFLOPS Nodes 2 Hexa-Core Xeon 5680 1888 nodes / 22656 cores Memory 1333 MHz DDR3 DIMMS 24 GB/node, 45 TB total Shared Disk Lustre parallel file system 1 PB Local Disk SATA 146GB/node, 276 TB total Interconnect Infiniband 4 GB/sec P-2-P

Blade : Rack : System 1 node : 2 x 6 cores = 12 cores 1 chassis : 16 nodes = 192 cores 1 rack : 3 x 16 nodes = 576 cores 39⅓ racks : 22656 cores 3 chassis 16 blades

Lonestar login nodes Dell PowerEdge M610 Intel dual socket Xeon hexa-core 3.33GHz 24 GB DDR3 1333 MHz DIMMS Intel QPI 5520 Chipset Dell 1200 PowerVault 15 TB HOME disk 1 GB user quota(5x)

Lonestar compute nodes 16 Blades / 10U chassis Dell PowerEdge M619 Dual socket Intel hexa-core Xeon 3.33 GHz(1.25x) 13.3 GFLOPS/core(1.25x) 64 KB L1 cache (independent) 12 MB L2 cache (unified) 24 GB DDR3 1333 MHz DIMMS Intel QPI 5520 Chipset 2x QPI 6.4 GT/s 146 GB 10k RPM SAS-SATA local disk (/tmp)

Motherboard DDR3 1333 DDR3 1333 Hexa-core Xeon QPI Hexa-core Xeon 12 CPU Cores IOH I/O Hub DMI PCIe ICH I/O Controller

Intel Xeon 5600(Westmere) 32 KB L1 cache/core 256 KB L2 cache/core Shared 12 MB L3 cache Core 1 Core 2

Cluster Interconnect 16 independent 4 GB/s connections/chassis

Lonestar Parallel File Systems: Lustre &NFS 97 TB 1GB/ user Ethernet $HOME 226 TB 200GB/ user $WORK 3 2 1 806 TB InfiniBand $SCRATCH Uses IP over IB

Ranger: AMD Quad-core system

Ranger: Introduction Unique instrument for computational scientific research Housed at TACC s new machine room Over 2 ½ years of initial planning and deployment efforts Funded by the National Science Foundation as part of a unique program to reinvigorate High Performance Computing in the United States (Office of Cyberinfrastructure) ranger.tacc.utexas.edu

How Much Did it Cost and Who s Involved? TACC selected for very first NSF Track2 HPC system $30M system acquisition Sun Microsystems (now Oracle) was the vendor Very Large InfiniBand Installation ~4100 endpoint hosts >1350 MT47396 switches TACC, ICES, Cornell Theory Center, Arizona State HPCI are teamed to operate/support the system four 4 years ($29M)

Ranger: Performance Ranger debuted at #4 on the Top 500 list (ranked #25 as of November 2011)

Ranger Cluster Overview Hardware Components Characteristics Peak Performance 579 TFLOPS Nodes 4 Quad-Core Opteron 3,936 nodes / 62,976 cores Memory 667MHz DDR2 DIMMS 1GHz HyperTransport 32 GB/node, 123 TB total Shared Disk Lustre parallel file system 1.7 PB Local Disk NONE (flash-based) 8 Gb Interconnect Infiniband (generation 2) 1 GB/sec P-2-P 2.3 μs latency

Ranger Hardware Summary Compute power - 579 Teraflops 3,936 Sun four-socket blades 15,744 AMD Barcelona processors Quad-core, four flops/cycle (dual pipelines) Memory - 123 Terabytes 2 GB/core, 32 GB/node ~20 GB/sec memory B/W per node (667 MHz DDR2) Disk subsystem - 1.7 Petabytes 72 Sun x4500 Thumper I/O servers, 24TB each 40 GB/sec total aggregate I/O bandwidth 1 PB raw capacity in largest filesystem Interconnect - 10 Gbps /1.6 2.9 sec latency Sun InfiniBand-based switches (2), up to 3456 4x ports each Full non-blocking 7-stage Clos fabric Mellanox ConnectX InfiniBand (second generation)

Ranger Hardware Summary (cont.) 25 Management servers - Sun 4-socket x4600s 4 Login servers, quad-core processors 1 Rocks master, contains software stack for nodes 2 SGE servers, primary batch server and backup 2 Sun Connection Management servers, monitors hardware 2 InfiniBand Subnet Managers, primary and backup 6 Lustre Meta-Data Servers, enabled with failover 4 Archive data-movers, move data to tape library 4 GridFTP servers, external multi-stream transfer Ethernet Networking - 10Gbps Connectivity Two external 10GigE networks: TeraGrid, NLR 10GigE fabric for login, data-mover and GridFTP nodes, integrated into existing TACC network infrastructure Force10 S2410P and E1200 switches

Infiniband Cabling in Ranger Sun switch design with reduced cable count, manageable, but still a challenge to cable 1312 InfiniBand 12x to 12x cables 78 InfiniBand 12x to three 4x splitter cables Cable lengths range from 7-16m, average 11m 9.3 miles of InfiniBand cable total (15.4 km)

Space, Power and Cooling System Power: 3.0 MW total System: 2.4 MW ~90 racks, in 6 row arrangement ~100 in-row cooling units ~4000 ft 2 total footprint Cooling: ~0.6 MW In-row units fed by three 400-ton chillers Enclosed hot-aisles Supplemental 280-tons of cooling from CRAC units Observations: Space less an issue than power Cooling > 25kW per rack difficult Power distribution a challenge, more than 1200 circuits

External Power and Cooling Infrastructure

Switches in Place

InfiniBand Cabling in Progress

Ranger Features AMD Processors: HPC Features 4 FLOPS/CP 4 Sockets on a board 4 Cores per socket HyperTransport (Direct Connect between sockets) 2.3 GHz core Any idea what the peak floating-point performance of a node is? 2.3 GHz * 4 Flops/CP * 16 cores = 147.2 GFlops Peak Performance Any idea how much an application can sustain? Can sustain over 80% of peak with DGEMM (matrix-matrix multiply) NUMA Node Architecture (16 cores per node, think hybrid) 2-tier InfiniBand (NEM Magnum ) Switch System Multiple Lustre (Parallel) File Systems

Ranger Architecture internet Login Nodes X4600 X4600 I/O Nodes WORK File System 1 82 C48 Blades Compute Nodes 4 sockets X 4 cores 3,456 IB ports, each 12x Line splits into 3 4x lines. Bisection BW = 110Tbps. Magnum InfiniBand Switches Thumper X4500 1 Metadata Server X4600 1 per File Sys. 24 TB each 72 GigE InfiniBand

Ranger Infiniband Topology NEM NEM Magnum Switch NEM NEM NEM NEM NEM NEM 78 NEM NEM NEM NEM NEM NEM NEM NEM 12x InfiniBand 3 cables combined

Bandwidth (MB/sec) MPI Tests: P2P Bandwidth Point-to-Point MPI Measured Performance 1200 1000 800 Ranger - OFED 1.2 - MVAPICH 0.9.9 Lonestar - OFED 1.1 MVAPICH 0.9.8 Shelf Latencies: ~1.6 µs Rack Latencies: ~2.0 µs Peak BW: ~965 MB/s 600 400 200 0 Effictive Bandwith is improved at smaller message size 1 10 100 1000 10000 100000 1000000 1000000 0 Message Size (Bytes) 1E+08

Bisection BW (GB/sec). Full Bisection BW Efficiency. Ranger: Bisection BW Across 2 Magnums 2000 120.0% 1800 1600 Ideal Measured 100.0% 1400 1200 80.0% 1000 60.0% 800 600 40.0% 400 200 0 0 20 40 60 80 100 # of Ranger Compute Racks 20.0% 0.0% 1 2 4 8 16 32 64 82 # of Ranger Compute Racks Able to sustain ~73% bisection bandwidth efficiency with all 3936 nodes communicating simultaneously (82 racks)

Sun Motherboard for AMD Barcelona Chips Compute Blade 4 Sockets 4 Cores 8 Memory Slots/Socket HyperTransport 1 GHz

8.3 GB/s 8.3 GB/s Passive Midplane 8.3 GB/s 8.3 GB/s Sun Motherboard for AMD Barcelona Chips A maximum neighbor NUMA Configuration for 3-port HyperTransport. Two PCIe x8 32Gbps One PCIe x4 16Gbps NEM Switch HyperTransport Bidirectional is 6.4GB/s, Unidirectional is 3.2GB/s. Dual Channel, 533MHz Registered, ECC Memory

Intel/AMD Dual- to Quad-core Evolution AMD Opteron Dual-Core AMD Barcelona/Phenom Quad-Core MCH MCH Intel Woodcrest Intel Clovertown

Caches in Quad Core CPUs Intel Quad-Core L2 caches are not independent AMD Quad-Core L2 caches are independent Memory Controller Memory Controller L2 L2 L3 L2 L2 L2 L2 L1 L1 L1 L1 L1 L1 L1 L1 Core 1 Core 2 Core 3 Core 4 Core 1 Core 2 Core 3 Core 4

Cache sizes in AMD Barcelona HT 0 HT 1 HT 2 Crossbar Switch System Request Interface L3 2 MB Mem Ctrl HT link bandwidth is 24 GB/s (8 GB/s per link) Memory controller bandwidth up to 10.7 GB/s (667 MHz) L2 512 KB L2 512 KB L2 512 KB L2 512 KB L1 64 KB L1 64 KB L1 64 KB L1 64 KB Core 1 Core 2 Core 3 Core 4

Other Important Features AMD Quad-core (K10, code name Barcelona) Instruction fetch bandwidth now 32 bytes/cycle 2MB L3 cache on-die; 4 x 512KB L2 caches; 64KB L1 Instruction & Data caches. SSE units are now 128-bit wide -> single-cycle throughput; improved ALU and FPU throughput Larger branch prediction tables, higher accuracies Dedicated stack engine to pull stack-related ESP updates out of the instruction stream

AMD 10h Processor

Speeds and Feeds Load Speed Store Speed 4 W/CP 2 W/CP 2 W/CP 1 W/CP 0.5 W/CP 0.5 W/CP 2x @ 533MHz DDR2 DIMMS On Die External Registers L1 Data L2 64 KB 512 KB L3 2 MB Memory Latency Cache States: 3 CP ~15 CP ~25 CP ~300 CP MOESI (Modified, Owner, Exclusive, Shared, Invalid) MOESI is beneficial when latency/bandwidth between cpus is significantly better than main memory W : FP Word (64 bit) CP : Clock Period Cache line size (L1/L2) is 8W 4 FLOPS / CP

Ranger Disk Subsystem - Lustre Disk system (OSS) is based on Sun x4500 Thumper Each server has 48 SATA II 500 GB drives (24TB total) - running internal software RAID Dual Socket/Dual-Core Opterons @ 2.6 GHz 72 Servers Total: 1.7 PB raw storage (that s 288 cores just to drive the file systems) Metadata Servers (MDS) based on SunFire x4600s MDS is Fibre-channel connected to 9TB Flexline Storage Target Performance Aggregate bandwidth: 40 GB/sec

Ranger Parallel File Systems: Lustre 36 OSTs 72 OSTs 96TB 6GB/ user $HOME 193TB 200GB/ user $WORK 6 Thumpers 12 Thumpers InfiniBand Switch 12 blades/chassis blade 3 2 81 1 0 300 OSTs 773TB 50 Thumpers $SCRATCH Uses IP over IB

Write Speed (GB/sec). Write Speed (GB/sec). I/O with Lustre over Native InfiniBand 60.00 $SCRATCH File System Throughput 60.00 $SCRATCH Single Application Performance 50.00 Stripecount=1 Stripecount=4 50.00 Stripecount=1 Stripecount=4 40.00 40.00 30.00 30.00 20.00 20.00 10.00 10.00 0.00 1 10 100 1000 10000 # of Writing Clients 0.00 1 10 100 1000 10000 # of Writing Clients Max. total aggregate performance of 38 or 46 GB/sec depending on stripecount (Design Target = 32GB/sec) External users have reported performance of ~35 GB/sec with a 4K application run

Longhorn: Intel Quad-core system

Longhorn: Introduction First NSF extreme Digital Visualization grant (XD Vis) Designed for scientific visualization and data analysis Very large memory per computational core Two NVIDIA Graphics cards per node Rendering performance 154 billion triangles/second

Longhorn Cluster Overview Hardware Components Characteristics Peak Performance (CPUs) 512 Intel Xeon E5400 20.7 TFLOPS Peak Performance (GPUs, SP) System Memory 128 NVIDIA Quadroplex 2200 S4 DDR3-DIMMS 500 TFLOPS 48 GB/node (240 nodes) 144 GB/node (16 nodes) 13.5 TB total Graphics Memory 512 FX5800 x 4GB 2 TB Disk Lustre parallel file system 210 TB Interconnect QDR Infiniband 4 GB/sec P-2-P

Longhorn fat nodes Dell R710 Intel dual socket quad-core Xeon E5400 @ 2.53GHz 144 GB DDR3 ( 18 GB/core ) Intel 5520 chipset NVIDIA Quadroplex 2200 S4 4 NVIDIA Quadro FX5800 240 CUDA cores 4 GB Memory 102 GB/s Memory bandwidth

Longhorn standard nodes Dell R610 Dual socket Intel quad-core Xeon E5400 @ 2.53 GHz 48 GB DDR3 ( 6 GB/core ) Intel 5520 chipset NVIDIA Quadroplex 2200 S4 4 NVIDIA Quadro FX5800 240 CUDA cores 4 GB Memory 102 GB/s Memory bandwidth

DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 Motherboard (R610/R710) Quad Core Intel Xeon 5500 Quad Core Intel Xeon 5500 DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 12 DIMM slots (R610) 18 DIMM slots (R710) Intel 5520 42 lanes PCI Express or 36 lines PCI Express 2.0

QPI 0 QPI 1 Cache Sizes in Intel Nehalem QPI 0 QPI 1 Memory Controller Memory Controller Core 1 Core 2 Core 3 Core 4 L3 8 MB Shared L3 Cache L2 256 KB L1 32 KB Core 1 L2 256 KB L1 32 KB Core 2 L2 256 KB L1 32 KB Core 3 L2 256 KB L1 32 KB Core 4 Total QPI bandwidth up to 25.6 GB/s (@ 3.2 GHz) 2 20-lane QPI links 4 quadrants (10 lanes each)

Nehalem μarchitecture

Storage Ranch Long term tape storage Corral 1 PB of spinning disk

Ranch Archival System Sun StorageTek Silo 10,000 - T1000B tapes 6,000 - T1000C tapes ~40 PB total capacity Used for long-term storage

6PB DataDirect Networks online disk storage 8 Dell 1950 servers 8 Dell 2950 servers 12 Dell R710 servers High Performance Parallel File System Multiple databases irods data management Replication to tape archive Multiple levels of access control Web and other data access available globally Corral

Coming Soon (2013) Stampede 10 PFLOP peak performance 272 TB total memory 14 PB of disk storage Intel Xeon Processor E5 Family (Sandy Bridge) Also includes a special pre-release shipment of the Intel Many Integrated Core (MIC) co-processor www.tacc.utexas.edu/stampede

References

Lonestar Related References User Guide services.tacc.utexas.edu/index.php/lonestar-user-guide Developers www.tomshardware.com www.intel.com/p/en_us/products/server/processor/xeon5000/technical-documents

Ranger Related References User Guide services.tacc.utexas.edu/index.php/ranger-user-guide Forums forums.amd.com/devforum www.pgroup.com/userforum/index.php Developers developer.amd.com/home.jsp developer.amd.com/rec_reading.jsp

Longhorn Related References User Guide services.tacc.utexas.edu/index.php/longhorn-user-guide General Information www.intel.com/technology/architecture-silicon/next-gen/ www.nvidia.com/object/cuda_what_is.html Developers developer.intel.com/design/pentium4/manuals/index2.htm developer.nvidia.com/forums/index.php