Programming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture

Transcription

1 Programming Techniques for Supercomputers: Multicore processors There is no way back Modern multi-/manycore chips Basic ompute Node Architecture SimultaneousMultiThreading (SMT) Prof. Dr. G. Wellein (a,b), Dr. G. Hager (a), M. Wittmann (a) (a) HP Services Regionales Rechenzentrum Erlangen (b) Department für Informatik University Erlangen-Nürnberg, Sommersemester 2016

2 Introduction: Moore s law Intel Sandy Bridge EP: 2.3 billion Nvidia Kepler: 7 billion Intel Broadwell: 7.2 billion Nvidia Pascal: 15 billion 1965: G. Moore claimed #transistors on microchip doubles every months May 12, 2016 PTfS

3 Frequency [MHz] Introduction: Moore s law clock speeds saturate core Sandy Bridge 12 core Ivy Bridge Intel x86 clock speed 18 core Haswell 10 1 core Nocona 1 0,1 May 12, 2016 PTfS

4 Power consumption the root of all evil By courtesy of D. Vrsalovic, Intel N transistors Dual-ore 1.73x Performance Power 1.13x 1.00x 2N transistors 1.73x 1.02x Power envelope: Max W Power consumption: P = f * (V core ) 2 V core ~ V Over-clocked (+20%) Max Frequency Dual-core (-20%) Same process technology: V core ~ f P ~ f 3 May 12, 2016 PTfS

5 Introduction: Trends to consider lock speed of multicore chips will not increase Power/energy saving mechanisms in hardware lock speed depends on execution time parameter, e.g. number of cores used type of application executed environment temperature Transistor budget can be invested in various directions Execution units Width of execution units ores aches (additional functionalities, e.g. PIe or GPU on-chip) May 12, 2016 PTfS

6 Multi-ore: Intel Xeon 2600v3 (2014) One Xeon E5-2600v3 Haswell EP chip: Up to 18 cores running at 2.3 GHz (max 3.6 GHz) Simultaneous Multithreading (SMT) reports as 36-way chip Up to 40 MB cache & 40 PIe 3.0 lanes 5.7 Billion Transistors / 22 nm Die size: 662 mm 2 Standard HP configuration: 2 socket server 18 cores 18 cores May 12, 2016 PTfS

7 Modern multi- and manycore chips Intel Broadwell NVIDIA GK110 / K20 Intel Xeon Phi Be prepared for more cores with less complexity and slower clock!

8 There is no longer a single driving force for chip performance! Floating Point (FP) Peak Performance of a single chip: P chip = n core P core P core = nfp super n FMA n SIMD f Intel Xeon EP ( Broadwell ) (up to 22 core variants are available) TOP Intel Xeon E v4 ( Broadwell ): f = 2. 2 GHz FP n core = 22 ; n super = 2; n FMA = 2; n SIMD = 4 P chip = GF s (double) But: P chip =8.8 GF/s for serial, non-vectorized code May 12, 2016 PTfS

9 NVIDIA Kepler GK110 architecture: K20 13 SMX GHz each w/ 192 sp (64 dp) FMA units 64 kb L1/shared memory Peak performance (dp): GF/s = 1165 GF/s (3495 GF/s for single precision) 32k Registers 5 GB GDDR5 208 GB/s SingleInstructionMultipleThreads 1.5 MB L2 cache 7B Transistors Programming UDA / OpenL OpenA?! Massive Threading! (In order architecture) NVIDIA orp. Used with permission. (K20x shown) May 12, 2016 PTfS 2016

10 Intel Xeon Phi 5110P ( Knights orner ) 60 cores@1.05 GHz each with 512 bit SIMD/vector unit (FMA) 32 kb L1 0.5 MB L2/core In-order 4-way SMT 3B Transistors Peak performance (dp): GF/s = 1008 GF/s (2016 GF/s for single precision) 8 GB GDDR5 320 GB/s Programming Intel Fortran and /++ compiler OpenMP 64 byte/cy ode vectorization! May 12, 2016 PTfS 2016

11 There is no single driving force for single core performance! FP P chip = n core n super n FMA n SIMD f n core ores nfp super inst./cy Superscalarity n FMA n SIMD ops/inst FMA factor May 12, 2016 PTfS 2016 SIMD factor Server lock Speed f [GHz] P chip [GF/s] Nehalem Q1/2009 X Westmere Q1/2010 X Sandy Bridge Q1/2012 E Ivy Bridge Q3/2013 E v Haswell Q3/2014 E v Broadwell Q1/2016 E v IBM POWER Q2/2014 S822L Nvidia K Phi 5110P

12 Attainable bandwidth (BW): a[:] = b[:] + s * c[:] BW saturation in NUMA domain Intel Sandy Bridge Single core does not saturate BW AMD Interlagos E=on E=on Intel Xeon Phi 5110P NVIDIA K20 May 12, 2016 PTfS 2016

13 A brief view on basic compute node architecture From UMA to ccnuma (More details next presentation)

14 There is no longer a single flat memory: From UMA to ccnuma 2-way nodes Yesterday: Dual-socket Intel ore2 node: Uniform Memory Architecture (UMA): Flat memory ; symmetric MPs But: system anisotropy Shared Address Space within the node! Today: Dual-socket Intel (Westmere) node: ache-coherent Non-Uniform Memory Architecture (ccnuma) HT / QPI provide scalable bandwidth at the expense of ccnuma architectures: Where does my data finally end up? It is even more complicated ccnuma within a chip! May 12, 2016 PTfS

15 ccnuma in a single socket! AMD Magny-ours+ & Intel luster on Die mode Shared resources are hard to scale at hardware level: 2 x 2 memory channels vs. 1 x 4 memory channels per socket AMD: single chip ccnuma since Magny ours: 1 socket is built from two multicore chips with separate memory controllers (hardware) 2 NUMA domains Intel: luster on Die (od) mode since Haswell (BIOS option; software solution)... Standard 2 socket HP server 4 NUMA domains May 12, 2016 PTfS

16 Multicore nomenclature Node Node: A single shared cache coherent address space Socket Socket: Physical package that is equipped with leads or pins and can be replaced NUMA domain: UMA building block; single memory controller; flat memory NUMA domain ache group: ores sharing a given cache level (L1-, L2-, L3- group) May 12, 2016 PTfS 2016 hipset Memory ache group ore = processor = PU 16

17 Vector-Triads Saturation of shared resources: Main Memory Bandwidth Performance saturation of shared data paths inside a NUMA domain Shared resource Saturation with 3 threads Saturation with 2 threads 1 thread cannot saturate bandwidth Saturation with 4 threads May 12, 2016 PTfS

18 Bandwidth limitations: Outer-level cache Scalability of shared data paths in L3 cache May 12, 2016 PTfS

19 ompute nodes: Parallel and shared resources Parallel and shared resources within a shared-memory node 2 GPU # Other I/O 8 7 PIe link GPU #2 Parallel resources: Shared resources: Execution/SIMD units 1 Outer cache level per socket 6 ores 2 Memory bus per socket 7 Inner cache levels 3 Intersocket link 8 Sockets / ccnuma domains 4 PIe bus(es) 9 Multiple accelerators 5 Other I/O resources 10 Which resource is my bottleneck? May 12, 2016 PTfS

20 Parallel/shared resources: Scalable/saturating behavior learly distinguish between saturating and scalable performance on the chip level shared resources may show saturating performance parallel resources show scalable performance May 12, 2016 PTfS

21 SimultaneousMultiThreading Technology to improve single core utilization

22 Simultaneous Multithreading (SMT) Single Threaded execution often only occupies a small fraction of PU resources, e.g FP Pipelines are not completely busy (short loops, dependencies) PU is completely idling (waiting for data from main memory) Unbalanced instruction mix (FP Add only, no FP at all) Have another thread ready to use the underutilized resources SMT: Replicate architectural state multiple (n) times n-way SMT A single core appears as n logical cores Architectural state: Data registers Status & control registers stack / instruction pointers All other resources (caches, FP units, ) are shared Relation between instructions and different architectural states (i.e. threads) is maintained by hardware May 12, 2016 PTfS

23 2-way SMT Single threaded Simultaneous Multithreading (SMT) SMT principle (2-way example): May 12, 2016 PTfS

24 Thread 0 Thread 1 Thread 2 Thread 0 Thread 1 Thread 2 SMT impact SMT is primarily suited for increasing processor throughput With multiple threads/processes running concurrently on the same core Scientific codes tend to utilize chip resources quite well Standard optimizations (loop fusion, blocking, ) High data and instruction-level parallelism Exceptions do exist SMT is an important topology issue SMT threads share almost all core resources: Pipelines, caches, data paths Affinity matters! If SMT is not needed pin threads to physical cores or switch it off via BIOS etc. May 12, 2016 PTfS 2016 T 1 P T 0 P T0 T 1 P T0 T 1 MI P T0 T 1 P T0 T 1 Memory P T0 T 1 T 1 P T 0 P T0 T 1 P T0 T 1 MI P T0 T 1 P T0 T 1 Memory P T0 T 1 24

25 SMT impact example Example: Running two codes on MULT pipeline with one having a dependency Possible benefit: Better pipeline throughput Filling otherwise unused pipelines Filling pipeline bubbles with other thread s executing instructions: P T 0 P T 0 P T 0 P T 0 P T 0 P T 0 T 1 T 1 T 1 T 1 T 1 T 1 MI Memory T 1 Thread 0: do i=1,n a(i) = a(i-1)*c enddo Thread 1: do i=1,n b(i) = func(i)*d enddo Dependency pipeline stalls until previous MULT is over Note: Executing it all in a single thread (if possible) may reach the same goal without SMT: do i=1,n a(i) = a(i-1)*c b(i) = func(i)*d enddo Unrelated work in other thread can fill the pipeline bubbles May 12, 2016 PTfS

26 MULT pipe Simultaneous recursive updates with SMT Intel Sandy Bridge (desktop) 4-core; 3.5 GHz; SMT MULT Pipeline depth: 5 stages 1 F / 5 cycles for recursive update Fill bubbles via: SMT Multiple streams Thread 0: Thread 1: Thread do i=1,nthread 0: Thread do 0: i=1,n 1: do A(i)=A(i-1)*c i=1,ndo i=1,ndo A(i)=A(i-1)*c i=1,n a(i)=a(i-1)*c B(i)=B(i-1)*d a(i)=a(i-1)*c a(i)=a(i-1)*c B(i)=B(i-1)*d enddo enddo enddo B(7)*d a(2)*c A(2)*c a(7)*c A(7)*d B(2)*c May 12, 2016 PTfS

27 MULT pipe Simultaneous recursive updates with SMT Intel Sandy Bridge (desktop) 4-core; 3.5 GHz; SMT MULT Pipeline depth: 5 stages 1 F / 5 cycles for recursive update Thread 0: do i=1,n A(i)=A(i-1)*s B(i)=B(i-1)*s (i)=(i-1)*s D(i)=D(i-1)*s E(i)=E(i-1)*s enddo B(2)*s A(2)*s E(1)*s D(1)*s (1)*s 5 independent updates on a single thread do the same job! May 12, 2016 PTfS

28 Simultaneous recursive updates with SMT Intel Sandy Bridge (desktop) 4-core; 3.5 GHz; SMT Pure update benchmark can be vectorized 2 F / cycle (store limited) Recursive update: SMT can fill pipeline bubbles A single thread can do so as well Bandwidth does not increase through SMT SMT can not replace SIMD! May 12, 2016 PTfS

29 SMT Retrieving Topology Information Topology information, e.g. /proc/cpuinfo urrent Intel x86 PUs support 2-way SMT Intel Xeon Phi & IBM Blue Gene/Q: 4-way SMT SMT enabled correct pinning of threads / processes is mandatory! likwid-topology PU name: PU clock: Intel ore i7 processor Hz ****************************************************** Hardware Thread Topology ****************************************************** Sockets: 2 ores per socket: 4 Threads per core: HWThread Thread ore Socket May 12, 2016 PTfS

30 Multicore: Lessons to be learned Parallel programming is mandatory Serial codes will not run (substantially) faster in the future Highly threaded and/or vectorized implementation for accelerators omplex core / chip / node topologies Simultaneous multithreading In-order architectures Shared vs. parallel ( core-local ) caches ccnuma topologies within nodes and sockets Heterogeneous hardware devices (PUs + GPGPUs) Parallel vs. shared ( potentially saturated) resources Main memory bandwidth typically saturates within NUMA domain and basically scales between NUMA domains Shared cache performance may scale or saturate (depending on implementation) May 12, 2016 PTfS