High Performance Computing: The processor

High Performance Computing: The processor Erik Saule esaule@uncc.edu ITCS 6010/8010: Special Topics: High Performance Computing. 01/16/2014 Erik Saule (6010/8010 HPC) The Processor 01/16/2014 1 / 27

Outline 1 General architecture 2 Vector processing 3 Hardware Threading 4 Resources Erik Saule (6010/8010 HPC) The Processor 01/16/2014 2 / 27

Schematic (Nehalem: Intel Core i7) Source: intel.com Erik Saule (6010/8010 HPC) The Processor 01/16/2014 3 / 27

A classic RISC pipeline Typical architecture Instructions are pushed in a pipeline the one after the other Source: a Geogia-Tech assignment by Hyesoon Kim ADD R0, R1 LI R0, +42 LOAD R0, R1 BNZ R0, +3 Erik Saule (6010/8010 HPC) The Processor 01/16/2014 4 / 27

A classic RISC pipeline Typical architecture Instructions are pushed in a pipeline the one after the other Source: a Geogia-Tech assignment by Hyesoon Kim ADD R0, R1 LI R0, +42 LOAD R0, R1 BNZ R0, +3 Instruction pipelining Source: Wikipedia One instruction per cycle (thr). Erik Saule (6010/8010 HPC) The Processor 01/16/2014 4 / 27

A classic RISC pipeline Typical architecture Instructions are pushed in a pipeline the one after the other Source: a Geogia-Tech assignment by Hyesoon Kim Instruction pipelining ADD R0, R1 LI R0, +42 LOAD R0, R1 BNZ R0, +3 Pipeline Stall (Bubbles) Source: Wikipedia One instruction per cycle (thr). Source: Wikipedia Erik Saule (6010/8010 HPC) The Processor 01/16/2014 4 / 27

How to avoid bubbles? Control Hazard Happens when the processor has not decided yet which is the next instruction to execute. Typically because of a jump, which might but either: Jump +12 JumpR R0 BNZ R0, +2 Erik Saule (6010/8010 HPC) The Processor 01/16/2014 5 / 27

How to avoid bubbles? Control Hazard Happens when the processor has not decided yet which is the next instruction to execute. Typically because of a jump, which might but either: Jump +12 JumpR R0 BNZ R0, +2 Branchment Prediction Keep a small table of what was choosen before, and guess which choice will be taken. If wrong, flush the pipeline. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 5 / 27

How to avoid bubbles? Control Hazard Happens when the processor has not decided yet which is the next instruction to execute. Typically because of a jump, which might but either: Jump +12 JumpR R0 BNZ R0, +2 Data Hazard An instruction use the result of the previous instruction. R0 = R1+R2 R3 = R0+R4 Branchment Prediction Keep a small table of what was choosen before, and guess which choice will be taken. If wrong, flush the pipeline. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 5 / 27

How to avoid bubbles? Control Hazard Happens when the processor has not decided yet which is the next instruction to execute. Typically because of a jump, which might but either: Jump +12 JumpR R0 BNZ R0, +2 Branchment Prediction Keep a small table of what was choosen before, and guess which choice will be taken. If wrong, flush the pipeline. Data Hazard An instruction use the result of the previous instruction. R0 = R1+R2 R3 = R0+R4 Loop in the ALU Add an output to the ALU to give to the next Instruction Decode in case it is needed. (But complicated for more than 1 cycle of difference) Erik Saule (6010/8010 HPC) The Processor 01/16/2014 5 / 27

How to avoid bubbles? Control Hazard Happens when the processor has not decided yet which is the next instruction to execute. Typically because of a jump, which might but either: Jump +12 JumpR R0 BNZ R0, +2 Branchment Prediction Keep a small table of what was choosen before, and guess which choice will be taken. If wrong, flush the pipeline. Data Hazard An instruction use the result of the previous instruction. R0 = R1+R2 R3 = R0+R4 Loop in the ALU Add an output to the ALU to give to the next Instruction Decode in case it is needed. (But complicated for more than 1 cycle of difference) Out-of-order Execution Keep a pool of instructions to come and execute a safe one. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 5 / 27

Core architecture (Nehalem) Source: TAMU supercomputing facility. http://sc.tamu.edu/systems/eos/ Erik Saule (6010/8010 HPC) The Processor 01/16/2014 6 / 27

Summary A modern multicore processors are complicated. There are shared ressources between cores I/O port L3 cache Memory Controller(s) It is VERY difficult to predict their behavior. One core can retire more than one instruction per cycle. The philosophy of RISC optimization still applies. Writting efficient code manually is going to be tedious. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 7 / 27

Single Instruction Multiple Data Flynn s Taxonomy Single Instruction Single Data: Simple processor model. Single Instruction Mutiple Data: Vector processor. Multiple Instruction Single Data: Odd and uncommon. Multiple Instruction Multiple Data: Typically shared memory. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 9 / 27

Single Instruction Multiple Data Flynn s Taxonomy Principle Single Instruction Single Data: Simple processor model. Single Instruction Mutiple Data: Vector processor. Multiple Instruction Single Data: Odd and uncommon. Multiple Instruction Multiple Data: Typically shared memory. Perform in a single instruction the same operation of multiple data. In typical processor, it use special registers which contains multiple values. Often called packed register (Packed Integer, Packed Double,...) Erik Saule (6010/8010 HPC) The Processor 01/16/2014 9 / 27

History Intel processors MMX (1996): 64-bit integer packed register (overlap with FPU). SSE (1999): new 128 bit register for SP. SSE2 (2001): extend SSE to integer. SSE3 (2004): Horizontal operation, misaligned read, convert. SSSE3 (2006): new minor (mostly horizontal) operations. SSE4 (2006): minor, String, Bit counts. AES (2008): to help AES encryption. AVX (2008): introduces 256 bits registers and 3 operand instructions. Overlap with SSE registers. AVX2 (2013?): expand AVX to most instruction. FMA. gather. Some nuances with AMD, but mostly the same. Other classical ones: AltiVec for Power and Neon for ARM. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 10 / 27

How to use vector instructions Code and pray The compiler can do some of it. Typically does some vectorization with -O2. Might be necessary to use some -march -mtune options. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 11 / 27

How to use vector instructions Code and pray The compiler can do some of it. Typically does some vectorization with -O2. Might be necessary to use some -march -mtune options. Guide the compiler #pragma simd #pragma vector aligned Intel Cilk Plus Elemental extension: declspec(vector) Erik Saule (6010/8010 HPC) The Processor 01/16/2014 11 / 27

How to use vector instructions Code and pray The compiler can do some of it. Typically does some vectorization with -O2. Might be necessary to use some -march -mtune options. Write assembly Long and tedious. Previous section showed how difficult understanding architectures can be. Guide the compiler #pragma simd #pragma vector aligned Intel Cilk Plus Elemental extension: declspec(vector) Erik Saule (6010/8010 HPC) The Processor 01/16/2014 11 / 27

How to use vector instructions Code and pray The compiler can do some of it. Typically does some vectorization with -O2. Might be necessary to use some -march -mtune options. Guide the compiler #pragma simd #pragma vector aligned Intel Cilk Plus Elemental extension: declspec(vector) Write assembly Long and tedious. Previous section showed how difficult understanding architectures can be. Use high level types Use C++ types and operation overloading to achieve some abstraction. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 11 / 27

How to use vector instructions Code and pray The compiler can do some of it. Typically does some vectorization with -O2. Might be necessary to use some -march -mtune options. Guide the compiler #pragma simd #pragma vector aligned Intel Cilk Plus Elemental extension: declspec(vector) Write assembly Long and tedious. Previous section showed how difficult understanding architectures can be. Use high level types Use C++ types and operation overloading to achieve some abstraction. Use intrinsics/builtin Most compilers expose vector types on which built-in/intrinsics functions can be applied. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 11 / 27

AVX (SandyBridge) and AVX2 (Haswell) Registers There are 16 registers of 256-bits called YMM[0-15]. They overlap with XMM[0-15] registers for SSE. Declare them in C (with proper include): mm256s: single precision mm256d: double precisions mm256i: integers Notice the lack of signedness and size of the integers All operations are suffixed with the type of data they operate on: ps: packed single precision pd: packed double precision epi32: signed 32-bit integers (mostly AVX2) epu8: unsigned 8 bit integers (mostly AVX2) Note: read the documentation carefully, many are confusing. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 12 / 27

Different kinds of instructions Unary vertical operations m256d mm256 sqrt pd( m256d a); m256d mm256 rsqrt ps( m256 a); m256d mm256 rcp ps( m256 a); Erik Saule (6010/8010 HPC) The Processor 01/16/2014 13 / 27

Different kinds of instructions Unary vertical operations m256d mm256 sqrt pd( m256d a); m256d mm256 rsqrt ps( m256 a); m256d mm256 rcp ps( m256 a); For instance: m256d a = mm256 set pd(4.0, 9.0, 16.0, 25.0); 4 9 16 25 m256d b = mm256 sqrt pd(a); 2 3 4 5 Erik Saule (6010/8010 HPC) The Processor 01/16/2014 13 / 27

Different kinds of instructions Binary vertical operations m256d mm256 add epi16( m256i m1, m256i m2); (AVX2) m256d mm256 mul ps( m256 m1, m256 m2); m256d mm256 div pd( m256d m1, m256d m2); m256d mm256 max ps( m256 m1, m256 m2); m256d mm256 xor pd( m256d m1, m256d m2); m256 mm256 fmadd ps( m256 a, m256 b, m256 c); (AVX2) Erik Saule (6010/8010 HPC) The Processor 01/16/2014 14 / 27

Different kinds of instructions Binary vertical operations m256d mm256 add epi16( m256i m1, m256i m2); (AVX2) m256d mm256 mul ps( m256 m1, m256 m2); m256d mm256 div pd( m256d m1, m256d m2); m256d mm256 max ps( m256 m1, m256 m2); m256d mm256 xor pd( m256d m1, m256d m2); m256 mm256 fmadd ps( m256 a, m256 b, m256 c); (AVX2) For instance: m256d a = mm256 set pd(1., 2., 3., 4.); m256d b = mm256 set pd(10., 20., 30., 40.); m256d c = mm256 add pd(a, b); 1 2 3 4 10 20 30 40 11 22 33 44 Erik Saule (6010/8010 HPC) The Processor 01/16/2014 14 / 27

Different kinds of instructions Horizontal operations m256d mm256 hadd pd( m256d m1, m256d m2); m256d mm256 dp ps( m256 m1, m256 m2, const int mask); Erik Saule (6010/8010 HPC) The Processor 01/16/2014 15 / 27

Different kinds of instructions Horizontal operations m256d mm256 hadd pd( m256d m1, m256d m2); m256d mm256 dp ps( m256 m1, m256 m2, const int mask); For instance: m256d a = mm256 set pd(1., 2., 3., 4.); m256d b = mm256 set pd(10., 20., 30., 40.); m256d c = mm256 hadd pd(a, b); 1 2 3 4 10 20 30 40 3 7 30 70 Erik Saule (6010/8010 HPC) The Processor 01/16/2014 15 / 27

Different kinds of instructions Getting data from regular registers m256i mm256 setzero si256(void); m256d mm256 set1 ps(float); m256 mm256 set pd(float, float, float, float); Erik Saule (6010/8010 HPC) The Processor 01/16/2014 16 / 27

Different kinds of instructions Getting data from regular registers m256i mm256 setzero si256(void); m256d mm256 set1 ps(float); m256 mm256 set pd(float, float, float, float); Getting data from vector register m128i mm256 extractf128 si256( m256i m1, const int offset); m256d mm256 shuffle pd( m256d m1, m256d m2, const int select); m256 mm256 blend ps( m256 m1, m256 m2, const int mask); m256 mm256 permute ps( m256 m1, int control); Erik Saule (6010/8010 HPC) The Processor 01/16/2014 16 / 27

Different kinds of instructions Getting data from memory m256 mm256 load ps(float const *a); m256d mm256 loadu pd(double const *a); (unaligned) m256 mm256 broadcast ps( m128 const *a); m256d mm256 broadcast sd(double const *a); m128d mm i64gather pd(double const * base, m128i vindex, const int scale); (AVX2) Erik Saule (6010/8010 HPC) The Processor 01/16/2014 17 / 27

Data Layout Array of Structure Vs Structure of Array (AoS/SoA). A gaming example. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 18 / 27

Capabilities Being careful Of course not all processors can use all the instructions. Using a non existing instruction leads to an IllegalInstructionException (most of the time). You can know which processor you are using with the CPUID assembly instruction. Linux expose the available instruction set in /proc/cpuinfo Erik Saule (6010/8010 HPC) The Processor 01/16/2014 19 / 27

Capabilities Being careful Of course not all processors can use all the instructions. Using a non existing instruction leads to an IllegalInstructionException (most of the time). You can know which processor you are using with the CPUID assembly instruction. Linux expose the available instruction set in /proc/cpuinfo On the software side Many libraries and software probe the capability of the CPU and chose which code path to take. FFmpeg. http://www.ffmpeg.org/doxygen/0.6/cpuid_8c-source.html Intel MKL. It used to be way to restrictive. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 19 / 27

Summary Lots of performance available by using vectorial operations AVX can process in a single instruction: 4 double precision floating point value 8 single precision floating point value 8 32-bit integer (AVX2) 16 16-bit integer (AVX2) 32 8-bit integer (AVX2) Fused Multiply Add can potentially double that number again (AVX2) Not only vertical operations: horizontal operations data reordering gather (AVX2) Erik Saule (6010/8010 HPC) The Processor 01/16/2014 20 / 27

Why Hardware Threading? Context switching Sometimes CPU idles because it waits for data. there is a cache/page fault. it waits for the systems to wake it up (waiting on a condition to be met by another process). it is waiting for I/Os. its time-quantum has elapsed. Low level reason Sometimes some ressources of the processor are left idle. waiting for data in L2 or L3 cache to reach L1. contention on the L1 cache. pipeline bubbles. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 22 / 27

What is it? Multiple hardware context The principle is to have multiple hardware context within a core: Replicate the registers. Do not replicate execution components. Switching from one to the other is now virtually free. Keeps more potential work to run on the execution component. Efficiency increases. Multiple kinds One thread runs until it blocks and switch. Round robin on the threads. Decode everything at once and let them execute as possible. (SMT) But it is a very old idea used to overlap disk I/O and computation. Now, this (or similar ideas) are used by most computing hardware. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 23 / 27

Interesting facts about threading Often used to overlap data transfers and operations. Can cause more contention on shared ressources leading to performance drop. On compute kernels, 30% can be gained. On I/O bound application, large factor can be gained. On x86 CPUs, rarely more than 2-way SMT. (Xeon Phi is 4-way.) Sparc had 16-way threading on some processors. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 24 / 27

Conclusion (before talking about OpenMP) Generalities There is parallelism at many degrees: at the processor level (multiple cores) at the core level with SMT within a core (multiple instruction ports) within a port (SIMD) To understand performance, you need to understand how things are going to be executed. Very difficult to predict actual performance. Things that matter Frequency Number of core Instruction set Architecture of the core Erik Saule (6010/8010 HPC) The Processor 01/16/2014 25 / 27

Resources Classical RISC: http://en.wikipedia.org/wiki/classic_risc_pipeline Detailed explanation of Nehalem: http://sc.tamu.edu/systems/eos/ Composer XE documentation for SIMD instruction: http://software.intel.com/sites/products/documentation/ doclib/stdxe/2013/composerxe/compiler/cpp-mac/guid-64e5bfbb-fe9e-47cb-82a7-76c2ad57ed9c.htm On pragma simd: http://software.intel.com/en-us/articles/requirements-for-vectorizing-loops-with-pragma-simd Intel compiler ways to vectorize automatically: http://software.intel.com/en-us/articles/using-intel-avx-without-writing-avx SIMD in gcc: http://ds9a.nl/gcc-simd/index.html (a little old) Builtin in gcc: http://gcc.gnu.org/onlinedocs/gcc-4.8.1/gcc/target-builtins.html Intel developper manual: http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html SMT on wikipedia: http://en.wikipedia.org/wiki/simultaneous_multithreading CPUID: http://en.wikipedia.org/wiki/cpuid On Stream Programming (vectorization): http://www.akkadia.org/drepper/summit09-stream.pdf sandybridge/haswell architectural comparison: http://www.realworldtech.com/haswell-cpu/ Erik Saule (6010/8010 HPC) The Processor 01/16/2014 27 / 27