Exascale Challenges and General Purpose Processors Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation
Jun-93 Aug-94 Oct-95 Dec-96 Feb-98 Apr-99 Jun-00 Aug-01 Oct-02 Dec-03 Feb-05 Apr-06 Jun-07 Aug-08 Oct-09 Dec-10 Feb-12 Apr-13 Jun-14 Aug-15 Oct-16 Dec-17 Feb-19 Apr-20 Exponential Compute Growth 1.E+10 1.E+09 1.E+08 1.E+07 1.E+06 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E-01 Understand Universe Health Energy Weather Appetite for compute will continue to grow exponentially Fueled by the need to solve many fundamental and life changing problems.
Many Challenges to Reach Exascale Power efficiency Fit in Data Center power budget Space efficiency Fit in available floor space Memory technology Feed compute power-efficiently Network technology Connect nodes power-efficiently Reliability Control the increased probability of failures Software Utilize the full capability of hardware And more
Challenge 1: Compute Power At System Level: Today: 33 PF, 18 MW 550 pj/op Exaflop: 1000 PF, 20 MW 20 pj/op Needs improvements in all system components Processor-subsystem needs to reduce to 10 pj/op ~28x improvement needed for Exascale by 2018/19
Challenge 2: Memory Memory bandwidth fundamental to HPC performance Need to balance with capacity and power 1 ExaFlop Machine ~200-300 PB/sec ~2-3 pj/bit GDDR will be out of steam Periphery connected solutions will run into pin count issues Existing technology trends leave 3-4x gap on pj/bit
Power: Commonly Held Myths General-purpose processors cannot achieve the required efficiencies. Need special-purpose processors. Single thread performance features and legacy features too power hungry IA memory model, Cache Coherence too power hungry Caches don t help for HPC They waste power
Myth 1: General-purpose processors cannot achieve the required efficiencies. Need special-purpose processors
Performance/Power Progression 32nm 2009 22nm 2011 DEVELOPMENT 14nm 10nm 7nm 2013 * 2015 * 2017 * 2019+ RESEARCH Moore s Law scaling continues to be alive and well Process:1.3x - 1.4x (per generation) Arch/Uarch: 1.1x - 2.0x (per generation) Recurring improvement: 1.4 3.0x every 2 years
Energy/Op Reduction over Time Gap ~2.5x Gap reduces to ~2.5x from ~30x with existing techniques! Do not need special purpose processing to bridge this gap 9
Myth 2: Single thread performance features and legacy features too power hungry
Typical Core-Level Power Distribution 1% 5% 2% Fetch+Decode 12% OOO+speculation 75% FP 3% 0% 2% Integer Execution Caches TLBs Legacy Others Floating Pt Compute-heavy Application FP Execution Power dominated by compute as should be the case OOO/Speculation/TLB: < 10% X86 Legacy+Decode = ~1% 11
Typical Chip-level Power Distribution Fetch+Decode OOO+speculation Integer Execution TLBs 40% Caches 45% Legacy Others Fetch+Decode OOO+speculation Integer Execution Caches TLBs Legacy Others FP Execution Uncore At chip level core power is even smaller portion (~15%). X86 support, OOO, TLBs ~6% of the chip power Benefits outweigh the gains from removing them 12
Myth 3: IA memory model, Cache Coherence too power hungry
Coherency Power Distribution 2% 3% 1% 15% 9% Core+Cache 5% 20% 60% Core+Cache Memory IO On-die interconnect 5% 20% 60% Memory IO Data Transfer Address Snoops/Resp Overhead Typically coherency traffic is 4% of total power Programming benefits outweigh the power cost 14
Myth 4: Caches don t help for HPC They waste power
MPKI in HPC Workloads MPKI MPKI Most HPC workloads benefit from caches Less than 20 MPKI for 1M-4M caches 16
Caches save power 50 45 40 35 Relative BW Relative BW/Watt Relative BW/Watt 30 25 20 15 10 5 0 Memory BW L2 Cache BW L1 Cache BW Caches save power since memory communication avoided Caches 8x-45x better at BW/Watt compared to memory Power break-even point around 11% hit rate (L2 cache)
General purpose processors can achieve Exascale power efficiencies
Memory: Approach Fwd Significant power consumed in Memory Need to drive 20 pj/bit to 2-3 pj/bit Balancing BW, capacity and power is hard problem More hierarchical memories Progressively more integration Multi-package Usage Memory Multi-chip Package Usage Memory Direct Attach Usage Memory CPU 2A Logic Die package CPUDie 2A Logic Die CPUDie
Next Intel Xeon Phi Processor: Knights Landing Designed using Intel s cutting-edge 14nm process Not bound by offloading bottlenecks Standalone CPU or PCIe coprocessor All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. 20 Leadership compute & memory bandwidth Integrated on-package memory
Benefits of General Purpose Programming Familiar SW tools New languages/models not required Familiar programming model MPI, OpenMP <code id= this is code"> <lorem type= script/megascript" src=.lorem.ipsum.mee- 201307.ovs"> <ipsum type="tlet/merengue > Compilers Libraries Parallel Models Source Maintain single code base Same SW can run for multicores and many-cores Multi-core CPU Multi-core Multi-core CPU Many-Core Multi-core Cluster Cluster Multi-core and Many-Core Cluster Optimize code just once Optimizations for many cores improve performance for multi-core as well Intel MIC Architecture
Lots of Wide Vectors Many IA Cores Lots of IA Threads Coherent Cache Hierarchy Large on-pkg highbandwidth Memory in addition to DDR Standalone general purpose CPU No PCIe overhead Future Xeon-Phi o o o o o o o o o Core On-PKG High-BW Memory 22 o o o Vectors Threads
What Does It Mean For Programmers Existing CPU SW will work, but effort needed to prepare SW to utilize Xeon-Phi s full compute capability. Expose parallelism in programs to use all cores MPI ranks, Threads, Cilk+ Remove constructs that prevent compiler from vectorizing Block data in caches as much as possible Power efficient Partition data per node to maximize on-pkg memory usage Code remains portable. Optimization improves performance on Xeon processor as well. 23
Summary Many challenges to reach Exascale Power is one of them General purpose processors will achieve Exascale power efficiencies Energy/op trend show bridgeable gap of ~2x to Exascale (not 50x) General purpose programming allows use of existing tools and programming methods. Effort needed to prepare SW to utilize Xeon-Phi s full compute capability. But optimized code remains portable for general purpose processors. More integration over time to reduce power and increase reliability 24