Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Size: px

Start display at page:

Download "Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation"

Joseph Webster
10 years ago
Views:

1 Exascale Challenges and General Purpose Processors Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

E+06 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.

2 Jun-93 Aug-94 Oct-95 Dec-96 Feb-98 Apr-99 Jun-00 Aug-01 Oct-02 Dec-03 Feb-05 Apr-06 Jun-07 Aug-08 Oct-09 Dec-10 Feb-12 Apr-13 Jun-14 Aug-15 Oct-16 Dec-17 Feb-19 Apr-20 Exponential Compute Growth 1.E+10 1.E+09 1.E+08 1.E+07 1.E+06 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E-01 Understand Universe Health Energy Weather Appetite for compute will continue to grow exponentially Fueled by the need to solve many fundamental and life changing problems.

3 Many Challenges to Reach Exascale Power efficiency Fit in Data Center power budget Space efficiency Fit in available floor space Memory technology Feed compute power-efficiently Network technology Connect nodes power-efficiently Reliability Control the increased probability of failures Software Utilize the full capability of hardware And more

power-efficiently Network technology Connect nodes power-efficiently Reliability

4 Challenge 1: Compute Power At System Level: Today: 33 PF, 18 MW 550 pj/op Exaflop: 1000 PF, 20 MW 20 pj/op Needs improvements in all system components Processor-subsystem needs to reduce to 10 pj/op ~28x improvement needed for Exascale by 2018/19

improvements in all system components Processor-subsystem

5 Challenge 2: Memory Memory bandwidth fundamental to HPC performance Need to balance with capacity and power 1 ExaFlop Machine ~ PB/sec ~2-3 pj/bit GDDR will be out of steam Periphery connected solutions will run into pin count issues Existing technology trends leave 3-4x gap on pj/bit

~2-3 pj/bit GDDR will be out of steam Periphery connected solutions will

6 Power: Commonly Held Myths General-purpose processors cannot achieve the required efficiencies. Need special-purpose processors. Single thread performance features and legacy features too power hungry IA memory model, Cache Coherence too power hungry Caches don t help for HPC They waste power

Single thread performance features and legacy features too power hungry

7 Myth 1: General-purpose processors cannot achieve the required efficiencies. Need special-purpose processors

8 Performance/Power Progression 32nm nm 2011 DEVELOPMENT 14nm 10nm 7nm 2013 * 2015 * 2017 * RESEARCH Moore s Law scaling continues to be alive and well Process:1.3x - 1.4x (per generation) Arch/Uarch: 1.1x - 2.0x (per generation) Recurring improvement: x every 2 years

to be alive and well Process:1.3x - 1.4x (per generation) Arch/Uarch: 1.

9 Energy/Op Reduction over Time Gap ~2.5x Gap reduces to ~2.5x from ~30x with existing techniques! Do not need special purpose processing to bridge this gap 9

10 Myth 2: Single thread performance features and legacy features too power hungry

11 Typical Core-Level Power Distribution 1% 5% 2% Fetch+Decode 12% OOO+speculation 75% FP 3% 0% 2% Integer Execution Caches TLBs Legacy Others Floating Pt Compute-heavy Application FP Execution Power dominated by compute as should be the case OOO/Speculation/TLB: < 10% X86 Legacy+Decode = ~1% 11

Others Floating Pt Compute-heavy Application FP Execution Power dominated

12 Typical Chip-level Power Distribution Fetch+Decode OOO+speculation Integer Execution TLBs 40% Caches 45% Legacy Others Fetch+Decode OOO+speculation Integer Execution Caches TLBs Legacy Others FP Execution Uncore At chip level core power is even smaller portion (~15%). X86 support, OOO, TLBs ~6% of the chip power Benefits outweigh the gains from removing them 12

Legacy Others FP Execution Uncore At chip level core power is even smaller portion (~15%).

13 Myth 3: IA memory model, Cache Coherence too power hungry

14 Coherency Power Distribution 2% 3% 1% 15% 9% Core+Cache 5% 20% 60% Core+Cache Memory IO On-die interconnect 5% 20% 60% Memory IO Data Transfer Address Snoops/Resp Overhead Typically coherency traffic is 4% of total power Programming benefits outweigh the power cost 14

Transfer Address Snoops/Resp Overhead Typically coherency traffic

15 Myth 4: Caches don t help for HPC They waste power

16 MPKI in HPC Workloads MPKI MPKI Most HPC workloads benefit from caches Less than 20 MPKI for 1M-4M caches 16

17 Caches save power Relative BW Relative BW/Watt Relative BW/Watt Memory BW L2 Cache BW L1 Cache BW Caches save power since memory communication avoided Caches 8x-45x better at BW/Watt compared to memory Power break-even point around 11% hit rate (L2 cache)

save power since memory communication avoided Caches 8x-45x better at

18 General purpose processors can achieve Exascale power efficiencies

19 Memory: Approach Fwd Significant power consumed in Memory Need to drive 20 pj/bit to 2-3 pj/bit Balancing BW, capacity and power is hard problem More hierarchical memories Progressively more integration Multi-package Usage Memory Multi-chip Package Usage Memory Direct Attach Usage Memory CPU 2A Logic Die package CPUDie 2A Logic Die CPUDie

memories Progressively more integration Multi-package Usage Memory Multi-chip

systems, dates and figures specified are preliminary based on current expectations, and are

20 Next Intel Xeon Phi Processor: Knights Landing Designed using Intel s cutting-edge 14nm process Not bound by offloading bottlenecks Standalone CPU or PCIe coprocessor All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. 20 Leadership compute & memory bandwidth Integrated on-package memory

Benefits of General Purpose Programming Familiar SW tools New languages/models not required Familiar programming model MPI, OpenMP <code id= this is code"> <lorem type= script/megascript" src=.lorem.ipsum.

21 Benefits of General Purpose Programming Familiar SW tools New languages/models not required Familiar programming model MPI, OpenMP <code id= this is code"> <lorem type= script/megascript" src=.lorem.ipsum.mee ovs"> <ipsum type="tlet/merengue > Compilers Libraries Parallel Models Source Maintain single code base Same SW can run for multicores and many-cores Multi-core CPU Multi-core Multi-core CPU Many-Core Multi-core Cluster Cluster Multi-core and Many-Core Cluster Optimize code just once Optimizations for many cores improve performance for multi-core as well Intel MIC Architecture

22 Lots of Wide Vectors Many IA Cores Lots of IA Threads Coherent Cache Hierarchy Large on-pkg highbandwidth Memory in addition to DDR Standalone general purpose CPU No PCIe overhead Future Xeon-Phi o o o o o o o o o Core On-PKG High-BW Memory 22 o o o Vectors Threads

23 What Does It Mean For Programmers Existing CPU SW will work, but effort needed to prepare SW to utilize Xeon-Phi s full compute capability. Expose parallelism in programs to use all cores MPI ranks, Threads, Cilk+ Remove constructs that prevent compiler from vectorizing Block data in caches as much as possible Power efficient Partition data per node to maximize on-pkg memory usage Code remains portable. Optimization improves performance on Xeon processor as well. 23

24 Summary Many challenges to reach Exascale Power is one of them General purpose processors will achieve Exascale power efficiencies Energy/op trend show bridgeable gap of ~2x to Exascale (not 50x) General purpose programming allows use of existing tools and programming methods. Effort needed to prepare SW to utilize Xeon-Phi s full compute capability. But optimized code remains portable for general purpose processors. More integration over time to reduce power and increase reliability 24

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France