New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC Alan Gara Intel Fellow Exascale Chief Architect

Legal Disclaimer Today s presentations contain forward-looking statements. All statements made that are not historical facts are subject to a number of risks and uncertainties, and actual results may differ materially. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS. Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See www.intel.com/products/processor_number for details. Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel, Intel Xeon, Intel Core microarchitecture, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others Copyright 2011, Intel Corporation. All rights reserved.

The next step in perf/$ Typical DRAM Memory Die (2016) ~ 8Gb will be about 100 mm^2 (as always) Processor floating point unit ~0.03 mm^2 (2 Flops/cycle) (see below) Even if the core is 100x bigger than the FPU, At 1.0 GB/core we have >100x more silicon in memory than processing. This is not cost balanced. Threading gives us a mechanism to change this balance if we have enough For cost balance we need to either, bandwidth to support much higher compute/memory. 1) Use much less memory per compute or New memory architectures allow us to get a significant step in perf/$ 2) Make physical size of capacity much smaller DARPA:Exascale computing study: Exascale_Final_report_100208.pdf

Big Data meets HPC Big Data HPC Large Memory Capacity Large Small to Large Bandwidth to Large Memory Small to Large Large System Fabric Bandwidth Small to Large Large System compute capability Small to Large Large Big Data Bandwidth requirements to data vary Random access requires high bandwidth Cacheable accesses can tolerate much lower bandwidth to memory Where data can be cached matters at processor: fabric requirements lower at remote memory: high fabric requirements

Will HPC and Big Data drive a different system balance point? HPC cost balance Synthetic data for illustration only Big data cost balance Storage/ NVM Interconnect Memory Processor Variation in budgeting to will remain bound (1:10)

System commonality between big data and HPC Need to understand future memory technology characteristics Comes down to bandwidth (assuming we have better $/bit) DRAM like Bandwidth? Y N Can be fundamental or market microarchitecture choice System architectures for Big data and HPC very similar. Both benefit from new technologies Big Data will benefit more. Architectures will not be identical but configurability will allow for cross over.

This will also drive user effort New memory technologies replace/augment DRAM DRAM the remains dominant load-store memory technology Memory capacity per compute 5x-10x better than DRAM Modest need for threading when new technologies available. Task scaling can be effectively applied to many applications. Memory capacity per performance drops 10x to 20x from current levels. Aggressive threading is commonplace/necessary. Program model changes focus on thread scaling. Aggressively strive for more performance for similar task numbers.

Microarchitecture choices will drive bandwidth/ capacity tradeoff DENSITY VERSUS BANDWIDTH TRADE-OFF DRAM (approx) NVM (optimal) 6 5 4 3 2 1 0 0 50 100 150 200 250 300 Illustrative curves: Not based on actual data

Courtesy of James Hutchby SRC

Two Design Options For Supercomputing A processor with >10B transistors on a die in 2020 OR A processor with fewer transistors on a smaller cost effective die 11

Option 1: Large Die With>10B Transistors More cache Fewer cores Everything integrated More cores Enough cache for HPC Everything integrated Flavor of cores Enough cache for HPC Everything integrated Enables on-package memory? Cache size beyond a certain threshold not utilized by the programmer High FLOPS count on a die? Enough on-package memory becomes difficult to implement Extreme performance levels result in problematic off-package memory usage Powerful cores for ST performance Smaller cores for highly parallel? Enough on-package memory becomes difficult to implement. Extreme performance levels result in problematic off-package memory usage 12

13 Option 2: Cost Effective Die That Supports On-package Memory Building Block Stacked Memory TSV Scalable fabric Processor die matched to performance. Can be much smaller than memory. Broad Usage: With the right memory capacity per building block, it can address a large portion of the HPC market Cost: Building blocks can replace the compute and DRAM in a node (at the right price point) Scalability: Configure building block as memory or memory+compute Power: Better thermal solution with disaggregated compute blocks

The Possibilities With the Building Block Approach At Exascale Cost 1 1 Memory capacity (inpackage) Memory capacity (outside package) Evolved 2 TB 300 GB Assume none Number of cores 8000 1000 Memory Bandwidth (In-package) Memory Bandwidth (outside-package) 50 TB/s 5 TB/s Assume none 2TB (DDR4/5) 400 GB/s Performance peak 512TF 64TF Synthetic data for illustration only 1) On-package memory has 8-10x the bandwidth compared to external memory 2) At iso cost and memory capacity, on-package memory enables 8-10x additional compute to be placed under the memory 14

The Motherboard in 2020: Just a Backplane of Cards? 15

Summary While Big data and HPC have different memory access patterns, cost balance will drive to similar system balances. New memory technologies will drive future system architecture design points. Especially for Big data. New packaging technologies open up new directions allowing for a new dimension of disaggregation. Must remember Power is and will remain the biggest challenge. We need to no longer improve performance faster than energy efficiency improvements.