Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation



Similar documents
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Xeon+FPGA Platform for the Data Center

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

ICRI-CI Retreat Architecture track

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

Intel Many Integrated Core Architecture: An Overview and Programming Models

Keys to node-level performance analysis and threading in HPC applications

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

A quick tutorial on Intel's Xeon Phi Coprocessor

Parallel Programming Survey

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Parallel Computing. Benson Muite. benson.

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Symmetric Multiprocessing

MAQAO Performance Analysis and Optimization Tool

A Case Study - Scaling Legacy Code on Next Generation Platforms

Embedded Systems: map to FPGA, GPU, CPU?

Kalray MPPA Massively Parallel Processing Array

How To Build A Cloud Computer

Cloud Computing through Virtualization and HPC technologies

Big Data Visualization on the MIC

Intel Xeon +FPGA Platform for the Data Center

Building an energy dashboard. Energy measurement and visualization in current HPC systems

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

RevoScaleR Speed and Scalability

CUDA programming on NVIDIA GPUs

Intel Xeon Processor E5-2600

Capstone Overview Architecture for Big Data & Machine Learning. Debbie Marr ICRI-CI 2015 Retreat, May 5, 2015

Pedraforca: ARM + GPU prototype

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Multi-Threading Performance on Commodity Multi-Core Processors

LS DYNA Performance Benchmarks and Profiling. January 2009

Current Status of FEFS for the K computer

ECLIPSE Performance Benchmarks and Profiling. January 2009

Trends in High-Performance Computing for Power Grid Applications

Parallel Algorithm Engineering

Data Centric Systems (DCS)

CFD Implementation with In-Socket FPGA Accelerators

Large-Data Software Defined Visualization on CPUs

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Performance Isolation with Lightweight Co-Kernels

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche

White Paper The Numascale Solution: Extreme BIG DATA Computing

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Itanium 2 Platform and Technologies. Alexander Grudinski Business Solution Specialist Intel Corporation

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

A Quantum Leap in Enterprise Computing

benchmarking Amazon EC2 for high-performance scientific computing

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

Data Center and Cloud Computing Market Landscape and Challenges

Optimizing Shared Resource Contention in HPC Clusters

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power

FLOW-3D Performance Benchmark and Profiling. September 2012

OpenSoC Fabric: On-Chip Network Generator

64-Bit versus 32-Bit CPUs in Scientific Computing

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

GPU Computing - CUDA

INTEL Software Development Conference - LONDON High Performance Computing - BIG DATA ANALYTICS - FINANCE

HPC Wales Skills Academy Course Catalogue 2015

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

~ Greetings from WSU CAPPLab ~

Performance Characteristics of Large SMP Machines

Rethinking SIMD Vectorization for In-Memory Databases

Multicore Parallel Computing with OpenMP

High Performance Computing. Course Notes HPC Fundamentals

Multi-core and Linux* Kernel

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011

Introduction to Cloud Computing

Vendor Update Intel 49 th IDC HPC User Forum. Mike Lafferty HPC Marketing Intel Americas Corp.

Resource Efficient Computing for Warehouse-scale Datacenters

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

SGI High Performance Computing

Transcription:

Exascale Challenges and General Purpose Processors Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Jun-93 Aug-94 Oct-95 Dec-96 Feb-98 Apr-99 Jun-00 Aug-01 Oct-02 Dec-03 Feb-05 Apr-06 Jun-07 Aug-08 Oct-09 Dec-10 Feb-12 Apr-13 Jun-14 Aug-15 Oct-16 Dec-17 Feb-19 Apr-20 Exponential Compute Growth 1.E+10 1.E+09 1.E+08 1.E+07 1.E+06 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E-01 Understand Universe Health Energy Weather Appetite for compute will continue to grow exponentially Fueled by the need to solve many fundamental and life changing problems.

Many Challenges to Reach Exascale Power efficiency Fit in Data Center power budget Space efficiency Fit in available floor space Memory technology Feed compute power-efficiently Network technology Connect nodes power-efficiently Reliability Control the increased probability of failures Software Utilize the full capability of hardware And more

Challenge 1: Compute Power At System Level: Today: 33 PF, 18 MW 550 pj/op Exaflop: 1000 PF, 20 MW 20 pj/op Needs improvements in all system components Processor-subsystem needs to reduce to 10 pj/op ~28x improvement needed for Exascale by 2018/19

Challenge 2: Memory Memory bandwidth fundamental to HPC performance Need to balance with capacity and power 1 ExaFlop Machine ~200-300 PB/sec ~2-3 pj/bit GDDR will be out of steam Periphery connected solutions will run into pin count issues Existing technology trends leave 3-4x gap on pj/bit

Power: Commonly Held Myths General-purpose processors cannot achieve the required efficiencies. Need special-purpose processors. Single thread performance features and legacy features too power hungry IA memory model, Cache Coherence too power hungry Caches don t help for HPC They waste power

Myth 1: General-purpose processors cannot achieve the required efficiencies. Need special-purpose processors

Performance/Power Progression 32nm 2009 22nm 2011 DEVELOPMENT 14nm 10nm 7nm 2013 * 2015 * 2017 * 2019+ RESEARCH Moore s Law scaling continues to be alive and well Process:1.3x - 1.4x (per generation) Arch/Uarch: 1.1x - 2.0x (per generation) Recurring improvement: 1.4 3.0x every 2 years

Energy/Op Reduction over Time Gap ~2.5x Gap reduces to ~2.5x from ~30x with existing techniques! Do not need special purpose processing to bridge this gap 9

Myth 2: Single thread performance features and legacy features too power hungry

Typical Core-Level Power Distribution 1% 5% 2% Fetch+Decode 12% OOO+speculation 75% FP 3% 0% 2% Integer Execution Caches TLBs Legacy Others Floating Pt Compute-heavy Application FP Execution Power dominated by compute as should be the case OOO/Speculation/TLB: < 10% X86 Legacy+Decode = ~1% 11

Typical Chip-level Power Distribution Fetch+Decode OOO+speculation Integer Execution TLBs 40% Caches 45% Legacy Others Fetch+Decode OOO+speculation Integer Execution Caches TLBs Legacy Others FP Execution Uncore At chip level core power is even smaller portion (~15%). X86 support, OOO, TLBs ~6% of the chip power Benefits outweigh the gains from removing them 12

Myth 3: IA memory model, Cache Coherence too power hungry

Coherency Power Distribution 2% 3% 1% 15% 9% Core+Cache 5% 20% 60% Core+Cache Memory IO On-die interconnect 5% 20% 60% Memory IO Data Transfer Address Snoops/Resp Overhead Typically coherency traffic is 4% of total power Programming benefits outweigh the power cost 14

Myth 4: Caches don t help for HPC They waste power

MPKI in HPC Workloads MPKI MPKI Most HPC workloads benefit from caches Less than 20 MPKI for 1M-4M caches 16

Caches save power 50 45 40 35 Relative BW Relative BW/Watt Relative BW/Watt 30 25 20 15 10 5 0 Memory BW L2 Cache BW L1 Cache BW Caches save power since memory communication avoided Caches 8x-45x better at BW/Watt compared to memory Power break-even point around 11% hit rate (L2 cache)

General purpose processors can achieve Exascale power efficiencies

Memory: Approach Fwd Significant power consumed in Memory Need to drive 20 pj/bit to 2-3 pj/bit Balancing BW, capacity and power is hard problem More hierarchical memories Progressively more integration Multi-package Usage Memory Multi-chip Package Usage Memory Direct Attach Usage Memory CPU 2A Logic Die package CPUDie 2A Logic Die CPUDie

Next Intel Xeon Phi Processor: Knights Landing Designed using Intel s cutting-edge 14nm process Not bound by offloading bottlenecks Standalone CPU or PCIe coprocessor All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. 20 Leadership compute & memory bandwidth Integrated on-package memory

Benefits of General Purpose Programming Familiar SW tools New languages/models not required Familiar programming model MPI, OpenMP <code id= this is code"> <lorem type= script/megascript" src=.lorem.ipsum.mee- 201307.ovs"> <ipsum type="tlet/merengue > Compilers Libraries Parallel Models Source Maintain single code base Same SW can run for multicores and many-cores Multi-core CPU Multi-core Multi-core CPU Many-Core Multi-core Cluster Cluster Multi-core and Many-Core Cluster Optimize code just once Optimizations for many cores improve performance for multi-core as well Intel MIC Architecture

Lots of Wide Vectors Many IA Cores Lots of IA Threads Coherent Cache Hierarchy Large on-pkg highbandwidth Memory in addition to DDR Standalone general purpose CPU No PCIe overhead Future Xeon-Phi o o o o o o o o o Core On-PKG High-BW Memory 22 o o o Vectors Threads

What Does It Mean For Programmers Existing CPU SW will work, but effort needed to prepare SW to utilize Xeon-Phi s full compute capability. Expose parallelism in programs to use all cores MPI ranks, Threads, Cilk+ Remove constructs that prevent compiler from vectorizing Block data in caches as much as possible Power efficient Partition data per node to maximize on-pkg memory usage Code remains portable. Optimization improves performance on Xeon processor as well. 23

Summary Many challenges to reach Exascale Power is one of them General purpose processors will achieve Exascale power efficiencies Energy/op trend show bridgeable gap of ~2x to Exascale (not 50x) General purpose programming allows use of existing tools and programming methods. Effort needed to prepare SW to utilize Xeon-Phi s full compute capability. But optimized code remains portable for general purpose processors. More integration over time to reduce power and increase reliability 24