Kalray MPPA Massively Parallel Processing Array

Similar documents

Going Linux on Massive Multicore

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Accelerating the Data Plane With the TILE-Mx Manycore Processor

Xeon+FPGA Platform for the Data Center

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

Parallel Programming Survey

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

Data Center and Cloud Computing Market Landscape and Challenges

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Trends in High-Performance Computing for Power Grid Applications

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Embedded Systems: map to FPGA, GPU, CPU?

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Cloud Data Center Acceleration 2015

Next Generation GPU Architecture Code-named Fermi

Parallel Firewalls on General-Purpose Graphics Processing Units

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

LS DYNA Performance Benchmarks and Profiling. January 2009

Introduction to Cloud Computing

Intel Xeon +FPGA Platform for the Data Center

Case Study on Productivity and Performance of GPGPUs

Moving Beyond CPUs in the Cloud: Will FPGAs Sink or Swim?

Multicore Parallel Computing with OpenMP

A Low Latency Library in FPGA Hardware for High Frequency Trading (HFT)

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

FPGA-based MapReduce Framework for Machine Learning

Data Centric Systems (DCS)

OpenSoC Fabric: On-Chip Network Generator

Supercomputing Clusters with RapidIO Interconnect Fabric

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Embedded Development Tools

7a. System-on-chip design and prototyping platforms

QorIQ T4 Family of Processors. Our highest performance processor family. freescale.com

Evaluation of CUDA Fortran for the CFD code Strukti

STLinux Software development environment

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

ECLIPSE Performance Benchmarks and Profiling. January 2009

Stream Processing on GPUs Using Distributed Multimedia Middleware

A Case Study - Scaling Legacy Code on Next Generation Platforms

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

GPUs for Scientific Computing

Infrastructure Matters: POWER8 vs. Xeon x86

HPC Wales Skills Academy Course Catalogue 2015

Development With ARM DS-5. Mervyn Liu FAE Aug. 2015

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Model-based system-on-chip design on Altera and Xilinx platforms

Performance Monitoring of Parallel Scientific Applications

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

1 Bull, 2011 Bull Extreme Computing

How To Build An Ark Processor With An Nvidia Gpu And An African Processor

Multi-core Programming System Overview

Networking Virtualization Using FPGAs

Multicore Programming with LabVIEW Technical Resource Guide

How To Build A Cloud Computer

ARM Processors for Computer-On-Modules. Christian Eder Marketing Manager congatec AG

Enabling Technologies for Distributed Computing

HPC Software Requirements to Support an HPC Cluster Supercomputer

A quick tutorial on Intel's Xeon Phi Coprocessor

Big Data Technologies for Ultra-High-Speed Data Transfer and Processing

Next Generation Operating Systems

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

Computer Graphics Hardware An Overview

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Parallel Computing. Benson Muite. benson.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Intel Application Software Development Tool Suite 2.2 for Intel Atom processor. In-Depth

Attention. restricted to Avnet s X-Fest program and Avnet employees. Any use

Enabling Technologies for Distributed and Cloud Computing

ANALYSIS OF SUPERCOMPUTER DESIGN

Bricata Next Generation Intrusion Prevention System A New, Evolved Breed of Threat Mitigation

CFD Implementation with In-Socket FPGA Accelerators

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

Supercomputing Status und Trends (Conference Report) Peter Wegner

Chapter 2 Parallel Architecture, Software And Performance

High Performance Computing. Course Notes HPC Fundamentals

FPGA-Accelerated Heterogeneous Hyperscale Server Architecture for Next-Generation Compute Clusters

Multi-Threading Performance on Commodity Multi-Core Processors

Architekturen und Einsatz von FPGAs mit integrierten Prozessor Kernen. Hans-Joachim Gelke Institute of Embedded Systems Professur für Mikroelektronik

FLOW-3D Performance Benchmark and Profiling. September 2012

Experiences With Mobile Processors for Energy Efficient HPC

Cray XT3 Supercomputer Scalable by Design CRAY XT3 DATASHEET

Transcription:

Kalray MPPA Massively Parallel Processing Array Next-Generation Accelerated Computing February 2015 2015 Kalray, Inc. All Rights Reserved February 2015 1

Accelerated Computing 2015 Kalray, Inc. All Rights Reserved February 2015 2

Motivations of Accelerated Computing Computing challenges are performance and energy efficiency Portable electronics (mobile phones, tablets, health) High-performance embedded computing Video compression and transcoding High-speed networking and storage Numerical supercomputing Big data processing Parallel computing increases performance or energy efficiency P = ½ C sw V dd V f + I st V dd + I static V dd N units operating at f/n consume less energy Decreasing f allows to reduce V dd and V N units operating at f compute N times faster Assuming computation is concurrent Other improvements from the specialization of compute units 2015 Kalray, Inc. All Rights Reserved February 2015 3

HPC and Enterprise Applications (Intel) 2015 Kalray, Inc. All Rights Reserved February 2015 4

High Performance Embedded Computing (TI) 2015 Kalray, Inc. All Rights Reserved February 2015 5

Classes of Computing Accelerators Application-Specific Integrated Circuits (ASIC) Most effective accelerators, based on specialization Long design times, not adaptable to evolving applications / standard Field-Programmable Gate Arrays (FPGA) Most effective on bit-level computations Require HDL programming Digital Signal Processors (DSP) Most effective on fixed-point arithmetic Require assembler-level programming Graphics Processing Units (GPU) Most effective on regular computations with dense memory accesses Require massive parallelism and CUDA or OpenCL programming Intel Many Integrated Core (MIC) Require multicore programming + exploitation of SIMD instructions (AVX) 2015 Kalray, Inc. All Rights Reserved February 2015 6

Efficiency of CPUs, DSPs, FPGAs, ASICs 2015 Kalray, Inc. All Rights Reserved February 2015 7

FPGA Acceleration in Datacenters (Nallatech) 2015 Kalray, Inc. All Rights Reserved February 2015 8

GPU and MIC Acceleration in HPC (Top500) 2015 Kalray, Inc. All Rights Reserved February 2015 9

DSP vs GPU and MIC Acceleration (TI 2012) 2015 Kalray, Inc. All Rights Reserved February 2015 10

Energy Efficiency Drives Locality (NVIDIA) Communications take the bulk of power consumption Architectures should expose a memory hierarchy to the programmer 2015 Kalray, Inc. All Rights Reserved February 2015 11

3D Memory: 5x Bandwidth of DDR4 Micron: Hybrid Memory Cube (HMC) SK hynix: High Bandwidth Memory (HBM) Tezzaron: 3DDRAM memory 2015 Kalray, Inc. All Rights Reserved February 2015 12

MIC Knights Landing (Intel 2015Q2) Xeon-Phi (MIC) accelerator, 60+ cores, 2D mesh fabric System in package with Micron HMC / low-power interface DDR4 external memory 2015 Kalray, Inc. All Rights Reserved February 2015 13

Abstract Machine Model of an Exascale Node «Abstract Machine Models and Proxy Architectures for Exascale Computing» 2014 report from Sandia and Lawrence Berkeley National Laboratories 2015 Kalray, Inc. All Rights Reserved February 2015 14

3D Stacked Memory-Side Acceleration (2014) Near-data computing reduces the overhead on performance and energy consumption incurred by the data movement between memory and computational elements Acceleration logic should be co-located with the 3D memory stack Best applied to Micron HMC type of 3D memory, where a logic layer exists for packet switching between the host and the memory chips 2015 Kalray, Inc. All Rights Reserved February 2015 15

Xilinx Zinq UltraScale MPSoC Architecture Five Processing Bottlenecks Digital Signal Processing Graphics processing Network processing Real-time processing General computing Scalable from 32 to 64 bits Coherent interconnect and memory FPGA ASIC-class logic fabric Extreme real-time performance Multi-level security and safety Enhanced anti-tamper features Trust and information assurance Advanced power management High-level design abstractions Based on C, C++, OpenCL 2015 Kalray, Inc. All Rights Reserved February 2015 16

TI KeyStone-II 66AK2H12 SoC 15-20W typical 4x ARM A15 1.2 GHz 4MB L2 shared by cores 19.2/9.6 GFLOPS SP/SP 8x TI C66x DSP 1.2GHz 1MB L2 private per core 160/60 GFLOPS SP/DP Shared memory 6MB on-chip 2x DDR3 Peripherals 4x 1G Ethernet 4x SRIO 2.1 2x PCIe 2015 Kalray, Inc. All Rights Reserved February 2015 17

Altera FPGA OpenCL Tool Chain The OpenCL host program is a pure C/C++ x86 software routine The OpenCL kernel code executes in parallel on the FPGA 2015 Kalray, Inc. All Rights Reserved February 2015 18

Xilinx FPGA C/C++/OpenCL Tool Chain 2015 Kalray, Inc. All Rights Reserved February 2015 19

TI KeyStone-II OpenCL/OpenMP Tool Chain Classic OpenCL The OpenCL host program is a pure C/C++ ARM software routine The OpenCL kernel code executes in parallel on multiple DSPs OpenMP4 accelerator model Directive-based offloading from ARM execution to multi-core DSP 2015 Kalray, Inc. All Rights Reserved February 2015 20

Kalray MPPA Technology 2015 Kalray, Inc. All Rights Reserved February 2015 21

Kalray MPPA Acceleration Highlights DSP type of acceleration Energy efficiency Timing predictability Software programmability CPU ease of programming C/C++ GNU programming environment 32-bit or 64-bit addresses, little-endian Rich operating system environment Integrated multi-core processor 256 application cores on a chip High-performance low-latency I/O Massively parallel computing MPPA processors can be tiled together Acceleration benefits from thousands of cores 2015 Kalray, Inc. All Rights Reserved February 2015 22

Kalray MPPA, a Global Solution Powerful, Low Power and Programmable Processors Core IP for programmable processor array based systems. C/C++ based Software Development Kit for massively parallel programing Development platform Reference Design Board Reference Design board Application specific boards Multi-MPPA or Single-MPPA board 2015 Kalray, Inc. All Rights Reserved February 2015 23

MPPA -256 Andey Processor in CMOS 28nm 400MHz 211 SP GFLOPS 70 DP FLOPS 8W to 15W Production since 2013 in TSMC High processing performance Very low power consumption High predictability & low latency High-level software development Architecture and software scalability 2015 Kalray, Inc. All Rights Reserved February 2015 24

Kalray MPPA Processor Roadmap 2014 2015 2016 2017 MPPA -1024 MPPA -256 Volume Samples Volume Samples Volume MPPA -64 Kalray 1 st Andey generation 211 GFLOPS SP 70 GFLOPS DP 32bit architecture 400Mhz 28nm Bostan Kalray 2 nd generation 845 GFLOPS SP 422 GFLOPS DP 64bit architecture 800Mhz 16/14nm Coolidge Kalray 3 rd generation 1056 GFLOPS SP 527 GFLOPS DP 64bit architecture 1000Mhz Production Development Under Specification 2015 Kalray, Inc. All Rights Reserved February 2015 25

Supercomputer Clustered Architecture IBM BlueGene series, Cray XT series Compute nodes with multiple cores and shared memory I/O nodes with high-speed devices and a Linux operating system Dedicated networks between the compute nodes and the I/O nodes 2015 Kalray, Inc. All Rights Reserved February 2015 26

MPPA -256 Bostan Processor in CMOS 28nm 64-bit VLIW cores Ethernet + PCIe subsystem crypto acceleration 288 64-bit VLIW cores 600MHz / 800MHz (overdrive) Floating-point performance Simple 845 GFLOPs @800 MHz Double 422 GFLOPs @800 MHz Network on Chip (NoC) bandwidth East West 2x 12,8 GB/s @800 MHz North South 2x 12,8 GB/s @800 MHz Ethernet processing throughput Bits 2 x 40 Gbps Packets 2 x 60 MPPs @ 64B Mission critical support Time predictability Functional safety 2015 Kalray, Inc. All Rights Reserved February 2015 27

Quad MPPA -256 TURBOCARD Architecture 1024 Kalray application cores and 8 DDR3-1600 under 80W TURBOCARD Board Manycore Processor Compute Tile VLIW Core 4 MPPA-256 processors 8 DDR3 SODIM PCIe Gen3 16x to host NoCX between MPPA 16 compute tiles 4 I/O tiles with quad-core SMP and DDR memory 2 Networks-on-Chip (NoC) 16 PE cores + 1 RM core NoC Tx and Rx interfaces Debug Support Unit (DSU) 2 MB of shared memory 5-issue VLIW architecture Fully timing compositional 32-bit/64-bit IEEE 754 FPU MMU for rich OS support 2015 Kalray, Inc. All Rights Reserved February 2015 28

Kalray TURBOCARD Family Kalray TURBOCARD2 (2014) 4 Andey MPPA -256 processors 0.83 TFLOPS SP / 0.28 TFLOPS DP 8x DDR3L @1233 MT/s => 79 GB/s First engineering samples: Q4-14 Volume Production: Q1-15 Kalray TURBOCARD3 (2015) 4 Bostan MPPA -256 processors 2.5 TFLOPS SP / 1.25 TFLOPS DP 8x DDR3 @ 2133 MT/s => 136 GB/s First engineering samples: Q2-15 Volume Production: Q3-15 2015 Kalray, Inc. All Rights Reserved February 2015 29

MPPA Software Stack Communication Libraries (RDMA, MPI, POSIX, APEX) GNU C, C++, Fortran OpenCL C Language Libraries POSIX threads + OpenMP Linux RTOS Low-level Hypervisor programming (Exokernel) interfaces MPPA platforms OpenCL runtime 2015 Kalray, Inc. All Rights Reserved February 2015 30

Compilers, Debuggers & System Trace Compilers GCC version 4.9 C89, C99, C++ and Fortran GNU Binary utilities 2011 (assembler, linker, objdump, objcopy, etc.) C libraries: uclibc, Newlib + F.P. tuning Linear Assembly Optimizer (LAO) for VLIW Optimizations Debuggers GDB version 7.3 Each core seen as a thread Watchpoint & breakpoint System trace Linux Trace Toolkit NG based Low-latency trace points On-demand activation Eclipse-based system trace viewer Deep insight of execution at system level 2015 Kalray, Inc. All Rights Reserved February 2015 31

OpenCL Programming Environment MPPA OpenCL global memory implemented as software DSM DDR Host Memory Processing Element Compute Unit Compute Device VLIW Core Compute Tile MPPA Processor 2015 Kalray, Inc. All Rights Reserved February 2015 32

MPPA OpenCL Roadmap Mega OpenCL Device The MPPA OpenCL global memory is implemented by software DSM This implementation allows to pool several DDR memory controllers A single OpenCL device can be composed of several MPPA processors connected by their NoC extensions For the Kalray TURBOCARD with 4 MPPA processors, the OpenCL device will offer 1K VLIW cores and 64GB of DDR global memory Time-Critical OpenCL The OpenCL host API can be hosted on one I/O subsystem quad-core The OpenCL kernels can be dispatched on the same MPPA processor The OpenCL run-time can be implemented with controlled worst-cases This enables time-critical OpenCL computing 2015 Kalray, Inc. All Rights Reserved February 2015 33

MPPA Manycore Computing Energy Efficiency 5-issue 32-bit / 64-bit VLIW core Optimum instruction pipelining Shallow memory hierarchy Software cache coherence Performance Scalability Linear scaling with number of cores Network on chip extension (NoCX) 2x DDR controllers per processor 24x 10 Gb/s lanes per processor Execution Predictability Fully timing compositional cores Multi-banked parallel local memory Core-private busses to local memory Network on chip guaranteed services Configurable DDR address mapping Ease of Use Standard GCC C/C++/Fortran Command-line and Eclipse IDE Full featured debug environment System trace based on LTTNG Linux + RTOS operating systems 2015 Kalray, Inc. All Rights Reserved February 2015 34