MAXware: acceleration in HPC. R. Dimond, M. J. Flynn, O. Mencer and O. Pell Maxeler Technologies contact:

Similar documents
CFD Implementation with In-Socket FPGA Accelerators

GPUs for Scientific Computing

Networking Virtualization Using FPGAs

Introduction to GPU Programming Languages

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Going to the wire: The next generation financial risk management platform

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to Cloud Computing

CUDA programming on NVIDIA GPUs

Compiling PCRE to FPGA for Accelerating SNORT IDS

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Data Centric Systems (DCS)

Multi-GPU Load Balancing for Simulation and Rendering

Parallel Programming Survey

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

HARNESS project: Managing Heterogeneous Compute Resources for a Cloud Platform

Rethinking SIMD Vectorization for In-Memory Databases

Architectures and Platforms

Parallel Computing with MATLAB

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Scalability and Classifications

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Introduction to GPU hardware and to CUDA

1. Memory technology & Hierarchy

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

High Performance Computing in CST STUDIO SUITE

Capstone Overview Architecture for Big Data & Machine Learning. Debbie Marr ICRI-CI 2015 Retreat, May 5, 2015

Cloud Data Center Acceleration 2015

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Multi-Threading Performance on Commodity Multi-Core Processors

Intel Pentium 4 Processor on 90nm Technology

Binary search tree with SIMD bandwidth optimization using SSE

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

HP ProLiant SL270s Gen8 Server. Evaluation Report

Chapter 2 Parallel Architecture, Software And Performance

September 25, Maya Gokhale Georgia Institute of Technology

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

FPGA-based Multithreading for In-Memory Hash Joins

Reduced Precision Hardware for Ray Tracing. Sean Keely University of Texas, Austin

Integer Computation of Image Orthorectification for High Speed Throughput

Intel DPDK Boosts Server Appliance Performance White Paper

Intel Xeon +FPGA Platform for the Data Center

QoS & Traffic Management

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

Real-time Visual Tracker by Stream Processing

White Paper. Recording Server Virtualization

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

MIMO detector algorithms and their implementations for LTE/LTE-A

Dataflow Programming with MaxCompiler

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

AMD Opteron Quad-Core

Architecture of Hitachi SR-8000

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Pedraforca: ARM + GPU prototype

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

FPGA-based MapReduce Framework for Machine Learning

Open Flow Controller and Switch Datasheet

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Oracle Database In-Memory The Next Big Thing

Computer Graphics Hardware An Overview

Memory-Centric Database Acceleration

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect

10 Gbps Line Speed Programmable Hardware for Open Source Network Applications*

Full and Para Virtualization

Keys to node-level performance analysis and threading in HPC applications

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

CHAPTER 7: The CPU and Memory

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Systolic Computing. Fundamentals

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

Solid State Drive Architecture

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Next Generation GPU Architecture Code-named Fermi

2009 Oracle Corporation 1

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture.

Parallel Computing. Benson Muite. benson.

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

A Computer Vision System on a Chip: a case study from the automotive domain

Cloud Computing through Virtualization and HPC technologies

PARALLELS CLOUD SERVER

Price/performance Modern Memory Hierarchy

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

GPGPU Computing. Yong Cao

Go Faster - Preprocessing Using FPGA, CPU, GPU. Dipl.-Ing. (FH) Bjoern Rudde Image Acquisition Development STEMMER IMAGING

An Introduction to Parallel Computing/ Programming

HyperThreading Support in VMware ESX Server 2.1

Embedded Systems: map to FPGA, GPU, CPU?

Performance monitoring of the software frameworks for LHC experiments

SPARC64 VIIIfx: CPU for the K computer

Motivation: Smartphone Market

The Fastest, Most Efficient HPC Architecture Ever Built

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Operating System for the K computer

Transcription:

MAXware: acceleration in HPC R. Dimond, M. J. Flynn, O. Mencer and O. Pell Maxeler Technologies contact: flynn@maxeler.com

Maxeler Technologies MAXware: acceleration in HPC 2 / 26 HPC: the case for accelerators Recent HPC systems using commodity processors optimize latency, not throughput. Multi core / multi thread is frequently memory hierarchy limited. Structured arrays (FPGAs, GPUs, hyper core or cell dies) offer complementary throughput acceleration; avoiding memory bottleneck. An array can provide 10-100x improvement in arithmetic (SIMD) or memory bandwidth (by streaming or MISD).

Maxeler Technologies MAXware: acceleration in HPC 3 / 26 Design space for large applications Some distinctions Memory limited vs. compute limited Static vs. dynamic control structure Homogeneous vs. core+accelerator at the node

Maxeler Technologies MAXware: acceleration in HPC 4 / 26 Accelerate tasks by streaming MISD structured computation: streaming computations across a long array before storing results in memory. Can achieve 100x in improved use of memory. computation#2 Data from node memory Structured array Results to memory computation#1 Buffer intermediate results

Maxeler Technologies MAXware: acceleration in HPC 5 / 26 FPGA acceleration One tenth the frequency with 10 5 cells per die. Magnitude of parallelism overcomes frequency limitations. Stream data across large cell array, minimizing memory BW. Customized data structures e.g.17 bit FP; always just enough precision. A software (re)configurable technology Need an in-depth application study to realize acceleration; acceleration is not automatic; acceleration requires more programming effort.

Maxeler Technologies MAXware: acceleration in HPC 6 / 26 A different programming model A cylindrical rather than a layered model suits static applications A streaming (MISD) computational model suits memory and (often) compute limited applications.

Maxeler Technologies MAXware: acceleration in HPC 7 / 26 Cylindrical Model High level choices Low level no choice Critical kernel Software CPU Hardware

Maxeler Technologies MAXware: acceleration in HPC 8 / 26 Cylindrical Model High level choices application choices, algorithms, types of parallelism, data representations optimize Low level choices communications, routing, timing, gate delays Critical kernel First accelerate the (initially small) kernel Later extend kernel size

Maxeler Technologies MAXware: acceleration in HPC 9 / 26 Cylindrical Model The programmer sees kernel application through a cylinder that spans high and low level design decisions Applies to 1) large 2) static applications with 3) an identifiable critical section Speedup is achieved by understanding the whole process; application down to the gates After speeding up kernel; enlarge the application scope Speedup requires effort

Maxeler Technologies MAXware: acceleration in HPC 10 / 26 Speedup with cylindrical model Transform application to execute multiple simultaneous strips using DRAM pipes Stream computations through each pipe Strip size limited by FPGA area and DRAM bandwidth Application specific data precision multiplies FPGA area multiplies DRAM bandwidth

Maxeler Technologies MAXware: acceleration in HPC 11 / 26 MAXware: acceleration support Acceleration methodology Application tool set Acceleration hardware (MAX2 board)

Acceleration methodology Four stages: Analysis of program (static and dynamic) Transformation: create dataflow graph, data layout and representation Partitioning: optimize dataflow, data access and data representation options Implementation: uses Maxeler stream compiler to configure application for FPGA, set run time interfaces Maxeler Technologies MAXware: acceleration in HPC 12 / 26

Maxeler Technologies MAXware: acceleration in HPC 13 / 26 Maxeler tool set Parton profiles running application, finds hot spots and suitability for FPGA key metric: computations per memory access Performance modeler estimates execution time of alternative designs for FPGA; accurate to 10% MaxelerOS support for MAX2 board

MAX2 board Maxeler Technologies MAXware: acceleration in HPC 14 / 26

Maxeler Technologies MAXware: acceleration in HPC 15 / 26 MAX2 board 2 x Virtex-5 LX330T with triple channel interface to 24GB DDR2 memory with PCI Express x16 link to host. 8GB/s inter FPGA link; total off-chip BW of 47 GB/s MaxelerOS has Linux device drivers, manages all PCI & DRAM plumbing

Example: seismic data processing For O&G exploration: distribute grid of sensors over large area Sonic impulse the area and record reflections: frequency, amplitude, delay at each sensor (Sea based survey) use 30,000 sensors record data (120 db range) each sampled at more than 2kbps with new sonic impulse every 10 sec. Order of terabytes of data each day. Maxeler Technologies MAXware: acceleration in HPC 16 / 26

Maxeler Technologies MAXware: acceleration in HPC 17 / 26 seismic data processing: there s a lot of data to be processed Data can be interpreted with frequency and amplitude indicates structure, delay indicates depth (z axis). Process data to determine location of structures of interest Many different ways to process One option: simulate wave propagation from source to receivers and compare to recorded data

Maxeler Technologies MAXware: acceleration in HPC 18 / 26 Computations per output point on Intel Xeon Cycles FLOPs Other Ops L1 Cache Miss Rate 2 nd order X pass 11% 40.2 72.3 0.2% 0.6 2 nd order Y pass 15% 40.2 72.3 3.8% 0.8 2 nd order Z pass 21% 40.2 72.4 7.3% 1.1 Vector add 1% 1.0 9.0 1.3% 0.9 4 th order X pass 10% 40.2 64.7 0.2% 0.6 4 th order Y pass 15% 40.2 64.7 4.0% 0.9 4 th order Z pass 21% 39.6 65.3 7.8% 1.2 Update pressure 1% 1.9 9.9 1.0% 0.7 Boundary sponge 5% 0.8 4.0 5.6% 5.8 CPI

Maxeler Technologies MAXware: acceleration in HPC 19 / 26 Computations On average about a data cache miss per 10 floating point ops. Xeon achieves about 1.0 cpi So Xeon has a 20x frequency starting advantage over an FPGA based computation BUT FPGA uses lots of parallelism to significant advantage

Maxeler Technologies MAXware: acceleration in HPC 20 / 26 Streaming solution (FPGA) Convolve 4 input points (strips) simultaneously (per FPGA); buffer intermediate results; forward 4 outputs to next pipeline stage. Continue (streaming) pipelining until the silicon runs out (468 stages) Size the Floating Point so that there is just enough range & precision O. Pell, T. Nemeth, J. Stefani and R. Ergas. Design Space Analysis for the Acoustic Wave Equation Implementation on FPGA Circuits. European Association of Geoscientists and Engineers (EAGE) Conference, Rome, June 2008.

Maxeler Technologies MAXware: acceleration in HPC 21 / 26

Maxeler Technologies MAXware: acceleration in HPC 22 / 26 4 input x 4 output points; no cache misses; no memory accesses Input Input Input Input Constant Implements dataflow graph in 468 pipeline stages Output Output Output Output

Achieving Speedup > 100x Stream the computation in 468 stage pipeline Execute 8 points simultaneously Eliminate cache misses, eliminate overhead operations (load, store, branch..) So 8x (points processed) x 468 stages x 2 (overhead ops)/ 20 = 374 (max Sp possible) Operate at one-twentieth the frequency; reduce power and space. Maxeler Technologies MAXware: acceleration in HPC 23 / 26

So how typical is this? Other application areas (ongoing studies) Financial simulation: Monte Carlo based derivative pricing Image processing CFD Kernels, referenced to a Xeon Gaussian Random Number Generation 300x 3D finite difference 160x seismic offset gathers 170x Maxeler Technologies MAXware: acceleration in HPC 24 / 26

Maxeler Technologies MAXware: acceleration in HPC 25 / 26 So how can an emulation (FPGA) be better than the x86 processor(s)? Lack of robustness in streaming hardware (spanning area, time, power) Lack of robust parallel software methodology/ tools Effort (and support tools) enable large dataflows with excellent A x T x P Effort (and support tools) provide significant application Speedup (there s no free lunch in parallel programming)

Maxeler Technologies MAXware: acceleration in HPC 26 / 26 Summary Acceleration based speedup using structured arrays. More attention to memory and/or arithmetic: data representation, streaming, and RAM. Lower power with non aggressive frequency use. Programming uses cylindrical model; but speedup requires lots of low level program optimization. Good tools are golden.