CFD Implementation with In-Socket FPGA Accelerators



Similar documents
Seeking Opportunities for Hardware Acceleration in Big Data Analytics

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Networking Virtualization Using FPGAs

Xeon+FPGA Platform for the Data Center

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Cloud Data Center Acceleration 2015

FPGA-based MapReduce Framework for Machine Learning

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Accelerating CFD using OpenFOAM with GPUs

Eli Levi Eli Levi holds B.Sc.EE from the Technion.Working as field application engineer for Systematics, Specializing in HDL design with MATLAB and

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab

Nutaq. PicoDigitizer 125-Series 16 or 32 Channels, 125 MSPS, FPGA-Based DAQ Solution PRODUCT SHEET. nutaq.com MONTREAL QUEBEC

FPGA-based Multithreading for In-Memory Hash Joins

FPGAs for Trusted Cloud Computing

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

Intel Xeon +FPGA Platform for the Data Center

Infrastructure Matters: POWER8 vs. Xeon x86

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Computer Graphics Hardware An Overview

Architectures and Platforms

High Performance Computing in CST STUDIO SUITE

Embedded Systems: map to FPGA, GPU, CPU?

Performance Measurement of a High-Performance Computing System Utilized for Electronic Medical Record Management

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

7a. System-on-chip design and prototyping platforms

WiSER: Dynamic Spectrum Access Platform and Infrastructure

A quick tutorial on Intel's Xeon Phi Coprocessor

LS DYNA Performance Benchmarks and Profiling. January 2009

Lecture 1. Course Introduction

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

White Paper Accelerating High-Performance Computing With FPGAs

How To Build An Ark Processor With An Nvidia Gpu And An African Processor

How OpenCL enables easy access to FPGA performance?

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Parallel Programming Survey

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

ECLIPSE Performance Benchmarks and Profiling. January 2009

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

QCD as a Video Game?

Extending the Power of FPGAs. Salil Raje, Xilinx

High-Level Synthesis for FPGA Designs

CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge

GPUs for Scientific Computing

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

September 25, Maya Gokhale Georgia Institute of Technology

A Methodology for CFD Acceleration through Reconfigurable Hardware

1 Bull, 2011 Bull Extreme Computing

An XtremeData Whitepaper November 2006 Version 1.0

FPGA Music Project. Matthew R. Guthaus. Department of Computer Engineering, University of California Santa Cruz

HANIC 100G: Hardware accelerator for 100 Gbps network traffic monitoring

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Parallel Computing with MATLAB

RDMA over Ethernet - A Preliminary Study

Introduction to GPU hardware and to CUDA

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Pedraforca: ARM + GPU prototype

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

Intel Xeon Processor E5-2600

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

An Open-source Framework for Integrating Heterogeneous Resources in Private Clouds

Sorting Using the Xilinx Virtex-4 Field Programmable Gate Arrays on the Cray XD1

Introduction to GPGPU. Tiziano Diamanti

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Chapter 4 System Unit Components. Discovering Computers Your Interactive Guide to the Digital World

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Multi-Threading Performance on Commodity Multi-Core Processors

Accelerate Cloud Computing with the Xilinx Zynq SoC

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

Eingebettete Systeme. 4: Entwurfsmethodik, HW/SW Co-Design. Technische Informatik T T T

Discovering Computers Living in a Digital World

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Latency in High Performance Trading Systems Feb 2010

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Clustering Billions of Data Points Using GPUs

GPGPU accelerated Computational Fluid Dynamics

White Paper. Intel Sandy Bridge Brings Many Benefits to the PC/104 Form Factor

Performance Analysis with High-Level Languages for High-Performance Reconfigurable Computing

Performance Characteristics of Large SMP Machines

Data Center and Cloud Computing Market Landscape and Challenges

Compiling PCRE to FPGA for Accelerating SNORT IDS

credits Programming with actors Dave B. Parlour Xilinx Research Labs Thomas A. Lenart Lund University Robert Esser

The new frontier of the DATA acquisition using 1 and 10 Gb/s Ethernet links. Filippo Costa on behalf of the ALICE DAQ group

Logically a Linux cluster looks something like the following: Compute Nodes. user Head node. network

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

AMD Opteron Quad-Core

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

Using PCI Express Technology in High-Performance Computing Clusters

Transcription:

CFD Implementation with In-Socket FPGA Accelerators Ivan Gonzalez UAM Team at DOVRES FuSim-E Programme Symposium: CFD on Future Architectures C 2 A 2 S 2 E DLR Braunschweig 14 th -15 th October 2009

Outline Project Goal FPGA Design Methodology In-socket Accelerators XtremeData XD2000i High Level Languages Mitrionics SDK Euler 1D Implementation Conclusions and Future Work 1

Project Goal The main goal is to reduce the design time of airplanes by acting at two stages of the design process Firstly, by providing guidelines on how to improve mathematical methods in order to take advance of parallel hardware Secondly, by using reconfigurable computing platforms to significantly accelerate the execution of CFD algorithms 2

FPGA Design Methodology This methodology is completely different compared with other acceleration technologies Involves hardware and software design Knowledge and expertise in HW design, SW programming and HW/SW codesign is a must Hardware: Design the custom hardware Imagine that Intel develops a chip for your special needs This step is critical, complex and requires a lot of time Software: Program the custom hardware Similar to other acceleration technologies But requires to implement custom APIs 3

In-Socket Accelerators There are several FPGA-based solutions Reconfigurable computers, PCI-e boards, In-Socket Accelerators (ISAs), etc. In-Socket Accelerators: Tightly coupling FPGAs with x86 processors FPGA is located at the same level than the microprocessor Access to host memory Local memory Custom coprocessors Reconfigurable logic DSP units Internal memory 8 Reconfigurable MB QDRII+ SRAM HW XtremeData XD2000i (Intel FSB compatible) 4

High Level Languages (HLLs) Traditional Hardware Description Languages (HDLs) such as VHDL or Verilog Long development cycle Better performance Only for HW experts New approach is coming to FPGA development: HLLs Minimize the development time Increase productivity (reduce dev. time) Make easy to use the FPGA for non-expert users Some examples: From Matlab: DSPLogic From C: Codeveloper Impulse C New languages: Mitrion SDK (similar to C) Poor productivity P = Performance / Dev. Time We are evaluating Mitrion and Impulse 5

High Level Languages (HLLs) Mitrionics Mitrion-C Mitrion-C is a language specifically intended for developing applications on FPGAs Single-assignment dataflow language with native support for wide (vector) and deep (pipeline) parallelism Mitrion Virtual Processor (MVP) A fine-grained, massively parallel processor Runs software written in the Mitrion-C programming language in FPGAs Has a unique architecture that lets it be adapted to each program it is running in order to maximize performance.» The MVP performs thousands of operations simultaneously by allocating multiple computational units for each instruction 6

Euler 1D Implementation Testbed DOVRES-UAM cluster Two compute nodes Intel Xeon Quad-core 5408 2.13 GHz @1066 MHz FSB One XtremeData ISA One GPU Tesla C1060 32 GB Memory Infiniband 4x DDR dual port 7

Euler 1D Implementation XtremeData ISA Bandwidth Analysis Streaming Transfer Test Study the bandwidth between the FPGA and the host memory The FPGA moves data from / to host memory doing simultaneous reads and writes Overlapping communication and computation Using Mitrion-C Host memory A simple loopback is implemented in the FPGA FPGA 8

Euler 1D Implementation XtremeData ISA Bandwidth Analysis Current ISA BW numbers are: 2 GB/s Host to Bridge 1 GB/s Bridge to Host 1 GB/s Bridge to Host and Host to Bridge Future ISA BW numbers are: 3.5 GB/s Host to Bridge 2.5 GB/s Bridge to Host 1.5 GB/s Bridge to Host and Host to Bridge Streaming Transfer Test Results Data packets larger than 1MB 9

Euler 1D Implementation Results Full implementation of the Euler 1D algorithm FPGA-adapted version One FPGA Simple precision (float) Mitrion SDK Design time: 2 weeks 67% FPGA 2.5 hours!! Sep 14 2009 Euler1D.mitc Quartus II 8.1.163 Started synthesis [10:47] Started place&route [11:26] FIT reported 67% Logic utilization FIT reported 117,356/203,520 (58%) dedicated registers FIT reported 667,011/15,040,512 (4%) block memory bits FIT reported 132/768 (17%) 18-bit DSP elements Creating device programming image [13:12] Running timing analysis [13:14] Finished [13:19] SPR success! 10

Software: One big instruction Euler1D(float *grid, uint grid_size, uint n_iterations) Hardware: Euler 1D processing unit The FPGA will process the complete grid in each iteration Supports any grid size (streaming approach) Streaming approach Simultaneous reads and writes allow us to overlap communication and computation The Memory FPGA bandwidth is the key to obtain a good performance Euler 1D Implementation How does it work? Host memory FPGA 11

Euler 1D Implementation SpeedUp FPGA Clock is 100 MHz Low Bandwidth Euler 1D PU has only 1 core Grid Size (number of points) 12

Conclusions FPGA technology offers promise results on accelerating CFD algorithms It is necessary to increase the bandwidth for small data packets Great speedup (7x) is obtained when computation time is larger than communication time And the FPGA is working at 100 MHz!!! The current design can be improved More than one Euler 1D processing unit per FPGA This will require to use fixed point arithmetic There is another FPGA available in the current ISA Local memory of the FPGA can be used to store small grids New ISA BW numbers double the current ones A VHDL solution can increase easily by 10 the performance over a Mitrion-C implementation But the development time will increase too 13

Future Work Continue working on Euler 1D for testing purposes Improve the current software to solve some issues related to overlap communication and computation Apply the hardware improvements described before Two Euler 1D units per FPGA, use the second FPGA, use the local memory of the FPGA, etc. Study multi-node approach Identify how communication between nodes can affect the performance The DOVRES-UAM cluster is equipped with DDR Infiniband Use a new tool: Codeveloper Impulse-C Different approach than Mitrion SDK Currently working on an Euler 2D version 14

Questions? 15