FPGA-Accelerated Heterogeneous Hyperscale Server Architecture for Next-Generation Compute Clusters



Similar documents
All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

Architekturen und Einsatz von FPGAs mit integrierten Prozessor Kernen. Hans-Joachim Gelke Institute of Embedded Systems Professur für Mikroelektronik

ARM Processors for Computer-On-Modules. Christian Eder Marketing Manager congatec AG

SABRE Lite Development Kit

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Accelerate Cloud Computing with the Xilinx Zynq SoC

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Standardization with ARM on COM Qseven. Zeljko Loncaric, Marketing engineer congatec

Zynq SATA Storage Extension (Zynq SSE) - NAS. Technical Brief from Missing Link Electronics:

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

Parallel Programming Survey

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Cisco Unified Computing System Hardware

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab

How To Build An Ark Processor With An Nvidia Gpu And An African Processor

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Model-based system-on-chip design on Altera and Xilinx platforms

CFD Implementation with In-Socket FPGA Accelerators

Getting Started with the Xilinx Zynq All Programmable SoC Mini-ITX Development Kit

PCI Express and Storage. Ron Emerick, Sun Microsystems

Network Security Appliance. Overview Performance Platform Mainstream Platform Desktop Platform Industrial Firewall

HPC Update: Engagement Model

System Design Issues in Embedded Processing

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

System Performance Analysis of an All Programmable SoC

ARM Cortex -A8 SBC with MIPI CSI Camera and Spartan -6 FPGA SBC1654

Cloud Data Center Acceleration 2015

SC1-ALLEGRO CompactPCI Serial CPU Card Intel Core i7-3xxx Processor Quad-Core (Ivy Bridge)

Supercomputing Clusters with RapidIO Interconnect Fabric

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

præsentation oktober 2011

Chapter 4 System Unit Components. Discovering Computers Your Interactive Guide to the Digital World

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

7a. System-on-chip design and prototyping platforms

High-Density Network Flow Monitoring

Chapter 5 Cubix XP4 Blade Server

Lesson 7: SYSTEM-ON. SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY. Chapter-1L07: "Embedded Systems - ", Raj Kamal, Publs.: McGraw-Hill Education

Storage Architectures. Ron Emerick, Oracle Corporation

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

High speed pattern streaming system based on AXIe s PCIe connectivity and synchronization mechanism

LS DYNA Performance Benchmarks and Profiling. January 2009

HP Moonshot: An Accelerator for Hyperscale Workloads

Industry First X86-based Single Board Computer JaguarBoard Released

SUN HARDWARE FROM ORACLE: PRICING FOR EDUCATION


NVIDIA Jetson TK1 Development Kit

Reconfigurable System-on-Chip Design

ECLIPSE Performance Benchmarks and Profiling. January 2009

Intel Xeon Processor E5-2600

High Performance Computing in CST STUDIO SUITE

Architecting High-Speed Data Streaming Systems. Sujit Basu

Experiences With Mobile Processors for Energy Efficient HPC

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

FPGA-based MapReduce Framework for Machine Learning

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

HANIC 100G: Hardware accelerator for 100 Gbps network traffic monitoring

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Our innovation, Your Applications. Your Own Custom Embedded Board in 5 weeks!

High-Level Synthesis for FPGA Designs

A Smart Investment for Flexible, Modular and Scalable Blade Architecture Designed for High-Performance Computing.

ZigBee Technology Overview

Networking Virtualization Using FPGAs

Innovative development and deployment of Intuitive Human Machine Interface for embedded applications

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Open Flow Controller and Switch Datasheet

Pre-tested System-on-Chip Design. Accelerates PLD Development

Kalray MPPA Massively Parallel Processing Array

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Sun Constellation System: The Open Petascale Computing Architecture

Product Brief. R7A-200 Processor Card. Rev 1.0

WiSER: Dynamic Spectrum Access Platform and Infrastructure

OpenSoC Fabric: On-Chip Network Generator

VPX Implementation Serves Shipboard Search and Track Needs

sontheim Wir leben Elektronik! We live electronics! Industrie Elektronik GmbH Computer-on-Modules Overview of our Computer-on-Modules

A-CLASS The rack-level supercomputer platform with hot-water cooling

760 Veterans Circle, Warminster, PA Technical Proposal. Submitted by: ACT/Technico 760 Veterans Circle Warminster, PA

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

PCI Express Impact on Storage Architectures. Ron Emerick, Sun Microsystems

Discovering Computers Living in a Digital World

FPO. Expanding Intel Architecture Flexibility in the Data Center. Markus Leberecht Data Center Solutions Architect, Intel EMEA March 20, 2013

Video/Cameras, High Bandwidth Data Handling on imx6 Cortex-A9 Single Board Computer

HUAWEI TECHNOLOGIES CO., LTD. HUAWEI FusionServer X6800 Data Center Server

Nutaq. PicoDigitizer 125-Series 16 or 32 Channels, 125 MSPS, FPGA-Based DAQ Solution PRODUCT SHEET. nutaq.com MONTREAL QUEBEC

Redefining Flash Storage Solution

Servers, Clients. Displaying max. 60 cameras at the same time Recording max. 80 cameras Server-side VCA Desktop or rackmount form factor

The search engine you can see. Connects people to information and services

Motherboard- based Servers versus ATCA- based Servers

Pedraforca: ARM + GPU prototype

HUAWEI Tecal E6000 Blade Server

Features Rich Expansion. Specifications Dimensions Optional Kit. Packing List Ordering Information Optional Modules

Xeon+FPGA Platform for the Data Center

Accelerating I/O- Intensive Applications in IT Infrastructure with Innodisk FlexiArray Flash Appliance. Alex Ho, Product Manager Innodisk Corporation

Transcription:

FPGA-Accelerated Heterogeneous Hyperscale Server Architecture for Next-Generation Clusters Rene Griessl, Peykanu Meysam, Jens Hagemeyer, Mario Porrmann Bielefeld University, Germany Stefan Krupop, Micha vor dem Berge Christmann, Germany Lars Kosmann, Patrick Knocke OFFIS, Germany Michał Kierzynka, Ariel Oleksiak Poznan Supercomputing and Networking Center, Poland

FiPS Motivation Rising demand for computing power Complexity of calculations Increasing number of users Energy consumption is rising 15 % of electrical energy consumption by computers CO 2 emissions Systems have to be cooled Complexity/Cost rising FiPS (Field Programmable Servers) Founded by EC Integrate FPGA / ARM 2

RECS Box Concept One single rack with up to 100 TFLOPS Up to 2,400 Microservers Up to 200 accelerators (e.g. GPGPU, Xeon Phi, FPGA) Up to 2.5 PB storage Up to 70 TB of RAM Main components RCU: RECS Box Unit Contains CPU nodes, networking, management RPU: RECS Box Power Unit Intelligent power supply for RCUs Can deliver power for multiple RCUs 3

Backplane Backplane Backplane RECS Box System Overview Modular Microserver Architecture Arbitrary combinations of up to 72 Microservers in a single 1 RU Server Front Panel 4

Backplane Backplane Backplane RECS Box System Overview Modular Microserver Architecture Arbitrary combinations of up to 72 Microservers in a single 1 RU Server Communication Backplane Scalable from 6 to 18 compute boards Flexible communication using dedicated Net-s s Modular Architecture Integrate microservers based on traditional CPUs and mobile CPUs Hardware Accelerators Integrated in specialized microserver Attached via PCIe Front Panel 5

RECS Box Communication Backplane 18 x 1/10 GbE from Net s KVM/SM-Bus Three Levels of Interconnect Monitoring and Control Distributed monitoring environment Continuous sensing of volt., curr., temp., Integrated KVM Management Gb Ethernet, switched on backplane High Throughput 10 GbE Infiniband Network (PCIe/GbE) RECS RECS RECS Network (PCIe/GbE) Network (PCIe/GbE) Network 18 x 1/10 GbE to ext. Switch 2 x 1/10 GbE Net (1/10 GbE) 2 x 1/10 GbE Net (1/10 GbE) 2 x 1/10 GbE Net (1/10 GbE) GbE Switch 4 x GbE KVM, Monitoring & Control USB, HDMI, Ethernet Backplane Management GbE Network (PCIe/GbE) RECS Network (PCIe/GbE) RECS Network (PCIe/GbE) RECS Front Panel 6

RECS Box Microserver Overview Apalis-based microservers COM Express-based microservers ARM CPUs ranging from Cortex- A9 to Cortex-A15 Xilinx Zynq-based module x86 modules range from Atom single cores to i7 quad core FPGA-based COM Express module developed, integrating Xilinx Zynq SoC 7

RECS Box Microserver Zynq COM Express Module Zynq-7000, ARM A9 Dual Core, 1 GHz Tightly integrated Programmable Logic Used to extend Processing System High performance ARM AXI interfaces IP cores on PL Memory interfaces 1 GByte of DDR3 (32-bit) PS Memory up to 4 GByte DDR3 SO-DIMM module (64-bit-wide) PL Memory emmc memory (16 GByte) Secure Digital (SD) card slot Management network: Gigabit Ethernet (PS) High-speed serial links (PCIe 2.1, 5 Gb/s) PCIe-based high toughput network can be implemented 8

Processing System Programmable Logic I/O Zynq-7000 Overview Zynq-7000 AP SoC Devices Z-7010 Z-7015 Z-7020 Z-7030 Z-7035 Z-7045 Z-7100 Processor Core Dual ARM Cortex -A9 MPCore Processor Extensions NEON & Single / Double Precision Floating Point Max Frequency 866 MHz (-3) / 766 MHz (-2) 1 GHz (-3) / 800 MHz (-2) Memory External Memory Support Peripherals L1 Cache 32KB I / D, L2 Cache 512KB, on-chip Memory 256KB DDR3, DDR2, LPDDR2, 2x QSPI, NAND, NOR 2x USB 2.0 (OTG), 2x Tri-mode Gigabit Ethernet, 2x SD/SDIO, 2x UART, 2x CAN 2.0B, 2x I2C, 2x SPI, 4x 32b GPIO Approximate ASIC Gates Peak DSP Performance (Symmetric FIR) PCI Express (Root Complex or Endpoint) Agile Mixed Signal (XADC) ~430K (30k LC) ~1.1K (74k LC) ~1.3M (85k LC) ~1.9M (125k LC) ~4.1M (275k LC) ~5.2M (350k LC) ~6.6M (444k LC) Block RAM 240KB 380KB 560KB 1,060KB 2,000KB 2,180KB 3,020KB 100 GMACS 160 GMACS 276 GMACS 593 GMACS 1,334 GMACS 1334 GMACS 2622 GMACS - Gen2 x4 Gen2 x8 Gen2 x8 Gen2 x8 2x 12bit 1Msps A/D Converter Processor System IO 130 Multi Standards 3.3V IO 100 150 200 100 100 100 250 Multi Standards High Performance 1.8V IO - - - 150 150 150 150 Multi Gigabit Transceivers - 4 (6.25 Gbit/s) - 4 (12.5 Gbit/s) 8 (12.5 Gbit/s) 8 (12.5 Gbit/s) 16 (12.5 Gbit/s) [Xilinx, Zynq-7000 All Programmable SoC Overview, 2014] 9

SO-DIMM Module Zynq-7000 COM Express Mechanical Overview PS-DDR3 RAM HS Connector Piggyback Connector Zynq-7000 Crosspoint Switch Array COM-Express Connector Microcontroller HS Connector 10

RECS BOX Communication Backplane 18 x 1/10 GbE from Net s KVM/SM-Bus Fourth Level of Interconnect Direct low latency 8x 12.5 Gb/s 300 ns Network (PCIe/GbE) RECS 2 x 1/10 GbE Net (1/10 GbE) 2 x 1/10 GbE Network (PCIe/GbE) RECS High Throughput 10 GbE Infiniband Management Gb Ethernet Network (PCIe/GbE) RECS Network (PCIe/GbE) RECS Net (1/10 GbE) 2 x 1/10 GbE Net (1/10 GbE) 4 x GbE Network (PCIe/GbE) RECS Network (PCIe/GbE) RECS Monitoring and Control GbE Switch Backplane Network 18 x 1/10 GbE to ext. Switch KVM, Monitoring & Control USB, HDMI, Ethernet Management GbE Front Panel 11

Benchmark DNA sequencing GPU implementation Needleman-Wunsch and dynamic programming (DP): Data dependencies: left, upper and diagonal elements are needed H i, j = max H i 1, j G penalty H i, j 1 G penalty H i 1, j 1 + SM(s 1 i, s 2 [j]) GPU implementation: The whole matrix is processed by a single GPU thread, thousands of threads work in parallel MxN matrix is divided into sub-matricies of KxK (K is the unroll factor) Up to 256 cells computed from a single data fetch Highly optimised for NVIDIA Fermi architecture using CUDA 12

Benchmark DNA sequencing FPGA implementation Based on Vivado HLS Starting from basic C Needleman- Wunsch implementation Each PE calculates one submatrix of KxK nucleoids Systolic array style pipeline structure 11 PEs fit in Zynq-7045 Zynq PS initializes and manages data flow Direct low latency links can be used for multi FPGA architecture 32-bit unsigned int ACAGTATAGATTACAT ACAGTATAGATTACAT ACAGTATAGATTACAT ACAGTATAGATTACAT PE 1 PE 2 PE 3 PE 1 PE 2 PE 3 PE 1 PE 2 PE 3 Total FPGA resources (11 PE) 13

Benchmark DNA sequencing Results Fastest implementation on Tesla GPU Most energy efficient implementation on FPGA Live Demo available Visit Booth 2303 14

Summary and Outlook RECS Resource Efficient Cluster Server o Integration of CPUs, embedded CPUs, GPGPUs and FPGAs Evaluation using DNA sequencing o HLS FPGA implementation provides maximum energy efficency Outlook: EC Horizon 2020 project M2DC Modular Microserver Data Center Architectural improvements focusing on communication New Microservers (64-bit ARM, Zynq Ultrascale+, Hybrid Memory Cube) Large scale testbeds / applications 15

Many Thanks! Visit RECS at booth 2303 René Griessl Cognitronics and Sensor Systems Center of Excellence Cognitive Interaction Technology Bielefeld University, Germany rgriessl@cit-ec.uni-bielefeld.de www.ks.cit-ec.uni-bielefeld.de

Zynq-7000 COM Express System Architecture COM Express Connector I²C UART CAN COM-Express-Carrier DDR3 SODIMM 8 GByte/ 64 Bit USB Hub USB2514B USB SATA TUSB9261 SD Card QSPI Flash S25FL256S emmc Flash KLMxGxxE2x I²C Display Port ANX9804 USB 2.0 PHY USB3320 USB 2.0 PHY USB3320 DDR3 1 GByte/ 32 Bit GbE PHY 88E1116R ZYNQ-7000-PL AXI EMIO ZYNQ-7000-PS Compatible with COM Express Type 6 Standards 17