Intel Xeon +FPGA Platform for the Data Center



Similar documents
Xeon+FPGA Platform for the Data Center

Cloud Data Center Acceleration 2015

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

Networking Virtualization Using FPGAs

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab

Infrastructure Matters: POWER8 vs. Xeon x86

Big Data Performance Growth on the Rise

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data

ECLIPSE Performance Benchmarks and Profiling. January 2009

LSI SAS inside 60% of servers. 21 million LSI SAS & MegaRAID solutions shipped over last 3 years. 9 out of 10 top server vendors use MegaRAID

How To Build A Cisco Ukcsob420 M3 Blade Server

Multi-Threading Performance on Commodity Multi-Core Processors

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Data Center and Cloud Computing Market Landscape and Challenges

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Intel Xeon Processor E5-2600

The new frontier of the DATA acquisition using 1 and 10 Gb/s Ethernet links. Filippo Costa on behalf of the ALICE DAQ group

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Embedded Systems: map to FPGA, GPU, CPU?

Intel DPDK Boosts Server Appliance Performance White Paper

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

PCI Express IO Virtualization Overview

Design Patterns for Packet Processing Applications on Multi-core Intel Architecture Processors

CFD Implementation with In-Socket FPGA Accelerators

Datacenter Operating Systems

BSC vision on Big Data and extreme scale computing

White Paper. Innovate Telecom Services with NFV and SDN

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Capstone Overview Architecture for Big Data & Machine Learning. Debbie Marr ICRI-CI 2015 Retreat, May 5, 2015

Cut Network Security Cost in Half Using the Intel EP80579 Integrated Processor for entry-to mid-level VPN

The Future of Computing Cisco Unified Computing System. Markus Kunstmann Channels Systems Engineer

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011

Enabling Technologies for Distributed and Cloud Computing

LS DYNA Performance Benchmarks and Profiling. January 2009

Boost Database Performance with the Cisco UCS Storage Accelerator

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Accelerating the Data Plane With the TILE-Mx Manycore Processor

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

2015 Global Technology conference. Diane Bryant Senior Vice President & General Manager Data Center Group Intel Corporation

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Intel Open Network Platform Release 2.1: Driving Network Transformation

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Evaluating HDFS I/O Performance on Virtualized Systems

Operating System Support for Multiprocessor Systems-on-Chip

Innovativste XEON Prozessortechnik für Cisco UCS

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

7a. System-on-chip design and prototyping platforms

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER

Stream Processing on GPUs Using Distributed Multimedia Middleware

High-performance vswitch of the user, by the user, for the user

Memory Channel Storage ( M C S ) Demystified. Jerome McFarland

FPGA-based MapReduce Framework for Machine Learning

WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression

Big Data Technologies for Ultra-High-Speed Data Transfer and Processing

Boosting Data Transfer with TCP Offload Engine Technology

High-Density Network Flow Monitoring

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 11 th, 2014 NEC Corporation

Simplifying Big Data Deployments in Cloud Environments with Mellanox Interconnects and QualiSystems Orchestration Solutions

Cisco UCS B460 M4 Blade Server

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Couchbase Server: Accelerating Database Workloads with NVM Express*

Enabling Technologies for Distributed Computing

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Low-latency data acquisition to GPUs using FPGA-based 3rd party devices. Denis Perret, LESIA / Observatoire de Paris

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Rackspace Cloud Databases and Container-based Virtualization

VDI Without Compromise with SimpliVity OmniStack and Citrix XenDesktop

Applying the Benefits of Network on a Chip Architecture to FPGA System Design

Intel Virtualization and Server Technology Update

The Open Cloud Near-Term Infrastructure Trends in Cloud Computing

Architectures and Platforms

The Foundation for Better Business Intelligence

VI Performance Monitoring

SQL Server Consolidation Using Cisco Unified Computing System and Microsoft Hyper-V

GeoGrid Project and Experiences with Hadoop

Parallel Computing. Benson Muite. benson.

Maximizing Hadoop Performance with Hardware Compression

Router Architectures

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Von der Hardware zur Software in FPGAs mit Embedded Prozessoren. Alexander Hahn Senior Field Application Engineer Lattice Semiconductor

Putting it all together: Intel Nehalem.

Amazon EC2 Product Details Page 1 of 5

Evaluation Report: Accelerating SQL Server Database Performance with the Lenovo Storage S3200 SAN Array

A Scalable Large Format Display Based on Zero Client Processor

LLRF. Digital RF Stabilization System

The virtualization of SAP environments to accommodate standardization and easier management is gaining momentum in data centers.

SALSA Flash-Optimized Software-Defined Storage

Transcription:

Intel Xeon +FPGA Platform for the Data Center FPL 15 Workshop on Reconfigurable Computing for the Masses PK Gupta, Director of Cloud Platform Technology, DCG/CPG

Overview Data Center and Workloads Xeon+FPGA Accelerator Platform Applications and Eco-system 2

Digital Services Economy 50¹ Billion DEVICES New SERVICES $450B² Build out of the CLOUD $120B³ 1: Sources: AMS Research, Gartner, IDC, McKinsey Global Institute, and various others industry analysts and commentators 2: Source IDC, 2013. 2016 calculated base don reported CAGR 13-17 3: Source: idata /Digiworld, 2013

Fueling Cloud Computing Growth

Cloud Economics Workload Performance Metrics Amazon s TCO Analysis¹ VMs per System Web Transactions / Sec Storage Capacity Hadoop Queries Performance / TCO is the key metric 1: Source: James Hamilton, Amazon * http://perspectives.mvdirona.com/2010/09/overall-data-center-costs/

Diverse Data Center Demands Accelerators can increase Performance at lower TCO for targeted workloads Intel estimates; bubble size is relative CPU intensity

Overview Data Center and Workloads Xeon+FPGA Accelerator Platform Applications and Eco-system 7

Accelerator Architecture Landscape CPU Ease of Programming/ Development Reconfigurable Accelerator Fixed Function Accelerator Application Flexibility

Accelerator Attach PCIe attach Latency, Granularity QPI attach On-core On-Chip On-Package Distance from Core Best attach technology might be application or even algorithm dependent

Coherency and Programming Model Data Movement In-line Accelerator processes data fully or partially from direct I/O Shared Virtual Memory : Virtual addressing eliminates need for pinning memory buffers Zero-copy data buffers Interaction between Core and Accelerator Off-load Hybrid : algorithm implemented on host and accelerator

Proposed Platform for the Data Center FPGA with coherent low-latency interconnect: Simplified programming model Support for virtual addressing Data Caching Enables new classes of algorithms for acceleration with: Full access to system memory Support for efficient irregular data pattern access Remapping of algorithms from off-load model to hybrid processing model Fine grained interactions 11

IVB+FPGA Software Development Platform Software Development for Accelerating Workloads using Xeon and coherently attached FPGA in-socket DDR3 Processor FPGA Module Intel Xeon E5-26xx v2 Processor Altera Stratix V DDR3 DDR3 DDR3 Intel Xeon E5-2600 v2 Product Family QPI FPGA DDR3 DDR3 QPI Speed Memory to FPGA Module Expansion connector to FPGA Module 6.4 GT/s full width (target 8.0 GT/s at full width) 2 channels of DDR3 (up to 64 GB) PCIe 3.0 x8 lanes - maybe used for direct I/O e.g. Ethernet DMI2 PCIe* 3.0 x8 PCIe* 3.0 x8 PCIe* 3.0 x8 PCIe* 3.0 x8 PCIe* 3.0 x8 PCIe* 3.0 x8 Features Software Configuration Agent, Caching Agent,, (optional) Memory Controller Accelerator Abstraction Layer (AAL) runtime, drivers, sample applications Heterogeneous architecture with homogenous platform support 12

System Logical View AFUs can access coherent cache on FPGA AFUs can not implement a second level cache Intel Quick Path Interconnect (Intel QPI) IP participates in cache coherency with Processors

Programming Interfaces CPU FPGA Host Application Accelerator Function Units (AFU) Accelerator Abstraction Layer Service API Virtual Memory API Physical Memory API Uncore Addr Translation QPI/KTI Link, Protocol, & PHY CCI extended CCI standard QPI/KTI Programming interfaces will be forward compatible from SDP to future MCP solutions Simulation Environment available for development of SW and RTL 14

Programming Interfaces : OpenCL CPU OpenCL Host Code OpenCL Application OpenCL Kernel Code FPGA OpenCL RunTime C F G OpenCL Kernels Accelerator Abstraction Layer Service API Virtual Memory API Physical Memory API VirtMem Physical Memory API CCI Extended CCI Standard QPI/UPI/P CIe System Memory Unified application code abstracted from the hardware environment Portable across generations and families of CPUs and FPGAs 15

Overview Data Center and Workloads Xeon+FPGA Accelerator Platform Applications and Eco-system 16

Intel Xeon + FPGA 1 in the Cloud Vision Cloud Users Workload IP Library End User Developed IP Intel Developed IP FPGA Vendor Developed IP 3rd party Developed IP Launch workload Workload accelerators Software Defined Infrastructure Storage Orchestration Software Place workload Resource Pool Network Compute Static/dynamic FPGA programming Intel Xeon +FPGA 1: Field Programmable Gate Array (FPGA) 17

Example Usage: Deep Learning Framework for Visual Understanding cluster CCI Interface SRAM Controller node Read Write Reg Access DMA IP Registers Processing Tile n Processing Tile 1 Processing Tile 0 device Control State Machine Weights Inputs Outputs PE PE PE primitives CNN (Convolutional Neural Network) function accelerated on FPGA: Power-performance of CNN classification boosted up to 2.2X Source: Intel Measured (Intel Xeon processor E5-2699v3 results; Altera Estimated (4x Arria-10 results) 2S Intel( Xeon E5-2699v3 + 4x GX1150 PCI Express cards. Most computations executed on Arria-10 FPGA's, 2S Intel Xeon E5-2699v3 host assumed to be near idle, doing misc. networking/housekeeping functions. Arria-10 results estimated by Altera with Altera custom classification network. 2x Intel Xeon E5-2699v3 power estimated @ 139W while doing "housekeeping" for GX1150 cards based on Intel measured microbenchmark. In order to sustain ~2400 img/s we need a I/O bandwidth of ~500 MB/s, which can be supported by a 10GigE link and software stack

phmm Algorithm performance is measured in terms of Millions Cell Updates per seconds (CUPS). Performance projections: CPU Performance: includes: 1 core Intel Xeon processor E5-2680v2 @ 2.8GHz delivers 2101.1 MCUP/s measured; estimated value assumes linear scaling to 10 Cores on Xeon ES2680v2 @ 2.8 GHz & 115W TDP; FPGA Performance includes: 1 FPGA PE (Processing Engine) delivers 408.9 MCUP/s @ 200 MHz measured; estimated value assumes linear scaling to 32 PEs and 90% frequency scaling on Stratix- V A7 400 MHz based on RTL Synthesis results (35W TDP). Intel estimated based on 1S Xeon E5-2680v2 + 1 Stratix-V A7 with QPI 1.1 @ 6.4 GT/s full width using Intel QuickAssist FPGA System Release 3.3, ICC (CPU is essentially idle when work load is offloaded to the FPGA) Example Usage: Genomics Analysis Toolkit BWA mem (Smith-Waterman) HaplotypeCaller (PairHMM) PairHMM function accelerated on FPGA: Power-performance of phmm boosted up to 3.8X

Example Usage: Database Query Processing DB Application Query Select * from table where a<100 Data Decompression + Query Execution Query to Disk Network Router Compressed Data Query to Disk NAS Decompression function accelerated on FPGA: Power-performance of LZO Decompression boosted up to 1.9X LZO Decompression performance is measure in terms of Byte Decompressed per second. Performance projections for stream files of size 111kB where the decompression matches are in range of FPGA buffer not requiring any system memory R/W requests: FPGA performance (estimated): 0.48 Clocks/Byte per LZOD PE (Processing Engine) (resulting in 727 MB/s throughput @ 350 MHz) based on cycle accurate RTL simulation measurements; assuming linear scaling to 20 LZOD PE on Arria-10 1150 @ 350 MHz (60W TDP) (CPU is essentially idle when work load is offloaded to the FPGA). CPU performance: 4.5 Clocks/Byte measured on one thread E5-2699v3 using IPP 9.0.0 (resulting in 511 MB/s Throughput @ 2.3GHz); assuming linear scaling to 36 Threads on 1S E5-2699v3 @ 2.3 GHz (145W TDP)

Academic Research in FPGA Usages Intel & Altera jointly launched Hardware Accelerator Research Program Q1 15: Call for proposals which will provide faculty with computer systems containing Intel microprocessors and an Altera * Stratix * V FPGA module that incorporates Intel QuickAssist Technology in order to spur research in programming tools, operating systems, and innovative applications for accelerator-based computing systems Q2 15: Proposals reviewed and selected Q3 15: Systems being shipped to universities

Q & A 22