Cloud Data Center Acceleration 2015



Similar documents
Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

Xeon+FPGA Platform for the Data Center

Intel Xeon +FPGA Platform for the Data Center

Hadoop on the Gordon Data Intensive Cluster

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Data Center and Cloud Computing Market Landscape and Challenges

Building a Scalable Storage with InfiniBand

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Infrastructure Matters: POWER8 vs. Xeon x86

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Networking Virtualization Using FPGAs

Scaling from Datacenter to Client

How SSDs Fit in Different Data Center Applications

SMB Direct for SQL Server and Private Cloud

White Paper. Innovate Telecom Services with NFV and SDN

Mit Soft- & Hardware zum Erfolg. Giuseppe Paletta

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

ICRI-CI Retreat Architecture track

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

Part 1 - What s New in Hyper-V 2012 R2. Clive.Watson@Microsoft.com Datacenter Specialist

ebay Storage, From Good to Great

State of the Art Cloud Infrastructure

The Future of Computing Cisco Unified Computing System. Markus Kunstmann Channels Systems Engineer

Accelerating I/O- Intensive Applications in IT Infrastructure with Innodisk FlexiArray Flash Appliance. Alex Ho, Product Manager Innodisk Corporation

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Hyperscale Use Cases for Scaling Out with Flash. David Olszewski

Building All-Flash Software Defined Storages for Datacenters. Ji Hyuck Yun Storage Tech. Lab SK Telecom

Windows 8 SMB 2.2 File Sharing Performance

FLOW-3D Performance Benchmark and Profiling. September 2012

Performance Beyond PCI Express: Moving Storage to The Memory Bus A Technical Whitepaper

Data Center Storage Solutions

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

Accelerating Real Time Big Data Applications. PRESENTATION TITLE GOES HERE Bob Hansen

PCI Express Impact on Storage Architectures. Ron Emerick, Sun Microsystems

System Architecture. In-Memory Database

Microsoft Private Cloud Fast Track

PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 11 th, 2014 NEC Corporation

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Enabling High performance Big Data platform with RDMA

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Server Forum Copyright 2014 Micron Technology, Inc

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Overview: X5 Generation Database Machines

IBM System x SAP HANA

Memory Channel Storage ( M C S ) Demystified. Jerome McFarland

Connecting Flash in Cloud Storage

How To Build A Cisco Ukcsob420 M3 Blade Server

Hadoop: Embracing future hardware

Virtualization of the MS Exchange Server Environment

ECLIPSE Performance Benchmarks and Profiling. January 2009

Can High-Performance Interconnects Benefit Memcached and Hadoop?

HPC Update: Engagement Model

Intel RAID SSD Cache Controller RCS25ZB040

NVM Express TM Infrastructure - Exploring Data Center PCIe Topologies

Deep Dive on SimpliVity s OmniStack A Technical Whitepaper

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

No matter what you need for Managed IT services, High-Performance Storage, you can count on us for low cost, fast and effective service.

How To Test Nvm Express On A Microsoft I7-3770S (I7) And I7 (I5) Ios 2 (I3) (I2) (Sas) (X86) (Amd)

Hadoop Size does Hadoop Summit 2013

Storage Architectures. Ron Emerick, Oracle Corporation

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

Enabling the Flash-Transformed Data Center

Boost Database Performance with the Cisco UCS Storage Accelerator

Oracle Exadata: The World s Fastest Database Machine Exadata Database Machine Architecture

Microsoft Windows Server Hyper-V in a Flash

N /150/151/160 RAID Controller. N MegaRAID CacheCade. Feature Overview

Netvisor Software Defined Fabric Architecture

A Close Look at PCI Express SSDs. Shirish Jamthe Director of System Engineering Virident Systems, Inc. August 2011

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

NV-DIMM: Fastest Tier in Your Storage Strategy

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Flash 101. Violin Memory Switzerland. Violin Memory Inc. Proprietary 1

Diablo and VMware TM powering SQL Server TM in Virtual SAN TM. A Diablo Technologies Whitepaper. May 2015

PCI Express and Storage. Ron Emerick, Sun Microsystems

Michael Kagan.

Software-defined Storage Architecture for Analytics Computing

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

Zadara Storage Cloud A

Platfora Big Data Analytics

Cost Efficient VDI. XenDesktop 7 on Commodity Hardware

Windows Server 2008 R2 Hyper-V Live Migration

PCI Express Supersedes SAS and SATA in Storage

Scientific Computing Data Management Visions

Accelerating Applications and File Systems with Solid State Storage. Jacob Farmer, Cambridge Computer

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

MS Exchange Server Acceleration

Transcription:

Cloud Data Center Acceleration 2015

Agenda! Computer & Storage Trends! Server and Storage System - Memory and Homogenous Architecture - Direct Attachment! Memory Trends! Acceleration Introduction! FPGA Adoption Examples 2

Server (Computer) & Storage Trends! Cloud computing, virtualization, convergence - Server & storage consolidation & virtualization - Convergence to PCIe backplane and low latency 25GbE - Distributed storage and cache for cloud computing - Convergence is enabling lower power and higher density! Lots of interest in Storage, Storage Class Memory - Capacity Expansion - DRAM to flash, and flash cache - Intermediate Storage - Disaggregation of Storage and Data - Rapid change & new cloud architectures - Verticalization, disaggregation, dense computing for cloud servers - Acceleration option with FPGA per node, or pool heterogeneous accelerators 3

Data Center Challenges! Memory & IO bottlenecks limit utilization - Typical server workloads run at ~20% processor utilization! Virtualization driving application consolidation - But memory and IO are limiting factors to better utilization - Big Data configurations are also bottlenecked! Especially search and analytics workloads! The Processor mostly waits for RAM - Flash / Disk are100,000 1,000,000 clocks away from cpu - RAM is ~100 clocks away unless you have locality (cache). - If you want 1CPI (clock per instruction) you have to have the data in cache (program cache is easy ) - This requires cache conscious data-structures and algorithms sequential (or predictable) access patterns - In Memory DB is going to be common (SPARK Architecture) FPGA CPU A 4 Source: Microsoft

O/S Bypass: DMA, RDMA, zero copy, cpu cache direct! Avoid memory copies! NICs, clusters, accelerators! DMA, RDMA Mellanox RoCE, Infiniband! Intel PCIe steering hints Into cpu cache! Heterogeneous System Architecture (HSA) For accelerators! Direct Access to cpu cache QPI, CAPI Low latency Simplified programming model Huge benefit for flash cache 5

Computing Bottlenecks! Memory bottleneck - Need faster & larger DRAM! CPU core growth > memory b/w! CPU has limited # of pins! DRAM process geometry limits - Emerging:! Stacked DRAM in package! Optics from CPU package! Optics controller for clusters! Cluster networking! Over optics! Main Storage Data response time - Impact Big Data Processing Optics Controller to TOR In Optics Controller w/ switching 6

Emerging: Data Stream Mining, Real Time Analytics! Data stream examples Computer network traffic Data feeds Sensor data! Benefits Real time analytics! Predict class or value of new instances! e.g. security threats with machine learning Filtering data to store! Topology Single or Multiple FPGA accelerators 7

Did the 14nm NAND delay drive these solutions to becomes next gen? Or did the need for more flexible memory and storage applications Drive this transition? New Memories are complementary to existing solutions How to Adopt Where do they go How do they fit in tomorrows Server/ storage Architectures Enter New Memory solutions (A new Dawn Awaits) 8

3D XPoint vs. NAND! 1000X faster write! Much better endurance! 5X to 7X Faster SSD s! Cost & Price in between DRAM and flash! Altera FPGA controller options 9

Rapid Change in the Cloud Data Center! Rapid change & new cloud architectures - Verticalization, disaggregation, dense computing for cloud servers - Intel offering 35 custom Xeons for Grantley - Software Defined Data Center! Pool resources (compute, network, storage), automate provisioning, monitoring - Intel MCM & Microsoft Bing FPGA announcements - Intel Standard to Custom Roadmap showing 35 Grantley SKU s: 10

Accelerator Spectrum Application Spectrum Computer Vision Data Analytics Language Proc. Visual Analytics Search Ranking Data Streaming Computational Medical Diagnosis Image Pattern Recognition Best match Engines Algorithms Database Graph Machine Learning Numeric Computing 11

Innovation Roadmap Flexibility High number of VM s Linux Containers Virtual machines Disaggregated Virtual Storage Software data Center Local Virtual Storage Multiprotocol Scale out Main stream Disk Backup SATA SAS SOP NVMe SSD Fabric/Array Easy Replication HDD SSD SS Cache 3DRS 3D Xpoint Torus Cache Auto Tier Direct Attach Sub System 12

Efficient Data Centric Computing Topologies Server with Unstructured Search Topology e.g. Hadoop + Map/ Reduce Small processors close to storage Memory e.g Map Function P1 Flash Drive(s) Server with Balanced FLOPs/ Byte/s and FLOPs/Byte Depth Switch or/and Large Aggregating Processor e.g. map result collection and Reduce Function Memory Pn-1 Memory Flash Drive(s) X TFlop Processor X TB/s X TBytes Memory Network / Storage Attach Network Attach Application : Data Analytics / Data Search / Video Server Server with 3D Torus Configurations Pn Flash Drive(s) Application : Large Dataset HPC with Compute intensive function that do not scale well e.g.fea Server With Multi Node Pipeline Network / Storage Attach P1 Pn-1 Pn Network / Storage Attach Application : Classic HPC, e.g. QCD, CFD, Weather Modeling 130 GB Memory 130 GB Memory 130 GB Memory Application : Deep Pipeline DSP, e.g. Video Analytics

Microsoft SmartNIC with FPGA for Azure (8-25-15 Hot Chips Presentation)! Scaling up to 40 Gbs and beyond Requires significant computation for packet processing! Use FPGAs for reconfigurable functions Already used in Bing SW Configurable! Program with Generic Flow Tables (GFT) SDN i/f to hardware! SmartNIC also does Crypto, QoS, storage acceleration, and more http://tinyurl.com/p4sghaq 14

FPGA AlexNet Classification Demo (Intel IDF, August 2015)! CNN AlexNet Classification - 2X+ Performance/W vs cpu (Arria 10) - 5X+ performance Arria 10 à Startix 10! 3X DSP blocks, 2X clock speed! Microsoft Projection - 880 images/s for A10GX115-2X Perf./W versus GPU AlexNet! Altera OpenCL AlexNet Example - 600+ images/s for A10GX115 by year end CNN Classification Platform Power (W) Performance (image/s) Efficiency (Images/sec/W) E52699 Dual Xeon Processor (18 cores per Xeon) 321 1320 4.11 PCIe w/ dual Arria 10 1150 130* 1200 9.27 15 Note *: CPU low power state of 65W included.

Why Expansion Memory? machine learning graph-based informatics highproductivity languages Load Balancing data exploration Enable memory-intensive computation Increase users productivity algorithm expression statistics Change the way we look at data Boost scientific output Broaden participation interactivity Big Data... ISV apps

Advanced memory controller market Memory innovation will change how computing is done! Emerging market for Advanced Memory Controllers. These devices interface to the processor by directly attaching to their existing memory interface bus. Memory Types will require New Controller implementations! Memory offload Applications Filtering, Acceleration, Capacity, Sub-Systems! FPGA can translate between existing memory interface electricals and a plethora of backend devices, interfaces, or protocols to enable a wide variety of applications. Initial examples of this include: Bridging between DDR4 and other memory technologies such as NAND Flash, MRAM, or Memristor. Memory depth expansion to enable up to 8X the memory density available per memory controller. Enable new memory adoption quickly Enable acceleration of data processing for analytics applications Enable offload of data management functions such as compression or encryption. 17

Application: DDR4 DIMM Replacement - Memory Bridging and/or In-line Acceleration XEON DDR4 CTRL DDR4 Slot 0 DDR4 Slot 1 Key Memory Attributes Capacity Sub System mixed Memory Optimized Solution for App Database Acceleration On-Chip Cache DIMM Module FPGA DDR4 Slave Ctrl/Accel Logic Memory Filter/ Search On- Chip Cache ADV MEM CTRL Memory, 3DRS: DDR4, NAND, MRAM, Memristor, etc. 18

Acceleration Solutions Data making money the new Way

Acceleration Memory Applications Accelerator Application Memory Function Memory Type Future Data Analytics Temporary Storage DDR3/4 Storage Class, HBM, HMC Computer Vision/OCR Buffer DDR3/4 Storage Class Image Pattern Recognition Storage, Buffer SSD, DDR Storage Class, HBM, HMC Search Ranking storage, Working DDR3 Storage Class Visual Analytics Buffer DDR3 Storage Class Medical Imaging Storage, Buffer SSD, DDR3/4 Storage Class, DDR4, As FLOPs increase Memory Bandwidth will need to scale As Data increases capacity will also increase to sustain computation 20

21 Accelerator Board Block Diagram

Dual Arria 10 High Memory Bandwidth FPGA Accelerator! GPU Form Factor Card with 2x Arria 10 10A1150GX FPGAs - Dual Slot Standard Configuration, Single Slot width possible, if user design fits within ~100W power footprint! 410 GBytes/s Peak Aggregate Memory Bandwidth - 85GB/s Peak DDR4 Memory Bandwidth per FPGA - 60GB/s Write + 60GB/s Read Peak HMC Bandwidth per FPGA! 132 GBytes Memory Depth or 260GBytes with Soft Memory Controllers - 4GBytes of HMC memory shared between FPGAs! 60 GBytes/s, 7.5GBytes/s/Ch/Dir, board to board pipelining bandwidth - (4) Communication channels running at 15Gb/s or (4) 40GbE Network IO channels 32GByte DDR4 S Arria 10 SODIMMs 60 + 60 60 + 60 Arria 10 1150GX GB/s 4GB HMC GB/s 85GB/s 1150GX Delay FPGA x32 xcvrs x32 xcvrs Buffer FPGA 32GByte DDR4 Discretes 2 x72 Mem Ch.s @2666MTP 2 x72 Mem Ch.s @2666MTPS x4 x4 x4 x4 PCI e x8 PCIe Switch PCIex16 Gen 3 PCI e x8 2 x72 Mem Ch.s @2666MTP S 85GB/s 2 x72 Mem Ch.s @2666MTPS 32GByte DDR4 SODIMMs 32GByte DDR4 Discretes NOTE : Performance numbers are absolute maximum capability & peak data rates

Dual Stratix 10 3D Torus Scalable FPGA Accelerator! GPU Form Factor Card with 2x Stratix 10 FPGAs - Support Majority of Stratix 10 Family Both large and small devices from 2 to 10 TFlops! 204 GBytes/s Peak Aggregate Memory Bandwidth - 102GB/s Peak DDR4 Memory Bandwidth per FPGA! 256 GBytes Memory Depth! 336 GBytes/s, 14GBytes/s/Channel/Direction, board to board Scaling Board to Board scaling Interconnect for 2D/3D Mesh/Torus Topologies 23 32GByte DDR4 SODIMMs 32GByte DDR4 Discretes 2 x72 Mem Ch.s @3200MTPS 102GB/s 2 x72 Mem Ch.s @3200MTP S x4 Stratix 10 FPGA x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 PCI e x16 x8 xcvrs PCIe Switch PCIex16 Gen 3 PCI e x16 Stratix 10 FPGA 2 x72 Mem Ch.s @3200MTPS 102GB/s 2 x72 Mem Ch.s @3200MTP S 32GByte DDR4 SODIMMs 32GByte DDR4 Discretes NOTE : Performance numbers are absolute maximum capability & peak data rates 23

Minimise Multiple accesses to External Memory P Q Traditional CPU/GPU Implementation Function A Function B Iterate many times Function C Function D Access entire volume data storage in system memory Function E Result read from p & q buffers after many thousands of iterations

Minimise Multiple accesses to External Memory P Q Traditional CPU/GPU Implementation Function A Function B Iterate many times Function C Function D Access entire volume data storage in system memory Function E Result read from p & q buffers after many thousands of iterations

Minimise Multiple accesses to External Memory P Q FPGA Implementation Function A Function B D EE Iterate many times Access entire volume data storage in system memory Function C Function D Function E P P I P E L N I E Result read from p & q buffers after many thousands of iterations E

Minimise Multiple accesses to External Memory FPGA Implementation P Q Function A Function B D EE Iterate many times Access entire volume data storage in system memory Function C Function D Function E P P I P E L N I E Result read from p & q buffers after many thousands of iterations E

Try to Minimise Multiple accesses to External Memory P Q FPGA Implementation Function A Function B D EE Iterate many times Access entire volume data storage in system memory Function C Delay Line External Memory Function D Function E P P I P E L N I E Delay Line Deeper than blockram when large algorithm data alignment is required to further extend the deep pipeline Result read from p & q buffers after many thousands of iterations E

Summary! FPGA utilizes less external memory bandwidth for Reverse Time Migration, CNN and other common acceleration algorithms.! The growth in data and TFLOPs for Acceleration will require more BW and in a orderly fashion. New Memories will require higher bandwidth and controller changes.! Memory and System solution to increase compute efficiency are changing architectures, networks and the type of memory. 29

30