Scaling Mobile Compute to the Data Center. John Goodacre



Similar documents
Hardware accelerated Virtualization in the ARM Cortex Processors

7a. System-on-chip design and prototyping platforms

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Xeon+FPGA Platform for the Data Center

High Performance or Cycle Accuracy?

Cloud Data Center Acceleration 2015

ARM Webinar series. ARM Based SoC. Abey Thomas

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

Introduction to AMBA 4 ACE and big.little Processing Technology

Applied Micro development platform. ZT Systems (ST based) HP Redstone platform. Mitac Dell Copper platform. ARM in Servers

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

ARM Processors and the Internet of Things. Joseph Yiu Senior Embedded Technology Specialist, ARM

Support a New Class of Applications with Cisco UCS M-Series Modular Servers

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

Supercomputing Clusters with RapidIO Interconnect Fabric

HUAWEI Tecal E6000 Blade Server

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab

760 Veterans Circle, Warminster, PA Technical Proposal. Submitted by: ACT/Technico 760 Veterans Circle Warminster, PA

LSI SAS inside 60% of servers. 21 million LSI SAS & MegaRAID solutions shipped over last 3 years. 9 out of 10 top server vendors use MegaRAID

A-CLASS The rack-level supercomputer platform with hot-water cooling

Data Center and Cloud Computing Market Landscape and Challenges

What is a System on a Chip?

The Future of Computing Cisco Unified Computing System. Markus Kunstmann Channels Systems Engineer

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

SAN Conceptual and Design Basics

White Paper Solarflare High-Performance Computing (HPC) Applications

OpenSoC Fabric: On-Chip Network Generator

Data Centric Systems (DCS)

ICRI-CI Retreat Architecture track

Architekturen und Einsatz von FPGAs mit integrierten Prozessor Kernen. Hans-Joachim Gelke Institute of Embedded Systems Professur für Mikroelektronik

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

How To Design A Single Chip System Bus (Amba) For A Single Threaded Microprocessor (Mma) (I386) (Mmb) (Microprocessor) (Ai) (Bower) (Dmi) (Dual

big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices

ARM Processors for Computer-On-Modules. Christian Eder Marketing Manager congatec AG

Qsys and IP Core Integration

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

White Paper. S2C Inc Technology Drive, Suite 620 San Jose, CA 95110, USA Tel: Fax:

Overview: X5 Generation Database Machines

CoreSight SoC enabling efficient design of custom debug and trace subsystems for complex SoCs

How To Build A Cloud Computer

Architectures and Platforms

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Photonic Networks for Data Centres and High Performance Computing

Network connectivity controllers

Mainframe. Large Computing Systems. Supercomputer Systems. Mainframe

Server Forum Copyright 2014 Micron Technology, Inc

MPSoC Designs: Driving Memory and Storage Management IP to Critical Importance

Optimizing Data Center Networks for Cloud Computing

A+ Guide to Managing and Maintaining Your PC, 7e. Chapter 1 Introducing Hardware

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

Intel Xeon +FPGA Platform for the Data Center

Memory Channel Storage ( M C S ) Demystified. Jerome McFarland

Implementing a unified Datacenter architecture for service providers

Accelerating the Data Plane With the TILE-Mx Manycore Processor

Fibre Channel over Ethernet in the Data Center: An Introduction

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure

From Ethernet Ubiquity to Ethernet Convergence: The Emergence of the Converged Network Interface Controller

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Going Linux on Massive Multicore

Accelerating I/O- Intensive Applications in IT Infrastructure with Innodisk FlexiArray Flash Appliance. Alex Ho, Product Manager Innodisk Corporation

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Network Virtualization and Data Center Networks Data Center Virtualization - Basics. Qin Yin Fall Semester 2013

GadgetGatewayIa Configurable LON to IP Router and/or Remote Packet Monitor. ANSI (LonTalk ) and ANSI 852 (IP) standards based.

HP Moonshot: An Accelerator for Hyperscale Workloads

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

præsentation oktober 2011

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Managing Data Center Power and Cooling

Data Sheet FUJITSU Server PRIMERGY CX400 M1 Multi-Node Server Enclosure

High speed pattern streaming system based on AXIe s PCIe connectivity and synchronization mechanism

OpenSPARC T1 Processor

Management of VMware ESXi. on HP ProLiant Servers

Netvisor Software Defined Fabric Architecture

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

SeaMicro SM Server

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Embedded Systems: map to FPGA, GPU, CPU?

How To Build An Ark Processor With An Nvidia Gpu And An African Processor

PCI Express and Storage. Ron Emerick, Sun Microsystems

Virtualised MikroTik

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

Motherboard- based Servers versus ATCA- based Servers

From Bus and Crossbar to Network-On-Chip. Arteris S.A.

Zadara Storage Cloud A

Advancing Applications Performance With InfiniBand

Pedraforca: ARM + GPU prototype

Chapter 2 Heterogeneous Multicore Architecture

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Resource Efficient Computing for Warehouse-scale Datacenters

Cisco UCS B-Series M2 Blade Servers

AXIe: AdvancedTCA Extensions for Instrumentation and Test

The Internet of Things: Opportunities & Challenges

Secure Containers. Jan Imagination Technologies HGI Dec, 2014 p1

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

High-Speed SERDES Interfaces In High Value FPGAs

SGI High Performance Computing

Transcription:

Scaling Mobile Compute to the Data Center John Goodacre Director Technology and Systems, ARM Ltd. Cambridge Professor Computer Architectures, APT. Manchester

EuroServer Project EUROSERVER is a European commission FP7 part-funded project which is combining the technology trends of nanotechnology 3D integration, low-power SoC processor integration and the impossible requirements from next generation compute to investigate and build a solution for scalable, cost effective and flexible ARM-based micro-server system architecture suitable across multiple markets. 2

Why: The Perfect Storm? The ability to inexpensively extract heat from chips of any size maxed out. The ability to lower voltages with decreasing feature size slowed dramatically. The design complexity of single core microprocessors hits point of diminishing returns where more transistors could add little to the core s per cycle performance. We are approaching a limit on economically viable off-chip interconnect with the available technologies because of electrical issues. The cost of increasing wire-based signalling rates begins to grow considerably, especially in power and complexity of the interface circuits. The economics of chip production stops the growth in size of die. The electrical and power issues associated with driving off of a chip at high rates through inexpensive commodity packaging (such as found on commodity DIMMs) to a microprocessor chip more than a few inches away reached a point where further improvements become fairly difficult and/or expensive. 3

Trends in Data Centre In summary, it s a compute/thermal density issue

Scalability Beyond SMP Systems with Non-Unified (coherent) Get a few more CPUs within the coherent processor domain Challenges of non-unified access latencies and bandwidth Expensive in terms of power consumption Communicating systems through high-speed IO Opportunities for integration of IO directly onto processor bus High-latency through driver and IO abstractions System architecture defines software architecture Extend processor bus system wide (global addressing) Application virtual- inter-processor communication system-wide Caches need to be invisible to software, but coherence protocols will not scale

Scalability Concept: ARM Compute Unit Coherent view into hierarchy of this compute unit Compute Unit Local Coherent Interconnect Path to local addressable by this unit Path to address locations not accessible by this unit The Compute Unit unifies a number of localized compute resources Potentially both general purpose and other local heterogeneous accelerators Provides coherent and symmetric access across local resources Enabling a SMP capable operating system and resource sharing Each Compute Unit is registered at a partition within a system s global address space (GAS), including units with heterogeneous capability Any unit can access any remote location in the GAS (including cached) DMA can transfer between (virtually address cached) partitions

ARM: Example #1 Heterogeneous Compute Unit AMBA Extensions Interface (Master) Access to remote partitions Cortex-A15 Cluster CPU 0 L2 Cache CPU 2 CPU 1 CPU 3 Cortex-A7 Cluster CPU 0 L2 Cache CPU 2 CPU 1 CPU 3 Mali T600 series GPU Shader Core 0 Shader Core 1 L2 Cache SMMU Shader Core 2 Shader Core 3 Debug & Trace NIC 400 CoreSight Resources Mgt System Power JTAG & Trace PMIC/ APB Bus Cache Coherent Interconnect (CCI-400) AMBA Extensions Interface (Master) Unified coherent virtual TrustZone enabled DMC-400 2012 Compute Subsystem LPDDR2/DDR3 Controller DDR PHY or DDR Memory Local partition OpenCL TBA - H.S.A. Shared pointers Shared data Shared dispatch AMBA Extensions Interface (Slave) On-Chip Memories (RAM, ROM) Requests from remote partitions NIC 400 Base Peripheral

Example #2: 32-way ARM Cortex-A50 Series Up to 24 ports of AMBA 4 AXI4/ACE-Lite interfaces that support: Any ARM compatible CPU or GPU ARM compatible hardware accelerators ARM compatible high speed mastering I/O DSP or other compatible 3 rd party accelerator

EUROSERVER Approach: Next Generation Compute Market Competition is good one size doesn t fit all Make it flexible Critical mass is good a problem shared Keep it compatible Innovation is expensive and time consuming. Exploiting it even more so Collaborate appropriately Technology Towards the end of the road Utilize and extend the advancements in nanotechnology integration Increasingly high costs, NRE, Recurring Make the best economically accessible SoC, 3DIC, FDSOI/FINFET Use Embedded processors now feature capable for new markets The Common Requirement More for the same, more for less, specialized, specific, accepted by the critical mass software community leverage common Compute Units 9

Headline Objectives EUROSERVER will design and prototype technology, architecture, and systems software for the next generation of "Micro-Servers" to be used in building data centers. We will progress on the current state of the art in the following domains: Reduced Energy consumption by: (i) using ARM (64-bit) cores, which are the world-leading low-power processors, (ii) drastically reducing the core-to- distance by using novel silicon interposer packaging technology, and (iii) improving on the "energy proportionality Reduced Cost to build and operate each microserver, owing to (i) improved manufacturing yield when multiple smaller chiplets are placed on a larger silicon interposer(2.5d), and (ii) reduced physical volume of the packaged interposer module and (iii) and energy efficient semiconductor process (FDSOI) Better Software efficiency: Next Generation system SW that (i) manages the resources in a server that consists of multiple coherence-islands, and (ii) isolates and protects the multiple workloads from each other when they use the shared server resources of I/O, storage,, and interconnects. 10

Physical Distance: #1 challenge to address? What is a pico Joule? Joule is the energy of a given number of Watts of power for a Second duration For example 1 Joule is 1 Watt for 1 Second Consuming 20pJ at exascale level means delivering 20MW for 1 Second Examples of the issue: DDR 50-100mJ to deliver 1GB/s 1 bit of data across PCIe over 100pJ 1 bit of data 3-5mm along today s conductors ~1pJ/bit 1 DP flop ~20pJ Optical becomes a cheaper conductor only on distances more than 15-20cm

EUROSERVER Vision Low power Low cost SW efficiency Software interaction Classical blade architecture Active Interposer Integration EUROSERVER server on a chip Euroserver redesigns the entreprise server Lower cost through Improved Chiplet Yielding and Flexible System Integration Energy efficiency : low-power 64 bit processor, Everything local, Software Mutualization and sharing of I/O and common resources 12

EUROSERVER Applications Application Area Data Centres & Cloud Telecom infrastructures High-end Embedded Typical workloads Web-server hosting (LAMP/WAMP) Distributed databases (HADOOP) OLAP, OLTP workloads Traditional, relational databases (MySQL) Network communications Vehicle on-board computer Automatic vehicle location tracking Advanced security and surveillance 13

Focus of Developements 1. Next Generation Compute System Architecture Scale-out and flexible heterogeneous compute Define the Unimem model of a virtual-mappable shared GAS 2. Nanoscale, silicon-on-silicon 3D integration and IP Chiplet concept in 3DIC integration and test Heterogeneous silicon interface bridging compressing controller 3. Software Architecture and Frameworks Utilize the Unimem hierarchy Resource sharing and system wide reallocation 4. Applicability of Euroserver solution a) Device PCB Realization: Development System b) Embedded Mircoservers: Wireless Basestation c) Scale-out Servers: Cloud Services 14

Focus Area 1 / 4 System and Memory Architecture System Architecture Unimem Unified Memory Multiple independent nodes Heterogeneous IO and compute within and between nodes All share access to a common GAS Provide consistent/coherent access to all of its local space Topology Agnostic Address based routing across GAS Hierarchical address ownership Globally aware virtualization layer Sharing common IO/Peripherals System-wide power management Multiple coherent Islands Heterogeneous Compute 40-48bits Coherent Local Memory Essential IO and Peripherals Access to shared GAS Additional 40-48bits PA Shared IO and Peripheral Memory Model Any node can locally VA map any GAS location Single node can cache VA map of shared GAS 15

Thermal/ Cooling Peripherals Storage Power /Ctrl/ Clock Fabric & I/O Connectors Connector s Focus Area 2/5: Nanoscale Integration Actual configurations to be confirmed Compute chiplets Fabric & I/O chiplet Interposer -Structure of chiplets - IP to bridge silicon -Technology to realize Memory chiplets Thermal Solution Euroserver device Memory / storage Power/Clock Ctrl Fabric & I/O Micro-server board view -How to reuse across different markets - Market specific requirements Microserver board 1 Microserver board 2 Microserver board N-1 Microserver board N 16

Modeling Device Cost Impact Consider a traditional 300mm2 2D-SoC Currently yielding means $400-800 per device In < 20nm, that cost nearly doubles! Assume a 60:40 split between logic best in < 20nm and that in >= 20nm + = Rest of SoC in Key IP/processors Replicated chiplets low cost > 20nm in <=20nm And 3D stack increases yield yet again. Both die are smaller so yield higher and hence ~20x lower cost, even when adding today s 3D cost 17

Example Packaged Devices On schedule for end 2015 Proposal for end 2016

Focus Area 3/5: Software Architecture Systems software level - Resource agregation System architecture level Chiplet level 19

Intralink (L1) Intralink (L1) Intralink (L1) Intralink (L1) System Architecture Chiplet: One or more CPU cores Level-0 Interconnect Single coherence island Compute Node 0 DMA Node: One or more chiplets Level-1 interconnect Shared IO (Ethernet) and Storage Compute Node 1 DMA Compute Node 2 DMA μserver: (EuroServer) 1 or more Nodes Scale-out server using Local-IO or HPC via Level-3 interconnect EuroServer System Compute Node. DMA ARM 64b ARM 64b ARM 64b ARM 64b ARM 64b ARM 64b ARM 64b ARM 64b HPC System: Multiple Nodes sharing Level-3 interconnect (topology agnostic) Node-SSD Intralink (L1) Local-IO FPGA/OpenC L Intralink (L1) Intralink (L1) Node-SSD Local-IO Node-SSD Local-IO (Optional) Global System Intralink (L3) Node-SSD Intralink (L1) Local-IO Each Coherence Island has its own local independent global (coherent) address space (GAS L ) Coherence Islands communicating through multi-level Interconnect Sharing via page mapping a common remote global address space (GAS R ) Either Remote DMA or direct Remote Load/Store from application virtual page mapping

EUROSERVER: Unimem Memory Model Today s server have simple DRAM or DEVICE types Sequentially consistent cached dram is very expensive.. Even more expensive to scale beyond a single processor socket Key observations of Euroserver No need for sequential consistency in DC workloads Applications tend to partition datasets and its access Best to place the processor (and its cache) near dataset of an application task Unimem extends today s model and enables: Maintains a consistent and coherent access from Node to local DRAM Adds access to any system-wide resource by any workload Allow local processors to cache local on remote accesses Allow dynamically change ownership of regions Seamlessly supported in today s communication and shared API Can be implemented efficiently using ARM + SoC design principles does not require modifications to software applications

Demonstrating low latency high bandwidth resource sharing Access anything anywhere Resources do not need to be duplicated per node Uses numa-like cost tables System level MMU enables multiple independent PA plus a shared virtually mappable PA Various OS and library services already ported TCP/UDP over direct access and DMA RAM stealing to create increase local pool mmap, and other shmem schemes Multi host shared NIC

Access to Petabytes of DRAM from in a single application Use the DRAM physically connected to different coherent islands in same and remote devices Allows a RAM demanding application access to capacities higher than can be supported by a single device Memory consistency rules allow peer (same package) to be have similar latencies to local

Scalable shared model support Allows multiple independent coherent islands to share global addresses Virtual mapped DMA copies Absolute shared address pointers Memory based sycronizations Can be managed via Numa OS No inter-island coherence protocol No coherence directory Direct coherent r/w between islands Pipelined for high bandwidth Native processor addressing for low latency communication

Low Latency Communications Either direct load/store for single transaction communication, or virtually mapped DMA for block transfers Aliased memories for broadcasts Native processor addresses used for nonabstracted communication Consistency rules allow data movement directly between LLC of nodes

Architecture enables duplicating only the resources the app needs Scaling using current commodity architecture + + + Then connect via some interface card via PCIe Scaling with EuroServer 26 Shared GAS Constructed using chiplets then direct virtually map remote and devices

Initial software model compliant discrete prototype Enables OS and library porting. Targeting KVM with NUMA guest Development of share IO resources and FPGA images Forms basis of subsequent silicon based compute node 32x Cortex-A53 (4x by 8xSMP) per packaged part Enables subsequent projects with silicon speed prototype

Focus Area 4/4 part-a: Demonstration of a Micro-server Tentative specs Form factor: standard ComExpress or Custom Size: ~ 8 x 12 cm Components: 1 Euroserver Device (compute node) Single board-to-carrier connector Local storage, local Network Fabric Interconnect Board management and control logic and sensors Local test, debug and peripherals to consider The Micro-server board will populate (in different number) both the Embedded Server and the Enterprise Server Thermal Solution 28 28

Key Focus 4/4 part-b: Demo in an Embedded Server Tentative specs Target form factor: Eurotech Mounted Mobile Computer Size: 13 x 25 x 8 cm Prototype targeted for rugged sealed enclosure and extended temperature range Modular system design Shared design for micro-server board and carrier board with Enterprise Server version Up to 2 micro-server slots Support for optional general purpose card (using 1 slot) for specific function or peripheral (Reference only) (Reference only) 9 to 36 Vdc in; 30W max; Passive conduction cooling Support for Backup battery pack to investigate 29

Key focus 4/4 part-c: Demo as an Enterprise Server Tentative Specs: Compatible form factor with standard server rack design (Reference only) Minimum density design 1 node pitch (16 nodes per rack width) It enables 64 micro-server cards (16 x 4) in a standard cabinet and 12 kw compute power in a 42U rack Carrier board design shared with Embedded Server version Population options: any combination of micro-server node cards and general purpose cards (each card uses one slot) Support for multiple thermal and cooling options 30

Summary ARM Processor technology has the capabilities for high performance compute Best to think more than swap the ISA Its likely the business around competition that will decide on how ARM is used in the data centre EUROSERVER is a next small step in taking a longer term view at how the ARM technology could enable a new silicon ecosystem around ARM in the data center 31