Scaling Mobile Compute to the Data Center John Goodacre Director Technology and Systems, ARM Ltd. Cambridge Professor Computer Architectures, APT. Manchester
EuroServer Project EUROSERVER is a European commission FP7 part-funded project which is combining the technology trends of nanotechnology 3D integration, low-power SoC processor integration and the impossible requirements from next generation compute to investigate and build a solution for scalable, cost effective and flexible ARM-based micro-server system architecture suitable across multiple markets. 2
Why: The Perfect Storm? The ability to inexpensively extract heat from chips of any size maxed out. The ability to lower voltages with decreasing feature size slowed dramatically. The design complexity of single core microprocessors hits point of diminishing returns where more transistors could add little to the core s per cycle performance. We are approaching a limit on economically viable off-chip interconnect with the available technologies because of electrical issues. The cost of increasing wire-based signalling rates begins to grow considerably, especially in power and complexity of the interface circuits. The economics of chip production stops the growth in size of die. The electrical and power issues associated with driving off of a chip at high rates through inexpensive commodity packaging (such as found on commodity DIMMs) to a microprocessor chip more than a few inches away reached a point where further improvements become fairly difficult and/or expensive. 3
Trends in Data Centre In summary, it s a compute/thermal density issue
Scalability Beyond SMP Systems with Non-Unified (coherent) Get a few more CPUs within the coherent processor domain Challenges of non-unified access latencies and bandwidth Expensive in terms of power consumption Communicating systems through high-speed IO Opportunities for integration of IO directly onto processor bus High-latency through driver and IO abstractions System architecture defines software architecture Extend processor bus system wide (global addressing) Application virtual- inter-processor communication system-wide Caches need to be invisible to software, but coherence protocols will not scale
Scalability Concept: ARM Compute Unit Coherent view into hierarchy of this compute unit Compute Unit Local Coherent Interconnect Path to local addressable by this unit Path to address locations not accessible by this unit The Compute Unit unifies a number of localized compute resources Potentially both general purpose and other local heterogeneous accelerators Provides coherent and symmetric access across local resources Enabling a SMP capable operating system and resource sharing Each Compute Unit is registered at a partition within a system s global address space (GAS), including units with heterogeneous capability Any unit can access any remote location in the GAS (including cached) DMA can transfer between (virtually address cached) partitions
ARM: Example #1 Heterogeneous Compute Unit AMBA Extensions Interface (Master) Access to remote partitions Cortex-A15 Cluster CPU 0 L2 Cache CPU 2 CPU 1 CPU 3 Cortex-A7 Cluster CPU 0 L2 Cache CPU 2 CPU 1 CPU 3 Mali T600 series GPU Shader Core 0 Shader Core 1 L2 Cache SMMU Shader Core 2 Shader Core 3 Debug & Trace NIC 400 CoreSight Resources Mgt System Power JTAG & Trace PMIC/ APB Bus Cache Coherent Interconnect (CCI-400) AMBA Extensions Interface (Master) Unified coherent virtual TrustZone enabled DMC-400 2012 Compute Subsystem LPDDR2/DDR3 Controller DDR PHY or DDR Memory Local partition OpenCL TBA - H.S.A. Shared pointers Shared data Shared dispatch AMBA Extensions Interface (Slave) On-Chip Memories (RAM, ROM) Requests from remote partitions NIC 400 Base Peripheral
Example #2: 32-way ARM Cortex-A50 Series Up to 24 ports of AMBA 4 AXI4/ACE-Lite interfaces that support: Any ARM compatible CPU or GPU ARM compatible hardware accelerators ARM compatible high speed mastering I/O DSP or other compatible 3 rd party accelerator
EUROSERVER Approach: Next Generation Compute Market Competition is good one size doesn t fit all Make it flexible Critical mass is good a problem shared Keep it compatible Innovation is expensive and time consuming. Exploiting it even more so Collaborate appropriately Technology Towards the end of the road Utilize and extend the advancements in nanotechnology integration Increasingly high costs, NRE, Recurring Make the best economically accessible SoC, 3DIC, FDSOI/FINFET Use Embedded processors now feature capable for new markets The Common Requirement More for the same, more for less, specialized, specific, accepted by the critical mass software community leverage common Compute Units 9
Headline Objectives EUROSERVER will design and prototype technology, architecture, and systems software for the next generation of "Micro-Servers" to be used in building data centers. We will progress on the current state of the art in the following domains: Reduced Energy consumption by: (i) using ARM (64-bit) cores, which are the world-leading low-power processors, (ii) drastically reducing the core-to- distance by using novel silicon interposer packaging technology, and (iii) improving on the "energy proportionality Reduced Cost to build and operate each microserver, owing to (i) improved manufacturing yield when multiple smaller chiplets are placed on a larger silicon interposer(2.5d), and (ii) reduced physical volume of the packaged interposer module and (iii) and energy efficient semiconductor process (FDSOI) Better Software efficiency: Next Generation system SW that (i) manages the resources in a server that consists of multiple coherence-islands, and (ii) isolates and protects the multiple workloads from each other when they use the shared server resources of I/O, storage,, and interconnects. 10
Physical Distance: #1 challenge to address? What is a pico Joule? Joule is the energy of a given number of Watts of power for a Second duration For example 1 Joule is 1 Watt for 1 Second Consuming 20pJ at exascale level means delivering 20MW for 1 Second Examples of the issue: DDR 50-100mJ to deliver 1GB/s 1 bit of data across PCIe over 100pJ 1 bit of data 3-5mm along today s conductors ~1pJ/bit 1 DP flop ~20pJ Optical becomes a cheaper conductor only on distances more than 15-20cm
EUROSERVER Vision Low power Low cost SW efficiency Software interaction Classical blade architecture Active Interposer Integration EUROSERVER server on a chip Euroserver redesigns the entreprise server Lower cost through Improved Chiplet Yielding and Flexible System Integration Energy efficiency : low-power 64 bit processor, Everything local, Software Mutualization and sharing of I/O and common resources 12
EUROSERVER Applications Application Area Data Centres & Cloud Telecom infrastructures High-end Embedded Typical workloads Web-server hosting (LAMP/WAMP) Distributed databases (HADOOP) OLAP, OLTP workloads Traditional, relational databases (MySQL) Network communications Vehicle on-board computer Automatic vehicle location tracking Advanced security and surveillance 13
Focus of Developements 1. Next Generation Compute System Architecture Scale-out and flexible heterogeneous compute Define the Unimem model of a virtual-mappable shared GAS 2. Nanoscale, silicon-on-silicon 3D integration and IP Chiplet concept in 3DIC integration and test Heterogeneous silicon interface bridging compressing controller 3. Software Architecture and Frameworks Utilize the Unimem hierarchy Resource sharing and system wide reallocation 4. Applicability of Euroserver solution a) Device PCB Realization: Development System b) Embedded Mircoservers: Wireless Basestation c) Scale-out Servers: Cloud Services 14
Focus Area 1 / 4 System and Memory Architecture System Architecture Unimem Unified Memory Multiple independent nodes Heterogeneous IO and compute within and between nodes All share access to a common GAS Provide consistent/coherent access to all of its local space Topology Agnostic Address based routing across GAS Hierarchical address ownership Globally aware virtualization layer Sharing common IO/Peripherals System-wide power management Multiple coherent Islands Heterogeneous Compute 40-48bits Coherent Local Memory Essential IO and Peripherals Access to shared GAS Additional 40-48bits PA Shared IO and Peripheral Memory Model Any node can locally VA map any GAS location Single node can cache VA map of shared GAS 15
Thermal/ Cooling Peripherals Storage Power /Ctrl/ Clock Fabric & I/O Connectors Connector s Focus Area 2/5: Nanoscale Integration Actual configurations to be confirmed Compute chiplets Fabric & I/O chiplet Interposer -Structure of chiplets - IP to bridge silicon -Technology to realize Memory chiplets Thermal Solution Euroserver device Memory / storage Power/Clock Ctrl Fabric & I/O Micro-server board view -How to reuse across different markets - Market specific requirements Microserver board 1 Microserver board 2 Microserver board N-1 Microserver board N 16
Modeling Device Cost Impact Consider a traditional 300mm2 2D-SoC Currently yielding means $400-800 per device In < 20nm, that cost nearly doubles! Assume a 60:40 split between logic best in < 20nm and that in >= 20nm + = Rest of SoC in Key IP/processors Replicated chiplets low cost > 20nm in <=20nm And 3D stack increases yield yet again. Both die are smaller so yield higher and hence ~20x lower cost, even when adding today s 3D cost 17
Example Packaged Devices On schedule for end 2015 Proposal for end 2016
Focus Area 3/5: Software Architecture Systems software level - Resource agregation System architecture level Chiplet level 19
Intralink (L1) Intralink (L1) Intralink (L1) Intralink (L1) System Architecture Chiplet: One or more CPU cores Level-0 Interconnect Single coherence island Compute Node 0 DMA Node: One or more chiplets Level-1 interconnect Shared IO (Ethernet) and Storage Compute Node 1 DMA Compute Node 2 DMA μserver: (EuroServer) 1 or more Nodes Scale-out server using Local-IO or HPC via Level-3 interconnect EuroServer System Compute Node. DMA ARM 64b ARM 64b ARM 64b ARM 64b ARM 64b ARM 64b ARM 64b ARM 64b HPC System: Multiple Nodes sharing Level-3 interconnect (topology agnostic) Node-SSD Intralink (L1) Local-IO FPGA/OpenC L Intralink (L1) Intralink (L1) Node-SSD Local-IO Node-SSD Local-IO (Optional) Global System Intralink (L3) Node-SSD Intralink (L1) Local-IO Each Coherence Island has its own local independent global (coherent) address space (GAS L ) Coherence Islands communicating through multi-level Interconnect Sharing via page mapping a common remote global address space (GAS R ) Either Remote DMA or direct Remote Load/Store from application virtual page mapping
EUROSERVER: Unimem Memory Model Today s server have simple DRAM or DEVICE types Sequentially consistent cached dram is very expensive.. Even more expensive to scale beyond a single processor socket Key observations of Euroserver No need for sequential consistency in DC workloads Applications tend to partition datasets and its access Best to place the processor (and its cache) near dataset of an application task Unimem extends today s model and enables: Maintains a consistent and coherent access from Node to local DRAM Adds access to any system-wide resource by any workload Allow local processors to cache local on remote accesses Allow dynamically change ownership of regions Seamlessly supported in today s communication and shared API Can be implemented efficiently using ARM + SoC design principles does not require modifications to software applications
Demonstrating low latency high bandwidth resource sharing Access anything anywhere Resources do not need to be duplicated per node Uses numa-like cost tables System level MMU enables multiple independent PA plus a shared virtually mappable PA Various OS and library services already ported TCP/UDP over direct access and DMA RAM stealing to create increase local pool mmap, and other shmem schemes Multi host shared NIC
Access to Petabytes of DRAM from in a single application Use the DRAM physically connected to different coherent islands in same and remote devices Allows a RAM demanding application access to capacities higher than can be supported by a single device Memory consistency rules allow peer (same package) to be have similar latencies to local
Scalable shared model support Allows multiple independent coherent islands to share global addresses Virtual mapped DMA copies Absolute shared address pointers Memory based sycronizations Can be managed via Numa OS No inter-island coherence protocol No coherence directory Direct coherent r/w between islands Pipelined for high bandwidth Native processor addressing for low latency communication
Low Latency Communications Either direct load/store for single transaction communication, or virtually mapped DMA for block transfers Aliased memories for broadcasts Native processor addresses used for nonabstracted communication Consistency rules allow data movement directly between LLC of nodes
Architecture enables duplicating only the resources the app needs Scaling using current commodity architecture + + + Then connect via some interface card via PCIe Scaling with EuroServer 26 Shared GAS Constructed using chiplets then direct virtually map remote and devices
Initial software model compliant discrete prototype Enables OS and library porting. Targeting KVM with NUMA guest Development of share IO resources and FPGA images Forms basis of subsequent silicon based compute node 32x Cortex-A53 (4x by 8xSMP) per packaged part Enables subsequent projects with silicon speed prototype
Focus Area 4/4 part-a: Demonstration of a Micro-server Tentative specs Form factor: standard ComExpress or Custom Size: ~ 8 x 12 cm Components: 1 Euroserver Device (compute node) Single board-to-carrier connector Local storage, local Network Fabric Interconnect Board management and control logic and sensors Local test, debug and peripherals to consider The Micro-server board will populate (in different number) both the Embedded Server and the Enterprise Server Thermal Solution 28 28
Key Focus 4/4 part-b: Demo in an Embedded Server Tentative specs Target form factor: Eurotech Mounted Mobile Computer Size: 13 x 25 x 8 cm Prototype targeted for rugged sealed enclosure and extended temperature range Modular system design Shared design for micro-server board and carrier board with Enterprise Server version Up to 2 micro-server slots Support for optional general purpose card (using 1 slot) for specific function or peripheral (Reference only) (Reference only) 9 to 36 Vdc in; 30W max; Passive conduction cooling Support for Backup battery pack to investigate 29
Key focus 4/4 part-c: Demo as an Enterprise Server Tentative Specs: Compatible form factor with standard server rack design (Reference only) Minimum density design 1 node pitch (16 nodes per rack width) It enables 64 micro-server cards (16 x 4) in a standard cabinet and 12 kw compute power in a 42U rack Carrier board design shared with Embedded Server version Population options: any combination of micro-server node cards and general purpose cards (each card uses one slot) Support for multiple thermal and cooling options 30
Summary ARM Processor technology has the capabilities for high performance compute Best to think more than swap the ISA Its likely the business around competition that will decide on how ARM is used in the data centre EUROSERVER is a next small step in taking a longer term view at how the ARM technology could enable a new silicon ecosystem around ARM in the data center 31