NUMA-like architecture for Microservers

Similar documents
Scaling Mobile Compute to the Data Center. John Goodacre

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis

LS DYNA Performance Benchmarks and Profiling. January 2009

Cloud Computing through Virtualization and HPC technologies

Accelerate Cloud Computing with the Xilinx Zynq SoC

High Performance Computing in CST STUDIO SUITE

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Microsoft Private Cloud Fast Track Reference Architecture

Pedraforca: ARM + GPU prototype

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

ECLIPSE Performance Benchmarks and Profiling. January 2009

Infrastructure Matters: POWER8 vs. Xeon x86

Microsoft Private Cloud Fast Track

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure

Parallel Programming Survey

Unified Computing Systems

How System Settings Impact PCIe SSD Performance

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Getting Started with the Xilinx Zynq All Programmable SoC Mini-ITX Development Kit

Optimizing Web Infrastructure on Intel Architecture

IBM System x family brochure

FPO. Expanding Intel Architecture Flexibility in the Data Center. Markus Leberecht Data Center Solutions Architect, Intel EMEA March 20, 2013

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

International Journal of Computer & Organization Trends Volume20 Number1 May 2015

præsentation oktober 2011

Networking Virtualization Using FPGAs

VPX Implementation Serves Shipboard Search and Track Needs

How To Write An Article On An Hp Appsystem For Spera Hana

GUEST OPERATING SYSTEM BASED PERFORMANCE COMPARISON OF VMWARE AND XEN HYPERVISOR

Oracle Database Scalability in VMware ESX VMware ESX 3.5

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

Concepts Introduced in Chapter 6. Warehouse-Scale Computers. Important Design Factors for WSCs. Programming Models for WSCs

SERVER CLUSTERING TECHNOLOGY & CONCEPT

Sun Microsystems Special Promotions for Education and Research January 9, 2007

Sun Constellation System: The Open Petascale Computing Architecture

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Can High-Performance Interconnects Benefit Memcached and Hadoop?

I/O Performance of Cisco UCS M-Series Modular Servers with Cisco UCS M142 Compute Cartridges

24/12/8 UP Server Nodes in 3U

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

HPC Update: Engagement Model

The virtualization of SAP environments to accommodate standardization and easier management is gaining momentum in data centers.

Enabling Technologies for Distributed Computing

HP Moonshot: An Accelerator for Hyperscale Workloads

The Transition to PCI Express* for Client SSDs

Cisco UCS B-Series M2 Blade Servers

Recommended hardware system configurations for ANSYS users

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014

Motherboard- based Servers versus ATCA- based Servers

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Building an energy dashboard. Energy measurement and visualization in current HPC systems

Support a New Class of Applications with Cisco UCS M-Series Modular Servers

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Parallel Algorithm Engineering

How To Build A Cloud Computer

Clusters: Mainstream Technology for CAE

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

Cisco for SAP HANA Scale-Out Solution on Cisco UCS with NetApp Storage

Michael Kagan.

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

UCS M-Series Modular Servers

System Models for Distributed and Cloud Computing

HUAWEI Tecal E6000 Blade Server

Patriot Hardware and Systems Software Requirements

Scaling from Datacenter to Client

A+ Guide to Managing and Maintaining Your PC, 7e. Chapter 1 Introducing Hardware

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

SGI High Performance Computing

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

ICRI-CI Retreat Architecture track

White paper. ATCA Compute Platforms (ACP) Use ACP to Accelerate Private Cloud Deployments for Mission Critical Workloads. Rev 01

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Technology and Cost Considerations for Cloud Deployment: Amazon Elastic Compute Cloud (EC2) Case Study

HP Proliant BL460c G7

Avid ISIS


Energy Efficient MapReduce

Built for Business. Ready for the Future.

Recent Advances in HPC for Structural Mechanics Simulations

Enabling Technologies for Distributed and Cloud Computing

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

A Taxonomy and Survey of Energy-Efficient Data Centers and Cloud Computing Systems

Amazon EC2 XenApp Scalability Analysis

Operating System Support for Multiprocessor Systems-on-Chip

Transcription:

Foundation for Research and Technology Hellas (FORTH) Institute of Computer Science (ICS) NUMA-like architecture for Microservers Iakovos Mavroidis (jacob@ics.forth.gr) FORTH-ICS, Greece MPSoC 14, July 8, Margaux, France

Outline Cha a te isti s a d e ui e e ts of today s Data e te s Importance of Energy Efficiency Energy Proportionality Small form-factor Microservers Energy Efficient Architecture Intel or ARM? EUROSERVER approach EUROSERVER Architecture Unimem Architecture Testing Environment FMC Fan-Out Daughtercard 2

Why is Size/Power/Energy Efficiency Important? Utility costs Management Cost Improve Size Power and Cooling costs Improve PUE (Power Usage Effectiveness) Electricity growth 56% increase 05-10 (US increase 36%) 19% increase 11-12 7% increase in 13 1.1%-1.5% of global electricity 10 (US 1.7-2.2%) Google report PUE=1.16 in 10 PUE=1.14 in 11 Environmental friendliness programs Energy Star (US), TopRunner (Japan), FOE (Switzerland) 3

Energy Proportionality in Datacenters Most of the time at 10 50% Challenge: Power not proportional to utilization Server underutilized Two approaches: Turn off hardware when not used Dynamic Voltage Scaling (DVS) Clock Gating Keep CPU utilization high Multiple Virtual Machines Overprovisioning QoS guarantees? 4 4

Aligning energy use with workloads 5

Why many small cores? Scale out applications require large number of cores no brute processors Smaller cores more power-efficient for several workloads static web page serving, entry dedicated hosting, and basic content delivery, among others Less power consumption (sub-10w levels) Lower running costs, lower PUE Energy proportionality Easy to turn off idle cores (parts of the system) Easier maintenance and management Small form factor allows tightly packed clusters and less physical space Easier more efficient implementation? CPU partitioning instead of sharing (no sharing overhead) But No compute power for single-threaded application Hard to parallelize an application Not so efficient for HPC domain? 6

Energy-efficient architecture: Microservers Low-power components CPU (ARM, Intel Atom) Memory () Storage (NVM) Small form factor Small CPUs Fast interconnections (high-speed serial links) High integration Microservers are still in their infancy 7

Intel or ARM in Microservers? Diversity of ARM ecosystem Custom microservers using ARM-based SoCs Hundreds of customers More than 50 variations of Intel Atom and Xeon Xeon E3 suitable for webscale applications, online gaming, cloud Atom C00 suitable for lightweight scale-out workloads Hard to compete hundreds of chip-makers (Samsung Exynos, AMD Opteron A1100 with 8 A-57, APM s X-Gene, Google, Facebook, However Intel first released 64-bit SoC with ECC (Atom Avoton) Intel 3-D technology smaller die area less energy consumption Most datacenter software run on x86 (porting on ARM in progress) Calxeda: ARM-based servers didn't have the software support or hardware needed to win enterprise customers 8

EUROSERVER Challenges and Approach Energy-efficient architecture Use of highly-integrated, high-performance, energy-efficient components in a Microserver arcitecture Many low-power 3D - Interposer Technology main memory NVM memory for storage Suitable from cloud data-centers to embedded applications Unimem Architecture (Focus of this presentation) Take advantage of fast communication Scalable architecture Many coherent islands Global Address Space Facilitate maintenance and management Small form factor Energy proportionality 9

EUROSERVER Architecture Chiplet: Cores+L0 Coherent Interconnect 1 coherence island μserver: Nodes+L2 Interconnect Scale-out or HPC Node: Chiplets+L1 Interconnect Shared IO and Storage EuroServer System ARM. 64b Node-SSD Local-IO memory memory ARM 64b ARM 64b ARM 64b Local-IO memory ARM 64b ARM 64b Node-SSD memory Node-SSD memory ARM 64b memory memory memory Compute Node. Compute Node 2 Compute Node 1 Compute Node 0 ARM 64b Local-IO Node-SSD Local-IO Interlink System: Nodes+L3 Interconnect other μservers Ethernet John Goodacre, ARM Clustered Architecture: Coherence Islands communicating through multi-level Interconnect Sha ed IO s Each Coherence Island has its own local independent global (coherent) address space (GAS L) 10

Unimem Architecture μserver0 Compute Node 0 Compute Node 1 Compute Node 2 Compute Node 3 Interlink other MicroServers Ethernet μserver1 Compute Node 0 Compute Node 1 Compute Node 2 Compute Node 3 Interlink Ethernet other MicroServers John Goodacre, ARM Every memory page has a single owner (coherence island) A p o esso a a ess a y page i the syste th ough the page ow e s ohe e t i te o e t Every page can be cacheable either locally (single borrower) or remotely (owner) but not both 11

EUROSERVER environment Coherence Island0 Coherence Island1 Lite Local Cache Coherent Interconnect AXI DMC Lite AXI Chip2 Chip Lite Local Cache Coherent Interconnect Lite AXI Chip2 Chip AXI DMC 8-core Chiplet 8-core Chiplet Multi-level Global Interconnect Two coherence islands might belong in the same Compute Node (intralink communication) or not (intralink + interlink communication) 12

Remote Page Borrowing Coherence Island0 Coherence Island1 Lite Miss/Replace Lite Local Cache Coherent Interconnect AXI DMC Lite AXI Chip2 Chip Local Cache Coherent Interconnect Lite AXI Chip2 Chip AXI DMC 8-core Chiplet 8-core Chiplet Multi-level Global Interconnect Lo ally a hea le i itiato s a he 13

Shared Memory Coherence Island0 Coherence Island1 Miss/Replace Lite Local Cache Coherent Interconnect AXI DMC Lite AXI Lite Local Cache Coherent Interconnect Lite AXI Chip2 Chip Chip2 Chip AXI DMC 8-core Chiplet 8-core Chiplet Multi-level Global Interconnect Re otely a hea le ow e s a he 14

R Coherence Island0 Miss/Replace Coherence Island1 Read DMC Lite Lite AXI Chip2 Chip Lite Write Local Cache Coherent Interconnect AXI Local Cache Coherent Interconnect Lite AXI Chip2 Chip AXI DMC 8-core Chiplet 8-core Chiplet Multi-level Global Interconnect reads from (or writes to) on Coherence Island0 and writes to (or reads from) on Coherence Island1 Accesses can also be uncacheable locally or cacheable remotely (dashed lines) 15

NUMA-aware linux Coherence Island0 Coherence Island1 Lite Miss/Replace Lite Local Cache Coherent Interconnect AXI DMC Lite AXI Chip2 Chip Local Cache Coherent Interconnect Lite AXI Chip2 Chip AXI DMC 8-core Chiplet 8-core Chiplet Multi-level Global Interconnect Borrow unused remote memory instead of page faulting Fast shared memory and MPI communication NUMA-aware memory allocator and garbage collector 16

Initial Testing Environment using A9-based boards ZedBoard 0 ZedBoard 1 FPGA Cache FPGA A9 core M (GP) INT MBOX INT INTC Cache A9 core M (GP) S (ACP) INT MBOX INT INTC S (ACP) AXI Interconnect (64bit @ 100MHz) S AXI Interconnect (64bit @ 100MHz) S Chip2Chip M Master/Slave Chip2Chip M Master/Slave FMC FMC FMC to FMC cable (6.4Gb/s) Can we interconnect more A9 processors? (see next slides) 17

FMC Fan-Out Daughtercard v.1 2MicroZed boards (40LVDS per board) MZ0 4 MicroZed boards ( LVDS per board) mechanical support MZ0 MZ2 FMC HPC MZ1 MZ1 mechanical support 80 pairs mechanical support MZ3 FMC HPC mechanical support 80 pairs Top and bottom s are mainly used for mechanical support. Two connectivity modes: support for 1 to 4 MicroZed boards PCB design and fabrication done Testing done Version 2 in progress 18

Pictures of FMC Fan-Out Daugthercard v.1 8 s 2 MicroZeds FMC 4 MicroZeds 19

FMC Fan-out v.2 with 10GE and PCIe PCIe 4GTX 4GTX SFP+ SFP+ SFP+ FMC HPC SFP+ 80 pairs, 8GTX Support for: Four 10Gb SFP+ 2.5 PCIe socket

Initial Prototype using FMC Fan-Out v.2 4x10Gb not connected (no room for daughtercard) MicroZed 6 FMC Fan-Out v.2 4 GTX MicroZed 5 MicroZed 7 80 pairs, 8 GTX FMC HPC FMC HPC FMC HPC 80 pairs, 8 GTX MicroZed 3 v.2 MicroZed 2 MicroZed 0 FMC Fan-Out MicroZed 1 4x10Gb MicroZed 4 SSD Virtex 7 Specialist Node Hitech Global (HTG-V7-PCIE-585) 8 MicroZed boards 8 10Gb SFP+ ports 1-2 PCIe x4 SSD 21

Picture of testing environment using FMC Fan-Out v.1 8 MicroZeds (A9+1GB) Shared 10 GigE Hitech Global board (central router) 22

Thank you! Questions? Iakovos Mavroidis jacob@ics.forth.gr FORTH-ICS 23