Foundation for Research and Technology Hellas (FORTH) Institute of Computer Science (ICS) NUMA-like architecture for Microservers Iakovos Mavroidis (jacob@ics.forth.gr) FORTH-ICS, Greece MPSoC 14, July 8, Margaux, France
Outline Cha a te isti s a d e ui e e ts of today s Data e te s Importance of Energy Efficiency Energy Proportionality Small form-factor Microservers Energy Efficient Architecture Intel or ARM? EUROSERVER approach EUROSERVER Architecture Unimem Architecture Testing Environment FMC Fan-Out Daughtercard 2
Why is Size/Power/Energy Efficiency Important? Utility costs Management Cost Improve Size Power and Cooling costs Improve PUE (Power Usage Effectiveness) Electricity growth 56% increase 05-10 (US increase 36%) 19% increase 11-12 7% increase in 13 1.1%-1.5% of global electricity 10 (US 1.7-2.2%) Google report PUE=1.16 in 10 PUE=1.14 in 11 Environmental friendliness programs Energy Star (US), TopRunner (Japan), FOE (Switzerland) 3
Energy Proportionality in Datacenters Most of the time at 10 50% Challenge: Power not proportional to utilization Server underutilized Two approaches: Turn off hardware when not used Dynamic Voltage Scaling (DVS) Clock Gating Keep CPU utilization high Multiple Virtual Machines Overprovisioning QoS guarantees? 4 4
Aligning energy use with workloads 5
Why many small cores? Scale out applications require large number of cores no brute processors Smaller cores more power-efficient for several workloads static web page serving, entry dedicated hosting, and basic content delivery, among others Less power consumption (sub-10w levels) Lower running costs, lower PUE Energy proportionality Easy to turn off idle cores (parts of the system) Easier maintenance and management Small form factor allows tightly packed clusters and less physical space Easier more efficient implementation? CPU partitioning instead of sharing (no sharing overhead) But No compute power for single-threaded application Hard to parallelize an application Not so efficient for HPC domain? 6
Energy-efficient architecture: Microservers Low-power components CPU (ARM, Intel Atom) Memory () Storage (NVM) Small form factor Small CPUs Fast interconnections (high-speed serial links) High integration Microservers are still in their infancy 7
Intel or ARM in Microservers? Diversity of ARM ecosystem Custom microservers using ARM-based SoCs Hundreds of customers More than 50 variations of Intel Atom and Xeon Xeon E3 suitable for webscale applications, online gaming, cloud Atom C00 suitable for lightweight scale-out workloads Hard to compete hundreds of chip-makers (Samsung Exynos, AMD Opteron A1100 with 8 A-57, APM s X-Gene, Google, Facebook, However Intel first released 64-bit SoC with ECC (Atom Avoton) Intel 3-D technology smaller die area less energy consumption Most datacenter software run on x86 (porting on ARM in progress) Calxeda: ARM-based servers didn't have the software support or hardware needed to win enterprise customers 8
EUROSERVER Challenges and Approach Energy-efficient architecture Use of highly-integrated, high-performance, energy-efficient components in a Microserver arcitecture Many low-power 3D - Interposer Technology main memory NVM memory for storage Suitable from cloud data-centers to embedded applications Unimem Architecture (Focus of this presentation) Take advantage of fast communication Scalable architecture Many coherent islands Global Address Space Facilitate maintenance and management Small form factor Energy proportionality 9
EUROSERVER Architecture Chiplet: Cores+L0 Coherent Interconnect 1 coherence island μserver: Nodes+L2 Interconnect Scale-out or HPC Node: Chiplets+L1 Interconnect Shared IO and Storage EuroServer System ARM. 64b Node-SSD Local-IO memory memory ARM 64b ARM 64b ARM 64b Local-IO memory ARM 64b ARM 64b Node-SSD memory Node-SSD memory ARM 64b memory memory memory Compute Node. Compute Node 2 Compute Node 1 Compute Node 0 ARM 64b Local-IO Node-SSD Local-IO Interlink System: Nodes+L3 Interconnect other μservers Ethernet John Goodacre, ARM Clustered Architecture: Coherence Islands communicating through multi-level Interconnect Sha ed IO s Each Coherence Island has its own local independent global (coherent) address space (GAS L) 10
Unimem Architecture μserver0 Compute Node 0 Compute Node 1 Compute Node 2 Compute Node 3 Interlink other MicroServers Ethernet μserver1 Compute Node 0 Compute Node 1 Compute Node 2 Compute Node 3 Interlink Ethernet other MicroServers John Goodacre, ARM Every memory page has a single owner (coherence island) A p o esso a a ess a y page i the syste th ough the page ow e s ohe e t i te o e t Every page can be cacheable either locally (single borrower) or remotely (owner) but not both 11
EUROSERVER environment Coherence Island0 Coherence Island1 Lite Local Cache Coherent Interconnect AXI DMC Lite AXI Chip2 Chip Lite Local Cache Coherent Interconnect Lite AXI Chip2 Chip AXI DMC 8-core Chiplet 8-core Chiplet Multi-level Global Interconnect Two coherence islands might belong in the same Compute Node (intralink communication) or not (intralink + interlink communication) 12
Remote Page Borrowing Coherence Island0 Coherence Island1 Lite Miss/Replace Lite Local Cache Coherent Interconnect AXI DMC Lite AXI Chip2 Chip Local Cache Coherent Interconnect Lite AXI Chip2 Chip AXI DMC 8-core Chiplet 8-core Chiplet Multi-level Global Interconnect Lo ally a hea le i itiato s a he 13
Shared Memory Coherence Island0 Coherence Island1 Miss/Replace Lite Local Cache Coherent Interconnect AXI DMC Lite AXI Lite Local Cache Coherent Interconnect Lite AXI Chip2 Chip Chip2 Chip AXI DMC 8-core Chiplet 8-core Chiplet Multi-level Global Interconnect Re otely a hea le ow e s a he 14
R Coherence Island0 Miss/Replace Coherence Island1 Read DMC Lite Lite AXI Chip2 Chip Lite Write Local Cache Coherent Interconnect AXI Local Cache Coherent Interconnect Lite AXI Chip2 Chip AXI DMC 8-core Chiplet 8-core Chiplet Multi-level Global Interconnect reads from (or writes to) on Coherence Island0 and writes to (or reads from) on Coherence Island1 Accesses can also be uncacheable locally or cacheable remotely (dashed lines) 15
NUMA-aware linux Coherence Island0 Coherence Island1 Lite Miss/Replace Lite Local Cache Coherent Interconnect AXI DMC Lite AXI Chip2 Chip Local Cache Coherent Interconnect Lite AXI Chip2 Chip AXI DMC 8-core Chiplet 8-core Chiplet Multi-level Global Interconnect Borrow unused remote memory instead of page faulting Fast shared memory and MPI communication NUMA-aware memory allocator and garbage collector 16
Initial Testing Environment using A9-based boards ZedBoard 0 ZedBoard 1 FPGA Cache FPGA A9 core M (GP) INT MBOX INT INTC Cache A9 core M (GP) S (ACP) INT MBOX INT INTC S (ACP) AXI Interconnect (64bit @ 100MHz) S AXI Interconnect (64bit @ 100MHz) S Chip2Chip M Master/Slave Chip2Chip M Master/Slave FMC FMC FMC to FMC cable (6.4Gb/s) Can we interconnect more A9 processors? (see next slides) 17
FMC Fan-Out Daughtercard v.1 2MicroZed boards (40LVDS per board) MZ0 4 MicroZed boards ( LVDS per board) mechanical support MZ0 MZ2 FMC HPC MZ1 MZ1 mechanical support 80 pairs mechanical support MZ3 FMC HPC mechanical support 80 pairs Top and bottom s are mainly used for mechanical support. Two connectivity modes: support for 1 to 4 MicroZed boards PCB design and fabrication done Testing done Version 2 in progress 18
Pictures of FMC Fan-Out Daugthercard v.1 8 s 2 MicroZeds FMC 4 MicroZeds 19
FMC Fan-out v.2 with 10GE and PCIe PCIe 4GTX 4GTX SFP+ SFP+ SFP+ FMC HPC SFP+ 80 pairs, 8GTX Support for: Four 10Gb SFP+ 2.5 PCIe socket
Initial Prototype using FMC Fan-Out v.2 4x10Gb not connected (no room for daughtercard) MicroZed 6 FMC Fan-Out v.2 4 GTX MicroZed 5 MicroZed 7 80 pairs, 8 GTX FMC HPC FMC HPC FMC HPC 80 pairs, 8 GTX MicroZed 3 v.2 MicroZed 2 MicroZed 0 FMC Fan-Out MicroZed 1 4x10Gb MicroZed 4 SSD Virtex 7 Specialist Node Hitech Global (HTG-V7-PCIE-585) 8 MicroZed boards 8 10Gb SFP+ ports 1-2 PCIe x4 SSD 21
Picture of testing environment using FMC Fan-Out v.1 8 MicroZeds (A9+1GB) Shared 10 GigE Hitech Global board (central router) 22
Thank you! Questions? Iakovos Mavroidis jacob@ics.forth.gr FORTH-ICS 23