Memory Aggregation For KVM Hecatonchire Project



Similar documents
Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

SMB Direct for SQL Server and Private Cloud

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

Enabling Technologies for Distributed Computing

Best Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

Connecting the Clouds

Simplifying Big Data Deployments in Cloud Environments with Mellanox Interconnects and QualiSystems Orchestration Solutions

State of the Art Cloud Infrastructure

Maximizing SQL Server Virtualization Performance

Zadara Storage Cloud A

IOmark- VDI. HP HP ConvergedSystem 242- HC StoreVirtual Test Report: VDI- HC b Test Report Date: 27, April

Enabling Technologies for Distributed and Cloud Computing

Scaling in a Hypervisor Environment

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Hyper-V over SMB Remote File Storage support in Windows Server 8 Hyper-V. Jose Barreto Principal Program Manager Microsoft Corporation

Hecatonchire: Enabling Multi-Host Virtual Machines by Resource Aggregation and Pooling

PARALLELS CLOUD SERVER

Advances in Virtualization In Support of In-Memory Big Data Applications

SQL Server Virtualization

Data Centers and Cloud Computing

Cloud Server. Parallels. An Introduction to Operating System Virtualization and Parallels Cloud Server. White Paper.

Hyper-V over SMB: Remote File Storage Support in Windows Server 2012 Hyper-V. Jose Barreto Principal Program Manager Microsoft Corporation

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

WHITE PAPER 1

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

How To Test Nvm Express On A Microsoft I7-3770S (I7) And I7 (I5) Ios 2 (I3) (I2) (Sas) (X86) (Amd)

High Performance OpenStack Cloud. Eli Karpilovski Cloud Advisory Council Chairman

Overview: X5 Generation Database Machines

Windows Server 2012 R2 Hyper-V: Designing for the Real World

IOmark-VM. DotHill AssuredSAN Pro Test Report: VM a Test Report Date: 16, August

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

Michael Kagan.

Xen and the Art of. Virtualization. Ian Pratt

Dive Into VM Live Migration

Microsoft Windows Server Hyper-V in a Flash

Virtual Machine Monitors. Dr. Marc E. Fiuczynski Research Scholar Princeton University

Building an Internal Cloud that is ready for the external Cloud

Cloud Computing through Virtualization and HPC technologies

Cloud Computing and the Internet. Conferenza GARR 2010

Introduction to Cloud Design Four Design Principals For IaaS

SOLUTION BRIEF. Resolving the VDI Storage Challenge

Building a cost-effective and high-performing public cloud. Sander Cruiming, founder Cloud Provider

Data Center Op+miza+on

Simplifying Storage Operations By David Strom (published 3.15 by VMware) Introduction

Best Practices for Virtualised SharePoint

FLOW-3D Performance Benchmark and Profiling. September 2012

Monitoring Databases on VMware

Mellanox Accelerated Storage Solutions

Big Fast Data Hadoop acceleration with Flash. June 2013

Bright Idea: GE s Storage Performance Best Practices Brian W. Walker

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

VirtualclientTechnology 2011 July

VDI Optimization Real World Learnings. Russ Fellows, Evaluator Group

Accelerating Applications and File Systems with Solid State Storage. Jacob Farmer, Cambridge Computer

vsphere 6.0 Advantages Over Hyper-V

Deep Dive on SimpliVity s OmniStack A Technical Whitepaper

Virtualization of the MS Exchange Server Environment

Enabling High performance Big Data platform with RDMA

What s new in Hyper-V 2012 R2

<Insert Picture Here> Architekturen, Bausteine und Konzepte für Private Clouds Detlef Drewanz EMEA Server Principal Sales Consultant

Cloud Optimize Your IT

Microsoft Exchange Solutions on VMware

PARALLELS CLOUD STORAGE

Mit Soft- & Hardware zum Erfolg. Giuseppe Paletta

Nutanix Solutions for Private Cloud. Kees Baggerman Performance and Solution Engineer

Diablo and VMware TM powering SQL Server TM in Virtual SAN TM. A Diablo Technologies Whitepaper. May 2015

AIX NFS Client Performance Improvements for Databases on NAS

High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand

Hyperscale Use Cases for Scaling Out with Flash. David Olszewski

SQL Server Consolidation Using Cisco Unified Computing System and Microsoft Hyper-V

Scaling Database Performance in Azure

Nutanix Tech Note. Configuration Best Practices for Nutanix Storage with VMware vsphere

2009 Oracle Corporation 1

COS 318: Operating Systems. Virtual Machine Monitors

Microsoft SMB File Sharing Best Practices Guide

Part 1 - What s New in Hyper-V 2012 R2. Clive.Watson@Microsoft.com Datacenter Specialist

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Connecting Flash in Cloud Storage

Virtualizing Microsoft SQL Server on Dell XC Series Web-scale Converged Appliances Powered by Nutanix Software. Dell XC Series Tech Note

NET ACCESS VOICE PRIVATE CLOUD

HRG Assessment: Stratus everrun Enterprise

Making a Smooth Transition to a Hybrid Cloud with Microsoft Cloud OS

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

Building a Scalable Storage with InfiniBand

High Throughput File Servers with SMB Direct, Using the 3 Flavors of RDMA network adapters

FlashSoft Software from SanDisk : Accelerating Virtual Infrastructures

Achieving a High-Performance Virtual Network Infrastructure with PLUMgrid IO Visor & Mellanox ConnectX -3 Pro

Building Enterprise-Class Storage Using 40GbE

Scaling from Datacenter to Client

Oracle Solutions on Top of VMware vsphere 4. Saša Hederić VMware Adriatic

A Hybrid Approach To Live Migration Of Virtual Machines

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Virtual SAN Design and Deployment Guide

Transcription:

Memory Aggregation For KVM Hecatonchire Project Benoit Hudzia; Sr. Researcher; SAP Research Belfast With the contribution of Aidan Shribman, Roei Tell, Steve Walsh, Peter Izsak November 2012

Agenda Memory as a Utility Raw Performance First Use Case : Post Copy Second Use case : Memory aggregation Lego Cloud Summary 2

Memory as a Utility How we Liquefied Memory Resources

The Idea: Turning memory into a distributed memory service Breaks memory from the bounds of the physical box Transparent deployment with performance at scale and Reliability 4

High Level Principle Memory Sponsor A Memory Sponsor B Network Memory Demander Virtual Memory Address Space Memory Demanding Process 5

How does it work (Simplified Version) MMU Virtual Address (+ TLB) Miss Physical Address MMU (+ TLB) Update MMU (Custom Swap Entry) Extract Page Invalidate MMU Page Table Entry Remote PTE Page Table Entry PTE write Invalidate PTE Coherency Engine Coherency Engine Extract Page RDMA Engine Physical Node A Physical Address Page request Network Fabric Prepare Page for RDMA transfer RDMA Page Response Engine Physical Node B 6

Reducing Effects of Network Bound Page Faults Full Linux MMU integration (reducing the system-wide effects/cost of page fault) Enabling to perform page fault transparency (only pausing the requesting thread) Low latency RDMA Engine and page transfer protocol (reducing latency/cost of page faults) Implemented fully in kernel mode OFED VERBS Can use the fastest RDMA hardware available (IB, IWARP, RoCE) Tested with Software RDMA solution ( Soft IWARP and SoftRoCE) (NO SPECIAL HW REQUIRED) Demand pre-paging (pre-fetching) mechanism (reducing the number of page faults) Currently only a simple fetching of pages surrounding page on which fault occurred 7

Transparent Solution Minimal Modification of the kernel (simple and minimal intrusion) 4 Hooks in the static kernel, virtually no overhead when enabled for normal operation Paging and memory Cgroup support (Transparent Tiered Memory) Page are pushed back to their sponsor when paging occurs or if they are local they can be swapped out normally KVM Specific support (Virtualization Friendly) Shadow Page table (EPT / NPT ) KVM Asynchronous Page Fault 8

Transparent Solution (cont.) Scalable Active Active Mode (Distributed Shared Memory) Shared Nothing with distributed index Write invalidate with distributed index (end of this year) Library LibHeca (Ease of integration) Simple API bootstrapping and synching all participating nodes We also support: KSM Huge Page Discontinuous Shared Memory Region Multiple DSM / VM groups on the same physical node 9

Raw Performance How fast can we move memory around?

Raw Bandwidth usage HW: 4 core i5-2500 CPU @ 3.30GHz- SoftIwarp 10GbE Iwarp Chelsio T422 10GbE - IB ConnectX2 QDR 40 Gbps Gb/ Sequential Walk over 1GB of s 25 shared RAM 20 15 Bin split Walk over 1GB of shared RAM Not enough core to saturate (?) Random Walk over 1GB of shared RAM 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads 7 Threads No degradation under high load Maxing out Bandwidth Software RDMA has significant overhead 10 5 0 Total Gbit/sec (SIW - Seq) Total Gbit/sec (IB-Seq) Total Gbit/sec (IW- Bin split) Total Gbit/sec (SIW- Random) Total Gbit/sec (IB- Random) 11

Hard Page Fault Resolution Performance Resolution time Average (μs) Time spend over the wire one way Average (μs) Resolution time Best (μs) SoftIwar p (10 GbE) 355 150 + 74 Iwarp (10GbE) Infiniban d (40 48 4-6 28 29 2-4 16 12

Average Compounded Page Fault Resolution Time (With Prefetch) 6000 Micro-seconds IW 10GE Sequential IB 40 Gbps Sequential IW 10GE- Binary split IB 40Gbps- Binary split IW 10GE- Random Walk IB- Random Walk 5500 5000 4500 Avg IW 4000 3500 3000 2500 Avg IB 2000 1500 1000 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads 7 Threads 8 Threads 13

Click icon to add picture Post-Copy Live Migration Technology first Use Case

Post Copy Pre Copy Hybrid Comparison Downtime (seconds) 4 Pre-copy (Forced after 60s) 3.5 Post-Copy 3 Hybrid - 3 seconds 2.5 Hybrid - 5 Seconds 2 1.5 1 0.5 0 1 GB 4 GB 10 GB 14 GB VM Ram Host: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM Network : 10 GB Eth Chelsio T422-CR IWARP Workload App Mem Bench (~80% of the VM RAM) Dirtying Rate : 1GB/s (256k Page dirtied per seconds) 2012 SAP AG. All rights reserved. 15

Post Copy vs Pre copy under load Degradation (%) Post Copy Dirtying Rate 1GB/s Post Copy Dirtying Rate 5GB/s Post Copy Dirtying Rate 25GB/s Post Copy Dirtying Rate 50GB/s Post Copy Dirtying Rate 100GB/s Pre Copy Dirtying Rate 1GB/s Pre Copy Dirtying Rate 5GB/s Pre Copy Dirtying Rate 25GB/s Pre Copy Dirtying Rate 50GB/s Pre Copy Dirtying Rate 100GB/s 100 90 80 70 60 50 40 Seconds 30 20 Virtual Machine : 10 10 GB RAM -1vCPU Hardware: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM Network : 10 GB Eth Switch NIC : Chelsio T422-CR Workload: App Mem Bench 16

Post Copy Migration of HANA DB Baselin Pre-Copy Poste Copy Downtime N/A 7.47 s 675 ms Benchmark 0% Benchma 5% Performanc rk Failed e Degradatio Virtual Machine: n Hardware: 10 GB Ram, 4 vcpu Application : HANA ( In memory Database ) Workload : SAP-H ( TPC-H Variant) Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM Fabric: 10 GB Ethernet Switch NIC: Chelsio IWARP T422-CR 17

Memory Aggregation Second use case: Scaling out Memory

Scaling Out Virtual Machine Memory Business Problem Heavy swap usage slows execution time for data intensive applications Solution Hecatonchire/ RRAIM Solution Applications use memory mobility for high performance swap resource Completely transparent No integration required Act on results sooner High reliability built in Enables iteration or additional data to improve results Application RAM VM swaps to memory Cloud Memory Cloud Compression / Deduplication / Ntiers storage / HRHA 19

Redundant Array of Inexpensive RAM: RRAIM 1. Memory region backed by two remote nodes. Remote page faults and swap outs initiated simultaneously to all relevant nodes. 2. No immediate effect on computation node upon failure of node. 3. When we a new remote enters the cluster, it 20

Quicksort Benchmark with Memory Constraint Quicksort Benchmark 512 MB Dataset Quicksort Benchmark 1GB Dataset Memory Ratio DSM Overhead RRAIM Overhead 3:4 2.08% 5.21% 1:2 2.62% 6.15% 1:3 3.35% 9.21% 1:4 4.15% 8.68% 1:5 4.71% 9.28% (constraint using cgroup) 10.00% 9.00% 8.00% 7.00% 6.00% 5.00% 4.00% 3.00% 2.00% 1.00% 0.00% 0.1277777777777778 Quicksort Benchmark 2GB Dataset DSM Overhead RRAIM Overhead 4.4444444444444446E-2 21

Scaling out HANA Memor y Ratio 1:2 1:3 2:1:1 1:1:1 DSM RRAIM Overhe Overhe ad ad 1% 0.887% 1.6% 1.548% 0.1% 1.5% - Virtual Machine: 18 GB Ram, 4 vcpu Application : HANA ( In memory Database ) Workload : SAP-H ( TPC-H Variant) 1.60% 1.40% 1.20% 1.00% DSM Overhead RRAIM Overhead 0.80% 0.60% 0.40% 0.20% 0.00% 4.3055555555555562E-2 8.4039351851851851E-2 Hardware: Memory Host: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM Compute Host: Intel(R) Xeon(R) CPU X5650 @ 2.56GHz, 8 cores, 96GB RAM Fabric: Infiniband QDR 40Gbps Switch + Mellanox ConnectX2 22

Transitioning to a Memory Cloud (Ongoing work) Memory VM Compute VM Combination VM Memory Sponsor Memory Demander Memory Sponsor & Demander RAM VM App VM P oc Q1 - Q2 201 3 App RAM VM Memory memory Cloud RRAIM Many Physical Nodes Hosting a variety of VMs Memory Cloud Management Services (OpenStack) 23

Click icon to add picture Lego Cloud Going beyond Memory

Virtual Distributed Shared Memory System Fut u (Compute Cloud) Guests Compute aggregation Idea : Virtual Machine compute and memory span Multiple physical Nodes VM OS VM Coherency Protocol Granularity ( False sharing ) VM App Challenges App OS s VM App App OS H/W OS H/W H/W H/W re Wo rk Hecatonchire Value Proposition Optimal price / performance by using commodity hardware Operational flexibility: node downtime without downing the cluster Seamless deployment within existing cloud Server #1 Server #2 Server #n CPUs CPUs CPUs Memory Memory Memory I/O I/O I/O Fast RDMA Communication 25

Disaggregation of datacentre ( and cloud ) resources (Our Aim) Breaking out the functions of Memory,Compute, I/O, and optimizing the delivery of each. provides three primary benefits: Disaggregation, Better Performance: Each function is isolated => limiting the scope of what each box must do We can leverage dedicated hardware and software => increases performance. Superior Scalability: Functions are isolated from each other => alter one function without impacting the others. Improved Economics: cost-effective deployment of resource => improved provisioning and consolidation of disparate equipment 26

Click icon to add picture Summary

Hecatonchire Project Features: Distributed Shared Memory Memory extension via Memory Servers HA features Future :Distributed Workload executions Use standard Cloud interface Optimise Cloud infrastructure Support COTS HW 28

Key takeaways Hecatonchire project aim at disaggregating datacentre resources Hecatonchire Project currently deliver memory cloud capabilities Enhancements to be released as open source under GPLv2 and LGPL licenses by the end of November 2012 Hosted on GitHub, check: www.hecatonchire.com Developed by SAP Research Technology Infrastructure (TI) Programme 29

Thank you Benoit Hudzia; Sr. Researcher; SAP Research Belfast benoit.hudzia@sap.com

Backup Slide

Instant Flash Cloning On-Demand Pro Plann t ot y e pe Q d 4 20 1 Business Problem 2 Burst load / service usage that cannot be satisfied in time Existing solutions Vendors: Amazon / VMWare/ rightscale Startup VM from disk image Requires full VM OS startup sequence Hypervisor Hypervisor Infra Hecatonchire Solution Go live after VM-state (MBs) and hot memory (<5%) cloning Use post-copy live-migration schema in background Complete background transfer and disconnect from source Hypervisor Hecatonchire Value Proposition Just in time (sub-second) provisioning 32

DRAM Latency Has Remained Constant CPU clock speed and memory bandwidth increased steadily (at least until 2000) But memory latency remained constant so local memory has gotten slower from the CPU perspective Click icon to add picture Source: J. Karstens: In-Memory Technology at SAP. DKOM 2010 33

CPUs Stopped Getting Faster Moore s law prevailed until 2005 when core s speed hit a practical limit of about 3.4 GHz Since 2005 you do get more cores but the single threaded free lunch is over Effectively arbitrary sequential algorithms have not gotten faster since Click icon to add picture Source: http://www.intel.com/pressroom/kits/quickrefyr.htm Source: The Free Lunch Is Over.. by Herb Sutter 34

While Interconnect Link Speed has Kept Growing Panda et al. Supercomputing 2009 35

Result: Remote Nodes Have Gotten Closer Accessing DRAM on a remote host via IB interconnects is only 20x slower than local DRAM And remote DRAM has far better performance than paging in from an SSD or HDD device Click icon to add picture Fast interconnects have become a commodity - moving out of the High Performance Computing (HPC) niche HANA Performance Analysis, Chaim Bendelac, 2011 36

Post-Copy Live Migration (pre-migration) Guest VM Host B Host A Pre-migrate Live on A Reservation Stop and Copy Downtime Page Pushing 1 Round Degraded on B Commit Live on B Total Migration Time 37

Post-Copy Live Migration (reservation) Guest VM Guest VM Host B Host A Pre-migrate Live on A Reservation Stop and Copy Downtime Page Pushing 1 Round Degraded on B Commit Live on B Total Migration Time 38

Post-Copy Live Migration (stop and copy) Guest VM Guest VM Host B Host A Pre-migrate Live on A Reservation Stop and Copy Downtime Page Pushing 1 Round Degraded on B Commit Live on B Total Migration Time 39

Post-Copy Live Migration (post-copy) Guest VM Guest VM Page fault Page push Host B Host A Pre-migrate Live on A Reservation Stop and Copy Downtime Page Pushing 1 Round Degraded on B Commit Live on B Total Migration Time 40

Post-Copy Live Migration (commit) Guest VM Host B Host A Pre-migrate Live on A Reservation Stop and Copy Downtime Page Pushing 1 Round Degraded on B Commit Live on B Total Migration Time 41