Lightweight Operating Systems for Scalable Native and Virtualized Supercomputing

Similar documents
Achieving Performance Isolation with Lightweight Co-Kernels

Full and Para Virtualization

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

KVM: A Hypervisor for All Seasons. Avi Kivity avi@qumranet.com

Performance tuning Xen

Cray DVS: Data Virtualization Service

Virtualization. Dr. Yingwu Zhu

Recent Trends in Operating Systems and their Applicability to HPC

Xen and the Art of. Virtualization. Ian Pratt

Hardware Based Virtualization Technologies. Elsie Wahlig Platform Software Architect

Version 3.7 Technical Whitepaper

Enabling Technologies for Distributed Computing

Microkernels, virtualization, exokernels. Tutorial 1 CSC469

Architecture of the Kernel-based Virtual Machine (KVM)

Virtualization for Cloud Computing

Hybrid Virtualization The Next Generation of XenLinux

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

COS 318: Operating Systems. Virtual Machine Monitors

Oracle Database Scalability in VMware ESX VMware ESX 3.5

RPM Brotherhood: KVM VIRTUALIZATION TECHNOLOGY

Understanding Full Virtualization, Paravirtualization, and Hardware Assist. Introduction...1 Overview of x86 Virtualization...2 CPU Virtualization...

Enabling Technologies for Distributed and Cloud Computing

Virtual machines and operating systems

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

VMware and CPU Virtualization Technology. Jack Lo Sr. Director, R&D

Virtualization. Jia Rao Assistant Professor in CS

COS 318: Operating Systems. Virtual Machine Monitors

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster

IOS110. Virtualization 5/27/2014 1

Cloud^H^H^H^H^H Virtualization Technology. Andrew Jones May 2011

LS DYNA Performance Benchmarks and Profiling. January 2009

Virtualization benefits Introduction to XenSource How Xen is changing virtualization The Xen hypervisor architecture Xen paravirtualization

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Nested Virtualization

Virtual Machine Monitors. Dr. Marc E. Fiuczynski Research Scholar Princeton University

StACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud

VMware Server 2.0 Essentials. Virtualization Deployment and Management

OS/Run'me and Execu'on Time Produc'vity

Introduction to Virtual Machines

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Enterprise-Class Virtualization with Open Source Technologies

The XenServer Product Family:

Chapter 5 Cloud Resource Virtualization

Cloud Operating Systems for Servers

Cloud Computing through Virtualization and HPC technologies

Virtualization Technology. Zhiming Shen

Virtualization in Linux KVM + QEMU

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Outline: Operating Systems

IBM Software Group. Lotus Domino 6.5 Server Enablement

The Microsoft Windows Hypervisor High Level Architecture

Uses for Virtual Machines. Virtual Machines. There are several uses for virtual machines:

Virtualization. P. A. Wilsey. The text highlighted in green in these slides contain external hyperlinks. 1 / 16

Virtualization with Windows

The Xen of Virtualization

Hypervisors and Virtual Machines

CPS221 Lecture: Operating System Structure; Virtual Machines

VMkit A lightweight hypervisor library for Barrelfish

Oracle9i Release 2 Database Architecture on Windows. An Oracle Technical White Paper April 2003

Real-time KVM from the ground up

ELEC 377. Operating Systems. Week 1 Class 3

System Virtual Machines

2972 Linux Options and Best Practices for Scaleup Virtualization

An Oracle White Paper August Oracle VM 3: Server Pool Deployment Planning Considerations for Scalability and Availability

COM 444 Cloud Computing

Broadcom Ethernet Network Controller Enhanced Virtualization Functionality

An Implementation Of Multiprocessor Linux

x86 ISA Modifications to support Virtual Machines

Virtualization. Pradipta De

Realtime Linux Kernel Features

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

Virtual Machines. COMP 3361: Operating Systems I Winter

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

VON/K: A Fast Virtual Overlay Network Embedded in KVM Hypervisor for High Performance Computing

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

International Journal of Computer & Organization Trends Volume20 Number1 May 2015

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

How do Users and Processes interact with the Operating System? Services for Processes. OS Structure with Services. Services for the OS Itself

Chapter 14 Virtual Machines

MED 0115 Optimizing Citrix Presentation Server with VMware ESX Server

PCI Express IO Virtualization Overview

Understanding Linux on z/vm Steal Time

Processes and Non-Preemptive Scheduling. Otto J. Anshus

Scaling in a Hypervisor Environment

Datacenter Operating Systems

Transcription:

Lightweight Operating Systems for Scalable Native and Virtualized Supercomputing April 20, 2009 ORNL Visit Kevin Pedretti Senior Member of Technical Staff Scalable System Software, Dept. 1423 ktpedre@sandia.gov Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy s National Nuclear Security Administration under contract DE AC04 94AL85000.

Acknowledgments Kitten Lightweight Kernel Trammel Hudson, Mike Levenhagen, Kurt Ferreira Funding: Sandia LDRD program Palacios Virtual Machine Monitor Peter Dinda & Jack Lange (Northwestern Univ.) Patrick Bridges (Univ. New Mexico) OS Noise Studies Kurt Ferreira, Ron Brightwell Quad core Catamount Sue Kelly, John VanDyke, Courtenay Vaughan

Outline Introduction Kitten Lightweight Kernel Palacios Virtual Machine Monitor Native vs. Guest OS results on Cray XT Conclusion

Going on Four Decades of UNIX Operating System = Collection of software and APIs Users care about environment, not implementation details LWK is about getting details right for scalability

Challenge: Exponentially Increasing Parallelism See Key for Units 72% 3 r ye e p 3% 8 9% pe e r y e p ar e r y ar 900 TF 75K cores 12 GF/core ar 2019 1 EF 1.7M cores (green) 588 GF/core or 28M cores (blue) 35 GF/core

LWK Overview Libc.a libmpi.a Application N Policy Maker (PCT) Application 1 Basic Architecture Libc.a libmpi.a Policy Enforcer/HAL (QK) Privileged Hardware Memory Management Page 3 Page 3 Page 2 Page 2 Page 1 Page 1 Page 0 Page 0 Physical Memory Application Virtual Memory POSIX like environment Inverted resource management Very low noise OS noise/jitter Straight forward network stack (e.g., no pinning) Simplicity leads to reliability

Lightweight Kernel Timeline 1991 Sandia/UNM OS (SUNMOS), ncube 2 1991 Linux 0.02 1993 SUNMOS ported to Intel Paragon (1800 nodes) 1993 SUNMOS experience used to design Puma First implementation of Portals communication architecture 1994 Linux 1.0 1995 Puma ported to ASCI Red (4700 nodes) Renamed Cougar, productized by Intel 1997 Stripped down Linux used on Cplant (2000 nodes) Difficult to port Puma to COTS Alpha server Included Portals API 2002 Cougar ported to ASC Red Storm (13000 nodes) Renamed Catamount, productized by Cray Host and NIC based Portals implementations 2004 IBM develops LWK (CNK) for BG/L/P (106000 nodes) 2005 IBM & ETI develop LWK (C64) for Cyclops64 (160 cores/die)

We Know OS Noise Matters P0 P1 P2 P3 Impact of noise increases with scale (basic probability) Multi core increases load on OS Idle noise measurements distort reality Not asking OS to do anything Micro benchmark!= real application See The Case of the Missing Supercomputer Performance, Petrini, et al.

Red Storm Noise Injection Experiments Result: Noise duration is more important than frequency OS should break up work into many small & short pieces Opposite of current efforts Linux Dynaticks Cray CNL with 10 Hz timer had to revert back to 250 Hz due to OS noise duration issues From Kurt Ferreira s Masters Thesis

Drivers for LWK Compute Node OS Practical advantages Low OS noise Performance tuned for scalability Determinism inverted resource management Reliability Research advantages Small and simple Freedom to innovate (see Berkeley View ) Multi core Virtualization Focused on capability systems Can t separate OS from node level architecture Much simpler to create LWK than mainstream OS

Architecture and System Software are Tightly Coupled LWK s static, contiguous memory layout simplifies network stack No pinning/unpinning overhead Send address/length to SeaStar NIC LWK 28% better LWK 31% better LWK 31% better LWK 21% better Host based Network Stack (Generic Portals) Testing Performed April 2008 at Sandia, UNICOS 2.0.44 LWK 8% better

TLB Gets in Way of Algorithm Research Unexpected Behavior Due to Small Pages Dashed Line = Small pages Solid Line = Large pages (Dual core Opteron) Open Shapes = Existing Logarithmic Algorithm (Gibson/Bruck) Solid Shapes = New Constant Time Algorithm (Slepoy, Thompson, Plimpton) TLB misses increased with large pages, but time to service miss decreased dramatically (10x). Page table fits in L1! (vs. 2MB per GB with small pages)

Project Kitten Creating modern open source LWK platform Multi core becoming MPP on a chip, requires innovation Leverage hardware virtualization for flexibility Retain scalability and determinism of Catamount Better match user and vendor expectations Available from http://software.sandia.gov/trac/kitten

Leverage Linux and Open Source Repurpose basic functionality from Linux Kernel Hardware bootstrap Basic OS kernel primitives Innovate in key areas Memory management (Catamount like) Network stack SMARTMAP Fully tick less operation, but short duration OS work Aim for drop in replacement for CNL Open platform more attractive to collaborators Collaborating with Northwestern Univ. and Univ. New Mexico on lightweight virtualization for HPC, http://v3vee.org/ Potential for wider impact

Kitten Architecture

Current Status Initial release (December 2008) Single node, multi core Available from http://software.sandia.gov/trac/kitten Development trunk Support for Glibc NPTL and GCC OpenMP via Linux ABI compatible clone(), futex(),... Palacios virtual machine monitor support (planning parallel Kitten and Palacios releases for May 1) Kernel threads and local files for device drivers Private development trees Catamount user level for multi node (yod, PCT, Catamount Glibc port, Libsysio, etc.) Ported Open Fabrics Alliance IB stack

Virtualization Support Kitten optionally links with Palacios Palacios developed by Jack Lange and Peter Dinda at Northwestern Allows user level Kitten applications to launch unmodified guest ISO images or disk images Standard PC environment exposed to guest, even on Cray XT Guests booted: Puppy Linux 3.0 (32 bit), Finnix 92.0 (64 bit), Compute Node Linux, Catamount Lightweight Virtualization Physically contiguous memory allocated to guest Pass through devices (memory + interrupts) Low noise, no timers or deferred work Space sharing rather than time sharing

Motivations for Virtualization in HPC Provide full featured OS functionality in a lightweight kernel Custom tailor OS to application (ConfigOS, JeOS) Possibly augment guest OS's capabilities Improve resiliency Node migration, full system checkpointing Enhanced debug capabilities Dynamic assignment of compute node roles Individual jobs determine I/O node to compute node balance No rebooting required Run time system replacement Capability run time poor match for high throughput serial workloads

Palacios Architecture (credit: Jack Lange, Northwestern University) VM Guest Exit Dispatch MSR Map Nested Shadow Paging Paging VM Memory Map Hypercall Map IO Port Map Device Layer PCI NIC Keyboard IRQs Passthrough IO PIC ATAPI APIC PIT NVRAM Host OS (Kitten or GeekOS) Hardware

Shadow vs. Nested Paging: No Clear Winner Shadow Paging, O(N) mem accesses per TLB miss Nested Paging, O(N^2) mem accesses per TLB miss Palacios managed page tables used by the CPU Palacios managed guest phys to host phys page tables Page Faults CPU MMU Page tables the guest OS thinks it is using Guest OS managed guest virt to guest phys page tables

Lines of Code in Kitten and Palacios

Kitten+Palacios on Cray XT Kitten boots as drop in replacement for CNL Kitten kernel vmlwk.bin > vmlinux Kitten initial task ELF binary > initramfs Kernel command line args passed via parameters file Guest OS ISO image embedded in Kitten initial task Kitten boots, starts user level initial task, initial task boots the embedded guest OS Both CNL and Catamount ported to the standard PC environment that Palacios exposes SeaStar direct mapped through to guest SeaStar 2 MB device window direct mapped to guest physical memory SeaStar interrupts delivered to Kitten, Kitten forwards to Palacios, Palacios injects into guest

Native vs. Guest CNL and Catamount Tests Testing performed on rsqual XT4 system at Sandia Single cabinet, 48 2.2 GHz quad core nodes Developers have reboot capability Benchmarks: Intel Messaging Benchmarks (IMB, formerly Pallas) HPCCG Mini application Sparse CG solver 100 x 100 x 100 problem, ~400 MB per node CTH Application Shock physics, important Sandia application Shaped charge test problem (no AMR) Weakly scaled

IMB PingPong Latency: Nested Paging has Lowest Overhead 35.0 us 16.7 us 13.1 us 7.0 us Compute Node Linux 11.6 us 4.8 us Catamount Still investigating cause of poor performance of shadow paging on Catamount. Likely due to overhead/bug in emulating guest 2 MB pages for pass through memory mapped devices.

IMB PingPong Bandwidth: All Cases Converge to Same Peak Bandwidth For 4KB message: Native: 285 MB/s Nested: 123 MB/s Shadow: 100 MB/s Compute Node Linux For 4KB message: Native: 381 MB/s Nested: 134 MB/s Shadow: 58 MB/s Catamount

48 Node IMB Allreduce Latency: Nested Paging Wins, Most Converge at Large Message Sizes Compute Node Linux Catamount

16 byte IMB Allreduce Scaling: Native and Nested Paging Scale Similarly Compute Node Linux Catamount

HPCCG Scaling: 5 6% Virtualization Overhead Shadow faster than Nested on Catamount Higher is Better 48 node MFLOPs/node: Native: 540 Nested: 507 ( 6.1%) Shadow: 200 Compute Node Linux 48 node MFLOPs/node: Native: 544 Nested: 495 Shadow: 516 ( 5.1%) Catamount Poor performance of shadow paging on CNL due to context switching. Could be avoided by adding page table caching to Palacios. Catamount is essentially doing no context switching, benefiting shadow paging (2n vs. n^2 page table depth issue discussed earlier)

CTH Scaling: < 5% Virtualization Overhead Nested faster than Shadow on Catamount Lower is Better 32 node runtime: Native: 294 sec Nested: 308 sec Shadow: 628 sec Compute Node Linux 32 node runtime: Native: 281 sec Nested: 294 sec Shadow: 308 sec Catamount Poor performance of shadow paging on CNL due to context switching. Could be avoided by adding page table caching to Palacios.

Conclusion Kitten LWK is in active development Runs on Cray XT and standard PC hardware Guest OS support when combined with Palacios Available now, open source Virtualization experiments on Cray XT indicate ~5% performance overhead for CTH application Would like to do larger scale testing Accelerated portals may further reduce overhead