Using process address space on the GPU

Similar documents
IOMMU: A Detailed view

KVM & Memory Management Updates

Memory Management under Linux: Issues in Linux VM development

Using IOMMUs for Virtualization in Linux

Virtual Machines. COMP 3361: Operating Systems I Winter

How To Make A Minecraft Iommus Work On A Linux Kernel (Virtual) With A Virtual Machine (Virtual Machine) And A Powerpoint (Virtual Powerpoint) (Virtual Memory) (Iommu) (Vm) (

Using Linux as Hypervisor with KVM

2972 Linux Options and Best Practices for Scaleup Virtualization

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Linux VM Infrastructure for memory power management

Real-time KVM from the ground up

Cloud^H^H^H^H^H Virtualization Technology. Andrew Jones May 2011

Hardware Based Virtualization Technologies. Elsie Wahlig Platform Software Architect

Virtual vs Physical Addresses

Kernel Optimizations for KVM. Rik van Riel Senior Software Engineer, Red Hat June

I/O Device and Drivers

Architecture of the Kernel-based Virtual Machine (KVM)

RED HAT ENTERPRISE VIRTUALIZATION 3.0

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Hybrid Virtualization The Next Generation of XenLinux

WHITE PAPER. AMD-V Nested Paging. AMD-V Nested Paging. Issue Date: July, 2008 Revision: 1.0. Advanced Micro Devices, Inc.

COS 318: Operating Systems. I/O Device and Drivers. Input and Output. Definitions and General Method. Revisit Hardware

RED HAT ENTERPRISE VIRTUALIZATION & CLOUD COMPUTING

Top 5 Reasons to choose Microsoft Windows Server 2008 R2 SP1 Hyper-V over VMware vsphere 5

COS 318: Operating Systems. Virtual Memory and Address Translation

Answering the Requirements of Flash-Based SSDs in the Virtualized Data Center

KVM PERFORMANCE IMPROVEMENTS AND OPTIMIZATIONS. Mark Wagner Principal SW Engineer, Red Hat August 14, 2011

Lecture 17: Virtual Memory II. Goals of virtual memory

Intel Graphics Virtualization Technology Update. Zhi Wang,

WHITE PAPER 1

Nested Virtualization

KVM KERNEL BASED VIRTUAL MACHINE

Memory Management Outline. Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging

KVM: A Hypervisor for All Seasons. Avi Kivity avi@qumranet.com

Performance tuning Xen

Beyond the Hypervisor

Multi-core Programming System Overview

Xeon+FPGA Platform for the Data Center

Oracle Database Scalability in VMware ESX VMware ESX 3.5

XID ERRORS. vr352 May XID Errors

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Uses for Virtual Machines. Virtual Machines. There are several uses for virtual machines:

Methods to achieve low latency and consistent performance

COS 318: Operating Systems. Virtual Machine Monitors

Virtualization. Types of Interfaces

Full and Para Virtualization

COS 318: Operating Systems. Virtual Machine Monitors

GPU Profiling with AMD CodeXL

Xen and the Art of. Virtualization. Ian Pratt

EMC XtremSF: Delivering Next Generation Performance for Oracle Database

PARALLELS CLOUD SERVER

CUDA in the Cloud Enabling HPC Workloads in OpenStack With special thanks to Andrew Younge (Indiana Univ.) and Massimo Bernaschi (IAC-CNR)

09'Linux Plumbers Conference

Optimizing Application Performance with CUDA Profiling Tools

GPU Architecture. Michael Doggett ATI

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Microkernels, virtualization, exokernels. Tutorial 1 CSC469

Cloud Server. Parallels. An Introduction to Operating System Virtualization and Parallels Cloud Server. White Paper.

The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage

Storage. The text highlighted in green in these slides contain external hyperlinks. 1 / 14

The Xen of Virtualization

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices

Red Hat Network Satellite Management and automation of your Red Hat Enterprise Linux environment

CS5460: Operating Systems. Lecture: Virtualization 2. Anton Burtsev March, 2013

Red Hat Satellite Management and automation of your Red Hat Enterprise Linux environment

Advanced Computer Networks. Network I/O Virtualization

Eloquence Training What s new in Eloquence B.08.00

Virtualization. Dr. Yingwu Zhu

With Red Hat Enterprise Virtualization, you can: Take advantage of existing people skills and investments.

Virtualization for Cloud Computing

RCL: Software Prototype

Virtualization. Explain how today s virtualization movement is actually a reinvention

Distributed Systems. Virtualization. Paul Krzyzanowski

Developing a dynamic, real-time IT infrastructure with Red Hat integrated virtualization

HRG Assessment: Stratus everrun Enterprise

Memory Aggregation For KVM Hecatonchire Project

Xen and XenServer Storage Performance

Next Generation GPU Architecture Code-named Fermi

CS 61C: Great Ideas in Computer Architecture Virtual Memory Cont.

x86 Virtualization Hardware Support Pla$orm Virtualiza.on

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

KVM Architecture Overview

Virtual Machine Monitors. Dr. Marc E. Fiuczynski Research Scholar Princeton University

Cloud Operating Systems for Servers

L20: GPU Architecture and Models

Operating Systems (Linux)

Realtime Linux Kernel Features

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU-Based Network Traffic Monitoring & Analysis Tools

Virtualization. Michael Tsai 2015/06/08

Best Practices on monitoring Solaris Global/Local Zones using IBM Tivoli Monitoring

EMC XtremSF: Delivering Next Generation Storage Performance for SQL Server

GPGPU Computing. Yong Cao

RPM Brotherhood: KVM VIRTUALIZATION TECHNOLOGY

Red Hat Enterprise Virtualization Performance. Mark Wagner Senior Principal Engineer, Red Hat June 13, 2013

Web Application Deployment in the Cloud Using Amazon Web Services From Infancy to Maturity

Transcription:

Using process address space on the GPU Jérôme Glisse September 2013 Jérôme Glisse - Using process address space on the GPU 1/18

Outline Motivation Memory management Hardware solution Software solution On the graphic side Jérôme Glisse - Using process address space on the GPU 2/18

The middle man v o i d vec add ( f l o a t vr, f l o a t va, f l o a t vb, u n s i g n e d s { gpu bo gpu a, gpu b, gpu r ; gpu a = g p u a l l o c ( s i z e o f ( f l o a t ) s i z e ) ; gpu b = g p u a l l o c ( s i z e o f ( f l o a t ) s i z e ) ; gpu r = g p u a l l o c ( s i z e o f ( f l o a t ) s i z e ) ; gpu upload ( gpu a, va, s i z e o f ( f l o a t ) s i z e ) ; gpu upload ( gpu b, vb, s i z e o f ( f l o a t ) s i z e ) ; g p u s h a d e r e x e c u t e ( v e c a d d s h a d e r, gpu r, gpu a, gpu b gpu downoad ( gpu r, vr, s i z e o f ( f l o a t ) s i z e ) ; } Jérôme Glisse - Using process address space on the GPU 3/18

Cut the middle man v o i d vec add ( f l o a t vr, f l o a t va, f l o a t vb, u n s i g n e d s { g p u s h a d e r e x e c u t e ( v e c a d d s h a d e r, vr, va, vb, s i z e ) ; } Jérôme Glisse - Using process address space on the GPU 4/18

Memory management inherently concurrent Pagetable concurrent update Userspace map/unmap. Memory reclaim. Deduplication (KSM). Migration (NUMA architecture). Compaction.... The more memory pressure the more concurrent update. Jérôme Glisse - Using process address space on the GPU 5/18

Synchronization Pagetable update Concurrency imply no serialization. Pagetable is synchronization point. Save pte (pagetable entry). Perform job (swapping, migrating,...). Check pte is same if so update, otherwise back off. Page backing an address might change at any time. Pinning would defeat memory management. Taking reference defeat memory management too. Jérôme Glisse - Using process address space on the GPU 6/18

Notify me Virtualization Host kernel has global overview. Guest kernel has local overview. Efficiency needs communication both ways. Meet the mmu notifier API Bracket large pagetable update with range start/end callback. Allow proper youngness accounting... No serialization, concurrent notification on overlapping range. Jérôme Glisse - Using process address space on the GPU 7/18

Moving target Catch me if you can Pagetable is a moving target. Page behind an address might change at any time. Playing catchup. Mirroring process address space requirements Both CPU and GPU use same page for same address. Concurrent access by both (in most cases at least). No pinning. No references.... Jérôme Glisse - Using process address space on the GPU 8/18

Road forks Hardware solution. Software solution. Jérôme Glisse - Using process address space on the GPU 9/18

Meet the new middle man, IOMMU IOMMU original motivation I/O protection and isolation (security). Easy remapping and scatter gather. Middle man between device and system memory. Virtualization and device isolation.... IOMMU unique position Middle man. Close CPU (same die since memory controller got merge). IOMMU can be tie to specific CPU. IOMMU can understand CPU pagetable format. IOMMU can walk any process pagetable and relay information. Jérôme Glisse - Using process address space on the GPU 10/18

IOMMU and the PCIE ATS/PASID case PASID PCIE specification PASID = Process address space identifiant. Unique ID associated with a process and thus a pagetable. Device can tell IOMMU which process they are interested in.... ATS PCIE specification ATS = address translation service. IOMMU translate virtual address to physical address. Device can cash in TLB IOMMU translation. Device must offer TLB flush/invalidation mechanism. Carry over protection (read, write, execute,...)... Jérôme Glisse - Using process address space on the GPU 11/18

How device make use of address translation Device pagetable case Device manage its own pagetable, a bit flag in each entry tell if the address should use IOMMU address translation or not. Flexible, can mix VRAM and SRAM. Flag on any level of the pagetable (TLB cache optimization). Complex TLB cache and memory controller. Device use aperture case Address inside (or outside) aperture use ATS/PASID. Address outside (or inside) usual device memory controller. Not flexible. Coarse granularity. Simpler memory controller and TLB cache. Jérôme Glisse - Using process address space on the GPU 12/18

The limits of hardware solution Partial solution One way, device asking pagetable controlled by CPU. Can not copy temporarily data into VRAM. IOMMUv2 linux limitations IOMMU use empty pagetable during CPU pagetable update : Add latency Can be very frequent with memory pressure. Solvable by adding a flag to CPU pagetable entry Jérôme Glisse - Using process address space on the GPU 13/18

GPUs are VRAM junkies Never fast enough, never big enough GPU crave for big bandwidth and low latency. VRAM up to 200GB/s vs 20GB/s for system memory. Up to 4 times lower latency with VRAM. Such bandwith unlikely to happen soon for system memory. Software solution Use of VRAM require change to linux kernel. Kernel must know about VRAM and where are things. Not exclusive with hardware solution. Software can handle VRAM object only. Jérôme Glisse - Using process address space on the GPU 14/18

Challenges Serialize Device pagetable update can not be done from CPU. Serialization between CPU and device pagetable. Pagetable coherency (same page backing same address). Serialization might badly hurt performances. Minimize Minimize changes needed to core mm codepath. Avoid disruptive changes. Do not break linux API (like memory cgroup). Add another point of failure for process or file corruption.... Jérôme Glisse - Using process address space on the GPU 15/18

It is happening Partnership between NVidia and Red Hat. Working prototype. Should be soon send as an RFC upstream. Jérôme Glisse - Using process address space on the GPU 16/18

Open and useful to other Generic device agnostic API. No assumption on what the device do. Useful for any hardware with : Pagetable and support true pagefault. Support read only page entry. Preemptable workload. Page size on the device can be different from the CPU. Best to have : Dirty accounting to avoid over dirtying. Fast preemption. Fast device page table update.... Jérôme Glisse - Using process address space on the GPU 17/18

Graphic use case Texture upload without memcpy. Next sparse texture extension automatic load from disk. GL using process address for seemless compute shader. Cache policy need custom API (UC, WC,...). Jérôme Glisse - Using process address space on the GPU 18/18