Multi-GPU Load Balancing for Simulation and Rendering



Similar documents
Multi-GPU Load Balancing for In-situ Visualization

Introduction to Cloud Computing

Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Source Code Transformations Strategies to Load-balance Grid Applications

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Operating Systems, 6 th ed. Test Bank Chapter 7

Curriculum Map. Discipline: Computer Science Course: C++

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Parallel Computing. Benson Muite. benson.

Final Report. Cluster Scheduling. Submitted by: Priti Lohani

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Real-time Process Network Sonar Beamformer

OpenMP Programming on ScaleMP

Intel DPDK Boosts Server Appliance Performance White Paper

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study

Streamline Integration using MPI-Hybrid Parallelism on a Large Multi-Core Architecture

Using Predictive Adaptive Parallelism to Address Portability and Irregularity

Resource Utilization of Middleware Components in Embedded Systems

Distributed Memory Machines. Sanjay Goil and Sanjay Ranka. School of CIS ond NPAC.

Thread level parallelism

Texture Cache Approximation on GPUs

BSC vision on Big Data and extreme scale computing

HP ProLiant SL270s Gen8 Server. Evaluation Report

HPC Programming Framework Research Team

A Review of Customized Dynamic Load Balancing for a Network of Workstations

Overlapping Data Transfer With Application Execution on Clusters

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Computer System Design. System-on-Chip

Contributions to Gang Scheduling

Clustering Billions of Data Points Using GPUs

Program Optimization for Multi-core Architectures

Weighted Total Mark. Weighted Exam Mark

Xeon+FPGA Platform for the Data Center

Scientific Computing Programming with Parallel Objects

A Comparative Performance Analysis of Load Balancing Algorithms in Distributed System using Qualitative Parameters

Driving force. What future software needs. Potential research topics

УДК ADVANCED DATA STORAGE OF DEM SIMULATIONS RESULTS Ianushkevych V. 1, Dosta M. 2, Antonyuk S. 2, Heinrich S.2, Svjatnyj V.A.

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Petascale Visualization: Approaches and Initial Results

MOSIX: High performance Linux farm

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

Performance Testing in Virtualized Environments. Emily Apsey Product Engineer

Performance Characteristics of Large SMP Machines

The Classical Architecture. Storage 1 / 36

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

FPGA area allocation for parallel C applications

Alberto Corrales-García, Rafael Rodríguez-Sánchez, José Luis Martínez, Gerardo Fernández-Escribano, José M. Claver and José Luis Sánchez

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Load Balancing Techniques

High-performance computing: Use the cloud to outcompute the competition and get ahead

reduction critical_section

Recent Advances in Periscope for Performance Analysis and Tuning

Apache Hama Design Document v0.6

An Open Architecture through Nanocomputing

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances:

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

Real Time Programming: Concepts

All ju The State of Software Development Today: A Parallel View. June 2012

Stream Processing on GPUs Using Distributed Multimedia Middleware

Cellular Computing on a Linux Cluster

HPC ABDS: The Case for an Integrating Apache Big Data Stack

ultra fast SOM using CUDA

walberla: Towards an Adaptive, Dynamically Load-Balanced, Massively Parallel Lattice Boltzmann Fluid Simulation

Hardware design for ray tracing

Use Cases for Large Memory Appliance/Burst Buffer

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

OpenACC 2.0 and the PGI Accelerator Compilers

Lecture Outline Overview of real-time scheduling algorithms Outline relative strengths, weaknesses

Computer Graphics Hardware An Overview

Operating Systems. Virtual Memory

CHAPTER 4: SOFTWARE PART OF RTOS, THE SCHEDULER

Accelerating Wavelet-Based Video Coding on Graphics Hardware

Performance metrics for parallel systems

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

B.C.A. DEGREE EXAMINATION, NOVEMBER 2010 Fifth Semester Computer Applications Elective WIRELESS APPLICATION PROTOCOL (CBCS 2008 onwards)

?kt. An Unconventional Method for Load Balancing. w = C ( t m a z - ti) = p(tmaz - 0i=l. 1 Introduction. R. Alan McCoy,*

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

BLM 413E - Parallel Programming Lecture 3

Introduction to Cluster Computing

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

q EF HLD on SMP machine q Subfarm design q Distributor Global Buffer q Tests and results q Future developments

A Case Study - Scaling Legacy Code on Next Generation Platforms

Bringing Big Data Modelling into the Hands of Domain Experts

Operating System Tutorial

Putting Checkpoints to Work in Thread Level Speculative Execution

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

CHAPTER 1 INTRODUCTION

Transcription:

Multi- Load Balancing for Simulation and Rendering Yong Cao Computer Science Department, Virginia Tech, USA

In-situ ualization and ual Analytics Instant visualization and interaction of computing tasks Applications: Computational Fluid Dynamics Seismic Propagation Molecular Dynamics Network Security Analysis 2

In-situ ualization and ual Analytics Instant visualization and interaction of computing tasks Applications: Computational Fluid Dynamics Seismic Propagation Molecular Dynamics Network Security Analysis 3

In-situ ualization and ual Analytics Instant visualization and interaction of computing tasks Applications: Computational Fluid Dynamics Seismic Propagation Molecular Dynamics Network Security Analysis 4

Generalized Execution Loop Simulation Rendering Execution: Data write Data read Memory: 5

Generalized Execution Loop Execution: Task 1 Task 2 Data write Data read Memory: 6

Parallel Execution Task Split Problem: Task (Context) Switch T1 T2 Processor 1: Processor 2: Data write Data read Memory: Disadvantage of context switch: - Overhead of another kernel launch - Flash of the cache lines - Disallow persistent threads 7

Parallel Execution: Pipelining Task 1 Task 2 Processor 1: Processor 2: t t t+1 t+1 Data write Data read Memory: + Simplified kernel for each + Better share memory and cache usage + Persistent thread for distributed scheduling 8

Parallel Execution: Pipelining Problem: bubble in the pipeline Task 1 Task 2 Processor 1: Processor 2: t t t+1 t+1 Data write Data read Memory: 9

Multi- Pipeline Architecture Multi- Array Sim Sim Read Write FIFO Data Buffer Time Step 1 Time Step 2 Sim W R Sim W R Time Step n Sim W R 10

Adaptive Load Balancing Multi- Array Sim Sim FIFO Data Buffer Full Buffer: Shift toward Rendering Empty Buffer: Shift toward Simulation Read Read Read Sim Write Write Sim Write Sim Sim Adaptive and Distributed Scheduling 11

Task Partition Intra-frame partition Inter-frame partition t t t t t t t+1 t+2 t+3 t t+1 t+2 t+3 12

Task Partition for ual Simulation Simulation: Intra frame partition Rendering: Inter frame partition Multi- Array Sim Sim Read Write FIFO Data Buffer 13

Problem: Scheduling Algorithm Performance Model: n: The number of assigned s. Schedule to optimize: M i : The number of assigned Simulation s. 14

Case Study Application N-body Simulation with Ray-Traced rendering Performance model parameters: Simulation: number of iterations (i) number of simulated bodies (p) Rendering: number of samples for super sampling (s) Scheduling Optimization: M t = f (i t, s t, p t ) 15

Static Load-Balancing Assumption: the performance parameters do NOT change at run-time. M t = f (i t, s t, p t ) M = f (i, s, p) Data driven modeling approach: Sample the 3 dimensional (i,s,p) as a rigid grid Use tri-linear interpolation to get the result for the new inputs 16

Static Load-Balancing: Results Performance Parameter Sampling Load Balancing 16 Samples, 80 iterations 4 Samples, 80 iterations 17

Dynamic Load Balancing Assumption: Performance parameters change during the run-time. Find the indirect load-balance indicator p Execution time of the previous time step Problem: Performance different between two time steps can be dramatic. The fullness of the buffer F 18

Dynamic Load Balancing: Result Stability of the Dynamic Scheduling Algorithm No parameter change (only at the beginning) Parameters change at the dotted line. 19

Comparison: Dynamic vs. Static Scheduling 2000 Particles 4000 Particles Performance Speedup over static load-balancing 20

Conclusion + Pipelining + Dynamic load balancing - Fine granularity load balancing (SM level) - Communication overhead - Programmability: Software framework, Library 21

Question(s): Contact Information: Yong Cao Computer Science Department Virginia Tech Email: yongcao@vt.edu Website: www.cs.vt.edu/~yongcao 22