VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

Similar documents
Managing Adaptability in Heterogeneous Architectures through Performance Monitoring and Prediction

GPU Profiling with AMD CodeXL

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

An Oracle Technical White Paper November Oracle Solaris 11 Network Virtualization and Network Resource Management

Getting Started with CodeXL

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

Towards Elastic Application Model for Augmenting Computing Capabilities of Mobile Platforms. Mobilware 2010

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

LBPerf: An Open Toolkit to Empirically Evaluate the Quality of Service of Middleware Load Balancing Services

Cisco Integrated Services Routers Performance Overview

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Data Center and Cloud Computing Market Landscape and Challenges

An Approach to Load Balancing In Cloud Computing

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Performance Management for Cloudbased STC 2012

Chapter 1 Computer System Overview

Next Generation GPU Architecture Code-named Fermi

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Overlapping Data Transfer With Application Execution on Clusters

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Optimizing Application Performance with CUDA Profiling Tools

Comparison of Request Admission Based Performance Isolation Approaches in Multi-tenant SaaS Applications

MAQAO Performance Analysis and Optimization Tool

Full and Para Virtualization

QoS-Aware Storage Virtualization for Cloud File Systems. Christoph Kleineweber (Speaker) Alexander Reinefeld Thorsten Schütt. Zuse Institute Berlin

Capstone Overview Architecture for Big Data & Machine Learning. Debbie Marr ICRI-CI 2015 Retreat, May 5, 2015

GPU Computing - CUDA

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Efficient Parallel Processing on Public Cloud Servers Using Load Balancing

Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Multi-GPU Load Balancing for Simulation and Rendering

A Computer Vision System on a Chip: a case study from the automotive domain

big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices

The Multi2Sim Simulation Framework. A CPU-GPU Model for Heterogeneous Computing (For Multi2Sim v. 4.2)

The Key Technology Research of Virtual Laboratory based On Cloud Computing Ling Zhang

The International Journal Of Science & Technoledge (ISSN X)

Texture Cache Approximation on GPUs

Real-time Visual Tracker by Stream Processing

Pros and Cons of HPC Cloud Computing

Hardware Based Virtualization Technologies. Elsie Wahlig Platform Software Architect

Review from last time. CS 537 Lecture 3 OS Structure. OS structure. What you should learn from this lecture

Xeon+FPGA Platform for the Data Center

Performance Monitoring of Parallel Scientific Applications

Impact of Control Theory on QoS Adaptation in Distributed Middleware Systems

The Microsoft Windows Hypervisor High Level Architecture

MulticoreWare. Global Company, 250+ employees HQ = Sunnyvale, CA Other locations: US, China, India, Taiwan

ICRI-CI Retreat Architecture track

Characterizing Task Usage Shapes in Google s Compute Clusters

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Efficient Load Balancing using VM Migration by QEMU-KVM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Introduction to GPU Architecture

CS231M Project Report - Automated Real-Time Face Tracking and Blending

How To Understand And Understand An Operating System In C Programming

Black-box Performance Models for Virtualized Web. Danilo Ardagna, Mara Tanelli, Marco Lovera, Li Zhang

Writing Applications for the GPU Using the RapidMind Development Platform

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GraySort on Apache Spark by Databricks

Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems

Evaluation Methodology of Converged Cloud Environments

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

Real-Time Operating Systems for MPSoCs

Intel DPDK Boosts Server Appliance Performance White Paper

Implementing an In-Service, Non- Intrusive Measurement Device in Telecommunication Networks Using the TMS320C31

PERFORMANCE TUNING ORACLE RAC ON LINUX

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

Operating Systems 4 th Class

Software and the Concurrency Revolution

Secure Containers. Jan Imagination Technologies HGI Dec, 2014 p1

Experimental Evaluation of Distributed Middleware with a Virtualized Java Environment

Step by Step Guide To vstorage Backup Server (Proxy) Sizing

Introduction to Cloud Computing

DELL s Oracle Database Advisor

CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015

@IJMTER-2015, All rights Reserved 355

theguard! ApplicationManager System Windows Data Collector

Go Faster - Preprocessing Using FPGA, CPU, GPU. Dipl.-Ing. (FH) Bjoern Rudde Image Acquisition Development STEMMER IMAGING

A bachelor of science degree in electrical engineering with a cumulative undergraduate GPA of at least 3.0 on a 4.0 scale

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

Multi-Threading Performance on Commodity Multi-Core Processors

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Recent Advances in Periscope for Performance Analysis and Tuning

How To Test For Performance And Scalability On A Server With A Multi-Core Computer (For A Large Server)

Radeon HD 2900 and Geometry Generation. Michael Doggett

Transcription:

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS Perhaad Mistry, Yash Ukidave, Dana Schaa, David Kaeli Department of Electrical and Computer Engineering Northeastern University, Boston, USA GPGPU6 @ ASPLOS 2013 Houston, TX 16 th March, 2013 1 GPGPU 6 March 2013

WHAT IS THIS TALK ABOUT? A benchmark suite for heterogeneous computing written in OpenCL that allows us to study the interaction between compute devices in heterogeneous application environments 2 GPGPU 6 March 2013

TOPICS Goals of an alternative benchmark suite for heterogeneous computing Classifying heterogeneous applications based on their behavior and their mapping to compute devices Brief overview of Valar s Benchmarks Evaluation methodology Example exploration studies Conclusions and Future work 3 GPGPU 6 March 2013

MOTIVATION Benchmarks for evaluating workload partitioning on CPU-GPU systems Most open source benchmark suites for heterogeneous systems do not utilize both the CPU and GPU device(s) for compute in OpenCL Allow a wide range of behavior(s) within the same application to evaluate data movement optimizations A Benchmark suite with different behavior scenarios of heterogeneous applications To evaluate runtimes and schedulers targeting heterogeneous systems Fit somewhere between microbenchmarks and complete applications 4 GPGPU 6 March 2013

APPLICATION CLASSIFICATION IMPLEMENTATION Implementation classification covers mapping of computation onto compute devices present Mapping could be static or dynamically decided Determined by algorithm s development and mapping to the compute device Compute Pipeline: Large stream of kernels and minimum IO Multidevice Execution: Computation partitioned over multiple devices with or without frequent communication 5 GPGPU 6 March 2013

APPLICATION CLASSIFICATION - BEHAVIORAL Behavioral classification covers the algorithm s usage scenario Separate discussion of implementation of application and its behavior Quality of Service Behavior: Application depends on error or data characteristics Multiple independent Behavior: Small independent tasks continuously offloaded High B/W input Behavior: Large data streams, high bandwidth GPU workloads 6 GPGPU 6 March 2013

VALAR S APPLICATIONS PHYSICS SIMULATION Collision pipeline: A physics application where large and small particle combination define workload behavior GPU performs the small small collisions CPU performs the large small and large large collisions. Behavioral space explored using No of particles Ratio of large and small particles GPU (Posn, Vel, Force) CPU Build Grid Synchronization SS Collide LS Collide ForceLS LL Collide Synchronization S Integrate LL Integrate 7 GPGPU 6 March 2013

VALAR S APPLICATIONS FINITE IMPULSE FILTER (FIR) Adaptive FIR: A streaming DSP application used in audio filtering, speech recognition, and pulse detection OP signal generated by multiplying output with a set of taps Adaptive FIR changes weight of filter taps on a separate command queue based on signal characteristics Behavioral space explored using Filter block size and number of taps Compute Intensity Dispatch Frequency IO frequency and size 8 GPGPU 6 March 2013

VALAR S APPLICATIONS SEARCH Search Application: Simple application searching for a range of values in data GPU OpenCL kernel searches for a set of target data values in blocks of data Application hands off the resultant data to the CPU for a final reduction Behavioral Space Explored Using Interval: Communication frequency of results from GPU to CPU Data pool size: Size of GPU kernel CPU GPU Initialize Data Range Synchronization Search Kernel Initial Reduction Synchronization Final Reduction & Init new data range 9 GPGPU 6 March 2013

VALAR S APPLICATIONS SPEEDED UP ROBUST FEATURES (SURF) SURF: Feature detection application that summarizes an image into a number of interest points. Applications in object recognition, tracking, image stitching Behavioral Space Explored Using Image size Host Device I/O size and compute intensity Image color patterns Compute intensity 10 GPGPU 6 March 2013

VALAR S APPLICATIONS TRAFFIC Traffic Application: Cellular automaton model (NS model) for road traffic flow to reproduce traffic jams Models traffic jams as an emergent phenomenon due to interaction between cars on road Behavioral Space Explored Using No of cars and their distribution: Compute intensity of kernels Maximum Velocity: affects number of kernel calls per timestep Simple OpenCL kernel called over multiple strides 11 GPGPU 6 March 2013

PERFORMANCE ANALYSIS IN A HETEROGENEOUS HIERARCHY Categorization goal: Reflect algorithm, data mapping and kernel optimization in benchmark selection Layers to study heterogeneous application performance AL0 Application input AL1 OpenCL level behavior Host device behavior induced by input arguments AL2 Compute device specific Hardware counter statistics Abstraction Layer AL0 Benchmark Options AL1 Host Device interaction AL2 Device H/W Perf. Counters Southern Island GPUs Performance and Behavior Metrics Input arguments and data to benchmarks Kernel execn. freq vs IO. Kernel calls on CPU vs GPU Memory Transaction Freq Memory Transaction Size Vector ALU Busy % Scalar ALU Busy % Mem-Unit Busy % Registers Used Local Memory Used Throughput & time 12 GPGPU 6 March 2013

PERFORMANCE ANALYSIS IN A HETEROGENEOUS HIERARCHY Categorization goal: Reflect algorithm, data mapping and kernel optimization in benchmark selection Layers to study heterogeneous application performance AL0 Application input AL1 OpenCL level behavior Host device behavior induced by input arguments AL2 Compute device specific Hardware counter statistics Argument tracking OpenCL event based profiler AMD APP Profiler 13 GPGPU 6 March 2013

EXPERIMENTAL EVALUATION Kernel optimization studies are possible with Valar OpenCL kernels optimized while maintaining correctness on all OpenCL compliant platforms Experiments based on the host-device interaction can be used for the following architectural research Effects of data dependent kernels Benefits of host-device IO optimizations like write combining Kernel call and communication cost Different OpenCL buffer management strategies 14 GPGPU 6 March 2013

OPENCL KERNELS DATA DEPENDENT KERNELS IN VALAR Vector ALU utilization and memory unit utilization on AMD Southern Island GPUs Performance variation seen over the runtime of application for representative input cases 15 GPGPU 6 March 2013

INTERACTION RESULTS FIR The effect of write combining on application throughput fused and discrete devices Dispatch denotes the number of blocks combined in one kernel invocation Requires an application with enough flexibility in host-device IO and kernel Limited performance benefit seen for fused platforms and higher dispatch sizes 16 GPGPU 6 March 2013

INTERACTION RESULTS SEARCH Search: less coupled application - CPU-GPU communication is less frequent Effect of communication on application throughput in heterogeneous systems Comparing a midrange discrete GPU with an APU device APU system throughput comparable for small communication interval 17 GPGPU 6 March 2013

INTERACTION RESULTS SEARCH CPU performance: discrete vs APU At high communication: CPU kernel performance on APU reduces CPU kernel does gain from Quad core HT vs Quad core GPU performance: discrete vs APU Improvement for less frequent communication, more work on GPU High BW of SI GPUs vs APU decisive to throughput as communication reduces 18 GPGPU 6 March 2013

INTERACTION RESULTS PHYSICS Effect of CPU compute capacity on application throughput for a coupled application Application throughput for different particle distributions. Throughput for APU and discrete in similar range Time / step is affected by large particle counts 19 GPGPU 6 March 2013

INTERACTION RESULTS PHYSICS Effect of CPU compute capacity on application throughput for a coupled application Throughput for different large particle counts More large particles increase amount of work on CPU Substantial reduction in throughput Time / step is affected by large particle counts 20 GPGPU 6 March 2013

CONCLUSIONS AND FUTURE WORK Conclusions: Valar attempts to provide benchmarks that can generate a range of heterogeneous behavior for architectural research and application comparison Future Work Architectural Research Compare against discrete implementations and other programming models Evaluating power swishing on APUs and evaluate mobile low power SOCs Future Work Applications Predator algorithm (TLD) - coupled machine learning and feature detection More applications required, especially concurrent command queue usage Physics needs CPU OpenCL command queue instead of thread-pool Traffic needs a better algorithm and lane change model needs to be improved 21 GPGPU 6 March 2013

THANK YOU! QUESTIONS? COMMENTS? Perhaad Mistry pmistry@ece.neu.edu https://code.google.com/p/valar-bench/ 22 GPGPU 6 March 2013

INTERACTION RESULTS SURF IMAGE COMPARE Preprocessing added on CPU device at beginning of the pipeline Comparison kernel calculates difference between two gray-scale images Preprocessing result decides the decision to launch pipeline Heavier threshold values improve performance due to more frames skipped 23 GPGPU 6 March 2013

VALAR S APPLICATIONS SPEEDED UP ROBUST FEATURES (SURF) SURF: Feature detection application that summarizes an image into a number of interest points. Applications in object recognition, tracking, image stitching Behavioral Space Explored Using Image size Host Device I/O size and compute intensity Image color patterns Compute intensity 24 GPGPU 6 March 2013

EXTRA STUFF 25 GPGPU 6 March 2013

PERFORMANCE RESULTS SURF ORIENTATION COMPARE Orientation comparison useful if no camera rotation Test case for overhead since orientation step is < 10% of SURF computation Execution of compute pipeline interrupted to compare orientation vs. previous frame Frequency of orientation comparison increased, native denotes no HAPTIC More degradation in average performance seen for small videos 26 GPGPU 6 March 2013

VALAR S APPLICATIONS - PHYSICS SIMULATION Collision Detection Pipeline Large and small particles combination decides workload behavior GPU performs the small small collisions CPU performs the large small and large large collisions. Behavioral space explored using No of particles Ratio of large and small particles 27 GPGPU 6 March 2013