Automated Software Testing of Memory Performance in Embedded GPUs. Sudipta Chattopadhyay, Petru Eles and Zebo Peng! Linköping University



Similar documents
Parallel Firewalls on General-Purpose Graphics Processing Units

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Delivering Quality in Software Performance and Scalability Testing

TRACE PERFORMANCE TESTING APPROACH. Overview. Approach. Flow. Attributes

Parallel Programming Survey

Texture Cache Approximation on GPUs

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Guided Performance Analysis with the NVIDIA Visual Profiler

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Motivation: Smartphone Market

Final Project Report. Trading Platform Server

Understand Performance Monitoring

Next Generation GPU Architecture Code-named Fermi

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

DSS. Diskpool and cloud storage benchmarks used in IT-DSS. Data & Storage Services. Geoffray ADDE

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

The Impact of Memory Subsystem Resource Sharing on Datacenter Applications. Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa

Optimizing Application Performance with CUDA Profiling Tools

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Chapter 1 Computer System Overview

Stream Processing on GPUs Using Distributed Multimedia Middleware

ultra fast SOM using CUDA

Accelerating Server Storage Performance on Lenovo ThinkServer

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

NVIDIA Tools For Profiling And Monitoring. David Goodwin

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

GPUs for Scientific Computing

Web Application s Performance Testing

~ Greetings from WSU CAPPLab ~

Introduction to GPGPU. Tiziano Diamanti

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Chapter 2 Heterogeneous Multicore Architecture

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Clustering Billions of Data Points Using GPUs

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

Validating Java for Safety-Critical Applications

Parallel Algorithm Engineering

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Best Practices for Web Application Load Testing

Response Time Analysis

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche

GPU Performance Analysis and Optimisation

Real-Time Operating Systems for MPSoCs

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Resource Utilization of Middleware Components in Embedded Systems

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Cache Configuration Reference

Static Program Transformations for Efficient Software Model Checking

Case Study: Load Testing and Tuning to Improve SharePoint Website Performance

Locating Cache Performance Bottlenecks Using Data Profiling

Intel DPDK Boosts Server Appliance Performance White Paper

Introduction to Cloud Computing

Optimizing Configuration and Application Mapping for MPSoC Architectures

Using Power to Improve C Programming Education

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

GPU Programming Strategies and Trends in GPU Computing

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

OpenSPARC T1 Processor

Performance Tuning and Optimizing SQL Databases 2016

Software Engineering Best Practices. Christian Hartshorne Field Engineer Daniel Thomas Internal Sales Engineer

Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances:

FUSION iocontrol HYBRID STORAGE ARCHITECTURE 1

Analyzing IBM i Performance Metrics

GPU Architecture. Michael Doggett ATI

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Performance analysis of a Linux based FTP server

Hardware Acceleration for Just-In-Time Compilation on Heterogeneous Embedded Systems

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Chapter 18: Database System Architectures. Centralized Systems

5MD00. Assignment Introduction. Luc Waeijen

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Architectures and Platforms

- An Essential Building Block for Stable and Reliable Compute Clusters

Introduction to GPU Programming Languages

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Transcription:

Automated Software Testing of Memory Performance in Embedded GPUs Sudipta Chattopadhyay, Petru Eles and Zebo Peng! Linköping University 1

State-of-the-art in Detecting Performance Loss Input Program profiling Profiler Program Hotspots 2

State-of-the-art in Detecting Performance Loss Input Program profiling! Program inputs that expose performance loss! Detecting performance loss!! Profiler Program Hotspots 3

Overall Context Programming abstractions (CUDA, OpenMPI) High-performance Embedded Platforms (GPGPUs, Multi-cores) 4

Overall Context Write efficient software Programming abstractions (CUDA, OpenMPI) Tools and techniques High-performance Embedded Platforms (GPGPUs, Multi-cores) 5

Overview Tools and techniques for Efficient Software Performance Testing Performance Debugging Refactoring High-performance embedded platforms 6

Overview Tools and techniques for Efficient Software Performance Testing Performance Debugging Refactoring High-performance embedded platforms Embedded GPUs 7

SIMD cores Embedded GPUs Streaming multiprocessor Streaming multiprocessor. Cache Cache Cache Interconnect DRAM 8

So, what is the problem? Automatically generate test scenarios Expose performance bottlenecks What is a performance bottleneck? What is a test scenario? Generation of test scenarios 9

Performance Bottleneck Longer delay does not necessarily mean a bottleneck Heavy but unavoidable computation 10

SIMD cores Embedded GPUs Streaming multiprocessor Streaming multiprocessor. Cache Cache Cache Interconnect DRAM Interferences in cache and memory 11

Performance bottleneck Embedded GPUs Memory subsystem is several magnitudes slower than on-chip registers/memories Thread group 1 Computation Memory Computation Memory Thread group 2 Memory Computation Memory Computation 12

Performance bottleneck Embedded GPUs Memory subsystem is several magnitudes slower than on-chip registers/memories Thread group 1 Computation Memory Computation Memory Thread group 2 Computation Wait Memory Computation Bottleneck due to DRAM bank conflict 13

Performance bottleneck Embedded GPUs Memory subsystem is several magnitudes slower than on-chip registers/memories Thread group 1 Computation Memory Computation Memory Thread group 2 cache conflict Memory Computation Memory Computation Bottleneck due to cache conflict 14

State-of-the-art in Detecting Performance Loss Input Program profiling! Program inputs that expose performance loss! Detecting performance loss!! Profiler Program Execution state 15

Test Scenarios y1!=1 y2!=1 y3!=1 y n y n y n DRAM DRAM Thread 1 Thread 2 Thread 3 Input selection Random testing: DRAM contention probability = 1/2 64 Symbolic (path-based): DRAM contention probability = 1/4 16

Test Scenarios y1!=1 y2!=1 y3!=1 y n y n y n DRAM DRAM Thread 1 Thread 2 Thread 3 Schedule selection Random selection: DRAM contention probability = 1/2 n n = schedule points 17

Test Scenarios Input selection Thread schedule selection Potentially unbounded combinations 18

Test Generation Approach Two-step approach Static analysis to summarize the memory footprint of individual threads Directed test generation using the summary 19

Test Generation Approach Thread 1 Thread 2 Thread n Static Analyzer Summary 1 Summary 2 Summary n - Cache hits or cache miss (Ferdinand et al., 2000) - The reuse of cache content - Memory accesses for an uninterrupted execution 20

Test Generation Approach Concrete execution + symbolic state of all threads (captures the set of paths taken by each thread) 21

Test Generation Approach Diversion of path More DRAM conflicts (from cache miss information) More cache conflicts (from memory access information) Purely static Concrete execution + symbolic state of all threads (captures the set of paths taken by each thread) 22 Divert to a different set of paths

Test Generation Approach x < 2 y < 2 z > 3 DRAM Control Dependency Graph (reaching path to potential DRAM access)!

Test Generation Approach x < 2 y < 2 z > 3 DRAM Control Dependency Graph Generate inputs satisfying (x >= 2 /\ z <= 3)

Test Generation Approach On-the-fly Thread selection point Memory access information DRAM bank conflicts DRAM state Static Dynamic 25

Test Generation Approach Thread selection point DRAM bank queue m3 m2 m1 Bank 0 Bank 1 ma mb mx my mz Bank mapping Bank mapping Dynamic DRAM state DRAM accesses/cache misses (from summary) 26

Test Generation Approach Thread selection point DRAM bank queue m3 m2 m1 Bank 0 Schedule choice Bank 1 ma mb mx my mz Bank mapping Bank mapping Dynamic DRAM state DRAM accesses/cache misses (from summary) 27

Test Generation Approach On-the-fly Thread selection point Cache reuse information Cache conflicts Cache state Static Dynamic 28

Test Generation Approach Thread selection point Set 0 (From summary) unused Set 1 reused ma mb mx mz mc md Cache mapping Cache mapping Dynamic Cache State Memory accesses (from summary) 29

Test Generation Approach Schedule choice Thread selection point Set 0 (From summary) unused Set 1 reused ma mb mx mz mc md Cache mapping Cache mapping Dynamic Cache State Memory accesses (from summary) 30

Test Generation Approach Summary 1 GPGPU program Summary 2 Input Selection (symbolic testing) Schedule Selection Execution state (cache, DRAM) Summary n Input Execute Static information Full picture Dynamic information 31

Implementation GPGPU-Sim A cycle accurate GPU simulator LLVM Compiler infrastructure GKLEE For generating symbolic constraints along program path STP Theorem prover to solve symbolic constraints 32

Number of performance bottlenecks w.r.t. time 33

Evaluation with CUDA kernels 34

Summary Software testing to discover performance bottlenecks in Embedded GPUs Systematic exploration of test inputs and thread schedule No false positives, however, might miss bottlenecks Usage in optimization such as cache locking, memory layout modification (see the paper) Future perspective Diagnosis of root cause Automatic fixing of performance bottlenecks 35

Thank you 36