Automated Software Testing of Memory Performance in Embedded GPUs Sudipta Chattopadhyay, Petru Eles and Zebo Peng! Linköping University 1
State-of-the-art in Detecting Performance Loss Input Program profiling Profiler Program Hotspots 2
State-of-the-art in Detecting Performance Loss Input Program profiling! Program inputs that expose performance loss! Detecting performance loss!! Profiler Program Hotspots 3
Overall Context Programming abstractions (CUDA, OpenMPI) High-performance Embedded Platforms (GPGPUs, Multi-cores) 4
Overall Context Write efficient software Programming abstractions (CUDA, OpenMPI) Tools and techniques High-performance Embedded Platforms (GPGPUs, Multi-cores) 5
Overview Tools and techniques for Efficient Software Performance Testing Performance Debugging Refactoring High-performance embedded platforms 6
Overview Tools and techniques for Efficient Software Performance Testing Performance Debugging Refactoring High-performance embedded platforms Embedded GPUs 7
SIMD cores Embedded GPUs Streaming multiprocessor Streaming multiprocessor. Cache Cache Cache Interconnect DRAM 8
So, what is the problem? Automatically generate test scenarios Expose performance bottlenecks What is a performance bottleneck? What is a test scenario? Generation of test scenarios 9
Performance Bottleneck Longer delay does not necessarily mean a bottleneck Heavy but unavoidable computation 10
SIMD cores Embedded GPUs Streaming multiprocessor Streaming multiprocessor. Cache Cache Cache Interconnect DRAM Interferences in cache and memory 11
Performance bottleneck Embedded GPUs Memory subsystem is several magnitudes slower than on-chip registers/memories Thread group 1 Computation Memory Computation Memory Thread group 2 Memory Computation Memory Computation 12
Performance bottleneck Embedded GPUs Memory subsystem is several magnitudes slower than on-chip registers/memories Thread group 1 Computation Memory Computation Memory Thread group 2 Computation Wait Memory Computation Bottleneck due to DRAM bank conflict 13
Performance bottleneck Embedded GPUs Memory subsystem is several magnitudes slower than on-chip registers/memories Thread group 1 Computation Memory Computation Memory Thread group 2 cache conflict Memory Computation Memory Computation Bottleneck due to cache conflict 14
State-of-the-art in Detecting Performance Loss Input Program profiling! Program inputs that expose performance loss! Detecting performance loss!! Profiler Program Execution state 15
Test Scenarios y1!=1 y2!=1 y3!=1 y n y n y n DRAM DRAM Thread 1 Thread 2 Thread 3 Input selection Random testing: DRAM contention probability = 1/2 64 Symbolic (path-based): DRAM contention probability = 1/4 16
Test Scenarios y1!=1 y2!=1 y3!=1 y n y n y n DRAM DRAM Thread 1 Thread 2 Thread 3 Schedule selection Random selection: DRAM contention probability = 1/2 n n = schedule points 17
Test Scenarios Input selection Thread schedule selection Potentially unbounded combinations 18
Test Generation Approach Two-step approach Static analysis to summarize the memory footprint of individual threads Directed test generation using the summary 19
Test Generation Approach Thread 1 Thread 2 Thread n Static Analyzer Summary 1 Summary 2 Summary n - Cache hits or cache miss (Ferdinand et al., 2000) - The reuse of cache content - Memory accesses for an uninterrupted execution 20
Test Generation Approach Concrete execution + symbolic state of all threads (captures the set of paths taken by each thread) 21
Test Generation Approach Diversion of path More DRAM conflicts (from cache miss information) More cache conflicts (from memory access information) Purely static Concrete execution + symbolic state of all threads (captures the set of paths taken by each thread) 22 Divert to a different set of paths
Test Generation Approach x < 2 y < 2 z > 3 DRAM Control Dependency Graph (reaching path to potential DRAM access)!
Test Generation Approach x < 2 y < 2 z > 3 DRAM Control Dependency Graph Generate inputs satisfying (x >= 2 /\ z <= 3)
Test Generation Approach On-the-fly Thread selection point Memory access information DRAM bank conflicts DRAM state Static Dynamic 25
Test Generation Approach Thread selection point DRAM bank queue m3 m2 m1 Bank 0 Bank 1 ma mb mx my mz Bank mapping Bank mapping Dynamic DRAM state DRAM accesses/cache misses (from summary) 26
Test Generation Approach Thread selection point DRAM bank queue m3 m2 m1 Bank 0 Schedule choice Bank 1 ma mb mx my mz Bank mapping Bank mapping Dynamic DRAM state DRAM accesses/cache misses (from summary) 27
Test Generation Approach On-the-fly Thread selection point Cache reuse information Cache conflicts Cache state Static Dynamic 28
Test Generation Approach Thread selection point Set 0 (From summary) unused Set 1 reused ma mb mx mz mc md Cache mapping Cache mapping Dynamic Cache State Memory accesses (from summary) 29
Test Generation Approach Schedule choice Thread selection point Set 0 (From summary) unused Set 1 reused ma mb mx mz mc md Cache mapping Cache mapping Dynamic Cache State Memory accesses (from summary) 30
Test Generation Approach Summary 1 GPGPU program Summary 2 Input Selection (symbolic testing) Schedule Selection Execution state (cache, DRAM) Summary n Input Execute Static information Full picture Dynamic information 31
Implementation GPGPU-Sim A cycle accurate GPU simulator LLVM Compiler infrastructure GKLEE For generating symbolic constraints along program path STP Theorem prover to solve symbolic constraints 32
Number of performance bottlenecks w.r.t. time 33
Evaluation with CUDA kernels 34
Summary Software testing to discover performance bottlenecks in Embedded GPUs Systematic exploration of test inputs and thread schedule No false positives, however, might miss bottlenecks Usage in optimization such as cache locking, memory layout modification (see the paper) Future perspective Diagnosis of root cause Automatic fixing of performance bottlenecks 35
Thank you 36