Automated Software Testing of Memory Performance in Embedded GPUs. Sudipta Chattopadhyay, Petru Eles and Zebo Peng! Linköping University

Size: px

Start display at page:

Download "Automated Software Testing of Memory Performance in Embedded GPUs. Sudipta Chattopadhyay, Petru Eles and Zebo Peng! Linköping University"

Bridget Summers
10 years ago
Views:

1 Automated Software Testing of Memory Performance in Embedded GPUs Sudipta Chattopadhyay, Petru Eles and Zebo Peng! Linköping University 1

2 State-of-the-art in Detecting Performance Loss Input Program profiling Profiler Program Hotspots 2

3 State-of-the-art in Detecting Performance Loss Input Program profiling! Program inputs that expose performance loss! Detecting performance loss!! Profiler Program Hotspots 3

4 Overall Context Programming abstractions (CUDA, OpenMPI) High-performance Embedded Platforms (GPGPUs, Multi-cores) 4

5 Overall Context Write efficient software Programming abstractions (CUDA, OpenMPI) Tools and techniques High-performance Embedded Platforms (GPGPUs, Multi-cores) 5

6 Overview Tools and techniques for Efficient Software Performance Testing Performance Debugging Refactoring High-performance embedded platforms 6

7 Overview Tools and techniques for Efficient Software Performance Testing Performance Debugging Refactoring High-performance embedded platforms Embedded GPUs 7

8 SIMD cores Embedded GPUs Streaming multiprocessor Streaming multiprocessor. Cache Cache Cache Interconnect DRAM 8

9 So, what is the problem? Automatically generate test scenarios Expose performance bottlenecks What is a performance bottleneck? What is a test scenario? Generation of test scenarios 9

10 Performance Bottleneck Longer delay does not necessarily mean a bottleneck Heavy but unavoidable computation 10

11 SIMD cores Embedded GPUs Streaming multiprocessor Streaming multiprocessor. Cache Cache Cache Interconnect DRAM Interferences in cache and memory 11

12 Performance bottleneck Embedded GPUs Memory subsystem is several magnitudes slower than on-chip registers/memories Thread group 1 Computation Memory Computation Memory Thread group 2 Memory Computation Memory Computation 12

registers/memories Thread group 1 Computation Memory

13 Performance bottleneck Embedded GPUs Memory subsystem is several magnitudes slower than on-chip registers/memories Thread group 1 Computation Memory Computation Memory Thread group 2 Computation Wait Memory Computation Bottleneck due to DRAM bank conflict 13

Computation Memory Computation Memory Thread group 2 Computation

14 Performance bottleneck Embedded GPUs Memory subsystem is several magnitudes slower than on-chip registers/memories Thread group 1 Computation Memory Computation Memory Thread group 2 cache conflict Memory Computation Memory Computation Bottleneck due to cache conflict 14

Computation Memory Computation Memory Thread group 2 cache conflict

15 State-of-the-art in Detecting Performance Loss Input Program profiling! Program inputs that expose performance loss! Detecting performance loss!! Profiler Program Execution state 15

16 Test Scenarios y1!=1 y2!=1 y3!=1 y n y n y n DRAM DRAM Thread 1 Thread 2 Thread 3 Input selection Random testing: DRAM contention probability = 1/2 64 Symbolic (path-based): DRAM contention probability = 1/4 16

17 Test Scenarios y1!=1 y2!=1 y3!=1 y n y n y n DRAM DRAM Thread 1 Thread 2 Thread 3 Schedule selection Random selection: DRAM contention probability = 1/2 n n = schedule points 17

18 Test Scenarios Input selection Thread schedule selection Potentially unbounded combinations 18

19 Test Generation Approach Two-step approach Static analysis to summarize the memory footprint of individual threads Directed test generation using the summary 19

20 Test Generation Approach Thread 1 Thread 2 Thread n Static Analyzer Summary 1 Summary 2 Summary n - Cache hits or cache miss (Ferdinand et al., 2000) - The reuse of cache content - Memory accesses for an uninterrupted execution 20

21 Test Generation Approach Concrete execution + symbolic state of all threads (captures the set of paths taken by each thread) 21

22 Test Generation Approach Diversion of path More DRAM conflicts (from cache miss information) More cache conflicts (from memory access information) Purely static Concrete execution + symbolic state of all threads (captures the set of paths taken by each thread) 22 Divert to a different set of paths

23 Test Generation Approach x < 2 y < 2 z > 3 DRAM Control Dependency Graph (reaching path to potential DRAM access)!

24 Test Generation Approach x < 2 y < 2 z > 3 DRAM Control Dependency Graph Generate inputs satisfying (x >= 2 /\ z <= 3)

25 Test Generation Approach On-the-fly Thread selection point Memory access information DRAM bank conflicts DRAM state Static Dynamic 25

26 Test Generation Approach Thread selection point DRAM bank queue m3 m2 m1 Bank 0 Bank 1 ma mb mx my mz Bank mapping Bank mapping Dynamic DRAM state DRAM accesses/cache misses (from summary) 26

27 Test Generation Approach Thread selection point DRAM bank queue m3 m2 m1 Bank 0 Schedule choice Bank 1 ma mb mx my mz Bank mapping Bank mapping Dynamic DRAM state DRAM accesses/cache misses (from summary) 27

28 Test Generation Approach On-the-fly Thread selection point Cache reuse information Cache conflicts Cache state Static Dynamic 28

29 Test Generation Approach Thread selection point Set 0 (From summary) unused Set 1 reused ma mb mx mz mc md Cache mapping Cache mapping Dynamic Cache State Memory accesses (from summary) 29

30 Test Generation Approach Schedule choice Thread selection point Set 0 (From summary) unused Set 1 reused ma mb mx mz mc md Cache mapping Cache mapping Dynamic Cache State Memory accesses (from summary) 30

31 Test Generation Approach Summary 1 GPGPU program Summary 2 Input Selection (symbolic testing) Schedule Selection Execution state (cache, DRAM) Summary n Input Execute Static information Full picture Dynamic information 31

32 Implementation GPGPU-Sim A cycle accurate GPU simulator LLVM Compiler infrastructure GKLEE For generating symbolic constraints along program path STP Theorem prover to solve symbolic constraints 32

33 Number of performance bottlenecks w.r.t. time 33

34 Evaluation with CUDA kernels 34

35 Summary Software testing to discover performance bottlenecks in Embedded GPUs Systematic exploration of test inputs and thread schedule No false positives, however, might miss bottlenecks Usage in optimization such as cache locking, memory layout modification (see the paper) Future perspective Diagnosis of root cause Automatic fixing of performance bottlenecks 35

36 Thank you 36

Parallel Firewalls on General-Purpose Graphics Processing Units

Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering