VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS Perhaad Mistry, Yash Ukidave, Dana Schaa, David Kaeli Department of Electrical and Computer Engineering Northeastern University, Boston, USA GPGPU6 @ ASPLOS 2013 Houston, TX 16 th March, 2013 1 GPGPU 6 March 2013

WHAT IS THIS TALK ABOUT? A benchmark suite for heterogeneous computing written in OpenCL that allows us to study the interaction between compute devices in heterogeneous application environments 2 GPGPU 6 March 2013

TOPICS Goals of an alternative benchmark suite for heterogeneous computing Classifying heterogeneous applications based on their behavior and their mapping to compute devices Brief overview of Valar s Benchmarks Evaluation methodology Example exploration studies Conclusions and Future work 3 GPGPU 6 March 2013

MOTIVATION Benchmarks for evaluating workload partitioning on CPU-GPU systems Most open source benchmark suites for heterogeneous systems do not utilize both the CPU and GPU device(s) for compute in OpenCL Allow a wide range of behavior(s) within the same application to evaluate data movement optimizations A Benchmark suite with different behavior scenarios of heterogeneous applications To evaluate runtimes and schedulers targeting heterogeneous systems Fit somewhere between microbenchmarks and complete applications 4 GPGPU 6 March 2013

APPLICATION CLASSIFICATION IMPLEMENTATION Implementation classification covers mapping of computation onto compute devices present Mapping could be static or dynamically decided Determined by algorithm s development and mapping to the compute device Compute Pipeline: Large stream of kernels and minimum IO Multidevice Execution: Computation partitioned over multiple devices with or without frequent communication 5 GPGPU 6 March 2013

APPLICATION CLASSIFICATION - BEHAVIORAL Behavioral classification covers the algorithm s usage scenario Separate discussion of implementation of application and its behavior Quality of Service Behavior: Application depends on error or data characteristics Multiple independent Behavior: Small independent tasks continuously offloaded High B/W input Behavior: Large data streams, high bandwidth GPU workloads 6 GPGPU 6 March 2013

VALAR S APPLICATIONS PHYSICS SIMULATION Collision pipeline: A physics application where large and small particle combination define workload behavior GPU performs the small small collisions CPU performs the large small and large large collisions. Behavioral space explored using No of particles Ratio of large and small particles GPU (Posn, Vel, Force) CPU Build Grid Synchronization SS Collide LS Collide ForceLS LL Collide Synchronization S Integrate LL Integrate 7 GPGPU 6 March 2013

VALAR S APPLICATIONS FINITE IMPULSE FILTER (FIR) Adaptive FIR: A streaming DSP application used in audio filtering, speech recognition, and pulse detection OP signal generated by multiplying output with a set of taps Adaptive FIR changes weight of filter taps on a separate command queue based on signal characteristics Behavioral space explored using Filter block size and number of taps Compute Intensity Dispatch Frequency IO frequency and size 8 GPGPU 6 March 2013

VALAR S APPLICATIONS SEARCH Search Application: Simple application searching for a range of values in data GPU OpenCL kernel searches for a set of target data values in blocks of data Application hands off the resultant data to the CPU for a final reduction Behavioral Space Explored Using Interval: Communication frequency of results from GPU to CPU Data pool size: Size of GPU kernel CPU GPU Initialize Data Range Synchronization Search Kernel Initial Reduction Synchronization Final Reduction & Init new data range 9 GPGPU 6 March 2013

VALAR S APPLICATIONS SPEEDED UP ROBUST FEATURES (SURF) SURF: Feature detection application that summarizes an image into a number of interest points. Applications in object recognition, tracking, image stitching Behavioral Space Explored Using Image size Host Device I/O size and compute intensity Image color patterns Compute intensity 10 GPGPU 6 March 2013

VALAR S APPLICATIONS TRAFFIC Traffic Application: Cellular automaton model (NS model) for road traffic flow to reproduce traffic jams Models traffic jams as an emergent phenomenon due to interaction between cars on road Behavioral Space Explored Using No of cars and their distribution: Compute intensity of kernels Maximum Velocity: affects number of kernel calls per timestep Simple OpenCL kernel called over multiple strides 11 GPGPU 6 March 2013

PERFORMANCE ANALYSIS IN A HETEROGENEOUS HIERARCHY Categorization goal: Reflect algorithm, data mapping and kernel optimization in benchmark selection Layers to study heterogeneous application performance AL0 Application input AL1 OpenCL level behavior Host device behavior induced by input arguments AL2 Compute device specific Hardware counter statistics Abstraction Layer AL0 Benchmark Options AL1 Host Device interaction AL2 Device H/W Perf. Counters Southern Island GPUs Performance and Behavior Metrics Input arguments and data to benchmarks Kernel execn. freq vs IO. Kernel calls on CPU vs GPU Memory Transaction Freq Memory Transaction Size Vector ALU Busy % Scalar ALU Busy % Mem-Unit Busy % Registers Used Local Memory Used Throughput & time 12 GPGPU 6 March 2013

PERFORMANCE ANALYSIS IN A HETEROGENEOUS HIERARCHY Categorization goal: Reflect algorithm, data mapping and kernel optimization in benchmark selection Layers to study heterogeneous application performance AL0 Application input AL1 OpenCL level behavior Host device behavior induced by input arguments AL2 Compute device specific Hardware counter statistics Argument tracking OpenCL event based profiler AMD APP Profiler 13 GPGPU 6 March 2013

EXPERIMENTAL EVALUATION Kernel optimization studies are possible with Valar OpenCL kernels optimized while maintaining correctness on all OpenCL compliant platforms Experiments based on the host-device interaction can be used for the following architectural research Effects of data dependent kernels Benefits of host-device IO optimizations like write combining Kernel call and communication cost Different OpenCL buffer management strategies 14 GPGPU 6 March 2013

OPENCL KERNELS DATA DEPENDENT KERNELS IN VALAR Vector ALU utilization and memory unit utilization on AMD Southern Island GPUs Performance variation seen over the runtime of application for representative input cases 15 GPGPU 6 March 2013

INTERACTION RESULTS FIR The effect of write combining on application throughput fused and discrete devices Dispatch denotes the number of blocks combined in one kernel invocation Requires an application with enough flexibility in host-device IO and kernel Limited performance benefit seen for fused platforms and higher dispatch sizes 16 GPGPU 6 March 2013

INTERACTION RESULTS SEARCH Search: less coupled application - CPU-GPU communication is less frequent Effect of communication on application throughput in heterogeneous systems Comparing a midrange discrete GPU with an APU device APU system throughput comparable for small communication interval 17 GPGPU 6 March 2013

INTERACTION RESULTS SEARCH CPU performance: discrete vs APU At high communication: CPU kernel performance on APU reduces CPU kernel does gain from Quad core HT vs Quad core GPU performance: discrete vs APU Improvement for less frequent communication, more work on GPU High BW of SI GPUs vs APU decisive to throughput as communication reduces 18 GPGPU 6 March 2013

INTERACTION RESULTS PHYSICS Effect of CPU compute capacity on application throughput for a coupled application Application throughput for different particle distributions. Throughput for APU and discrete in similar range Time / step is affected by large particle counts 19 GPGPU 6 March 2013

INTERACTION RESULTS PHYSICS Effect of CPU compute capacity on application throughput for a coupled application Throughput for different large particle counts More large particles increase amount of work on CPU Substantial reduction in throughput Time / step is affected by large particle counts 20 GPGPU 6 March 2013

CONCLUSIONS AND FUTURE WORK Conclusions: Valar attempts to provide benchmarks that can generate a range of heterogeneous behavior for architectural research and application comparison Future Work Architectural Research Compare against discrete implementations and other programming models Evaluating power swishing on APUs and evaluate mobile low power SOCs Future Work Applications Predator algorithm (TLD) - coupled machine learning and feature detection More applications required, especially concurrent command queue usage Physics needs CPU OpenCL command queue instead of thread-pool Traffic needs a better algorithm and lane change model needs to be improved 21 GPGPU 6 March 2013

THANK YOU! QUESTIONS? COMMENTS? Perhaad Mistry pmistry@ece.neu.edu https://code.google.com/p/valar-bench/ 22 GPGPU 6 March 2013

INTERACTION RESULTS SURF IMAGE COMPARE Preprocessing added on CPU device at beginning of the pipeline Comparison kernel calculates difference between two gray-scale images Preprocessing result decides the decision to launch pipeline Heavier threshold values improve performance due to more frames skipped 23 GPGPU 6 March 2013

VALAR S APPLICATIONS SPEEDED UP ROBUST FEATURES (SURF) SURF: Feature detection application that summarizes an image into a number of interest points. Applications in object recognition, tracking, image stitching Behavioral Space Explored Using Image size Host Device I/O size and compute intensity Image color patterns Compute intensity 24 GPGPU 6 March 2013

EXTRA STUFF 25 GPGPU 6 March 2013

PERFORMANCE RESULTS SURF ORIENTATION COMPARE Orientation comparison useful if no camera rotation Test case for overhead since orientation step is < 10% of SURF computation Execution of compute pipeline interrupted to compare orientation vs. previous frame Frequency of orientation comparison increased, native denotes no HAPTIC More degradation in average performance seen for small videos 26 GPGPU 6 March 2013

VALAR S APPLICATIONS - PHYSICS SIMULATION Collision Detection Pipeline Large and small particles combination decides workload behavior GPU performs the small small collisions CPU performs the large small and large large collisions. Behavioral space explored using No of particles Ratio of large and small particles 27 GPGPU 6 March 2013