A GPU-based Algorithm-specific Optimization for High-performance Background Subtraction. Chulian Zhang, Hamed Tabkhi, Gunar Schirner

Similar documents
Introduction to GPU hardware and to CUDA

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

ultra fast SOM using CUDA

GPUs for Scientific Computing

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

CFD Implementation with In-Socket FPGA Accelerators

High Performance Matrix Inversion with Several GPUs

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Computer Graphics Hardware An Overview

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Speed Performance Improvement of Vehicle Blob Tracking System

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Introduction to GPU Programming Languages

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Rethinking SIMD Vectorization for In-Memory Databases

Introduction to GPU Architecture

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Accelerating CFD using OpenFOAM with GPUs

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

GPGPU Computing. Yong Cao

Real-time Visual Tracker by Stream Processing

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

HPC with Multicore and GPUs

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

Architecture of Hitachi SR-8000

GPGPU accelerated Computational Fluid Dynamics

Evaluation of CUDA Fortran for the CFD code Strukti

Accelerating variant calling

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture.

Power Benefits Using Intel Quick Sync Video H.264 Codec With Sorenson Squeeze

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

White Paper. Intel Sandy Bridge Brings Many Benefits to the PC/104 Form Factor

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Accelerating Wavelet-Based Video Coding on Graphics Hardware

Introduction to Cloud Computing

GPU Architecture. Michael Doggett ATI

GeoImaging Accelerator Pansharp Test Results

FPGA-based Multithreading for In-Memory Hash Joins

~ Greetings from WSU CAPPLab ~

WAKING up without the sound of an alarm clock is a

ST810 Advanced Computing

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Guided Performance Analysis with the NVIDIA Visual Profiler

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Binary search tree with SIMD bandwidth optimization using SSE

Turbomachinery CFD on many-core platforms experiences and strategies

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Speeding up GPU-based password cracking

Optimizing Application Performance with CUDA Profiling Tools

Texture Cache Approximation on GPUs

High Performance Computing in CST STUDIO SUITE

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Optimizing Code for Accelerators: The Long Road to High Performance

GPU Parallel Computing Architecture and CUDA Programming Model

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Parallel Programming Survey

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Introduction to GPGPU. Tiziano Diamanti

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

OpenCL Programming for the CUDA Architecture. Version 2.3

Speeding Up RSA Encryption Using GPU Parallelization

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

Interactive Level-Set Deformation On the GPU

Virtuoso and Database Scalability

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

Part I Courses Syllabus

Chapter 1 Computer System Overview

Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures

MIDeA: A Multi-Parallel Intrusion Detection Architecture

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

CUDA programming on NVIDIA GPUs

CHAPTER 1 INTRODUCTION

Stream Processing on GPUs Using Distributed Multimedia Middleware

HyperThreading Support in VMware ESX Server 2.1

on an system with an infinite number of processors. Calculate the speedup of

GPU Renderfarm with Integrated Asset Management & Production System (AMPS)

GPU Performance Analysis and Optimisation

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

Choosing a Computer for Running SLX, P3D, and P5

Multi-core and Linux* Kernel

Parallel Computing with MATLAB

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

GPGPU acceleration in OpenFOAM

Transcription:

A GPU-based Algorithm-specific Optimization for High-performance Background Subtraction Chulian Zhang, Hamed Tabkhi, Gunar Schirner

Outline Motivation Background Related Work Approach General Optimization Algorithm-specific Optimization Shared Memory Optimization Quality Exploration Conclusion Embedded Systems Laboratory, Northeastern 2

Motivation Computer Vision Huge market Surveillance, ADAS, HCI, traffic monitoring Vision algorithm properties Embarrassingly parallel Demand high performance Possible Solutions for CV CPU low performance, high power FPGA High performance, low power Hard to implement GPU High performance, relatively low power Massively parallel cores Suitable for throughput oriented applications Easy to program Embedded Systems Laboratory, Northeastern 3

Outline Motivation Background Related Work Approach General Optimization Algorithm-specific Optimization Shared Memory Optimization Quality Exploration Conclusion Embedded Systems Laboratory, Northeastern 4

Background Subtraction Background subtraction Primary vision kernel Frequently used in many market Mixture of Gaussians (MoG) Gaussian background model for each pixel Multiple Gaussians Weight, mean, standard deviation Adaptive Operate on single video frame, recursively update the model Embarrassingly parallel Advantages Learning based algorithm Deals better with gradual variations Number of Gaussian Components Weight Intensity mean Embedded Systems Laboratory, Northeastern 5

GPU Architecture Streaming Multiprocessor Many cores Single inst. multiple thread (SIMT) Large RF Shared memory On-chip, fast Communication & sync Global Memory Observation GPU architecture is fundamentally different from CPU Application optimized for CPU single thread is not suitable for GPU Demand for algorithm-specific optimization Adjust vision algorithms to get maximum performance Embedded Systems Laboratory, Northeastern 6

Outline Motivation Background Related Work Approach General Optimization Algorithm-specific Optimization Shared Memory Optimization Quality Exploration Conclusion Embedded Systems Laboratory, Northeastern 7

Related Work Focus on general GPU optimizations Memory coalescing [Kumar, 13] [Pham, 10] [Li, 12] Overlapped execution [Kumar, 13] [Pham, 10] Shared memory [Kumar, 13] [Pham, 10] [Li, 12] No detail performance analysis Limited performance speedup 20x Algorithm optimizations [Azmat, 12] Reduce costly operations Quality loss [Kumar, 13], Real-timed Moving Object Detection Algorithms on High-resolution Videos using GPUs [Pham, 10], GPU Implementation of Extended Gaussian Mixture Model for Background Subtraction [Poremba, 10], Accelerating adaptive background subtraction with GPU and CEBA architecture [Li, 12], Three-level GPU accelerated Gaussian Mixture Model for Background Subtraction OpenCV [Azmat, 12], Accelerating adaptive background modeling on low-power integrated GPUs Embedded Systems Laboratory, Northeastern 8

Outline Motivation Background Related Work Approach General Optimization Algorithm-specific Optimization Shared Memory Optimization Quality Exploration Conclusion Embedded Systems Laboratory, Northeastern 9

Experiment Setup Hardware Platform OS RHEL 6.2 Algorithm Setup 3 Gaussians Double precision Baseline CPU Single thread -O3 optimization Input HD frames (1080 x 1920) CPU GPU Processor Intel Xeon E5-2620 nvidia Tesla C2075 Cores 6 448 Frequency 2.5 GHz 1.15 GHz GFLOPS 120.3 1030 Cache L2 (256K) L3 (15M) L1(16/48K) L2 (768K) Memory BW 12.8 GB/s 144 GB/s Embedded Systems Laboratory, Northeastern 10

General Optimizations Memory Coalescing Non-coalesced Locality within each thread Optimized for CPU single-thread Coalesced Locality across different threads GPU massively-parallel threads Request will be coalesced Overlapped Execution Overlap comm. & computation Result Memory Performance Speedup A B C Base Implementation Memory Coalescing Overlapped Execution Embedded Systems Laboratory, Northeastern 11

Outline Motivation Background Related Work Approach General Optimization Algorithm-specific Optimization Shared Memory Optimization Quality Exploration Conclusion Embedded Systems Laboratory, Northeastern 12

Algorithm-specific Optimization General Optimization is not enough GPU is not fully utilized Algorithm-specific optimization Tune the algorithm to better fit the underlying architecture Examples Vision algorithm has many branches Branches reduce GPU utilization Solutions Remove braches, even at the cost of extra computation Overall performance will be better Embedded Systems Laboratory, Northeastern 13

Branch Reduction Gaussian background checking (CPU code) 1. Sorting 2. Check from HighestRank 3. Stop when a match happens Problem New pixel has higher chance to match with higher rank Gaussian CPU-style optimization Add branches GPU code Remove sorting Check all components Preserve correctness Result Branch number reduced Branch efficiency increased Embedded Systems Laboratory, Northeastern 14 Branch Performance

Source-level Predicated Execution Predicated instruction Generated by GPU compiler Gaussian update Compiler cannot detect it Source-level predicated Use flag to decide the effect Remove branch Result Algorithm 4 non-predicated Execution 1: for k =0to numgau do 2: if match then 3: w[k] = w[k]+(1 ) 4: tmp =(1 )/w[k] 5: m[k] =f(tmp) 6: sd[k] =g(tmp) 7: else 8: w[k] = 9: end if 10: end for Algorithm 5 Predicated Execution 1: for k =0to numgau do 2: w[k] = w[k]+match (1 ) 3: tmp =(1 )/w[k] 4: m[k] =(1 match) m[k]+match f(tmp) 5: sd[k] =(1 match) sd[k]+match g(tmp) 6: end for Branch Performance Register Usage Embedded Systems Laboratory, Northeastern 15

Register Usage Reduction Solution Reduce registers per thread Approach Cascade arithmetic operations Remove intermediate variables Guided by source code Result Compiler cannot find the minimum # register that yield best performance Occupancy Performance Embedded Systems Laboratory, Northeastern 16

Outline Motivation Background Related Work Approach General Optimization Algorithm-specific Optimization Shared Memory Optimization Quality Exploration Conclusion Embedded Systems Laboratory, Northeastern 17

Shared Memory Optimization Frame-based operation One frame each time Read / write Gaussian parameters from / to global memory Shared Memory Small size Need data reuse Window-based operation Split one frame into smaller windows Window size decided by shared memory size Process the same window across many frames Frame group Increase data reuse Shift to next frame group after finishing all windows Embedded Systems Laboratory, Northeastern 18

Shared Memory Optimization Positive Memory access goes to shared memory Negative Latency per frame is increased Optimal group size? 8 frames Occupancy decreases with increasing group size Embedded Systems Laboratory, Northeastern 19

Outline Motivation Background Related Work Approach General Optimization Algorithm-specific Optimization Shared Memory Optimization Quality Exploration Conclusion Embedded Systems Laboratory, Northeastern 20

Quality Exploration Quality Assessment Ground Truth CPU result with double precision Multi-Scalar Structural SIMilarity (MS-SSIM) Quantify the structural similarity between two images Quality Result Background result is always 99% Foreground Slightly drop due to algorithm tuning A B C D E F Background 99% 99% 99% 99% 99% 99% Foreground 99% 99% 96% 97% 97% 95% Embedded Systems Laboratory, Northeastern 21

Outline Motivation Background Related Work Approach General Optimization Algorithm-specific Optimization Shared Memory Optimization Quality Exploration Conclusion Embedded Systems Laboratory, Northeastern 22

Conclusion GPU is perfect for Computer Visions massively parallel Existing vision algorithms optimized for CPU Demand for optimal performance from GPU Understand the vision algorithms Tune algorithm to fit architecture Future work Apply same principles to other vision algorithms Develop & evaluate optimizations for embedded GPU Embedded Systems Laboratory, Northeastern 23

Embedded Systems Laboratory, Northeastern 24