Performance Error Correcting Codes

Similar documents
ultra fast SOM using CUDA

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Introduction to GPGPU. Tiziano Diamanti

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Alberto Corrales-García, Rafael Rodríguez-Sánchez, José Luis Martínez, Gerardo Fernández-Escribano, José M. Claver and José Luis Sánchez

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Stream Processing on GPUs Using Distributed Multimedia Middleware

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Introduction to GPU hardware and to CUDA

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

Introduction to GPU Programming Languages

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

NVIDIA GeForce GTX 580 GPU Datasheet

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Real-time Visual Tracker by Stream Processing

Introduction to GPU Architecture

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU-BASED TUNING OF QUANTUM-INSPIRED GENETIC ALGORITHM FOR A COMBINATORIAL OPTIMIZATION PROBLEM

NVIDIA GeForce GTX 750 Ti

CUDA programming on NVIDIA GPUs

CFD Implementation with In-Socket FPGA Accelerators

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Clustering Billions of Data Points Using GPUs

GPU Parallel Computing Architecture and CUDA Programming Model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

MIMO detector algorithms and their implementations for LTE/LTE-A

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

QCD as a Video Game?

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

GPGPU Computing. Yong Cao

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

GPUs for Scientific Computing

The search engine you can see. Connects people to information and services

DCT-JPEG Image Coding Based on GPU

High Performance Matrix Inversion with Several GPUs

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

HIGH SPEED DECODING OF NON-BINARY IRREGULAR LDPC CODES USING GPUS

Parallel Programming Survey

Introduction to GPU Computing

IP Video Rendering Basics

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

Accelerating Wavelet-Based Video Coding on Graphics Hardware

L20: GPU Architecture and Models

Lecture 1. Course Introduction

ST810 Advanced Computing

Programming and Scheduling Model for Supporting Heterogeneous Architectures in Linux

1. If we need to use each thread to calculate one output element of a vector addition, what would

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

HPC with Multicore and GPUs

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CUDA Basics. Murphy Stein New York University

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

GeoImaging Accelerator Pansharp Test Results

Non-Data Aided Carrier Offset Compensation for SDR Implementation

Parallel Computing with MATLAB

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Breaking the Interleaving Bottleneck in Communication Applications for Efficient SoC Implementations

Trends in High-Performance Computing for Power Grid Applications

Accelerating CFD using OpenFOAM with GPUs

~ Greetings from WSU CAPPLab ~

Speeding Up RSA Encryption Using GPU Parallelization

Intelligent Heuristic Construction with Active Learning

Introduction to Cloud Computing

Extreme Machine Learning with GPUs John Canny

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

10- High Performance Compu5ng

Post Processing Service

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

GPU-based Decompression for Medical Imaging Applications

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS UPDATE

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

September 25, Maya Gokhale Georgia Institute of Technology

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Overview of HPC Resources at Vanderbilt

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

FPGA-based Multithreading for In-Memory Hash Joins

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Turbomachinery CFD on many-core platforms experiences and strategies

Performance Characteristics of Large SMP Machines

An Adaptive Decoding Algorithm of LDPC Codes over the Binary Erasure Channel. Gou HOSOYA, Hideki YAGI, Toshiyasu MATSUSHIMA, and Shigeichi HIRASAWA

Hardware Acceleration for CST MICROWAVE STUDIO

HP ProLiant SL270s Gen8 Server. Evaluation Report

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

OpenCL Programming for the CUDA Architecture. Version 2.3

Transcription:

GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas B. Chang, and dstephen Leung This work is sponsored by the Department of the Air Force under Air Force contract FA8721-5-C-2. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government. HPEC_GPU_DECODE-1

Outline Introduction Wireless Communications Performance vs. Complexity Candidate Implementations Low Density Parity Check (LDPC) Codes NVIDIA GPU Architecture GPU Implementation Results Summary HPEC_GPU_DECODE-2

Wireless Communications Data Generic Communication System Channel Coding Modulation Multipath Noise Demodulation Decoding Interference...... Data HPEC_GPU_DECODE-3

Wireless Communications Data Generic Communication System Our Focus Channel Coding Modulation Multipath Noise Demodulation Decoding Interference...... Data HPEC_GPU_DECODE-4

Data Wireless Communications Generic Communication System Our Focus Channel Coding Modulation Multipath Noise Demodulation Decoding Interference... Simple Error Correcting Code 111 111 Coding 111 111... Data HPEC_GPU_DECODE-5

Data Wireless Communications Generic Communication System Our Focus Channel Coding Modulation Multipath Noise Demodulation Decoding Interference... Simple Error Correcting Code 111 111 Coding 111 111 High Performance Codes Allow for larger link distance and/or higher h data rate Superior performance comes with significant increase in complexity... Data HPEC_GPU_DECODE-6

Performance vs. Complexity Increased performance requires larger computational cost Off the shelf systems require 5dB more power to close link than custom system Best custom design only require.5db more power than theoretical minimum 4-State STTC 8-State STTC 16-State STTC 32-State t STTC 64-State STTC GF(2) LDPC BICM-IDID Commercial Interests Direct GF(256) LDPC Custom Designs HPEC_GPU_DECODE-7

Performance vs. Complexity Increased performance requires larger computational cost Off the shelf systems require 5dB more power to close link than custom system Best custom design only require.5db more power than theoretical minimum 4-State STTC 8-State STTC 16-State STTC 32-State t STTC 64-State STTC GF(2) LDPC BICM-IDID Commercial Interests Direct GF(256) LDPC Custom Designs Our Choice HPEC_GPU_DECODE-8

Candidate Implementations Half second burst of data Decode complexity 37 TFLOPS Memory transferred during decode 44 TB PC 9x FPGA ASIC 4x GPU Implementation Time (beyond development) Performance (decode time) Hardware Cost HPEC_GPU_DECODE-9

Candidate Implementations Half second burst of data Decode complexity 37 TFLOPS Memory transferred during decode 44 TB PC 9x FPGA ASIC 4x GPU Implementation - Time (beyond development) Performance (decode time) 9 days Hardware Cost $2, HPEC_GPU_DECODE-1

Candidate Implementations Half second burst of data Decode complexity 37 TFLOPS Memory transferred during decode 44 TB PC 9x FPGA ASIC 4x GPU Implementation - 1 person Time (beyond development) year (est.) Performance (decode time) 9 days 5.1 minutes (est.) Hardware Cost $2, $9, HPEC_GPU_DECODE-11

Candidate Implementations Half second burst of data Decode complexity 37 TFLOPS Memory transferred during decode 44 TB PC 9x FPGA ASIC 4x GPU Implementation - 1 person 1.5 person Time (beyond year (est.) years (est.) development) +1 year to Fab. Performance 9 days 5.1 minutes.5s (decode time) (est.) (if possible) Hardware Cost $2, $9, $1 million HPEC_GPU_DECODE-12

Candidate Implementations Half second burst of data Decode complexity 37 TFLOPS Memory transferred during decode 44 TB PC 9x FPGA ASIC 4x GPU Implementation - 1 person 1.5 person? Time (beyond year (est.) years (est.) development) +1 year to Fab. Performance 9 days 5.1 minutes.5s? (decode time) (est.) (if possible) Hardware Cost $2, $9, $1 million $5, (includes PC cost) HPEC_GPU_DECODE-13

Outline Introduction Low Density Parity Check (LDPC) Codes Decoding LDPC GF(256) Codes Algorithmic i Demands of LDPC Decoder NVIDIA GPU Architecture GPU Implementation Results Summary HPEC_GPU_DECODE-14

Low Density Parity Check (LDPC) Codes LDPC Codes Proposed by Gallager (1962) codes were binary Not used until mid 199s GF(256) Davey and McKay (1998) symbols in [, 1, 2,, 255] Parity check matrix defines graph Symbols connected to checks nodes Has error state feedback Syndrome = (no errors) Potential to re-transmit HPEC_GPU_DECODE-15 LDPC GF(256) Parity-Check Matrix N Checks N Symbols 52 174 5 4 213 97 119 78 218 17 129 174 172 214 78 216 6 39 114 49 139 191 181 152 134 97 22 238 11 178 S Y M B O L S Codeword Tanner Graph N Symbols N Checks c c1 c = n c 15 Syndrome C H E C K S

Decoding LDPC GF(256) Codes Solved using LDPC Sum-Product Decoding Algorithm A belief propagation p algorithm Iterative Algorithm Number of iterations depends on SNR Tanner Graph Symbol Update Check Update Hard Decision ML Probabilities decoded symbols S Y M B O L S Codeword C H E C K S Syndrome HPEC_GPU_DECODE-16

Algorithmic Demands of LDPC Decoder Computational Complexity Decoding of a entire sequence 37 TFLOPS Single frame 922 MFLOPS Check update 95% of total Memory Requirements Data moved between device memory and multiprocessors 44.2 TB Single codeword 1.1 GB Data moved within Walsh Transform and permutations (part of check update) 177 TB Single codeword 4.4 GB Decoder requires architectures with high computational and memory bandwidths HPEC_GPU_DECODE-17

Outline Introduction Low Density Parity Check (LDPC) Codes NVIDIA GPU Architecture Dispatching Parallel Jobs GPU Implementation Results Summary HPEC_GPU_DECODE-18

NVIDIA GPU Architecture GPU GPU contains N multiprocessors Each Performs Single Instruction Multiple Thread (SIMT) operations Constant and Texture caches speed up certain memory access patterns NVIDIA GTX 295 (2 GPUs) N = 3 multiprocessors M = 8 scalar processors Shared memory 16KB Device memory 896MB Up to 512 threads on each multiprocessor Capable of 93 GFLOPs Capable of 112 GB/sec HPEC_GPU_DECODE-19 NVIDIA CUDA Programming Guide Version 2.1

Dispatching Parallel Jobs Program decomposed into blocks of threads Thread blocks are executed on individual multiprocessors Threads within blocks coordinate through shared memory Threads optimized for coalesced memory access Grid Block 1 Block 2 Block 3 Block M Calling parallel jobs (kernels) within C function is simple Scheduling is transparent and managed by GPU and API grid.x = M; threads.x = N; kernel_call<<<grid, ll threads>>>(variables); Thread blocks execute independently of one another Threads allow faster computation and better use of memory bandwidth Threads work together using shared memory HPEC_GPU_DECODE-2

Outline Introduction Low Density Parity Check (LDPC) Codes NVIDIA GPU Architecture GPU Implementation Check Update Implementation Results Summary HPEC_GPU_DECODE-21

GPU Implementation of LDPC Decoder Quad GPU Workstation 2 NVIDIA GTX 295s 3.2GHz Intel Core i7 965 12GB of 133MHz DDR3 memory Costs $5 Software written using NVIDIA s Compute Unified Device Architecture (CUDA) Replaced Matlab functions with CUDA MEX functions Utilized static variables GPU Memory allocated first time function is called Constants used to decode many codewords only loaded once Codewords distributed across the 4 GPUs HPEC_GPU_DECODE-22

Check Update Implementation Permutation Permutation Walsh Transform Walsh Transform X... X... X Inverse Walsh Trans. Inverse Walsh Trans. Permutation Permutation Thread blocks assigned to each check independently of all other checks and are 256 element vectors Each thread is assigned to element of GF(256) Coalesced memory reads Threads coordinate using shared memory grid.x = num_checks; threads.x = 256; check_update<<<grid, threads>>>(q,r); S Y M B O L S C H E C K S HPEC_GPU_DECODE-23 Structure of algorithm exploitable on GPU

Outline Introduction Low Density Parity Check (LDPC) Codes NVIDIA GPU Architecture GPU Implementation Results Summary HPEC_GPU_DECODE-24

Candidate Implementations PC 9x FPGA ASIC 4x GPU Implementation Time (beyond development) Performance (time to decode half second of data) - 1 person year (est.) 1.5 person years (est.) +1 year to Fab. 9 days 5.1 minutes.5s (if possible)?? Hardware Cost $2, $9, $1 million $5, (includes PC cost) HPEC_GPU_DECODE-25

Candidate Implementations PC 9x FPGA ASIC 4x GPU Implementation Time (beyond development) Performance (time to decode half second of data) - 1 person year (est.) 1.5 person years (est.) +1 year to Fab. 9 days 5.1 minutes.5s (if possible) 6 person weeks 5.6 minutes Hardware Cost $2, $9, $1 million $5, (includes PC cost) HPEC_GPU_DECODE-26

Implementation Results Decoded test data in 5.6 minutes instead of 9 days 1 4 A 2336x speedup (Minutes) Dec code Time 1 3 1 2 1 1 Decode Time GPU 15 Iterations 15 iterations Matlab 1 -.5 5 5.5 1 15 1.5 2 25 2.5 3 35 3.5 4 SNR (db) HPEC_GPU_DECODE-27 GPU implementation supports analysis, testing, and demonstration activities

Implementation Results Decoded test data in 5.6 minutes instead of 9 days 1 4 A 2336x speedup Hard coded version runs fixed number of 1 3 iterations Likely used on FPGA 1 2 or ASIC versions (Minutes) Dec code Time 1 1 Decode Time GPU 15 Iterations 15 iterations Matlab Hard Coded 1 -.5 5 5.5 1 15 1.5 2 25 2.5 3 35 3.5 4 SNR (db) HPEC_GPU_DECODE-28 GPU implementation supports analysis, testing, and demonstration activities

Summary GPU architecture can provide significant performance improvement with limited hardware cost and implementation effort GPU implementation is a good fit for algorithms with parallel steps that require systems with large computational and memory bandwidths LDPC GF(256) decoding is a well-suited algorithm Implementation allows reasonable algorithm development cycle New algorithms can be tested and demonstrated in minutes instead of days HPEC_GPU_DECODE-29

Backup HPEC_GPU_DECODE-3

Experimental Setup Test System Parameters Data rate 1.3 Gb/s Half second burst 4, codewords (6 symbols) Parity check matrix 4 checks 6 symbols Corresponds to rate 1/3 code Performance numbers based on 15 iterations of decoding algorithm HPEC_GPU_DECODE-31