GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

Similar documents
Introduction to GPU Architecture

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPGPU Computing. Yong Cao

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Parallel Programming Survey

Next Generation GPU Architecture Code-named Fermi

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Introduction to GPGPU. Tiziano Diamanti

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Evaluation of CUDA Fortran for the CFD code Strukti

Introduction to GPU Programming Languages

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

10- High Performance Compu5ng

GPUs for Scientific Computing

Introduction to GPU Computing

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

GPU Computing - CUDA

ST810 Advanced Computing

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Parallel Firewalls on General-Purpose Graphics Processing Units

Computer Graphics Hardware An Overview

OpenCL Programming for the CUDA Architecture. Version 2.3

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Introduction to GPU hardware and to CUDA

GPU for Scientific Computing. -Ali Saleh

Clustering Billions of Data Points Using GPUs

Chapter 2 Parallel Architecture, Software And Performance

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

SAP HANA PLATFORM Top Ten Questions for Choosing In-Memory Databases. Start Here

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

High Performance GPGPU Computer for Embedded Systems

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Texture Cache Approximation on GPUs

Stream Processing on GPUs Using Distributed Multimedia Middleware

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

GPGPU accelerated Computational Fluid Dynamics

2020 Design Update Release Notes November 10, 2015

ultra fast SOM using CUDA

MIDeA: A Multi-Parallel Intrusion Detection Architecture

Scalability and Classifications

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

NVIDIA GeForce GTX 750 Ti

Guided Performance Analysis with the NVIDIA Visual Profiler

Introduction to Cloud Computing

Next Generation Operating Systems

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Bringing Big Data Modelling into the Hands of Domain Experts

GPU Parallel Computing Architecture and CUDA Programming Model

Direct GPU/FPGA Communication Via PCI Express

In the early 1990s, ubiquitous

The GPU Accelerated Data Center. Marc Hamilton, August 27, 2015

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

Amazon EC2 Product Details Page 1 of 5

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Binary search tree with SIMD bandwidth optimization using SSE

GPU Usage. Requirements

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

General Purpose Computation on Graphics Processors (GPGPU) Mike Houston, Stanford University

Radeon HD 2900 and Geometry Generation. Michael Doggett

HPC with Multicore and GPUs

Parallel Algorithm Engineering

CHAPTER 1 INTRODUCTION

GPU Programming Strategies and Trends in GPU Computing

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

How To Build An Ark Processor With An Nvidia Gpu And An African Processor

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing

Real-time Visual Tracker by Stream Processing

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

GPU Performance Analysis and Optimisation

A survey on platforms for big data analytics

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

Thread level parallelism

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Transcription:

GPGPU for Real-Time Data Analytics: Introduction Bingsheng He 1, Huynh Phung Huynh 2, Rick Siow Mong Goh 2 1 Nanyang Technological University, Singapore 2 A*STAR Institute of High Performance Computing, Singapore 1

Outline Real-time data analytics GPU architectures Technical challenges 2

Real-time Data Analytics Real-time Data Analytics Applications RTDA Categorization On-demand RTDA Continuous RTDA 3

REAL-TIME DATA ANALYTICS Constantly changing world Data evolving from time to time Analysis results affected by data revisions Real-time updates are vital Meteorological Data Real-time Forecast 4

REAL-TIME DATA ANALYTICS Constantly changing world Data evolving from time to time Analysis results affected by data revisions Real-time updates are vital Traffic Data Driving Guidance 5

REAL-TIME DATA ANALYTICS Constantly changing world Data evolving from time to time Analysis results affected by data revisions Instant updates are vital Stock Data Prediction 6

MORE APPLICATIONS Earth science Earthquake & volcanoes monitoring Physics High energy physics data analysis Astrophysics data analysis Healthcare Patient monitoring Security Real-time surveillance 7

Data is All Around Us! Opportunities to make use of them, generating insights and bringing value to us/others, are aplenty! 8

TWO MAJOR RTDA TYPES On-demand analytics Reactive Waits for users to request a query, then delivers the analytics Examples: typhoon prediction, report generations Real-time requirements: Low response time Continuous analytics Proactive Alerts users with continuous updates in real-time Examples: driving guidance, traffic monitoring and stock monitoring Real-time requirements: High freshness, Welldefined time constraints, High velocity. Ad-hoc tasks Analytics engine 9

Outline Real-time data analytics GPU architectures Throughput computing principles Real implementation: GPUs Technical challenges 10

Part 1: Throughput Computing Three key principles behind GPU design, in comparison with CPU design. Simplification SIMD Processing Interleaving executions Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 11

CPU-Style Cores Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 12

Idea 1: Simplification Remove everything that makes a single instruction stream run fast Caches (50% of die area in typical CPUs!) Hard-wired logic: out-of-order execution, branch prediction, memory pre-fetching. Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 13

Consequence: Use Many Simple Cores in Parallel Invest saved transistors into more copies of the simple core More copies : More than a conventional multicore CPU could afford Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 14

Consequence: Use Many Simple Cores in Parallel Invest saved transistors into more copies of the simple core Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 15

Consequence: Use Many Simple Cores in Parallel Invest saved transistors into more copies of the simple core Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 16

Instruction Stream Sharing Observations Data parallelism is pervasive: vector addition, sparse matrix vector multiply,... Idea 2: SIMD Processing Throughput optimized for data-parallel workloads Amortise cost/complexity of managing an instruction stream across many ALUs to reduce overhead (ALUs are very cheap!) Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 17

Instruction Stream Sharing An example of SIMD design One instruction stream across eight ALUs Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 18

Improving Throughput Stalls: Delays due to dependencies in the instruction stream Latency: Accessing data from memory easily takes 1000+ cycles Idea 3: Interleave processing of many work groups on a single core Switch to instruction stream of another (nonstalled = ready) SIMD group in case currently active group stalls Ideally, latency is fully hidden, throughput is maximized. Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 19

Latency Hiding Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 20

Part 2: Putting the three ideas into practice: A closer look at real GPUs NVIDIA GeForce GTX 680 ATI Radeon HD 7970 07/29/10 Beyond Programmable Shading Course, ACM SIGGRAPH 2010 21

Two Latest Architectures NVIDIA GeForce GTX 680 (Kepler): 1536 stream processors ( CUDA cores ) 192.2 GB/s, 3.1 TFLOPS (single precision) SPMD execution AMD Radeon HD 7970: 2048 stream processors 264GB/s, 3.79 TFLOPS (single precision) 22

NVIDIA GeForce GTX 680 Groups of 192 CUDA cores per SMX: share an instruction stream Four Graphics Processing Clusters (GPC), each of which houses two Streaming Multiprocessors (SMX) Up to 1536 individual contexts can be stored 23

AMD Radeon HD 7970 32 Graphics Core Next (GCN) compute units Up to 2048 individual contexts can be stored 24

Outline Real-time data analytics GPU architectures Technical challenges 25

RTDA Challenges On-demand analytics Low response time Continuous analytics High freshness Well-defined time constraints High velocity GPU Optimization Challenges CPU-GPU data movement and optimization GPU memory hierarchy optimizations Multi-GPU system scalability Data processing frameworks for GPU 26

Thank you and Q&A Feedbacks are welcome: Bingsheng He, bshe@ntu.edu.sg Huynh Phung Huynh, huynhph@ihpc.a-star.edu.sg Rick Siow Mong Goh, gohsm@ihpc.a-star.edu.sg Tutorial site: http://www3.ntu.edu.sg/home/bshe/gpgputut.html 27

Acknowledgement Andrei Hagiescu, Altera Weng-Fai Wong, National University of Singapore 28