GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

Size: px

Start display at page:

Download "GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2"

Ella Washington
10 years ago
Views:

1 GPGPU for Real-Time Data Analytics: Introduction Bingsheng He 1, Huynh Phung Huynh 2, Rick Siow Mong Goh 2 1 Nanyang Technological University, Singapore 2 A*STAR Institute of High Performance Computing, Singapore 1

2 Outline Real-time data analytics GPU architectures Technical challenges 2

3 Real-time Data Analytics Real-time Data Analytics Applications RTDA Categorization On-demand RTDA Continuous RTDA 3

4 REAL-TIME DATA ANALYTICS Constantly changing world Data evolving from time to time Analysis results affected by data revisions Real-time updates are vital Meteorological Data Real-time Forecast 4

5 REAL-TIME DATA ANALYTICS Constantly changing world Data evolving from time to time Analysis results affected by data revisions Real-time updates are vital Traffic Data Driving Guidance 5

6 REAL-TIME DATA ANALYTICS Constantly changing world Data evolving from time to time Analysis results affected by data revisions Instant updates are vital Stock Data Prediction 6

7 MORE APPLICATIONS Earth science Earthquake & volcanoes monitoring Physics High energy physics data analysis Astrophysics data analysis Healthcare Patient monitoring Security Real-time surveillance 7

8 Data is All Around Us! Opportunities to make use of them, generating insights and bringing value to us/others, are aplenty! 8

9 TWO MAJOR RTDA TYPES On-demand analytics Reactive Waits for users to request a query, then delivers the analytics Examples: typhoon prediction, report generations Real-time requirements: Low response time Continuous analytics Proactive Alerts users with continuous updates in real-time Examples: driving guidance, traffic monitoring and stock monitoring Real-time requirements: High freshness, Welldefined time constraints, High velocity. Ad-hoc tasks Analytics engine 9

10 Outline Real-time data analytics GPU architectures Throughput computing principles Real implementation: GPUs Technical challenges 10

11 Part 1: Throughput Computing Three key principles behind GPU design, in comparison with CPU design. Simplification SIMD Processing Interleaving executions Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH

12 CPU-Style Cores Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH

13 Idea 1: Simplification Remove everything that makes a single instruction stream run fast Caches (50% of die area in typical CPUs!) Hard-wired logic: out-of-order execution, branch prediction, memory pre-fetching. Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH

14 Consequence: Use Many Simple Cores in Parallel Invest saved transistors into more copies of the simple core More copies : More than a conventional multicore CPU could afford Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH

15 Consequence: Use Many Simple Cores in Parallel Invest saved transistors into more copies of the simple core Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH

16 Consequence: Use Many Simple Cores in Parallel Invest saved transistors into more copies of the simple core Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH

17 Instruction Stream Sharing Observations Data parallelism is pervasive: vector addition, sparse matrix vector multiply,... Idea 2: SIMD Processing Throughput optimized for data-parallel workloads Amortise cost/complexity of managing an instruction stream across many ALUs to reduce overhead (ALUs are very cheap!) Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH

18 Instruction Stream Sharing An example of SIMD design One instruction stream across eight ALUs Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH

19 Improving Throughput Stalls: Delays due to dependencies in the instruction stream Latency: Accessing data from memory easily takes cycles Idea 3: Interleave processing of many work groups on a single core Switch to instruction stream of another (nonstalled = ready) SIMD group in case currently active group stalls Ideally, latency is fully hidden, throughput is maximized. Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH

20 Latency Hiding Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH

21 Part 2: Putting the three ideas into practice: A closer look at real GPUs NVIDIA GeForce GTX 680 ATI Radeon HD /29/10 Beyond Programmable Shading Course, ACM SIGGRAPH

22 Two Latest Architectures NVIDIA GeForce GTX 680 (Kepler): 1536 stream processors ( CUDA cores ) GB/s, 3.1 TFLOPS (single precision) SPMD execution AMD Radeon HD 7970: 2048 stream processors 264GB/s, 3.79 TFLOPS (single precision) 22

23 NVIDIA GeForce GTX 680 Groups of 192 CUDA cores per SMX: share an instruction stream Four Graphics Processing Clusters (GPC), each of which houses two Streaming Multiprocessors (SMX) Up to 1536 individual contexts can be stored 23

24 AMD Radeon HD Graphics Core Next (GCN) compute units Up to 2048 individual contexts can be stored 24

25 Outline Real-time data analytics GPU architectures Technical challenges 25

26 RTDA Challenges On-demand analytics Low response time Continuous analytics High freshness Well-defined time constraints High velocity GPU Optimization Challenges CPU-GPU data movement and optimization GPU memory hierarchy optimizations Multi-GPU system scalability Data processing frameworks for GPU 26

27 Thank you and Q&A Feedbacks are welcome: Bingsheng He, Huynh Phung Huynh, Rick Siow Mong Goh, Tutorial site: 27

28 Acknowledgement Andrei Hagiescu, Altera Weng-Fai Wong, National University of Singapore 28

Introduction to GPU Architecture

Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three