GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

GPGPU for Real-Time Data Analytics: Introduction Bingsheng He 1, Huynh Phung Huynh 2, Rick Siow Mong Goh 2 1 Nanyang Technological University, Singapore 2 A*STAR Institute of High Performance Computing, Singapore 1

Outline Real-time data analytics GPU architectures Technical challenges 2

Real-time Data Analytics Real-time Data Analytics Applications RTDA Categorization On-demand RTDA Continuous RTDA 3

REAL-TIME DATA ANALYTICS Constantly changing world Data evolving from time to time Analysis results affected by data revisions Real-time updates are vital Meteorological Data Real-time Forecast 4

REAL-TIME DATA ANALYTICS Constantly changing world Data evolving from time to time Analysis results affected by data revisions Real-time updates are vital Traffic Data Driving Guidance 5

REAL-TIME DATA ANALYTICS Constantly changing world Data evolving from time to time Analysis results affected by data revisions Instant updates are vital Stock Data Prediction 6

MORE APPLICATIONS Earth science Earthquake & volcanoes monitoring Physics High energy physics data analysis Astrophysics data analysis Healthcare Patient monitoring Security Real-time surveillance 7

Data is All Around Us! Opportunities to make use of them, generating insights and bringing value to us/others, are aplenty! 8

TWO MAJOR RTDA TYPES On-demand analytics Reactive Waits for users to request a query, then delivers the analytics Examples: typhoon prediction, report generations Real-time requirements: Low response time Continuous analytics Proactive Alerts users with continuous updates in real-time Examples: driving guidance, traffic monitoring and stock monitoring Real-time requirements: High freshness, Welldefined time constraints, High velocity. Ad-hoc tasks Analytics engine 9

Outline Real-time data analytics GPU architectures Throughput computing principles Real implementation: GPUs Technical challenges 10

Part 1: Throughput Computing Three key principles behind GPU design, in comparison with CPU design. Simplification SIMD Processing Interleaving executions Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 11

CPU-Style Cores Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 12

Idea 1: Simplification Remove everything that makes a single instruction stream run fast Caches (50% of die area in typical CPUs!) Hard-wired logic: out-of-order execution, branch prediction, memory pre-fetching. Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 13

Consequence: Use Many Simple Cores in Parallel Invest saved transistors into more copies of the simple core More copies : More than a conventional multicore CPU could afford Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 14

Consequence: Use Many Simple Cores in Parallel Invest saved transistors into more copies of the simple core Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 15

Consequence: Use Many Simple Cores in Parallel Invest saved transistors into more copies of the simple core Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 16

Instruction Stream Sharing Observations Data parallelism is pervasive: vector addition, sparse matrix vector multiply,... Idea 2: SIMD Processing Throughput optimized for data-parallel workloads Amortise cost/complexity of managing an instruction stream across many ALUs to reduce overhead (ALUs are very cheap!) Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 17

Instruction Stream Sharing An example of SIMD design One instruction stream across eight ALUs Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 18

Improving Throughput Stalls: Delays due to dependencies in the instruction stream Latency: Accessing data from memory easily takes 1000+ cycles Idea 3: Interleave processing of many work groups on a single core Switch to instruction stream of another (nonstalled = ready) SIMD group in case currently active group stalls Ideally, latency is fully hidden, throughput is maximized. Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 19

Latency Hiding Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 20

Part 2: Putting the three ideas into practice: A closer look at real GPUs NVIDIA GeForce GTX 680 ATI Radeon HD 7970 07/29/10 Beyond Programmable Shading Course, ACM SIGGRAPH 2010 21

Two Latest Architectures NVIDIA GeForce GTX 680 (Kepler): 1536 stream processors ( CUDA cores ) 192.2 GB/s, 3.1 TFLOPS (single precision) SPMD execution AMD Radeon HD 7970: 2048 stream processors 264GB/s, 3.79 TFLOPS (single precision) 22

NVIDIA GeForce GTX 680 Groups of 192 CUDA cores per SMX: share an instruction stream Four Graphics Processing Clusters (GPC), each of which houses two Streaming Multiprocessors (SMX) Up to 1536 individual contexts can be stored 23

AMD Radeon HD 7970 32 Graphics Core Next (GCN) compute units Up to 2048 individual contexts can be stored 24

Outline Real-time data analytics GPU architectures Technical challenges 25

RTDA Challenges On-demand analytics Low response time Continuous analytics High freshness Well-defined time constraints High velocity GPU Optimization Challenges CPU-GPU data movement and optimization GPU memory hierarchy optimizations Multi-GPU system scalability Data processing frameworks for GPU 26

Thank you and Q&A Feedbacks are welcome: Bingsheng He, bshe@ntu.edu.sg Huynh Phung Huynh, huynhph@ihpc.a-star.edu.sg Rick Siow Mong Goh, gohsm@ihpc.a-star.edu.sg Tutorial site: http://www3.ntu.edu.sg/home/bshe/gpgputut.html 27

Acknowledgement Andrei Hagiescu, Altera Weng-Fai Wong, National University of Singapore 28