GPGPU for Real-Time Data Analytics: Introduction Bingsheng He 1, Huynh Phung Huynh 2, Rick Siow Mong Goh 2 1 Nanyang Technological University, Singapore 2 A*STAR Institute of High Performance Computing, Singapore 1
Outline Real-time data analytics GPU architectures Technical challenges 2
Real-time Data Analytics Real-time Data Analytics Applications RTDA Categorization On-demand RTDA Continuous RTDA 3
REAL-TIME DATA ANALYTICS Constantly changing world Data evolving from time to time Analysis results affected by data revisions Real-time updates are vital Meteorological Data Real-time Forecast 4
REAL-TIME DATA ANALYTICS Constantly changing world Data evolving from time to time Analysis results affected by data revisions Real-time updates are vital Traffic Data Driving Guidance 5
REAL-TIME DATA ANALYTICS Constantly changing world Data evolving from time to time Analysis results affected by data revisions Instant updates are vital Stock Data Prediction 6
MORE APPLICATIONS Earth science Earthquake & volcanoes monitoring Physics High energy physics data analysis Astrophysics data analysis Healthcare Patient monitoring Security Real-time surveillance 7
Data is All Around Us! Opportunities to make use of them, generating insights and bringing value to us/others, are aplenty! 8
TWO MAJOR RTDA TYPES On-demand analytics Reactive Waits for users to request a query, then delivers the analytics Examples: typhoon prediction, report generations Real-time requirements: Low response time Continuous analytics Proactive Alerts users with continuous updates in real-time Examples: driving guidance, traffic monitoring and stock monitoring Real-time requirements: High freshness, Welldefined time constraints, High velocity. Ad-hoc tasks Analytics engine 9
Outline Real-time data analytics GPU architectures Throughput computing principles Real implementation: GPUs Technical challenges 10
Part 1: Throughput Computing Three key principles behind GPU design, in comparison with CPU design. Simplification SIMD Processing Interleaving executions Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 11
CPU-Style Cores Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 12
Idea 1: Simplification Remove everything that makes a single instruction stream run fast Caches (50% of die area in typical CPUs!) Hard-wired logic: out-of-order execution, branch prediction, memory pre-fetching. Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 13
Consequence: Use Many Simple Cores in Parallel Invest saved transistors into more copies of the simple core More copies : More than a conventional multicore CPU could afford Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 14
Consequence: Use Many Simple Cores in Parallel Invest saved transistors into more copies of the simple core Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 15
Consequence: Use Many Simple Cores in Parallel Invest saved transistors into more copies of the simple core Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 16
Instruction Stream Sharing Observations Data parallelism is pervasive: vector addition, sparse matrix vector multiply,... Idea 2: SIMD Processing Throughput optimized for data-parallel workloads Amortise cost/complexity of managing an instruction stream across many ALUs to reduce overhead (ALUs are very cheap!) Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 17
Instruction Stream Sharing An example of SIMD design One instruction stream across eight ALUs Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 18
Improving Throughput Stalls: Delays due to dependencies in the instruction stream Latency: Accessing data from memory easily takes 1000+ cycles Idea 3: Interleave processing of many work groups on a single core Switch to instruction stream of another (nonstalled = ready) SIMD group in case currently active group stalls Ideally, latency is fully hidden, throughput is maximized. Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 19
Latency Hiding Adopted from Kayvon Fatahalian. Beyond Programmable Shading Course, ACM SIGGRAPH 2010 20
Part 2: Putting the three ideas into practice: A closer look at real GPUs NVIDIA GeForce GTX 680 ATI Radeon HD 7970 07/29/10 Beyond Programmable Shading Course, ACM SIGGRAPH 2010 21
Two Latest Architectures NVIDIA GeForce GTX 680 (Kepler): 1536 stream processors ( CUDA cores ) 192.2 GB/s, 3.1 TFLOPS (single precision) SPMD execution AMD Radeon HD 7970: 2048 stream processors 264GB/s, 3.79 TFLOPS (single precision) 22
NVIDIA GeForce GTX 680 Groups of 192 CUDA cores per SMX: share an instruction stream Four Graphics Processing Clusters (GPC), each of which houses two Streaming Multiprocessors (SMX) Up to 1536 individual contexts can be stored 23
AMD Radeon HD 7970 32 Graphics Core Next (GCN) compute units Up to 2048 individual contexts can be stored 24
Outline Real-time data analytics GPU architectures Technical challenges 25
RTDA Challenges On-demand analytics Low response time Continuous analytics High freshness Well-defined time constraints High velocity GPU Optimization Challenges CPU-GPU data movement and optimization GPU memory hierarchy optimizations Multi-GPU system scalability Data processing frameworks for GPU 26
Thank you and Q&A Feedbacks are welcome: Bingsheng He, bshe@ntu.edu.sg Huynh Phung Huynh, huynhph@ihpc.a-star.edu.sg Rick Siow Mong Goh, gohsm@ihpc.a-star.edu.sg Tutorial site: http://www3.ntu.edu.sg/home/bshe/gpgputut.html 27
Acknowledgement Andrei Hagiescu, Altera Weng-Fai Wong, National University of Singapore 28