System Software for Big Data Computing. Cho-Li Wang The University of Hong Kong

Size: px

Start display at page:

Download "System Software for Big Data Computing. Cho-Li Wang The University of Hong Kong"

Jocelin Thornton
8 years ago
Views:

1 System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

2 HKU High-Performance Computing Lab. Total # of cores: 3004 CPU GPU cores RAM Size: 8.34 TB Disk storage: 130 TB Peak computing power: TFlops GPU-Cluster (Nvidia M2050, Tianhe-1a ): 7.62 Tflops 31.45TFlops (X12 in 3.5 years) T 3.1T 20T CS Gideon-II & CC MDRP Clusters

34 TB Disk storage: 130 TB Peak computing power: 27.

3 Big Data: The "3Vs" Model High Volume (amount of data) High Velocity (speed of data in and out) High Variety (range of data types and sources) 2.5 x : 800,000 petabytes (would fill a stack of DVDs reaching from the earth to the moon and back) By 2020, that pile of DVDs would stretch half way to Mars.

5 x 10 18 2010: 800,000 petabytes (would fill a stack of DVDs reaching from

4 Our Research Heterogeneous Manycore Computing (CPUs+ GUPs) Big Data Computing on Future Manycore Chips Multi-granularity Computation Migration

5 JAPONICA : Java with Auto- Parallelization ON GraphIcs Coprocessing Architecture (1) Heterogeneous Manycore Computing (CPUs+ GUPs)

6 CPUs GPU Heterogeneous Manycore Architecture 6

7 New GPU & Coprocessors Vendor Model Launch Date Fab. (nm) #Accelerator Cores (Max.) GPU Clock (MHz) TDP (watts) Memory Bandwidth (GB/s) Programming Model Remarks Intel Sandy Bridge Ivy Bridge 2011Q Q HD graphics 3000 EUs (8 threads/eu) 16 HD graphics 4000 EUs (8 threads/eu) L3: 8MB Sys mem (DDR3) L3: 8MB Sys mem (DDR3) 21 OpenCL 25.6 Bandwidth is system DDR3 memory bandwidth Xeon Phi 2012H x86 cores (with a 512-bit vector unit) GB GDDR5 320 OpenMP#, OpenCL*, OpenACC% Less sensitive to branch divergent workloads AMD Brazos Q2 40 Trinity 2012Q Evergreen shader cores Northern Islands cores L2: 1MB Sys mem (DDR3) L2: 4MB Sys mem (DDR3) 21 OpenCL, C++AMP 25 APU Nvidia Fermi 2010Q1 40 Kepler (GK110) 2012Q Cuda cores (16 SMs) 2880 Cuda cores / L1: 48KB L2: 768KB 6GB 148 6GB GDDR CUDA, OpenCL, OpenACC 3X Perf/Watt, Dynamic Parallelism, HyperQ 7

threads/eu) 850 1350 650 1150 95 77 L3: 8MB Sys mem (DDR3) L3: 8MB Sys mem (DDR3) 21 OpenCL 25.

8 #1 in Top500 (11/2012): Oak Ridge National Lab. 18,688 AMD Opteron core CPUs (32GB DDR3). 18,688 Nvidia Tesla K20X GPUs Total RAM size: over 710 TB Total Storage: 10 PB. Peak Performance: 27 Petaflop/s o GPU: CPU = TF/s: TF/s = 9.3 : 1 Linpack: Petaflop/s Power Consumption: 8.2 MW Titan compute board: 4 AMD Opteron + 4 NVIDIA Tesla K20X GPUs NVIDIA Tesla K20X (Kepler GK110) GPU: 2688 CUDA cores 8

Peak Performance: 27 Petaflop/s o GPU: CPU = 1.311 TF/s: 0.141 TF/s = 9.3 : 1 Linpack: 17.

9 Design Challenge: GPU Can t Handle Dynamic Loops GPU = SIMD/Vector Data Dependency Issues (RAW, WAW) Solutions? Static loops for(i=0; i<n; i++) { C[i] = A[i] + B[i]; } Dynamic loops for(i=0; i<n; i++) { A[ w[i] ] = 3 * A[ r[i] ]; } 9 Non-deterministic data dependencies inhibit exploitation of inherent parallelism; only DO-ALL loops or embarrassingly parallel workload gets admitted to GPUs. 9

Static loops for(i=0; i<n; i++) { C[i] = A[i] + B[i]; } Dynamic loops for(i=0; i<n; i++) { A[ w[i]

10 Dynamic loops are common in scientific and engineering applications Source: Z. Shen, Z. Li, and P. Yew, "An Empirical Study on Array Subscripts and Data Dependencies" 10

11 GPU-TLS : Thread-level Speculation on GPU Incremental parallelization o sliding window style execution. Efficient dependency checking schemes Deferred update o Speculative updates are stored in the write buffer of each thread until the commit time. 3 phases of execution Phase I Phase II Phase III Speculative execution Dependency checking Commit intra-thread RAW valid inter-thread RAW in GPU true inter-thread RAW GPU: lock-step execution in the same warp (32 threads per warp). 11

12 JAPONICA : Profile-Guided Work Dispatching Dynamic Profiling Parallel High Scheduler Dependency density Medium Highly parallel Inter-iteration dependence: -- Read-After-Write (RAW) -- Write-After-Read (WAR) -- Write-After-Write (WAW) Low/None Massively parallel 8 high-speed x86 cores Multi-core CPU 64 x86 cores 2880 cores Many-core coprocessors 12

Read-After-Write (RAW) -- Write-After-Read (WAR) -- Write-After-Write (WAW) Low/None

13 JAPONICA : System Architecture Sequential Java Code with user annotation Code Translation JavaR Static Dep. Analysis DO-ALL Parallelizer CPU-Multi threads No dependence GPU-Many threads Uncertain CUDA kernels & CPU Multi-threads Task Sharing High DD : CPU single core Low DD : CPU+GPU-TLS 0 : CPU multithreads + GPU Dependency Density Analysis Intra-warp Dep. Check RAW Profiler (on GPU) GPU-TLS Inter-warp Dep. Check Speculator Task Scheduler : CPU-GPU Co-Scheduling Task Stealing CPU queue: low, high, 0 GPU queue: low, 0 Profiling Results WAW/WAR Privatization Program Dependence Graph (PDG) one loop CUDA kernels with GPU-TLS Privatization & CPU Single-thread Assign the tasks among CPU & GPU according to their dependency density (DD) CPU Communication GPU 13

multithreads + GPU Dependency Density Analysis Intra-warp Dep. Check RAW Profiler (on GPU) GPU-TLS Inter-warp Dep.

14 (2) Crocodiles: Cloud Runtime with Object Coherence On Dynamic tiles

15 General Purpose Manycore Tile-based architecture: Cores are connected through a 2D networkon-a-chip

GbE PCI-E Memory Controller 鳄鱼 @ HKU (01/2013-12/2015) Crocodiles: Cloud Runtime with Object Coherence On Dynamic tiles for future 1000-core

16 GbE PCI-E Memory Controller 鳄 HKU (01/ /2015) Crocodiles: Cloud Runtime with Object Coherence On Dynamic tiles for future 1000-core tiled processors PCI-E GbE Memory Controller RAM RAM RAM RAM ZONE 2 ZONE 1 ZONE 4 ZONE 3 Memory Controller GbE PCI-E DRAM Controller PCI-E GbE 16

1000-core tiled processors PCI-E GbE Memory Controller RAM RAM RAM RAM

17 Dynamic Zoning o o o o Multi-tenant Cloud Architecture Partition varies over time, mimic Data center on a Chip. Performance isolation On-demand scaling. Power efficiency (high flops/watt).

18 Design Challenge: Off-chip Memory Wall Problem DRAM performance (latency) improved slowly over the past 40 years. (a) Gap of DRAM Density & Speed (b) DRAM Latency Not Improved Memory density has doubled nearly every two years, while performance has improved slowly (e.g. still 100+ of core clock cycles per memory access)

(a) Gap of DRAM Density & Speed (b) DRAM Latency Not Improved Memory density has

19 Lock Contention in Multicore System Physical memory allocation performance sorted by function. As more cores are added more processing time is spent contending for locks. Lock Contention

20 Exim on Linux collapse Kernel CPU time (milliseconds/message)

21 Challenges and Potential Solutions Cache-aware design o o Data Locality/Working Set getting critical! Compiler or runtime techniques to improve data reuse Stop multitasking o o Context switching breaks data locality Time Sharing Space Sharing 马其顿方阵众核操作系统 : Next-generation Operating System for 1000-core processor 21

22 Thanks! For more information: C.L. Wang s webpage:

24 Multi-granularity Computation Migration Granularity Coarse WAVNet Desktop Cloud G-JavaMPI Fine Small JESSICA2 SOD Large System scale (Size of state) 24

WAVNet: Live VM Migration over WAN A P2P Cloud with Live VM Migration over WAN Virtualized LAN over the Internet High penetration via NAT hole punching Establish direct host-to-host connection Free

25 WAVNet: Live VM Migration over WAN A P2P Cloud with Live VM Migration over WAN Virtualized LAN over the Internet High penetration via NAT hole punching Establish direct host-to-host connection Free from proxies, able to traverse most NATs Key Members VM VM Zheming Xu, Sheng Di, Weida Zhang, Luwei Cheng, and Cho-Li Wang, WAVNet: Wide-Area Network Virtualization Technique for Virtual Private Cloud, 2011 International Conference on Parallel Processing (ICPP2011) 25

26 WAVNet: Experiments at Pacific Rim Areas 北京高能物理所 IHEP, Beijing 日本产业技术综合研究所 (AIST, Japan) SDSC, San Diego 深圳先进院 (SIAT) 中央研究院 (Sinica, Taiwan) 香港大学 (HKU) 静宜大学 (Providence University) 26 26

27 JESSICA2: Distributed Java Virtual Machine A Multithreaded Java Program Java Enabled Single System Image Computing Architecture Thread Migration Portable Java Frame JIT Compiler Mode JESSICA2 JVM JESSICA2 JVM JESSICA2 JVM JESSICA2 JVM JESSICA2 JVM JESSICA2 JVM Master Worker Worker Worker 27 27

28 History and Roadmap of JESSICA Project JESSICA V1.0 ( ) Execution mode: Interpreter Mode JVM kernel modification (Kaffe JVM) Global heap: built on top of TreadMarks (Lazy Release Consistency + homeless) JESSICA V2.0 ( ) Execution mode: JIT-Compiler Mode JVM kernel modification Lazy release consistency + migrating-home protocol JESSICA V3.0 (2008~2010) Built above JVM (via JVMTI) Support Large Object Space JESSICA v.4 (2010~) Japonica : Automatic loop parallization and speculative execution on GPU and multicore CPU TrC-DC : a software transactional memory system on cluster with distributed clocks (not discussed) King Tin LAM, Past Members Chenggang Zhang J1 and J2 received a total of 1107 source code downloads Kinson Chan Ricky Ma 28

29 Stack-on-Demand (SOD) Stack frame A Method Stack frame A Program Counter Local variables Method Area Stack frame A Program Counter Local variables Stack frame B Program Counter Heap Area Rebuilt Method Area Local variables Heap Area Object (Pre-)fetching Cloud node Mobile node objects

30 Elastic Execution Model via SOD (a) Remote Method Call (b) Mimic thread migration (c) Task Roaming : like a mobile agent roaming over the network or workflow With such flexible or composable execution paths, SOD enables agile and elastic exploitation of distributed resources (storage), a Big Data Solution! Lightweight, Portable, Adaptable 30

Stack-on-demand (SOD) Live migration Method Area Code Stacks Heap Method Area Code Stack segments JVM ios Partial Heap Small footprint Load balancer Thread migration (JESSICA2) Cloud service provider

31 Stack-on-demand (SOD) Live migration Method Area Code Stacks Heap Method Area Code Stack segments JVM ios Partial Heap Small footprint Load balancer Thread migration (JESSICA2) Cloud service provider duplicate VM instances for scaling JVM process Mobile client Multi-thread Java process Internet JVM comm. JVM trigger live migration guest OS Xen VM guest OS Xen VM Xen-aware host OS Overloaded Desktop PC Load balancer excloud : Integrated Solution for Multigranularity Migration Ricky K. K. Ma, King Tin Lam, Cho-Li Wang, "excloud: Transparent Runtime Support for Scaling Mobile Applications," 2011 IEEE International Conference on Cloud and Service Computing (CSC2011),. (Best Paper Award)

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems