System Software for Big Data Computing. Cho-Li Wang The University of Hong Kong

Similar documents

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Enabling Technologies for Distributed Computing

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Enabling Technologies for Distributed and Cloud Computing

Parallel Programming Survey

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Parallel Computing Architecture and CUDA Programming Model

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Next Generation GPU Architecture Code-named Fermi

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Multi-Threading Performance on Commodity Multi-Core Processors

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Case Study on Productivity and Performance of GPGPUs

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Parallel Algorithm Engineering

Multi-core Programming System Overview

Scaling from Datacenter to Client

GPU Computing - CUDA

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

A quick tutorial on Intel's Xeon Phi Coprocessor

GPUs for Scientific Computing

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

Improving System Scalability of OpenMP Applications Using Large Page Support

Introduction to GPU hardware and to CUDA

Overview of HPC Resources at Vanderbilt

Embedded Systems: map to FPGA, GPU, CPU?

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

Trends in High-Performance Computing for Power Grid Applications

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

HPC-related R&D in 863 Program

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

~ Greetings from WSU CAPPLab ~

Data Centers and Cloud Computing

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Stream Processing on GPUs Using Distributed Multimedia Middleware

Cloud Computing in Action - for Better Service and Better Life

CUDA programming on NVIDIA GPUs

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

Introduction to GPGPU. Tiziano Diamanti

HP Workstations graphics card options

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Chapter 14 Virtual Machines

Visit to the National University for Defense Technology Changsha, China. Jack Dongarra. University of Tennessee. Oak Ridge National Laboratory

HP Workstations graphics card options

Stingray Traffic Manager Sizing Guide

Amazon EC2 Product Details Page 1 of 5

CUDA in the Cloud Enabling HPC Workloads in OpenStack With special thanks to Andrew Younge (Indiana Univ.) and Massimo Bernaschi (IAC-CNR)

Parallel Firewalls on General-Purpose Graphics Processing Units

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Cloud Data Center Acceleration 2015

Memory Architecture and Management in a NoC Platform

Introduction to GPU Computing

Chapter 2 Parallel Architecture, Software And Performance

Turbomachinery CFD on many-core platforms experiences and strategies

Performance Characteristics of Large SMP Machines

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Kalray MPPA Massively Parallel Processing Array

Xeon+FPGA Platform for the Data Center

NVIDIA GeForce GTX 580 GPU Datasheet

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Evaluation of CUDA Fortran for the CFD code Strukti

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Parallel Computing. Benson Muite. benson.

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

OpenMP Programming on ScaleMP

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

Programming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture

Informatica Ultra Messaging SMX Shared-Memory Transport

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Introduction to GPU Architecture

ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer

PSE Molekulardynamik

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

High Performance Computing in CST STUDIO SUITE

Computer Graphics Hardware An Overview

LSI SAS inside 60% of servers. 21 million LSI SAS & MegaRAID solutions shipped over last 3 years. 9 out of 10 top server vendors use MegaRAID

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

HPC performance applications on Virtual Clusters

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

A Holistic Model of the Energy-Efficiency of Hypervisors

Infrastructure Matters: POWER8 vs. Xeon x86

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1

Overview of HPC systems and software available within

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Transcription:

System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

HKU High-Performance Computing Lab. Total # of cores: 3004 CPU + 5376 GPU cores RAM Size: 8.34 TB Disk storage: 130 TB Peak computing power: 27.05 TFlops GPU-Cluster (Nvidia M2050, Tianhe-1a ): 7.62 Tflops 31.45TFlops (X12 in 3.5 years) 35 30 25 20 15 10 2.6T 3.1T 20T 2007.7 2009 2010 2011.1 5 CS Gideon-II & CC MDRP Clusters 0 2007.7 2009 2010 2011.1 2

Big Data: The "3Vs" Model High Volume (amount of data) High Velocity (speed of data in and out) High Variety (range of data types and sources) 2.5 x 10 18 2010: 800,000 petabytes (would fill a stack of DVDs reaching from the earth to the moon and back) By 2020, that pile of DVDs would stretch half way to Mars.

Our Research Heterogeneous Manycore Computing (CPUs+ GUPs) Big Data Computing on Future Manycore Chips Multi-granularity Computation Migration

JAPONICA : Java with Auto- Parallelization ON GraphIcs Coprocessing Architecture (1) Heterogeneous Manycore Computing (CPUs+ GUPs)

CPUs GPU Heterogeneous Manycore Architecture 6

New GPU & Coprocessors Vendor Model Launch Date Fab. (nm) #Accelerator Cores (Max.) GPU Clock (MHz) TDP (watts) Memory Bandwidth (GB/s) Programming Model Remarks Intel Sandy Bridge Ivy Bridge 2011Q1 32 2012Q2 22 12 HD graphics 3000 EUs (8 threads/eu) 16 HD graphics 4000 EUs (8 threads/eu) 850 1350 650 1150 95 77 L3: 8MB Sys mem (DDR3) L3: 8MB Sys mem (DDR3) 21 OpenCL 25.6 Bandwidth is system DDR3 memory bandwidth Xeon Phi 2012H2 22 60 x86 cores (with a 512-bit vector unit) 600-1100 300 8GB GDDR5 320 OpenMP#, OpenCL*, OpenACC% Less sensitive to branch divergent workloads AMD Brazos 2.0 2012Q2 40 Trinity 2012Q2 32 80 Evergreen shader cores 128-384 Northern Islands cores 488-680 18 723-800 17-100 L2: 1MB Sys mem (DDR3) L2: 4MB Sys mem (DDR3) 21 OpenCL, C++AMP 25 APU Nvidia Fermi 2010Q1 40 Kepler (GK110) 2012Q4 28 512 Cuda cores (16 SMs) 2880 Cuda cores 1300 238 836/876 300 L1: 48KB L2: 768KB 6GB 148 6GB GDDR5 288.5 CUDA, OpenCL, OpenACC 3X Perf/Watt, Dynamic Parallelism, HyperQ 7

#1 in Top500 (11/2012): Titan @ Oak Ridge National Lab. 18,688 AMD Opteron 6274 16-core CPUs (32GB DDR3). 18,688 Nvidia Tesla K20X GPUs Total RAM size: over 710 TB Total Storage: 10 PB. Peak Performance: 27 Petaflop/s o GPU: CPU = 1.311 TF/s: 0.141 TF/s = 9.3 : 1 Linpack: 17.59 Petaflop/s Power Consumption: 8.2 MW Titan compute board: 4 AMD Opteron + 4 NVIDIA Tesla K20X GPUs NVIDIA Tesla K20X (Kepler GK110) GPU: 2688 CUDA cores 8

Design Challenge: GPU Can t Handle Dynamic Loops GPU = SIMD/Vector Data Dependency Issues (RAW, WAW) Solutions? Static loops for(i=0; i<n; i++) { C[i] = A[i] + B[i]; } Dynamic loops for(i=0; i<n; i++) { A[ w[i] ] = 3 * A[ r[i] ]; } 9 Non-deterministic data dependencies inhibit exploitation of inherent parallelism; only DO-ALL loops or embarrassingly parallel workload gets admitted to GPUs. 9

Dynamic loops are common in scientific and engineering applications Source: Z. Shen, Z. Li, and P. Yew, "An Empirical Study on Array Subscripts and Data Dependencies" 10

GPU-TLS : Thread-level Speculation on GPU Incremental parallelization o sliding window style execution. Efficient dependency checking schemes Deferred update o Speculative updates are stored in the write buffer of each thread until the commit time. 3 phases of execution Phase I Phase II Phase III Speculative execution Dependency checking Commit intra-thread RAW valid inter-thread RAW in GPU true inter-thread RAW GPU: lock-step execution in the same warp (32 threads per warp). 11

JAPONICA : Profile-Guided Work Dispatching Dynamic Profiling Parallel High Scheduler Dependency density Medium Highly parallel Inter-iteration dependence: -- Read-After-Write (RAW) -- Write-After-Read (WAR) -- Write-After-Write (WAW) Low/None Massively parallel 8 high-speed x86 cores Multi-core CPU 64 x86 cores 2880 cores Many-core coprocessors 12

JAPONICA : System Architecture Sequential Java Code with user annotation Code Translation JavaR Static Dep. Analysis DO-ALL Parallelizer CPU-Multi threads No dependence GPU-Many threads Uncertain CUDA kernels & CPU Multi-threads Task Sharing High DD : CPU single core Low DD : CPU+GPU-TLS 0 : CPU multithreads + GPU Dependency Density Analysis Intra-warp Dep. Check RAW Profiler (on GPU) GPU-TLS Inter-warp Dep. Check Speculator Task Scheduler : CPU-GPU Co-Scheduling Task Stealing CPU queue: low, high, 0 GPU queue: low, 0 Profiling Results WAW/WAR Privatization Program Dependence Graph (PDG) one loop CUDA kernels with GPU-TLS Privatization & CPU Single-thread Assign the tasks among CPU & GPU according to their dependency density (DD) CPU Communication GPU 13

(2) Crocodiles: Cloud Runtime with Object Coherence On Dynamic tiles

General Purpose Manycore Tile-based architecture: Cores are connected through a 2D networkon-a-chip

GbE PCI-E Memory Controller 鳄鱼 @ HKU (01/2013-12/2015) Crocodiles: Cloud Runtime with Object Coherence On Dynamic tiles for future 1000-core tiled processors PCI-E GbE Memory Controller RAM RAM RAM RAM ZONE 2 ZONE 1 ZONE 4 ZONE 3 Memory Controller GbE PCI-E DRAM Controller PCI-E GbE 16

Dynamic Zoning o o o o Multi-tenant Cloud Architecture Partition varies over time, mimic Data center on a Chip. Performance isolation On-demand scaling. Power efficiency (high flops/watt).

Design Challenge: Off-chip Memory Wall Problem DRAM performance (latency) improved slowly over the past 40 years. (a) Gap of DRAM Density & Speed (b) DRAM Latency Not Improved Memory density has doubled nearly every two years, while performance has improved slowly (e.g. still 100+ of core clock cycles per memory access)

Lock Contention in Multicore System Physical memory allocation performance sorted by function. As more cores are added more processing time is spent contending for locks. Lock Contention

Exim on Linux collapse Kernel CPU time (milliseconds/message)

Challenges and Potential Solutions Cache-aware design o o Data Locality/Working Set getting critical! Compiler or runtime techniques to improve data reuse Stop multitasking o o Context switching breaks data locality Time Sharing Space Sharing 马其顿方阵众核操作系统 : Next-generation Operating System for 1000-core processor 21

Thanks! For more information: C.L. Wang s webpage: http://www.cs.hku.hk/~clwang/

http://i.cs.hku.hk/~clwang/recruit2012.htm

Multi-granularity Computation Migration Granularity Coarse WAVNet Desktop Cloud G-JavaMPI Fine Small JESSICA2 SOD Large System scale (Size of state) 24

WAVNet: Live VM Migration over WAN A P2P Cloud with Live VM Migration over WAN Virtualized LAN over the Internet High penetration via NAT hole punching Establish direct host-to-host connection Free from proxies, able to traverse most NATs Key Members VM VM Zheming Xu, Sheng Di, Weida Zhang, Luwei Cheng, and Cho-Li Wang, WAVNet: Wide-Area Network Virtualization Technique for Virtual Private Cloud, 2011 International Conference on Parallel Processing (ICPP2011) 25

WAVNet: Experiments at Pacific Rim Areas 北京高能物理所 IHEP, Beijing 日本产业技术综合研究所 (AIST, Japan) SDSC, San Diego 深圳先进院 (SIAT) 中央研究院 (Sinica, Taiwan) 香港大学 (HKU) 静宜大学 (Providence University) 26 26

JESSICA2: Distributed Java Virtual Machine A Multithreaded Java Program Java Enabled Single System Image Computing Architecture Thread Migration Portable Java Frame JIT Compiler Mode JESSICA2 JVM JESSICA2 JVM JESSICA2 JVM JESSICA2 JVM JESSICA2 JVM JESSICA2 JVM Master Worker Worker Worker 27 27

History and Roadmap of JESSICA Project JESSICA V1.0 (1996-1999) Execution mode: Interpreter Mode JVM kernel modification (Kaffe JVM) Global heap: built on top of TreadMarks (Lazy Release Consistency + homeless) JESSICA V2.0 (2000-2006) Execution mode: JIT-Compiler Mode JVM kernel modification Lazy release consistency + migrating-home protocol JESSICA V3.0 (2008~2010) Built above JVM (via JVMTI) Support Large Object Space JESSICA v.4 (2010~) Japonica : Automatic loop parallization and speculative execution on GPU and multicore CPU TrC-DC : a software transactional memory system on cluster with distributed clocks (not discussed) King Tin LAM, Past Members Chenggang Zhang J1 and J2 received a total of 1107 source code downloads Kinson Chan Ricky Ma 28

Stack-on-Demand (SOD) Stack frame A Method Stack frame A Program Counter Local variables Method Area Stack frame A Program Counter Local variables Stack frame B Program Counter Heap Area Rebuilt Method Area Local variables Heap Area Object (Pre-)fetching Cloud node Mobile node objects

Elastic Execution Model via SOD (a) Remote Method Call (b) Mimic thread migration (c) Task Roaming : like a mobile agent roaming over the network or workflow With such flexible or composable execution paths, SOD enables agile and elastic exploitation of distributed resources (storage), a Big Data Solution! Lightweight, Portable, Adaptable 30

Stack-on-demand (SOD) Live migration Method Area Code Stacks Heap Method Area Code Stack segments JVM ios Partial Heap Small footprint Load balancer Thread migration (JESSICA2) Cloud service provider duplicate VM instances for scaling JVM process Mobile client Multi-thread Java process Internet JVM comm. JVM trigger live migration guest OS Xen VM guest OS Xen VM Xen-aware host OS Overloaded Desktop PC Load balancer excloud : Integrated Solution for Multigranularity Migration Ricky K. K. Ma, King Tin Lam, Cho-Li Wang, "excloud: Transparent Runtime Support for Scaling Mobile Applications," 2011 IEEE International Conference on Cloud and Service Computing (CSC2011),. (Best Paper Award)