Big Data Analytics in the Age of Accelerators Kunle Olukotun Cadence Design Systems Professor Stanford University



Similar documents
Accelerating Big Data Analytics with Domain Specific Languages Kunle Olukotun Cadence Design Systems Professor Stanford University

HIGH PERFORMANCE BIG DATA ANALYTICS

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Parallel Computing. Benson Muite. benson.

Green-Marl: A DSL for Easy and Efficient Graph Analysis

Introduction to Cloud Computing

A Multi-layered Domain-specific Language for Stencil Computations

Parallel Programming Survey

FPGA area allocation for parallel C applications

Parallel Computing for Data Science

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

FPGA-based MapReduce Framework for Machine Learning

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Apache Flink Next-gen data analysis. Kostas

Embedded Systems: map to FPGA, GPU, CPU?

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

OpenACC Programming and Best Practices Guide

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Distributed R for Big Data

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

BSC vision on Big Data and extreme scale computing

Map-Reduce for Machine Learning on Multicore

Course Development of Programming for General-Purpose Multicore Processors

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Big Data With Hadoop

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Evaluation of CUDA Fortran for the CFD code Strukti

Architectures and Platforms

CUDA Programming. Week 4. Shared memory and register

Xeon+FPGA Platform for the Data Center

Bringing Big Data Modelling into the Hands of Domain Experts

Chapter 2 Parallel Architecture, Software And Performance

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

OpenMP Programming on ScaleMP

Spark: Cluster Computing with Working Sets

Xilinx SDAccel. A Unified Development Environment for Tomorrow s Data Center. By Loring Wirbel Senior Analyst. November

Multi-Threading Performance on Commodity Multi-Core Processors

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Distributed R for Big Data

HPC Wales Skills Academy Course Catalogue 2015

Example-driven Interconnect Synthesis for Heterogeneous Coarse-Grain Reconfigurable Logic

High-Level Synthesis for FPGA Designs

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

International Workshop on Field Programmable Logic and Applications, FPL '99

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Next Generation GPU Architecture Code-named Fermi

Parallel Algorithm Engineering

Other Map-Reduce (ish) Frameworks. William Cohen

HPC Programming Framework Research Team

CUDA programming on NVIDIA GPUs

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis

OpenACC 2.0 and the PGI Accelerator Compilers

LDIF - Linked Data Integration Framework

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

GPUs for Scientific Computing

Packet-based Network Traffic Monitoring and Analysis with GPUs

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Report: Declarative Machine Learning on MapReduce (SystemML)

Big Data looks Tiny from the Stratosphere

Developing MapReduce Programs

Introduction to GPU Programming Languages

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

GPU-Based Network Traffic Monitoring & Analysis Tools

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

PDQCollections*: A Data-Parallel Programming Model and Library for Associative Containers

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Turbomachinery CFD on many-core platforms experiences and strategies

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Apache Flink. Fast and Reliable Large-Scale Data Processing

ICRI-CI Retreat Architecture track

Dataflow Programming with MaxCompiler

OpenCL Programming for the CUDA Architecture. Version 2.3

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

HPC with Multicore and GPUs

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Network Traffic Monitoring and Analysis with GPUs

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Challenges for Data Driven Systems

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Big Data Processing with Google s MapReduce. Alexandru Costan

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Transcription:

Big Data Analytics in the Age of Accelerators Kunle Olukotun Cadence Design Systems Professor Stanford University Reconfigurable Compu1ng for the Masses, Really? September 4, 2015

Research Goals Unleash full power of modern computing platforms Heterogeneous parallelism Make parallel application development practical for the masses (Joe/Jane the programmer) Parallelism is not for the average programmer Too difficult to find parallelism, to debug, and get good performance Parallel applications without parallel programming

Modern Big Data Analytics Predictive Analytics Data Science Enable better decision making Data is only as useful as the decisions it enables Deliver the capability to mine, search and analyze this data in real time Requires the full power of modern computing platforms

Data Center Computing Platforms > 1 TFLOPS Graphics Processing Unit (GPU) 10s of cores Multicore Multi-socket Accelerators Reconfigurable comput. 1000s of nodes Cluster Programmable Logic

Expert Parallel Programming > 1 TFLOPS 10s of cores CUDA OpenCL Graphics Processing Unit (GPU) 1000s of nodes Threads OpenMP Multicore Muti-socket Reconfigurable comput. MPI Map Reduce Cluster MPI: Message Passing Interface Verilog VHDL Programmable Logic

MapReduce vs CUDA MapReduce: simplified data processing on large clusters J Dean, S Ghemawat Communications of the ACM, 2008 Cited by 14764 Scalable parallel programming with CUDA J Nickolls, I Buck, M Garland, K Skadron ACM Queue, 2008 Cited by 1205

Data Analytics Application Data Prep Big-Data Analytics Programming Challenge Pthreads OpenMP Multicore Data Transform High-Performance Domain Specific Languages CUDA OpenCL GPU Network Analysis MPI Map Reduce Cluster Predictive Analytics Verilog VHDL FPGA

Domain Specific Languages Domain Specific Languages (DSLs) Programming language with restricted expressiveness for a particular domain High-level, usually declarative, and deterministic

High Performance DSLs for Data Analytics Applications Data Transformation Graph Analysis Prediction Recommendation Domain Specific Languages Data Extraction OptiWrangler Query Proc. OptiQL Graph Alg. OptiGraph Machine Learning OptiML DSL Compiler DSL Compiler DSL Compiler DSL Compiler Heterogeneous Hardware Multicore GPU FPGA Cluster

OptiML: Overview Provides a familiar (MATLAB-like) language and API for writing ML applications Ex. val c = a * b (a, b are Matrix[Double]) Implicitly parallel data structures Base types Vector[T], Matrix[T], Graph[V,E], Stream[T] Subtypes TrainingSet, IndexVector, Image, Implicitly parallel control structures sum{ }, (0::end) { }, gradient { }, untilconverged { } Allow anonymous functions with restricted semantics to be passed as arguments of the control structures

K-means Clustering in OptiML untilconverged(kmeans, tol){kmeans => val clusters = samples.grouprowsby { sample => kmeans.maprows(mean => dist(sample, mean)).minindex } val newkmeans = clusters.map(e => e.sum / e.length) calculate newkmeans } assign each sample to the closest mean distances to current means No explicit map-reduce, no key-value pairs No distributed data structures (e.g. RDDs) No annotations for hardware design Efficient multicore and GPU execution Efficient cluster implementation Efficient FPGA hardware move each cluster centroid to the mean of the points assigned to it

High Performance DSLs for Data Analytics with Delite Applications Data Transformation Graph Analysis Prediction Recommendation Domain Specific Languages Data Extraction OptiWrangler Query Proc. OptiQL Graph Alg. OptiGraph Machine Learning OptiML Delite DSL Framework DSL Compiler DSL Compiler DSL Compiler DSL Compiler Heterogeneous Hardware Multicore GPU FPGA Cluster

Delite: A Framework for High Performance DSLs Overall Approach: Generative Programming for Abstraction without regret Embed compilers in Scala libraries Scala does syntax and type checking Use metaprogramming with LMS (type-directed staging) to build an intermediate representation (IR) of the user program Optimize IR and map to multiple targets Goal: Make embedded DSL compilers easier to develop than stand alone DSLs As easy as developing a library

Delite Overview K. J. Brown et. al., A heterogeneous parallel framework for domain-specific languages, PACT, 2011. Op1{Wrangler, QL, ML, Graph} DSL 1 DSL n domain ops domain ops domain data domain data Domain specific analyses & transforma1ons Key elements DSLs embedded in Scala IR created using type-directed staging parallel data Parallel patterns Op1mized Code Generators Generic analyses & transforma1ons Domain specific optimization General parallelism and locality optimizations Scala C++ CUDA OpenCL MPI HDL Optimized mapping to HW targets

Parallel Patterns: Delite Ops Parallel execution patterns Functional: Map, FlatMap, ZipWith, Reduce, Filter, GroupBy, Sort, Join, union, intersection Non-functional: Foreach, ForeachReduce, Sequential Set of patterns can grow over time Provide high-level information about data access patterns and parallelism DSL author maps each domain operation to the appropriate pattern Delite handles parallel optimization, code generation, and execution for all DSLs Delite provides implementations of these patterns for multiple hardware targets High-level information creates straightforward and efficient implementations Multi-core, GPU, clusters and FPGA

Parallel Patterns Most data analytic computations can be expressed as parallel patterns on collections (e.g. sets, arrays, table) Pattern Anonymous Function map in map { e => e + 1 } Example zipwith ina zipwith(inb) { (ea,eb) => ea + eb } foreach ina foreach { e => if (e>0) inb(e) = true } filter in filter { e => e > 0} reduce in reduce { (e1,e2) => e1 + e2 } groupby in groupby { e => e.id } Other patterns: sort, intersection, union

Parallel Patterns are Universal DSL Components DSL Parallel Patterns OptiWrangler (data extraction) OptiQL (query processing) OptiML (machine learning) OptiGraph (graph analytics) Map, ZipWith, Reduce, Filter, Sort, GroupBy Map, Reduce, Filter, Sort, GroupBy, Intersection Map, ZipWith, Reduce, Foreach, GroupBy, Sort Map, Reduce, Filter, GroupBy

Parallel Pattern Language (PPL) A data-parallel language that supports parallel patterns Example application: k-means val clusters = samples groupby { sample => val dists = kmeans map { mean => mean.zip(sample){ (a,b) => sq(a b) } reduce { (a,b) => a + b } } Range(0, dists.length) reduce { (i,j) => if (dists(i) < dists(j)) i else j } } val newkmeans = clusters map { e => val sum = e reduce { (v1,v2) => v1.zip(v2){ (a,b) => a + b } } val count = e map { v => 1 } reduce { (a,b) => a + b } sum map { a => a / count } }

Key Aspects of Delite IR Sea of Nodes Data-flow graph Explicit effects encoded as data dependencies Parallel Patterns (Delite Ops) Sequential, Map, Reduce, Zip, Foreach, Filter, GroupBy, Sort, ForeachReduce, FlatMap Skeletons that DSL authors extend Data Structures also in IR Structs with restricted fields (scalars, arrays, structs) Field access and struct instantiation is explicit and constructs IR nodes Multiple Views Generic, Parallel, Domain-Specific Can optimize at any level

Mapping Nested Parallel Patterns to GPUs Parallel patterns are often nested in applications > 70% apps in Rodinia benchmark contain kernels with nested parallelism Efficiently mapping parallel patterns on GPUs becomes significantly more challenging when patterns are nested Memory coalescing, divergence, dynamic allocations, Large space of possible mappings

Mapping Nested Ops to GPUs m = Matrix.rand(nR,nC) v = m.sumcols map (i) reduce(j) m = Matrix.rand(nR,nC) v = m.sumrows Normalized ExecuLon Time 60 50 40 30 20 10 0 1D thread- block/thread warp- based Mul1Dim sumcols non-coalesced memory [64K,1K] [8K,8K] [1K,64K] [64K,1K] [8K,8K] [1K,64K] sumrows limited parallelism HyoukJoong Lee et. al, Locality-Aware Mapping of Nested Parallel Patterns on GPUs, MICRO'14

Nested Delite Ops on Rodinia Apps 8 Normalized ExecuLon Lme 7 6 5 4 3 2 1 manual GPU 1D GPU 2D 0 Hotspot GE BFS Pathfinder 2D mapping exposes more parallelism 2D mapping enables coalesced memory accesses

Heterogeneous Cluster Performance 4 node local cluster: 3.4 GB dataset

MSM Builder Using OptiML with Vijay Pande Markov State Models (MSMs) MSMs are a powerful means of modeling the structure and dynamics of molecular systems, like proteins high prod, high perf low prod, high perf x86 ASM high prod, low perf!

High Performance Data Analytics with Delite Applications Data Transformation Graph Analysis Prediction Recommendation Domain Specific Languages Data Extraction OptiWrangler Query Proc. OptiQL Graph Alg. OptiGraph Machine Learning OptiML Delite DSL Framework DSL Compiler DSL Compiler DSL Compiler DSL Compiler Heterogeneous Hardware Multicore GPU FPGA Cluster

FPGAs in the Datacenter? FPGAs based accelerators Recent commercial interest from Baidu, Microsoft, and Intel Key advantage: Performance, Performance/Watt Key disadvantage: lousy programming model Verilog and VHDL poor match for software developers High quality designs High level synthesis (HLS) tools with C interface Medium/low quality designs Need architectural knowledge to build good accelerators Not enough information in compiler IR to perform access pattern and data layout optimizations Cannot synthesize complex data paths with nested parallelism

Hardware Design with HLS is Easy Sum Module DRAM Add 512 integers stored in external DRAM void(int* mem){ mem[512] = 0; } for(int i=0; i<512; i++){ mem[512] += mem[i]; } 27,236 clock cycles for computation Two-orders of magnitude too long!

High Quality Hardware Design with HLS is Still Difficult #define ChunkSize (sizeof(mport)/sizeof(int)) #define LoopCount (512/ChunkSize) void(mport* mem){ Width of the DRAM controller interface MPort buff[loopcount]; memcpy(buff, mem, LoopCount); Burst access } int sum=0; Use local variable for(int i=1; i<loopcount; i++){ #pragma PIPELINE Special compiler directives for(int j=0; j<chunksize; j++){ #pragma UNROLL sum+=(int)(buff[i]>>j*sizeof(int)*8); } } Reformat code mem[512]=sum; 302 clock cycles for computation

Make HLS Easier with Delite Delite generates HLS code for each parallel pattern in the application Currently targets Xilinx Vivado HLS Optimizations for burst DRAM access Clock and Control Circuit Nithin George et. al. Hardware system synthesis from Domain-Specific Languages, FPL 2014 // data is an array of 512 elements that has been declared and initialized. val result = data.sum Reduce Sum Module DRAM Controller 368 clock cycles for computation DRAM

Optimized Approach to HW Generation Key optimizations: Parallel pattern tiling to maximize on-chip data reuse Metapipelines to exploit nested parallelism Generate MaxJ code Use Maxeler s MaxCompiler to generate FPGA bitstream Parallel Pattern IR Pattern Transformations Fusion Pattern Tiling Code Motion Tiled Parallel Pattern IR Hardware Generation Memory Allocation Template Selection Metapipeline Analysis MaxJ HGL Bitstream Generation FPGA Configuration

Generalized Parallel Pattern Language (GPPL) Enable use of general pattern matching rules for automatic tiling Polyhedral modeling limits array accesses to affine functions of loop indices Pattern matching rules can be run on any input program, even those with random and data-dependent accesses N-Dim dense 1-Dim sparse Pattern Description Application Usage Example Map MultiFold FlatMap GroupByFold Generates one element per loop index, aggregates result into fixed size output collection Reduces one partial result per loop index into a subsection of a fixed size accumulator Concatenates arbitrary number of elements per loop index into dynamic output collection Reduces arbitrary number of partial results per loop index into buckets based on generated keys x.map{e => 2*e} x.zip(y){(a,b) => a + b} mat.map{row => row.fold{(a,b) => a + b}} data.filter{e => e > 0} img.histogram

PPL Fusion of k-means Fusion creates MultiFold Fused val clusters = samples groupby { sample => val dists = kmeans map { mean => mean.zip(sample){ (a,b) => sq(a b) } reduce { (a,b) => a + b } } Range(0, dists.length) reduce { (i,j) => if (dists(i) < dists(j)) i else j } } val newkmeans = clusters map { e => val sum = e reduce { (v1,v2) => v1.zip(v2){ (a,b) => a + b } } val count = e map { v => 1 } reduce { (a,b) => a + b } sum map { a => a / count } }

Core of k-means using GPPL sums = multifold(n){i => pt1 = points.slice(i, *) mindistwithindex = multifold(k){j => pt2 = centroids.slice(j, *) For each point in a set of n points Get point pt1 from points set For each centroid in set of k centroids Get centroid pt2 from centroids set dist = distance(pt1, pt2) (0, (dist, j)) }{(a,b) => if (a._1 < b._1) a else b } minindex = mindistwithindex._2 (minindex, pt1) }{(a,b) => map(d){k => a(k) + b(k) } Calculate distance between point & centroid Take closer of current (distance,index) pair & previously found closest (distance,index) Extract index of closest centroid At index of closest centroid, add point (with dimension d) to accumulator (non-affine access)

Parallel Pattern Tiling 1 Strip mining: Chunk parallel patterns into nested patterns of known size, chunk predictable array accesses with copies Copy becomes local memory with hardware prefetching Strip mined patterns enable computation reordering Example Parallel Patterns Strip Mined Patterns Simple Map x: Array, size d x.map{e => 2*e} map(d){i => 2*x(i)} multifold(d/b){ii => xtile = x.copy(b + ii) (i, map(b){i => 2*xTile(i) }) } Sum through matrix rows mat: Matrix, size m x n mat.map{row => row.fold{(a,b) => a + b}} Simple data filter data: Array, size d data.filter{e => e > 0} multifold(m,n){i,j => (i, mat(i,j)) }{(a,b) => a + b} flatmap(d){i => if (x(i) > 0) x(i) else [] } multifold(m/b0,n/b1){ii,jj => mattile = mat.copy(b0+ii,b1+jj) (ii, multifold(b0,b1){i,j => (i, mattile(i,j)) }{(a,b) => a + b}) }{(a,b) => a + b} flatmap(d/b){ii => xtile = x.copy(b + ii) flatmap(b){i => if (xtile(i) > 0) xtile(i) else [] }}

Parallel Pattern Tiling 2 Pattern interchange: Reorder nested patterns and split imperfectly nested patterns when intermediate data created is statically known to fit on chip Reordering improves locality and reuse of on-chip memory Reduces number of main memory reads and writes Example Strip Mined Patterns Interchanged Patterns Matrix multiplication x: Matrix, size m x p y: Matrix, size p x n x.maprows{row => y.mapcols{col => row.zip(col){(a,b)=> a*b }.sum }} multifold(m/b0,n/b1){ii,jj => xtl = x.copy(b0+ii, b1+jj) ((ii,jj), map(b0,b1){i,j => multifold(p/b2){kk => ytl dy.copy(b1+jj, b2+kk) (0, multifold(b2){ k => (0, xtl(i,j)* ytl(j,k)) }{(a,b) => a + b}) }{(a,b) => a + b} }) } multifold(m/b0,n/b1){ii,jj => xtl = x.copy(b0+ii, b1+jj) ((ii,jj), multifold(p/b2){kk => ytl = y.copy(b1+jj, b2+kk) (0, map(b0,b1){i,j => (0, multifold(b2){ k => (0, xtl(i,j)* ytl(j,k)) }{(a,b) => a + b}) }) }{(a,b) => map(b0,b1){i,j => a(i,j) + b(i,j) } }) }

Hardware (MaxJ code) Generation Parallel patterns mapped to library of hardware templates Each template exploits one or more kinds of parallelism or memory access pattern Templates coded in MaxJ: Java based hardware generation language from Maxeler

Hardware Templates Pipe. Exec. Units Description IR Construct Vector SIMD parallelism Map over scalars Reduction tree Parallel reduction of associative operations MultiFold over scalars Parallel FIFO Buffer ordered outputs of dynamic size FlatMap over scalars CAM Fully associative key-value store GroupByFold over scalars Memories Description IR Construct Buffer Scratchpad memory Statically sized array Double buffer Buffer coupling two stages in a metapipeline Metapipeline Cache Tagged memory exploits locality in random accesses Non-affine accesses Controllers Description IR Construct Sequential Coordinates sequential execution Sequential IR node Parallel Coordinates parallel execution Independent IR nodes Metapipeline Tile memory Execute nested parallel patterns in a pipelined fashion Outer parallel pattern with multiple inner patterns Fetch tiles of data from off-chip memory Transformer-inserted array copy

Metapipelining Hierarchical pipeline: pipeline of pipelines Exploits nested parallelism Stages could be other nested patterns or combinational logic Does not require iteration space to be known statically Does not require complete unrolling of inner patterns Intermediate data from each stage stored in double buffers No need for lockstep execution

Metapipeline Simple Example map(n) { r => row = matrix.slice(r) TileMemController Pipe1 row r = 12 r = 23 14 5 sub TileMemController Pipe1 row row sub diff = map(d) { i => row(i) sub(i) } vprod = map(d,d) {(i,j)=> diff(i) * diff(j) } ld ld - st Pipe2 diff ld ld * st Pipe3 ld ld - st Pipe2 diff ld * st diff ld Pipe3 Metapipeline 4 stages vprod vprod vprod vprod } TileMemController Pipe4 TileMemController Pipe4

Generated k-means Hardware High quality hardware design Hardware similar to Hussain et al. Adapt. HW & Syst. 2011 FPGA implementation of k-means algorithm for bioinformatics application Implements a fixed number of clusters and a small input dataset Tiling analysis automatically generates buffers and tile load units to handle arbitrarily sized data Parallelizes across centroids and vectorizes the point distance calculations

Impact of Tiling and Metapipelining Base design uses burst access Speedup with tiling alone: up to 15.5x Speedup with tiling and metapipelining: up to 39.4x Minimal (often negative!) impact on resource usage Tiled designs have fewer off-chip data loaders and storers

Summary In the age of heterogeneous architectures Power limited computing parallelism and accelerators Need parallelism and acceleration for the masses DSLs let programmers operate at high-levels of abstraction Need one DSL for all architectures Semantic information enables compiler to do coarse-grained domainspecific optimization and translation Need a parallelism and accelerator friendly IR Parallel pattern IR structures computation and data Allows aggressive parallelism and locality optimizations through transformations Provides efficient mapping to heterogeneous architectures DSL tools for FPGAs need to be improved Better performance prediction More optimization Shorter compile times (place and route)

Big Data Analytics In the Age of Accelerators Power Performance Productivity Portability Accelerators (GPU, FPGA, ) Parallel Patterns High Performance DSLs (OptiML, OptiQL, )

Colaborators & Funding Faculty PhD students Pat Hanrahan Martin Odersky (EPFL) Chris Ré Tiark Rompf (Purdue/EPFL) Chris Aberger Kevin Brown Hassan Chafi Zach DeVito Chris De Sa Nithin George (EPFL) David Koeplinger Funding PPL : Oracle Labs, Nvidia, Intel, AMD, Huawei, SAP NSF DARPA HyoukJoong Lee Victoria Popic Raghu Prabhakar Aleksander Prokopec (EPFL) Vojin Jovanovic (EPFL) Vera Salvisberg (EPFL) Arvind Sujeeth

Extra Slides

Comparing Programming Models of Recent Systems For Data ANALYTICs Programming Model Features Supported Hardware Rich Data Nested Nested Mul1ple Random Mul1- System Parallelism Prog. Parallelism Collec1ons Reads core NUMA Clusters GPU FPGA MapReduce DryadLINQ Thrust Scala Collec1ons Delite Spark Lime PowerGraph Dandelion Frameworks are listed in chronological order Requirement: expressive programing model and support for all platforms

Distributed Heterogeneous Execution Separate Memory Regions NUMA Clusters FPGAs Partitioning Analysis Multidimensional arrays Decide which data structures / parallel ops to partition across abstract memory regions Nested Pattern Transformations Optimize patterns for distributed and heterogeneous architectures Delite&parallel&data& par22oned&data& par22oned&data& DSL&Applica2on& local&data& Delite& parallel& pa+erns& Partitioning & Stencil Analysis local&data& scheduled& pa+erns& Nested Pattern Transformations scheduled,& transformed& pa+erns& Heterogeneous&Code&Genera2on&&&Distributed&Run2me&

OptiML on Heterogeneous Cluster 4 node local cluster: 3.4 GB dataset

Multi-socket NUMA Performance TPCHQ1 Gene GDA LogReg Speedup 35 30 25 20 15 10 5 30x 50 45 40 35 30 25 20 15 10 5 45x 45 40 35 30 25 20 15 10 5 10x 25 20 15 10 5 10x 0 1 12 24 48 0 1 12 24 48 0 1 12 24 48 0 1 12 24 48 k- means Triangle PageRank 1 48 threads 4 sockets 50 45 40 35 50 45 40 35 35 30 25 30 25 20 15 10 5 11x 30 25 20 15 10 5 5x 20 15 10 5 10x 0 1 12 24 48 0 1 12 24 48 0 1 12 24 48

Parallel Pattern Language Implemented a data-parallel language that supports parallel patterns Structured computations and data structures Computations : map, zipwith, foreach, filter, reduce, groupby, Data structures: scalars, array, structs Example application: PageRank Graph.nodes map { n => } nbrsweights = n.nbrs map { w => getprevpagerank(w) / w.degree } sumweights = nbrsweights reduce { (a,b) => a + b } ((1 - damp) / numnodes + damp * sumweights