Extreme Machine Learning with GPUs John Canny

Similar documents

MLlib: Scalable Machine Learning on Spark

Spark: Cluster Computing with Working Sets

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Spark and the Big Data Library

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Unified Big Data Processing with Apache Spark. Matei

Mammoth Scale Machine Learning!

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Fast Analytics on Big Data with H20

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

HIGH PERFORMANCE BIG DATA ANALYTICS

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

CS294-1 Behavioral Data Mining. Prof: John Canny GSI: Kenghao Chang

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Moving From Hadoop to Spark

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Deep Learning For Text Processing

DEEP LEARNING WITH GPUS

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

An Introduction to Data Mining

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 11

How Companies are! Using Spark

Introduction to GPGPU. Tiziano Diamanti

A survey on platforms for big data analytics

Maximizing Hadoop Performance with Hardware Compression

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Bringing Big Data Modelling into the Hands of Domain Experts

Beyond Hadoop with Apache Spark and BDAS

The Internet of Things and Big Data: Intro

HPC with Multicore and GPUs

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Deep Learning and GPUs. Julie Bernauer

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

R and Hadoop: Architectural Options. Bill Jacobs VP Product Marketing & Field CTO, Revolution

Map-Reduce for Machine Learning on Multicore

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Next Generation GPU Architecture Code-named Fermi

Rethinking SIMD Vectorization for In-Memory Databases

Predict Influencers in the Social Network

Big-data Analytics: Challenges and Opportunities

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Spark: Making Big Data Interactive & Real-Time

Hadoop Ecosystem B Y R A H I M A.

Machine Learning over Big Data

BIG DATA What it is and how to use?

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Hadoop Architecture. Part 1

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Presto/Blockus: Towards Scalable R Data Analysis

How To Create A Data Visualization With Apache Spark And Zeppelin

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

Big Data Analytics with Small Footprint: Squaring the Cloud

Bayesian networks - Time-series models - Apache Spark & Scala

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Data Analytics at NERSC. Joaquin Correa NERSC Data and Analytics Services

Deep Learning GPU-Based Hardware Platform

Architectures for Big Data Analytics A database perspective

Parallel Computing for Data Science

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15

Big Data and Apache Hadoop s MapReduce

Large scale processing using Hadoop. Ján Vaňo

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Lecture 10 - Functional programming: Hadoop and MapReduce

Intro to Map/Reduce a.k.a. Hadoop

Big Data and Scripting Systems build on top of Hadoop

Datacenter Operating Systems

The Stratosphere Big Data Analytics Platform

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

A tour on big data classification: Learning algorithms, Feature selection, and Imbalanced Classes

Machine Learning. Hands-On for Developers and Technical Professionals

Large-Scale Test Mining

Distributed Computing and Big Data: Hadoop and MapReduce

Benchmarking Hadoop & HBase on Violin

Big Data Analytics Hadoop and Spark

Transcription:

Extreme Machine Learning with GPUs John Canny Computer Science Division University of California, Berkeley GTC, March, 2014

Big Data Event and text data: Microsoft Yahoo Ebay Quantcast MOOC logs Social Media Health Data Later: Images, Video Recommendation System Sentiment Analysis and Social Network Analysis

Big Data Workflow Digging Around in Data Hypothesize Model Large Scale Exploitation Evaluate Interpret

Top-10 Big Data Algorithms 1. Regression (logistic, linear) + Naïve Bayes 2. Support Vector Machines 3. Greedy Clustering (k-means) 4. Topic Models (Latent Dirichlet Allocation) 5. Collaborative Filtering (Sparse Matrix Factorization) 6. Random Forests 7. Hidden-Markov Models 8. Spectral Clustering 9. Factorization Machines (Regression with Interactions) 10. Multi-layer neural networks 11. Natural Language Parsing

Machine Learning for Big Data Classical: Batch model update in memory features samples DATA Incremental-update Methods Stochastic Gradient Descent (SGD) Gibbs Sampling (GS) Large Datasets: Mini-batch model updates Spark: UC Berkeley HaLoop: U. Washington Mahout DATA M+ M DATA M+ M DATA M+ M Deep Learning BIDMat/BIDMach: (this talk) Downpour SGD: (Google) Hogwild: U. Wisc.-Madison Torch7: (NYU, NEC) Convnet, RNNLib, Visual-RBM: Toronto Theano: Montreal

GPUs at a glance Intel CPU NVIDIA GPU Memory Controller ALU ALU ALU ALU Core Core Core Core L3 Cache L2 Cache

Vive La Difference! Intel CPU NVIDIA GPU Memory Controller ALU ALU ALU ALU Core Core Core Core L3 Cache L2 Cache Hardware transcendentals (power series) 4kB registers: 4 MB register file (!)

A datapoint: NLP Parsing (Canny, Hall, Klein, EMNLP 2013) Natural language parsing with the state-of-the-art Berkeley grammar (1100 symbols, 1.7 million rules) End-to-End Throughput (4 GPUs): 2-2.4 Teraflops (1-1.2 B rules/sec) CPU throughput is about 5 Mflops. i.e. we achieved a 0.5 million-fold speedup on rule evaluation.

Memory Performance Intel 8 core Sandy Bridge CPU NVIDIA GK110 GPU 4kB registers: 5 TB/s 4 MB register file (!) 40 TB/s 1 MB Constant Mem 13 TB/s 512K L1 Cache 1 TB/s 1 MB Shared Mem 1 TB/s 2 MB L2 Cache 8 MB L3 Cache 500 GB/s 1.5 MB L2 Cache 500 GB/s 10s GB Main Memory 20 GB/s 150 GB/s 4 GB Main Memory

Hi Speed CPU kernels Intel 8 core Sandy Bridge CPU NVIDIA GK110 GPU 4kB registers: 5 TB/s 4 MB register file (!) 40 TB/s 1 MB Constant Mem 13 TB/s 512K L1 Cache 1 TB/s 1 MB Shared Mem 1 TB/s 2 MB L2 Cache 8 MB L3 Cache 500 GB/s 1.5 MB L2 Cache 500 GB/s 10s GB Main Memory 20 GB/s 150 GB/s 4 GB Main Memory

A Strategy for Speed on GPUs Intel 8 core Sandy Bridge CPU NVIDIA GK110 GPU 4kB registers: 5 TB/s 4 MB register file (!) 40 TB/s 1 MB Constant Mem 13 TB/s 512K L1 Cache 1 TB/s 1 MB Shared Mem 1 TB/s 2 MB L2 Cache 500 GB/s 8 MB L3 Cache 1.5 MB L2 Cache 500 GB/s 10s GB Main Memory 20 GB/s 150 GB/s 4 GB Main Memory

Using Register and Constant Memory Our goal is to use registers to hold symbols values, and constant memory to hold rule weights. i.e. we commit to compiling the grammar into code, like this (actual GPU code): float L001 = left[1][tid]; float R031 = right[31][tid]; float P001 = L001 * R031 * 1.338202e-001f; P001 += L021 * R019 * 8.32642e-003f;... atomicadd(&parent[1][tid], P001);

Using Register and Constant Memory But: Each GPU core has only 63 (or 255 in Titan) registers. We have 1132 x 3 = 3396 symbols, a less-than-perfect fit. Therefore we use blocking, similar to the approach used in fast CPU matrix kernels, partly inspired by: Usually not worth trying to cache block like you would on CPU GTC 2012 Performance Analysis and Optimization i.e. we cluster the symbols into small subsets which fit into register storage, trying at the same time to balance the number of rules in each block.

A Strategy for Speed on GPUs Intel 8 core Sandy Bridge CPU NVIDIA GK110 GPU 4kB registers: 5 TB/s 4 MB register file (!) 40 TB/s 1 MB Constant Mem 13 TB/s 512K L1 Cache 1 TB/s 1 MB Shared Mem 1 TB/s 2 MB L2 Cache 500 GB/s 8 MB L3 Cache 1.5 MB L2 Cache 500 GB/s 10s GB Main Memory 20 GB/s 150 GB/s 4 GB Main Memory

Blocking Align the (1132) symbols for P, L, R along the axes of a cube. We want small subcubes whose sides are roughly 50 values, that will fit in GPU register memory. 8 blocks on one GPU L P R Blocks that run as separate kernels (function calls) on a GPU.

Serendipity The compiler s version float tmp = L021 * R019; P001 += tmp * 8.32642e-003f; P002 += tmp * 4.31572e-005f; P005 += tmp * 2.81231e-002f; Compiles each rule update line into a single atomic multiply-add instruction, which runs in one cycle. i.e. with 1.7 million rules, the compiled GPU code has about 1.7 million instructions. It runs at about 2 cycles/rule or 1 teraflop per GPU. This is as fast as dense matrix multiply on the GTX-680.

Back to our Top-10 list 1. Regression (logistic, linear) + Naïve Bayes 2. Support Vector Machines 3. Greedy Clustering (k-means) 4. Topic Models (Latent Dirichlet Allocation) 5. Collaborative Filtering (Sparse Matrix Factorization) 6. Random Forests 7. Hidden-Markov Models 8. Spectral Clustering 9. Factorization Machines (Regression with Interactions) 10. Multi-layer neural networks

BIDMat/BIDMach architecture People BIDMach Algorithms Matrix Layer BIDMat Hardware (GPU + CPU) Network

A GPU-enabled Matrix Tool Written in the beautiful Scala language: Interpreter, w/ excellent performance Natural syntax +,-,*,,, etc and high-level expressivity CPU and GPU backend (generics) Hardware acceleration many custom GPU kernels Easy threading (Actors) Java VM + Java codebase runs on Hadoop, Spark Good text processing, integrated XML interpreter Inspired by Matlab, R, SciPy

A modular learning API Zhao+Canny SIAM DM 13, KDD 13, BIGLearn 13 Model DataSource (Memory) CPU host code Optimizer Regularizer GPU 1 thread 1 DataSource (JBOD disks) DataSource HDFS over network data blocks Learner Compressed disk streaming at ~ 1.5GB/s 40-100 Hadoop nodes Mixins Model Optimizer Regularizer Mixins : : 4 GPUs: 80 Gflops to 3 Teraflops typical GPU 2 thread 2

BIDMach sample code Latent Dirichlet Allocation Model: def estep(sdata:mat, user:mat):unit = { for (i <- 0 until opts.uiter) { val preds = SDDMM(modelmat, user, sdata) val unew = user (mm * (sdata / preds)) + opts.alpha user <-- exppsi(unew) } }

BIDMach Every Learner can: Run Sparse or Dense input matrices Run on GPU or CPU Run on single or multiple GPUs Use in-memory or disk data sources (matrix caching) Run on single or multiple network nodes*

BIDMach Performance Performance dominated by a few kernels: Dense-dense MM sgemm (for dense input data) Sparse-dense MM and filtered MM (for sparse inputs) Almost all learners achieve end-to-end performance of: 20-40 Gflops (for sparse input data) 1-3 Tflops (for dense input data) Tested K-means, LDA, ALS, on Mahout, Scikit-Learn, Vowpal Wabbit, Mlbase, with MKL acceleration if possible. Speedups 100x to several 1000x.

Benchmarks Variational Latent Dirichlet Allocation (N hosts x N cores x N GPUs) i.e. 10x improvement for the single-node implementation vs. 64-node cluster, or 500x in per-node throughput. Avg end-to-end throughput with 4 GPUs is 80 Gflops.

Benchmarks Variational Latent Dirichlet Allocation (256 dims) LDA convergence on 1 Terabyte of Twitter data We have run this algorithm up to 10 TB, ~10 16 floating point operations, on a single PC with GTX-680s. This is the largest calculation on commodity hardware that we know of.

MapReduce Version Variational Latent Dirichlet Allocation (256 dims) But you can do this on a big MapReduce Cluster, right? No-one has Probably not The common MapReduce implementations (Hadoop, Spark, Powergraph*) don t scale. i.e. The communication time stops decreasing and starts increasing past a certain point, on this example about 20 machines.

Kylix: A Scalable, Sparse Allreduce (Forthcoming paper) Total communication across all layers a small constant larger than the top layer, which is close to optimal. Communication volume across layers has a characteristic Kylix shape.

Learner Output 1.00%, ll=-4.985, gf=71.878, secs=116.9, GB=10.04, MB/s=85.87, GPUmem=0.57, 0.57, 0.57, 0.57 2.00%, ll=-4.852, gf=67.469, secs=254.9, GB=20.54, MB/s=80.56, GPUmem=0.57, 0.57, 0.57, 0.57 3.00%, ll=-4.824, gf=68.385, secs=379.8, GB=31.00, MB/s=81.62, GPUmem=0.57, 0.57, 0.57, 0.57 4.00%, ll=-4.803, gf=68.469, secs=517.2, GB=42.27, MB/s=81.73, GPUmem=0.57, 0.57, 0.57, 0.57 5.00%, ll=-4.787, gf=69.333, secs=639.4, GB=52.91, MB/s=82.74, GPUmem=0.57, 0.57, 0.57, 0.57 6.00%, ll=-4.784, gf=69.589, secs=768.7, GB=63.84, MB/s=83.04, GPUmem=0.57, 0.57, 0.57, 0.57 7.00%, ll=-4.784, gf=70.226, secs=892.2, GB=74.77, MB/s=83.80, GPUmem=0.57, 0.57, 0.57, 0.57 8.00%, ll=-4.762, gf=70.415, secs=1023.6, GB=86.00, MB/s=84.02, GPUmem=0.57, 0.57, 0.57, 0.57 9.00%, ll=-4.765, gf=70.492, secs=1135.5, GB=95.50, MB/s=84.10, GPUmem=0.57, 0.57, 0.57, 0.57 10.00%, ll=-4.761, gf=70.488, secs=1260.1, GB=105.97, MB/s=84.10, GPUmem=0.57, 0.57, 0.57, 0.57 11.00%, ll=-4.762, gf=70.346, secs=1373.9, GB=115.29, MB/s=83.92, GPUmem=0.57, 0.57, 0.57, 0.57 12.00%, ll=-4.758, gf=70.087, secs=1496.1, GB=125.09, MB/s=83.61, GPUmem=0.57, 0.57, 0.57, 0.57 13.00%, ll=-4.760, gf=69.812, secs=1621.2, GB=135.01, MB/s=83.28, GPUmem=0.57, 0.57, 0.57, 0.57 14.00%, ll=-4.756, gf=69.549, secs=1752.5, GB=145.40, MB/s=82.97, GPUmem=0.57, 0.57, 0.57, 0.57 15.00%, ll=-4.753, gf=69.229, secs=1890.2, GB=156.12, MB/s=82.59, GPUmem=0.57, 0.57, 0.57, 0.57 16.00%, ll=-4.748, gf=68.930, secs=2016.9, GB=165.87, MB/s=82.24, GPUmem=0.57, 0.57, 0.57, 0.57 17.00%, ll=-4.752, gf=68.697, secs=2136.9, GB=175.16, MB/s=81.97, GPUmem=0.57, 0.57, 0.57, 0.57 18.00%, ll=-4.749, gf=68.411, secs=2275.6, GB=185.74, MB/s=81.62, GPUmem=0.57, 0.57, 0.57, 0.57 19.00%, ll=-4.759, gf=68.125, secs=2426.5, GB=197.24, MB/s=81.29, GPUmem=0.57, 0.57, 0.57, 0.57 20.00%, ll=-4.751, gf=67.889, secs=2573.0, GB=208.40, MB/s=80.99, GPUmem=0.57, 0.57, 0.57, 0.57 21.00%, ll=-4.740, gf=67.661, secs=2718.3, GB=219.43, MB/s=80.72, GPUmem=0.57, 0.57, 0.57, 0.57 22.00%, ll=-4.760, gf=67.407, secs=2855.3, GB=229.62, MB/s=80.42, GPUmem=0.57, 0.57, 0.57, 0.57 23.00%, ll=-4.760, gf=67.179, secs=2986.0, GB=239.29, MB/s=80.14, GPUmem=0.57, 0.57, 0.57, 0.57 24.00%, ll=-4.755, gf=66.968, secs=3132.1, GB=250.21, MB/s=79.89, GPUmem=0.57, 0.57, 0.57, 0.57 25.00%, ll=-4.756, gf=66.776, secs=3266.1, GB=260.16, MB/s=79.66, GPUmem=0.57, 0.57, 0.57, 0.57

Benchmarks Alternating Least Squares: (synthetic Netflix Data) i.e. order of magnitude speedup for single-node vs. 64- node cluster, or 1000x speedup in per-node throughput. Uses the SDDMM matrix primitive and interleaved conjugate gradient updates (KDD 2013 paper). About 80 Gflops end-to-end throughput w/ 4 GPUs.

Benchmarks Logistic Regression: (100GB Twitter) i.e. single-node implementation takes 2x time, and has 50x the per-node throughput. But for multi-model regression (many different targets), BIDMach achieves 50x the throughput (one node vs 100), and 5000x the per-node throughput.

Benchmarks Pagerank Iteration (using Sparse Allreduce) i.e. for in-memory data, single-node performance is comparable with a 64-node cluster (about 40x faster in pernode throughput)

Toward Interactive Machine Learning People Interactive ML Algorithms Matrix Layer Hardware (GPU + CPU) Network

Gibbs Sampling The most general method for inference on probabilistic graphical models: Simple to specify and implement Flexible (grouping, ordering) Unbiased Allows estimation of arbitrary statistics But: Slow!! Hard to do parameter optimization

EM and Cooled Gibbs Sampling EM: Separate parameters from other latent variables: joint is P(X,Θ), maximize P(Θ) and compute expected log likelihood. Standard Gibbs: blocked sample from P(X Θ) and P(Θ X) Cooled Gibbs: sample from P(X 1,, X k Θ) and P(Θ X 1,, X k ) for independent groups X i The X i have the same conditional distribution as before. parameters now ~ P k (Θ), i.e. the parameter distribution cooled to T=1/k. The samples X i can often be computed very fast.

EM and Cooled Gibbs Sampling In the language of graphical models: Run independent simulations with tied parameters Θ Θ

EM and Cooled Gibbs Sampling What cooling does: Likelihood function in model parameter space (peaks are good models)

EM and Cooled Gibbs Sampling What cooling does: Likelihood function in model parameter space (peaks are good models)

Cooled Gibbs Sampling The fastest version of this sampler represents a collection of samples by its average. For some models, e.g. LDA, other factor models, the fastest sampler is also exact. The fast sampler gives a two order-of-magnitude speedup for inference on LDA models. We can use both samplers on general graphical models: Run the fast, cooled sampler to convergence. Run the exact cooled sampler for a few iterations.

Toward Interactive Modeling We can control the temperature of individual parameters in a model, and use this for human-supervised search. See Biye s poster.

Future Caffeinated BIDMach: Wrapping a DNN toolkit called CAFFE with a Java native API Genomics Module: Very fast, bit-level edit distance (2 Tcups) Sorting (the new hashing) Probabilistic alignment/assembly Cleaving, reversing, filtering,

Summary You can achieve order-of-magnitude speedups for general machine learning through roofline design (BIDMach). With GPU acceleration, you gain a further order of magnitude. You can scale the performance of GPU-accelerated ML, but not with current MapReduce frameworks. Exciting possibilities for fundamental improvements in ML through deep codesign (model compilation, cooled sampling).

Software 42 Code: github.com/biddata/bidmat github.com/biddata/bidmach BSD-style open source libs and dependencies, Amazon AMI for test-driving http://bid2.berkeley.edu/bid-data-project/overview/