Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Similar documents
Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

Big learning: challenges and opportunities

Machine learning challenges for big data

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Weekly Sales Forecasting

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Machine Learning over Big Data

MapReduce/Bigtable for Distributed Optimization

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

Regularized Logistic Regression for Mind Reading with Parallel Validation

Big Data Analytics. Lucas Rego Drumond

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

How I won the Chess Ratings: Elo vs the rest of the world Competition

Spark: Cluster Computing with Working Sets

Chapter 4: Artificial Neural Networks

Parallel Algorithm Engineering

Introduction to Machine Learning Using Python. Vikram Kamath

The Gurobi Optimizer

Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

Federated Optimization: Distributed Optimization Beyond the Datacenter

Parallelism and Cloud Computing

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University. (Joint work with R. Chen and M. Menickelly)

Various Schemes of Load Balancing in Distributed Systems- A Review

Stochastic Optimization for Big Data Analytics: Algorithms and Libraries

HPC with Multicore and GPUs

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore

Experiments on the local load balancing algorithms; part 1

Main Memory Map Reduce (M3R)

MLlib: Scalable Machine Learning on Spark

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Big Data Analytics. Lucas Rego Drumond

Large-Scale Similarity and Distance Metric Learning

CSCI567 Machine Learning (Fall 2014)

Convex Optimization for Big Data

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Spark and the Big Data Library

Distributed Structured Prediction for Big Data

Introduction to Logistic Regression

Convex Optimization for Big Data

Parallelization: Binary Tree Traversal

Beyond stochastic gradient descent for large-scale machine learning

Map-Reduce for Machine Learning on Multicore

Database Replication Techniques: a Three Parameter Classification

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Big Graph Processing: Some Background

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Simple and efficient online algorithms for real world applications

MOMENTUM - A MEMORY-HARD PROOF-OF-WORK VIA FINDING BIRTHDAY COLLISIONS. DANIEL LARIMER dlarimer@invictus-innovations.com Invictus Innovations, Inc

Big Data Analytics and HPC

JBösen: A Java Bounded Asynchronous Key-Value Store for Big ML Yihua Fang

Forecasting and Hedging in the Foreign Exchange Markets

Logistic Regression for Spam Filtering

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Road Map. Scheduling. Types of Scheduling. Scheduling. CPU Scheduling. Job Scheduling. Dickinson College Computer Science 354 Spring 2010.

Fast R-CNN. Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th. Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Distributed Databases

Introduction to Cloud Computing

Linear Threshold Units

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

2: Computer Performance

Uintah Framework. Justin Luitjens, Qingyu Meng, John Schmidt, Martin Berzins, Todd Harman, Chuch Wight, Steven Parker, et al

Bringing Big Data Modelling into the Hands of Domain Experts

Extreme Machine Learning with GPUs John Canny

Dell One Identity Manager Scalability and Performance

Parallel Scalable Algorithms- Performance Parameters

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Quantum Computing. Robert Sizemore

Introduction to Parallel Programming and MapReduce

Big Data Deep Learning: Challenges and Perspectives

Lecture Outline Overview of real-time scheduling algorithms Outline relative strengths, weaknesses

ultra fast SOM using CUDA

Using In-Memory Computing to Simplify Big Data Analytics

Asynchronous Bypass Channels

Parallel Computing for Data Science

Determining Inventory Levels in a CONWIP Controlled Job Shop

Enabling Multi-pipeline Data Transfer in HDFS for Big Data Applications

Graph Based Processing of Big Images

Big Data Analytics: Optimization and Randomization

Parallel Computing with Mathematica UVACSE Short Course

Mesh Generation and Load Balancing

PERFORMANCE ANALYSIS OF PaaS CLOUD COMPUTING SYSTEM

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

Transcription:

Parallel & Distributed Optimization Based on Mark Schmidt s slides

Motivation behind using parallel & Distributed optimization Performance Computational throughput have increased exponentially in linear time (Moore s law) But only so many transistors can fit in limited space (atomic size) Serial computation throughput plateaued (Moore s law coming to an end) Space Large datasets cannot fit on a single machine

Introduction Parallel Computing One machine Multiple processors (Quad-Core, GPU)

Introduction Parallel Computing One machine Multiple processors (Quad-Core, GPU) Distributed Computing Multiple computers, linked via network

Introduction Parallel Computing One machine Multiple processors (Quad-Core, GPU) Distributed Computing Multiple computers, linked via network

Distributed optimization Strategy Each machine handles a subset of the dataset Issues link failures between machines devise algorithms that limits communication decentralize optimization Synchronization wait for slowest machine when the all machines depend on the variable parameter x coordinates

Distributed optimization Strategy Issues Each machine handles a subset of the dataset link failures between machines synchronization wait for the slowest machine to complete processing its data Solutions devise algorithms that limits communication decentralize optimization has parameter vector x

Straightforward distributed optimization Run different algorithms/strategies on different machines/cores First one that finishes wins Gauss-Southwell Coordinate Descent Randomized Coordinate Descent Gradient Descent Stochastic Gradient Descent

Straightforward distributed optimization Run different algorithms/strategies on different machines/cores First one that finishes wins Gauss-Southwell Coordinate Descent Randomized Coordinate Descent Gradient Descent Stochastic Gradient Descent Gauss-Southwell Coordinate Descent Randomized Coordinate Descent Your computer

Parallel first-order methods Synchronized deterministic Gradient Descent Can be broken into separable components n samples m machines

Parallel first-order methods Synchronized deterministic Gradient Descent n samples m machines

Parallel first-order methods Synchronized deterministic Gradient Descent n samples m machines

Parallel first-order methods m machines Synchronized deterministic Gradient Descent n samples

Parallel first-order methods m machines Synchronized deterministic Gradient Descent These allow optimal linear speedups You should always consider this first!

Parallel first-order methods m machines Synchronized deterministic Gradient Descent Issue What if one of the computers is very slow? What if one of the links failed?

Centralized Gradient descent (HogWild) Update x asynchronously - saves a lot of time Stochastic gradient method on shared memory

Centralized Gradient descent (HogWild) Update x asynchronously - saves a lot of time Stochastic gradient method on shared memory mini-batch 1 mini-batch 2 mini-batch 3

Centralized Gradient descent Update x asynchronously - saves a lot of time Stochastic gradient method on shared memory

Centralized Gradient descent Update x asynchronously - saves a lot of time Stochastic gradient method on shared memory

Centralized Gradient descent Update x asynchronously - saves a lot of time Stochastic gradient method on shared memory

Centralized Gradient descent Update x asynchronously - saves a lot of time Stochastic gradient method on shared memory

Centralized Coordinate descent Communicating parameters x can be expensive Use coordinate descent to transmit one coordinate update at a time mini-batch 1 mini-batch 2 mini-batch 3

Centralized Coordinate descent Communicating parameters x can be expensive Sending and receiving `x` parameters is expensive mini-batch 1 mini-batch 2 mini-batch 3

Centralized Coordinate descent Communicating parameters x can be expensive Use coordinate descent to transmit one coordinate update at a time Coordinates that have changed

Centralized Coordinate descent Communicating parameters x can be expensive Use coordinate descent to transmit one coordinate update at a time

Centralized Coordinate descent Communicating parameters x can be expensive Use coordinate descent to transmit one coordinate update at a time Coordinates that have changed Coordinates that have changed Need to decrease step-size for convergence (it s stochastic coordinate descent).

Decentralized Coordinate descent for sparse datasets Least square problem Sparse A 0 1 0 0 Update rule 1 0 0 0 1 0 0 1 0 0 1 0 Doesn t seem separable at first sight. But, can have many non-zero entries - most entries in will be unnecessary

Decentralized Coordinate descent for sparse datasets Least square problem Sparse A 0 1 0 0 Update rule 1 0 0 0 1 0 0 1 0 0 1 0 Coordinates? &? Coordinate? Coordinate?

Decentralized Coordinate descent for sparse datasets Least square problem Sparse A 0 1 0 0 Update rule 1 0 0 0 1 0 0 1 0 0 1 0 Coordinates? &? Coordinate? Coordinate?

Decentralized Coordinate descent for sparse datasets Least square problem Sparse A 0 1 0 0 Update rule 1 0 0 0 1 0 0 1 0 0 1 0 Coordinates? &? Coordinate? Coordinate?

Decentralized Coordinate descent for sparse datasets Least square problem Sparse A 0 1 0 0 Update rule 1 0 0 0 1 0 0 1 0 0 1 0 Coordinates 1 & 4 Coordinate 2 Coordinate 3 Samples 2 & 3 Sample 1 Sample 4

Decentralized Coordinate descent for sparse datasets Update rule Sparse A Say, you can only fit one sample in the machine 0 1 0 0 1 0 0 0 Coordinates 1 & 4 Coordinate 2 Coordinate 3 Samples 2 & 3 Sample 1 Sample 4 1 0 0 1 0 0 1 0

Decentralized Coordinate descent for sparse datasets Update rule Sparse A Say, you can only fit one sample in a machine 0 1 0 0 1 0 0 0 Coordinates 1 & 4 Coordinate 2 Coordinate 3 Sample 1 Sample 4 1 0 0 1 0 0 1 0 Samples 2 Sample 3

Decentralized Coordinate descent for sparse datasets Update rule Sparse A Say, you can only fit one sample in a machine 0 1 0 0 1 0 0 0 Coordinates 1 & 4 Coordinate 2 Coordinate 3 Sample 1 Sample 4 1 0 0 1 0 0 1 0 Samples 2 Sample 3 Mini centralized coordinate descent

Decentralized Gradient Descent Distribute the data across machines. We may not want to update a centralized vector x Decentralized Gradient Descent Each machine has its own data samples Sparse A 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0

Decentralized Gradient Descent Distribute the data across machines. We may not want to update a centralized vector x Decentralized Gradient Descent Each machine has its own data samples Sparse A 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Machine 1 Machine 2 Machine 3 Machine 4

Decentralized Gradient Descent Distribute the data across machines. We may not want to update a centralized vector x Decentralized Gradient Descent Each machine has its own data samples Each machine has its own parameter vector Sample 1 Sample 2 Sample 3 Sample 4 Sparse A 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Sample 1 Sample 2 Sample 3 Sample 4 Machine 1 Machine 2 Machine 3 Machine 4

Decentralized Gradient Descent Distribute the data across machines. We may not want to update a centralized vector x Decentralized Gradient Descent Each machine has its own data samples Each machine has its own parameter vector Sample 1 Sample 2 Sample 3 Sample 4 Sparse A 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0???? Sample 1 Sample 2 Sample 3 Sample 4 Machine 1 Machine 2 Machine 3 Machine 4

Decentralized Gradient Descent Distribute the data across machines. We may not want to update a centralized vector x Decentralized Gradient Descent Each machine has its own data samples Each machine has its own parameter vector Sample 1 Sample 2 Sample 3 Sample 4 Sparse A 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0???? Sample 1 Sample 2 Sample 3 Sample 4 Machine 1 Machine 2 Machine 3 Machine 4

Decentralized Gradient Descent Distribute the data across machines. We may not want to update a centralized vector x Decentralized Gradient Descent Each machine has its own data samples Each machine has its own parameter vector Sample 1 Sample 2 Sample 3 Sample 4 Sparse A 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0???? Sample 1 Sample 2 Sample 3 Sample 4 Machine 1 Machine 2 Machine 3 Machine 4

Decentralized Gradient Descent Distribute the data across machines. We may not want to update a centralized vector x Decentralized Gradient Descent Each machine has its own data samples Each machine has its own parameter vector Sample 1 Sample 2 Sample 3 Sample 4 Sparse A 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0???? Sample 1 Sample 2 Sample 3 Sample 4 Machine 1 Machine 2 Machine 3 Machine 4

Decentralized Gradient Descent Distribute the data across machines. We may not want to update a centralized vector x Decentralized Gradient Descent Each machine has its own data samples Each machine has its own parameter vector Sample 1 Sample 2 Sample 3 Sample 4 Sparse A 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0???? Sample 1 Sample 2 Sample 3 Sample 4 Machine 1 Machine 2 Machine 3 Machine 4

Decentralized Gradient Descent Distribute the data across machines. We may not want to update a centralized vector x Decentralized Gradient Descent Each machine has its own data samples Each machine has its own parameter vector Sample 1 Sample 2 Sample 3 Sample 4 Sparse A 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Sample 1 Sample 2 Sample 3 Sample 4 Machine 1 Machine 2 Machine 3 Machine 4

Decentralized Gradient Descent Distribute the data across machines. We may not want to update a centralized vector x Decentralized Gradient Descent Each machine has its own data samples Each machine has its own parameter vector Update rule Sample 1 Sample 2 Sample 3 Sample 4 Sparse A 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 no communication communicates with machine 3 communicates with machine 2 No communication machine 1 machine 2 machine 3 machine 4

Decentralized Gradient Descent Distribute the data across machines. We may not want to update a centralized vector x Decentralized Gradient Descent Each machine has its own data samples Each machine has its own parameter vector Update rule Sample 1 Sample 2 Sample 3 Sample 4 Sparse A 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Similar convergence to the gradient descent with central communication The rate depends on the sparsity of the dataset

Summary Using parallel and distributed systems is important for speeding up optimization for big data Synchronized Deterministic Gradient Descent Optimization halts with link failure or when a machine is slow at processing its data Centralized Asynchronous Gradient Descent Communicating vector x is costly Centralized Asynchronous Coordinate descent Centralization causes additional overhead - communication Decentralized Asynchronous Coordinate descent Helpful for sparse datasets No communication between machines Decentralized Gradient descent Helpful for sparse datasets Machines have to communicate with few neighbors only