Big learning: challenges and opportunities



Similar documents
Machine learning challenges for big data

Beyond stochastic gradient descent for large-scale machine learning

Stochastic gradient methods for machine learning

Machine learning and optimization for massive data

Stochastic gradient methods for machine learning

Large-scale machine learning and convex optimization

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Summer School Machine Learning Trimester

Big Data - Lecture 1 Optimization reminders

Online Learning for Matrix Factorization and Sparse Coding

Simple and efficient online algorithms for real world applications

Introduction to Online Learning Theory

Large-Scale Similarity and Distance Metric Learning

Group Sparse Coding. Fernando Pereira Google Mountain View, CA Dennis Strelow Google Mountain View, CA

Big Data Analytics: Optimization and Randomization

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

Introduction to Machine Learning Using Python. Vikram Kamath

A primal-dual algorithm for group sparse regularization with overlapping groups

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

Federated Optimization: Distributed Optimization Beyond the Datacenter

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

A Stochastic 3MG Algorithm with Application to 2D Filter Identification

GI01/M055 Supervised Learning Proximal Methods

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Machine learning for algo trading

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Sanjeev Kumar. contribute

(Quasi-)Newton methods

Unsupervised Data Mining (Clustering)

Learning is a very general term denoting the way in which agents:

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Network Security A Decision and Game-Theoretic Approach

Stochastic Optimization for Big Data Analytics: Algorithms and Libraries

Large-Scale Machine Learning with Stochastic Gradient Descent

Statistical Machine Learning

Chapter 4: Artificial Neural Networks

Applied mathematics and mathematical statistics

Learning Mid-Level Features For Recognition

Proximal mapping via network optimization

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Collaborative Filtering Scalable Data Analysis Algorithms Claudia Lehmann, Andrina Mascher

Data, Measurements, Features

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Linear Threshold Units

Lecture 2: The SVM classifier

Lecture. Simulation and optimization

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, ACCEPTED FOR PUBLICATION 1

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION

TIETS34 Seminar: Data Mining on Biometric identification

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Part II Redundant Dictionaries and Pursuit Algorithms

STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University. (Joint work with R. Chen and M. Menickelly)

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Bag of Pursuits and Neural Gas for Improved Sparse Coding

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

CS229T/STAT231: Statistical Learning Theory (Winter 2015)

Scalable Object Detection by Filter Compression with Regularized Sparse Coding

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Challenges for Data Driven Systems

Parallel Selective Algorithms for Nonconvex Big Data Optimization

Protein Protein Interaction Networks

Karthik Sridharan. 424 Gates Hall Ithaca, sridharan/ Contact Information

Big Data Systems CS 5965/6965 FALL 2015

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Interactive Machine Learning. Maria-Florina Balcan

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

Big Data Analytics CSCI 4030

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

Machine Learning using MapReduce

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Sibyl: a system for large scale machine learning

TD(0) Leads to Better Policies than Approximate Value Iteration

Logistic Regression for Spam Filtering

Online Convex Optimization

Environmental Remote Sensing GEOG 2021

: Introduction to Machine Learning Dr. Rita Osadchy

MapReduce/Bigtable for Distributed Optimization

Introduction to Logistic Regression

Accelerated Parallel Optimization Methods for Large Scale Machine Learning

Efficient online learning of a non-negative sparse autoencoder

HT2015: SC4 Statistical Data Mining and Machine Learning

Sparsity & Co.: An Overview of Analysis vs Synthesis in Low-Dimensional Signal Models. Rémi Gribonval, INRIA Rennes - Bretagne Atlantique

Class-specific Sparse Coding for Learning of Object Representations

Transcription:

Big learning: challenges and opportunities Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure December 2013

Omnipresent digital media Scientific context Big data Multimedia, sensors, indicators, social networks All levels: personal, professional, scientific, industrial Too large and/or complex for manual processing Computational challenges Dealing with large databases Statistical challenges What can be predicted from such databases and how? Looking for hidden information Opportunities (and threats)

Omnipresent digital media Scientific context Big data Multimedia, sensors, indicators, social networks All levels: personal, professional, scientific, industrial Too large and/or complex for manual processing Computational challenges Dealing with large databases Statistical challenges What can be predicted from such databases and how? Looking for hidden information Opportunities (and threats)

Machine learning for big data Large-scale machine learning: large p, large n, large k p : dimension of each observation (input) n : number of observations k : number of tasks (dimension of outputs) Examples: computer vision, bioinformatics, etc.

Object recognition

Learning for bioinformatics - Proteins Crucial components of cell life Predicting multiple functions and interactions Massive data: up to 1 millions for humans! Complex data Amino-acid sequence Link with DNA Tri-dimensional molecule

Search engines - advertising

Advertising - recommendation

Machine learning for big data Large-scale machine learning: large p, large n, large k p : dimension of each observation (input) n : number of observations k : number of tasks (dimension of outputs) Examples: computer vision, bioinformatics, etc. Two main challenges: 1. Computational: ideal running-time complexity = O(pn + kn) 2. Statistical: meaningful results

Big learning: challenges and opportunities Outline Scientific context Big data: need for supervised and unsupervised learning Beyond stochastic gradient for supervised learning Few passes through the data Provable robustness and ease of use Matrix factorization for unsupervised learning Looking for hidden information through dictionary learning Feature learning

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n Prediction as a linear function θ Φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 min θ R p n n i=1 l ( y i,θ Φ(x i ) ) + µω(θ) convex data fitting term + regularizer

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n Prediction as a linear function θ Φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 min θ R p n n i=1 l ( y i,θ Φ(x i ) ) + µω(θ) convex data fitting term + regularizer Applications to any data-oriented field Computer vision, bioinformatics Natural language processing, etc.

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n Prediction as a linear function θ Φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 min θ R p n n i=1 l ( y i,θ Φ(x i ) ) + µω(θ) Main practical challenges convex data fitting term + regularizer Designing/learning good features Φ(x) Efficiently solving the optimization problem

Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n Linear (e.g., exponential) convergence rate in O(e αt ) Iteration complexity is linear in n (with line search) n f i(θ t 1 ) i=1

Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n n f i(θ t 1 ) i=1

Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n Linear (e.g., exponential) convergence rate in O(e αt ) Iteration complexity is linear in n (with line search) n f i(θ t 1 ) i=1 Stochastic gradient descent: θ t = θ t 1 γ t f i(t) (θ t 1) Sampling with replacement: i(t) random element of {1,...,n} Convergence rate in O(1/t) Iteration complexity is independent of n (step size selection?)

Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n n f i(θ t 1 ) i=1 Stochastic gradient descent: θ t = θ t 1 γ t f i(t) (θ t 1)

Stochastic vs. deterministic methods Goal = best of both worlds: Linear rate with O(1) iteration cost Robustness to step size log(excess cost) stochastic deterministic time

Stochastic vs. deterministic methods Goal = best of both worlds: Linear rate with O(1) iteration cost Robustness to step size log(excess cost) hybrid time stochastic deterministic

Stochastic average gradient (Le Roux, Schmidt, and Bach, 2012) Stochastic average gradient (SAG) iteration Keep in memory the gradients of all functions f i, i = 1,...,n Random selection i(t) {1,...,n} with replacement { Iteration: θ t = θ t 1 γ n t f yi t with yi t n = i (θ t 1 ) if i = i(t) otherwise i=1 y t 1 i

Stochastic average gradient (Le Roux, Schmidt, and Bach, 2012) Stochastic average gradient (SAG) iteration Keep in memory the gradients of all functions f i, i = 1,...,n Random selection i(t) {1,...,n} with replacement { Iteration: θ t = θ t 1 γ n t f yi t with yi t n = i (θ t 1 ) if i = i(t) otherwise i=1 y t 1 i Stochastic version of incremental average gradient(blatt et al., 2008) Simple implementation Extra memory requirement: same size as original data (or less) Simple/robust constant step-size

Stochastic average gradient Convergence analysis Assume each f i is L-smooth and g= 1 n n i=1 f i is µ-strongly convex Constant step size γ t = 1 16L. If µ 2L, C R such that n t 0, E [ g(θ t ) g(θ ) ] ( Cexp t ) 8n Linear convergence rate with iteration cost independent of n After each pass through the data, constant error reduction Breaking two lower bounds

spam dataset (n = 92 189, p = 823 470)

Simplicity Few lines of code Robustness Large-scale supervised learning Convex optimization Step-size Adaptivity to problem difficulty On-going work Single pass through the data (Bach and Moulines, 2013) Distributed algorithms - Convexity as a solution to all problems? - Need good features Φ(x) for linear predictions θ Φ(x)!

Simplicity Few lines of code Robustness Large-scale supervised learning Convex optimization Step-size Adaptivity to problem difficulty On-going work Single pass through the data (Bach and Moulines, 2013) Distributed algorithms Convexity as a solution to all problems? Need good features Φ(x) for linear predictions θ Φ(x)!

Unsupervised learning through matrix factorization Given data matrix X = (x 1,...,x n) R n p Principal component analysis: x i Dα i K-means: x i d k X = DA

Learning dictionaries for uncovering hidden structure Fact: many natural signals may be approximately represented as a superposition of few atoms from a dictionary D = (d 1,...,d k ) k Decomposition x = α i d i = Dα with α R k sparse i=1 Natural signals (sounds, images) (Olshausen and Field, 1997) - Decoding problem: given a dictionary D, finding α through regularized convex optimization min α R k x Dα 2 2+λ α 1

Learning dictionaries for uncovering hidden structure Fact: many natural signals may be approximately represented as a superposition of few atoms from a dictionary D = (d 1,...,d k ) k Decomposition x = α i d i = Dα with α R k sparse i=1 Natural signals (sounds, images) (Olshausen and Field, 1997) Decoding problem: given a dictionary D, finding α through regularized convex optimization min α R k x Dα 2 2+λ α 1 w 2 w 2 w1 w1

Learning dictionaries for uncovering hidden structure Fact: many natural signals may be approximately represented as a superposition of few atoms from a dictionary D = (d 1,...,d k ) k Decomposition x = α i d i = Dα with α R k sparse i=1 Natural signals (sounds, images) (Olshausen and Field, 1997) Decoding problem: given a dictionary D, finding α through regularized convex optimization min α R k x Dα 2 2+λ α 1 Dictionary learning problem: given n signals x 1,...,x n, Estimate both dictionary D and codes α 1,...,α n min D n j=1 min α j R p { xj Dα j 2 2 +λ α j 1 }

Challenges of dictionary learning min D n j=1 { xj } min Dα α j 2 j R p 2 +λ α j 1 Algorithmic challenges Large number of signals online learning (Mairal et al., 2009) Theoretical challenges Identifiabiliy/robustness (Jenatton et al., 2012) Domain-specific challenges Going beyond plain sparsity structured sparsity (Jenatton, Mairal, Obozinski, and Bach, 2011)

Applications - Digital Zooming

Digital Zooming (Couzinie-Devy et al., 2011)

Applications - Task-driven dictionaries inverse half-toning (Mairal et al., 2011)

Extensions - Task-driven dictionaries inverse half-toning (Mairal et al., 2011)

Big learning: challenges and opportunities Conclusion Scientific context Big data: need for supervised and unsupervised learning Beyond stochastic gradient for supervised learning Few passes through the data Provable robustness and ease of use Matrix factorization for unsupervised learning Looking for hidden information through dictionary learning Feature learning

References F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate o(1/n). Technical Report 00831977, HAL, 2013. D. Blatt, A.O. Hero, and H. Gauchman. A convergent incremental gradient method with a constant step size. 18(1):29 51, 2008. R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. Journal of Machine Learning Research, 12:2297 2334, 2011. N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. Technical Report -, HAL, 2012. J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In International Conference on Machine Learning (ICML), 2009. B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37:3311 3325, 1997.