Machine learning challenges for big data

Similar documents

Big learning: challenges and opportunities

Stochastic gradient methods for machine learning

Beyond stochastic gradient descent for large-scale machine learning

Stochastic gradient methods for machine learning

Machine learning and optimization for massive data

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Summer School Machine Learning Trimester

Large-scale machine learning and convex optimization

Large-Scale Similarity and Distance Metric Learning

Big Data - Lecture 1 Optimization reminders

Introduction to Online Learning Theory

Group Sparse Coding. Fernando Pereira Google Mountain View, CA Dennis Strelow Google Mountain View, CA

Simple and efficient online algorithms for real world applications

Online Learning for Matrix Factorization and Sparse Coding

A primal-dual algorithm for group sparse regularization with overlapping groups

Proximal mapping via network optimization

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Lecture 2: The SVM classifier

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

Introduction to Logistic Regression

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Regularized Logistic Regression for Mind Reading with Parallel Validation

Statistical Machine Learning

Chapter 4: Artificial Neural Networks

Logistic Regression for Spam Filtering

TIETS34 Seminar: Data Mining on Biometric identification

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Big Data Analytics: Optimization and Randomization

Part II Redundant Dictionaries and Pursuit Algorithms

Federated Optimization: Distributed Optimization Beyond the Datacenter

ADVANCED MACHINE LEARNING. Introduction

GI01/M055 Supervised Learning Proximal Methods

Journée Thématique Big Data 13/03/2015

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

MapReduce/Bigtable for Distributed Optimization

Optimization with Sparsity-Inducing Penalties. Contents

(Quasi-)Newton methods

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

The Artificial Prediction Market

Stochastic Optimization for Big Data Analytics: Algorithms and Libraries

Forecasting and Hedging in the Foreign Exchange Markets

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

Distributed Structured Prediction for Big Data

Challenges for Data Driven Systems

Data, Measurements, Features

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Workshop 1: Optimization in machine learning, vision and image processing

Data Mining Practical Machine Learning Tools and Techniques

Group Sparsity in Nonnegative Matrix Factorization

HT2015: SC4 Statistical Data Mining and Machine Learning

Bootstrapping Big Data

Learning outcomes. Knowledge and understanding. Competence and skills

Recurrent Neural Networks

Linear Threshold Units

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Position Classification Flysheet for Computer Science Series, GS Table of Contents

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Computer programming course in the Department of Physics, University of Calcutta

Large-Scale Machine Learning with Stochastic Gradient Descent

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Deep learning applications and challenges in big data analytics

Depth and Excluded Courses

Applied mathematics and mathematical statistics

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Big Data Optimization at SAS

Large Linear Classification When Data Cannot Fit In Memory

Application of Neural Network in User Authentication for Smart Home System

SINGLE-STAGE MULTI-PRODUCT PRODUCTION AND INVENTORY SYSTEMS: AN ITERATIVE ALGORITHM BASED ON DYNAMIC SCHEDULING AND FIXED PITCH PRODUCTION

Making Sense of the Mayhem: Machine Learning and March Madness

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Visualizing Higher-Layer Features of a Deep Network

Data Mining: An Overview. David Madigan

Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems

2 Reinforcement learning architecture

Lecture. Simulation and optimization

Clarify Some Issues on the Sparse Bayesian Learning for Sparse Signal Recovery

Introduction to Machine Learning Using Python. Vikram Kamath

Machine Learning and Pattern Recognition Logistic Regression

Social Media Mining. Data Mining Essentials

Karthik Sridharan. 424 Gates Hall Ithaca, sridharan/ Contact Information

Randomized Robust Linear Regression for big data applications

Efficient online learning of a non-negative sparse autoencoder

Gaussian Processes in Machine Learning

The Psychology of Simulation Model and Metamodeling

Unsupervised Data Mining (Clustering)

How To Make A Credit Risk Model For A Bank Account

Manifold Learning with Variational Auto-encoder for Medical Image Analysis

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, ACCEPTED FOR PUBLICATION 1

Transcription:

Machine learning challenges for big data Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure Joint work with R. Jenatton, J. Mairal, G. Obozinski, N. Le Roux, M. Schmidt - December 2012

Machine learning Computer science and applied mathematics Modelisation, prediction and control from training examples Theory Analysis of statistical performance Algorithms Numerical efficiency and stability Applications Computer vision, bioinformatics, neuro-imaging, text, audio

Context Machine learning for big data Large-scale machine learning: large p, large n, large k p : dimension of each observation (input) k : number of tasks (dimension of outputs) n : number of observations Examples: computer vision, bioinformatics, etc.

Context Machine learning for big data Large-scale machine learning: large p, large n, large k p : dimension of each observation (input) k : number of tasks (dimension of outputs) n : number of observations Examples: computer vision, bioinformatics, etc. Two main challenges: 1. Computational: ideal running-time complexity = O(pn + kn) 2. Statistical: meaningful results

Machine learning challenges for big data Recent work 1. Large-scale supervised learning Going beyond stochastic gradient descent Le Roux, Schmidt, and Bach (2012) 2. Unsupervised learning through dictionary learning Imposing structure for interpretability Bach, Jenatton, Mairal, and Obozinski (2011, 2012) 3. Interactions between convex and combinatorial optimization Submodular functions Bach (2011); Obozinski and Bach (2012)

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n Prediction as a linear function θ Φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 min θ R p n n l ( y i,θ Φ(x i ) ) + µω(θ) convex data fitting term + regularizer

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n Prediction as a linear function θ Φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 min θ R p n n l ( y i,θ Φ(x i ) ) + µω(θ) convex data fitting term + regularizer Applications to any data-oriented field Computer vision, bioinformatics Natural language processing, etc.

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n Prediction as a linear function θ Φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 min θ R p n n l ( y i,θ Φ(x i ) ) + µω(θ) Main practical challenges convex data fitting term + regularizer Designing/learning good features Φ(x) Efficiently solving the optimization problem

Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n Linear (e.g., exponential) convergence rate Iteration complexity is linear in n n f i(θ t 1 )

Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n n f i(θ t 1 )

Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n Linear (e.g., exponential) convergence rate Iteration complexity is linear in n n f i(θ t 1 ) Stochastic gradient descent: θ t = θ t 1 γ t f i(t) (θ t 1) Sampling with replacement: i(t) random element of {1,...,n} Convergence rate in O(1/t) Iteration complexity is independent of n

Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n n f i(θ t 1 ) Stochastic gradient descent: θ t = θ t 1 γ t f i(t) (θ t 1)

Stochastic vs. deterministic methods Goal = best of both worlds: linear rate with O(1) iteration cost log(excess cost) stochastic deterministic time

Stochastic vs. deterministic methods Goal = best of both worlds: linear rate with O(1) iteration cost log(excess cost) hybrid stochastic deterministic time

Stochastic average gradient (Le Roux, Schmidt, and Bach, 2012) Stochastic average gradient (SAG) iteration Keep in memory the gradients of all functions f i, i = 1,...,n Random selection i(t) {1,...,n} with replacement { Iteration: θ t = θ t 1 γ n t f yi t with yi t n = i (θ t 1 ) if i = i(t) otherwise y t 1 i Stochastic version of incremental average gradient(blatt et al., 2008) Simple implementation Extra memory requirement: same size as original data (or less)

Stochastic average gradient Convergence analysis Assume each f i is L-smooth and g= 1 n n f i is µ-strongly convex Constant step size γ t = 1 2nµ. If µ L 8, then C R such that n t 0, E [ g(θ t ) g(θ ) ] ( C 1 1 ) t 8n

Stochastic average gradient Convergence analysis Assume each f i is L-smooth and g= 1 n n f i is µ-strongly convex Constant step size γ t = 1 2nµ. If µ L 8, then C R such that n t 0, E [ g(θ t ) g(θ ) ] ( C 1 1 ) t 8n Linear convergence rate with iteration cost independent of n Linear convergence rate independent of the condition number After each pass through the data, constant error reduction Application to linear systems

Stochastic average gradient Simulation experiments protein dataset (n = 145751, p = 74) Dataset split in two (training/testing) Objective minus Optimum 10 0 10 1 10 2 10 3 Steepest AFG L BFGS pegasos RDA SAG (2/(L+nµ)) SAG LS Test Logistic Loss 5 x 104 4.5 4 3.5 3 2.5 2 1.5 1 Steepest AFG L BFGS pegasos RDA SAG (2/(L+nµ)) SAG LS 10 4 0.5 0 5 10 15 20 25 30 Effective Passes Training cost 0 5 10 15 20 25 30 Effective Passes Testing cost

Stochastic average gradient Simulation experiments cover type dataset (n = 581012, p = 54) Dataset split in two (training/testing) Objective minus Optimum 10 2 10 0 10 2 10 4 Steepest AFG L BFGS pegasos RDA SAG (2/(L+nµ)) SAG LS Test Logistic Loss 2 1.95 1.9 1.85 1.8 1.75 1.7 1.65 1.6 x 10 5 Steepest AFG L BFGS pegasos RDA SAG (2/(L+nµ)) SAG LS 1.55 0 5 10 15 20 25 30 Effective Passes Training cost 1.5 0 5 10 15 20 25 30 Effective Passes Testing cost

Machine learning challenges for big data Recent work 1. Large-scale supervised learning Going beyond stochastic gradient descent Le Roux, Schmidt, and Bach (2012) 2. Unsupervised learning through dictionary learning Imposing structure for interpretability Bach, Jenatton, Mairal, and Obozinski (2011, 2012) 3. Interactions between convex and combinatorial optimization Submodular functions Bach (2011); Obozinski and Bach (2012)

Learning dictionaries for uncovering hidden structure Fact: many natural signals may be approximately represented as a superposition of few atoms from a dictionary D = (d 1,...,d p ) p Decomposition x = α i d i with α R p sparse Natural signals (sounds, images) and others Decoding problem: given a dictionary D, finding α through regularized convex optimization min α R p x p α id i 2 2+λ α 1

Learning dictionaries for uncovering hidden structure Fact: many natural signals may be approximately represented as a superposition of few atoms from a dictionary D = (d 1,...,d p ) p Decomposition x = α i d i with α R p sparse Natural signals (sounds, images) and others Decoding problem: given a dictionary D, finding α through regularized convex optimization min α R p x p α id i 2 2+λ α 1 Dictionary learning problem: given n signals x 1,...,x n, Estimate both dictionary D and codes α 1,...,α n min D n j=1 min α j R p { x j p α j i d i 2 2 +λ α j 1 }

Challenges of dictionary learning min D n j=1 min α j R p { x j p } α j i d i 2 2 +λ αj 1 Algorithmic challenges Large number of signals online learning (Mairal et al., 2009) Theoretical challenges Identifiabiliy/robustness (Jenatton et al., 2012) Domain-specific challenges Going beyond plain sparsity structured sparsity (Jenatton, Mairal, Obozinski, and Bach, 2011)

Structured sparsity Sparsity-inducing behavior depends on corners of unit balls

Structured sparse PCA (Jenatton et al., 2009) raw data sparse PCA Unstructed sparse PCA many zeros do not lead to better interpretability

Structured sparse PCA (Jenatton et al., 2009) raw data sparse PCA Unstructed sparse PCA many zeros do not lead to better interpretability

Structured sparse PCA (Jenatton et al., 2009) raw data Structured sparse PCA Enforce selection of convex nonzero patterns robustness to occlusion in face identification

Structured sparse PCA (Jenatton et al., 2009) raw data Structured sparse PCA Enforce selection of convex nonzero patterns robustness to occlusion in face identification

Machine learning challenges for big data Recent work 1. Large-scale supervised learning Going beyond stochastic gradient descent Le Roux, Schmidt, and Bach (2012) 2. Unsupervised learning through dictionary learning Imposing structure for interpretability Bach, Jenatton, Mairal, and Obozinski (2011, 2012) 3. Interactions between convex and combinatorial optimization Submodular functions Bach (2011); Obozinski and Bach (2012)

Conclusion and open problems Machine learning for big data Having a large-scale hardware infrastructure is not enough Large-scale learning Between theory, algorithms and applications Adaptivity to increased amounts of data with linear complexity Robust algorithms with no hyperparameters Unsupervised learning Incorporating structural prior knowledge Semi-supervised learning Automatic learning of features for supervised learning

References F. Bach. Learning with Submodular Functions: A Convex Optimization Perspective. 2011. URL http://hal.inria.fr/hal-00645271/en. F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends R in Machine Learning, 4(1):1 106, 2011. F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Structured sparsity through convex optimization. Statistical Science, 2012. To appear. D. Blatt, A.O. Hero, and H. Gauchman. A convergent incremental gradient method with a constant step size. 18(1):29 51, 2008. R. Jenatton, G. Obozinski, and F. Bach. Structured sparse principal component analysis. Technical report, arxiv:0909.1440, 2009. R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. Journal of Machine Learning Research, 12:2297 2334, 2011. N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. Technical Report -, HAL, 2012. J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In International Conference on Machine Learning (ICML), 2009. G. Obozinski and F. Bach. Convex relaxation of combinatorial penalties. Technical report, HAL, 2012.