Big Data - Lecture 1 Optimization reminders

Similar documents

Support Vector Machine (SVM)

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1

10. Proximal point method

GI01/M055 Supervised Learning Proximal Methods

Nonlinear Programming Methods.S2 Quadratic Programming

Nonlinear Optimization: Algorithms 3: Interior-point methods

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Lecture 6: Logistic Regression

A NEW LOOK AT CONVEX ANALYSIS AND OPTIMIZATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Linear Programming Notes V Problem Transformations

Statistical Machine Learning

Distributed Machine Learning and Big Data

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

Making Sense of the Mayhem: Machine Learning and March Madness

2.3 Convex Constrained Optimization Problems

Stochastic Gradient Method: Applications

(Quasi-)Newton methods

Support Vector Machines

Online Convex Optimization

Interior Point Methods and Linear Programming

Lecture 2: The SVM classifier

STA 4273H: Statistical Machine Learning

Big Data Analytics. Lucas Rego Drumond

Big learning: challenges and opportunities

Beyond stochastic gradient descent for large-scale machine learning

Support Vector Machines Explained

Social Media Mining. Data Mining Essentials

Mathematical finance and linear programming (optimization)

Date: April 12, Contents

Several Views of Support Vector Machines

Stochastic gradient methods for machine learning

Introduction to Online Learning Theory

A Stochastic 3MG Algorithm with Application to 2D Filter Identification

The Master s Degree with Thesis Course Descriptions in Industrial Engineering

Support Vector Machines

Principles of demand management Airline yield management Determining the booking limits. » A simple problem» Stochastic gradients for general problems

Adaptive Online Gradient Descent

The p-norm generalization of the LMS algorithm for adaptive filtering

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

A Simple Introduction to Support Vector Machines

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

The equivalence of logistic regression and maximum entropy models

Convex analysis and profit/cost/support functions

Proximal mapping via network optimization

Online Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes

24. The Branch and Bound Method

Learning outcomes. Knowledge and understanding. Competence and skills

Convex Programming Tools for Disjunctive Programs

CHAPTER 9. Integer Programming

Course Syllabus For Operations Management. Management Information Systems

Online Learning, Stability, and Stochastic Gradient Descent

Practical Guide to the Simplex Method of Linear Programming

Perron vector Optimization applied to search engines

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Machine learning and optimization for massive data

Variational approach to restore point-like and curve-like singularities in imaging

Machine Learning and Pattern Recognition Logistic Regression

CSCI567 Machine Learning (Fall 2014)

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

An Overview Of Software For Convex Optimization. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Machine learning challenges for big data

Lecture 3: Linear methods for classification

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

4.6 Linear Programming duality

Big Data Analytics: Optimization and Randomization

Notes V General Equilibrium: Positive Theory. 1 Walrasian Equilibrium and Excess Demand

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Problem Set 6 - Solutions

Operations Research and Financial Engineering. Courses

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Duality of linear conic problems

Dual Methods for Total Variation-Based Image Restoration

Optimal shift scheduling with a global service level constraint

Applications to Data Smoothing and Image Processing I

Interior-Point Algorithms for Quadratic Programming

A FIRST COURSE IN OPTIMIZATION THEORY

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Machine Learning Introduction

Machine Learning Big Data using Map Reduce

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Linear Threshold Units

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

An Introduction to Machine Learning

Massive Data Classification via Unconstrained Support Vector Machines

constraint. Let us penalize ourselves for making the constraint too big. We end up with a

Support Vector Machines with Clustering for Training with Very Large Datasets

Equilibrium computation: Part 1

Efficient variable selection in support vector machines via the alternating direction method of multipliers

SMOOTHING APPROXIMATIONS FOR TWO CLASSES OF CONVEX EIGENVALUE OPTIMIZATION PROBLEMS YU QI. (B.Sc.(Hons.), BUAA)

Semi-Supervised Support Vector Machines and Application to Spam Filtering

The Data Mining Process

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Transcription:

Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014

Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014

Schedule Introduction Major issues Examples Mathematics 1 Introduction Major issues Examples Mathematics 2 Convex functions Example of convex functions Gradient descent method 3 Equality constraint Inequality constraint KKT Conditions 4

Major issues Examples Mathematics I Introduction - Major issues in data science Data science: Extract from data some knowledge for industrial or academic exploitation. Involves: 1 Signal Processing (how to record the data and represent it?) 2 Modelization (What is the problem, what kind of mathematical model and answer?) 3 Statistics (reliability of estimation procedures?) 4 Machine Learning (what kind of efficient optimization algorithm?) 5 Implementation (software needs) 6 Visualization (how can I represent the resulting knowledge?) In its whole, this sequence of questions are at the core of Artificial Intelligence and may also be referred to as Computer Science problems. In this lecture, we will address some issues raised in red items. Each time, practical examples will be provided Most of our motivation comes from the Big Data world, encountered in image processing, finance, genetics and many other fields where knowledge extraction is needed, when facing to many observations described by many variables. n: number of observations - p: number of variables per observations p >> n >> O(1).

Major issues Examples Mathematics I Introduction - Spam classification - Signal Processing datasets From a set of labelled messages (spam or not), build a classification for automatic spam rejection. Select among the words meaningful elements? Automatic classification?

Major issues Examples Mathematics I Introduction - Micro-array analysis - Biological datasets One measures micro-array datasets built from a huge amount of profile genes expression. Number of genes p (of order thousands). Number of samples n (of order hundred). Diagnostic help: healthy or ill? Select among the genes meaningful elements? Automatic classification?

Major issues Examples Mathematics I Introduction - Fraud detection - Industrial datasets Set of individual electrical consumption for some people in Medellin (Colombia). Each individual provides a monthly electrical consumption. The electrical firm measures the whole consumption for important hubs (they are formed by a set of clients). Want to detect eventual fraud. Problems: Missing data: completion of the table. How? Noise in the several measurements: how does it degrages the fraud detection? Can we exhibit several monthly behaviour of the clients?

Introduction Major issues Examples Mathematics I Introduction - Data Completion & Recommandation systems - Advertisement and e-business datasets Recommandation problems: And more recently: What kind of database? Reliable recommandation for clients? Online strategy? S. Gadat Big Data - Lecture 1

Major issues Examples Mathematics I Introduction - Credit scoring - Actuaries datasets Build an indicator (Q score) from a dataset for the probability of interest in a financial product (Visa premier credit card). 1 Define a model, a question? 2 Use a supervised classification algorithm to rank the best clients. 3 Use logistic regression to provide a score.

Major issues Examples Mathematics I Introduction - What about maths? Various mathematical fields we will talk about: Analysis: Convex optimization, Approximation theory Statistics: Penalized procedures and their reliability Probabilistic methods: concentration inequalities, stochastic processes, stochastic approximations Famous keywords: Lasso Boosting Convex relaxation Supervised classification Support Vector Machine Aggregation rules Gradient descent Stochastic Gradient descent Sequential prediction Bandit games, minimax policies Matrix completion

Schedule Introduction Convex functions Example of convex functions Gradient descent method 1 Introduction Major issues Examples Mathematics 2 Convex functions Example of convex functions Gradient descent method 3 Equality constraint Inequality constraint KKT Conditions 4

Convex functions Example of convex functions Gradient descent method Convex functions We recall some background material that is necessary for a clear understanding of how some machine learning procedures work. We will cover some basic relationships between convexity, positive semidefiniteness, local and global minimizers. (Convex sets, convex functions) A set D is convex if for any (x 1, x 2) D 2 and all α [0, 1], x = αx 1 + (1 α)x 2 D. A function f is convex if its domain D is convex f(x) = f(αx 1 + (1 α)x 2) αf(x 1) + (1 α)f(x 2). (Positive Semi Definite matrix (PSD)) A p p matrix H is (PSD) if for all p 1 vectors z, we have z t Hz 0. There exists a strong link between SDP matrix and convex functions, given by the following proposition. Proposition A smooth C 2 (D) function f is convex if and only if D 2 f is SDP at any point of D. The proof follows easily from a second order Taylor expansion.

Example of convex functions Convex functions Example of convex functions Gradient descent method Exponential function: θ R exp(aθ) on R whatever a is. Affine function: θ R d a t θ + b Entropy function: θ R + θ log(θ) p-norm: θ R d θ p := d p θ i p. i=1 Quadratic form: θ R d θ t P θ + 2q t θ + r where P is symetric and positive.

Convex functions Example of convex functions Gradient descent method Why convex functions are useful? From external motivations: Many problems in machine learning come from the minimization of a convex criterion and provide meaningful results for the statistical initial task. Many optimization problems admit a convex reformulation (SVM classification or regression, LASSO regression, ridge regression, permutation recovery,... ). From a numerical point of view: Local minimizer = global minimizer. It is a powerful point since in general, descent methods involve f(x) (or something related to), which is a local information on f. x is a local (global) minimizer of f if and only if 0 f(x). Many fast algorithms for the optimization of convex function exist, and sometimes are independent on the dimension d of the original space.

Convex functions Example of convex functions Gradient descent method Why convex functions are powerful? Two kind of optimization problems: On the left: non convex optimization problem, use of Travelling Salesman type method. Greedy exploration step (simulated annealing, genetic algortihms). On the right: convex optimization problem, use local descent methods with gradients or subgradients. (Subgradient (nonsmooth functions?)) For any function f : R d R, and any x in R d, the subgradient f(x) is the set of all vectors g such that f(x) f(y) g, x y. This set of subgradients may be empty. Fortunately, it is not the case for convex functions. Proposition f : R d R is convex if and only if f(x) for any x of R d.

Convex functions Example of convex functions Gradient descent method Convexity and gradient: Constrained case In either constrained or unconstrained problems, descent methods are powerful with convex functions. In particular, consider constrained problems in X R d. The most famous local descent method relies on y t+1 = x t ηg t where g t f(x t), and where η > 0 is a fixed step-size parameters. x t+1 = Π X (y t+1), Théorème (Convergence of the projected gradient descent method, fixed step-size) If f is convex over X with X B(0, R) and f L, the choice η = ( ) 1 t f x s min f RL t s=1 S. Gadat Big Data - Lecture t 1 R L leads to t

Convex functions Example of convex functions Gradient descent method Convexity and gradient: Smooth unconstrained case Results can be seriously improved with smooth functions with bounded second derivatives. f is β smooth if f is β Lipschitz: Standard gradient descent over R d becomes f(x) f(y) β x y. x t+1 = x t η f(x t), Théorème (Convergence of the gradient descent method, β smooth function) If f is a convex and β-smooth function, then η = 1 β leads to ( ) 1 t 2β x1 x0 2 f x s min f t t 1 s=1 Remark Note that the two past results do not depend on the dimension of the state space d. The last result can be extended to the constrained situation.

Schedule Introduction Equality constraint Inequality constraint KKT Conditions 1 Introduction Major issues Examples Mathematics 2 Convex functions Example of convex functions Gradient descent method 3 Equality constraint Inequality constraint KKT Conditions 4

Equality constraint Inequality constraint KKT Conditions Constrained optimisation: Elements of the problem: θ unknown vector of R d to be recovered J : R d R function to be minimized f i and g i differentiable functions defining a set of constraints. of the problem: min θ R d J(θ) such that: f i(θ) = 0, i = 1,..., n g i(θ) 0, i = 1,..., m Set of admissible vectors: { } Ω := θ R d f i(θ) = 0, i and g j(θ) 0, j

Constrained optimisation: Example Equality constraint Inequality constraint KKT Conditions Typical situation: Ω: circle of radius 2 Optimal solution: θ = ( 1, 1) t and J(θ ) = 2. Important restriction: we will restrict our study to convex functions J.

Constrained convex optimisation Equality constraint Inequality constraint KKT Conditions A constrained optimization problem is convex if: J is a convex function f i are linear or affine functions g i are convex functions

Equality constraint Inequality constraint KKT Conditions Case of a unique equality constraint min J(θ) such that a t θ b = 0 θ Descent direction h: J(θ) t h < 0. Admissible direction h: a t (θ + h) b = 0 a t h = 0. Optimality θ is optimal if there is no admissible descent direction starting from θ. The only possible case is when J(θ ) and a are linearly dependent: λ R J(θ ) + λa = 0. In this situation: J(θ) = ( ) 2θ1 + θ 2 2 θ 1 + 2θ 2 + 2 ( 1 and a = 1) Hence, we are looking for θ such that J(θ) a. Computations lead to θ 1 = θ 2. Optimal value reached for θ 1 = 1/2 (and J(θ ) = 15/4).

Lagrangian function Equality constraint Inequality constraint KKT Conditions min J(θ) such that f(θ) := a t θ b = 0 θ We have seen the important role of the scalar value λ above. (Lagrangian function) L(λ, θ) = J(θ) + λf(θ) λ is the Lagrange multiplier. The optimal choice of (θ, λ ) corresponds to θ L(λ, θ ) = 0 and λ L(λ, θ ) = 0. Argument: θ is optimal if there is no admissible descent directions h. Hence, J and f are linearly dependent. As a consequence, there exists λ such that θ L(λ, θ ) = J(θ) + λ f(θ) = 0 (Dual equation) Since θ must be admissible, we have θ L(λ, θ ) = f(θ ) = 0 (Primal equation)

Equality constraint Inequality constraint KKT Conditions Case of a unique inequality constraint min J(θ) such that g(θ) 0 θ Descent direction h: J(θ) t h < 0. Admissible direction h: g(θ) t h 0 guarantees that g(θ + αh) is decreasing with α. Optimality θ is optimal if there is no admissible descent direction starting from θ. The only possible case is when J(θ ) and g(θ ) are linearly dependent and opposite: λ R J(θ ) = µ g(θ ) with µ 0. We can check that θ = ( 1, 1).

Equality constraint Inequality constraint KKT Conditions We consider the minimization problem: min θ J(θ) such that g j(θ) 0, j = 1,..., m f i(θ) = 0, i = 1,..., n (Lagrangian function) We associate to this problem the Lagrange multipliers (λ, µ) = (λ 1,..., λ n, µ 1,..., µ m). n m L(θ, λ, µ) = J(θ) + λ if i(θ) + µ jg j(θ) i=1 j=1 θ primal variables (λ, µ) dual variables

KKT Conditions Introduction Equality constraint Inequality constraint KKT Conditions (KKT Conditions) If J and f, g are smooth, we define the Karush-Kuhn-Tucker (KKT) conditions as Stationarity: θ L(λ, µ, θ) = 0. Primal Admissibility: f(θ) = 0 and g(θ) 0. Dual admissibility µ j 0, j = 1,..., m. Theorem A convex minimization problem of J under convex constraints f and g has a solution θ if and only if there exists λ and µ such that KKT conditions hold. Example: J(θ) = 1 2 θ 2 2 s.t. θ 1 2θ 2 + 2 0 We get L(θ, µ) = θ 2 2 2 + µ(θ 1 + 2θ 2 + 2) with µ 0. Stationarity: (θ 1 + µ, θ 2 2µ) = 0. We deduce that θ = ( 2/5, 4/5). θ 2 = 2θ 1 with θ 2 0.

Dual problems (1) Introduction Equality constraint Inequality constraint KKT Conditions We introduce the dual function: We have the following important result Theorem L(λ, µ) = min θ L(θ, λ, µ). Denote the optimal value of the constrained problem p = min {J(θ) f(θ) = 0, g(θ) 0}, then Remark: L(λ, µ) p. The dual function L is lower than p, for any (λ, µ) R n R m + We aim to make this lower bound as close as possible to p : idea to maximize w.r.t. λ, µ the function L. (Dual problem) max L(λ, µ). λ R n,µ R m + L(θ, λ, µ) affine function on λ, µ and thus convex. Hence, L is convex and almost unconstrained.

Dual problems (2) Introduction Equality constraint Inequality constraint KKT Conditions Dual problems are easier than primal ones (because of almost constraints omissions). Dual problems are equivalent to primal ones: maximization of the dual minimization of the primal (not shown in this lecture). Dual solutions permit to recover primal ones with KKT conditions (Lagrange multipliers). Example: Lagrangian: L(θ, µ) = θ2 1 +θ2 2 2 + µ(θ 1 2θ 2 + 2). Dual function L(µ) = min θ L(θ, µ) = 5 2 µ2 + 2µ. Dual solution: max L(µ) such that µ 0: µ = 2/5. Primal solution: KKT = θ = ( µ, 2µ) = ( 2/5, 4/5).

Take home message Big Data problems arise in a large variety of fields. They are complicated for a computational reason (and also for a statistical one, see later). Many Big Data problems will be traduced in an optimization of a convex problem. Efficient algorithms are available to optimize them: independently on the dimension of the underlying space. Primal - Dual formulations are important to overcome some constraints on the optimization. Numerical convex solvers are widely and freely distributed.