BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION

Similar documents
2.3 Convex Constrained Optimization Problems

(Quasi-)Newton methods

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Adaptive Online Gradient Descent

Computing a Nearest Correlation Matrix with Factor Structure

Optimal Scheduling for Dependent Details Processing Using MS Excel Solver

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

10. Proximal point method

Solutions of Equations in One Variable. Fixed-Point Iteration II

Nonlinear Optimization: Algorithms 3: Interior-point methods

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

Inner Product Spaces

Date: April 12, Contents

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Big Data Techniques Applied to Very Short-term Wind Power Forecasting

A characterization of trace zero symmetric nonnegative 5x5 matrices

Parallel Computing for Option Pricing Based on the Backward Stochastic Differential Equation

Statistical Machine Learning

Introduction to Algebraic Geometry. Bézout s Theorem and Inflection Points

2.2 Creaseness operator

Lecture 3: Linear methods for classification

MATHEMATICAL METHODS OF STATISTICS

Follow the Perturbed Leader

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Pricing and calibration in local volatility models via fast quantization

The Advantages and Disadvantages of Online Linear Optimization

Zeros of Polynomial Functions

Cyber-Security Analysis of State Estimators in Power Systems

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Machine Learning and Pattern Recognition Logistic Regression

Simulation-based optimization methods for urban transportation problems. Carolina Osorio

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

CSCI567 Machine Learning (Fall 2014)

Introduction to Online Learning Theory

Prime Numbers and Irreducible Polynomials

Lecture 8 February 4

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

SINGLE-STAGE MULTI-PRODUCT PRODUCTION AND INVENTORY SYSTEMS: AN ITERATIVE ALGORITHM BASED ON DYNAMIC SCHEDULING AND FIXED PITCH PRODUCTION

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Big Data Science. Prof. Lise Getoor University of Maryland, College Park. October 17, 2013

How To Prove The Dirichlet Unit Theorem

5.1 Bipartite Matching

Big Data - Lecture 1 Optimization reminders

STA 4273H: Statistical Machine Learning

Optimization Modeling for Mining Engineers

Efficient Curve Fitting Techniques

Notes on Symmetric Matrices

Proximal mapping via network optimization

Optimization of Supply Chain Networks

Solving polynomial least squares problems via semidefinite programming relaxations

Pacific Journal of Mathematics

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

Convex Programming Tools for Disjunctive Programs

Semi-Supervised Support Vector Machines and Application to Spam Filtering

A FIRST COURSE IN OPTIMIZATION THEORY

A progressive method to solve large-scale AC Optimal Power Flow with discrete variables and control of the feasibility

Network Traffic Modelling

Notes from Week 1: Algorithms for sequential prediction

Fixed Point Theorems

Parallel Selective Algorithms for Nonconvex Big Data Optimization

Chapter 13: Binary and Mixed-Integer Programming

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA

STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University. (Joint work with R. Chen and M. Menickelly)

A NEW LOOK AT CONVEX ANALYSIS AND OPTIMIZATION

Scheduling Home Health Care with Separating Benders Cuts in Decision Diagrams

Tensor Factorization for Multi-Relational Learning

Massive Data Classification via Unconstrained Support Vector Machines

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Computational Optical Imaging - Optique Numerique. -- Deconvolution --

Statistics Graduate Courses

Properties of BMO functions whose reciprocals are also BMO

Parameter Estimation for Bingham Models

NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Distributed Machine Learning and Big Data

24. The Branch and Bound Method

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Performance Characteristics of Large SMP Machines

Solving NP Hard problems in practice lessons from Computer Vision and Computational Biology

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

AN INTRODUCTION TO NUMERICAL METHODS AND ANALYSIS

Nonlinear Programming Methods.S2 Quadratic Programming

Stochastic Inventory Control

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Duality of linear conic problems

Scheduling a sequence of tasks with general completion costs

Fast Analytics on Big Data with H20

TOMLAB - For fast and robust largescale optimization in MATLAB

A simpler and better derandomization of an approximation algorithm for Single Source Rent-or-Buy

Transcription:

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION Ş. İlker Birbil Sabancı University Ali Taylan Cemgil 1, Hazal Koptagel 1, Figen Öztoprak 2, Umut Şimşekli 1 1: Boğaziçi University, 2: Bilgi University Nottingham University March, 2015 Ş. İlker Birbil (Sabancı University) Big Data Optimization 1 / 22

LARGE-SCALE OPTIMIZATION AND MACHINE LEARNING Introduction Exploiting the Structure Need for Parallel Algorithms F. Öztoprak Ş. İlker Birbil (Sabancı University) Big Data Optimization 2 / 22

DATA SCIENCE Ş. İlker Birbil (Sabancı University) Big Data Optimization 3 / 22

GRADUATE COURSES Ş. İlker Birbil (Sabancı University) Big Data Optimization 4 / 22

NONLINEAR OPTIMIZATION Introduction Exploiting the Structure Need for Parallel Algorithms Typically, Nonlinear a nonlinear Programming optimization problem (NLP) isproblem defined as minimize f (x) x R n Covers optimization problems subject to c i(x) = 0, i E, min c f(x) x2x i(x) 0, i I, where where f : R n X = {x 2 R R is the n : g(x) apple 0}, the functions g objective function and c i : R n : R n! R R for m, f : R i E n! R are I are the continuous and not necessarily linear. constraint functions. At least one of these functions is nonlinear. (1) x* Ş. İlker Birbil (Sabancı University) Big Data Optimization 5 / 22

ROLE OF NONLINEAR OPTIMIZATION Introduction Exploiting the Structure Need for Parallel Algorithms Molecular Biology (Protein Folding) Engineering Design (Machining) Global Optimization Finance (Risk Management) Derivative Free Optimization Nonlinear Stochastic Prog. Statistics Large Scale Core NLP Computer Science Applied Mathematics Convex Optimization Mixed Integer NLP Operations Research Machine Learning (Image Recovery) PDE Constrained Optimization Production (Chemical Complex Design) Health (Cancer Treatment) F. Öztoprak Ş. İlker Birbil (Sabancı University) Big Data Optimization 6 / 22

OUR RESEARCH GROUP Three faculty members, four PhD students, three MSc students (Coupled) Tensor or matrix factorization Distributed and parallel algorithms: Bayesian inference Nonlinear optimization Processor 1 Core 1 Core 2 Core 3 Core 4 Memory 1 Processor 2 Core 1 Core 2 Core 3 Core 4 Memory 2 Processor 3 Core 1 Core 2 Core 3 Core 4 Memory 3 Ş. İlker Birbil (Sabancı University) Big Data Optimization 7 / 22

OUR RESEARCH GROUP Three faculty members, four PhD students, three MSc students (Coupled) Tensor or matrix factorization Distributed and parallel algorithms: Bayesian inference Nonlinear optimization Processor 1 Core 1 Core 2 Core 3 Core 4 Memory 1 Processor 2 Core 1 Core 2 Core 3 Core 4 Memory 2 Processor 3 Core 1 Core 2 Core 3 Core 4 Memory 3 Ş. İlker Birbil (Sabancı University) Big Data Optimization 7 / 22

LINK PREDICTION VIA TENSOR FACTORIZATION X 1(i, j, k): if user i visits location j and performs activity k X 2(i, m): frequency of a user i visiting location m X j(j, n): points of interest for a location j Ş. İlker Birbil (Sabancı University) Big Data Optimization 8 / 22

TENSOR FACTORIZATION Matrix & Tensor Factorizations Tensor Factorization Tensor Factorization Tensor Multidimensional Array (X i,j,k,...) Extension of matrix factorizations to higher-order tensors Tensor factorizations are used to extract the underlying factors in higher-order data I Tensor Multidimensional Array I Used toi extract the underlying factors in higher-order data sets sets Tensor Factorisation + 7/1 X (i, j, k) X (i, r)z 2(j, r)z 3(k, r) X(i, j, k) r (i, r)z 2(j, r)z 3(k, r) r Cemgil Probabilistic Latent Tensor Factorisation. IFG19SabanciUniversity 14 Ş. İlker Birbil (Sabancı University) Big Data Optimization 9 / 22

X X 12 Z 2 MATRIX FACTORIZATION X (, ) X X(, ) i Z(, 1 ( i)z,i)z 2 (i, 2 ) (i, ) An inverse problem: Estimate i and Z 2 given data matrix X assuming X Z 2 X M "! ˆX Z 2 #! " #! " able error Overall function optimization subject problem to constraints (e.g., nonnegativity, ble error function subject to constraints (e.g., nonnegativity, minimize X Z 2 2 F subject to, Z 2 Z, (, Z 2 ) =argmind(x Z 2 )+ R(, Z 2 ),Z 2 where Z is the feasible region. When Z is the first orthant, we have the 1,Z 2 nonnegative ) = arg matrixmin factorization D(X Z problem. 1 Z 2 )+ R(,Z 2 ),Z 2 Ş. İlker Birbil (Sabancı University) Big Data Optimization 10 / 22

MOVIE RECOMMENDATION minimize X Z 2 2 F subject to 0, Z 2 0 Ş. İlker Birbil (Sabancı University) Big Data Optimization 11 / 22

DISTRIBUTED IMPLEMENTATION Time Slot 1: Perform X 12 (1,:) X 12 = (1,:)Z 2 (:,2) on P1 X 31 Time Slot 2: X 23 (2,:) (3,:) x Z 2 (:,1) Z 2 (:,2) Z 2 (:,3) X 23 = (2,:)Z 2 (:,3) on P2 X 31 = (3,:)Z 2 (:,1) on P3 by employing IPA. X 11 (1,:) X 22 (2,:) x Z 2 (:,1) Z 2 (:,2) Z 2 (:,3) X 33 (3,:) Time Slot 3: X 13 (1,:) X 21 (2,:) x Z 2 (:,1) Z 2 (:,2) Z 2 (:,3) X 32 (3,:) Time Slot 4:... Ş. İlker Birbil (Sabancı University) Big Data Optimization 12 / 22

REFORMULATION 1" minimize subject to X Z 2 2 F, Z 2 Z 1" 2" 3" Z 2 4" 5" 6" z."."." 6" GENERIC PROBLEM minimize f i(z) subject to i {1,,m} z ζ Ş. İlker Birbil (Sabancı University) Big Data Optimization 13 / 22

DISTRUBUTED OPTIMIZATION Time Slot 1: X 31 Time Slot 2: X 11 Time Slot 3: X 21 Time Slot 4:... X 12 X 22 X 32 X 23 X 33 X 13 (1,:) (2,:) (3,:) (1,:) (2,:) (3,:) (1,:) (2,:) (3,:) x x x Z 2 (:,1) Z 2 (:,2) Z 2 (:,3) Z 2 (:,1) Z 2 (:,2) Z 2 (:,3) Z 2 (:,1) Z 2 (:,2) Z 2 (:,3) Perform X 12 = (1,:)Z 2 (:,2) on P1 X 23 = (2,:)Z 2 (:,3) on P2 X 31 = (3,:)Z 2 (:,1) on P3 by employing IPA. " 2" 3" 1 Z 2 4" 5" 6" z 1"."."." 6" minimize subject to i {1,,m} z ζ f i(z) At each time slot k, we solve a subset S k of the component functions f i, i {1, 2,, m} We make sure that each data block is visited after c passes (c = 3 in the figure) Ş. İlker Birbil (Sabancı University) Big Data Optimization 14 / 22

INCREMENTAL QUASI-NEWTON ALGORITHM Unlike gradient-based methods, the proposed algorithm uses second order information through Hessian approximation (L-BFGS quasi-newton method) The proposed algorithm visits each subset of component functions in the same order (incremental and deterministic) We do not assume convexity of the function (matrix factorization can be solved) CORE STEP Solve a quadratic approximation of the (partial) objective function: Q t k(z) = (z z k) Sk f (z k) + 1 2 (z zk) H t(z z k) + 1 2 βt z zk 2. Ş. İlker Birbil (Sabancı University) Big Data Optimization 15 / 22

INCREMENTAL QUASI-NEWTON ALGORITHM (CONT D) Q t k(z) = (z z k) Sk f (z k) + 1 2 (z zk) H t(z z k) + 1 2 βt z zk 2. Algorithm 1: HAMSI input: y 0,β 1 1 for t = 0, 1, 2, do 2 z 1 = y t 3 Compute H t 4 for k = 1, 2,, c do 5 Choose a subset S k {1,, m} 6 Compute Sk f (z k) 7 z k+1 = arg min z ζ Q t k(z) 8 end 9 y t+1 = z c+1 10 Set β t+1 β t 11 end Ş. İlker Birbil (Sabancı University) Big Data Optimization 16 / 22

CONVERGENCE ANALYSIS (ζ = R n ) ASSUMPTIONS 1. Hessians of the component functions and (H t + β ti) are uniformly bounded: i S k 2 i f (y t) L t L S k, y t. 2. The smallest eigenvalue of (H t + β ti) is bounded away from zero: U t (H t + β ti) 1 M t t. 3. The gradient norms are uniformly bounded: Sk f (y t) C S k, y t. Ş. İlker Birbil (Sabancı University) Big Data Optimization 17 / 22

CONVERGENCE ANALYSIS (CONT D) LEMMA At each outer iteration t of Algorithm 1 and for k = 1,, c, we have k 1 δ k = Sk f (z k) Sk f (y t) L tm t (1 + L tm t) k 1 j Sj f (y t) j=1 THEOREM Consider the iterates y t produced by Algorithm 1. Then, all accumulation points of {y t} are stationary points of the generic problem. Ş. İlker Birbil (Sabancı University) Big Data Optimization 18 / 22

CONVERGENCE ANALYSIS (CONT D) LEMMA At each outer iteration t of Algorithm 1 and for k = 1,, c, we have k 1 δ k = Sk f (z k) Sk f (y t) L tm t (1 + L tm t) k 1 j Sj f (y t) j=1 THEOREM Consider the iterates y t produced by Algorithm 1. Then, all accumulation points of {y t} are stationary points of the generic problem. COROLLARY Algorithm 1 solves the matrix factorization problem. Ş. İlker Birbil (Sabancı University) Big Data Optimization 18 / 22

PRELIMINARY EXPERIMENTS - SETUP Linux cluster with 15 nodes Each node has 8, Intel Xeon 2.50 GHz processor with 16 GB RAM This setting allows execution of 120 parallel tasks in parallel MovieLens data (1M) is used for our preliminary experiments Ş. İlker Birbil (Sabancı University) Big Data Optimization 19 / 22

PRELIMINARY EXPERIMENTS FIGURE: Objective function values Ş. İlker Birbil (Sabancı University) Big Data Optimization 20 / 22

PRELIMINARY EXPERIMENTS (CONT D) FIGURE: Root mean square error Ş. İlker Birbil (Sabancı University) Big Data Optimization 21 / 22

CONCLUDING REMARKS Ş. İlker Birbil (Sabancı University) Big Data Optimization 22 / 22

CONCLUDING REMARKS SUMMARY A promising research path at the intersection of operations research and computer science Ş. İlker Birbil (Sabancı University) Big Data Optimization 22 / 22

CONCLUDING REMARKS SUMMARY A promising research path at the intersection of operations research and computer science A new distributed and parallel implementation for matrix factorization Ş. İlker Birbil (Sabancı University) Big Data Optimization 22 / 22

CONCLUDING REMARKS SUMMARY A promising research path at the intersection of operations research and computer science A new distributed and parallel implementation for matrix factorization A generic analysis that could be used for showing convergence of other algorithms Ş. İlker Birbil (Sabancı University) Big Data Optimization 22 / 22

CONCLUDING REMARKS SUMMARY A promising research path at the intersection of operations research and computer science A new distributed and parallel implementation for matrix factorization A generic analysis that could be used for showing convergence of other algorithms FUTURE RESEARCHJ Extensive computational study Ş. İlker Birbil (Sabancı University) Big Data Optimization 22 / 22

CONCLUDING REMARKS SUMMARY A promising research path at the intersection of operations research and computer science A new distributed and parallel implementation for matrix factorization A generic analysis that could be used for showing convergence of other algorithms FUTURE RESEARCHJ Extensive computational study Stochastic version of the proposed algorithm Ş. İlker Birbil (Sabancı University) Big Data Optimization 22 / 22

CONCLUDING REMARKS SUMMARY A promising research path at the intersection of operations research and computer science A new distributed and parallel implementation for matrix factorization A generic analysis that could be used for showing convergence of other algorithms FUTURE RESEARCHJ Extensive computational study Stochastic version of the proposed algorithm Quasi-Newton-based Bayesian inference Ş. İlker Birbil (Sabancı University) Big Data Optimization 22 / 22