# Bilinear Prediction Using Low-Rank Models

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Bilinear Prediction Using Low-Rank Models Inderjit S. Dhillon Dept of Computer Science UT Austin 26th International Conference on Algorithmic Learning Theory Banff, Canada Oct 6, 2015 Joint work with C-J. Hsieh, P. Jain, N. Natarajan, H. Yu and K. Zhong

2 Outline Multi-Target Prediction Features on Targets: Bilinear Prediction Inductive Matrix Completion 1 Algorithms 2 Positive-Unlabeled Matrix Completion 3 Recovery Guarantees Experimental Results

3 Sample Prediction Problems Predicting stock prices Predicting risk factors in healthcare

4 Regression Real-valued responses (target) t Predict response for given input data (features) a

5 Linear Regression Estimate target by a linear function of given data a, i.e. t ˆt = a T x.

6 Linear Regression: Least Squares Choose x that minimizes J x = 1 2 n (a T i x t i ) 2 i=1 Closed-form solution: x = (A T A) 1 A T t.

7 Prediction Problems: Classification Spam detection Character Recognition

8 Binary Classification Categorical responses (target) t Predict response for given input data (features) a Linear methods decision boundary is a linear surface or hyperplane

9 Linear Methods for Prediction Problems Regression: Ridge Regression: J x = 1 2 n i=1 (at i x t i ) 2 + λ x 2 2. Lasso: J x = 1 2 n i=1 (at i x t i ) 2 + λ x 1. Classification: Linear Support Vector Machines J x = 1 2 n max(0, 1 t i a T i x) + λ x 2 2. i=1 Logistic Regression J x = 1 2 n log(1 + exp( t i a T i x)) + λ x 2 2. i=1

10 Linear Prediction

11 Linear Prediction

12 Multi-Target Prediction

13 Modern Prediction Problems in Machine Learning Ad-word Recommendation

14 Modern Prediction Problems in Machine Learning Ad-word Recommendation geico auto insurance geico car insurance car insurance geico insurance need cheap auto insurance geico com car insurance coupon code

15 Modern Prediction Problems in Machine Learning Wikipedia Tag Recommendation Learning in computer vision Machine learning Learning Cybernetics.

16 Modern Prediction Problems in Machine Learning Predicting causal disease genes

17 Prediction with Multiple Targets In many domains, goal is to simultaneously predict multiple target variables Multi-target regression: targets are real-valued Multi-label classification:targets are binary

18 Prediction with Multiple Targets Applications Bid word recommendation Tag recommendation Disease-gene linkage prediction Medical diagnoses Ecological modeling...

19 Prediction with Multiple Targets Input data a i is associated with m targets, t i = (t (1) i, t (2) i,..., t (m) i )

20 Multi-target Linear Prediction Basic model: Treat targets independently Estimate regression coefficients x j for each target j

21 Multi-target Linear Prediction Assume targets t (j) are independent Linear predictive model: t i a T i X

22 Multi-target Linear Prediction Assume targets t (j) are independent Linear predictive model: t i a T i X Multi-target regression problem has a closed-form solution: V A Σ 1 A U A T = arg min X T AX 2 F where A = U A Σ A V T A is the thin SVD of A

23 Multi-target Linear Prediction Assume targets t (j) are independent Linear predictive model: t i a T i X Multi-target regression problem has a closed-form solution: V A Σ 1 A U A T = arg min X T AX 2 F where A = U A Σ A V T A is the thin SVD of A In multi-label classification: Binary Relevance (independent binary classifier for each label)

24 Multi-target Linear Prediction: Low-rank Model Exploit correlations between targets T, where T AX Reduced-Rank Regression [A.J. Izenman, 1974] model the coefficient matrix X as low-rank A. J. Izenman. Reduced-rank regression for the multivariate linear model. Journal of Multivariate Analysis 5.2 (1975):

25 Multi-target Linear Prediction: Low-rank Model X is rank-k Linear predictive model: t i a T i X

26 Multi-target Linear Prediction: Low-rank Model X is rank-k Linear predictive model: t i a T i X Low-rank multi-target regression problem has a closed-form solution: X = = min X :rank(x ) k { T AX 2 F V A Σ 1 A U A T k if A is full row rank, V A Σ 1 A M k otherwise, where A = U A Σ A VA T is the thin SVD of A, M = U A T, and T k, M k are the best rank-k approximations of T and M respectively.

27 Modern Challenges

28 Multi-target Prediction with Missing Values In many applications, several observations (targets) may be missing E.g. Recommending tags for images and wikipedia articles

29 Modern Prediction Problems in Machine Learning Ad-word Recommendation geico auto insurance geico car insurance car insurance geico insurance need cheap auto insurance geico com car insurance coupon code

30 Multi-target Prediction with Missing Values

31 Multi-target Prediction with Missing Values Low-rank model: t i = a T i X where X is low-rank

32 Canonical Correlation Analysis

33 Bilinear Prediction

34 Bilinear Prediction Augment multi-target prediction with features on targets as well Motivated by modern applications of machine learning bioinformatics, auto-tagging articles Need to model dyadic or pairwise interactions Move from linear models to bilinear models linear in input features as well as target features

35 Bilinear Prediction

36 Bilinear Prediction

37 Bilinear Prediction Bilinear predictive model: T ij a T i X b j

38 Bilinear Prediction Bilinear predictive model: T ij a T i X b j Corresponding regression problem has a closed-form solution: V A Σ 1 A U A TU BΣ 1 B V T B = arg min X T AXB 2 F where A = U A Σ A V A, B = U BΣ B V B are the thin SVDs of A and B

39 Bilinear Prediction: Low-rank Model X is rank-k Bilinear predictive model: T ij a T i X b j

40 Bilinear Prediction: Low-rank Model X is rank-k Bilinear predictive model: T ij a T i X b j Corresponding regression problem has a closed-form solution: X = = min T AXB 2 F X :rank(x ) k { V A Σ 1 A U A T ku B Σ 1 B V B T V A Σ 1 A M kσ 1 B V B T if A, B are full row rank, otherwise, where A = U A Σ A VA, B = U BΣ B VB are the thin SVDs of A and B, M = UA TU B, and T k, M k are the best rank-k approximations of T and M

41 Modern Challenges in Multi-Target Prediction Millions of targets Correlations among targets Missing values

42 Modern Challenges in Multi-Target Prediction Millions of targets (Scalable) Correlations among targets (Low-rank) Missing values (Inductive Matrix Completion)

43 Bilinear Prediction with Missing Values

44 Matrix Completion Missing Value Estimation Problem Matrix Completion: Recover a low-rank matrix from observed entries Matrix Completion: exact recovery requires O(kn log 2 (n)) samples, under the assumptions of: 1 Uniform sampling 2 Incoherence

45 Inductive Matrix Completion Inductive Matrix Completion: Bilinear low-rank prediction with missing values Degrees of freedom in X are O(kd) Can we get better sample complexity (than O(kn))?

46 Algorithm 1: Convex Relaxation 1 Nuclear-norm Minimization: min X s.t. a T i X b j = T ij, (i, j) Ω Computationally expensive Sample complexity for exact recovery: O(kd log d log n) Conditions for exact recovery: C1. Incoherence on A, B. C2. Incoherence on AU and BV,where X = U Σ V T is the SVD of the ground truth X C1 and C2 are satisfied with high probability when A, B are Gaussian

47 Algorithm 1: Convex Relaxation Theorem (Recovery Guarantees for Nuclear-norm Minimization) Let X = U Σ V T R d d be the SVD of X with rank k. Assume A, B are orthonormal matrices w.l.o.g., satisfying the incoherence conditions. Then if Ω is uniformly observed with Ω O(kd log d log n), the solution of nuclear-norm minimization problem is unique and equal to X with high probability. The incoherence conditions are C1. max a i 2 2 µd i [n] n, max j [n] C2. max U T a i 2 2 µ0k i [n] b j 2 2 µd n n, max j [n] V T b j 2 2 µ0k n

48 Algorithm 2: Alternating Least Squares Alternating Least Squares (ALS): min Y R d 1 k Z R d 2 k (i,j) Ω Non-convex optimization Alternately minimize w.r.t. Y and Z (a T i YZ T b j T ij ) 2

49 Algorithm 2: Alternating Least Squares Computational complexity of ALS. At h-th iteration, fixing Y h, solve the least squares problem for Z h+1 : (ã T i Zh+1b T j )b j ã T i = T ij b j ã T i (i,j) Ω (i,j) Ω where ã i = Yh T a i. Similarly solve for Y h when fixing Z h. 1 Closed form: O( Ω k 2 d (nnz(a) + nnz(b))/n + k 3 d 3 ). 2 Vanilla conjugate gradient: O( Ω k (nnz(a) + nnz(b))/n) per iteration. 3 Exploit the structure for conjugate gradient: (ã T i Z T b j )b j ã T i = B T DÃ (i,j) Ω where D is a sparse matrix with D ji = ã T i Z T b j, (i, j) Ω, and Ã = AY h. Only O((nnz(A) + nnz(b) + Ω ) k) per iteration.

50 Algorithm 2: Alternating Least Squares Theorem (Convergence Guarantees for ALS ) Let X be a rank-k matrix with condition number β, and T = AX B T. Assume A, B are orthogonal w.l.o.g. and satisfy the incoherence conditions. Then if Ω is uniformly sampled with Ω O(k 4 β 2 d log d), then after H iterations of ALS, Y H Z T H+1 X 2 ɛ, where H = O(log( X F /ɛ)). The incoherence conditions are: C1. max a i 2 2 µd i [n] C2. n, max j [n] max Yh T a i 2 2 µ0k i [n] b j 2 2 µd n n, max j [n] for all the Y h s and Z h s generated from ALS. Z T h b j 2 2 µ0k n,

51 Algorithm 2: Alternating Least Squares Proof sketch for ALS Consider the case when the rank k = 1: (a T i yz T b j T ij ) 2 min y R d 1,z R d 2 (i,j) Ω

52 Algorithm 2: Alternating Least Squares Proof sketch for rank-1 ALS min y R d 1,z R d 2 (i,j) Ω (a T i yz T b j T ij ) 2 (a) Let X = σ y z T be the thin SVD of X and assume A and B are orthogonal w.l.o.g. (b) In the absence of missing values, ALS = Power method. Ay h z T B T T 2 F z = 2B T (Bzy T h A T T T )Ay h = 2(z y h 2 B T T T Ay h ) z h+1 (A T TB) T y h ; normalize z h+1 y h+1 (A T TB)z h+1 ; normalize y h+1 Note that A T TB = A T AX B T B = X and the power method converges to the optimal.

53 Algorithm 2: Alternating Least Squares Proof sketch for rank-1 ALS min y R d 1,z R d 2 (i,j) Ω (a T i yz T b j T ij ) 2 (c) With missing values, ALS is a variant of power method with noise in each iteration z h+1 QR( X T y } {{ } h σ N 1 ((y T y h )N Ñ)z ) } {{ } power method noise term g where N = (i,j) Ω b ja T i y h y T h a ib T j, Ñ = (i,j) Ω b ja T i y h y T a i b T j. (d) Given C1 and C2, the noise term g = σ N 1 ((y T y h )N Ñ)z becomes smaller as the iterate gets close to the optimal: g (y T h 99 y ) 2

54 Algorithm 2: Alternating Least Squares Proof sketch for rank-1 ALS min y R d 1,z R d 2 (i,j) Ω (a T i yz T b j T ij ) 2 (e) Given C1 and C2, the first iterate y 0 is well initialized, i.e. y T 0 y 0.9, which guarantees the initial noise is small enough (f) The iterates can then be shown to linearly converge to the optimal: 1 (z T h+1z ) (1 (y T h z ) 2 ) 1 (y T h+1y ) (1 (zt h+1y ) 2 )

55 Algorithm 2: Alternating Least Squares Proof sketch for rank-1 ALS min y R d 1,z R d 2 (i,j) Ω (a T i yz T b j T ij ) 2 (e) Given C1 and C2, the first iterate y 0 is well initialized, i.e. y T 0 y 0.9, which guarantees the initial noise is small enough (f) The iterates can then be shown to linearly converge to the optimal: 1 (z T h+1z ) (1 (y T h z ) 2 ) 1 (y T h+1y ) (1 (zt h+1y ) 2 ) Similarly, the rank-k case can be proved.

56 Inductive Matrix Completion: Sample Complexity Sample complexity of Inductive Matrix Completion (IMC) and Matrix Completion (MC). methods IMC MC Nuclear-norm O(kd log n log d) kn log 2 n (Recht, 2011) ALS O(k 4 β 2 d log d) k 3 β 2 n log n (Hardt, 2014) where β is the condition number of X In most cases, n d Incoherence conditions on A, B are required Satisfied e.g. when A, B are Gaussian (no assumption on X needed) B. Recht. A simpler approach to matrix completion. The Journal of Machine Learning Research 12 : (2011). M. Hardt. Understanding alternating minimization for matrix completion. Foundations of Computer Science (FOCS), IEEE 55th Annual Symposium, pp (2014).

57 Inductive Matrix Completion: Sample Complexity Results All matrices are sampled from Gaussian random distribution. Left two figures: fix k = 5, n = 1000 and change d. Right two figures: fix k = 5, d = 50 and change n. Darkness of the shading is proportional to the number of failures (repeated 10 times) Number of Measurements Number of Measurements Number of Measurements Number of Measurements d d n n 0 Ω vs. d (ALS) Ω vs. d (Nuclear) Ω vs. n (ALS) Ω vs. n (Nuclear) Sample complexity is proportional to d while almost independent of n for both Nuclear-norm and ALS methods.

58 Positive-Unlabeled Learning

59 Modern Prediction Problems in Machine Learning Predicting causal disease genes

60 Bilinear Prediction: PU Learning In many applications, only positive labels are observed

61 PU Learning No observations of the negative class available

62 PU Inductive Matrix Completion Guarantees so far assume observations are sampled uniformly What can we say about the case when observations are all 1 s ( positives )? Typically, 99% entries are missing ( unlabeled )

63 PU Inductive Matrix Completion Inductive Matrix Completion: min X : X t (i,j) Ω (a T i X b j T ij ) 2 Commonly used PU strategy: Biased Matrix Completion min α (a T i X b j T ij ) 2 + (1 α) (a T i X b j 0) 2 X : X t (i,j) Ω (i,j) Ω Typically, α > 1 α (α 0.9). V. Sindhwani, S. S. Bucak, J. Hu, A. Mojsilovic. One-class matrix completion with low-density factorizations. ICDM, pp

64 PU Inductive Matrix Completion Inductive Matrix Completion: min X : X t (i,j) Ω (a T i X b j T ij ) 2 Commonly used PU strategy: Biased Matrix Completion min α (a T i X b j T ij ) 2 + (1 α) (a T i X b j 0) 2 X : X t (i,j) Ω (i,j) Ω Typically, α > 1 α (α 0.9). We can show guarantees for the biased formulation V. Sindhwani, S. S. Bucak, J. Hu, A. Mojsilovic. One-class matrix completion with low-density factorizations. ICDM, pp

65 PU Learning: Random Noise Model Can be formulated as learning with class-conditional noise N. Natarajan, I. S. Dhillon, P. Ravikumar, and A.Tewari. Learning with Noisy Labels. In Advances in Neural Information Processing Systems, pp

66 PU Inductive Matrix Completion A deterministic PU learning model T ij = { 1 if M ij > 0.5, 0 if M ij 0.5

67 PU Inductive Matrix Completion A deterministic PU learning model P( T ij = 0 T ij = 1) = ρ and P( T ij = 0 T ij = 0) = 1. We are given only T but not T or M Goal: Recover T given T (recovering M is not possible!)

68 Algorithm 1: Biased Inductive Matrix Completion X = min α (a T i X b j 1) 2 + (1 α) (a T i X b j 0) 2 X : X t (i,j) Ω (i,j) Ω Rationale: (a) Fix α = (1 + ρ)/2 and define T ij = I [ (A X B T ) ij > 0.5 ] (b) The above problem is equivalent to: X = min X : X t l α ((AXB T ) ij, T ij ) where l α (x, T ij ) = α T ij (x 1) 2 + (1 α)(1 T ij )x 2 (c) Minimizing l α loss is equivalent to minimizing the true error, i.e. 1 mn ij i,j l α ((AXB T ) ij, T ij ) = C 1 1 mn T T 2 F + C 2

69 Algorithm 1: Biased Inductive Matrix Completion Theorem (Error Bound for Biased IMC) Assume ground-truth X satisfies X t (where M = AXB T ). Define T ij = I [ (A X B T ) ij > 0.5 ], A = max i a i and B = max i b i. If α = 1+ρ 2, then with probability at least 1 δ, where η = 4(1 + 2ρ). ( 1 n 2 T T η log(2/δ) 2 F = O + η tab ) log2d n(1 ρ) (1 ρ)n 3/2 C-J. Hsieh, N. Natarajan, and I. S. Dhillon. PU Learning for Matrix Completion. In Proceedings of The 32nd International Conference on Machine Learning, pp (2015).

70 Experimental Results

71 Multi-target Prediction: Image Tag Recommendation NUS-Wide Image Dataset 161,780 training images 107,879 test images 1,134 features 1,000 tags

72 Multi-target Prediction: Image Tag Recommendation H. F. Yu, P. Jain, P. Kar, and I. S. Dhillon. Large-scale Multi-label Learning with Missing Labels. In Proceedings of The 31st International Conference on Machine Learning, pp (2014).

73 Multi-target Prediction: Image Tag Recommendation Low-rank Model with k = 50: time(s) AUC LEML(ALS) WSABIE 4, Low-rank Model with k = 100: time(s) AUC LEML(ALS) 1, WSABIE 6, H. F. Yu, P. Jain, P. Kar, and I. S. Dhillon. Large-scale Multi-label Learning with Missing Labels. In Proceedings of The 31st International Conference on Machine Learning, pp (2014).

74 Multi-target Prediction: Wikipedia Tag Recommendation Wikipedia Dataset 881,805 training wiki pages 10,000 test wiki pages 366,932 features 213,707 tags

75 Multi-target Prediction: Wikipedia Tag Recommendation Low-rank Model with k = 250: time(s) AUC LEML(ALS) 9, WSABIE 79, Low-rank Model with k = 500: time(s) AUC LEML(ALS) 18, WSABIE 139, H. F. Yu, P. Jain, P. Kar, and I. S. Dhillon. Large-scale Multi-label Learning with Missing Labels. In Proceedings of The 31st International Conference on Machine Learning, pp (2014).

76 PU Inductive Matrix Completion: Gene-Disease Prediction N. Natarajan, and I. S. Dhillon. Inductive matrix completion for predicting gene disease associations. Bioinformatics, 30(12), i60-i68 (2014).

77 PU Inductive Matrix Completion: Gene-Disease Prediction Catapult 2.5 Katz Matrix Completion on C P(hidden gene among genes looked at) IMC LEML Matrix Completion Precision(%) Number of genes looked at Recall(%) Predicting gene-disease associations in the OMIM data set ( N. Natarajan, and I. S. Dhillon. Inductive matrix completion for predicting gene disease associations. Bioinformatics, 30(12), i60-i68 (2014).

78 PU Inductive Matrix Completion: Gene-Disease Prediction Catapult Katz P(hidden gene among genes looked at) 0.1 Matrix Completion on C IMC LEML P(hidden gene among genes looked at) Number of genes looked at Number of genes looked at Predicting genes for diseases with no training associations. N. Natarajan, and I. S. Dhillon. Inductive matrix completion for predicting gene disease associations. Bioinformatics, 30(12), i60-i68 (2014).

79 Conclusions and Future Work Inductive Matrix Completion: Scales to millions of targets Captures correlations among targets Overcomes missing values Extension to PU learning Much work to do: Other structures: low-rank+sparse, low-rank+column-sparse (outliers)? Different loss functions? Handling time as one of the dimensions incorporating smoothness through graph regularization? Incorporating non-linearities? Efficient (parallel) implementations? Improved recovery guarantees?

80 References [1] P. Jain, and I. S. Dhillon. Provable inductive matrix completion. arxiv preprint arxiv: (2013). [2] K. Zhong, P. Jain, I. S. Dhillon. Efficient Matrix Sensing Using Rank-1 Gaussian Measurements. In Proceedings of The 26th Conference on Algorithmic Learning Theory (2015). [3] N. Natarajan, and I. S. Dhillon. Inductive matrix completion for predicting gene disease associations. Bioinformatics, 30(12), i60-i68 (2014). [4] H. F. Yu, P. Jain, P. Kar, and I. S. Dhillon. Large-scale Multi-label Learning with Missing Labels. In Proceedings of The 31st International Conference on Machine Learning, pp (2014). [5] C-J. Hsieh, N. Natarajan, and I. S. Dhillon. PU Learning for Matrix Completion. In Proceedings of The 32nd International Conference on Machine Learning, pp (2015).

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### Consistent Binary Classification with Generalized Performance Metrics

Consistent Binary Classification with Generalized Performance Metrics Nagarajan Natarajan Joint work with Oluwasanmi Koyejo, Pradeep Ravikumar and Inderjit Dhillon UT Austin Nov 4, 2014 Problem and Motivation

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### DATA ANALYTICS USING R

DATA ANALYTICS USING R Duration: 90 Hours Intended audience and scope: The course is targeted at fresh engineers, practicing engineers and scientists who are interested in learning and understanding data

### Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

### Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

### CSE 494 CSE/CBS 598 (Fall 2007): Numerical Linear Algebra for Data Exploration Clustering Instructor: Jieping Ye

CSE 494 CSE/CBS 598 Fall 2007: Numerical Linear Algebra for Data Exploration Clustering Instructor: Jieping Ye 1 Introduction One important method for data compression and classification is to organize

### Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,

### Lectures notes on orthogonal matrices (with exercises) 92.222 - Linear Algebra II - Spring 2004 by D. Klain

Lectures notes on orthogonal matrices (with exercises) 92.222 - Linear Algebra II - Spring 2004 by D. Klain 1. Orthogonal matrices and orthonormal sets An n n real-valued matrix A is said to be an orthogonal

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Cheng Soon Ong & Christfried Webers. Canberra February June 2016

c Cheng Soon Ong & Christfried Webers Research Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 31 c Part I

### These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

### Big Data - Lecture 1 Optimization reminders

Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics

### Lecture 6: Logistic Regression

Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,

### Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large

### Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

### NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing

NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing Alex Cloninger Norbert Wiener Center Department of Mathematics University of Maryland, College Park http://www.norbertwiener.umd.edu

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### Semi-Supervised Support Vector Machines and Application to Spam Filtering

Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery

### Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery

: Distribution-Oblivious Support Recovery Yudong Chen Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 7872 Constantine Caramanis Department of Electrical

### Randomized Robust Linear Regression for big data applications

Randomized Robust Linear Regression for big data applications Yannis Kopsinis 1 Dept. of Informatics & Telecommunications, UoA Thursday, Apr 16, 2015 In collaboration with S. Chouvardas, Harris Georgiou,

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### Classification Problems

Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

### LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph

### Lecture Topic: Low-Rank Approximations

Lecture Topic: Low-Rank Approximations Low-Rank Approximations We have seen principal component analysis. The extraction of the first principle eigenvalue could be seen as an approximation of the original

### Distributed Machine Learning and Big Data

Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya

### Introduction to Online Learning Theory

Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

### Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### Several Views of Support Vector Machines

Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min

### Linear Classification. Volker Tresp Summer 2015

Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

### Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers

Variance Reduction The statistical efficiency of Monte Carlo simulation can be measured by the variance of its output If this variance can be lowered without changing the expected value, fewer replications

### Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 What is machine learning? Data description and interpretation

### Predict Influencers in the Social Network

Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

### Sparse modeling: some unifying theory and word-imaging

Sparse modeling: some unifying theory and word-imaging Bin Yu UC Berkeley Departments of Statistics, and EECS Based on joint work with: Sahand Negahban (UC Berkeley) Pradeep Ravikumar (UT Austin) Martin

### Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics

### Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

### SOLVING LINEAR SYSTEMS

SOLVING LINEAR SYSTEMS Linear systems Ax = b occur widely in applied mathematics They occur as direct formulations of real world problems; but more often, they occur as a part of the numerical analysis

### CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

### Large-Scale Similarity and Distance Metric Learning

Large-Scale Similarity and Distance Metric Learning Aurélien Bellet Télécom ParisTech Joint work with K. Liu, Y. Shi and F. Sha (USC), S. Clémençon and I. Colin (Télécom ParisTech) Séminaire Criteo March

### Linear Algebra Review Part 2: Ax=b

Linear Algebra Review Part 2: Ax=b Edwin Olson University of Michigan The Three-Day Plan Geometry of Linear Algebra Vectors, matrices, basic operations, lines, planes, homogeneous coordinates, transformations

### Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

### Absolute Value Programming

Computational Optimization and Aplications,, 1 11 (2006) c 2006 Springer Verlag, Boston. Manufactured in The Netherlands. Absolute Value Programming O. L. MANGASARIAN olvi@cs.wisc.edu Computer Sciences

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### Sketching as a Tool for Numerical Linear Algebra

Sketching as a Tool for Numerical Linear Algebra (Graph Sparsification) David P. Woodruff presented by Sepehr Assadi o(n) Big Data Reading Group University of Pennsylvania April, 2015 Sepehr Assadi (Penn)

### LINEAR SYSTEMS. Consider the following example of a linear system:

LINEAR SYSTEMS Consider the following example of a linear system: Its unique solution is x +2x 2 +3x 3 = 5 x + x 3 = 3 3x + x 2 +3x 3 = 3 x =, x 2 =0, x 3 = 2 In general we want to solve n equations in

### Sublinear Algorithms for Big Data. Part 4: Random Topics

Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1 2-1 Topic 1: Compressive sensing Compressive sensing The model (Candes-Romberg-Tao 04; Donoho 04) Applicaitons Medical imaging reconstruction

### Link-based Analysis on Large Graphs. Presented by Weiren Yu Mar 01, 2011

Link-based Analysis on Large Graphs Presented by Weiren Yu Mar 01, 2011 Overview 1 Introduction 2 Problem Definition 3 Optimization Techniques 4 Experimental Results 2 1. Introduction Many applications

### Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University

### DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

### A network flow algorithm for reconstructing. binary images from discrete X-rays

A network flow algorithm for reconstructing binary images from discrete X-rays Kees Joost Batenburg Leiden University and CWI, The Netherlands kbatenbu@math.leidenuniv.nl Abstract We present a new algorithm

### December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation

### Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

### Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

### Scalable Machine Learning - or what to do with all that Big Data infrastructure

- or what to do with all that Big Data infrastructure TU Berlin blog.mikiobraun.de Strata+Hadoop World London, 2015 1 Complex Data Analysis at Scale Click-through prediction Personalized Spam Detection

### Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples

### Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Department of Industrial Engineering and Management Sciences Northwestern University September 15th, 2014

### Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Tianbao Yang, Qihang Lin, Rong Jin Tutorial@SIGKDD 2015 Sydney, Australia Department of Computer Science, The University of Iowa, IA, USA Department of

### CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

### Christfried Webers. Canberra February June 2015

c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

### IN this paper we focus on the problem of large-scale, multiclass

IEEE TRANSACTIONS ON PATTERN RECOGNITION AND MACHINE INTELLIGENCE 1 Distance-Based Image Classification: Generalizing to new classes at near-zero cost Thomas Mensink, Member IEEE, Jakob Verbeek, Member,

### arxiv:1312.1254v1 [cs.na] 4 Dec 2013

PARALLEL MATRIX FACTORIZATION FOR LOW-RANK TENSOR COMPLETION YANGYANG XU, RURU HAO, WOTAO YIN, AND ZHIXUN SU arxiv:1312.1254v1 [cs.na] 4 Dec 2013 Abstract. Higher-order low-rank tensors naturally arise

### Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classiﬁca6on

Pa8ern Recogni6on and Machine Learning Chapter 4: Linear Models for Classiﬁca6on Represen'ng the target values for classifica'on If there are only two classes, we typically use a single real valued output

### (Quasi-)Newton methods

(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

### Big learning: challenges and opportunities

Big learning: challenges and opportunities Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure December 2013 Omnipresent digital media Scientific context Big data Multimedia, sensors, indicators,

### Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

### Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

### ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of

### A Negative Result Concerning Explicit Matrices With The Restricted Isometry Property

A Negative Result Concerning Explicit Matrices With The Restricted Isometry Property Venkat Chandar March 1, 2008 Abstract In this note, we prove that matrices whose entries are all 0 or 1 cannot achieve

### INTRODUCTION TO NEURAL NETWORKS

INTRODUCTION TO NEURAL NETWORKS Pictures are taken from http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html http://research.microsoft.com/~cmbishop/prml/index.htm By Nobel Khandaker Neural Networks An

### Inner Product Spaces and Orthogonality

Inner Product Spaces and Orthogonality week 3-4 Fall 2006 Dot product of R n The inner product or dot product of R n is a function, defined by u, v a b + a 2 b 2 + + a n b n for u a, a 2,, a n T, v b,

### Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Learning Gaussian process models from big data Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Machine learning seminar at University of Cambridge, July 4 2012 Data A lot of

### Classification by Pairwise Coupling

Classification by Pairwise Coupling TREVOR HASTIE * Stanford University and ROBERT TIBSHIRANI t University of Toronto Abstract We discuss a strategy for polychotomous classification that involves estimating

### MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Machine Learning for Computer Vision 1 MVA ENS Cachan Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Department of Applied Mathematics Ecole Centrale Paris Galen

### DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

### Classification of Cartan matrices

Chapter 7 Classification of Cartan matrices In this chapter we describe a classification of generalised Cartan matrices This classification can be compared as the rough classification of varieties in terms

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### Part II Redundant Dictionaries and Pursuit Algorithms

Aisenstadt Chair Course CRM September 2009 Part II Redundant Dictionaries and Pursuit Algorithms Stéphane Mallat Centre de Mathématiques Appliquées Ecole Polytechnique Sparsity in Redundant Dictionaries

### The Trimmed Iterative Closest Point Algorithm

Image and Pattern Analysis (IPAN) Group Computer and Automation Research Institute, HAS Budapest, Hungary The Trimmed Iterative Closest Point Algorithm Dmitry Chetverikov and Dmitry Stepanov http://visual.ipan.sztaki.hu

### We shall turn our attention to solving linear systems of equations. Ax = b

59 Linear Algebra We shall turn our attention to solving linear systems of equations Ax = b where A R m n, x R n, and b R m. We already saw examples of methods that required the solution of a linear system

### An Introduction to Machine Learning

An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

### Machine Learning for Data Science (CS4786) Lecture 1

Machine Learning for Data Science (CS4786) Lecture 1 Tu-Th 10:10 to 11:25 AM Hollister B14 Instructors : Lillian Lee and Karthik Sridharan ROUGH DETAILS ABOUT THE COURSE Diagnostic assignment 0 is out:

### GI01/M055 Supervised Learning Proximal Methods

GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators

### LCs for Binary Classification

Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it

### On Covariance Structure in Noisy, Big Data

On Covariance Structure in Noisy, Big Data Randy C. Paffenroth a, Ryan Nong a and Philip C. Du Toit a a Numerica Corporation, Loveland, CO, USA; ABSTRACT Herein we describe theory and algorithms for detecting

### C19 Machine Learning

C9 Machine Learning 8 Lectures Hilary Term 25 2 Tutorial Sheets A. Zisserman Overview: Supervised classification perceptron, support vector machine, loss functions, kernels, random forests, neural networks

### Regression Using Support Vector Machines: Basic Foundations

Regression Using Support Vector Machines: Basic Foundations Technical Report December 2004 Aly Farag and Refaat M Mohamed Computer Vision and Image Processing Laboratory Electrical and Computer Engineering

### Class Overview and General Introduction to Machine Learning

Class Overview and General Introduction to Machine Learning Piyush Rai www.cs.utah.edu/~piyush CS5350/6350: Machine Learning August 23, 2011 (CS5350/6350) Intro to ML August 23, 2011 1 / 25 Course Logistics

### Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions Peter Richtárik School of Mathematics The University of Edinburgh Joint work with Martin Takáč (Edinburgh)

### Lasso-based Spam Filtering with Chinese Emails

Journal of Computational Information Systems 8: 8 (2012) 3315 3322 Available at http://www.jofcis.com Lasso-based Spam Filtering with Chinese Emails Zunxiong LIU 1, Xianlong ZHANG 1,, Shujuan ZHENG 2 1

### Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

### MATH 304 Linear Algebra Lecture 18: Rank and nullity of a matrix.

MATH 304 Linear Algebra Lecture 18: Rank and nullity of a matrix. Nullspace Let A = (a ij ) be an m n matrix. Definition. The nullspace of the matrix A, denoted N(A), is the set of all n-dimensional column

### 10. Proximal point method

L. Vandenberghe EE236C Spring 2013-14) 10. Proximal point method proximal point method augmented Lagrangian method Moreau-Yosida smoothing 10-1 Proximal point method a conceptual algorithm for minimizing