Bilinear Prediction Using LowRank Models


 Jodie Chapman
 3 years ago
 Views:
Transcription
1 Bilinear Prediction Using LowRank Models Inderjit S. Dhillon Dept of Computer Science UT Austin 26th International Conference on Algorithmic Learning Theory Banff, Canada Oct 6, 2015 Joint work with CJ. Hsieh, P. Jain, N. Natarajan, H. Yu and K. Zhong
2 Outline MultiTarget Prediction Features on Targets: Bilinear Prediction Inductive Matrix Completion 1 Algorithms 2 PositiveUnlabeled Matrix Completion 3 Recovery Guarantees Experimental Results
3 Sample Prediction Problems Predicting stock prices Predicting risk factors in healthcare
4 Regression Realvalued responses (target) t Predict response for given input data (features) a
5 Linear Regression Estimate target by a linear function of given data a, i.e. t ˆt = a T x.
6 Linear Regression: Least Squares Choose x that minimizes J x = 1 2 n (a T i x t i ) 2 i=1 Closedform solution: x = (A T A) 1 A T t.
7 Prediction Problems: Classification Spam detection Character Recognition
8 Binary Classification Categorical responses (target) t Predict response for given input data (features) a Linear methods decision boundary is a linear surface or hyperplane
9 Linear Methods for Prediction Problems Regression: Ridge Regression: J x = 1 2 n i=1 (at i x t i ) 2 + λ x 2 2. Lasso: J x = 1 2 n i=1 (at i x t i ) 2 + λ x 1. Classification: Linear Support Vector Machines J x = 1 2 n max(0, 1 t i a T i x) + λ x 2 2. i=1 Logistic Regression J x = 1 2 n log(1 + exp( t i a T i x)) + λ x 2 2. i=1
10 Linear Prediction
11 Linear Prediction
12 MultiTarget Prediction
13 Modern Prediction Problems in Machine Learning Adword Recommendation
14 Modern Prediction Problems in Machine Learning Adword Recommendation geico auto insurance geico car insurance car insurance geico insurance need cheap auto insurance geico com car insurance coupon code
15 Modern Prediction Problems in Machine Learning Wikipedia Tag Recommendation Learning in computer vision Machine learning Learning Cybernetics.
16 Modern Prediction Problems in Machine Learning Predicting causal disease genes
17 Prediction with Multiple Targets In many domains, goal is to simultaneously predict multiple target variables Multitarget regression: targets are realvalued Multilabel classification:targets are binary
18 Prediction with Multiple Targets Applications Bid word recommendation Tag recommendation Diseasegene linkage prediction Medical diagnoses Ecological modeling...
19 Prediction with Multiple Targets Input data a i is associated with m targets, t i = (t (1) i, t (2) i,..., t (m) i )
20 Multitarget Linear Prediction Basic model: Treat targets independently Estimate regression coefficients x j for each target j
21 Multitarget Linear Prediction Assume targets t (j) are independent Linear predictive model: t i a T i X
22 Multitarget Linear Prediction Assume targets t (j) are independent Linear predictive model: t i a T i X Multitarget regression problem has a closedform solution: V A Σ 1 A U A T = arg min X T AX 2 F where A = U A Σ A V T A is the thin SVD of A
23 Multitarget Linear Prediction Assume targets t (j) are independent Linear predictive model: t i a T i X Multitarget regression problem has a closedform solution: V A Σ 1 A U A T = arg min X T AX 2 F where A = U A Σ A V T A is the thin SVD of A In multilabel classification: Binary Relevance (independent binary classifier for each label)
24 Multitarget Linear Prediction: Lowrank Model Exploit correlations between targets T, where T AX ReducedRank Regression [A.J. Izenman, 1974] model the coefficient matrix X as lowrank A. J. Izenman. Reducedrank regression for the multivariate linear model. Journal of Multivariate Analysis 5.2 (1975):
25 Multitarget Linear Prediction: Lowrank Model X is rankk Linear predictive model: t i a T i X
26 Multitarget Linear Prediction: Lowrank Model X is rankk Linear predictive model: t i a T i X Lowrank multitarget regression problem has a closedform solution: X = = min X :rank(x ) k { T AX 2 F V A Σ 1 A U A T k if A is full row rank, V A Σ 1 A M k otherwise, where A = U A Σ A VA T is the thin SVD of A, M = U A T, and T k, M k are the best rankk approximations of T and M respectively.
27 Modern Challenges
28 Multitarget Prediction with Missing Values In many applications, several observations (targets) may be missing E.g. Recommending tags for images and wikipedia articles
29 Modern Prediction Problems in Machine Learning Adword Recommendation geico auto insurance geico car insurance car insurance geico insurance need cheap auto insurance geico com car insurance coupon code
30 Multitarget Prediction with Missing Values
31 Multitarget Prediction with Missing Values Lowrank model: t i = a T i X where X is lowrank
32 Canonical Correlation Analysis
33 Bilinear Prediction
34 Bilinear Prediction Augment multitarget prediction with features on targets as well Motivated by modern applications of machine learning bioinformatics, autotagging articles Need to model dyadic or pairwise interactions Move from linear models to bilinear models linear in input features as well as target features
35 Bilinear Prediction
36 Bilinear Prediction
37 Bilinear Prediction Bilinear predictive model: T ij a T i X b j
38 Bilinear Prediction Bilinear predictive model: T ij a T i X b j Corresponding regression problem has a closedform solution: V A Σ 1 A U A TU BΣ 1 B V T B = arg min X T AXB 2 F where A = U A Σ A V A, B = U BΣ B V B are the thin SVDs of A and B
39 Bilinear Prediction: Lowrank Model X is rankk Bilinear predictive model: T ij a T i X b j
40 Bilinear Prediction: Lowrank Model X is rankk Bilinear predictive model: T ij a T i X b j Corresponding regression problem has a closedform solution: X = = min T AXB 2 F X :rank(x ) k { V A Σ 1 A U A T ku B Σ 1 B V B T V A Σ 1 A M kσ 1 B V B T if A, B are full row rank, otherwise, where A = U A Σ A VA, B = U BΣ B VB are the thin SVDs of A and B, M = UA TU B, and T k, M k are the best rankk approximations of T and M
41 Modern Challenges in MultiTarget Prediction Millions of targets Correlations among targets Missing values
42 Modern Challenges in MultiTarget Prediction Millions of targets (Scalable) Correlations among targets (Lowrank) Missing values (Inductive Matrix Completion)
43 Bilinear Prediction with Missing Values
44 Matrix Completion Missing Value Estimation Problem Matrix Completion: Recover a lowrank matrix from observed entries Matrix Completion: exact recovery requires O(kn log 2 (n)) samples, under the assumptions of: 1 Uniform sampling 2 Incoherence
45 Inductive Matrix Completion Inductive Matrix Completion: Bilinear lowrank prediction with missing values Degrees of freedom in X are O(kd) Can we get better sample complexity (than O(kn))?
46 Algorithm 1: Convex Relaxation 1 Nuclearnorm Minimization: min X s.t. a T i X b j = T ij, (i, j) Ω Computationally expensive Sample complexity for exact recovery: O(kd log d log n) Conditions for exact recovery: C1. Incoherence on A, B. C2. Incoherence on AU and BV,where X = U Σ V T is the SVD of the ground truth X C1 and C2 are satisfied with high probability when A, B are Gaussian
47 Algorithm 1: Convex Relaxation Theorem (Recovery Guarantees for Nuclearnorm Minimization) Let X = U Σ V T R d d be the SVD of X with rank k. Assume A, B are orthonormal matrices w.l.o.g., satisfying the incoherence conditions. Then if Ω is uniformly observed with Ω O(kd log d log n), the solution of nuclearnorm minimization problem is unique and equal to X with high probability. The incoherence conditions are C1. max a i 2 2 µd i [n] n, max j [n] C2. max U T a i 2 2 µ0k i [n] b j 2 2 µd n n, max j [n] V T b j 2 2 µ0k n
48 Algorithm 2: Alternating Least Squares Alternating Least Squares (ALS): min Y R d 1 k Z R d 2 k (i,j) Ω Nonconvex optimization Alternately minimize w.r.t. Y and Z (a T i YZ T b j T ij ) 2
49 Algorithm 2: Alternating Least Squares Computational complexity of ALS. At hth iteration, fixing Y h, solve the least squares problem for Z h+1 : (ã T i Zh+1b T j )b j ã T i = T ij b j ã T i (i,j) Ω (i,j) Ω where ã i = Yh T a i. Similarly solve for Y h when fixing Z h. 1 Closed form: O( Ω k 2 d (nnz(a) + nnz(b))/n + k 3 d 3 ). 2 Vanilla conjugate gradient: O( Ω k (nnz(a) + nnz(b))/n) per iteration. 3 Exploit the structure for conjugate gradient: (ã T i Z T b j )b j ã T i = B T DÃ (i,j) Ω where D is a sparse matrix with D ji = ã T i Z T b j, (i, j) Ω, and Ã = AY h. Only O((nnz(A) + nnz(b) + Ω ) k) per iteration.
50 Algorithm 2: Alternating Least Squares Theorem (Convergence Guarantees for ALS ) Let X be a rankk matrix with condition number β, and T = AX B T. Assume A, B are orthogonal w.l.o.g. and satisfy the incoherence conditions. Then if Ω is uniformly sampled with Ω O(k 4 β 2 d log d), then after H iterations of ALS, Y H Z T H+1 X 2 ɛ, where H = O(log( X F /ɛ)). The incoherence conditions are: C1. max a i 2 2 µd i [n] C2. n, max j [n] max Yh T a i 2 2 µ0k i [n] b j 2 2 µd n n, max j [n] for all the Y h s and Z h s generated from ALS. Z T h b j 2 2 µ0k n,
51 Algorithm 2: Alternating Least Squares Proof sketch for ALS Consider the case when the rank k = 1: (a T i yz T b j T ij ) 2 min y R d 1,z R d 2 (i,j) Ω
52 Algorithm 2: Alternating Least Squares Proof sketch for rank1 ALS min y R d 1,z R d 2 (i,j) Ω (a T i yz T b j T ij ) 2 (a) Let X = σ y z T be the thin SVD of X and assume A and B are orthogonal w.l.o.g. (b) In the absence of missing values, ALS = Power method. Ay h z T B T T 2 F z = 2B T (Bzy T h A T T T )Ay h = 2(z y h 2 B T T T Ay h ) z h+1 (A T TB) T y h ; normalize z h+1 y h+1 (A T TB)z h+1 ; normalize y h+1 Note that A T TB = A T AX B T B = X and the power method converges to the optimal.
53 Algorithm 2: Alternating Least Squares Proof sketch for rank1 ALS min y R d 1,z R d 2 (i,j) Ω (a T i yz T b j T ij ) 2 (c) With missing values, ALS is a variant of power method with noise in each iteration z h+1 QR( X T y } {{ } h σ N 1 ((y T y h )N Ñ)z ) } {{ } power method noise term g where N = (i,j) Ω b ja T i y h y T h a ib T j, Ñ = (i,j) Ω b ja T i y h y T a i b T j. (d) Given C1 and C2, the noise term g = σ N 1 ((y T y h )N Ñ)z becomes smaller as the iterate gets close to the optimal: g (y T h 99 y ) 2
54 Algorithm 2: Alternating Least Squares Proof sketch for rank1 ALS min y R d 1,z R d 2 (i,j) Ω (a T i yz T b j T ij ) 2 (e) Given C1 and C2, the first iterate y 0 is well initialized, i.e. y T 0 y 0.9, which guarantees the initial noise is small enough (f) The iterates can then be shown to linearly converge to the optimal: 1 (z T h+1z ) (1 (y T h z ) 2 ) 1 (y T h+1y ) (1 (zt h+1y ) 2 )
55 Algorithm 2: Alternating Least Squares Proof sketch for rank1 ALS min y R d 1,z R d 2 (i,j) Ω (a T i yz T b j T ij ) 2 (e) Given C1 and C2, the first iterate y 0 is well initialized, i.e. y T 0 y 0.9, which guarantees the initial noise is small enough (f) The iterates can then be shown to linearly converge to the optimal: 1 (z T h+1z ) (1 (y T h z ) 2 ) 1 (y T h+1y ) (1 (zt h+1y ) 2 ) Similarly, the rankk case can be proved.
56 Inductive Matrix Completion: Sample Complexity Sample complexity of Inductive Matrix Completion (IMC) and Matrix Completion (MC). methods IMC MC Nuclearnorm O(kd log n log d) kn log 2 n (Recht, 2011) ALS O(k 4 β 2 d log d) k 3 β 2 n log n (Hardt, 2014) where β is the condition number of X In most cases, n d Incoherence conditions on A, B are required Satisfied e.g. when A, B are Gaussian (no assumption on X needed) B. Recht. A simpler approach to matrix completion. The Journal of Machine Learning Research 12 : (2011). M. Hardt. Understanding alternating minimization for matrix completion. Foundations of Computer Science (FOCS), IEEE 55th Annual Symposium, pp (2014).
57 Inductive Matrix Completion: Sample Complexity Results All matrices are sampled from Gaussian random distribution. Left two figures: fix k = 5, n = 1000 and change d. Right two figures: fix k = 5, d = 50 and change n. Darkness of the shading is proportional to the number of failures (repeated 10 times) Number of Measurements Number of Measurements Number of Measurements Number of Measurements d d n n 0 Ω vs. d (ALS) Ω vs. d (Nuclear) Ω vs. n (ALS) Ω vs. n (Nuclear) Sample complexity is proportional to d while almost independent of n for both Nuclearnorm and ALS methods.
58 PositiveUnlabeled Learning
59 Modern Prediction Problems in Machine Learning Predicting causal disease genes
60 Bilinear Prediction: PU Learning In many applications, only positive labels are observed
61 PU Learning No observations of the negative class available
62 PU Inductive Matrix Completion Guarantees so far assume observations are sampled uniformly What can we say about the case when observations are all 1 s ( positives )? Typically, 99% entries are missing ( unlabeled )
63 PU Inductive Matrix Completion Inductive Matrix Completion: min X : X t (i,j) Ω (a T i X b j T ij ) 2 Commonly used PU strategy: Biased Matrix Completion min α (a T i X b j T ij ) 2 + (1 α) (a T i X b j 0) 2 X : X t (i,j) Ω (i,j) Ω Typically, α > 1 α (α 0.9). V. Sindhwani, S. S. Bucak, J. Hu, A. Mojsilovic. Oneclass matrix completion with lowdensity factorizations. ICDM, pp
64 PU Inductive Matrix Completion Inductive Matrix Completion: min X : X t (i,j) Ω (a T i X b j T ij ) 2 Commonly used PU strategy: Biased Matrix Completion min α (a T i X b j T ij ) 2 + (1 α) (a T i X b j 0) 2 X : X t (i,j) Ω (i,j) Ω Typically, α > 1 α (α 0.9). We can show guarantees for the biased formulation V. Sindhwani, S. S. Bucak, J. Hu, A. Mojsilovic. Oneclass matrix completion with lowdensity factorizations. ICDM, pp
65 PU Learning: Random Noise Model Can be formulated as learning with classconditional noise N. Natarajan, I. S. Dhillon, P. Ravikumar, and A.Tewari. Learning with Noisy Labels. In Advances in Neural Information Processing Systems, pp
66 PU Inductive Matrix Completion A deterministic PU learning model T ij = { 1 if M ij > 0.5, 0 if M ij 0.5
67 PU Inductive Matrix Completion A deterministic PU learning model P( T ij = 0 T ij = 1) = ρ and P( T ij = 0 T ij = 0) = 1. We are given only T but not T or M Goal: Recover T given T (recovering M is not possible!)
68 Algorithm 1: Biased Inductive Matrix Completion X = min α (a T i X b j 1) 2 + (1 α) (a T i X b j 0) 2 X : X t (i,j) Ω (i,j) Ω Rationale: (a) Fix α = (1 + ρ)/2 and define T ij = I [ (A X B T ) ij > 0.5 ] (b) The above problem is equivalent to: X = min X : X t l α ((AXB T ) ij, T ij ) where l α (x, T ij ) = α T ij (x 1) 2 + (1 α)(1 T ij )x 2 (c) Minimizing l α loss is equivalent to minimizing the true error, i.e. 1 mn ij i,j l α ((AXB T ) ij, T ij ) = C 1 1 mn T T 2 F + C 2
69 Algorithm 1: Biased Inductive Matrix Completion Theorem (Error Bound for Biased IMC) Assume groundtruth X satisfies X t (where M = AXB T ). Define T ij = I [ (A X B T ) ij > 0.5 ], A = max i a i and B = max i b i. If α = 1+ρ 2, then with probability at least 1 δ, where η = 4(1 + 2ρ). ( 1 n 2 T T η log(2/δ) 2 F = O + η tab ) log2d n(1 ρ) (1 ρ)n 3/2 CJ. Hsieh, N. Natarajan, and I. S. Dhillon. PU Learning for Matrix Completion. In Proceedings of The 32nd International Conference on Machine Learning, pp (2015).
70 Experimental Results
71 Multitarget Prediction: Image Tag Recommendation NUSWide Image Dataset 161,780 training images 107,879 test images 1,134 features 1,000 tags
72 Multitarget Prediction: Image Tag Recommendation H. F. Yu, P. Jain, P. Kar, and I. S. Dhillon. Largescale Multilabel Learning with Missing Labels. In Proceedings of The 31st International Conference on Machine Learning, pp (2014).
73 Multitarget Prediction: Image Tag Recommendation Lowrank Model with k = 50: time(s) AUC LEML(ALS) WSABIE 4, Lowrank Model with k = 100: time(s) AUC LEML(ALS) 1, WSABIE 6, H. F. Yu, P. Jain, P. Kar, and I. S. Dhillon. Largescale Multilabel Learning with Missing Labels. In Proceedings of The 31st International Conference on Machine Learning, pp (2014).
74 Multitarget Prediction: Wikipedia Tag Recommendation Wikipedia Dataset 881,805 training wiki pages 10,000 test wiki pages 366,932 features 213,707 tags
75 Multitarget Prediction: Wikipedia Tag Recommendation Lowrank Model with k = 250: time(s) AUC LEML(ALS) 9, WSABIE 79, Lowrank Model with k = 500: time(s) AUC LEML(ALS) 18, WSABIE 139, H. F. Yu, P. Jain, P. Kar, and I. S. Dhillon. Largescale Multilabel Learning with Missing Labels. In Proceedings of The 31st International Conference on Machine Learning, pp (2014).
76 PU Inductive Matrix Completion: GeneDisease Prediction N. Natarajan, and I. S. Dhillon. Inductive matrix completion for predicting gene disease associations. Bioinformatics, 30(12), i60i68 (2014).
77 PU Inductive Matrix Completion: GeneDisease Prediction Catapult 2.5 Katz Matrix Completion on C P(hidden gene among genes looked at) IMC LEML Matrix Completion Precision(%) Number of genes looked at Recall(%) Predicting genedisease associations in the OMIM data set ( N. Natarajan, and I. S. Dhillon. Inductive matrix completion for predicting gene disease associations. Bioinformatics, 30(12), i60i68 (2014).
78 PU Inductive Matrix Completion: GeneDisease Prediction Catapult Katz P(hidden gene among genes looked at) 0.1 Matrix Completion on C IMC LEML P(hidden gene among genes looked at) Number of genes looked at Number of genes looked at Predicting genes for diseases with no training associations. N. Natarajan, and I. S. Dhillon. Inductive matrix completion for predicting gene disease associations. Bioinformatics, 30(12), i60i68 (2014).
79 Conclusions and Future Work Inductive Matrix Completion: Scales to millions of targets Captures correlations among targets Overcomes missing values Extension to PU learning Much work to do: Other structures: lowrank+sparse, lowrank+columnsparse (outliers)? Different loss functions? Handling time as one of the dimensions incorporating smoothness through graph regularization? Incorporating nonlinearities? Efficient (parallel) implementations? Improved recovery guarantees?
80 References [1] P. Jain, and I. S. Dhillon. Provable inductive matrix completion. arxiv preprint arxiv: (2013). [2] K. Zhong, P. Jain, I. S. Dhillon. Efficient Matrix Sensing Using Rank1 Gaussian Measurements. In Proceedings of The 26th Conference on Algorithmic Learning Theory (2015). [3] N. Natarajan, and I. S. Dhillon. Inductive matrix completion for predicting gene disease associations. Bioinformatics, 30(12), i60i68 (2014). [4] H. F. Yu, P. Jain, P. Kar, and I. S. Dhillon. Largescale Multilabel Learning with Missing Labels. In Proceedings of The 31st International Conference on Machine Learning, pp (2014). [5] CJ. Hsieh, N. Natarajan, and I. S. Dhillon. PU Learning for Matrix Completion. In Proceedings of The 32nd International Conference on Machine Learning, pp (2015).
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationConsistent Binary Classification with Generalized Performance Metrics
Consistent Binary Classification with Generalized Performance Metrics Nagarajan Natarajan Joint work with Oluwasanmi Koyejo, Pradeep Ravikumar and Inderjit Dhillon UT Austin Nov 4, 2014 Problem and Motivation
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationDATA ANALYTICS USING R
DATA ANALYTICS USING R Duration: 90 Hours Intended audience and scope: The course is targeted at fresh engineers, practicing engineers and scientists who are interested in learning and understanding data
More informationLogistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.
Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features
More informationSimple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRALab,
More informationCSE 494 CSE/CBS 598 (Fall 2007): Numerical Linear Algebra for Data Exploration Clustering Instructor: Jieping Ye
CSE 494 CSE/CBS 598 Fall 2007: Numerical Linear Algebra for Data Exploration Clustering Instructor: Jieping Ye 1 Introduction One important method for data compression and classification is to organize
More informationStatistical machine learning, high dimension and big data
Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP  Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,
More informationLectures notes on orthogonal matrices (with exercises) 92.222  Linear Algebra II  Spring 2004 by D. Klain
Lectures notes on orthogonal matrices (with exercises) 92.222  Linear Algebra II  Spring 2004 by D. Klain 1. Orthogonal matrices and orthonormal sets An n n realvalued matrix A is said to be an orthogonal
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationCheng Soon Ong & Christfried Webers. Canberra February June 2016
c Cheng Soon Ong & Christfried Webers Research Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 31 c Part I
More informationThese slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher
More informationBig Data  Lecture 1 Optimization reminders
Big Data  Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data  Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics
More informationLecture 6: Logistic Regression
Lecture 6: CS 19410, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,
More informationNimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff
Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationAPPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
More informationProbabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur
Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:
More informationNMR Measurement of T1T2 Spectra with Partial Measurements using Compressive Sensing
NMR Measurement of T1T2 Spectra with Partial Measurements using Compressive Sensing Alex Cloninger Norbert Wiener Center Department of Mathematics University of Maryland, College Park http://www.norbertwiener.umd.edu
More informationLogistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
More informationSemiSupervised Support Vector Machines and Application to Spam Filtering
SemiSupervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery
More informationNoisy and Missing Data Regression: DistributionOblivious Support Recovery
: DistributionOblivious Support Recovery Yudong Chen Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 7872 Constantine Caramanis Department of Electrical
More informationRandomized Robust Linear Regression for big data applications
Randomized Robust Linear Regression for big data applications Yannis Kopsinis 1 Dept. of Informatics & Telecommunications, UoA Thursday, Apr 16, 2015 In collaboration with S. Chouvardas, Harris Georgiou,
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation:  Feature vector X,  qualitative response Y, taking values in C
More informationClassification Problems
Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems
More informationLABEL PROPAGATION ON GRAPHS. SEMISUPERVISED LEARNING. Changsheng Liu 10302014
LABEL PROPAGATION ON GRAPHS. SEMISUPERVISED LEARNING Changsheng Liu 10302014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph
More informationLecture Topic: LowRank Approximations
Lecture Topic: LowRank Approximations LowRank Approximations We have seen principal component analysis. The extraction of the first principle eigenvalue could be seen as an approximation of the original
More informationDistributed Machine Learning and Big Data
Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya
More informationIntroduction to Online Learning Theory
Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent
More informationSupervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationSeveral Views of Support Vector Machines
Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min
More informationLinear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
More informationIntroduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multiclass classification.
More informationAdaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
More informationMachine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
More informationVariance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers
Variance Reduction The statistical efficiency of Monte Carlo simulation can be measured by the variance of its output If this variance can be lowered without changing the expected value, fewer replications
More informationIntroduction to machine learning and pattern recognition Lecture 1 Coryn BailerJones
Introduction to machine learning and pattern recognition Lecture 1 Coryn BailerJones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 What is machine learning? Data description and interpretation
More informationPredict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons
More informationSparse modeling: some unifying theory and wordimaging
Sparse modeling: some unifying theory and wordimaging Bin Yu UC Berkeley Departments of Statistics, and EECS Based on joint work with: Sahand Negahban (UC Berkeley) Pradeep Ravikumar (UT Austin) Martin
More informationIntroduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu
Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics
More informationData Mining  Evaluation of Classifiers
Data Mining  Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationSOLVING LINEAR SYSTEMS
SOLVING LINEAR SYSTEMS Linear systems Ax = b occur widely in applied mathematics They occur as direct formulations of real world problems; but more often, they occur as a part of the numerical analysis
More informationCS 688 Pattern Recognition Lecture 4. Linear Models for Classification
CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(
More informationLargeScale Similarity and Distance Metric Learning
LargeScale Similarity and Distance Metric Learning Aurélien Bellet Télécom ParisTech Joint work with K. Liu, Y. Shi and F. Sha (USC), S. Clémençon and I. Colin (Télécom ParisTech) Séminaire Criteo March
More informationLinear Algebra Review Part 2: Ax=b
Linear Algebra Review Part 2: Ax=b Edwin Olson University of Michigan The ThreeDay Plan Geometry of Linear Algebra Vectors, matrices, basic operations, lines, planes, homogeneous coordinates, transformations
More informationModern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
More informationAbsolute Value Programming
Computational Optimization and Aplications,, 1 11 (2006) c 2006 Springer Verlag, Boston. Manufactured in The Netherlands. Absolute Value Programming O. L. MANGASARIAN olvi@cs.wisc.edu Computer Sciences
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationSketching as a Tool for Numerical Linear Algebra
Sketching as a Tool for Numerical Linear Algebra (Graph Sparsification) David P. Woodruff presented by Sepehr Assadi o(n) Big Data Reading Group University of Pennsylvania April, 2015 Sepehr Assadi (Penn)
More informationLINEAR SYSTEMS. Consider the following example of a linear system:
LINEAR SYSTEMS Consider the following example of a linear system: Its unique solution is x +2x 2 +3x 3 = 5 x + x 3 = 3 3x + x 2 +3x 3 = 3 x =, x 2 =0, x 3 = 2 In general we want to solve n equations in
More informationSublinear Algorithms for Big Data. Part 4: Random Topics
Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 11 21 Topic 1: Compressive sensing Compressive sensing The model (CandesRombergTao 04; Donoho 04) Applicaitons Medical imaging reconstruction
More informationLinkbased Analysis on Large Graphs. Presented by Weiren Yu Mar 01, 2011
Linkbased Analysis on Large Graphs Presented by Weiren Yu Mar 01, 2011 Overview 1 Introduction 2 Problem Definition 3 Optimization Techniques 4 Experimental Results 2 1. Introduction Many applications
More informationTensor Methods for Machine Learning, Computer Vision, and Computer Graphics
Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University
More informationDATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDDLAB ISTI CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
More informationA network flow algorithm for reconstructing. binary images from discrete Xrays
A network flow algorithm for reconstructing binary images from discrete Xrays Kees Joost Batenburg Leiden University and CWI, The Netherlands kbatenbu@math.leidenuniv.nl Abstract We present a new algorithm
More informationDecember 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in twodimensional space (1) 2x y = 3 describes a line in twodimensional space The coefficients of x and y in the equation
More informationBayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com
Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian
More informationFoundations of Machine Learning OnLine Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu
Foundations of Machine Learning OnLine Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.
More informationScalable Machine Learning  or what to do with all that Big Data infrastructure
 or what to do with all that Big Data infrastructure TU Berlin blog.mikiobraun.de Strata+Hadoop World London, 2015 1 Complex Data Analysis at Scale Clickthrough prediction Personalized Spam Detection
More informationChapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )
Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples
More informationTwo Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering
Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Department of Industrial Engineering and Management Sciences Northwestern University September 15th, 2014
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang, Qihang Lin, Rong Jin Tutorial@SIGKDD 2015 Sydney, Australia Department of Computer Science, The University of Iowa, IA, USA Department of
More informationCSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /
More informationChristfried Webers. Canberra February June 2015
c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic
More informationIN this paper we focus on the problem of largescale, multiclass
IEEE TRANSACTIONS ON PATTERN RECOGNITION AND MACHINE INTELLIGENCE 1 DistanceBased Image Classification: Generalizing to new classes at nearzero cost Thomas Mensink, Member IEEE, Jakob Verbeek, Member,
More informationarxiv:1312.1254v1 [cs.na] 4 Dec 2013
PARALLEL MATRIX FACTORIZATION FOR LOWRANK TENSOR COMPLETION YANGYANG XU, RURU HAO, WOTAO YIN, AND ZHIXUN SU arxiv:1312.1254v1 [cs.na] 4 Dec 2013 Abstract. Higherorder lowrank tensors naturally arise
More informationPa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classiﬁca6on
Pa8ern Recogni6on and Machine Learning Chapter 4: Linear Models for Classiﬁca6on Represen'ng the target values for classifica'on If there are only two classes, we typically use a single real valued output
More information(Quasi)Newton methods
(Quasi)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable nonlinear function g, x such that g(x) = 0, where g : R n R n. Given a starting
More informationBig learning: challenges and opportunities
Big learning: challenges and opportunities Francis Bach SIERRA Projectteam, INRIA  Ecole Normale Supérieure December 2013 Omnipresent digital media Scientific context Big data Multimedia, sensors, indicators,
More informationNonnegative Matrix Factorization (NMF) in Semisupervised Learning Reducing Dimension and Maintaining Meaning
Nonnegative Matrix Factorization (NMF) in Semisupervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
More informationData Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining
Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distancebased Kmeans, Kmedoids,
More informationANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING
ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANIKALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of
More informationA Negative Result Concerning Explicit Matrices With The Restricted Isometry Property
A Negative Result Concerning Explicit Matrices With The Restricted Isometry Property Venkat Chandar March 1, 2008 Abstract In this note, we prove that matrices whose entries are all 0 or 1 cannot achieve
More informationINTRODUCTION TO NEURAL NETWORKS
INTRODUCTION TO NEURAL NETWORKS Pictures are taken from http://www.cs.cmu.edu/~tom/mlbookchapterslides.html http://research.microsoft.com/~cmbishop/prml/index.htm By Nobel Khandaker Neural Networks An
More informationInner Product Spaces and Orthogonality
Inner Product Spaces and Orthogonality week 34 Fall 2006 Dot product of R n The inner product or dot product of R n is a function, defined by u, v a b + a 2 b 2 + + a n b n for u a, a 2,, a n T, v b,
More informationLearning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu
Learning Gaussian process models from big data Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Machine learning seminar at University of Cambridge, July 4 2012 Data A lot of
More informationClassification by Pairwise Coupling
Classification by Pairwise Coupling TREVOR HASTIE * Stanford University and ROBERT TIBSHIRANI t University of Toronto Abstract We discuss a strategy for polychotomous classification that involves estimating
More informationMVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr
Machine Learning for Computer Vision 1 MVA ENS Cachan Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Department of Applied Mathematics Ecole Centrale Paris Galen
More informationDATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
More informationClassification of Cartan matrices
Chapter 7 Classification of Cartan matrices In this chapter we describe a classification of generalised Cartan matrices This classification can be compared as the rough classification of varieties in terms
More informationClass #6: Nonlinear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Nonlinear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Nonlinear classification Linear Support Vector Machines
More informationPart II Redundant Dictionaries and Pursuit Algorithms
Aisenstadt Chair Course CRM September 2009 Part II Redundant Dictionaries and Pursuit Algorithms Stéphane Mallat Centre de Mathématiques Appliquées Ecole Polytechnique Sparsity in Redundant Dictionaries
More informationThe Trimmed Iterative Closest Point Algorithm
Image and Pattern Analysis (IPAN) Group Computer and Automation Research Institute, HAS Budapest, Hungary The Trimmed Iterative Closest Point Algorithm Dmitry Chetverikov and Dmitry Stepanov http://visual.ipan.sztaki.hu
More informationWe shall turn our attention to solving linear systems of equations. Ax = b
59 Linear Algebra We shall turn our attention to solving linear systems of equations Ax = b where A R m n, x R n, and b R m. We already saw examples of methods that required the solution of a linear system
More informationAn Introduction to Machine Learning
An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,
More informationMachine Learning for Data Science (CS4786) Lecture 1
Machine Learning for Data Science (CS4786) Lecture 1 TuTh 10:10 to 11:25 AM Hollister B14 Instructors : Lillian Lee and Karthik Sridharan ROUGH DETAILS ABOUT THE COURSE Diagnostic assignment 0 is out:
More informationGI01/M055 Supervised Learning Proximal Methods
GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators
More informationLCs for Binary Classification
Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it
More informationOn Covariance Structure in Noisy, Big Data
On Covariance Structure in Noisy, Big Data Randy C. Paffenroth a, Ryan Nong a and Philip C. Du Toit a a Numerica Corporation, Loveland, CO, USA; ABSTRACT Herein we describe theory and algorithms for detecting
More informationC19 Machine Learning
C9 Machine Learning 8 Lectures Hilary Term 25 2 Tutorial Sheets A. Zisserman Overview: Supervised classification perceptron, support vector machine, loss functions, kernels, random forests, neural networks
More informationRegression Using Support Vector Machines: Basic Foundations
Regression Using Support Vector Machines: Basic Foundations Technical Report December 2004 Aly Farag and Refaat M Mohamed Computer Vision and Image Processing Laboratory Electrical and Computer Engineering
More informationClass Overview and General Introduction to Machine Learning
Class Overview and General Introduction to Machine Learning Piyush Rai www.cs.utah.edu/~piyush CS5350/6350: Machine Learning August 23, 2011 (CS5350/6350) Intro to ML August 23, 2011 1 / 25 Course Logistics
More informationBig Data Optimization: Randomized lockfree methods for minimizing partially separable convex functions
Big Data Optimization: Randomized lockfree methods for minimizing partially separable convex functions Peter Richtárik School of Mathematics The University of Edinburgh Joint work with Martin Takáč (Edinburgh)
More informationLassobased Spam Filtering with Chinese Emails
Journal of Computational Information Systems 8: 8 (2012) 3315 3322 Available at http://www.jofcis.com Lassobased Spam Filtering with Chinese Emails Zunxiong LIU 1, Xianlong ZHANG 1,, Shujuan ZHENG 2 1
More informationPattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University
Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision
More informationMATH 304 Linear Algebra Lecture 18: Rank and nullity of a matrix.
MATH 304 Linear Algebra Lecture 18: Rank and nullity of a matrix. Nullspace Let A = (a ij ) be an m n matrix. Definition. The nullspace of the matrix A, denoted N(A), is the set of all ndimensional column
More information10. Proximal point method
L. Vandenberghe EE236C Spring 201314) 10. Proximal point method proximal point method augmented Lagrangian method MoreauYosida smoothing 101 Proximal point method a conceptual algorithm for minimizing
More informationThe Many Facets of Big Data
Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong ACPR 2013 Big Data 1 volume sample size is big feature dimensionality is big 2 variety multiple formats:
More informationEnsemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 20150305
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 20150305 Roman Kern (KTI, TU Graz) Ensemble Methods 20150305 1 / 38 Outline 1 Introduction 2 Classification
More information