Bilinear Prediction Using Low-Rank Models
|
|
|
- Jodie Chapman
- 10 years ago
- Views:
Transcription
1 Bilinear Prediction Using Low-Rank Models Inderjit S. Dhillon Dept of Computer Science UT Austin 26th International Conference on Algorithmic Learning Theory Banff, Canada Oct 6, 2015 Joint work with C-J. Hsieh, P. Jain, N. Natarajan, H. Yu and K. Zhong
2 Outline Multi-Target Prediction Features on Targets: Bilinear Prediction Inductive Matrix Completion 1 Algorithms 2 Positive-Unlabeled Matrix Completion 3 Recovery Guarantees Experimental Results
3 Sample Prediction Problems Predicting stock prices Predicting risk factors in healthcare
4 Regression Real-valued responses (target) t Predict response for given input data (features) a
5 Linear Regression Estimate target by a linear function of given data a, i.e. t ˆt = a T x.
6 Linear Regression: Least Squares Choose x that minimizes J x = 1 2 n (a T i x t i ) 2 i=1 Closed-form solution: x = (A T A) 1 A T t.
7 Prediction Problems: Classification Spam detection Character Recognition
8 Binary Classification Categorical responses (target) t Predict response for given input data (features) a Linear methods decision boundary is a linear surface or hyperplane
9 Linear Methods for Prediction Problems Regression: Ridge Regression: J x = 1 2 n i=1 (at i x t i ) 2 + λ x 2 2. Lasso: J x = 1 2 n i=1 (at i x t i ) 2 + λ x 1. Classification: Linear Support Vector Machines J x = 1 2 n max(0, 1 t i a T i x) + λ x 2 2. i=1 Logistic Regression J x = 1 2 n log(1 + exp( t i a T i x)) + λ x 2 2. i=1
10 Linear Prediction
11 Linear Prediction
12 Multi-Target Prediction
13 Modern Prediction Problems in Machine Learning Ad-word Recommendation
14 Modern Prediction Problems in Machine Learning Ad-word Recommendation geico auto insurance geico car insurance car insurance geico insurance need cheap auto insurance geico com car insurance coupon code
15 Modern Prediction Problems in Machine Learning Wikipedia Tag Recommendation Learning in computer vision Machine learning Learning Cybernetics.
16 Modern Prediction Problems in Machine Learning Predicting causal disease genes
17 Prediction with Multiple Targets In many domains, goal is to simultaneously predict multiple target variables Multi-target regression: targets are real-valued Multi-label classification:targets are binary
18 Prediction with Multiple Targets Applications Bid word recommendation Tag recommendation Disease-gene linkage prediction Medical diagnoses Ecological modeling...
19 Prediction with Multiple Targets Input data a i is associated with m targets, t i = (t (1) i, t (2) i,..., t (m) i )
20 Multi-target Linear Prediction Basic model: Treat targets independently Estimate regression coefficients x j for each target j
21 Multi-target Linear Prediction Assume targets t (j) are independent Linear predictive model: t i a T i X
22 Multi-target Linear Prediction Assume targets t (j) are independent Linear predictive model: t i a T i X Multi-target regression problem has a closed-form solution: V A Σ 1 A U A T = arg min X T AX 2 F where A = U A Σ A V T A is the thin SVD of A
23 Multi-target Linear Prediction Assume targets t (j) are independent Linear predictive model: t i a T i X Multi-target regression problem has a closed-form solution: V A Σ 1 A U A T = arg min X T AX 2 F where A = U A Σ A V T A is the thin SVD of A In multi-label classification: Binary Relevance (independent binary classifier for each label)
24 Multi-target Linear Prediction: Low-rank Model Exploit correlations between targets T, where T AX Reduced-Rank Regression [A.J. Izenman, 1974] model the coefficient matrix X as low-rank A. J. Izenman. Reduced-rank regression for the multivariate linear model. Journal of Multivariate Analysis 5.2 (1975):
25 Multi-target Linear Prediction: Low-rank Model X is rank-k Linear predictive model: t i a T i X
26 Multi-target Linear Prediction: Low-rank Model X is rank-k Linear predictive model: t i a T i X Low-rank multi-target regression problem has a closed-form solution: X = = min X :rank(x ) k { T AX 2 F V A Σ 1 A U A T k if A is full row rank, V A Σ 1 A M k otherwise, where A = U A Σ A VA T is the thin SVD of A, M = U A T, and T k, M k are the best rank-k approximations of T and M respectively.
27 Modern Challenges
28 Multi-target Prediction with Missing Values In many applications, several observations (targets) may be missing E.g. Recommending tags for images and wikipedia articles
29 Modern Prediction Problems in Machine Learning Ad-word Recommendation geico auto insurance geico car insurance car insurance geico insurance need cheap auto insurance geico com car insurance coupon code
30 Multi-target Prediction with Missing Values
31 Multi-target Prediction with Missing Values Low-rank model: t i = a T i X where X is low-rank
32 Canonical Correlation Analysis
33 Bilinear Prediction
34 Bilinear Prediction Augment multi-target prediction with features on targets as well Motivated by modern applications of machine learning bioinformatics, auto-tagging articles Need to model dyadic or pairwise interactions Move from linear models to bilinear models linear in input features as well as target features
35 Bilinear Prediction
36 Bilinear Prediction
37 Bilinear Prediction Bilinear predictive model: T ij a T i X b j
38 Bilinear Prediction Bilinear predictive model: T ij a T i X b j Corresponding regression problem has a closed-form solution: V A Σ 1 A U A TU BΣ 1 B V T B = arg min X T AXB 2 F where A = U A Σ A V A, B = U BΣ B V B are the thin SVDs of A and B
39 Bilinear Prediction: Low-rank Model X is rank-k Bilinear predictive model: T ij a T i X b j
40 Bilinear Prediction: Low-rank Model X is rank-k Bilinear predictive model: T ij a T i X b j Corresponding regression problem has a closed-form solution: X = = min T AXB 2 F X :rank(x ) k { V A Σ 1 A U A T ku B Σ 1 B V B T V A Σ 1 A M kσ 1 B V B T if A, B are full row rank, otherwise, where A = U A Σ A VA, B = U BΣ B VB are the thin SVDs of A and B, M = UA TU B, and T k, M k are the best rank-k approximations of T and M
41 Modern Challenges in Multi-Target Prediction Millions of targets Correlations among targets Missing values
42 Modern Challenges in Multi-Target Prediction Millions of targets (Scalable) Correlations among targets (Low-rank) Missing values (Inductive Matrix Completion)
43 Bilinear Prediction with Missing Values
44 Matrix Completion Missing Value Estimation Problem Matrix Completion: Recover a low-rank matrix from observed entries Matrix Completion: exact recovery requires O(kn log 2 (n)) samples, under the assumptions of: 1 Uniform sampling 2 Incoherence
45 Inductive Matrix Completion Inductive Matrix Completion: Bilinear low-rank prediction with missing values Degrees of freedom in X are O(kd) Can we get better sample complexity (than O(kn))?
46 Algorithm 1: Convex Relaxation 1 Nuclear-norm Minimization: min X s.t. a T i X b j = T ij, (i, j) Ω Computationally expensive Sample complexity for exact recovery: O(kd log d log n) Conditions for exact recovery: C1. Incoherence on A, B. C2. Incoherence on AU and BV,where X = U Σ V T is the SVD of the ground truth X C1 and C2 are satisfied with high probability when A, B are Gaussian
47 Algorithm 1: Convex Relaxation Theorem (Recovery Guarantees for Nuclear-norm Minimization) Let X = U Σ V T R d d be the SVD of X with rank k. Assume A, B are orthonormal matrices w.l.o.g., satisfying the incoherence conditions. Then if Ω is uniformly observed with Ω O(kd log d log n), the solution of nuclear-norm minimization problem is unique and equal to X with high probability. The incoherence conditions are C1. max a i 2 2 µd i [n] n, max j [n] C2. max U T a i 2 2 µ0k i [n] b j 2 2 µd n n, max j [n] V T b j 2 2 µ0k n
48 Algorithm 2: Alternating Least Squares Alternating Least Squares (ALS): min Y R d 1 k Z R d 2 k (i,j) Ω Non-convex optimization Alternately minimize w.r.t. Y and Z (a T i YZ T b j T ij ) 2
49 Algorithm 2: Alternating Least Squares Computational complexity of ALS. At h-th iteration, fixing Y h, solve the least squares problem for Z h+1 : (ã T i Zh+1b T j )b j ã T i = T ij b j ã T i (i,j) Ω (i,j) Ω where ã i = Yh T a i. Similarly solve for Y h when fixing Z h. 1 Closed form: O( Ω k 2 d (nnz(a) + nnz(b))/n + k 3 d 3 ). 2 Vanilla conjugate gradient: O( Ω k (nnz(a) + nnz(b))/n) per iteration. 3 Exploit the structure for conjugate gradient: (ã T i Z T b j )b j ã T i = B T Dà (i,j) Ω where D is a sparse matrix with D ji = ã T i Z T b j, (i, j) Ω, and à = AY h. Only O((nnz(A) + nnz(b) + Ω ) k) per iteration.
50 Algorithm 2: Alternating Least Squares Theorem (Convergence Guarantees for ALS ) Let X be a rank-k matrix with condition number β, and T = AX B T. Assume A, B are orthogonal w.l.o.g. and satisfy the incoherence conditions. Then if Ω is uniformly sampled with Ω O(k 4 β 2 d log d), then after H iterations of ALS, Y H Z T H+1 X 2 ɛ, where H = O(log( X F /ɛ)). The incoherence conditions are: C1. max a i 2 2 µd i [n] C2. n, max j [n] max Yh T a i 2 2 µ0k i [n] b j 2 2 µd n n, max j [n] for all the Y h s and Z h s generated from ALS. Z T h b j 2 2 µ0k n,
51 Algorithm 2: Alternating Least Squares Proof sketch for ALS Consider the case when the rank k = 1: (a T i yz T b j T ij ) 2 min y R d 1,z R d 2 (i,j) Ω
52 Algorithm 2: Alternating Least Squares Proof sketch for rank-1 ALS min y R d 1,z R d 2 (i,j) Ω (a T i yz T b j T ij ) 2 (a) Let X = σ y z T be the thin SVD of X and assume A and B are orthogonal w.l.o.g. (b) In the absence of missing values, ALS = Power method. Ay h z T B T T 2 F z = 2B T (Bzy T h A T T T )Ay h = 2(z y h 2 B T T T Ay h ) z h+1 (A T TB) T y h ; normalize z h+1 y h+1 (A T TB)z h+1 ; normalize y h+1 Note that A T TB = A T AX B T B = X and the power method converges to the optimal.
53 Algorithm 2: Alternating Least Squares Proof sketch for rank-1 ALS min y R d 1,z R d 2 (i,j) Ω (a T i yz T b j T ij ) 2 (c) With missing values, ALS is a variant of power method with noise in each iteration z h+1 QR( X T y } {{ } h σ N 1 ((y T y h )N Ñ)z ) } {{ } power method noise term g where N = (i,j) Ω b ja T i y h y T h a ib T j, Ñ = (i,j) Ω b ja T i y h y T a i b T j. (d) Given C1 and C2, the noise term g = σ N 1 ((y T y h )N Ñ)z becomes smaller as the iterate gets close to the optimal: g (y T h 99 y ) 2
54 Algorithm 2: Alternating Least Squares Proof sketch for rank-1 ALS min y R d 1,z R d 2 (i,j) Ω (a T i yz T b j T ij ) 2 (e) Given C1 and C2, the first iterate y 0 is well initialized, i.e. y T 0 y 0.9, which guarantees the initial noise is small enough (f) The iterates can then be shown to linearly converge to the optimal: 1 (z T h+1z ) (1 (y T h z ) 2 ) 1 (y T h+1y ) (1 (zt h+1y ) 2 )
55 Algorithm 2: Alternating Least Squares Proof sketch for rank-1 ALS min y R d 1,z R d 2 (i,j) Ω (a T i yz T b j T ij ) 2 (e) Given C1 and C2, the first iterate y 0 is well initialized, i.e. y T 0 y 0.9, which guarantees the initial noise is small enough (f) The iterates can then be shown to linearly converge to the optimal: 1 (z T h+1z ) (1 (y T h z ) 2 ) 1 (y T h+1y ) (1 (zt h+1y ) 2 ) Similarly, the rank-k case can be proved.
56 Inductive Matrix Completion: Sample Complexity Sample complexity of Inductive Matrix Completion (IMC) and Matrix Completion (MC). methods IMC MC Nuclear-norm O(kd log n log d) kn log 2 n (Recht, 2011) ALS O(k 4 β 2 d log d) k 3 β 2 n log n (Hardt, 2014) where β is the condition number of X In most cases, n d Incoherence conditions on A, B are required Satisfied e.g. when A, B are Gaussian (no assumption on X needed) B. Recht. A simpler approach to matrix completion. The Journal of Machine Learning Research 12 : (2011). M. Hardt. Understanding alternating minimization for matrix completion. Foundations of Computer Science (FOCS), IEEE 55th Annual Symposium, pp (2014).
57 Inductive Matrix Completion: Sample Complexity Results All matrices are sampled from Gaussian random distribution. Left two figures: fix k = 5, n = 1000 and change d. Right two figures: fix k = 5, d = 50 and change n. Darkness of the shading is proportional to the number of failures (repeated 10 times) Number of Measurements Number of Measurements Number of Measurements Number of Measurements d d n n 0 Ω vs. d (ALS) Ω vs. d (Nuclear) Ω vs. n (ALS) Ω vs. n (Nuclear) Sample complexity is proportional to d while almost independent of n for both Nuclear-norm and ALS methods.
58 Positive-Unlabeled Learning
59 Modern Prediction Problems in Machine Learning Predicting causal disease genes
60 Bilinear Prediction: PU Learning In many applications, only positive labels are observed
61 PU Learning No observations of the negative class available
62 PU Inductive Matrix Completion Guarantees so far assume observations are sampled uniformly What can we say about the case when observations are all 1 s ( positives )? Typically, 99% entries are missing ( unlabeled )
63 PU Inductive Matrix Completion Inductive Matrix Completion: min X : X t (i,j) Ω (a T i X b j T ij ) 2 Commonly used PU strategy: Biased Matrix Completion min α (a T i X b j T ij ) 2 + (1 α) (a T i X b j 0) 2 X : X t (i,j) Ω (i,j) Ω Typically, α > 1 α (α 0.9). V. Sindhwani, S. S. Bucak, J. Hu, A. Mojsilovic. One-class matrix completion with low-density factorizations. ICDM, pp
64 PU Inductive Matrix Completion Inductive Matrix Completion: min X : X t (i,j) Ω (a T i X b j T ij ) 2 Commonly used PU strategy: Biased Matrix Completion min α (a T i X b j T ij ) 2 + (1 α) (a T i X b j 0) 2 X : X t (i,j) Ω (i,j) Ω Typically, α > 1 α (α 0.9). We can show guarantees for the biased formulation V. Sindhwani, S. S. Bucak, J. Hu, A. Mojsilovic. One-class matrix completion with low-density factorizations. ICDM, pp
65 PU Learning: Random Noise Model Can be formulated as learning with class-conditional noise N. Natarajan, I. S. Dhillon, P. Ravikumar, and A.Tewari. Learning with Noisy Labels. In Advances in Neural Information Processing Systems, pp
66 PU Inductive Matrix Completion A deterministic PU learning model T ij = { 1 if M ij > 0.5, 0 if M ij 0.5
67 PU Inductive Matrix Completion A deterministic PU learning model P( T ij = 0 T ij = 1) = ρ and P( T ij = 0 T ij = 0) = 1. We are given only T but not T or M Goal: Recover T given T (recovering M is not possible!)
68 Algorithm 1: Biased Inductive Matrix Completion X = min α (a T i X b j 1) 2 + (1 α) (a T i X b j 0) 2 X : X t (i,j) Ω (i,j) Ω Rationale: (a) Fix α = (1 + ρ)/2 and define T ij = I [ (A X B T ) ij > 0.5 ] (b) The above problem is equivalent to: X = min X : X t l α ((AXB T ) ij, T ij ) where l α (x, T ij ) = α T ij (x 1) 2 + (1 α)(1 T ij )x 2 (c) Minimizing l α loss is equivalent to minimizing the true error, i.e. 1 mn ij i,j l α ((AXB T ) ij, T ij ) = C 1 1 mn T T 2 F + C 2
69 Algorithm 1: Biased Inductive Matrix Completion Theorem (Error Bound for Biased IMC) Assume ground-truth X satisfies X t (where M = AXB T ). Define T ij = I [ (A X B T ) ij > 0.5 ], A = max i a i and B = max i b i. If α = 1+ρ 2, then with probability at least 1 δ, where η = 4(1 + 2ρ). ( 1 n 2 T T η log(2/δ) 2 F = O + η tab ) log2d n(1 ρ) (1 ρ)n 3/2 C-J. Hsieh, N. Natarajan, and I. S. Dhillon. PU Learning for Matrix Completion. In Proceedings of The 32nd International Conference on Machine Learning, pp (2015).
70 Experimental Results
71 Multi-target Prediction: Image Tag Recommendation NUS-Wide Image Dataset 161,780 training images 107,879 test images 1,134 features 1,000 tags
72 Multi-target Prediction: Image Tag Recommendation H. F. Yu, P. Jain, P. Kar, and I. S. Dhillon. Large-scale Multi-label Learning with Missing Labels. In Proceedings of The 31st International Conference on Machine Learning, pp (2014).
73 Multi-target Prediction: Image Tag Recommendation Low-rank Model with k = 50: time(s) prec@1 prec@3 AUC LEML(ALS) WSABIE 4, Low-rank Model with k = 100: time(s) prec@1 prec@3 AUC LEML(ALS) 1, WSABIE 6, H. F. Yu, P. Jain, P. Kar, and I. S. Dhillon. Large-scale Multi-label Learning with Missing Labels. In Proceedings of The 31st International Conference on Machine Learning, pp (2014).
74 Multi-target Prediction: Wikipedia Tag Recommendation Wikipedia Dataset 881,805 training wiki pages 10,000 test wiki pages 366,932 features 213,707 tags
75 Multi-target Prediction: Wikipedia Tag Recommendation Low-rank Model with k = 250: time(s) prec@1 prec@3 AUC LEML(ALS) 9, WSABIE 79, Low-rank Model with k = 500: time(s) prec@1 prec@3 AUC LEML(ALS) 18, WSABIE 139, H. F. Yu, P. Jain, P. Kar, and I. S. Dhillon. Large-scale Multi-label Learning with Missing Labels. In Proceedings of The 31st International Conference on Machine Learning, pp (2014).
76 PU Inductive Matrix Completion: Gene-Disease Prediction N. Natarajan, and I. S. Dhillon. Inductive matrix completion for predicting gene disease associations. Bioinformatics, 30(12), i60-i68 (2014).
77 PU Inductive Matrix Completion: Gene-Disease Prediction Catapult 2.5 Katz Matrix Completion on C P(hidden gene among genes looked at) IMC LEML Matrix Completion Precision(%) Number of genes looked at Recall(%) Predicting gene-disease associations in the OMIM data set ( N. Natarajan, and I. S. Dhillon. Inductive matrix completion for predicting gene disease associations. Bioinformatics, 30(12), i60-i68 (2014).
78 PU Inductive Matrix Completion: Gene-Disease Prediction Catapult Katz P(hidden gene among genes looked at) 0.1 Matrix Completion on C IMC LEML P(hidden gene among genes looked at) Number of genes looked at Number of genes looked at Predicting genes for diseases with no training associations. N. Natarajan, and I. S. Dhillon. Inductive matrix completion for predicting gene disease associations. Bioinformatics, 30(12), i60-i68 (2014).
79 Conclusions and Future Work Inductive Matrix Completion: Scales to millions of targets Captures correlations among targets Overcomes missing values Extension to PU learning Much work to do: Other structures: low-rank+sparse, low-rank+column-sparse (outliers)? Different loss functions? Handling time as one of the dimensions incorporating smoothness through graph regularization? Incorporating non-linearities? Efficient (parallel) implementations? Improved recovery guarantees?
80 References [1] P. Jain, and I. S. Dhillon. Provable inductive matrix completion. arxiv preprint arxiv: (2013). [2] K. Zhong, P. Jain, I. S. Dhillon. Efficient Matrix Sensing Using Rank-1 Gaussian Measurements. In Proceedings of The 26th Conference on Algorithmic Learning Theory (2015). [3] N. Natarajan, and I. S. Dhillon. Inductive matrix completion for predicting gene disease associations. Bioinformatics, 30(12), i60-i68 (2014). [4] H. F. Yu, P. Jain, P. Kar, and I. S. Dhillon. Large-scale Multi-label Learning with Missing Labels. In Proceedings of The 31st International Conference on Machine Learning, pp (2014). [5] C-J. Hsieh, N. Natarajan, and I. S. Dhillon. PU Learning for Matrix Completion. In Proceedings of The 32nd International Conference on Machine Learning, pp (2015).
Consistent Binary Classification with Generalized Performance Metrics
Consistent Binary Classification with Generalized Performance Metrics Nagarajan Natarajan Joint work with Oluwasanmi Koyejo, Pradeep Ravikumar and Inderjit Dhillon UT Austin Nov 4, 2014 Problem and Motivation
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Simple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,
Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.
Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features
Statistical machine learning, high dimension and big data
Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,
Lectures notes on orthogonal matrices (with exercises) 92.222 - Linear Algebra II - Spring 2004 by D. Klain
Lectures notes on orthogonal matrices (with exercises) 92.222 - Linear Algebra II - Spring 2004 by D. Klain 1. Orthogonal matrices and orthonormal sets An n n real-valued matrix A is said to be an orthogonal
These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher
Big Data - Lecture 1 Optimization reminders
Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics
Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff
Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: [email protected] Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
Semi-Supervised Support Vector Machines and Application to Spam Filtering
Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery
Randomized Robust Linear Regression for big data applications
Randomized Robust Linear Regression for big data applications Yannis Kopsinis 1 Dept. of Informatics & Telecommunications, UoA Thursday, Apr 16, 2015 In collaboration with S. Chouvardas, Harris Georgiou,
Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery
: Distribution-Oblivious Support Recovery Yudong Chen Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 7872 Constantine Caramanis Department of Electrical
Lecture 6: Logistic Regression
Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,
APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014
LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph
NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing
NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing Alex Cloninger Norbert Wiener Center Department of Mathematics University of Maryland, College Park http://www.norbertwiener.umd.edu
Distributed Machine Learning and Big Data
Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya
Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur
Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:
Introduction to Online Learning Theory
Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Several Views of Support Vector Machines
Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min
Classification Problems
Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems
Linear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Introduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers
Variance Reduction The statistical efficiency of Monte Carlo simulation can be measured by the variance of its output If this variance can be lowered without changing the expected value, fewer replications
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Adaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Large-Scale Similarity and Distance Metric Learning
Large-Scale Similarity and Distance Metric Learning Aurélien Bellet Télécom ParisTech Joint work with K. Liu, Y. Shi and F. Sha (USC), S. Clémençon and I. Colin (Télécom ParisTech) Séminaire Criteo March
Machine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
SOLVING LINEAR SYSTEMS
SOLVING LINEAR SYSTEMS Linear systems Ax = b occur widely in applied mathematics They occur as direct formulations of real world problems; but more often, they occur as a part of the numerical analysis
Sketch As a Tool for Numerical Linear Algebra
Sketching as a Tool for Numerical Linear Algebra (Graph Sparsification) David P. Woodruff presented by Sepehr Assadi o(n) Big Data Reading Group University of Pennsylvania April, 2015 Sepehr Assadi (Penn)
Sublinear Algorithms for Big Data. Part 4: Random Topics
Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1 2-1 Topic 1: Compressive sensing Compressive sensing The model (Candes-Romberg-Tao 04; Donoho 04) Applicaitons Medical imaging reconstruction
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics
Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University
Binary Image Reconstruction
A network flow algorithm for reconstructing binary images from discrete X-rays Kees Joost Batenburg Leiden University and CWI, The Netherlands [email protected] Abstract We present a new algorithm
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research [email protected]
Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research [email protected] Introduction Logistics Prerequisites: basics concepts needed in probability and statistics
Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University [email protected]
Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University [email protected] 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian
Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research [email protected]
Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research [email protected] Motivation PAC learning: distribution fixed over time (training and test). IID assumption.
Scalable Machine Learning - or what to do with all that Big Data infrastructure
- or what to do with all that Big Data infrastructure TU Berlin blog.mikiobraun.de Strata+Hadoop World London, 2015 1 Complex Data Analysis at Scale Click-through prediction Personalized Spam Detection
Big Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang, Qihang Lin, Rong Jin Tutorial@SIGKDD 2015 Sydney, Australia Department of Computer Science, The University of Iowa, IA, USA Department of
Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )
Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples
Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering
Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Department of Industrial Engineering and Management Sciences Northwestern University September 15th, 2014
CS 688 Pattern Recognition Lecture 4. Linear Models for Classification
CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
Big learning: challenges and opportunities
Big learning: challenges and opportunities Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure December 2013 Omnipresent digital media Scientific context Big data Multimedia, sensors, indicators,
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining
Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,
Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu
Learning Gaussian process models from big data Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Machine learning seminar at University of Cambridge, July 4 2012 Data A lot of
Lasso-based Spam Filtering with Chinese Emails
Journal of Computational Information Systems 8: 8 (2012) 3315 3322 Available at http://www.jofcis.com Lasso-based Spam Filtering with Chinese Emails Zunxiong LIU 1, Xianlong ZHANG 1,, Shujuan ZHENG 2 1
Classification by Pairwise Coupling
Classification by Pairwise Coupling TREVOR HASTIE * Stanford University and ROBERT TIBSHIRANI t University of Toronto Abstract We discuss a strategy for polychotomous classification that involves estimating
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation
Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on
Pa8ern Recogni6on and Machine Learning Chapter 4: Linear Models for Classifica6on Represen'ng the target values for classifica'on If there are only two classes, we typically use a single real valued output
Part II Redundant Dictionaries and Pursuit Algorithms
Aisenstadt Chair Course CRM September 2009 Part II Redundant Dictionaries and Pursuit Algorithms Stéphane Mallat Centre de Mathématiques Appliquées Ecole Polytechnique Sparsity in Redundant Dictionaries
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
We shall turn our attention to solving linear systems of equations. Ax = b
59 Linear Algebra We shall turn our attention to solving linear systems of equations Ax = b where A R m n, x R n, and b R m. We already saw examples of methods that required the solution of a linear system
(Quasi-)Newton methods
(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting
Machine Learning for Data Science (CS4786) Lecture 1
Machine Learning for Data Science (CS4786) Lecture 1 Tu-Th 10:10 to 11:25 AM Hollister B14 Instructors : Lillian Lee and Karthik Sridharan ROUGH DETAILS ABOUT THE COURSE Diagnostic assignment 0 is out:
ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING
ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of
GI01/M055 Supervised Learning Proximal Methods
GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators
Neural Network Add-in
Neural Network Add-in Version 1.5 Software User s Guide Contents Overview... 2 Getting Started... 2 Working with Datasets... 2 Open a Dataset... 3 Save a Dataset... 3 Data Pre-processing... 3 Lagging...
LCs for Binary Classification
Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it
A Negative Result Concerning Explicit Matrices With The Restricted Isometry Property
A Negative Result Concerning Explicit Matrices With The Restricted Isometry Property Venkat Chandar March 1, 2008 Abstract In this note, we prove that matrices whose entries are all 0 or 1 cannot achieve
Inner Product Spaces and Orthogonality
Inner Product Spaces and Orthogonality week 3-4 Fall 2006 Dot product of R n The inner product or dot product of R n is a function, defined by u, v a b + a 2 b 2 + + a n b n for u a, a 2,, a n T, v b,
Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions
Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions Peter Richtárik School of Mathematics The University of Edinburgh Joint work with Martin Takáč (Edinburgh)
10. Proximal point method
L. Vandenberghe EE236C Spring 2013-14) 10. Proximal point method proximal point method augmented Lagrangian method Moreau-Yosida smoothing 10-1 Proximal point method a conceptual algorithm for minimizing
MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos [email protected]
Machine Learning for Computer Vision 1 MVA ENS Cachan Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos [email protected] Department of Applied Mathematics Ecole Centrale Paris Galen
CSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /
MATH 304 Linear Algebra Lecture 18: Rank and nullity of a matrix.
MATH 304 Linear Algebra Lecture 18: Rank and nullity of a matrix. Nullspace Let A = (a ij ) be an m n matrix. Definition. The nullspace of the matrix A, denoted N(A), is the set of all n-dimensional column
Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye team @ MSRC
Machine Learning for Medical Image Analysis A. Criminisi & the InnerEye team @ MSRC Medical image analysis the goal Automatic, semantic analysis and quantification of what observed in medical scans Brain
Christfried Webers. Canberra February June 2015
c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic
Adaptive Demand-Forecasting Approach based on Principal Components Time-series an application of data-mining technique to detection of market movement
Adaptive Demand-Forecasting Approach based on Principal Components Time-series an application of data-mining technique to detection of market movement Toshio Sugihara Abstract In this study, an adaptive
Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems
Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems Aleksandar Donev Courant Institute, NYU 1 [email protected] 1 Course G63.2010.001 / G22.2420-001,
An Introduction to Machine Learning
An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia [email protected] Tata Institute, Pune,
(67902) Topics in Theory and Complexity Nov 2, 2006. Lecture 7
(67902) Topics in Theory and Complexity Nov 2, 2006 Lecturer: Irit Dinur Lecture 7 Scribe: Rani Lekach 1 Lecture overview This Lecture consists of two parts In the first part we will refresh the definition
Cross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
Machine Learning Logistic Regression
Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.
Interactive Machine Learning. Maria-Florina Balcan
Interactive Machine Learning Maria-Florina Balcan Machine Learning Image Classification Document Categorization Speech Recognition Protein Classification Branch Prediction Fraud Detection Spam Detection
Introduction to General and Generalized Linear Models
Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby
The Many Facets of Big Data
Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong ACPR 2013 Big Data 1 volume sample size is big feature dimensionality is big 2 variety multiple formats:
17. Inner product spaces Definition 17.1. Let V be a real vector space. An inner product on V is a function
17. Inner product spaces Definition 17.1. Let V be a real vector space. An inner product on V is a function, : V V R, which is symmetric, that is u, v = v, u. bilinear, that is linear (in both factors):
Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S
Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard
A Survey on Pre-processing and Post-processing Techniques in Data Mining
, pp. 99-128 http://dx.doi.org/10.14257/ijdta.2014.7.4.09 A Survey on Pre-processing and Post-processing Techniques in Data Mining Divya Tomar and Sonali Agarwal Indian Institute of Information Technology,
Chapter 4: Artificial Neural Networks
Chapter 4: Artificial Neural Networks CS 536: Machine Learning Littman (Wu, TA) Administration icml-03: instructional Conference on Machine Learning http://www.cs.rutgers.edu/~mlittman/courses/ml03/icml03/
High Productivity Data Processing Analytics Methods with Applications
High Productivity Data Processing Analytics Methods with Applications Dr. Ing. Morris Riedel et al. Adjunct Associate Professor School of Engineering and Natural Sciences, University of Iceland Research
Introduction to Machine Learning Using Python. Vikram Kamath
Introduction to Machine Learning Using Python Vikram Kamath Contents: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Introduction/Definition Where and Why ML is used Types of Learning Supervised Learning Linear Regression
