Regularization, Ridge Regression
|
|
- Bethany Hudson
- 7 years ago
- Views:
Transcription
1 Regularization, Ridge Regression Machine Learning CSEP546 Carlos Guestrin University of Washington January 13, The regression problem Instances: <x j, t j > Learn: Mapping from x to t(x) Hypothesis space: Given, basis functions Find coeffs w={w 1,,w k } Why is this called linear regression??? model is linear in the parameters Precisely, minimize the residual squared error: Carlos Guestrin 2 1
2 The regression problem in matrix notation N data points K basis func N data points K basis functions Carlos Guestrin weights 3 observations Regression solution = simple matrix operations where k k matrix for k basis functions k 1 vector Carlos Guestrin 4 2
3 Bias-Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance Carlos Guestrin 5 Test set error as a function of model complexity Carlos Guestrin 6 3
4 Overfitting Overfitting: a learning algorithm overfits the training data if it outputs a solution w when there exists another solution w such that: Carlos Guestrin 7 Regularization in Linear Regression Overfitting usually leads to very large parameter choices, e.g.: X 0.30 X ,700,910.7 X 8,585,638.4 X 2 + Regularized or penalized regression aims to impose a complexity penalty by penalizing large weights Shrinkage method 8 4
5 Quadratic Penalty (regularization) What we thought we wanted to minimize: But weights got too big, penalize large weights: 9 Ridge Regression Ameliorating issues with overfitting: New objective: 10 5
6 Ridge Regression in Matrix Notation ŵ ridge = arg min w NX j=1 2 kx t(x j ) (w 0 + w i h i (x j ))! + i=1 kx i=1 w 2 i + w T I 0+k w h 0...h k N data points K+1 basis func N data points K+1 basis functions weights observations 11 Minimizing the Ridge Regression Objective MLE ŵ ridge = arg min w NX j=1 t(x j ) (w kx w i h i (x j ))! + i=1 =(Hw t) T (Hw t)+ w T I 0+k w kx i=1 w 2 i 12 6
7 Shrinkage Properties ŵ ridge =(H T H + I 0+k ) 1 H T t If orthonormal features/basis: H T H = I 13 Ridge Regression: Effect of Regularization ŵ ridge = arg min w NX j=1 2 kx t(x j ) (w 0 + w i h i (x j ))! + i=1 kx i=1 w 2 i Solution is indexed by the regularization parameter λ Larger λ Smaller λ As λ à 0 As λ à 14 7
8 Ridge Coefficient Path From Kevin Murphy textbook Typical approach: select λ using cross validation, more on this later in the quarter 15 Error as a function of regularization parameter for a fixed model complexity λ= λ=0 16 8
9 What you need to know Regularization Penalizes for complex models Ridge regression L 2 penalized least-squares regression Regularization parameter trades off model complexity with training error 17 Cross-Validation Machine Learning CSEP546 Carlos Guestrin University of Washington January 13,
10 Test set error as a function of model complexity 19 How How How??????? How do we pick the regularization constant λ And all other constants in ML, cause one thing ML doesn t lack is constants to tune L We could use the test data, but 20 10
11 (LOO) Leave-one-out cross validation Consider a validation set with 1 example: D training data D\j training data with j th data point moved to validation set Learn classifier h D\j with D\j dataset Estimate true error as squared error on predicting t(x j ): Unbiased estimate of error true (h D\j )! Seems really bad estimator, but wait! LOO cross validation: Average over all data points j: For each data point you leave out, learn a new classifier h D\j Estimate error as: error LOO = 1 N NX j=1 t(x j ) h D\j (x j ) 2 21 LOO cross validation is (almost) unbiased estimate of true error of h D! When computing LOOCV error, we only use N-1 data points So it s not estimate of true error of learning with N data points! Usually pessimistic, though learning with less data typically gives worse answer LOO is almost unbiased! Great news! Use LOO error for model selection!!! E.g., picking λ 22 11
12 Using LOO to Pick λ errorloo = 1 N N X j=1 t(xj) h D\j (xj) 2 λ= λ=0 23 Using LOO error for model selection errorloo = 1 N N X j=1 t(xj) h D\j (xj)
13 Computational cost of LOO Suppose you have 100,000 data points You implemented a great version of your learning algorithm Learns in only 1 second Computing LOO will take about 1 day!!! If you have to do for each choice of basis functions, it will take fooooooreeeve!!! Solution 1: Preferred, but not usually possible Find a cool trick to compute LOO (e.g., see homework) 25 Solution 2 to complexity of computing LOO: (More typical) Use k-fold cross validation Randomly divide training data into k equal parts D 1,,D k For each i Learn classifier h D\Di using data point not in D i Estimate error of h D\Di on validation set D i : error Di = k X t(x j ) h D\Di (x j ) 2 N x j2d i k-fold cross validation error is average over data splits: k-fold cross validation properties: Much faster to compute than LOO More (pessimistically) biased using much less data, only m(k-1)/k Usually, k = 10 J 26 13
14 ML Pipeline data 27 What you need to know Never ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever train on the test data Use cross-validation to choose magic parameters such as λ Leave-one-out is the best you can do, but sometimes too slow In that case, use k-fold cross-validation 28 14
15 Variable Selection LASSO: Sparse Regression Machine Learning CSEP546 Carlos Guestrin University of Washington January 13, Sparsity Vector w is sparse, if many entries are zero: Very useful for many tasks, e.g., Efficiency: If size(w) = 100B, each prediction is expensive: If part of an online system, too slow If w is sparse, prediction computation only depends on number of non-zeros Interpretability: What are the relevant dimension to make a prediction? Eat Push Run E.g., what are the parts of the brain associated with particular words? But computationally intractable to perform all subsets regression Participant P1 Mean of independently learned signatures over all nine participants Pars opercularis (z=24mm) Postcentral gyrus (z=30mm) Superior temporal sulcus (posterior) (z=12mm) 30 Figure from Tom Mitchell 15
16 Simple greedy model selection algorithm Pick a dictionary of features e.g., polynomials for linear regression Greedy heuristic: Start from empty (or simple) set of features F 0 = Run learning algorithm for current set of features F t Obtain h t Select next best feature X i * e.g., X j that results in lowest training error learner when learning with F t + {X j } F t+1 ç F t + {X i* } Recurse 31 Greedy model selection Applicable in many settings: Linear regression: Selecting basis functions Naïve Bayes: Selecting (independent) features P(X i Y) Logistic regression: Selecting features (basis functions) Decision trees: Selecting leaves to expand Only a heuristic! But, sometimes you can prove something cool about it e.g., [Krause & Guestrin 05]: Near-optimal in some settings that include Naïve Bayes There are many more elaborate methods out there 32 16
17 When do we stop??? Greedy heuristic: Select next best feature X i * e.g., X j that results in lowest training error learner when learning with F t + {X j } F t+1 ç F t + {X i* } Recurse When do you stop??? When training error is low enough? When test set error is low enough? 33 Regularization in Linear Regression Overfitting usually leads to very large parameter choices, e.g.: X 0.30 X ,700,910.7 X 8,585,638.4 X 2 + Regularized or penalized regression aims to impose a complexity penalty by penalizing large weights Shrinkage method 34 17
18 Variable Selection by Regularization Ridge regression: Penalizes large weights What if we want to perform feature selection? E.g., Which regions of the brain are important for word prediction? Can t simply choose features with largest coefficients in ridge solution Try new penalty: Penalize non-zero weights Regularization penalty: Leads to sparse solutions Just like ridge regression, solution is indexed by a continuous param λ This simple approach has changed statistics, machine learning & electrical engineering 35 LASSO Regression LASSO: least absolute shrinkage and selection operator New objective: 36 18
19 Geometric Intuition for Sparsity wβ 2 ^. wβ MLE w 2 β 2 ^. w β MLE β1 w 1 Ridge Regression Lasso β 1 w 1 From Rob Tibshirani slides 37 Optimizing the LASSO Objective LASSO solution: ŵ LASSO = arg min w NX j=1 2 kx t(x j ) (w 0 + w i h i (x j ))! + i=1 kx w i i=
20 Coordinate Descent Given a function F Want to find minimum Often, hard to find minimum for all coordinates, but easy for one coordinate Coordinate descent: How do we pick next coordinate? Super useful approach for *many* problems Converges to optimum in some cases, such as LASSO 39 How do we find the minimum over each coordinate? Key step in coordinate descent: Find minimum over each coordinate Illustration from Wikipedia Standard approach: 40 20
21 Optimizing LASSO Objective One Coordinate at a Time 2 NX kx kx t(x j ) (w 0 + w i h i (x j ))! + w i j=1 i=1 i=1 Taking the derivative: Residual sum of squares NX RSS(w) = 2 h`(x j ) t(x j ) (w 0 j=1! kx w i h i (x j )) i=1 Penalty term: 41 Coordinate Descent for LASSO (aka Shooting Algorithm) Repeat until convergence Pick a coordinate l at (random or sequentially) 8 Set: < (c` + )/a` c` < ŵ` = 0 c` 2 [, ] : (c` )/a` c` > Where: NX a` =2 (h`(x j)) 2 j=1 0 1 NX c` =2 h`(x j) (w 0 + X w ih i(x j)) A i6=` j=1 For convergence rates, see Shalev-Shwartz and Tewari 2009 Other common technique = LARS Least angle regression and shrinkage, Efron et al
22 Soft Thresholding ŵ` = 8 < : (c` + )/a` c` < 0 c` 2 [, ] (c` )/a` c` > c` From Kevin Murphy textbook 43 Recall: Ridge Coefficient Path From Kevin Murphy textbook Typical approach: select λ using cross validation 44 22
23 Now: LASSO Coefficient Path From Kevin Murphy textbook 45 LASSO Example Term Least Squares Ridge Lasso Intercept lcavol lweight age lbph svi lcp gleason pgg From Rob Tibshirani slides 46 23
24 Debiasing From Kevin Murphy textbook 47 What you need to know Variable Selection: find a sparse solution to learning problem L 1 regularization is one way to do variable selection Applies beyond regressions Hundreds of other approaches out there LASSO objective non-differentiable, but convex è Use subgradient No closed-form solution for minimization è Use coordinate descent Shooting algorithm is very simple approach for solving LASSO 48 24
Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.
Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features
More informationModel selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013
Model selection in R featuring the lasso Chris Franck LISA Short Course March 26, 2013 Goals Overview of LISA Classic data example: prostate data (Stamey et. al) Brief review of regression and model selection.
More informationRidge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS
Ridge Regression Patrick Breheny September 1 Patrick Breheny BST 764: Applied Statistical Modeling 1/22 Ridge regression: Definition Definition and solution Properties As mentioned in the previous lecture,
More informationClass #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationCSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /
More informationLogistic Regression for Spam Filtering
Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used
More informationRegularized Logistic Regression for Mind Reading with Parallel Validation
Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationCross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.
Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month
More informationMachine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationSupervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
More informationMachine Learning. CUNY Graduate Center, Spring 2013. Professor Liang Huang. huang@cs.qc.cuny.edu
Machine Learning CUNY Graduate Center, Spring 2013 Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning Logistics Lectures M 9:30-11:30 am Room 4419 Personnel
More informationL13: cross-validation
Resampling methods Cross validation Bootstrap L13: cross-validation Bias and variance estimation with the Bootstrap Three-way data partitioning CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU
More informationMachine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)
Machine Learning Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos) What Is Machine Learning? A computer program is said to learn from experience E with respect to some class of
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationLASSO Regression. Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox February 21 th, 2013.
Case Study 3: fmri Prediction LASSO Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox February 21 th, 2013 Emily Fo013 1 LASSO Regression LASSO: least
More informationBig Data Techniques Applied to Very Short-term Wind Power Forecasting
Big Data Techniques Applied to Very Short-term Wind Power Forecasting Ricardo Bessa Senior Researcher (ricardo.j.bessa@inesctec.pt) Center for Power and Energy Systems, INESC TEC, Portugal Joint work with
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationA Simple Introduction to Support Vector Machines
A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear
More informationActive Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
More informationCross Validation. Dr. Thomas Jensen Expedia.com
Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract
More informationLasso on Categorical Data
Lasso on Categorical Data Yunjin Choi, Rina Park, Michael Seo December 14, 2012 1 Introduction In social science studies, the variables of interest are often categorical, such as race, gender, and nationality.
More informationPredicting daily incoming solar energy from weather data
Predicting daily incoming solar energy from weather data ROMAIN JUBAN, PATRICK QUACH Stanford University - CS229 Machine Learning December 12, 2013 Being able to accurately predict the solar power hitting
More informationSeveral Views of Support Vector Machines
Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min
More informationCollaborative Filtering. Radek Pelánek
Collaborative Filtering Radek Pelánek 2015 Collaborative Filtering assumption: users with similar taste in past will have similar taste in future requires only matrix of ratings applicable in many domains
More informationMaking Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research
More informationProbabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014
Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about
More informationGovernment of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence
Government of Russian Federation Federal State Autonomous Educational Institution of High Professional Education National Research University «Higher School of Economics» Faculty of Computer Science School
More informationCross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
More informationStatistical machine learning, high dimension and big data
Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,
More informationMultiple Linear Regression in Data Mining
Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple
More informationMachine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34
Machine Learning Javier Béjar cbea LSI - FIB Term 2012/2013 Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34 Outline 1 Introduction to Inductive learning 2 Search and inductive learning
More informationParallel & Distributed Optimization. Based on Mark Schmidt s slides
Parallel & Distributed Optimization Based on Mark Schmidt s slides Motivation behind using parallel & Distributed optimization Performance Computational throughput have increased exponentially in linear
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationEffective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data
Effective Linear Discriant Analysis for High Dimensional, Low Sample Size Data Zhihua Qiao, Lan Zhou and Jianhua Z. Huang Abstract In the so-called high dimensional, low sample size (HDLSS) settings, LDA
More informationNew Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
More informationTraffic Driven Analysis of Cellular Data Networks
Traffic Driven Analysis of Cellular Data Networks Samir R. Das Computer Science Department Stony Brook University Joint work with Utpal Paul, Luis Ortiz (Stony Brook U), Milind Buddhikot, Anand Prabhu
More informationEnsemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
More informationMath 4310 Handout - Quotient Vector Spaces
Math 4310 Handout - Quotient Vector Spaces Dan Collins The textbook defines a subspace of a vector space in Chapter 4, but it avoids ever discussing the notion of a quotient space. This is understandable
More informationChapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
More informationCS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott
More informationSampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data
Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian
More informationProbabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur
Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:
More informationLecture 2: The SVM classifier
Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function
More informationIntroduction to Learning & Decision Trees
Artificial Intelligence: Representation and Problem Solving 5-38 April 0, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning What is learning? - more than just memorizing
More informationInner Product Spaces
Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and
More informationMachine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler
Machine Learning and Data Mining Regression Problem (adapted from) Prof. Alexander Ihler Overview Regression Problem Definition and define parameters ϴ. Prediction using ϴ as parameters Measure the error
More informationThe Operational Value of Social Media Information. Social Media and Customer Interaction
The Operational Value of Social Media Information Dennis J. Zhang (Kellogg School of Management) Ruomeng Cui (Kelley School of Business) Santiago Gallino (Tuck School of Business) Antonio Moreno-Garcia
More information1 Teaching notes on GMM 1.
Bent E. Sørensen January 23, 2007 1 Teaching notes on GMM 1. Generalized Method of Moment (GMM) estimation is one of two developments in econometrics in the 80ies that revolutionized empirical work in
More informationNonparametric statistics and model selection
Chapter 5 Nonparametric statistics and model selection In Chapter, we learned about the t-test and its variations. These were designed to compare sample means, and relied heavily on assumptions of normality.
More informationLecture 6. Artificial Neural Networks
Lecture 6 Artificial Neural Networks 1 1 Artificial Neural Networks In this note we provide an overview of the key concepts that have led to the emergence of Artificial Neural Networks as a major paradigm
More informationLecture 13: Validation
Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model
More informationApplications to Data Smoothing and Image Processing I
Applications to Data Smoothing and Image Processing I MA 348 Kurt Bryan Signals and Images Let t denote time and consider a signal a(t) on some time interval, say t. We ll assume that the signal a(t) is
More informationGeneral Framework for an Iterative Solution of Ax b. Jacobi s Method
2.6 Iterative Solutions of Linear Systems 143 2.6 Iterative Solutions of Linear Systems Consistent linear systems in real life are solved in one of two ways: by direct calculation (using a matrix factorization,
More informationSparse Online Learning via Truncated Gradient
Sparse Online Learning via Truncated Gradient John Langford Yahoo! Research jl@yahoo-inc.com Lihong Li Department of Computer Science Rutgers University lihong@cs.rutgers.edu Tong Zhang Department of Statistics
More informationBayesian Penalized Methods for High Dimensional Data
Bayesian Penalized Methods for High Dimensional Data Joseph G. Ibrahim Joint with Hongtu Zhu and Zakaria Khondker What is Covered? Motivation GLRR: Bayesian Generalized Low Rank Regression L2R2: Bayesian
More informationApplied Multivariate Analysis - Big data analytics
Applied Multivariate Analysis - Big data analytics Nathalie Villa-Vialaneix nathalie.villa@toulouse.inra.fr http://www.nathalievilla.org M1 in Economics and Economics and Statistics Toulouse School of
More informationDefinition 8.1 Two inequalities are equivalent if they have the same solution set. Add or Subtract the same value on both sides of the inequality.
8 Inequalities Concepts: Equivalent Inequalities Linear and Nonlinear Inequalities Absolute Value Inequalities (Sections 4.6 and 1.1) 8.1 Equivalent Inequalities Definition 8.1 Two inequalities are equivalent
More informationTree based ensemble models regularization by convex optimization
Tree based ensemble models regularization by convex optimization Bertrand Cornélusse, Pierre Geurts and Louis Wehenkel Department of Electrical Engineering and Computer Science University of Liège B-4000
More informationSpeaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD
Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD Optum Labs Cambridge, MA, USA Statistical Methods and Machine Learning ISPOR International
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationDetecting Corporate Fraud: An Application of Machine Learning
Detecting Corporate Fraud: An Application of Machine Learning Ophir Gottlieb, Curt Salisbury, Howard Shek, Vishal Vaidyanathan December 15, 2006 ABSTRACT This paper explores the application of several
More informationLecture 8 February 4
ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt
More informationArtificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
More informationPenalized regression: Introduction
Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood
More informationMachine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
More informationData analysis in supersaturated designs
Statistics & Probability Letters 59 (2002) 35 44 Data analysis in supersaturated designs Runze Li a;b;, Dennis K.J. Lin a;b a Department of Statistics, The Pennsylvania State University, University Park,
More informationAcknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues
Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the
More informationAn Introduction to Machine Learning
An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,
More informationLearning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
More informationSupporting Online Material for
www.sciencemag.org/cgi/content/full/313/5786/504/dc1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence
More informationBig Data - Lecture 1 Optimization reminders
Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics
More informationCross-validation for detecting and preventing overfitting
Cross-validation for detecting and preventing overfitting Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures.
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationDegrees of Freedom and Model Search
Degrees of Freedom and Model Search Ryan J. Tibshirani Abstract Degrees of freedom is a fundamental concept in statistical modeling, as it provides a quantitative description of the amount of fitting performed
More informationAdaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
More information(Quasi-)Newton methods
(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting
More informationPredict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons
More informationMACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
More informationHT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
More informationMachine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang, Qihang Lin, Rong Jin Tutorial@SIGKDD 2015 Sydney, Australia Department of Computer Science, The University of Iowa, IA, USA Department of
More informationPredicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang (junjie87@stanford.edu) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
More informationChapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )
Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples
More informationUsing Ensemble of Decision Trees to Forecast Travel Time
Using Ensemble of Decision Trees to Forecast Travel Time José P. González-Brenes Guido Matías Cortés What to Model? Goal Predict travel time at time t on route s using a set of explanatory variables We
More informationServer Load Prediction
Server Load Prediction Suthee Chaidaroon (unsuthee@stanford.edu) Joon Yeong Kim (kim64@stanford.edu) Jonghan Seo (jonghan@stanford.edu) Abstract Estimating server load average is one of the methods that
More informationBuilding risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg
Building risk prediction models - with a focus on Genome-Wide Association Studies Risk prediction models Based on data: (D i, X i1,..., X ip ) i = 1,..., n we like to fit a model P(D = 1 X 1,..., X p )
More informationLinear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S
Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard
More informationMachine Learning Methods for Demand Estimation
Machine Learning Methods for Demand Estimation By Patrick Bajari, Denis Nekipelov, Stephen P. Ryan, and Miaoyu Yang Over the past decade, there has been a high level of interest in modeling consumer behavior
More informationPolynomial Invariants
Polynomial Invariants Dylan Wilson October 9, 2014 (1) Today we will be interested in the following Question 1.1. What are all the possible polynomials in two variables f(x, y) such that f(x, y) = f(y,
More informationQuestion 2 Naïve Bayes (16 points)
Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationSupport Vector Machines
Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric
More information5. Linear Regression
5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4
More informationCOMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: (jpineau@cs.mcgill.ca) TAs: Pierre-Luc Bacon (pbacon@cs.mcgill.ca) Ryan Lowe (ryan.lowe@mail.mcgill.ca)
More informationClustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
More informationDistributed Machine Learning and Big Data
Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya
More information