Designing a learning system



Similar documents
CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Lecture 3: Linear methods for classification

Statistical Machine Learning

LCs for Binary Classification

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

STA 4273H: Statistical Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Introduction to Machine Learning Using Python. Vikram Kamath

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Supervised Learning (Big Data Analytics)

Basics of Statistical Machine Learning

Data Preprocessing. Week 2

Machine Learning and Pattern Recognition Logistic Regression

Predict Influencers in the Social Network

Statistical Machine Learning from Data

Detection of changes in variance using binary segmentation and optimal partitioning

Supporting Online Material for

Server Load Prediction

Principal components analysis

Linear Classification. Volker Tresp Summer 2015

Lecture 8. Confidence intervals and the central limit theorem

Exercise 1.12 (Pg )

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

CSCI567 Machine Learning (Fall 2014)

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Supervised Feature Selection & Unsupervised Dimensionality Reduction

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

Classification Problems

Making Sense of the Mayhem: Machine Learning and March Madness

Subspace Analysis and Optimization for AAM Based Face Alignment

Analyzing Structural Equation Models With Missing Data

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

1 Prior Probability and Posterior Probability

Machine Learning in Statistical Arbitrage

Linear Threshold Units

Lecture 10: Regression Trees

An Introduction to Machine Learning

Logistic Regression for Spam Filtering

Christfried Webers. Canberra February June 2015

Standardization and Its Effects on K-Means Clustering Algorithm

Econometrics Simple Linear Regression

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Reject Inference in Credit Scoring. Jie-Men Mok

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

Logistic Regression (1/24/13)

1 Maximum likelihood estimation

Early defect identification of semiconductor processes using machine learning

Lecture 13: Validation

Environmental Remote Sensing GEOG 2021

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Exploratory Factor Analysis and Principal Components. Pekka Malo & Anton Frantsev 30E00500 Quantitative Empirical Research Spring 2016

Factor analysis. Angela Montanari

Drugs store sales forecast using Machine Learning

Additional sources Compilation of sources:

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Data Mining Practical Machine Learning Tools and Techniques

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Data Mining: An Overview. David Madigan

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC Politecnico di Milano)

BUSINESS ANALYTICS. Data Pre-processing. Lecture 3. Information Systems and Machine Learning Lab. University of Hildesheim.

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

Introduction to Logistic Regression

Linear Models for Classification

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Introduction to Support Vector Machines. Colin Campbell, Bristol University

How is Big Data Different? A Paradigm Shift

Data Mining Part 5. Prediction

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Knowledge Discovery and Data Mining

Lecture 9: Introduction to Pattern Analysis

AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague

Social Media Mining. Data Mining Essentials

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Lecture 2: The SVM classifier

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

Predictive Modeling and Big Data

Local classification and local likelihoods

Statistics 104: Section 6!

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Knowledge Discovery and Data Mining

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

CS Introduction to Data Mining Instructor: Abdullah Mueen

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Machine Learning Logistic Regression

IBM SPSS Neural Networks 22

Note on the EM Algorithm in Linear Regression Model

Automatic parameter regulation for a tracking system with an auto-critical function

Transcription:

Lecture Designing a learning system Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x4-8845 http://.cs.pitt.edu/~milos/courses/cs750/ Design of a learning system (first vie) Application or Testing

Design of a learning system. 1. : D = { d1, d,.., dn}. : Select a model or a set of models (ith parameters) E.g. y = ax + b + ε ε = N( 0, σ ) Select the error function to be optimized n E.g. 1 ( yi f ( xi )) n i= 1 3. : Find the set of parameters optimizing the error function The model and parameters ith the smallest error 4. Application (Evaluation): Apply the learned model E.g. predict ys for ne inputs x using learned f (x) Design cycle Require some prior knoledge Evaluation

Design cycle Require prior knoledge Evaluation may need a lot of: Cleaning Preprocessing (conversions) Cleaning: Get rid of errors, noise, Removal of redundancies Preprocessing: Renaming Rescaling (normalization) Discretizations Abstraction Agreggation Ne attributes

preprocessing Renaming (relabeling) categorical values to numbers dangerous in conjunction ith some learning methods numbers ill impose an order that is not arrantied Rescaling (normalization): continuous values transformed to some range, typically [-1, 1] or [0,1]. Discretizations (binning): continuous values to a finite set of discrete values Abstraction: merge together categorical values Aggregation: summary or aggregation operations, such minimum value, maximum value etc. Ne attributes: example: obesity-factor = eight/height biases Watch out for data biases: Try to understand the data source It is very easy to derive unexpected results hen data used for analysis and learning are biased (pre-selected) Results (conclusions) derived for pre-selected data do not hold in general!!!

biases Example 1: Risks in pregnancy study Sponsored by DARPA at military hospitals Study of a large sample of pregnant oman Conclusion: the factor ith the largest impact on reducing risks during pregnancy (statistically significant) is a pregnant oman being single Single oman -> the smallest risk What is rong? Example : Stock market trading (example by Andre Lo) on stock performances of companies traded on stock market over past 5 year Investment goal: pick a stock to hold long term Proposed strategy: invest in a company stock ith an IPO corresponding to a Carmichael number - Evaluation result: excellent return over 5 years - Where the magic comes from?

Design cycle Require prior knoledge Evaluation The size (dimensionality) of a sample can be enormous 1 d x i = ( xi, xi,.., xi ) d -very large Example: document classification 10,000 different ords Inputs: counts of occurrences of different ords Too many parameters to learn (not enough samples to justify the estimates the parameters of the model) Dimensionality reduction: replace inputs ith features Extract relevant inputs (e.g. mutual information measure) PCA principal component analysis Group (cluster) similar ords (uses a similarity measure) Replace ith the group label

Design cycle Require prior knoledge Evaluation What is the right model to learn? A prior knoledge helps a lot, but still a lot of guessing Initial data analysis and visualization We can make a good guess about the form of the distribution, shape of the function Independences and correlations Overfitting problem Take into account the bias and variance of error estimates

Design cycle Require prior knoledge Evaluation = optimization problem. Various criteria: Mean square error * 1 = arg min Error ( ) Error ( ) = ( y i f ( xi, )) N Maximum likelihood (ML) criterion Θ * = max P ( D Θ ) Error ( Θ) = log P( D Θ) Θ Maximum posterior probability (MAP) Θ * = max P( Θ D ) P( Θ D ) = i = 1,.. N P( D Θ) P( Θ) P D Θ ( )

= optimization problem Optimization problems can be hard to solve. Right choice of a model and an error function makes a difference. Parameter optimizations Gradient descent, Conjugate gradient Neton-Rhapson Levenberg-Marquard Some can be carried on-line on a sample by sample basis Combinatorial optimizations (over discrete spaces): Hill-climbing Simulated-annealing Genetic algorithms Parametric optimizations Sometimes can be solved directly but this depends on the error function and the model Example: squared error criterion for linear regression Very often the error function to be optimized is not that nice. Error ( ) = f ( ) = ( 0, 1, Kk ) - a complex function of eights (parameters) * Goal: = arg min f ( ) Typical solution: iterative methods. Example: Gradient-descent method Idea: move the eights (free parameters) gradually in the error decreasing direction

Gradient descent method Descend to the minimum of the function using the gradient information Error ( ) Error ( ) * * Change the parameter value of according to the gradient * α Error ( ) * Gradient descent method Error () Error ( ) * * Ne value of the parameter α Error ( ) α > 0 * * - a learning rate (scales the gradient changes)

Gradient descent method To get to the function minimum repeat (iterate) the gradient based update fe times Error () ( 0 ) (1) ( ) ( 3 ) Problems: local optima, saddle points, slo convergence More complex optimization techniques use additional information (e.g. second derivatives) On-line learning (optimization) Error function looks at all data points at the same time 1 E.g. Error ( ) = ( y i f ( x i, )) n i = 1,.. n On-line error - separates the contribution from a data point Error ( ) = ( y f ( x, ON -LINE i i )) Example: On-line gradient descent Error () Derivatives based on different data points ( 0 ) ( 1 ) ( ) ( 3 ) Advantages: 1. simple learning algorithm. no need to store data (on-line data streams)

Design cycle Require prior knoledge Evaluation Covered earlier Evaluation. Simple holdout method. Divide the data to the training and test data. Other more complex methods Based on cross-validation, random sub-sampling. What if e ant to compare the predictive performance on a classification or a regression problem for to different learning methods? Solution: compare the error results on the test data set Possible anser: the method ith better (smaller) testing error gives a better generalization error. Is this a good anser? Ho sure are e about the method ith a better test score being truly better?

Evaluation. Problem: e cannot be 100 % sure about generalization errors Solution: test the statistical significance of the result Central limit theorem: Let random variables X 1, X form a random sample, L X n from a distribution ith mean µ and variance σ, then if the sample n is large, the distribution n X i N ( nµ, nσ ) 1 n or X N (, i µ σ / n) n i= 1 i= 1 0. 0.18 0.16 0.14 µ = 0 σ =1 0.1 0.1 0.08 0.06 0.04 0.0 σ = σ =4 0-10 -8-6 -4-0 4 6 8 10