Lecture Designing a learning system Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x4-8845 http://.cs.pitt.edu/~milos/courses/cs750/ Design of a learning system (first vie) Application or Testing
Design of a learning system. 1. : D = { d1, d,.., dn}. : Select a model or a set of models (ith parameters) E.g. y = ax + b + ε ε = N( 0, σ ) Select the error function to be optimized n E.g. 1 ( yi f ( xi )) n i= 1 3. : Find the set of parameters optimizing the error function The model and parameters ith the smallest error 4. Application (Evaluation): Apply the learned model E.g. predict ys for ne inputs x using learned f (x) Design cycle Require some prior knoledge Evaluation
Design cycle Require prior knoledge Evaluation may need a lot of: Cleaning Preprocessing (conversions) Cleaning: Get rid of errors, noise, Removal of redundancies Preprocessing: Renaming Rescaling (normalization) Discretizations Abstraction Agreggation Ne attributes
preprocessing Renaming (relabeling) categorical values to numbers dangerous in conjunction ith some learning methods numbers ill impose an order that is not arrantied Rescaling (normalization): continuous values transformed to some range, typically [-1, 1] or [0,1]. Discretizations (binning): continuous values to a finite set of discrete values Abstraction: merge together categorical values Aggregation: summary or aggregation operations, such minimum value, maximum value etc. Ne attributes: example: obesity-factor = eight/height biases Watch out for data biases: Try to understand the data source It is very easy to derive unexpected results hen data used for analysis and learning are biased (pre-selected) Results (conclusions) derived for pre-selected data do not hold in general!!!
biases Example 1: Risks in pregnancy study Sponsored by DARPA at military hospitals Study of a large sample of pregnant oman Conclusion: the factor ith the largest impact on reducing risks during pregnancy (statistically significant) is a pregnant oman being single Single oman -> the smallest risk What is rong? Example : Stock market trading (example by Andre Lo) on stock performances of companies traded on stock market over past 5 year Investment goal: pick a stock to hold long term Proposed strategy: invest in a company stock ith an IPO corresponding to a Carmichael number - Evaluation result: excellent return over 5 years - Where the magic comes from?
Design cycle Require prior knoledge Evaluation The size (dimensionality) of a sample can be enormous 1 d x i = ( xi, xi,.., xi ) d -very large Example: document classification 10,000 different ords Inputs: counts of occurrences of different ords Too many parameters to learn (not enough samples to justify the estimates the parameters of the model) Dimensionality reduction: replace inputs ith features Extract relevant inputs (e.g. mutual information measure) PCA principal component analysis Group (cluster) similar ords (uses a similarity measure) Replace ith the group label
Design cycle Require prior knoledge Evaluation What is the right model to learn? A prior knoledge helps a lot, but still a lot of guessing Initial data analysis and visualization We can make a good guess about the form of the distribution, shape of the function Independences and correlations Overfitting problem Take into account the bias and variance of error estimates
Design cycle Require prior knoledge Evaluation = optimization problem. Various criteria: Mean square error * 1 = arg min Error ( ) Error ( ) = ( y i f ( xi, )) N Maximum likelihood (ML) criterion Θ * = max P ( D Θ ) Error ( Θ) = log P( D Θ) Θ Maximum posterior probability (MAP) Θ * = max P( Θ D ) P( Θ D ) = i = 1,.. N P( D Θ) P( Θ) P D Θ ( )
= optimization problem Optimization problems can be hard to solve. Right choice of a model and an error function makes a difference. Parameter optimizations Gradient descent, Conjugate gradient Neton-Rhapson Levenberg-Marquard Some can be carried on-line on a sample by sample basis Combinatorial optimizations (over discrete spaces): Hill-climbing Simulated-annealing Genetic algorithms Parametric optimizations Sometimes can be solved directly but this depends on the error function and the model Example: squared error criterion for linear regression Very often the error function to be optimized is not that nice. Error ( ) = f ( ) = ( 0, 1, Kk ) - a complex function of eights (parameters) * Goal: = arg min f ( ) Typical solution: iterative methods. Example: Gradient-descent method Idea: move the eights (free parameters) gradually in the error decreasing direction
Gradient descent method Descend to the minimum of the function using the gradient information Error ( ) Error ( ) * * Change the parameter value of according to the gradient * α Error ( ) * Gradient descent method Error () Error ( ) * * Ne value of the parameter α Error ( ) α > 0 * * - a learning rate (scales the gradient changes)
Gradient descent method To get to the function minimum repeat (iterate) the gradient based update fe times Error () ( 0 ) (1) ( ) ( 3 ) Problems: local optima, saddle points, slo convergence More complex optimization techniques use additional information (e.g. second derivatives) On-line learning (optimization) Error function looks at all data points at the same time 1 E.g. Error ( ) = ( y i f ( x i, )) n i = 1,.. n On-line error - separates the contribution from a data point Error ( ) = ( y f ( x, ON -LINE i i )) Example: On-line gradient descent Error () Derivatives based on different data points ( 0 ) ( 1 ) ( ) ( 3 ) Advantages: 1. simple learning algorithm. no need to store data (on-line data streams)
Design cycle Require prior knoledge Evaluation Covered earlier Evaluation. Simple holdout method. Divide the data to the training and test data. Other more complex methods Based on cross-validation, random sub-sampling. What if e ant to compare the predictive performance on a classification or a regression problem for to different learning methods? Solution: compare the error results on the test data set Possible anser: the method ith better (smaller) testing error gives a better generalization error. Is this a good anser? Ho sure are e about the method ith a better test score being truly better?
Evaluation. Problem: e cannot be 100 % sure about generalization errors Solution: test the statistical significance of the result Central limit theorem: Let random variables X 1, X form a random sample, L X n from a distribution ith mean µ and variance σ, then if the sample n is large, the distribution n X i N ( nµ, nσ ) 1 n or X N (, i µ σ / n) n i= 1 i= 1 0. 0.18 0.16 0.14 µ = 0 σ =1 0.1 0.1 0.08 0.06 0.04 0.0 σ = σ =4 0-10 -8-6 -4-0 4 6 8 10