COSC343: Artificial Intelligence

Similar documents
The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

Logistic Regression for Spam Filtering

Lecture 6. Artificial Neural Networks

Part 2: Analysis of Relationship Between Two Variables

Big Data - Lecture 1 Optimization reminders

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

Introduction to Logistic Regression

2013 MBA Jump Start Program

Regression III: Advanced Methods

Solutions to Homework 5

Linear Threshold Units

The Cobb-Douglas Production Function

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Machine Learning: Multi Layer Perceptrons

Self Organizing Maps: Fundamentals

Airport Planning and Design. Excel Solver

Chapter 4: Artificial Neural Networks

2.2 Creaseness operator

Lecture 6: Logistic Regression

Map Patterns and Finding the Strike and Dip from a Mapped Outcrop of a Planar Surface

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

14. Nonlinear least-squares

An Introduction to Neural Networks

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Classification using Logistic Regression

Lecture. Simulation and optimization

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

Microeconomic Theory: Basic Math Concepts

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

(Quasi-)Newton methods

Nonlinear Iterative Partial Least Squares Method

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Statistical Machine Learning

Integer Programming: Algorithms - 3

Application of Neural Network in User Authentication for Smart Home System

Advanced analytics at your hands

Machine Learning and Pattern Recognition Logistic Regression

Blind Deconvolution of Barcodes via Dictionary Analysis and Wiener Filter of Barcode Subsections

STA 4273H: Statistical Machine Learning

Lecture 5: Singular Value Decomposition SVD (1)

GI01/M055 Supervised Learning Proximal Methods

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

POISSON AND LAPLACE EQUATIONS. Charles R. O Neill. School of Mechanical and Aerospace Engineering. Oklahoma State University. Stillwater, OK 74078

Recurrent Neural Networks

4.1 Learning algorithms for neural networks

Adaptive Equalization of binary encoded signals Using LMS Algorithm

In order to describe motion you need to describe the following properties.

Derivative Free Optimization

Nonlinear Algebraic Equations Example

Predict Influencers in the Social Network

Empirical Model-Building and Response Surfaces

Big Ideas in Mathematics

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

Nurse Rostering. Jonathan Johannsen CS 537. Scheduling Algorithms

Optimal linear-quadratic control

CSCI567 Machine Learning (Fall 2014)

Univariate Regression

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

Computer programming course in the Department of Physics, University of Calcutta

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Pattern Recognition and Prediction in Equity Market

SOLUTIONS. f x = 6x 2 6xy 24x, f y = 3x 2 6y. To find the critical points, we solve

Elements of a graph. Click on the links below to jump directly to the relevant section

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

CHAPTER 5 PREDICTIVE MODELING STUDIES TO DETERMINE THE CONVEYING VELOCITY OF PARTS ON VIBRATORY FEEDER

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Asymmetric Reactions of Stock Market to Good and Bad News

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Lecture 3: Linear methods for classification

Weekly Sales Forecasting

Inner product. Definition of inner product

Applications to Data Smoothing and Image Processing I

Lean Six Sigma Analyze Phase Introduction. TECH QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY

Hybrid Evolution of Heterogeneous Neural Networks

QUALITY ENGINEERING PROGRAM

Roots of Equations (Chapters 5 and 6)

Principles of Scientific Computing Nonlinear Equations and Optimization

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

GLM, insurance pricing & big data: paying attention to convergence issues.

Models of Cortical Maps II

Kinetic Theory of Gases

[1] Diagonal factorization

Simple Regression Theory II 2010 Samuel L. Baker

Determine If An Equation Represents a Function

Chapter 6 Quantum Computing Based Software Testing Strategy (QCSTS)

1 Introduction to Matrices

Mathematical finance and linear programming (optimization)

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Big Data Analytics: Optimization and Randomization

Transcription:

COSC343: Artificial Intelligence Lecture 8 : Non-linear regression and Optimisation Lech Szymanski Dept. of Computer Science, University of Otago Lech Szymanski COSC343 Lecture 8

In today s lecture Error surface Iterative parameter search Non-linear regression and multiple minima Optimisation techniques: Random walk Steepest gradient Simulated annealing

Recall: Least squares and linear regression The least squares parameters that minimise, where, can be found by solving: Linear regression is consistent only when appropriate choice of base functions is made For models that are linear in parameters,, where, there is a closed form solution: and Matrix inversion is computationally intensive, where This matrix is sometimes singular

Recall: Least squares and regression The least squares parameters that minimise, where, can be found by solving: Can appropriate parameters be found without the closed form equation?

Space of possible solutions Let s think about the space of possible values that can take, and the corresponding cost for some hypothesis. The space of possible solutions is a U-dimensional state-space (where U is the number of parameters in the model); A cost function maps each point in the state space to a real number; The evaluations of all points in the state space can be visualised as a state-space landscape (in U+1 dimensions)

An example: fitting a line 3 Training data Cost surface 1 Cost surface 25 2 y = 2:1x! 2:99 y =!:5x + 8: 8 6 4 ~y 15 1 5-5 -1-5 5 1 15 x J 1 8 6 4 2-2 2 w 1 4 6-2 1-1 w 2 w2 2-2 -4-6 -8-1 -12-1 1 2 3 4 5 w 1 y i = w 1 x i + w 2 J = 1 2 Pi (y i! ~y i ) 2 w = (FF T )!1 F~y T w 1 = 2:1; w 2 =!2:99 Closed form solution Epoch : w 1 =!:5; w 2 = 8: Initial guess

Parameter search 1. Make an initial guess of the parameter values (often random) and compute the cost 2. Update the parameters and compute the new cost Learning rate parameter Change in parameter values a. Keep the update if 3. Go back to step 2

Random walk Pick the weight update at random, where Parameter update is a random variable (in this case with a normal distribution) keep the update COSC343 Lecture 8 if Lech Szymanski

An example: fitting a line with random walk

Learning rate parameter The learning rate parameter controls how big the update steps are when parameters are updated. Small results in smaller steps: more ordered and direct path towards the minimum, but it takes longer to get there Large results in bigger steps: faster convergence, but less direct path and more chance of overshooting the minimum

An example: random walk with smaller α

An example: random walk with larger α

Steepest gradient descent The weight update is the negative gradient of the cost function with respect to the parameters and Gradient is the positive slope (going up) of the cost surface at w t Also referred to as steepest gradient ascent or hill climbing if the objective function is being maximised

An example: fitting a line with steepest gradient descent Since, Therefore: true for any linear model then. Given: Output derivatives are:, Cost derivatives are: Since, then. Since then., j - index over number of weights i index over number of training samples

An example: fitting a line with steepest gradient descent

An example: non-linear regression y = sin(2:5x 1 +!1:5x 2 ) 1.5 Training data 25 Cost surface 2 1 Cost surface ~y 1.5 -.5 J 2 15 1 w2-1 -1-1.5 8 6 4 x 2 2 5 x 1 1 5 2 w 2-2 -4 w 1 2 4-2 -3-4 1 2 3 4 w 1 y i = sin(w 1 x 1 + w 2 x 2 ) J = 1 2 Pi (y i! ~y i ) 2 Epoch : w 1 = 2:5; w 2 =!1:5 J = 89:23

An example: non-linear regression

An example: non-linear regression

An example: non-linear regression

Local minima Cost functions in non-linear optimisation are not necessarily convex The cost surface may not correspond to one giant valley, but a series of valleys separated by high cost ridges Global minimum the state of the system that gives lowest possible cost Best solution for the chosen model Local minimum the minimum closest to the current state May not be the best solution Steepest gradient always points towards the local minimum, and as a result, the outcome of the training is highly dependent on the starting value of the parameters

Local minima Global minimum

Local minima Global minimum Local minima

Simulated annealing Pick the weight update at random, where keep the update probability with, where COSC343 Lecture 8

Simulated annealing Pick the weight update at random, where keep the update probability with, where Reduction in cost COSC343 Lecture 8 Temperature

Simulated annealing: probability of accepting the update 1.8.6.4 High temperature increases the chance of accepting an update that results in higher cost Low temperature reduces the change of accepting an update that results in higher cost Start training with high temperature More energy to jump out of the current minimum End training with low temperature No energy to jump out of the current minimum.2-5 -4-3 -2-1 1 2 3 4 5 1.8.6.4.2-5 -4-3 -2-1 1 2 3 4 5 1.8.6.4.2 COSC343 Lecture 8-5 -4-3 -2-1 1 2 3 4 5

Simulated annealing: cooling schedule Starting temperature 3 25 2 15 1 5 5 1 15 2 3 25 2 15 1 5 5 1 15 2 Rate of cooling 3 25 2 15 1 5 5 1 15 2 3 25 2 15 1 5 5 1 15 2

An example: non-linear regression with simulated annealing

Summary and reading Learning as the parameter search over the cost surface Models that are non-linear in parameters tend to have multiple minima Random walk slow, can t guarantee where it goes Gradient descent very fast, finds local minima Simulated annealing theoretical promise of finding the global minimum (but slow and hard to get it to work in practice) Reading for the lecture: AIMA Chapter 18 Section 4 Reading for next lecture: AIMA Chapter 18 Sections 7.1,7.2 Lech Szymanski COSC343 Lecture 6