ON THE DEGREES OF FREEDOM IN RICHLY PARAMETERISED MODELS

Size: px
Start display at page:

Download "ON THE DEGREES OF FREEDOM IN RICHLY PARAMETERISED MODELS"

Transcription

1 COMPSTAT 2004 Symposium c Physica-Verlag/Springer 2004 ON THE DEGREES OF FREEDOM IN RICHLY PARAMETERISED MODELS Salvatore Ingrassia and Isabella Morlini Key words: Richly parameterised models, small data sets. COMPSTAT 2004 section: Neural networks. Abstract: Using richly parameterised models for small datasets can be justified from a theoretical point of view according to some results due to Bartlett [1] which show that the generalization performance of a multi layer perceptron (MLP) depends more on the L 1 norm c 1 of the weights between the hidden and the output layer rather than on the number of parameters in the model. In this paper we investigate the problem of measuring the generalization performance and the complexity of richly parameterised procedures and, drawing on linear model theory, we propose a different notion of degrees of freedom to neural networks and other projection tools. This notion is compatible with similar ideas long associated with smoothers based models (like projection pursuit regression) and can be interpreted using the projection theory of linear models and showing some geometrical properties of neural networks. Results in this study lead to corrections in some goodness-of-fit statistics like AIC, BIC/SBC: the number of degrees of freedom in these indexes are set equal to the dimension p of the projection space intrinsically found by the mapping function. An empirical study is presented in order to illustrate the behavior of the L 1 norm c 1 and of the values of some selection model criteria, varying the value of p in a MLP. 1 Introduction An important issue in statistical modeling is related to so called indirect measures or virtual sensors, concerning the prediction of variables that are quite expensive to measure (e.g. the viscosity or the concentration of certain chemical species, some mechanical features) using other variables like, for example, the temperature or the pressure. This problem usually involves some difficulties: the available data sets is small and the input-output relation to be estimated is non-linear. A third difficulty arises when there are many predictor variables but, since the linearity cannot be assumed, it is quite difficult to reduce the dimensionality of the input by choosing a good subset of predictors or suitable underlying features. When we are provided with a data set of N pairs (x n,y n )ofm-dimensional input vectors x n and scalar target values y n and the size N of the available sample is small compared with the number of weights of the mapping function, the model is considered overparameterised. Over parameterised models can be justified from a theoretical point of view according to some results due to Bartlett [1] showing that the generalization performance of an MLP depends more on the size of the

2 1238 Salvatore Ingrassia and Isabella Morlini weights rather than on the number of weights and, in particular, on the L 1 norm c 1 of the weights between the hidden and the output layer. Bartlett s results suggest a deeper look at the roles of the parameters in a neural network and in similar richly parameterised models. The roles of these parameters are here interpreted drawing from the projection theory of linear models and by means of some geometrical properties shared by neural networks and statistical tools realizing similar mapping functions. 2 Geometrical properties of the sigmoidal functions In this section we investigate some geometrical properties of a mapping function of the form: f p (x) = p c k τ(a kx). (1) k=1 Without loss of generality, we assume that the bias term is equal to zero and that the function τ( ) is sigmoidal and analytic, that is, it can be represented by a power series, on some interval ( r, r), where r may be +. Thehyperbolic tangent τ(z) = tanh(z) or the logistic function τ(z) =(1+e z ) 1 are examples of analytic sigmoidal functions. We point out that the function f p is a combination of certain transformations of the input data. In particular, f p realizes: i) a non-linear projection from R m to R p given by the sigmoidal function τ, thatisx τ(a 1x),...,τ(a px); ii) a linear transformation from R p to R according to c 1,...,c p. The results in this paper are based on the following theorem, see e.g. Rudin (1966): Theorem 2.1. Let g be analytic and not identically zero in the interval ( r, r), with r>0. Then the set of the zeroes of g in ( r, r) isatmost countable. Let x 1 =(x 11,...,x 1m ),...,x p =(x p1,...,x pm )bep points of R m,with p>m; evidently these points are linearly dependent as p>m. Let A = (a ij )beap m matrix with values in some hypercube [ u, u] mp,forsome u>0; thus the points Ax 1,...,Ax p are linearly dependent because they are obtained by a linear transformation acting on x 1,...,x p. For u =1/m the points τ(ax 1 ),...,τ(ax p ), where: τ(ax i ) = m m τ( a 1j x ij ),...,τ( a pj x ij ) j=1 j=1 = ( τ(a 1x i ),...,τ(a px i ) ) i =1,...,p. are linearly independent for almost all matrices A [ u, u] mp, according to the following theorem.

3 On the degrees of freedom in richly parameterised models 1239 Theorem 2.2. [5]Let x 1,...,x p be p distinct points in ( r, r) m with x h 0 (h = 1,...,p) and A = (a ij ) [ u, u] mp be a p m matrix, with u =1/m. Let τ be a sigmoidal analytic function on ( r, r), with r>0. Then the points τ(ax 1 ),...,τ(ax p ) R p are linearly independent for almost all matrix A =(a ij ) [ u, u] mp. This result proves that, given N > m points x 1,...,x N R m, the transformed points τ(ax 1 ),...,τ(ax N ) generate an over-space of dimension p>mif the matrix A satisfies suitable conditions. In particular, the largest over-space is attained when p = N, that is when in a MLP the hidden layer has as many units as the number of points in the learning set. Moreover, it gains insight why neural networks have been proved to work well in presence of multicollinearity. On this topic De Veaux & Ungar [3] present a case-study in which the temperature of a flow is measured by six different devices at various places in a production process: even though the inputs are highly correlated, a better prediction of the response is gained using a weighted combination of all six predictors rather than choosing the single best measurement having the highest correlation with the response. Next result generalizes Theorem 2. Theorem 2.3. Let L = {(x 1,y 1 ),...,(x N,y N )} be a given learning set of N i.i.d. realizations of (X,Y)andf p = p k=1 c kτ(a kx). If p = N, then the learning error Ê(f p, L) = (x (y n,y n) L n f p (x n )) 2 is zero for almost all matrices A [ 1/m, 1/m] mp. Proof. Theorem 2 implies that the points τ(ax 1 ),...,τ(ax N ) are linearly independent for almost all matrices A [ 1/m, 1/m] forp N. In particular, if p = N the system: c 1 τ(a 1x 1 ) + + c N τ(a N x 1) = y =. (2) c 1 τ(a 1 x N ) + + c N τ(a N x N ) = y N has a unique solution. The upper bound on p given above looks too large but it refers to the worst case. In neural modelling, given a learning set L of N sample data, the good question seems not to be what is the largest network we can train by L (if any)? but what is a suitable size namely, the dimension p of the space R p necessary for fitting the input-output unknown dependence φ = E[Y X]?. This dimension p depends on the geometry of the data and this explains why neural models may be successfully applied as virtual sensors when the predictors exhibit a high degree of multicollinearity. As a matter of fact, the hidden units break the multicollinearity and exploit the contribution of each single predictor. This is the reason why the optimal size p of the hidden layer is often greater than the number m of predictors.

4 1240 Salvatore Ingrassia and Isabella Morlini 3 On counting the degrees of freedom in linear projection models Consider the standard regression model: y = Xβ + ε (3) where y is a vector of N observations of the dependent variable measured about its mean, X is an N m matrix whose (i, j)th element is the value of the jth predictor variable for the ith observation, again measured about its mean, β is a vector of regression coefficients and ε is the vector of the error terms satisfying the usual assumption of independence and homoscedasticity. The values of the principal components (PC) for each observation are given by: Z = XA where the (i, k)th element of Z is the value of the kth PC for the ith observation and A is the m m matrix whose kth column is the kth eigenvector of X X. Because A is orthogonal, Xβ can be rewritten as XAA β = Zγ, where γ = A β.equation(3) can therefore be written as: y = Zγ + ε (4) which has simply replaced the predictor variables by theirs PCs in the regression model. Principal component regression can be defined as the model (4) or as the reduced model: y = Z p γ p + ε p (5) where γ p is a vector of p elements which are a subset of γ, Z p is an N p matrix whose columns are the corresponding subset of columns of Z and ε p is the appropriate error term, see Section 8.1 in Jolliffe (1986). From an algebraic point of view, computing the PCs of the original predictor variables is a way to overcome the problem of multicollinearity between them, since the PCs are orthogonal. From a geometrical point of view, if we compute the values of the PCs for each of the N observations, we project the N points in a m-dimensional hyperplane for which the sum of squares of perpendicular distances of the observations in the original space is minimized. The total number of estimated quantities in model (4), which results from defining the projection of the centered matrix X in the space spanned by the m PCs and then estimating the m regression coefficients plus the bias term, is larger than the number m of variables involved in the model. In any case, model (5) is given p + 1 degrees of freedom, that is, the number of retained PCs plus one. The function given by the PC regression model can be rewritten as: f( x n ) = m β k (a k x n) (6) k=0

5 On the degrees of freedom in richly parameterised models 1241 where x n is the m dimensional vector of the original centered variables, the a k with (k =1,..., p) arethem normalized eigenvectors of the matrix X X and β k are the regression coefficients. The function in (6) is formally identical to the mapping function realized by the projection pursuit regression or by a network model with the identity function in the second layer. This analogy between the PC regression and the mapping function realized by an MLP with p hidden nodes (m p N) and a linear transfer function in the second level, also arises in accordance with Theorem 4 in previous Section. As in PC regression, the first transformation from the input layer to the hidden layer is a geometrical one which projects the point into a new space of dimension p N. This transformation is non-linear and its optimality is not well established: the important point here is that in the new space the points are linearly independent (e.g. Theorem 2). A difference between the PC regression and the mapping function realized by the MLP is, beside the type of the geometrical projection (linear in the first case, non-linear in the second case) and the maximum dimension of this space (equal to m for PCs andintheinterval[m, N] for the neural network), is that the estimates of the projection matrix and the regression parameters are faced in a two stage procedure for the PC regression while simultaneously for the network. In neural network modeling the dimension of the model, i.e. the number p of neurons in the hidden, is often chosen according to some model selection criteria listed in the summary reports of many statistical software for data mining. The linkage between the generalization error and the empirical error, used in these selection model criteria, has been approached in Ingrassia and Morlini [6] on the basis on the Vapnik-Chervonenkis theory [10]. The seminal work on model selection is based on the parametric statistics literature and is quite vast but it must be noted that, although model selection techniques for parametric models have been widely used in the past 30 years, surprisingly little work has been done on the application of these techniques in a semi-parametric or non-parametric context. Such goodness-of-fit statistics are quite simple to compute and even if the underlying theory does not hold for neural networks, the rule among users is to consider them as crude estimates of the generalization error and thus to apply these methods to very complex models, regardless of their parametric framework. These criteria, here denoted by Π, are an extension of the maximum likelihood and have the following form: Π = Ê(f K)+C K (7) where the term Ê(f K) is the deviance of the model f K and C K is a complexity term representing a penalty which grows as the number K of degrees of freedom in the model increases: if the model f K is too simple it will give a large value for the criterion because the residual training error is large; while amodelf K which is too complex will have a large value for the criterion because the complexity term is large. Typical indexes include the Akaike Information Criterion (AIC), the Schwarz Bayesian Information Criterion (BIC

6 1242 Salvatore Ingrassia and Isabella Morlini or SBC), the Final Prediction Error (FPE), the Generalized Cross Validation Error (GCV) and the well-established Unbiased Estimate of the Variance (UEV) (for a review of these indexes we refer to Ingrassia and Morlini [6]. The classical statistical parametric viewpoint in which the dimensionality of the model complexity is given by K = W, i.e. the number W of all parameters defining the mapping function, does not seem to apply to flexible non-parametric or semi-parametric models, in which the adaptive parameters are not on the same level and have different interpretations, as we have seen in previous sections. The assumption that the degrees of freedom of a neural model should be different than W has been remarked by many authors, see e.g. Hodges and Sargent [4], Ye [11]. In the present study we propose an easy correction to the selection model criteria. According to the analogy with the PC regression, for models of the form f p (x) = p k=1 c kτ(a k x + b k)+c 0 we should consider K = p + 1 rather than K = W = p(m + 2) + 1, that is the dimension of the projection space plus one. In this case both the FPE and the UEV are never negative (as it may happen when K = W = p(m+2)+1). 4 A case study In this section we present some numerical results in order to investigate the behavior of Bartlett s constant c 1 and its relation with the learning error Ê(f p, L) andthetesterrorê(f p, T ). We consider the polymer data set modeled by De Veaux et al. [2] by means of a MLP with 18 hidden units. This dataset contains 61 observations with 10 predictors concerning measurements of controlled variables in a polymer process plant an a response concerning the output of the plant (data are from ftp.cis.upenn.edu in pub/ungar/chemdata). The data exhibit a quite large degree of multicollinearity, as shown by the variance inflation factor (VIF) of some predictors: X 1 X 3 X 4 X 5 X 6 X 7 X 8 X 9 VIF: In general, VIFs larger than 10 imply serious computational problems for many statistical tools [8]. As in De Veaux et al. [2], we use 50 observations for the learning set and 11 for the test set. Here, however, we consider 100 different samples with different observations for the training and the test sets. We train networks with increasing numbers of hidden units from p = 2 to p = 25. For each p we train 1000 times the network varying the sample and the initial weights; we adopt either the weight decay or the early stopping regularization techniques. Then, we retained the 100 networks with the smallest test errors. The distribution of the learning error vs. the number p isplottedinfigure1 a) using boxplots. In Figure 1 b) we plot the distribution of c 1 vs. p using weight decay. Similar distributions are

7 On the degrees of freedom in richly parameterised models Learning Error c p p Figure 1: Polymer data: distribution of the learning error and of c 1 vs p. obtained with early stopping. Figure 1b) shows that there is a quite large number of models with different architectures (i.e. a different number p) having the same value of the constant c introduced in Bartlett [1]. In Table 1 for each group of the 100 best models are reported the mean values of the training error Ê(f p; L), the test error Ê(f p; T ), the L 1 norm c 1 and of the error complexity measures AIC, BIC/SBC, GCV and FPE computed with K = p +1. Thevaluesp leading to the smallest mean values are AIC BIC GCV FPE p 9, 11 3, 5, 7 5, 7 7, 9, 11 and thus, on the basis of these statistics, models with p =7, 9, 11 neurons in the hidden layer should be selected. In addition, we note that p =10isthe value with the smallest absolute difference between the means of training and the test errors. This study confirms that c better describes the complexity of a model of the form (1) than the total number of parameters and is a more suitable characterization of the mapping function. References [1] Bartlett P.L. (1998). The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Transaction on Information Theory, 44, n.2, [2] De Veaux R.D., Schumi J., Schweinsberg J., Ungar L.H. (1998). Prediction intervals for neural networks via nonlinear regression. Technometrics 40, (4), [3] De Veaux R.D., Ungar L.H. (1994). Multicollinearity: a tale of two non parametric regressions. In Selecting Models from Data: AI and Statistics IV, (Eds. P. Cheeseman & R.W. Oldford). [4] Hodges J., Sargent D. (2001). Counting degrees of freedom in hierarchical and other richly parameterised models. Biometrika 88,

8 1244 Salvatore Ingrassia and Isabella Morlini p E(f b p; L) E(f b p; T ) c 1 K BIC AIC FPE GCV Table 1: Polymer data: summary statistics for different values of p. [5] Ingrassia S. (1999). Geometrical aspects of discrimination by multilayer perceptrons. Journal of Multivariate Analysis 68, [6] Ingrassia S., Morlini I. (2002). Neural network modeling for small data sets. Submitted for publication. [7] Jolliffe I.T. (1986). Principal Component Analysis. Springer-Verlag, N.Y. [8] Morlini, I. (2002). Facing multicollinearity in data mining. Atti della XLI Riunione Scientifica della Societá Italiana di Statistica, Milano-Bicocca, [9] Rudin W. (1966). Real and complex analysis. Mc-Graw-Hill, New York. [10] Vapnik V. (1998). Statistical Learning Theory, John Wiley & Sons, N.Y. [11] Ye J. (1998). On measuring and correcting the effects of data mining and model selection. Journal of the American Statistical Association, 93, 441, Address: S. Ingrassia, Dipartimento di Economia e Statistica, Università della Calabria, Arcavacata di Rende Italy I. Morlini, Dipartimento di Economia, Università di Parma, Parma, Italy s.ingrassia@unical.it, isabella.morlini@unipr.it

Dimensionality Reduction: Principal Components Analysis

Dimensionality Reduction: Principal Components Analysis Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely

More information

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

More information

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

More information

DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II. Matrix Algorithms DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Review Jeopardy. Blue vs. Orange. Review Jeopardy Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?

More information

Similarity and Diagonalization. Similar Matrices

Similarity and Diagonalization. Similar Matrices MATH022 Linear Algebra Brief lecture notes 48 Similarity and Diagonalization Similar Matrices Let A and B be n n matrices. We say that A is similar to B if there is an invertible n n matrix P such that

More information

Linear Algebra Notes for Marsden and Tromba Vector Calculus

Linear Algebra Notes for Marsden and Tromba Vector Calculus Linear Algebra Notes for Marsden and Tromba Vector Calculus n-dimensional Euclidean Space and Matrices Definition of n space As was learned in Math b, a point in Euclidean three space can be thought of

More information

Factor analysis. Angela Montanari

Factor analysis. Angela Montanari Factor analysis Angela Montanari 1 Introduction Factor analysis is a statistical model that allows to explain the correlations between a large number of observed correlated variables through a small number

More information

Numerical Analysis Lecture Notes

Numerical Analysis Lecture Notes Numerical Analysis Lecture Notes Peter J. Olver 5. Inner Products and Norms The norm of a vector is a measure of its size. Besides the familiar Euclidean norm based on the dot product, there are a number

More information

DISCRIMINANT FUNCTION ANALYSIS (DA)

DISCRIMINANT FUNCTION ANALYSIS (DA) DISCRIMINANT FUNCTION ANALYSIS (DA) John Poulsen and Aaron French Key words: assumptions, further reading, computations, standardized coefficents, structure matrix, tests of signficance Introduction Discriminant

More information

1 VECTOR SPACES AND SUBSPACES

1 VECTOR SPACES AND SUBSPACES 1 VECTOR SPACES AND SUBSPACES What is a vector? Many are familiar with the concept of a vector as: Something which has magnitude and direction. an ordered pair or triple. a description for quantities such

More information

Section 1.1. Introduction to R n

Section 1.1. Introduction to R n The Calculus of Functions of Several Variables Section. Introduction to R n Calculus is the study of functional relationships and how related quantities change with each other. In your first exposure to

More information

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS 1. SYSTEMS OF EQUATIONS AND MATRICES 1.1. Representation of a linear system. The general system of m equations in n unknowns can be written a 11 x 1 + a 12 x 2 +

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

NOTES ON LINEAR TRANSFORMATIONS

NOTES ON LINEAR TRANSFORMATIONS NOTES ON LINEAR TRANSFORMATIONS Definition 1. Let V and W be vector spaces. A function T : V W is a linear transformation from V to W if the following two properties hold. i T v + v = T v + T v for all

More information

Neural Networks and Support Vector Machines

Neural Networks and Support Vector Machines INF5390 - Kunstig intelligens Neural Networks and Support Vector Machines Roar Fjellheim INF5390-13 Neural Networks and SVM 1 Outline Neural networks Perceptrons Neural networks Support vector machines

More information

Lecture 6. Artificial Neural Networks

Lecture 6. Artificial Neural Networks Lecture 6 Artificial Neural Networks 1 1 Artificial Neural Networks In this note we provide an overview of the key concepts that have led to the emergence of Artificial Neural Networks as a major paradigm

More information

Combining GLM and datamining techniques for modelling accident compensation data. Peter Mulquiney

Combining GLM and datamining techniques for modelling accident compensation data. Peter Mulquiney Combining GLM and datamining techniques for modelling accident compensation data Peter Mulquiney Introduction Accident compensation data exhibit features which complicate loss reserving and premium rate

More information

Solving Systems of Linear Equations

Solving Systems of Linear Equations LECTURE 5 Solving Systems of Linear Equations Recall that we introduced the notion of matrices as a way of standardizing the expression of systems of linear equations In today s lecture I shall show how

More information

Neural Networks for Sentiment Detection in Financial Text

Neural Networks for Sentiment Detection in Financial Text Neural Networks for Sentiment Detection in Financial Text Caslav Bozic* and Detlef Seese* With a rise of algorithmic trading volume in recent years, the need for automatic analysis of financial news emerged.

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Mathematics Course 111: Algebra I Part IV: Vector Spaces

Mathematics Course 111: Algebra I Part IV: Vector Spaces Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are

More information

Orthogonal Diagonalization of Symmetric Matrices

Orthogonal Diagonalization of Symmetric Matrices MATH10212 Linear Algebra Brief lecture notes 57 Gram Schmidt Process enables us to find an orthogonal basis of a subspace. Let u 1,..., u k be a basis of a subspace V of R n. We begin the process of finding

More information

Chapter 3: The Multiple Linear Regression Model

Chapter 3: The Multiple Linear Regression Model Chapter 3: The Multiple Linear Regression Model Advanced Econometrics - HEC Lausanne Christophe Hurlin University of Orléans November 23, 2013 Christophe Hurlin (University of Orléans) Advanced Econometrics

More information

Lectures notes on orthogonal matrices (with exercises) 92.222 - Linear Algebra II - Spring 2004 by D. Klain

Lectures notes on orthogonal matrices (with exercises) 92.222 - Linear Algebra II - Spring 2004 by D. Klain Lectures notes on orthogonal matrices (with exercises) 92.222 - Linear Algebra II - Spring 2004 by D. Klain 1. Orthogonal matrices and orthonormal sets An n n real-valued matrix A is said to be an orthogonal

More information

Joint models for classification and comparison of mortality in different countries.

Joint models for classification and comparison of mortality in different countries. Joint models for classification and comparison of mortality in different countries. Viani D. Biatat 1 and Iain D. Currie 1 1 Department of Actuarial Mathematics and Statistics, and the Maxwell Institute

More information

Linear Algebra Notes

Linear Algebra Notes Linear Algebra Notes Chapter 19 KERNEL AND IMAGE OF A MATRIX Take an n m matrix a 11 a 12 a 1m a 21 a 22 a 2m a n1 a n2 a nm and think of it as a function A : R m R n The kernel of A is defined as Note

More information

Econometrics Simple Linear Regression

Econometrics Simple Linear Regression Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight

More information

Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

More information

Empirical Model-Building and Response Surfaces

Empirical Model-Building and Response Surfaces Empirical Model-Building and Response Surfaces GEORGE E. P. BOX NORMAN R. DRAPER Technische Universitat Darmstadt FACHBEREICH INFORMATIK BIBLIOTHEK Invortar-Nf.-. Sachgsbiete: Standort: New York John Wiley

More information

Penalized regression: Introduction

Penalized regression: Introduction Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

More information

Inner Product Spaces

Inner Product Spaces Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

More information

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S. AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Orthogonal Projections

Orthogonal Projections Orthogonal Projections and Reflections (with exercises) by D. Klain Version.. Corrections and comments are welcome! Orthogonal Projections Let X,..., X k be a family of linearly independent (column) vectors

More information

1 Introduction to Matrices

1 Introduction to Matrices 1 Introduction to Matrices In this section, important definitions and results from matrix algebra that are useful in regression analysis are introduced. While all statements below regarding the columns

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics INTERNATIONAL BLACK SEA UNIVERSITY COMPUTER TECHNOLOGIES AND ENGINEERING FACULTY ELABORATION OF AN ALGORITHM OF DETECTING TESTS DIMENSIONALITY Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree

More information

CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.

CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C. CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES From Exploratory Factor Analysis Ledyard R Tucker and Robert C MacCallum 1997 180 CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES In

More information

Introduction to Matrix Algebra

Introduction to Matrix Algebra Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary

More information

3.1 Least squares in matrix form

3.1 Least squares in matrix form 118 3 Multiple Regression 3.1 Least squares in matrix form E Uses Appendix A.2 A.4, A.6, A.7. 3.1.1 Introduction More than one explanatory variable In the foregoing chapter we considered the simple regression

More information

171:290 Model Selection Lecture II: The Akaike Information Criterion

171:290 Model Selection Lecture II: The Akaike Information Criterion 171:290 Model Selection Lecture II: The Akaike Information Criterion Department of Biostatistics Department of Statistics and Actuarial Science August 28, 2012 Introduction AIC, the Akaike Information

More information

Multivariate Normal Distribution

Multivariate Normal Distribution Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

More information

MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996)

MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996) MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL by Michael L. Orlov Chemistry Department, Oregon State University (1996) INTRODUCTION In modern science, regression analysis is a necessary part

More information

Chapter 2 Portfolio Management and the Capital Asset Pricing Model

Chapter 2 Portfolio Management and the Capital Asset Pricing Model Chapter 2 Portfolio Management and the Capital Asset Pricing Model In this chapter, we explore the issue of risk management in a portfolio of assets. The main issue is how to balance a portfolio, that

More information

Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

Marketing Mix Modelling and Big Data P. M Cain

Marketing Mix Modelling and Big Data P. M Cain 1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored

More information

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,... IFT3395/6390 Historical perspective: back to 1957 (Prof. Pascal Vincent) (Rosenblatt, Perceptron ) Machine Learning from linear regression to Neural Networks Computer Science Artificial Intelligence Symbolic

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Simple Linear Regression Inference

Simple Linear Regression Inference Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

More information

Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

More information

Systems of Linear Equations

Systems of Linear Equations Systems of Linear Equations Beifang Chen Systems of linear equations Linear systems A linear equation in variables x, x,, x n is an equation of the form a x + a x + + a n x n = b, where a, a,, a n and

More information

5. Multiple regression

5. Multiple regression 5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful

More information

ISOMETRIES OF R n KEITH CONRAD

ISOMETRIES OF R n KEITH CONRAD ISOMETRIES OF R n KEITH CONRAD 1. Introduction An isometry of R n is a function h: R n R n that preserves the distance between vectors: h(v) h(w) = v w for all v and w in R n, where (x 1,..., x n ) = x

More information

Multivariate Analysis (Slides 13)

Multivariate Analysis (Slides 13) Multivariate Analysis (Slides 13) The final topic we consider is Factor Analysis. A Factor Analysis is a mathematical approach for attempting to explain the correlation between a large set of variables

More information

Introduction: Overview of Kernel Methods

Introduction: Overview of Kernel Methods Introduction: Overview of Kernel Methods Statistical Data Analysis with Positive Definite Kernels Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department of Statistical Science, Graduate University

More information

Data Mining mit der JMSL Numerical Library for Java Applications

Data Mining mit der JMSL Numerical Library for Java Applications Data Mining mit der JMSL Numerical Library for Java Applications Stefan Sineux 8. Java Forum Stuttgart 07.07.2005 Agenda Visual Numerics JMSL TM Numerical Library Neuronale Netze (Hintergrund) Demos Neuronale

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

8. Linear least-squares

8. Linear least-squares 8. Linear least-squares EE13 (Fall 211-12) definition examples and applications solution of a least-squares problem, normal equations 8-1 Definition overdetermined linear equations if b range(a), cannot

More information

Chapter 6. Orthogonality

Chapter 6. Orthogonality 6.3 Orthogonal Matrices 1 Chapter 6. Orthogonality 6.3 Orthogonal Matrices Definition 6.4. An n n matrix A is orthogonal if A T A = I. Note. We will see that the columns of an orthogonal matrix must be

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean

Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. by Philip Kostov and Seamus McErlean Working Paper, Agricultural and Food Economics, Queen

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci

More information

Data Mining Lab 5: Introduction to Neural Networks

Data Mining Lab 5: Introduction to Neural Networks Data Mining Lab 5: Introduction to Neural Networks 1 Introduction In this lab we are going to have a look at some very basic neural networks on a new data set which relates various covariates about cheese

More information

160 CHAPTER 4. VECTOR SPACES

160 CHAPTER 4. VECTOR SPACES 160 CHAPTER 4. VECTOR SPACES 4. Rank and Nullity In this section, we look at relationships between the row space, column space, null space of a matrix and its transpose. We will derive fundamental results

More information

Regression III: Advanced Methods

Regression III: Advanced Methods Lecture 16: Generalized Additive Models Regression III: Advanced Methods Bill Jacoby Michigan State University http://polisci.msu.edu/jacoby/icpsr/regress3 Goals of the Lecture Introduce Additive Models

More information

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

More information

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a

More information

Multivariate normal distribution and testing for means (see MKB Ch 3)

Multivariate normal distribution and testing for means (see MKB Ch 3) Multivariate normal distribution and testing for means (see MKB Ch 3) Where are we going? 2 One-sample t-test (univariate).................................................. 3 Two-sample t-test (univariate).................................................

More information

IBM SPSS Neural Networks 22

IBM SPSS Neural Networks 22 IBM SPSS Neural Networks 22 Note Before using this information and the product it supports, read the information in Notices on page 21. Product Information This edition applies to version 22, release 0,

More information

Model Validation Techniques

Model Validation Techniques Model Validation Techniques Kevin Mahoney, FCAS kmahoney@ travelers.com CAS RPM Seminar March 17, 2010 Uses of Statistical Models in P/C Insurance Examples of Applications Determine expected loss cost

More information

Linear Mixed-Effects Modeling in SPSS: An Introduction to the MIXED Procedure

Linear Mixed-Effects Modeling in SPSS: An Introduction to the MIXED Procedure Technical report Linear Mixed-Effects Modeling in SPSS: An Introduction to the MIXED Procedure Table of contents Introduction................................................................ 1 Data preparation

More information

Vector and Matrix Norms

Vector and Matrix Norms Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty

More information

2. Linear regression with multiple regressors

2. Linear regression with multiple regressors 2. Linear regression with multiple regressors Aim of this section: Introduction of the multiple regression model OLS estimation in multiple regression Measures-of-fit in multiple regression Assumptions

More information

Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

More information

Multivariate Statistical Inference and Applications

Multivariate Statistical Inference and Applications Multivariate Statistical Inference and Applications ALVIN C. RENCHER Department of Statistics Brigham Young University A Wiley-Interscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim

More information

Department of Economics

Department of Economics Department of Economics On Testing for Diagonality of Large Dimensional Covariance Matrices George Kapetanios Working Paper No. 526 October 2004 ISSN 1473-0278 On Testing for Diagonality of Large Dimensional

More information

Chapter 6: Multivariate Cointegration Analysis

Chapter 6: Multivariate Cointegration Analysis Chapter 6: Multivariate Cointegration Analysis 1 Contents: Lehrstuhl für Department Empirische of Wirtschaftsforschung Empirical Research and und Econometrics Ökonometrie VI. Multivariate Cointegration

More information

MATHEMATICAL METHODS OF STATISTICS

MATHEMATICAL METHODS OF STATISTICS MATHEMATICAL METHODS OF STATISTICS By HARALD CRAMER TROFESSOK IN THE UNIVERSITY OF STOCKHOLM Princeton PRINCETON UNIVERSITY PRESS 1946 TABLE OF CONTENTS. First Part. MATHEMATICAL INTRODUCTION. CHAPTERS

More information

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski trakovski@nyus.edu.mk Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems

More information

On the degrees of freedom in shrinkage estimation

On the degrees of freedom in shrinkage estimation On the degrees of freedom in shrinkage estimation Kengo Kato Graduate School of Economics, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan kato ken@hkg.odn.ne.jp October, 2007 Abstract

More information

CONTROLLABILITY. Chapter 2. 2.1 Reachable Set and Controllability. Suppose we have a linear system described by the state equation

CONTROLLABILITY. Chapter 2. 2.1 Reachable Set and Controllability. Suppose we have a linear system described by the state equation Chapter 2 CONTROLLABILITY 2 Reachable Set and Controllability Suppose we have a linear system described by the state equation ẋ Ax + Bu (2) x() x Consider the following problem For a given vector x in

More information

Data Mining. Supervised Methods. Ciro Donalek donalek@astro.caltech.edu. Ay/Bi 199ab: Methods of Computa@onal Sciences hcp://esci101.blogspot.

Data Mining. Supervised Methods. Ciro Donalek donalek@astro.caltech.edu. Ay/Bi 199ab: Methods of Computa@onal Sciences hcp://esci101.blogspot. Data Mining Supervised Methods Ciro Donalek donalek@astro.caltech.edu Supervised Methods Summary Ar@ficial Neural Networks Mul@layer Perceptron Support Vector Machines SoLwares Supervised Models: Supervised

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

MATH 304 Linear Algebra Lecture 9: Subspaces of vector spaces (continued). Span. Spanning set.

MATH 304 Linear Algebra Lecture 9: Subspaces of vector spaces (continued). Span. Spanning set. MATH 304 Linear Algebra Lecture 9: Subspaces of vector spaces (continued). Span. Spanning set. Vector space A vector space is a set V equipped with two operations, addition V V (x,y) x + y V and scalar

More information

Statistical Models in Data Mining

Statistical Models in Data Mining Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Fitting Subject-specific Curves to Grouped Longitudinal Data

Fitting Subject-specific Curves to Grouped Longitudinal Data Fitting Subject-specific Curves to Grouped Longitudinal Data Djeundje, Viani Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh, EH14 4AS, UK E-mail: vad5@hw.ac.uk Currie,

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information