ON THE DEGREES OF FREEDOM IN RICHLY PARAMETERISED MODELS
|
|
- Kelley Higgins
- 8 years ago
- Views:
Transcription
1 COMPSTAT 2004 Symposium c Physica-Verlag/Springer 2004 ON THE DEGREES OF FREEDOM IN RICHLY PARAMETERISED MODELS Salvatore Ingrassia and Isabella Morlini Key words: Richly parameterised models, small data sets. COMPSTAT 2004 section: Neural networks. Abstract: Using richly parameterised models for small datasets can be justified from a theoretical point of view according to some results due to Bartlett [1] which show that the generalization performance of a multi layer perceptron (MLP) depends more on the L 1 norm c 1 of the weights between the hidden and the output layer rather than on the number of parameters in the model. In this paper we investigate the problem of measuring the generalization performance and the complexity of richly parameterised procedures and, drawing on linear model theory, we propose a different notion of degrees of freedom to neural networks and other projection tools. This notion is compatible with similar ideas long associated with smoothers based models (like projection pursuit regression) and can be interpreted using the projection theory of linear models and showing some geometrical properties of neural networks. Results in this study lead to corrections in some goodness-of-fit statistics like AIC, BIC/SBC: the number of degrees of freedom in these indexes are set equal to the dimension p of the projection space intrinsically found by the mapping function. An empirical study is presented in order to illustrate the behavior of the L 1 norm c 1 and of the values of some selection model criteria, varying the value of p in a MLP. 1 Introduction An important issue in statistical modeling is related to so called indirect measures or virtual sensors, concerning the prediction of variables that are quite expensive to measure (e.g. the viscosity or the concentration of certain chemical species, some mechanical features) using other variables like, for example, the temperature or the pressure. This problem usually involves some difficulties: the available data sets is small and the input-output relation to be estimated is non-linear. A third difficulty arises when there are many predictor variables but, since the linearity cannot be assumed, it is quite difficult to reduce the dimensionality of the input by choosing a good subset of predictors or suitable underlying features. When we are provided with a data set of N pairs (x n,y n )ofm-dimensional input vectors x n and scalar target values y n and the size N of the available sample is small compared with the number of weights of the mapping function, the model is considered overparameterised. Over parameterised models can be justified from a theoretical point of view according to some results due to Bartlett [1] showing that the generalization performance of an MLP depends more on the size of the
2 1238 Salvatore Ingrassia and Isabella Morlini weights rather than on the number of weights and, in particular, on the L 1 norm c 1 of the weights between the hidden and the output layer. Bartlett s results suggest a deeper look at the roles of the parameters in a neural network and in similar richly parameterised models. The roles of these parameters are here interpreted drawing from the projection theory of linear models and by means of some geometrical properties shared by neural networks and statistical tools realizing similar mapping functions. 2 Geometrical properties of the sigmoidal functions In this section we investigate some geometrical properties of a mapping function of the form: f p (x) = p c k τ(a kx). (1) k=1 Without loss of generality, we assume that the bias term is equal to zero and that the function τ( ) is sigmoidal and analytic, that is, it can be represented by a power series, on some interval ( r, r), where r may be +. Thehyperbolic tangent τ(z) = tanh(z) or the logistic function τ(z) =(1+e z ) 1 are examples of analytic sigmoidal functions. We point out that the function f p is a combination of certain transformations of the input data. In particular, f p realizes: i) a non-linear projection from R m to R p given by the sigmoidal function τ, thatisx τ(a 1x),...,τ(a px); ii) a linear transformation from R p to R according to c 1,...,c p. The results in this paper are based on the following theorem, see e.g. Rudin (1966): Theorem 2.1. Let g be analytic and not identically zero in the interval ( r, r), with r>0. Then the set of the zeroes of g in ( r, r) isatmost countable. Let x 1 =(x 11,...,x 1m ),...,x p =(x p1,...,x pm )bep points of R m,with p>m; evidently these points are linearly dependent as p>m. Let A = (a ij )beap m matrix with values in some hypercube [ u, u] mp,forsome u>0; thus the points Ax 1,...,Ax p are linearly dependent because they are obtained by a linear transformation acting on x 1,...,x p. For u =1/m the points τ(ax 1 ),...,τ(ax p ), where: τ(ax i ) = m m τ( a 1j x ij ),...,τ( a pj x ij ) j=1 j=1 = ( τ(a 1x i ),...,τ(a px i ) ) i =1,...,p. are linearly independent for almost all matrices A [ u, u] mp, according to the following theorem.
3 On the degrees of freedom in richly parameterised models 1239 Theorem 2.2. [5]Let x 1,...,x p be p distinct points in ( r, r) m with x h 0 (h = 1,...,p) and A = (a ij ) [ u, u] mp be a p m matrix, with u =1/m. Let τ be a sigmoidal analytic function on ( r, r), with r>0. Then the points τ(ax 1 ),...,τ(ax p ) R p are linearly independent for almost all matrix A =(a ij ) [ u, u] mp. This result proves that, given N > m points x 1,...,x N R m, the transformed points τ(ax 1 ),...,τ(ax N ) generate an over-space of dimension p>mif the matrix A satisfies suitable conditions. In particular, the largest over-space is attained when p = N, that is when in a MLP the hidden layer has as many units as the number of points in the learning set. Moreover, it gains insight why neural networks have been proved to work well in presence of multicollinearity. On this topic De Veaux & Ungar [3] present a case-study in which the temperature of a flow is measured by six different devices at various places in a production process: even though the inputs are highly correlated, a better prediction of the response is gained using a weighted combination of all six predictors rather than choosing the single best measurement having the highest correlation with the response. Next result generalizes Theorem 2. Theorem 2.3. Let L = {(x 1,y 1 ),...,(x N,y N )} be a given learning set of N i.i.d. realizations of (X,Y)andf p = p k=1 c kτ(a kx). If p = N, then the learning error Ê(f p, L) = (x (y n,y n) L n f p (x n )) 2 is zero for almost all matrices A [ 1/m, 1/m] mp. Proof. Theorem 2 implies that the points τ(ax 1 ),...,τ(ax N ) are linearly independent for almost all matrices A [ 1/m, 1/m] forp N. In particular, if p = N the system: c 1 τ(a 1x 1 ) + + c N τ(a N x 1) = y =. (2) c 1 τ(a 1 x N ) + + c N τ(a N x N ) = y N has a unique solution. The upper bound on p given above looks too large but it refers to the worst case. In neural modelling, given a learning set L of N sample data, the good question seems not to be what is the largest network we can train by L (if any)? but what is a suitable size namely, the dimension p of the space R p necessary for fitting the input-output unknown dependence φ = E[Y X]?. This dimension p depends on the geometry of the data and this explains why neural models may be successfully applied as virtual sensors when the predictors exhibit a high degree of multicollinearity. As a matter of fact, the hidden units break the multicollinearity and exploit the contribution of each single predictor. This is the reason why the optimal size p of the hidden layer is often greater than the number m of predictors.
4 1240 Salvatore Ingrassia and Isabella Morlini 3 On counting the degrees of freedom in linear projection models Consider the standard regression model: y = Xβ + ε (3) where y is a vector of N observations of the dependent variable measured about its mean, X is an N m matrix whose (i, j)th element is the value of the jth predictor variable for the ith observation, again measured about its mean, β is a vector of regression coefficients and ε is the vector of the error terms satisfying the usual assumption of independence and homoscedasticity. The values of the principal components (PC) for each observation are given by: Z = XA where the (i, k)th element of Z is the value of the kth PC for the ith observation and A is the m m matrix whose kth column is the kth eigenvector of X X. Because A is orthogonal, Xβ can be rewritten as XAA β = Zγ, where γ = A β.equation(3) can therefore be written as: y = Zγ + ε (4) which has simply replaced the predictor variables by theirs PCs in the regression model. Principal component regression can be defined as the model (4) or as the reduced model: y = Z p γ p + ε p (5) where γ p is a vector of p elements which are a subset of γ, Z p is an N p matrix whose columns are the corresponding subset of columns of Z and ε p is the appropriate error term, see Section 8.1 in Jolliffe (1986). From an algebraic point of view, computing the PCs of the original predictor variables is a way to overcome the problem of multicollinearity between them, since the PCs are orthogonal. From a geometrical point of view, if we compute the values of the PCs for each of the N observations, we project the N points in a m-dimensional hyperplane for which the sum of squares of perpendicular distances of the observations in the original space is minimized. The total number of estimated quantities in model (4), which results from defining the projection of the centered matrix X in the space spanned by the m PCs and then estimating the m regression coefficients plus the bias term, is larger than the number m of variables involved in the model. In any case, model (5) is given p + 1 degrees of freedom, that is, the number of retained PCs plus one. The function given by the PC regression model can be rewritten as: f( x n ) = m β k (a k x n) (6) k=0
5 On the degrees of freedom in richly parameterised models 1241 where x n is the m dimensional vector of the original centered variables, the a k with (k =1,..., p) arethem normalized eigenvectors of the matrix X X and β k are the regression coefficients. The function in (6) is formally identical to the mapping function realized by the projection pursuit regression or by a network model with the identity function in the second layer. This analogy between the PC regression and the mapping function realized by an MLP with p hidden nodes (m p N) and a linear transfer function in the second level, also arises in accordance with Theorem 4 in previous Section. As in PC regression, the first transformation from the input layer to the hidden layer is a geometrical one which projects the point into a new space of dimension p N. This transformation is non-linear and its optimality is not well established: the important point here is that in the new space the points are linearly independent (e.g. Theorem 2). A difference between the PC regression and the mapping function realized by the MLP is, beside the type of the geometrical projection (linear in the first case, non-linear in the second case) and the maximum dimension of this space (equal to m for PCs andintheinterval[m, N] for the neural network), is that the estimates of the projection matrix and the regression parameters are faced in a two stage procedure for the PC regression while simultaneously for the network. In neural network modeling the dimension of the model, i.e. the number p of neurons in the hidden, is often chosen according to some model selection criteria listed in the summary reports of many statistical software for data mining. The linkage between the generalization error and the empirical error, used in these selection model criteria, has been approached in Ingrassia and Morlini [6] on the basis on the Vapnik-Chervonenkis theory [10]. The seminal work on model selection is based on the parametric statistics literature and is quite vast but it must be noted that, although model selection techniques for parametric models have been widely used in the past 30 years, surprisingly little work has been done on the application of these techniques in a semi-parametric or non-parametric context. Such goodness-of-fit statistics are quite simple to compute and even if the underlying theory does not hold for neural networks, the rule among users is to consider them as crude estimates of the generalization error and thus to apply these methods to very complex models, regardless of their parametric framework. These criteria, here denoted by Π, are an extension of the maximum likelihood and have the following form: Π = Ê(f K)+C K (7) where the term Ê(f K) is the deviance of the model f K and C K is a complexity term representing a penalty which grows as the number K of degrees of freedom in the model increases: if the model f K is too simple it will give a large value for the criterion because the residual training error is large; while amodelf K which is too complex will have a large value for the criterion because the complexity term is large. Typical indexes include the Akaike Information Criterion (AIC), the Schwarz Bayesian Information Criterion (BIC
6 1242 Salvatore Ingrassia and Isabella Morlini or SBC), the Final Prediction Error (FPE), the Generalized Cross Validation Error (GCV) and the well-established Unbiased Estimate of the Variance (UEV) (for a review of these indexes we refer to Ingrassia and Morlini [6]. The classical statistical parametric viewpoint in which the dimensionality of the model complexity is given by K = W, i.e. the number W of all parameters defining the mapping function, does not seem to apply to flexible non-parametric or semi-parametric models, in which the adaptive parameters are not on the same level and have different interpretations, as we have seen in previous sections. The assumption that the degrees of freedom of a neural model should be different than W has been remarked by many authors, see e.g. Hodges and Sargent [4], Ye [11]. In the present study we propose an easy correction to the selection model criteria. According to the analogy with the PC regression, for models of the form f p (x) = p k=1 c kτ(a k x + b k)+c 0 we should consider K = p + 1 rather than K = W = p(m + 2) + 1, that is the dimension of the projection space plus one. In this case both the FPE and the UEV are never negative (as it may happen when K = W = p(m+2)+1). 4 A case study In this section we present some numerical results in order to investigate the behavior of Bartlett s constant c 1 and its relation with the learning error Ê(f p, L) andthetesterrorê(f p, T ). We consider the polymer data set modeled by De Veaux et al. [2] by means of a MLP with 18 hidden units. This dataset contains 61 observations with 10 predictors concerning measurements of controlled variables in a polymer process plant an a response concerning the output of the plant (data are from ftp.cis.upenn.edu in pub/ungar/chemdata). The data exhibit a quite large degree of multicollinearity, as shown by the variance inflation factor (VIF) of some predictors: X 1 X 3 X 4 X 5 X 6 X 7 X 8 X 9 VIF: In general, VIFs larger than 10 imply serious computational problems for many statistical tools [8]. As in De Veaux et al. [2], we use 50 observations for the learning set and 11 for the test set. Here, however, we consider 100 different samples with different observations for the training and the test sets. We train networks with increasing numbers of hidden units from p = 2 to p = 25. For each p we train 1000 times the network varying the sample and the initial weights; we adopt either the weight decay or the early stopping regularization techniques. Then, we retained the 100 networks with the smallest test errors. The distribution of the learning error vs. the number p isplottedinfigure1 a) using boxplots. In Figure 1 b) we plot the distribution of c 1 vs. p using weight decay. Similar distributions are
7 On the degrees of freedom in richly parameterised models Learning Error c p p Figure 1: Polymer data: distribution of the learning error and of c 1 vs p. obtained with early stopping. Figure 1b) shows that there is a quite large number of models with different architectures (i.e. a different number p) having the same value of the constant c introduced in Bartlett [1]. In Table 1 for each group of the 100 best models are reported the mean values of the training error Ê(f p; L), the test error Ê(f p; T ), the L 1 norm c 1 and of the error complexity measures AIC, BIC/SBC, GCV and FPE computed with K = p +1. Thevaluesp leading to the smallest mean values are AIC BIC GCV FPE p 9, 11 3, 5, 7 5, 7 7, 9, 11 and thus, on the basis of these statistics, models with p =7, 9, 11 neurons in the hidden layer should be selected. In addition, we note that p =10isthe value with the smallest absolute difference between the means of training and the test errors. This study confirms that c better describes the complexity of a model of the form (1) than the total number of parameters and is a more suitable characterization of the mapping function. References [1] Bartlett P.L. (1998). The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Transaction on Information Theory, 44, n.2, [2] De Veaux R.D., Schumi J., Schweinsberg J., Ungar L.H. (1998). Prediction intervals for neural networks via nonlinear regression. Technometrics 40, (4), [3] De Veaux R.D., Ungar L.H. (1994). Multicollinearity: a tale of two non parametric regressions. In Selecting Models from Data: AI and Statistics IV, (Eds. P. Cheeseman & R.W. Oldford). [4] Hodges J., Sargent D. (2001). Counting degrees of freedom in hierarchical and other richly parameterised models. Biometrika 88,
8 1244 Salvatore Ingrassia and Isabella Morlini p E(f b p; L) E(f b p; T ) c 1 K BIC AIC FPE GCV Table 1: Polymer data: summary statistics for different values of p. [5] Ingrassia S. (1999). Geometrical aspects of discrimination by multilayer perceptrons. Journal of Multivariate Analysis 68, [6] Ingrassia S., Morlini I. (2002). Neural network modeling for small data sets. Submitted for publication. [7] Jolliffe I.T. (1986). Principal Component Analysis. Springer-Verlag, N.Y. [8] Morlini, I. (2002). Facing multicollinearity in data mining. Atti della XLI Riunione Scientifica della Societá Italiana di Statistica, Milano-Bicocca, [9] Rudin W. (1966). Real and complex analysis. Mc-Graw-Hill, New York. [10] Vapnik V. (1998). Statistical Learning Theory, John Wiley & Sons, N.Y. [11] Ye J. (1998). On measuring and correcting the effects of data mining and model selection. Journal of the American Statistical Association, 93, 441, Address: S. Ingrassia, Dipartimento di Economia e Statistica, Università della Calabria, Arcavacata di Rende Italy I. Morlini, Dipartimento di Economia, Università di Parma, Parma, Italy s.ingrassia@unical.it, isabella.morlini@unipr.it
Dimensionality Reduction: Principal Components Analysis
Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely
More informationData Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression
Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction
More informationMachine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
More informationDATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
More informationLeast Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationReview Jeopardy. Blue vs. Orange. Review Jeopardy
Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?
More informationSimilarity and Diagonalization. Similar Matrices
MATH022 Linear Algebra Brief lecture notes 48 Similarity and Diagonalization Similar Matrices Let A and B be n n matrices. We say that A is similar to B if there is an invertible n n matrix P such that
More informationLinear Algebra Notes for Marsden and Tromba Vector Calculus
Linear Algebra Notes for Marsden and Tromba Vector Calculus n-dimensional Euclidean Space and Matrices Definition of n space As was learned in Math b, a point in Euclidean three space can be thought of
More informationFactor analysis. Angela Montanari
Factor analysis Angela Montanari 1 Introduction Factor analysis is a statistical model that allows to explain the correlations between a large number of observed correlated variables through a small number
More informationNumerical Analysis Lecture Notes
Numerical Analysis Lecture Notes Peter J. Olver 5. Inner Products and Norms The norm of a vector is a measure of its size. Besides the familiar Euclidean norm based on the dot product, there are a number
More informationDISCRIMINANT FUNCTION ANALYSIS (DA)
DISCRIMINANT FUNCTION ANALYSIS (DA) John Poulsen and Aaron French Key words: assumptions, further reading, computations, standardized coefficents, structure matrix, tests of signficance Introduction Discriminant
More information1 VECTOR SPACES AND SUBSPACES
1 VECTOR SPACES AND SUBSPACES What is a vector? Many are familiar with the concept of a vector as: Something which has magnitude and direction. an ordered pair or triple. a description for quantities such
More informationSection 1.1. Introduction to R n
The Calculus of Functions of Several Variables Section. Introduction to R n Calculus is the study of functional relationships and how related quantities change with each other. In your first exposure to
More informationNew Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
More informationOverview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More information15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
More informationMATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS 1. SYSTEMS OF EQUATIONS AND MATRICES 1.1. Representation of a linear system. The general system of m equations in n unknowns can be written a 11 x 1 + a 12 x 2 +
More informationSAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
More informationNOTES ON LINEAR TRANSFORMATIONS
NOTES ON LINEAR TRANSFORMATIONS Definition 1. Let V and W be vector spaces. A function T : V W is a linear transformation from V to W if the following two properties hold. i T v + v = T v + T v for all
More informationNeural Networks and Support Vector Machines
INF5390 - Kunstig intelligens Neural Networks and Support Vector Machines Roar Fjellheim INF5390-13 Neural Networks and SVM 1 Outline Neural networks Perceptrons Neural networks Support vector machines
More informationLecture 6. Artificial Neural Networks
Lecture 6 Artificial Neural Networks 1 1 Artificial Neural Networks In this note we provide an overview of the key concepts that have led to the emergence of Artificial Neural Networks as a major paradigm
More informationCombining GLM and datamining techniques for modelling accident compensation data. Peter Mulquiney
Combining GLM and datamining techniques for modelling accident compensation data Peter Mulquiney Introduction Accident compensation data exhibit features which complicate loss reserving and premium rate
More informationSolving Systems of Linear Equations
LECTURE 5 Solving Systems of Linear Equations Recall that we introduced the notion of matrices as a way of standardizing the expression of systems of linear equations In today s lecture I shall show how
More informationNeural Networks for Sentiment Detection in Financial Text
Neural Networks for Sentiment Detection in Financial Text Caslav Bozic* and Detlef Seese* With a rise of algorithmic trading volume in recent years, the need for automatic analysis of financial news emerged.
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationMathematics Course 111: Algebra I Part IV: Vector Spaces
Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are
More informationOrthogonal Diagonalization of Symmetric Matrices
MATH10212 Linear Algebra Brief lecture notes 57 Gram Schmidt Process enables us to find an orthogonal basis of a subspace. Let u 1,..., u k be a basis of a subspace V of R n. We begin the process of finding
More informationChapter 3: The Multiple Linear Regression Model
Chapter 3: The Multiple Linear Regression Model Advanced Econometrics - HEC Lausanne Christophe Hurlin University of Orléans November 23, 2013 Christophe Hurlin (University of Orléans) Advanced Econometrics
More informationLectures notes on orthogonal matrices (with exercises) 92.222 - Linear Algebra II - Spring 2004 by D. Klain
Lectures notes on orthogonal matrices (with exercises) 92.222 - Linear Algebra II - Spring 2004 by D. Klain 1. Orthogonal matrices and orthonormal sets An n n real-valued matrix A is said to be an orthogonal
More informationJoint models for classification and comparison of mortality in different countries.
Joint models for classification and comparison of mortality in different countries. Viani D. Biatat 1 and Iain D. Currie 1 1 Department of Actuarial Mathematics and Statistics, and the Maxwell Institute
More informationLinear Algebra Notes
Linear Algebra Notes Chapter 19 KERNEL AND IMAGE OF A MATRIX Take an n m matrix a 11 a 12 a 1m a 21 a 22 a 2m a n1 a n2 a nm and think of it as a function A : R m R n The kernel of A is defined as Note
More informationEconometrics Simple Linear Regression
Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight
More informationSimple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,
More informationEmpirical Model-Building and Response Surfaces
Empirical Model-Building and Response Surfaces GEORGE E. P. BOX NORMAN R. DRAPER Technische Universitat Darmstadt FACHBEREICH INFORMATIK BIBLIOTHEK Invortar-Nf.-. Sachgsbiete: Standort: New York John Wiley
More informationPenalized regression: Introduction
Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood
More informationInner Product Spaces
Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and
More informationIntroduction to General and Generalized Linear Models
Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby
More informationAUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.
AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationOrthogonal Projections
Orthogonal Projections and Reflections (with exercises) by D. Klain Version.. Corrections and comments are welcome! Orthogonal Projections Let X,..., X k be a family of linearly independent (column) vectors
More information1 Introduction to Matrices
1 Introduction to Matrices In this section, important definitions and results from matrix algebra that are useful in regression analysis are introduced. While all statements below regarding the columns
More informationNCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
More informationMehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics
INTERNATIONAL BLACK SEA UNIVERSITY COMPUTER TECHNOLOGIES AND ENGINEERING FACULTY ELABORATION OF AN ALGORITHM OF DETECTING TESTS DIMENSIONALITY Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree
More informationCHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.
CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES From Exploratory Factor Analysis Ledyard R Tucker and Robert C MacCallum 1997 180 CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES In
More informationIntroduction to Matrix Algebra
Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary
More information3.1 Least squares in matrix form
118 3 Multiple Regression 3.1 Least squares in matrix form E Uses Appendix A.2 A.4, A.6, A.7. 3.1.1 Introduction More than one explanatory variable In the foregoing chapter we considered the simple regression
More information171:290 Model Selection Lecture II: The Akaike Information Criterion
171:290 Model Selection Lecture II: The Akaike Information Criterion Department of Biostatistics Department of Statistics and Actuarial Science August 28, 2012 Introduction AIC, the Akaike Information
More informationMultivariate Normal Distribution
Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues
More informationMULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996)
MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL by Michael L. Orlov Chemistry Department, Oregon State University (1996) INTRODUCTION In modern science, regression analysis is a necessary part
More informationChapter 2 Portfolio Management and the Capital Asset Pricing Model
Chapter 2 Portfolio Management and the Capital Asset Pricing Model In this chapter, we explore the issue of risk management in a portfolio of assets. The main issue is how to balance a portfolio, that
More informationMultivariate Analysis of Ecological Data
Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationData quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
More informationMarketing Mix Modelling and Big Data P. M Cain
1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored
More informationIFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...
IFT3395/6390 Historical perspective: back to 1957 (Prof. Pascal Vincent) (Rosenblatt, Perceptron ) Machine Learning from linear regression to Neural Networks Computer Science Artificial Intelligence Symbolic
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationSimple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
More informationCS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott
More informationComponent Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
More informationSystems of Linear Equations
Systems of Linear Equations Beifang Chen Systems of linear equations Linear systems A linear equation in variables x, x,, x n is an equation of the form a x + a x + + a n x n = b, where a, a,, a n and
More information5. Multiple regression
5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful
More informationISOMETRIES OF R n KEITH CONRAD
ISOMETRIES OF R n KEITH CONRAD 1. Introduction An isometry of R n is a function h: R n R n that preserves the distance between vectors: h(v) h(w) = v w for all v and w in R n, where (x 1,..., x n ) = x
More informationMultivariate Analysis (Slides 13)
Multivariate Analysis (Slides 13) The final topic we consider is Factor Analysis. A Factor Analysis is a mathematical approach for attempting to explain the correlation between a large set of variables
More informationIntroduction: Overview of Kernel Methods
Introduction: Overview of Kernel Methods Statistical Data Analysis with Positive Definite Kernels Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department of Statistical Science, Graduate University
More informationData Mining mit der JMSL Numerical Library for Java Applications
Data Mining mit der JMSL Numerical Library for Java Applications Stefan Sineux 8. Java Forum Stuttgart 07.07.2005 Agenda Visual Numerics JMSL TM Numerical Library Neuronale Netze (Hintergrund) Demos Neuronale
More informationLinear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
More information8. Linear least-squares
8. Linear least-squares EE13 (Fall 211-12) definition examples and applications solution of a least-squares problem, normal equations 8-1 Definition overdetermined linear equations if b range(a), cannot
More informationChapter 6. Orthogonality
6.3 Orthogonal Matrices 1 Chapter 6. Orthogonality 6.3 Orthogonal Matrices Definition 6.4. An n n matrix A is orthogonal if A T A = I. Note. We will see that the columns of an orthogonal matrix must be
More informationPredict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons
More informationUsing Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean
Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. by Philip Kostov and Seamus McErlean Working Paper, Agricultural and Food Economics, Queen
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
More informationLinear Models for Classification
Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci
More informationData Mining Lab 5: Introduction to Neural Networks
Data Mining Lab 5: Introduction to Neural Networks 1 Introduction In this lab we are going to have a look at some very basic neural networks on a new data set which relates various covariates about cheese
More information160 CHAPTER 4. VECTOR SPACES
160 CHAPTER 4. VECTOR SPACES 4. Rank and Nullity In this section, we look at relationships between the row space, column space, null space of a matrix and its transpose. We will derive fundamental results
More informationRegression III: Advanced Methods
Lecture 16: Generalized Additive Models Regression III: Advanced Methods Bill Jacoby Michigan State University http://polisci.msu.edu/jacoby/icpsr/regress3 Goals of the Lecture Introduce Additive Models
More informationArtificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
More informationMATRIX ALGEBRA AND SYSTEMS OF EQUATIONS
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a
More informationMultivariate normal distribution and testing for means (see MKB Ch 3)
Multivariate normal distribution and testing for means (see MKB Ch 3) Where are we going? 2 One-sample t-test (univariate).................................................. 3 Two-sample t-test (univariate).................................................
More informationIBM SPSS Neural Networks 22
IBM SPSS Neural Networks 22 Note Before using this information and the product it supports, read the information in Notices on page 21. Product Information This edition applies to version 22, release 0,
More informationModel Validation Techniques
Model Validation Techniques Kevin Mahoney, FCAS kmahoney@ travelers.com CAS RPM Seminar March 17, 2010 Uses of Statistical Models in P/C Insurance Examples of Applications Determine expected loss cost
More informationLinear Mixed-Effects Modeling in SPSS: An Introduction to the MIXED Procedure
Technical report Linear Mixed-Effects Modeling in SPSS: An Introduction to the MIXED Procedure Table of contents Introduction................................................................ 1 Data preparation
More informationVector and Matrix Norms
Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty
More information2. Linear regression with multiple regressors
2. Linear regression with multiple regressors Aim of this section: Introduction of the multiple regression model OLS estimation in multiple regression Measures-of-fit in multiple regression Assumptions
More informationPenalized Logistic Regression and Classification of Microarray Data
Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification
More informationMultivariate Statistical Inference and Applications
Multivariate Statistical Inference and Applications ALVIN C. RENCHER Department of Statistics Brigham Young University A Wiley-Interscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim
More informationDepartment of Economics
Department of Economics On Testing for Diagonality of Large Dimensional Covariance Matrices George Kapetanios Working Paper No. 526 October 2004 ISSN 1473-0278 On Testing for Diagonality of Large Dimensional
More informationChapter 6: Multivariate Cointegration Analysis
Chapter 6: Multivariate Cointegration Analysis 1 Contents: Lehrstuhl für Department Empirische of Wirtschaftsforschung Empirical Research and und Econometrics Ökonometrie VI. Multivariate Cointegration
More informationMATHEMATICAL METHODS OF STATISTICS
MATHEMATICAL METHODS OF STATISTICS By HARALD CRAMER TROFESSOK IN THE UNIVERSITY OF STOCKHOLM Princeton PRINCETON UNIVERSITY PRESS 1946 TABLE OF CONTENTS. First Part. MATHEMATICAL INTRODUCTION. CHAPTERS
More informationIntroduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski trakovski@nyus.edu.mk Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems
More informationOn the degrees of freedom in shrinkage estimation
On the degrees of freedom in shrinkage estimation Kengo Kato Graduate School of Economics, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan kato ken@hkg.odn.ne.jp October, 2007 Abstract
More informationCONTROLLABILITY. Chapter 2. 2.1 Reachable Set and Controllability. Suppose we have a linear system described by the state equation
Chapter 2 CONTROLLABILITY 2 Reachable Set and Controllability Suppose we have a linear system described by the state equation ẋ Ax + Bu (2) x() x Consider the following problem For a given vector x in
More informationData Mining. Supervised Methods. Ciro Donalek donalek@astro.caltech.edu. Ay/Bi 199ab: Methods of Computa@onal Sciences hcp://esci101.blogspot.
Data Mining Supervised Methods Ciro Donalek donalek@astro.caltech.edu Supervised Methods Summary Ar@ficial Neural Networks Mul@layer Perceptron Support Vector Machines SoLwares Supervised Models: Supervised
More informationStatistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
More informationMATH 304 Linear Algebra Lecture 9: Subspaces of vector spaces (continued). Span. Spanning set.
MATH 304 Linear Algebra Lecture 9: Subspaces of vector spaces (continued). Span. Spanning set. Vector space A vector space is a set V equipped with two operations, addition V V (x,y) x + y V and scalar
More informationStatistical Models in Data Mining
Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of
More informationD-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
More informationFitting Subject-specific Curves to Grouped Longitudinal Data
Fitting Subject-specific Curves to Grouped Longitudinal Data Djeundje, Viani Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh, EH14 4AS, UK E-mail: vad5@hw.ac.uk Currie,
More informationAdditional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
More information