ON THE DEGREES OF FREEDOM IN RICHLY PARAMETERISED MODELS

Transcription

1 COMPSTAT 2004 Symposium c Physica-Verlag/Springer 2004 ON THE DEGREES OF FREEDOM IN RICHLY PARAMETERISED MODELS Salvatore Ingrassia and Isabella Morlini Key words: Richly parameterised models, small data sets. COMPSTAT 2004 section: Neural networks. Abstract: Using richly parameterised models for small datasets can be justified from a theoretical point of view according to some results due to Bartlett [1] which show that the generalization performance of a multi layer perceptron (MLP) depends more on the L 1 norm c 1 of the weights between the hidden and the output layer rather than on the number of parameters in the model. In this paper we investigate the problem of measuring the generalization performance and the complexity of richly parameterised procedures and, drawing on linear model theory, we propose a different notion of degrees of freedom to neural networks and other projection tools. This notion is compatible with similar ideas long associated with smoothers based models (like projection pursuit regression) and can be interpreted using the projection theory of linear models and showing some geometrical properties of neural networks. Results in this study lead to corrections in some goodness-of-fit statistics like AIC, BIC/SBC: the number of degrees of freedom in these indexes are set equal to the dimension p of the projection space intrinsically found by the mapping function. An empirical study is presented in order to illustrate the behavior of the L 1 norm c 1 and of the values of some selection model criteria, varying the value of p in a MLP. 1 Introduction An important issue in statistical modeling is related to so called indirect measures or virtual sensors, concerning the prediction of variables that are quite expensive to measure (e.g. the viscosity or the concentration of certain chemical species, some mechanical features) using other variables like, for example, the temperature or the pressure. This problem usually involves some difficulties: the available data sets is small and the input-output relation to be estimated is non-linear. A third difficulty arises when there are many predictor variables but, since the linearity cannot be assumed, it is quite difficult to reduce the dimensionality of the input by choosing a good subset of predictors or suitable underlying features. When we are provided with a data set of N pairs (x n,y n )ofm-dimensional input vectors x n and scalar target values y n and the size N of the available sample is small compared with the number of weights of the mapping function, the model is considered overparameterised. Over parameterised models can be justified from a theoretical point of view according to some results due to Bartlett [1] showing that the generalization performance of an MLP depends more on the size of the

2 1238 Salvatore Ingrassia and Isabella Morlini weights rather than on the number of weights and, in particular, on the L 1 norm c 1 of the weights between the hidden and the output layer. Bartlett s results suggest a deeper look at the roles of the parameters in a neural network and in similar richly parameterised models. The roles of these parameters are here interpreted drawing from the projection theory of linear models and by means of some geometrical properties shared by neural networks and statistical tools realizing similar mapping functions. 2 Geometrical properties of the sigmoidal functions In this section we investigate some geometrical properties of a mapping function of the form: f p (x) = p c k τ(a kx). (1) k=1 Without loss of generality, we assume that the bias term is equal to zero and that the function τ( ) is sigmoidal and analytic, that is, it can be represented by a power series, on some interval ( r, r), where r may be +. Thehyperbolic tangent τ(z) = tanh(z) or the logistic function τ(z) =(1+e z ) 1 are examples of analytic sigmoidal functions. We point out that the function f p is a combination of certain transformations of the input data. In particular, f p realizes: i) a non-linear projection from R m to R p given by the sigmoidal function τ, thatisx τ(a 1x),...,τ(a px); ii) a linear transformation from R p to R according to c 1,...,c p. The results in this paper are based on the following theorem, see e.g. Rudin (1966): Theorem 2.1. Let g be analytic and not identically zero in the interval ( r, r), with r>0. Then the set of the zeroes of g in ( r, r) isatmost countable. Let x 1 =(x 11,...,x 1m ),...,x p =(x p1,...,x pm )bep points of R m,with p>m; evidently these points are linearly dependent as p>m. Let A = (a ij )beap m matrix with values in some hypercube [ u, u] mp,forsome u>0; thus the points Ax 1,...,Ax p are linearly dependent because they are obtained by a linear transformation acting on x 1,...,x p. For u =1/m the points τ(ax 1 ),...,τ(ax p ), where: τ(ax i ) = m m τ( a 1j x ij ),...,τ( a pj x ij ) j=1 j=1 = ( τ(a 1x i ),...,τ(a px i ) ) i =1,...,p. are linearly independent for almost all matrices A [ u, u] mp, according to the following theorem.

3 On the degrees of freedom in richly parameterised models 1239 Theorem 2.2. [5]Let x 1,...,x p be p distinct points in ( r, r) m with x h 0 (h = 1,...,p) and A = (a ij ) [ u, u] mp be a p m matrix, with u =1/m. Let τ be a sigmoidal analytic function on ( r, r), with r>0. Then the points τ(ax 1 ),...,τ(ax p ) R p are linearly independent for almost all matrix A =(a ij ) [ u, u] mp. This result proves that, given N > m points x 1,...,x N R m, the transformed points τ(ax 1 ),...,τ(ax N ) generate an over-space of dimension p>mif the matrix A satisfies suitable conditions. In particular, the largest over-space is attained when p = N, that is when in a MLP the hidden layer has as many units as the number of points in the learning set. Moreover, it gains insight why neural networks have been proved to work well in presence of multicollinearity. On this topic De Veaux & Ungar [3] present a case-study in which the temperature of a flow is measured by six different devices at various places in a production process: even though the inputs are highly correlated, a better prediction of the response is gained using a weighted combination of all six predictors rather than choosing the single best measurement having the highest correlation with the response. Next result generalizes Theorem 2. Theorem 2.3. Let L = {(x 1,y 1 ),...,(x N,y N )} be a given learning set of N i.i.d. realizations of (X,Y)andf p = p k=1 c kτ(a kx). If p = N, then the learning error Ê(f p, L) = (x (y n,y n) L n f p (x n )) 2 is zero for almost all matrices A [ 1/m, 1/m] mp. Proof. Theorem 2 implies that the points τ(ax 1 ),...,τ(ax N ) are linearly independent for almost all matrices A [ 1/m, 1/m] forp N. In particular, if p = N the system: c 1 τ(a 1x 1 ) + + c N τ(a N x 1) = y =. (2) c 1 τ(a 1 x N ) + + c N τ(a N x N ) = y N has a unique solution. The upper bound on p given above looks too large but it refers to the worst case. In neural modelling, given a learning set L of N sample data, the good question seems not to be what is the largest network we can train by L (if any)? but what is a suitable size namely, the dimension p of the space R p necessary for fitting the input-output unknown dependence φ = E[Y X]?. This dimension p depends on the geometry of the data and this explains why neural models may be successfully applied as virtual sensors when the predictors exhibit a high degree of multicollinearity. As a matter of fact, the hidden units break the multicollinearity and exploit the contribution of each single predictor. This is the reason why the optimal size p of the hidden layer is often greater than the number m of predictors.

4 1240 Salvatore Ingrassia and Isabella Morlini 3 On counting the degrees of freedom in linear projection models Consider the standard regression model: y = Xβ + ε (3) where y is a vector of N observations of the dependent variable measured about its mean, X is an N m matrix whose (i, j)th element is the value of the jth predictor variable for the ith observation, again measured about its mean, β is a vector of regression coefficients and ε is the vector of the error terms satisfying the usual assumption of independence and homoscedasticity. The values of the principal components (PC) for each observation are given by: Z = XA where the (i, k)th element of Z is the value of the kth PC for the ith observation and A is the m m matrix whose kth column is the kth eigenvector of X X. Because A is orthogonal, Xβ can be rewritten as XAA β = Zγ, where γ = A β.equation(3) can therefore be written as: y = Zγ + ε (4) which has simply replaced the predictor variables by theirs PCs in the regression model. Principal component regression can be defined as the model (4) or as the reduced model: y = Z p γ p + ε p (5) where γ p is a vector of p elements which are a subset of γ, Z p is an N p matrix whose columns are the corresponding subset of columns of Z and ε p is the appropriate error term, see Section 8.1 in Jolliffe (1986). From an algebraic point of view, computing the PCs of the original predictor variables is a way to overcome the problem of multicollinearity between them, since the PCs are orthogonal. From a geometrical point of view, if we compute the values of the PCs for each of the N observations, we project the N points in a m-dimensional hyperplane for which the sum of squares of perpendicular distances of the observations in the original space is minimized. The total number of estimated quantities in model (4), which results from defining the projection of the centered matrix X in the space spanned by the m PCs and then estimating the m regression coefficients plus the bias term, is larger than the number m of variables involved in the model. In any case, model (5) is given p + 1 degrees of freedom, that is, the number of retained PCs plus one. The function given by the PC regression model can be rewritten as: f( x n ) = m β k (a k x n) (6) k=0

5 On the degrees of freedom in richly parameterised models 1241 where x n is the m dimensional vector of the original centered variables, the a k with (k =1,..., p) arethem normalized eigenvectors of the matrix X X and β k are the regression coefficients. The function in (6) is formally identical to the mapping function realized by the projection pursuit regression or by a network model with the identity function in the second layer. This analogy between the PC regression and the mapping function realized by an MLP with p hidden nodes (m p N) and a linear transfer function in the second level, also arises in accordance with Theorem 4 in previous Section. As in PC regression, the first transformation from the input layer to the hidden layer is a geometrical one which projects the point into a new space of dimension p N. This transformation is non-linear and its optimality is not well established: the important point here is that in the new space the points are linearly independent (e.g. Theorem 2). A difference between the PC regression and the mapping function realized by the MLP is, beside the type of the geometrical projection (linear in the first case, non-linear in the second case) and the maximum dimension of this space (equal to m for PCs andintheinterval[m, N] for the neural network), is that the estimates of the projection matrix and the regression parameters are faced in a two stage procedure for the PC regression while simultaneously for the network. In neural network modeling the dimension of the model, i.e. the number p of neurons in the hidden, is often chosen according to some model selection criteria listed in the summary reports of many statistical software for data mining. The linkage between the generalization error and the empirical error, used in these selection model criteria, has been approached in Ingrassia and Morlini [6] on the basis on the Vapnik-Chervonenkis theory [10]. The seminal work on model selection is based on the parametric statistics literature and is quite vast but it must be noted that, although model selection techniques for parametric models have been widely used in the past 30 years, surprisingly little work has been done on the application of these techniques in a semi-parametric or non-parametric context. Such goodness-of-fit statistics are quite simple to compute and even if the underlying theory does not hold for neural networks, the rule among users is to consider them as crude estimates of the generalization error and thus to apply these methods to very complex models, regardless of their parametric framework. These criteria, here denoted by Π, are an extension of the maximum likelihood and have the following form: Π = Ê(f K)+C K (7) where the term Ê(f K) is the deviance of the model f K and C K is a complexity term representing a penalty which grows as the number K of degrees of freedom in the model increases: if the model f K is too simple it will give a large value for the criterion because the residual training error is large; while amodelf K which is too complex will have a large value for the criterion because the complexity term is large. Typical indexes include the Akaike Information Criterion (AIC), the Schwarz Bayesian Information Criterion (BIC

6 1242 Salvatore Ingrassia and Isabella Morlini or SBC), the Final Prediction Error (FPE), the Generalized Cross Validation Error (GCV) and the well-established Unbiased Estimate of the Variance (UEV) (for a review of these indexes we refer to Ingrassia and Morlini [6]. The classical statistical parametric viewpoint in which the dimensionality of the model complexity is given by K = W, i.e. the number W of all parameters defining the mapping function, does not seem to apply to flexible non-parametric or semi-parametric models, in which the adaptive parameters are not on the same level and have different interpretations, as we have seen in previous sections. The assumption that the degrees of freedom of a neural model should be different than W has been remarked by many authors, see e.g. Hodges and Sargent [4], Ye [11]. In the present study we propose an easy correction to the selection model criteria. According to the analogy with the PC regression, for models of the form f p (x) = p k=1 c kτ(a k x + b k)+c 0 we should consider K = p + 1 rather than K = W = p(m + 2) + 1, that is the dimension of the projection space plus one. In this case both the FPE and the UEV are never negative (as it may happen when K = W = p(m+2)+1). 4 A case study In this section we present some numerical results in order to investigate the behavior of Bartlett s constant c 1 and its relation with the learning error Ê(f p, L) andthetesterrorê(f p, T ). We consider the polymer data set modeled by De Veaux et al. [2] by means of a MLP with 18 hidden units. This dataset contains 61 observations with 10 predictors concerning measurements of controlled variables in a polymer process plant an a response concerning the output of the plant (data are from ftp.cis.upenn.edu in pub/ungar/chemdata). The data exhibit a quite large degree of multicollinearity, as shown by the variance inflation factor (VIF) of some predictors: X 1 X 3 X 4 X 5 X 6 X 7 X 8 X 9 VIF: In general, VIFs larger than 10 imply serious computational problems for many statistical tools [8]. As in De Veaux et al. [2], we use 50 observations for the learning set and 11 for the test set. Here, however, we consider 100 different samples with different observations for the training and the test sets. We train networks with increasing numbers of hidden units from p = 2 to p = 25. For each p we train 1000 times the network varying the sample and the initial weights; we adopt either the weight decay or the early stopping regularization techniques. Then, we retained the 100 networks with the smallest test errors. The distribution of the learning error vs. the number p isplottedinfigure1 a) using boxplots. In Figure 1 b) we plot the distribution of c 1 vs. p using weight decay. Similar distributions are

7 On the degrees of freedom in richly parameterised models Learning Error c p p Figure 1: Polymer data: distribution of the learning error and of c 1 vs p. obtained with early stopping. Figure 1b) shows that there is a quite large number of models with different architectures (i.e. a different number p) having the same value of the constant c introduced in Bartlett [1]. In Table 1 for each group of the 100 best models are reported the mean values of the training error Ê(f p; L), the test error Ê(f p; T ), the L 1 norm c 1 and of the error complexity measures AIC, BIC/SBC, GCV and FPE computed with K = p +1. Thevaluesp leading to the smallest mean values are AIC BIC GCV FPE p 9, 11 3, 5, 7 5, 7 7, 9, 11 and thus, on the basis of these statistics, models with p =7, 9, 11 neurons in the hidden layer should be selected. In addition, we note that p =10isthe value with the smallest absolute difference between the means of training and the test errors. This study confirms that c better describes the complexity of a model of the form (1) than the total number of parameters and is a more suitable characterization of the mapping function. References [1] Bartlett P.L. (1998). The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Transaction on Information Theory, 44, n.2, [2] De Veaux R.D., Schumi J., Schweinsberg J., Ungar L.H. (1998). Prediction intervals for neural networks via nonlinear regression. Technometrics 40, (4), [3] De Veaux R.D., Ungar L.H. (1994). Multicollinearity: a tale of two non parametric regressions. In Selecting Models from Data: AI and Statistics IV, (Eds. P. Cheeseman & R.W. Oldford). [4] Hodges J., Sargent D. (2001). Counting degrees of freedom in hierarchical and other richly parameterised models. Biometrika 88,

8 1244 Salvatore Ingrassia and Isabella Morlini p E(f b p; L) E(f b p; T ) c 1 K BIC AIC FPE GCV Table 1: Polymer data: summary statistics for different values of p. [5] Ingrassia S. (1999). Geometrical aspects of discrimination by multilayer perceptrons. Journal of Multivariate Analysis 68, [6] Ingrassia S., Morlini I. (2002). Neural network modeling for small data sets. Submitted for publication. [7] Jolliffe I.T. (1986). Principal Component Analysis. Springer-Verlag, N.Y. [8] Morlini, I. (2002). Facing multicollinearity in data mining. Atti della XLI Riunione Scientifica della Societá Italiana di Statistica, Milano-Bicocca, [9] Rudin W. (1966). Real and complex analysis. Mc-Graw-Hill, New York. [10] Vapnik V. (1998). Statistical Learning Theory, John Wiley & Sons, N.Y. [11] Ye J. (1998). On measuring and correcting the effects of data mining and model selection. Journal of the American Statistical Association, 93, 441, Address: S. Ingrassia, Dipartimento di Economia e Statistica, Università della Calabria, Arcavacata di Rende Italy I. Morlini, Dipartimento di Economia, Università di Parma, Parma, Italy s.ingrassia@unical.it, isabella.morlini@unipr.it