Inverse covariance matrix regularization for minimum variance portfolio. Auberson Maxime

Inverse covariance matrix regularization for minimum variance portfolio Auberson Maxime June 2015

Abstract In the mean-variance optimization framework, the inverse of the covariance matrix (or the precision matrix) of the assets considered Σ 1 is of primary importance. It is central in determining the optimized weights to allocate in each asset, especially when using the global minimum variance portfolio w GMV. Unfortunately, in practice, the estimation of the covariance matrix and its inverse is associated with several serious problems, especially when using the sample estimates. The gains from the optimization are often more than offset by the errors in the estimation. One way to deal with this problem is to pull the off-diagonal elements of the precision matrix estimate towards zero, i.e. shrink the precision matrix. Extreme coefficients are reduced, and it improves the prediction accuracy of the estimation, as extreme coefficients are likely to be due to estimation errors. A way to achieve that shrinkage effect is to impose a penalty on the size of the off-diagonal elements in the Σ 1 estimation. In this paper, I focus on two main types of regularization; the l 1 -penalty and the l 2 -penalty, which impose respectively a penalty on the l 1 and the l 2 norm of the precision matrix coefficients. It results in different shrinkage profiles and this is what I aim to study in this paper. As for the organization, the first part of the paper is about the theory, i.e. how to estimate the precision matrices according to these two methods and what are the theoretic implications and characteristics of both estimates. Then, in the second part, I implement both methods on several sets of financial data in order to study and compare their out-of-sample performance in an asset allocation context, using the minimum variance portfolio w GMV. Moreover, to provide a better global view of the results and some type of benchmark, other well-known methods are considered, like the equally-weighted portfolio or the sample-based global minimum variance portfolio. 1

Contents 1 Introduction 3 2 The precision matrix 4 2.1 Theory............................................ 4 2.2 Estimation: problems and potential solutions...................... 5 3 The precision matrix estimation 6 3.1 The unpenalized version................................. 6 3.2 The L1-regularized estimation: the graphical lasso estimator.............. 6 3.2.1 Description..................................... 6 3.2.2 The L1-penalty................................... 7 3.2.3 Graphical lasso: how it works........................... 7 3.2.4 Graphical lasso: the algorithm.......................... 9 3.3 The L2-regularized estimation: the ridge precision estimator.............. 10 3.3.1 Description..................................... 10 3.3.2 The L2-penalty................................... 10 3.3.3 The Alternative Type I ridge precision estimator................ 11 4 Analysis of the two estimators 12 4.1 The graphical lasso - the shrinkage parameter λ L.................... 12 4.2 The graphical lasso - the convergence condition t.................... 12 4.3 The ridge estimator - the shrinkage parameter λ R.................... 13 4.4 The ridge estimator VS the graphical lasso - the timing................ 14 4.5 The ridge estimator VS the graphical lasso - sparsity.................. 16 5 Out-of-sample evaluation 16 5.1 Setup............................................ 16 5.1.1 The databases................................... 16 5.1.2 The different portfolio strategies considered................... 17 5.1.3 The approach.................................... 18 5.1.4 The choice of the shrinkage parameters λ L and λ R............... 19 5.2 The adaptive ridge strategy - ADR........................... 20 5.3 Results............................................ 21 5.3.1 The performance measures considered...................... 21 5.3.2 # 1 dataset - the 96 portfolios based on size and BtM ratio.......... 23 5.3.3 # 2 dataset - the 48 industry portfolios..................... 27 5.3.4 # 3 dataset - the 133 individual stocks...................... 30 5.3.5 # 4 dataset - combination 2 (96SBtMport + 133Indiv)............ 33 6 Conclusion 36 2

1 Introduction How should portfolio managers optimally allocate wealth across all available assets? What are the criteria to consider in determining whether an additional asset should be included in a portfolio? How can we judge the usefulness of an asset on the portfolio level? The economist Harry Markowitz is the first one to really bring answers to these questions in his 1952 paper; basically, one should always maximize the portfolio expected return with respect to a given amount of portfolio risk, represented by its variance. Only these two statistical moments are considered in judging the performance of assets and portfolios. It is the mean-variance framework, which serves as foundation for many theories which are still relevant today, such as the Capital Asset Pricing Model (CAPM). On the portfolio level, when multiple assets are considered, not only the variance of each asset is important but also the covariances (or correlations) between each asset. Indeed, some assets among the portfolio may tend to move in an offsetting way, and therefore reduce the variance (or risk) of the overall portfolio. This is the notion of diversification, which means that it is actually possible to decrease the (non-systematic) risk of the portfolio by investing in additional assets. For the diversification effect to be efficient, the assets should, as much as possible, not move in the same way i.e. be imperfectly correlated. Hence, in mean-variance portfolio optimization, the covariance matrix Σ of asset returns is crucial in determining the optimal allocations. Unfortunately, in practice, it must be estimated and the estimation process brings along serious problems. The real expected returns, variances and covariances are not available; they must be estimated using data. That means that errors can be made through the estimation and it is usually what happens in practice. In some cases, the gains from portfolio optimization even disappear, offset completely by these estimation errors. So, even though the mean-variance optimization framework is intuitive, easy to understand and still used in practice, it is well criticized in the literature. Therefore, the results from the optimization can be sub-optimal and even outperformed by nave diversification strategies such as the 1/N strategy. DeMiguel, Garlappi and Uppal (2007) find that no optimizing model consistently delivers a better performance than the equally-weighted strategy, as the latter does not need any estimation procedures and has a very low turnover. The conditions needed for the sample-based mean-variance optimization framework to actually outperform the 1/N strategy out-of-sample are almost impossible to achieve in practice. In an effort to decrease the estimation errors, it is possible to focus only on the risk, i.e. the variance, rather than on the risk and the return. In that case, all the potential errors from estimating the expected returns of the assets are avoided. This is the basis for the risk budgeting or risk parity approaches; only controlling for the risk of the portfolio. Therefore, only the covariance matrix has to be estimated. However, problems still persist; using the sample estimate for the covariance matrix tend to perform awfully out-of-sample. First, it usually gives extreme weights which are not desirable or maybe even applicable in practice. Second, it is also hypersensitive to new data, resulting in a huge turnover over multiple trading periods. The reason for that is that the number of observations needed for the estimation (T ) has to be very high for the estimate to be trustworthy relatively to the number of assets considered (N). In other words, due to its nature, the sample estimator needs to collect a lot of information in the data to be reliable. Closer to one gets the T/N ratio, less reliable are the results. It is therefore not 3

surprising that, using the sample covariance matrix, the gains from the optimization are completely offset by the costs, especially against the 1/N portfolio which has almost no turnover costs. Actually, DeMiguel, Garlappi and Uppal (2007) find that the estimation window needed for the sample-based optimization strategies to consistently outperform the 1/N strategy is of around 3000 months for 25 assets and around 6000 months for 50 assets. It is almost impossible to achieve in practice, and they find therefore that no optimizing model consistently outperform the equally-weighted strategy. Related to that, another main drawback is that when the number of observations used for estimation (T ) is lower than to the number of assets considered (N), the sample-based covariance matrix is singular. It is then impossible to invert, and it can t be used for the optimization process as the inverse covariance matrix is computationally necessary. The main objective of this paper is to find better ways than the sample-based method to estimate the covariance matrix and its inverse in order to find more adequate results in an asset allocation context. The weights resulting from the optimization should be less extreme and more stable across the trading periods to avoid the turnover trap. To be more precise, I test two different methods to estimate the precision matrix which are based on two different types of penalization: the graphical lasso algorithm from Friedman, Hastie and Tibshirani (2007) and the alternative Type I ridge precision estimator from Van Wieringen and Peeters (2014). The first one is based on the l 1 -penalty, whereas the second is based on the l 2 -penalty. I apply both methods on financial data in order to compare their performance on financial data. For the optimization, I only focus on risk, i.e. the covariance matrix, without considering the expected returns. I use the global minimum variance portfolio w GMV along the paper for the evaluations, which does not need any expected return estimation. 2 The precision matrix 2.1 Theory The covariance matrix and especially its inverse are of critical importance for an efficient asset allocation in a mean-variance framework. The covariance matrix is intuitively useful, to better understand the statistical relations between the data. However, computationally, its inverse is more relevant as the covariance matrix has to be inverted to transform the return data into weights. When using a strategy which focuses only on the risk like the global minimum variance portfolio, the precision matrix 1 is actually the only parameter to compute to find the optimized weights. It is the only element which depends on the data, other than the number of assets, as we can see in the GMW portfolio formula: w GMV = 1 1 N Σ 1 1 N Σ 1 1 N Of course, when the inverse has to be estimated, as in practice, Σ 1 must be replaced by Σ 1. Even though the precision matrix in itself is complicated to interpret, Stevens (1995) revealed an interesting information about it. He highlighted the fact that the inverse of the covariance matrix gives us the hedging trades among the assets. More precisely, each line i in the precision matrix is 1 Along the paper, I use Σ 1 or Θ equivalently to designate the precision matrix depending on the situation 4

a hedge portfolio for the asset i. Given N assets, each line i of the precision matrix can be seen as a long position in the i-th asset and short positions in the N 1 other assets. Basically, the long position in the i-th stock is then hedged by the N 1 short positions. It is useful to see the precision matrix with these notations in order to better understand the principle: Σ 1 = 1 β 12 σ 11 (1 R1 2) β 21 1 σ 22 (1 R2 2) σ 11 (1 R 2 1 )... β 1N σ 11 (1 R1 2) β 2N σ 22 (1 R2 2) σ 22 (1 R2 2)............... β N1 σ NN (1 R 2 N ) β N2 σ NN (1 R 2 N )... 1 σ 22 (1 R 2 2 ) The inverse covariance matrix is expressed here in a multiple regression way. The returns of the i-th stock are regressed on the N 1 other stocks, and it returns the (N 1) different β ij, or the vector β i. Of course, the inverse matrix being symmetric,β ij must be equal to β ji. Their signs are negative, as they represent short positions, according to the hedge portfolio view of the precision matrix. The vector β i represent the part of the i-th stock return that can be explained by the regression, i.e. by the variations in the N 1 stocks. The elements are normalized by the part of asset i s variance that can t be explained by the regression, i.e. the unhedgeable risk of the asset i (Ri 2 being the R squared of the i-th regression). 2.2 Estimation: problems and potential solutions Hence, if the precision matrix estimation is seen like a multiple linear regression problem as Stevens suggested it, another problem faces itself: the multicollinearity in the data. Basically, it means that if the independent variables in a multiple regression are highly correlated, it can disrupt the regression process and discredit the vector of coefficient β found. Indeed, a regression works well when the independent variables are really, as it is suggested, independent of each other. They should have separate and independent impacts on the dependent variable. Therefore, when the independent variables are in some way dependent of each other, the results from the linear regression can be imprecise and unreliable. That can be especially problematic in large databases like groups of stocks, as correlations often exist among them. It can then generate large errors in the estimation process of the precision matrix if the regular least squares estimation is used. One way to limit these estimation errors and avoid the multicollinearity problem is to simplify the covariance structure of the data. Essentially, it means that it can be beneficial in a model to, rather than consider the whole set of parameters and increasing the errors, select only some relevant parameters from the full set and therefore decreasing the estimation errors. In the precision matrix context, it means that setting some redundant off-diagonal elements ( β ij ) to zero can reduce the amount of noise in the model due to estimation errors. It also makes the models easier to interpret as they focus only on the strongest relations. This phenomenon is called subset selection, as we force some independence in the data only to focus on the most pertinent relations i.e. the most pertinent subset of parameters. Another way to improve the estimation process is to reduce the most extreme coefficients, as they are likely to be due to estimation errors. This phenomenon is called shrinkage, as we shrink the coefficients towards zero so that they become more conservative and less subject to estimation errors. 5

To achieve subset selection and/or shrinkage, it requires a type of penalization in the estimation process, to pull the too extreme coefficients of the precision matrix towards to or to zero. An important concept to understand is that, with the penalizations, the estimation is biased. It means that the penalized estimate do not asymptotically tend to the true parameter value and they are statistically incorrect. The model is statistically wrong as it does not represent the exact correct relations due to the structure simplification. However, it allows to reduce overfitting i.e. reduce the variance and increase the prediction accuracy of the estimates. The least squares estimates of the coefficients have a very low bias, but they also have a high variance, which explains why they are subject to large estimation errors. They need a very large number of observations to really capture enough information as it is in the data and achieve sufficient prediction power. Hence, there is actually a trade-off between the benefits (higher prediction accuracy) and costs (bias) of the penalization. By shrinking, we sacrifice some bias for a better prediction power of the estimation. 3 The precision matrix estimation 3.1 The unpenalized version Let Y i = (Y 1,..., Y T ) be one of the N T -dimensional random variable drawn from a multivariate normal distribution N (µ, Σ). There are N variables, and T observations. If Σ 1 = Θ and S is the sample covariance matrix, the problem is to maximize the likelihood function: argmax {ln(det Θ) trace(sθ)} Θ If N < T, there are more observations than variables, and Θ ML = Ŝ 1, meaning that the inverse of the sample covariance matrix is the maximum likelihood estimator of the precision matrix. However, when N > T, as it has already been mentioned, the sample covariance matrix is singular. It is therefore not invertible, and the precision matrix is undefined. Moreover, even if N < T, as it was also mentioned earlier, the sample covariance matrix and its inverse achieve low prediction power and produce unstable weights. The results of the optimization are therefore unreliable in practice. In order to avoid these problems, it is necessary to regularize the precision matrix estimation. 3.2 The L1-regularized estimation: the graphical lasso estimator 3.2.1 Description The graphical lasso is an algorithm which shrinks the elements of the precision matrix towards zero compared to the maximum likelihood estimates. Due to the nature of its penalty, it provides sparsity in the data, meaning that it shrinks some irrelevant precision matrix coefficients directly to zero. Therefore, aside from shrinkage, it also promotes subset selection. If β ij (the precision matrix coefficient) is different from zero according to this method, it means that the jth-stock provides a sufficient contribution to the hedge of the i-th stock relative to the other N 2 stocks. Otherwise, it is set to zero by the l 1 -penalty. The advantage is that it limits the trades only to the assets which are really relevant in a risk reduction context. Another nice property is also that 6

the graphical lasso keeps the precision matrix positive definite even if N > T. The graphical lasso have been shown by Goto and Xu (2013) to bring substantial gains in terms of risk reduction in an asset allocation context and my work is based on their paper. It is important to note that the sparsity of the precision matrix does not imply the sparsity of the covariance matrix. Even though the precision matrix is sparse, the covariance matrix often still has positive covariances. 3.2.2 The L1-penalty Basically, the l 1 -penalty imposes a penalty on the overall size of the regression coefficients, i.e. on the sum of the absolute values of the vector β. It is used in least squares regression, and if we see it in the Lagrangian form, it can be expressed like this: β lasso = argmin β 1 2 N p (y i β 0 x ij β j ) 2 + λ L i=1 j=1 j=1 p β j With the penalty parameter being λ L. The higher it is, stronger is the shrinkage. Due to the absolute nature of the penalty around the coefficients, it achieves absolute shrinkage and set coefficients exactly to zero. An equivalent way to see the problem which makes the size penalty clearer is: β lasso = argmin β N (y i β 0 i=1 p x ij β j ) 2 j=1 subject to p j=1 β j t When applied to the estimation and the likelihood function, according to Friedman, Hastie and Tibshirani (2007), it can be expressed in this way: argmax {ln(det Θ) trace(sθ) λ L Θ 1 } Θ With Θ 1 being the l 1 norm, or the sum of the absolute values of the elements of Σ 1. In this context, the penalty is on the precision matrix coefficients. A larger value for λ L promotes more sparsity, whereas a value of zero for λ L gives us the same solution as the unconstrained maximum likelihood solution. The crucial choice for the value of the regularization parameter will be discussed later in the implementation. 3.2.3 Graphical lasso: how it works Unfortunately, the computations for estimating the precision matrix with the lasso penalization are complicated and a closed-form solution does not exist. This is why it must be solved with the help of an algorithm. Several have been elaborated in the recent years, but I use the graphical lasso algorithm that Friedman, Hastie and Tibshirani presented in their 2007 paper. It actually estimates the covariance matrix Σ rather than the precision matrix Σ 1 (the explanations are below), but the latter can be easily retrieved from Σ. The graphical lasso is a block coordinate descent algorithm, meaning that it estimates one line and column of the covariance matrix at a 7

time rather than estimating the whole covariance matrix at once. In substance, it means that at each line of the matrix, the covariance coefficients for this line are estimated through the lasso l 1 -penalized regression. Then, using these new coefficients, the algorithm updates the current covariance matrix estimate. This new covariance matrix estimate is used as the basis for the next optimization at the next line. Therefore, it does not consist of N separate lasso problems, but rather of a single N-coupled lasso problem. This is what makes the graphical lasso relevant as the use of the current estimate at each lasso problem shares the information between the problems in an appropriate fashion. The graphical lasso algorithm is based on the sub-gradient of the likelihood function above. Using the fact that the derivative of ln(det Θ) is Θ 1 = W (the covariance matrix), the sub-gradient is equal to: W + S + λ L Γ = 0 With Γ =sign(θ), or a matrix of component-wise signs of Θ. Basically, the graphical lasso solves this sub-gradient for one row/column at the time, while holding the rest fixed. Intuitively, it regresses the variance coefficient w 22 on the other coefficients in order to find the covariance coefficients w 12 and w12 T by symmetry. For each i, the algorithm partitions the covariance matrix estimate W in that way: ( ) W11 w W = 12 w12 T w 22 With: W 11 being a (N 1) (N 1) matrix, corresponding to the original matrix without the line and column i w 12 and w12 T being (N 1) 1 vectors, corresponding to the line and column i, i.e. the covariances ij and ji by symmetry w 22 being a scalar, corresponding to the diagonal element ii (the variance of i) Θ is partitioned in the same way. Therefore, if we use the partitioning above, for each i it means that the sub-gradient solves: w 12 + s 12 + λ L γ 12 = 0 With γ jk = sign(θ jk ) now that we are taking it line by line, as θ jk is the element jk of the precision matrix Θ. Using w 12 = W 11 β, the sub-gradient for each row/column can be rewritten: W 11 β + s 12 + λ L v = 0 Where v sign(β) as we know that the sign of θ 22 (the diagonal of precision matrix) is always positive. It corresponds to the sub-gradient of the l 1 -regularized quadratic program: { } 1 min β 2 β W 11 β + β s 12 + λ L β 1 For β being a (N 1) 1 vector. The algorithm uses the partitioning and finds the vectors w 12 and w T 12 (by symmetry), i.e. the shrunk covariances, through the l 1-regularized quadratic program of the variable i on the other variables j (i j). The sub-gradient is the link between 8

the l 1 -regularized quadratic program (hence, the graphical lasso algorithm) and the solution of the likelihood maximization that is shown at the beginning of the section 2. 3.2.4 Graphical lasso: the algorithm The objective for each line i is to estimate the covariances w 12 (and w 21 or w12 T by symmetry) through the lasso regression. The diagonal coefficients of the covariance matrix must not be changed during the algorithm, as it only shrinks the covariances and not the variances. It cycles through the lines i = 1, 2,..., N, 1, 2,...N..., and each time updates the current estimate of the covariance matrix W with the N 1 coefficients (w 12 = W 11 β) corresponding to the covariances of the asset i with the other assets j. The algorithm continues until it decides it has converged. In my implementation, according to Friedman, Hastie and Tibshirani (2007), the convergence condition is achieved when the average absolute change in W is less than t mean( S ), where S are the off-diagonal elements of the sample covariance matrix and t is a fixed threshold. Once the convergence is achieved for the covariance matrix, it is easy to convert it into the precision matrix. The stages of the algorithm can be summarized in this way: 1. Start with Ŵ0 = S + λ I N, where I is an identity matrix of dimension N For each i from 1 to N: (a) Rearrange the row/column i in the matrix so that it is the last one, corresponding to the partitioning of the matrix above (b) Solve the l 1 -regularized quadratic program above to find the vector of covariances ŵ 12 using as warm start the precedent vector β for this line (c) Update the row/column of covariances ŵ 12 in the current estimate of the covariance matrix Ŵ (transform line by line Ŵ0 into Ŵ) (d) Save the β for this row/column in a matrix (e) Check the convergence condition i. If it is satisfied, stop the algorithm ii. If not, start again at (a) with the current estimate Ŵ as Ŵ0 2. Finally, once it has converged, sweep through all the lines and convert in the same way Ŵ into Θ using first θ 22 = 1 (ŵ 22 + β ŵ 12 ) for the diagonal elements and θ 12 = β θ 22 for the off-diagonal elements (using the matrix of β saved through the algorithm) It is important to note that after each i, the algorithm updates Ŵ, and uses this update of Ŵ for the next iteration. Therefore, Ŵ is a minor of the sample covariance matrix only at the first iteration, as the algorithm updates (or shrinks) Ŵ each time. Therefore, it shares the information between the problems in an appropriate block coordinate fashion, and this is why it amounts to the approximate solution of the penalized likelihood function. It is also important to note that if λ L = 0, the algorithm does not penalize the coefficients and simply compute the sample inverse covariance matrix Ŝ 1 using a regression at each stage. 2 For more details, see Friedman, Hastie and Tibshirani (2007); Banerjee et al. (2008); Mazumder and Hastie (2012) 9

Of course, smaller t is, longer it takes to converge, more the covariance matrix is shrunk and more sparse is the precision matrix estimate. Indeed, the average absolute change in W must be smaller to satisfy the convergence condition, and it gets only smaller more it cycles through the lines and shrinks the matrix. The sparsity is also a function of the penalty parameter λ L ; higher it is, more the covariance and precision matrices are shrunk through each optimization. If λ L > S i.e. the penalty is higher than all the covariance coefficients, it results in a covariance matrix filled with zero non-diagonal elements. The characteristics of the graphical lasso are studied later more in details. 3.3 The L2-regularized estimation: the ridge precision estimator 3.3.1 Description The ridge regression is another way to shrink and estimate the precision matrix. Actually, whereas the graphical lasso uses the l 1 -penalty, the ridge estimator uses the l 2 -penalty. It is similar as it also imposes a penalty on the coefficients, but the latter penalizes the sum of the coefficients squared (instead of the sum of the absolute values of the coefficients). When the l 1 -penalty shrinks the coefficients in a different way and set some to zero, the l 2 -penalty shrinks all the coefficients in a proportional way. Therefore, the result from the estimation is not sparse. In some situations, it can be better not to have a sparse solution, as the true model may not be sparse. Hence, the ridge estimation promotes shrinkage, but not subset selection. As the lasso penalization, the ridge penalization also ensures that the covariance matrix is non-singular or invertible, and that the precision matrix exists. In my paper, I use the work done on the ridge estimation of the precision matrix by Van Wieringen and Peeters (2014), as they have shown that the alternative ridge estimators perform well, and even better than the corresponding graphical lasso estimators in terms of loss. 3.3.2 The L2-penalty The l 2 -penalty is also primary used in regressions, and written in the Lagrangian form, it can be expressed like this: or β ridge = argmin β 1 2 N (y i β 0 i=1 β ridge = argmin β p x ij β j ) 2 + λ R j=1 N (y i β 0 i=1 subject to p j=1 β2 j t p x ij β j ) 2 With the penalty parameter being λ R. We see that the only difference between with this formula and the formula of the l 1 -penalty is the penalty itself, i.e. p j=1 β2 j instead of p j=1 β j. As λ L for the l 1 -penalty, λ R also controls the strength of the shrinkage. Whereas there is no closed-form solution for the lasso penalty, there is a closed-form solution for the ridge penalty, which makes the computations way easier. 10 j=1 p j=1 β 2 j

When applied to the estimation and the likelihood function, according to Van Wieringen and Peeters (2014) and their Alternative Ridge precision estimator, the penalized likelihood function can be expressed in this way: argmax Θ {ln(det Θ) trace(sθ) 12 trace [ (Θ T) T Λ(Θ T) ]} Where Λ is a positive definite symmetric matrix of penalty parameters and T is a positive definite symmetric target matrix. Essentially, it means that the shrinkage parameter will shrink the precision matrix towards the target matrix (the strength of that shrinkage depending on Λ). Solving this, it results in the following generic penalized ML ridge estimator of the precision matrix: Θ(Λ) = { [ Λ + 1 ] 1 (S ΛT)2 4 2 1 + (S ΛT) 2 Equivalently, the covariance matrix can be estimated in this way: Σ(Λ) = [Λ + 14 ] 1 (S 2 1 ΛT)2 + (S ΛT) 2 3.3.3 The Alternative Type I ridge precision estimator The alternative type I ridge precision estimator is a special case of the generic penalized ML Ridge estimator above, but with Λ = λ R I N, λ R being a scalar penalty parameter between zero and infinity. We can then rewrite the Alternative Type I ridge precision estimator in that way: Θ(λ R ) = } 1 { [ λ R I N + 1 ] 1 } 1 4 (S λ RT) 2 2 1 + 2 (S λ RT) And the Alternative Type I ridge covariance matrix estimator: Σ(λ R ) = [λ R I N + 14 (S λ RT) 2 ] 1 2 + 1 2 (S λ RT) It is then also possible to shrink the covariance matrix to a target covariance matrix C; one just has to specify T = C 1. In order to understand better the estimator, it is necessary to state some of its main properties. First, for any λ R, the precision matrix coefficients are never exactly equal to zero, as it is a proportional shrinkage method. In other words, it does not promote sparsity and does not achieve subset selection. Second, closer to zero the shrinkage parameter λ R gets, more the precision matrix looks like the inverse of the sample covariance matrix (with the advantage of being always definite). Finally, closer to infinity the shrinkage parameter λ R gets, more the precision matrix looks like the target matrix T. The choice of the shrinkage parameter is discussed in the out-of-sample evaluation section. 11

We see that this ridge estimator is way easier to compute than the graphical lasso and it is a significant advantage. It also offers much more flexibility, due to the multiple values possible for the shrinkage parameter and the possibilities to shrink towards a target matrix. I chose T such that diag[t]= 1/diag[S] as Van Wieringen and Peeters (2014) suggested it. This way the estimator shrinks the off-diagonal elements of the precision matrix towards zero as desired. Of course, multiple different choices could be possible, and it deserves to be studied more in details. To hold up a potential disadvantage of the estimator, it does not achieve subset selection, and it could perhaps be problematic in an asset allocation context, especially compared to the graphical lasso. This is what I aim to test in the second part of my paper with an application on financial data. 4 Analysis of the two estimators In this section, the main characteristics of both estimators are studied. Of course, all the results are sample specific, meaning that, depending on the data and the shrinkage parameters used, the results can be different. However, the objective here is to capture the main relationships between the parameters. All the timing results are based on an Intel Core i5 3.40 GHz processor. 4.1 The graphical lasso - the shrinkage parameter λ L The effect of the shrinkage parameter for the graphical lasso can be seen in this table. These are data for 10 same variables, with t = 0.01 as convergence parameter. Σ and Θ are the absolute value of the off-diagonal elements of the corresponding matrices. As a general trend, higher is the shrinkage parameter, more shrunk is the precision matrix or the covariance matrix. It also takes less time, and less iterations to get to the convergence condition (one iteration meaning N optimizations, or one cycle through the matrix). Therefore, it converges faster with a higher shrinkage parameter. If the penalty parameter λ L is very close or even higher than the covariance matrix entries, it will simply shrink the entries directly towards zero in one iteration, and this is what we can see with λ L = 70 in the Table 1. λ L mean( Σ L ) mean( Θ L ) #Iterations Seconds 0.75 7.0059 0.0017 31 51.1433 1 6.8888 0.0016 24 47.5406 1.5 5.5381 0.0014 20 42.2321 5 3.4553 0.00074 8 15.5955 20 2.3061 0.00025 3 4.6760 70 0 0 2 0.1859 Table 1: The effects of the shrinkage parameter λ L 4.2 The graphical lasso - the convergence condition t The convergence condition is also a parameter which has an effect on the shrinkage of the covariance and precision matrices. In this table, t vary while the other elements are kept fixed. There are also 10 variables, with a shrinkage parameter λ L = 2. As anticipated, smaller is the convergence 12

condition, longer the algorithm takes to shrink the matrices (especially between t = 0.05 and t = 0.01). However, there is not much difference in terms of shrinkage once t = 0.01 is exceeded. That means that the shrinkage parameter λ L really determines the shrinkage; the convergence condition only determines the precision of the shrinkage. t mean( Σ L ) mean( Θ L ) #Iterations Seconds 0.5 25.4904 0.0044 1 1.2367 0.1 14.4187 0.0024 5 5.5506 0.05 10.1973 0.0019 8 11.3330 0.01 4.8168 0.0012 16 33.4172 0.001 3.8124 0.0010 23 44.8588 0.0005 3.6017 0.00096 28 51.8451 Table 2: The effects of the convergence condition t 4.3 The ridge estimator - the shrinkage parameter λ R The shrinkage parameter of the ridge estimator is on a different scale than the graphical lasso as it can go from zero to infinity. It is interesting to see the effect of the different values for λ R on the covariance and precision matrices. These results are also for 10 same variables, and shows that it can make a lot of difference. The column mean( Ŝ 1 Θ shows the difference between the sample inverse covariance matrix and the inverse covariance matrix estimated through the Alternative Type I ridge estimator. In agreement with the theory, when λ R is close to zero, the precision matrix looks a lot like the sample one. As soon as λ R gets higher and goes closer to infinity, the precision matrix estimated through the ridge estimator becomes more and more different from the sample one to look like the target matrix. In the asset allocation context, lower is the shrinkage parameter, more extreme are the weights given by the strategy. We can see that the shrinkage is proportional to the value of λ R in agreement with the theory. As for the time taken, we can see that there is a big difference between the graphical lasso and the ridge estimator. The timing is studied more in details in the next section. λ R mean( Σ R ) mean( Θ R ) mean( Ŝ 1 Θ R ) Seconds 0.1 28.4792 0.0162 0.0002 0.0051 100 27.9286 0.0059 0.0166 0.0048 2000 23.4135 0.0025 0.0234 0.0046 20000 5.9460 0.0011 0.0250 0.0048 1000000 0.0733 0.00003 0.0255 0.0052 10000000 0.0073 0.000003 0.0255 0.0070 Table 3: The effects of the shrinkage parameter λ R 13

4.4 The ridge estimator VS the graphical lasso - the timing We can see in this section the differences in timing between both estimators. Basically, the time needed to estimate the covariance and precision matrices is displayed as a function of the number of variables. The variables here are series of random numbers from a normal distribution with µ = 0 and σ = 4. For the number of variables N going from 5 to 100, N + 1 observations are generated for each variable. For the graphical lasso parameters, λ GL = 1 and t = 0.01, whereas for the ridge estimator λ R = 1000. The element which really determines the speed of the graphical lasso estimator is the size of the penalty compared to the size of the covariances. As with these random normal observations the covariances are rather low compared to the covariances with stock return data, the λ L is set rather low to represent more accurately the shrinkage process. It can actually take more or less time than what is shown here depending on the data. The results could not be represented together, as the ridge estimator would not even appear on the graphical lasso graph. The ridge estimator takes, at most, 0.007 seconds to estimate the covariance and precision matrix as we see in Figure 3, whereas for the graphical lasso it can take up to almost 200 seconds as we see in Figure 1. Especially for the graphical lasso, there seems to be an upward trend in the time taken when variables are added. It is not surprising, as more variables means more optimizations and more coefficients to estimate for each optimization. However, it is not always exactly the case, and the behaviour can be quite erratic. Moreover, in Figure 2, the time needed to estimate for the graphical lasso is represented, besides as a function of the number of variables, also as a function of the convergence condition, with t = 0.5, 0.1 and 0.01. We see that smaller is the convergence condition, longer it takes for the same number of assets. Figure 1: Time in seconds as a function of # of variables - graphical lasso 14

Figure 2: Time in seconds (y-axis) as a function of # of variables (x-axis) and convergence condition t (z-axis) - graphical lasso Figure 3: Time in seconds as a function of # of variables - ridge estimator 15

4.5 The ridge estimator VS the graphical lasso - sparsity In theory, the ridge estimator never sets coefficients to zero, as it only shrinks the coefficients proportionally. The graphical lasso, meanwhile, shrinks some coefficients more than others and can set some coefficients to zero. We can see the different effects of both estimators on the sparsity of the precision matrices in the tables below. The sparsity is defined as the number of zero non-diagonal elements 3 (as the diagonal elements are not changed by the graphical lasso, they are always positive) over the total number of elements in the precision matrix estimate. The theory seems to be confirmed. Even though the mean of the off-diagonal elements of Θ R can be smaller than Θ GL, there is still no zero elements with the ridge estimator and therefore no sparsity. This demonstrates its proportional shrinkage, as opposed to the graphical lasso. Of course, as anticipated, higher the shrinkage parameter, higher is the sparsity, amounting to 100% for λ GL = 70, meaning that all the non-diagonal coefficients are equal to zero. Here, the convergence condition t for the graphical lasso is equal to 0.01. λ L Sparsity Θ L mean( Θ L ) 0.75 0.49 0.0017 1 0.42 0.0016 1.5 0.69 0.0014 5 0.78 0.00074 20 0.80 0.00026 70 1 0 λ R Sparsity Θ R mean( Θ R ) 0.1 0 0.0162 100 0 0.0059 2000 0 0.0025 20000 0 0.0011 1000000 0 0.00003 10000000 0 0.000003 Figure 4: The sparsity of both precision matrix estimators 5 Out-of-sample evaluation Now that the theoretic characteristics and the implications of both estimators have been reviewed, I come to the second part of the paper: the implementation of both methods on financial data to test their out-of-sample performance. 5.1 Setup 5.1.1 The databases Datasets Data description N T T/N Time period #1 Portfolios based on size and BtM ratio (96SBtMport) 96 120 1.25 07/1969-12/2012 #2 Industry portfolios (48IndPort) 48 120 2.5 07/1969-12/2012 #3 Individuals (133Indiv) 133 120 0.9 07/1969-12/2012 #4 Combination2 (96SBtMport+133Indiv) 114 120 1.05 07/1969-12/2012 Table 4: Description of the datasets considered 3 The precision matrix elements are actually rounded to zero for both estimators here; if the coefficient< 0.00000001, it is considered as being zero. 16

For the data, I follow the example of Goto and Xu (2013), as I chose four different databases of different sizes in order to cover several sample characteristics. The period for all the databases goes from 07/1969 to 12/2012 and these are all monthly return data. Of course, I also dispose of the data on the risk-free rate for the same period. The first dataset is composed of 96 portfolios 4 formed on size and book-to-market ratio available on the Kenneth R. French website 5. The second dataset is formed of 48 industry portfolios, also available on courtesy of the Kenneth R. French website. The third one is a sample of 133 individual stocks chosen randomly. Finally, the last one is a combination between the 96 portfolios and the 133 individual stocks, totalling 114 assets (48 from the first and 66 from the fourth). We can see that three of the four datasets contain large and diversified portfolios return data, except for the #3 dataset (and, of course, half of the last dataset). For all datasets, the covariance matrices and their inverses are computed using a period of 120 months (T), or 10 years. Therefore, for the third dataset, it is impossible to use the sample covariance matrix for the optimization. As T/N < 1, it is not invertible and the precision matrix is undefined. 5.1.2 The different portfolio strategies considered To provide some type of benchmark for the empirical evaluation, I also consider in addition to the graphical lasso and the ridge portfolios several popular methods. Before going into more details, I think it is necessary to clarify one notion; except for the equally-weighted portfolio which is free on any estimations, the methods I consider are all based on the Global Minimum Variance portfolio (GMV) formula: w GMV = 1 1 N Σ 1 1 N Σ 1 1 N It only depends on the inverse covariance matrix Σ 1, and the difference between the methods is the way to estimate this inverse covariance matrix ( Σ 1 ) or to apply this formula. But the point is that they are all based on this, as I only focus on the risk in my paper and not on the return for the strategies. Or, to be more precise, I only focus on risk minimization as these are minimum variance portfolios. 1. The equally-weighted portfolio (1/N): w EW = 1 N 1 N It is the simplest possible strategy, as it is only an equal repartition across all assets in the portfolio. However, it has been shown to perform surprisingly quite well, as it is free of any estimation errors and has low turnover costs. 2. The sample-based minimum variance portfolio (S): w S = 1 1 NŜ 1 1 N Ŝ 1 1 N 4 There actually are 100 portfolios, but I removed four of them for data missing reasons. 5 http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html 17

It is the GMV portfolio, but based on the sample covariance matrix Ŝ 1 with all its known disadvantages. 3. The Jagannathan and Ma (2003) portfolio (JM): w JM = argmin w Ŝw w s.t. N i=1 w i = 1 and w i 0 for all i = 1,..., N Basically, it just consists in the sample minimum variance portfolio but with a no short-sale constraint. It has been shown to perform well and limit the sample covariance problems. Moreover, it does not depend on the inverse covariance matrix which means that it can be computed for every T/N profile. 4. The Ledoit and Wolf portfolio (LW ): w LW = 1 N 1 1 Σ LW 1 N Σ 1 LW 1 N It is also the GMV portfolio but with the precision matrix estimated with the Ledoit and Wolf s shrinkage estimator Σ 1 LW 6. 5. The graphical lasso portfolio (L): w L = 1 N 1 1 Σ L 1 N Σ 1 L 1 N The GMV portfolio with the inverse covariance matrix estimated through the graphical lasso algorithm. 6. The ridge portfolio (R): w R = 1 N 1 1 Σ R 1 N Σ 1 R 1 N The GMV portfolio estimated through the Alternative Type I ridge precision estimator. Of course, as mentioned earlier, for two datasets the sample minimum variance strategy is not applicable. 5.1.3 The approach The approach I consider is out-of-sample, meaning that the period I use to estimate the matrices and optimize the weights is not the same period used to test the performance of the strategies. It is actually a rolling-window approach; for each month t, the covariance and precision matrices are estimated using the 120 preceding months (from t 120 to t). The optimized portfolio weights to invest in each asset ŵ i,t are estimated according to the strategy, and these weights are held for one month during which the strategy achieves a certain level of out-of-sample return R i,t+1 = N j=1 ŵi,tr j,t+1. Then, the following month t + 1, the matrices are re-estimated using the 6 The function is publicly available on courtesy of Michael Wolf s page: http://www.econ.uzh.ch/faculty/wolf/ publications.html 18

120 months from t 119 to t + 1, which gives new optimized weights ŵ i,t+1 which are again held until t + 2, to achieve a strategy return R i,t+2 = N j=1 ŵi,t+1r j,t+2. This process goes on over the whole sample period. Given the fact that 120 initial months are needed to start trading, it results in 402 covariance and precision matrices and 402 vectors of N optimized weights as there are 402 trading periods (denoted as P in the rest of the paper) over the whole sample (the 522 months of data minus the 120 initial months). This process ensures the potential investability of the strategies and the pertinence of the backtesting. 5.1.4 The choice of the shrinkage parameters λ L and λ R For the graphical lasso and the ridge strategies, values for λ L and λ R have to be set in order to apply the strategies. The optimal shrinkage parameters values are unknown and a way to estimate them must be found. There exists different methods, like the Leave-One-Out Cross Validation (LOOCV) score, but given that performance is the main objective in this paper I rather use a performance-based method. In order to be able to compare both strategies on the same scale, the method should be the same for the two different shrinkage parameters. I estimate the precision matrix with different shrinkage parameter values for both strategies during the first 10 years for each sample, and I use the shrinkage parameters which result in the best performance over the this in-sample period. The shrinkage parameters are then the same over the whole period. The performance is judged from a mean point-of-view, or to be more precise; I choose the shrinkage parameter which maximizes the mean of the returns over the in-sample period. Of course, given that only one precision matrix is estimated for the whole in-sample period, the weights for each month during that period are the same for one given shrinkage parameter value. For the out-of-sample period, the weights change each month, and performance must not be judged in the same way. Other measures than the mean could be used, like the variance or the Sharpe ratio, but it usually results in bad performance out-of-sample. Indeed, the lowest shrinkage parameter tends to give the least variable returns in-sample, whereas it is not the case out-of-sample. Therefore, using any measure based on the variance or standard deviation of the returns in-sample is incorrect. A possible explanation is that, as the same period is used to estimate the matrices and to test the performance, the most extreme weights (i.e. the lowest shrinkage parameter) give correctly the least variable returns. The mean, even though it is rather intuitive and simple, tends to give better results out-of-sample. For the ridge strategy, I test 200 different values for λ R between 150 and 30000 for all samples. Unfortunately, for the graphical lasso, due to the intensive computations, I can not be as thorough and I test 10 different values of λ GL per sample. The range of potential values depend on the sample; for smaller samples, the range covers lower values than for samples with more assets. For example, the sample with the individual stocks is the one for which the range is the highest. Indeed, this is probably where the highest shrinkage must be achieved, as there are probably less correlations in individual stocks than in large and diversified portfolios. As for the convergence condition, I chose t to be equal to 0.1 for all the samples due to the time taken by the graphical lasso algorithm. Hence, it is important to notice that a smaller t may achieve a different shrinkage and maybe a different out-of-sample performance than what is shown here. It is necessary to specify that the computations for the graphical lasso are intensive and this is the reason why the shrinkage parameters are only estimated once per sample for both 19

regularization strategies. It can be re-estimated during the period but it is more difficult to implement, and with the hypothesis that the optimal shrinkage parameter stays stable it should not be a problem. For the ridge estimator, the computations are way easier and it takes less time to estimate the precision matrices. Moreover, there are an infinity of shrinkage parameters to choose from, and it is more likely that the optimal shrinkage parameter changes during the period. Therefore, it makes sense in that case to re-estimate several times over the period, and this is what I do with the adaptative ridge strategy in the next section. 5.2 The adaptive ridge strategy - ADR In order to fully exploit the advantages of the ridge estimator, I also integrate a strategy which re-estimates the optimal shrinkage parameter λ R along the sample period. Indeed, it is likely that the same shrinkage parameter for 402 months (more than 30 years) may not be really optimal. It can give good results in some situations, while performing poorly in others. It can be especially problematic for the ridge strategy which has an infinity of possible shrinkage parameter values unlike the graphical lasso whose optimal shrinkage parameter is more likely to be stable. Furthermore, the speed of estimation and the flexibility of the ridge estimator are characteristics that must be taken advantage of. The most important feature is the re-estimation period, or at which frequency the shrinkage parameter is reset. There is a trade-off; it must not be too short in order not to be reset when not necessary, but also be short enough for the strategy to adapt itself sufficiently to potential new economical and statistical conditions. Moreover, too many shrinkage parameter changes may not be desirable in terms of portfolio turnover as it has a direct effect on the weights. To take that into account, I implement a flexible re-estimation period in the algorithm. The re-estimation is done in principle every 24 months, but depending on the value for λ R found, it can be shortened or extended. If the value found is too different from the last ones, meaning that the actual context may be unstable, the period until next re-estimation is reduced to 6 months (instead of 24). Hence, it allows the strategy to adapt itself quicker to potential erratic conditions. Conversely, if the shrinkage parameter found is very close to the previous one, the next re-estimation period is extended to 36 months on the basis that the situation may be more stable and that the current λ R may be well-suited to the sample. Another important characteristic is the range of shrinkage parameters possible at each re-estimation. The minimum value for λ R across all samples is fixed at 150, as the regular ridge strategy. There is also a maximum value, set to be 5 10 15 as, for unexplainable reasons, the estimation process is disrupted with too high shrinkage parameter values. Moreover, there is no loss by imposing this limit given that, exceeding a certain level, it does not make any real differences in terms of shrinkage. At each re-estimation, the goal is to have a range wide enough to capture all potential optimal values while not being too wide in order to be sufficiently precise. It is a function of the last estimated optimal shrinkage parameter, or to be more precise λ R,i [0.4 λ R,i 1 ; 1.6 λ R,i 1 ]. 200 potential values with equal distance are then chosen as potential candidates within this range. However, when the last optimal shrinkage parameter value is at the top of the range, the potential range is increased for the next estimation, in order to capture a potential upward or downward trends. There is then a time series of the estimated optimal shrinkage parameters per each sample. 20