# TWO L 1 BASED NONCONVEX METHODS FOR CONSTRUCTING SPARSE MEAN REVERTING PORTFOLIOS

Save this PDF as:

Size: px
Start display at page:

Download "TWO L 1 BASED NONCONVEX METHODS FOR CONSTRUCTING SPARSE MEAN REVERTING PORTFOLIOS"

## Transcription

2 2 X. LONG, K. SOLNA AND J. XIN constraint in order to improve the profits during arbitrage opportunities, while in [9], the authors discuss more details about parameter estimation and trading strategies. In our work, we take the approach set forth in [5] as a starting point and develop two optimization methods that use l 1 based norms to enforce the sparsity of the portfolio. The first method uses the ratio of l 1 and l 2 norms as a penalty function assuming limited prior information of the assets. Such a penalty term arises in nonnegative matrix factorization (NMF), blind deconvolution, and sparse representation in coherent dictionaries, [12], [14], [15], [8], [24]. For example, the nonnegative least squares (NNLS) problem under such penalty takes the following form [8]: min x 0 Xx Y γp (x) where P (x) = x 1 x 2, X is an m n matrix and Y is an m 1 vector. For our problem, we formulate the following minimization: (1.1) x T Ax min x 0 x T Bx + γ x 1 x 2 where A and B are positive definite matrices and γ is a nonnegative tuning parameter. The first term of the objective function of (1.1) models predictability and the second term promotes sparsity. The details are in section 5. Due to the nonconvexity of (1.1), finding a global minimum is challenging. To deal with this aspect of global optimization, we shall incorporate a recent variant of simulated annealing, the so called intermittent diffusion method with discontinuous diffusion coefficient [3]. The combined local minimization of (1.1) and random search for global minimum is however expensive for large size portfolio computation. The second method uses l 1 norm and our partial knowledge on the collection of assets. We reformulate the problem as a quadratic program: f(r) = max x(i)=1, x xt Bx rx T Ax 1 m where A and B are both positive definite matrices, r is a nonnegative number, and the choice of (i, m) will be addressed later. The ratio minimization problem: x T Ax min x(i)=1, x 1 m x T Bx will be shown to be equivalent to finding the positive root of f(r). Nonconvexity still exists and a special algorithm designed to minimize a difference of convex functions [10] can be applied to overcome this difficulty in combination with convex algorithms such as the least angle regression [6]. The resulting algorithm is faster than the above and can handle portfolios with hundreds of assets, however requires prior knowledge. The paper is organized as follows. In section 2, we restate the problem of sparse and fast mean reverting portfolio proposed in [5]. Section 3 discusses two proxies of the mean reversion coefficient. In section 4, we present the optimization problem in finding the desired portfolios and give a brief introduction to existing methods for solving it. In sections 5 and 6, we introduce two new approaches in constructing portfolios. We formulate new optimization problems and develop the algorithms to solve them.

3 CONSTRUCTION OF SPARSE MEAN REVERTING PORTFOLIOS 3 Numerical experiments are presented in section 7. We perform in-sample tests and out-of-sample tests on both synthetic data and historical market data. The in-sample and out-of-sample performances curves for the mean reversion coefficients and the trading opportunities have a similar performance trend as functions of cardinality or l 1 penalty level. 2. Sparse and fast mean reverting portfolio problem. We briefly review the formulation in [5]. Suppose that S ti is the value at time t of the ith asset with i = 1, 2..., n and t = 1, 2,..., l, we want to form a portfolio P t of these assets with coefficients x i, and assume it follows an Ornstein-Uhlenbeck process given by: dp t = λ(µ P t )dt + σdw t with P t = n x i S ti where λ > 0, σ > 0 and µ are parameters and W t is a standard Brownian motion. The objective here is to maximize the mean reversion coefficient λ of P t by adjusting the portfolio weights x i under the normalization n i=1 x2 i = 1. In addition, we want to limit the number of assets in the portfolio, i.e. we want to balance the number of nonzeros of x i s and the mean reversion coefficient. The author of [5] used predictability as a proxy of the mean reversion coefficient, discussed methods for parameter estimation and formulated an optimization problem with sparsity constraint. Two algorithms were proposed to solve this problem: greedy search and semidefinite relaxation. However, the greedy search often fails to produce the optimal solution and the semidefinite relaxation method is costly and unwieldy for large size portfolios with hundreds of assets. The results were mainly tested on small size data sets (8 or 14 assets). It shows that the mean reversion coefficient increases as the cardinality increases. 3. Proxy of the mean reversion coefficients. In [4], the authors discuss three different criteria to measure how fast a portfolio is mean-reverting. They are predictability, the portmanteau statistics and the crossing statistics. In [17], the authors review some of the most applied hedging methods and the test statistics to judge whether a portfolio can be used for hedging. We refer the readers to [4],[1],[17] for more details. In our paper, we mainly consider two proxies. One is predictability and the other one is called the direct OU estimator Predictability. The idea of predictability of a time series is first derived in [1]. They consider a stationary vector autoregressive (VAR) model: i=1 (3.1) S T t = S T t 1β + Z T t where the column vector S t = (S t1,..., S tn ) T, β R n n, Z t is a vector of i.i.d Gaussian noise with zero mean and a covariance matrix Σ, independent of S t 1. In the univariate case, E[S 2 t ] = E[(S t 1 β) 2 ] + E[Z 2 t ] which can be rewritten as σ 2 t = σ 2 t 1+Σ. Box & Tiao (1977) measure the predictability of stationary series by: ν = σ2 t 1 σ 2 t

4 4 X. LONG, K. SOLNA AND J. XIN d Aspremont [5] propose to use this measure of predictability as a proxy for the mean reversion parameter λ. In multivariate case, consider a portfolio P t = S T t x with weights x R n, then by multiplying both sides of (3.1) by x, we get and we can measure its predictability as: S T t x = S T t 1βx + Z T t x ν 1 (x) = xt β T Γβx x T Γx where Γ is the covariance matrix of S t. Minimizing the predictability ν 1 (x) corresponds to maximizing λ Direct OU estimator. The Ornstein-Uhlenbeck process plays an important role in constructing sparse and fast mean reverting portfolios. We will use the mean reversion coefficient, λ, to test the performance of a portfolio. Therefore, we will need an estimator for the mean reversion coefficient and will use this estimator as its proxy value. Consider an Ornstein-Uhlenbeck process: (3.2) dp t = λ(µ P t )dt + σdw t We can estimate the parameters of an OU process by linear regression. We refer the readers to [13] and [25] for more details. By writing (3.2) in a discrete form, we have Note that it can be written as: P t P t 1 = λµ t λp t 1 t + σ W t P t = λµ t + P t 1 (1 λ t) + σ W t When µ = 0, this has the form (3.1) in the case that n = 1. It is equivalent to the linear regression: y = a + bx + ɛ t Therefore, we could estimate all the parameters by regressing P t P t 1 on P t 1. Then we can recover ˆλ as ˆb â ˆσ(ɛt) t, ˆµ as and ˆσ as ˆλ t t. Maximizing the estimated mean reversion coefficient λ corresponds to minimizing the estimated slope of ˆb. Note that Replacing P t by S T t x, we have: b = cov(p t P t 1, P t 1 ) var(p t 1 ) We define the direct OU estimator as: (3.3) = cov(p t, P t 1 ) var(p t 1 ) b = cov(st t x, S T t 1x) var(s T t 1 x) 1 ν 2 (x) = cov(st t x, S T t 1x) var(s T t 1 x) 1

5 CONSTRUCTION OF SPARSE MEAN REVERTING PORTFOLIOS 5 If S t is stationary, then we can rewrite (3.3) in the following form: ν 2 (x) = xt (cov(s t, S t 1 ) + cov(s t, S t 1 ) T )x 2x T var(s t 1 )x Minimizing the predictability ν 2 (x) corresponds to maximizing λ. This proxy captures similar property of a time series as the portmanteau statistics in [4] and the direct Dickey-Fuller statistics in [17]. One thing to note is that the proxy (3.3) works even if we do not assume that the asset prices S t are stationary. The only assumption needed is that the linear combinations of the asset S T t x is stationary. This phenomenon is called cointegration [7]. According to our numerical tests, the portfolios constructed under this proxy have a better performance in trading. Though the estimation of cov(s t, S t 1 ) + cov(s t, S t 1 ) T is not guaranteed to be positive definite, our numerical tests show that in most cases it is still true for the market data. Even if this condition fails, we could use the estimation of cov(s t, S t 1 )+ cov(s t, S t 1 ) T +ρvar(s t 1 ) as the matrix in the quadratic form of the numerator where ρ is a tuning parameter. By choosing an appropriate ρ, the matrix will be positive definite and it is equivalent to the optimization problem when ρ = Sparse optimization problem and existing methods. In the previous section, we discussed two proxies of the mean reversion coefficient. Note that both of them can be written as x T Ax x T Bx where A and B are n n positive definite matrices and we will make this assumption for the rest of the paper. The estimation procedures for A and B for two proxies are discussed in the Appendix A. If we do not penalize the cardinality of x, then minimizing these proxies is the same as a generalized eigenvalue problem. In order to obtain a sparse solution, d Aspremont in [5] proposed the following sparse optimization problem: (4.1) min x s.t. x T Ax/x T Bx x 0 k x 2 = 1, where the l 0 norm of a vector x is the number of nonzero entries of x. This problem has been proved to be NP-hard [19]. When the dimension of the problem is large, we cannot expect to find the optimal solution. Several methods of solving (4.1) have been proposed in [5] and [9]. We give a brief summary here: possible combinations of assets and find the smallest generalized eigenvalue and eigenvectors. This method gives us an optimal solutions and works very well when n is small. However, it will be extremely slow when n is large. 2. Greedy search: denote I k as the support of the solution and set I k = initially. Each time, we pick one asset from {1, 2,..., n} \ I k such that it has smallest objective among all the other choices. Then we add it to I k and repeat this procedure k times. This method gives a sub-optimal solution. 3. Truncation method: first we solve the unconstrained problem and find x opt. Then we find the index set J k of the largest k components of x opt = ( x 1,..., x n ) T. 1. Exhaustive search method: it tests all n! k!(n k)!

6 6 X. LONG, K. SOLNA AND J. XIN Next we solve the generalized eigenvalue problem on the set J k by taking the corresponding part of matrices A and B. It gives a sub-optimal solution and it is the fastest among all the listed methods since it only requires solving the generalized eigenvalue problem twice. 4. Semidefinite relaxation method: this method reformulates the problem (4.1) as a semidefinite program. It provides sub-optimal solutions. The computational complexity is lower than the exhaustive search method but is higher than that of the greedy search. For more details, we refer the readers to [5]. In the next two sections, we will present two new approaches for approximating the optimal solution to the problem (4.1). 5. Optimization problems based on the ratio of l 1 and l 2 norms Motivation. One popular method in handling a cardinality constraint is to use a norm penalization. Standard choices include l 1 and l p (0 < p < 1). In our case, we would prefer using x 1 x 2 since our problem (1.1) is scale invariant. The scale constraint is required for x 1 and x p penalties otherwise they cannot enforce sparsity. The reason is that we could simply decrease the penalty by decreasing the scale of all the elements of x. For analysis of sparsity promoting properties of ratio of l 1 and l 2 norms, we refer to [24]. In the following sections, we will present two formulations of sparse mean reverting problems (4.1), discuss the algorithms to solve them and show two theorems of the ratio of l 1 and l 2 norms in enforcing the sparsity Formulation 1. We could reformulate the problem (4.1) in the following way: (5.1) min x s.t. x T Ax/x T Bx x 1 x 2 m x 0, Using the similar method as [5], we found that this problem can also be expressed as a semidefinite programming problem. By setting X = xx T, then the problem (5.1) is equivalent to min X s.t. trace(ax)/trace(bx) 1 T X 1 trace(x) m2 rank(x) = 1 X 0 where X means we take the absolute value for each entry of X. Then after a change of variables: Y = X trace(bx), z = 1 trace(bx) and dropping the rank constraint, the previous problem can be written as a semidefinite programming problem: (5.2) min Y trace(ay ) s.t. 1 T Y 1 m 2 z trace(y ) z = 0 trace(by ) = 1 Y 0

7 CONSTRUCTION OF SPARSE MEAN REVERTING PORTFOLIOS 7 If we set Card(x) = k = m 2, this is exactly the semidefinite relaxation in [5] Algorithm to solve (5.2). Since we have relaxed the problem to a semidefinite program, we could use many well-established package like SeDuMi and Yalmip to solve it. The only difficulty is handling the constraint: 1 T Y 1 m 2 z It corresponds to 2 n2 constraints. When the dimension of the problem is small, we could write out a set of linear constraints that cover all the possible signs of x. This method is impossible to implement when n is large. A classical method to tackle this issue is starting with a smaller set of linear constraint(s) and increasing the set if the current solution fails the original constraint. Tibshirani proposed this method in solving the LASSO type problem [23]. For illustration, we consider a simple example: x 1 t where x = (x 1, x 2 ) T. It is equivalent to the following 4 linear inequalities: x 1 + x 2 t x 1 x 2 t x 1 + x 2 t x 1 x 2 t If we want to minimize some function f(x) subject to x 1 t, we start with the problem of minimizing f(x) subject to x 1 + x 2 t. We can then get a minimizer x. If x 1 t, then it is also the minimizer under the constraint x 1 t. If not, then we add a new constraint: sign(x 1)x 1 + sign(x 2)x 2 t. Now we will solve the optimization problem with two linear constraints. We repeat this process until the solution satisfies x 1 t. For our problem, we proposed the following algorithm: Algorithm 1 Solve (5.2) Input the parameters A, B and m; Set an initial X 0 which corresponds to the solution without sparsity constraint; Constraint set = {Null} Calculate Y 0 X = 0 trace(bx 0 ) and z0 1 = trace(bx 0 ) while 1 T Y k 1 > m 2 z k do Add i,j sign(y ij k k+1 )Yij m 2 z k+1 to the constraint set; Solve the problem (5.2) by reducing the first 2 n2 constraints to the current constraint set; k = k + 1 end while Finally, we need to recover x from X. We can express X as: X = n λ i q i qi T i=1 where λ i are eigenvalues with corresponding eigenvector q i and λ i is a nonincreasing sequence. Therefore, we here pick the first eigenvector q 1 and select the first k = m 2 entries with largest absolute values to recover x.

8 8 X. LONG, K. SOLNA AND J. XIN 5.4. Formulation 2. We modify the problem (4.1) in the following way: (5.3) x T Ax min x 0 x T Bx + γ x 1 x 2 where γ is a tuning parameter. Comparing with the problem (4.1), the advantage of the problem (5.3) is that it does not specify the cardinality beforehand. This is convenient from the practical point of view. In addition, we could consider the l 1 norm in the numerator as a way to quantify the uniform transaction costs Algorithm to solve (5.3). The major difficulty comes from the nonconvexity of the problem. Most optimization algorithms will only provide a local minimizer. In our problem, the performance of local minimizers may vary a lot. Therefore, it is important to develop an algorithm to find the global minimizer. The intermittent diffusion(id) algorithm proposed in [3] can be helpful. The main idea of this algorithm is to add intermittent, instead of continuously diminishing, random perturbations to the gradient flow generated by the objective, so that the trajectories can quickly escape from the trap of one minimizer and then approach others. Algorithm 2 Solve (5.3) Input A, B and γ; Set α as the scale for diffusion strength, κ the scale for diffusion time and the total number of realizations N. Set the initial state x 0 as the minimizer of xt Ax x T Bx with the constraint that x 0 2 = 1, i.e. the generalized eigenvector associated with the smallest eigenvalue Find a local minimizer ˆx 0 of problem (5.3) given x 0 and set X opt = ˆx 0. for i = 1 to N do Generate two positive random numbers d, s within [0, 1] by uniform distribution and let σ := αd and T := κs. Solve the stochastic equation for t [0, T ] dx(t, ω) = s F (x(t, ω))dt + σdw (t), x(0, ω) = X opt where s F is the gradient of the objective (5.3) and record the final state x T := x(t, ω). Find a local minimizer ˆx i of problem (5.3) by line search algorithm with starting point x T. X opt = ˆx i if f(ˆx i ) < f(x opt ). end for 5.6. Analysis of the ratio of l 1 and l 2 penalty. In this section, we want to prove some useful properties of the solutions to the problem (5.3). Theorem 5.1 indicates that an almost 1-sparse solution can always be recovered if the value of γ is large enough. Theorem 5.2 shows that the ratio of l 1 and l 2 norms of the solution to the problem (5.3) decreases as we increase the parameter γ. We will define f(x, γ), L(x) and P (x) as: f(x, γ) = xt Ax x T Bx + γ x 1 x 2

9 CONSTRUCTION OF SPARSE MEAN REVERTING PORTFOLIOS 9 L(x) = xt Ax x T Bx P (x) = x 1 x 2 We denote the set of minimizers of the problem (5.3) by x(γ) and we will set their l 2 norm to 1. When γ is fixed, the existence of the minima of f(x, γ) is due to its continuity on the unit sphere and the compactness of the unit sphere. Since x T Ax/x T Bx is scale invariant, any vector on the same direction yields the same value. If there are different directions that yield the same f(x, γ), then we always prefer those with the smaller ratio of l 1 and l 2 norms. Therefore, the set of the optimizers, x(γ), has a unique value of the function P ( ) on this set (and then a unique value of L( )). Mathematically, it can be defined in the following way: X(γ) = {x R n : x 2 = 1, f(x, γ) = min { zt Az z 0, z 2=1 z T Bz + γ z 1 }} z 2 x(γ) = {x X(γ) : P (x) = min{p (y) : y X(γ)}} Under this definition, P (x(γ)), L(x(γ)) and f(x(γ), γ) are well defined, since P ( ), L( ) and f(, γ) are constant on x(γ). In the following proofs, we will use the notations: B(x, δ) {x : x x 2 < δ} S k {x R n : x 2 = 1, x 0 k} S n {x R n : x 2 = 1} d(u, V ) = min{ x y 2 : x U, y V, U and V are compact sets in R n } Theorem 5.1. Given A = {a ij } and B = {b ij } are both n n positive-definite matrices, then for any ɛ > 0, there exists a number γ(ɛ), such that for any γ γ(ɛ), x e 2 < ɛ, where x(γ) is the set of optimal solutions of problem (5.3) given γ, x x(γ), e is a 1-sparse vector and e 2 = 1. Proof. First, note that the problem (5.3) is scale invariant, therefore, we could constrain the problem on the sphere x 2 = 1. All the vectors in our proof have l 2 norm 1. Note that the 1-sparse minimizers of L(x) on the sphere x = 1 are the vectors ±e i with all 0s except for a 1 in the ith coordinate. The i is determined by i = arg min{ a ii } i b ii Without loss of generality, we will assume that i = 1 and this corresponds to a unique minimizer, i.e. a 11 b 11 < a ii b ii, for all i 1

10 10 X. LONG, K. SOLNA AND J. XIN Note that both f(x, γ) and L(x) are even functions of x. We could further restrict our region to S + {x = (x 1,..., x n ) T : x 1 0 and x 2 = 1}. Since L(x) is continuous on S +, then δ i, for any point x in the region {x : x ± e i < δ i } S +, L(x) L(e 1 ) = a11 b 11, for i = 2, 3,..., n. Denote D i = {x : x ± e i < δ i } for i = 2, 3,..., n, then for all the points x S + ( n i=2 D i), L(x) L(e 1 ) and x 1 / x 2 1. Therefore, they cannot beat e 1 for any γ. For any ɛ > 0, let D 1 = {x : x e 1 < ɛ}, then S + (ɛ) S + \ ( n i=1 D i) is a closed and bounded set. Therefore, γ(ɛ) must satisfy the following inequality: f(e 1, γ(ɛ)) f(x, γ(ɛ)), x S + (ɛ). This is equivalent to a 11 b 11 + γ(ɛ) L(x) + γ(ɛ) x 1 x 2, for any x S + (ɛ). And it implies: γ(ɛ) ( a 11 b 11 L(x))/( x 1 x 2 1), for any x S + (ɛ). The last line holds, since for any x S + (ɛ), x 1 x 2 is guaranteed to be greater than 1. Note that the function ( a11 b 11 L(x))/( x 1 x 2 1) is well-defined and continuous on S + (ɛ). Therefore, we can set γ(ɛ) as: γ(ɛ) = max (a 11 L(x))/( x 1 1) x S +(ɛ) b 11 x 2 Theorem 5.2. Suppose x(γ) is the set of minimizers of the problem (5.3), then P (x(γ)) is nonincreasing in γ. Proof. For any arbitrary γ 1 and γ 2, assume γ 1 < γ 2. We want to show that P (x(γ 1 )) P (x(γ 2 )). Since x(γ 1 ) and x(γ 2 ) are two minimizers, so we have (5.4a) (5.4b) L(x(γ 1 )) + γ 1 P (x(γ 1 )) L(x(γ 2 )) + γ 1 P (x(γ 2 )) L(x(γ 2 )) + γ 2 P (x(γ 2 )) L(x(γ 1 )) + γ 2 P (x(γ 1 )) (5.4a) and (5.4b) implies L(x(γ 1 )) L(x(γ 2 )) γ 1 (P (x(γ 2 )) P (x(γ 1 ))) L(x(γ 1 )) L(x(γ 2 )) γ 2 (P (x(γ 2 )) P (x(γ 1 ))) Then we have: γ 2 (P (x(γ 2 )) P (x(γ 1 ))) L(x(γ 1 )) L(x(γ 2 )) γ 1 (P (x(γ 2 )) P (x(γ 1 ))) This implies (γ 2 γ 1 )(P (x(γ 2 )) P (x(γ 1 ))) 0 By the assumption that γ 1 < γ 2, therefore P (x(γ 1 )) P (x(γ 2 )).

11 CONSTRUCTION OF SPARSE MEAN REVERTING PORTFOLIOS Optimization problems based on the l 1 norm and prior knowledge Motivation. In the previous section, we applied a stochastic algorithm in searching for a global optimizer of the problem (5.3). The major drawback of the algorithms is that we can not guarantee a global optimizer. Based on the theory of the intermittent diffusion(id) algorithm, we could increase the probability of finding the global optimizer by increasing the number of realizations N. However, choosing N is problem dependent and it will be a difficult task when the dimension of the problem is large. We have to balance the efficiency and the performance of the solution. Therefore, an efficient global optimization algorithm is desired. In addition, we may prefer a simpler norm to enforce sparsity, because it will lead to simpler algorithms. We also would like to keep the number of tuning parameters as low as possible Formulation. Note that the problem (4.1) is equivalent to: max s.t. x T Bx/x T Ax x 0 k x 2 = 1 Now, we use the l 1 norm to enforce the sparsity and consider the following problem: (6.1) f(r) = max x(i)=1, x xt Bx rx T Ax 1 m where i is a predetermined fixed number. We could choose i based on our prior knowledge or our investment need. For example, we could choose i by selecting the asset which has the largest entry in absolute value by solving the unconstraint problem. The constraint x(i) = 1 enables using the l 1 norm to enforce the sparsity and also simplifies the problem to a quadratic program Theory of recovering the global optimizer. We would like to show that Theorem 6.1. Let f(r) = max x(i)=1, x xt Bx rx T Ax 1 m where A and B are both positive definite matrices, then a For any given R > 0, (6.1) is continuous on 0 r R; b f(r) is an nonincreasing function of r; c There exists r such that f(r ) = 0; d Suppose the optimizer at r is x, then x is the optimizer of the following problem: (6.2) max x 0 x T Bx/x T Ax s.t. x(i) = 1 x 1 m Proof. a. For any given R > 0, since f(x, r) = x T Bx rx T Ax is continuous on the closed and bounded set {(x, r), x(i) = 1, x 1 m, 0 r R}, all the optimizers are obtainable. Due to the uniform continuity, f(r) is continuous.

12 12 X. LONG, K. SOLNA AND J. XIN b. Suppose r 1 > r 2, then for any feasible x, we must have (x T Bx r 1 x T Ax) (x T Bx r 2 x T Ax) = (r 2 r 1 )x T Ax < 0 Since A is positive definite and x 0. Therefore, it is nonincreasing in r. c. Notice that f(0) > 0. Since A is positive definite, there must exist an R > 0 such that B RA is negative definite. Therefore, f(r) < 0. Due to the continuity of f(r), there exists r such that f(r ) = 0. d. Suppose there exists a feasible x 1 such that Then we must have: r 1 := xt 1 Bx 1 x T 1 Ax 1 > r 0= x T 1 Bx 1 r 1 x T 1 Ax 1 = x T 1 Bx 1 r x T 1 Ax 1 (r 1 r )x T 1 Ax 1 x T Bx r x T Ax (r 1 r )x T 1 Ax 1 = (r 1 r )x T 1 Ax 1 < 0 Therefore, the result is shown by contradiction. This theorem tells us that if we can find a root of f(r) and the corresponding maximizer, then we have found the global maximizer of problem (6.2). The conversion of the ratio minimization problem to a sequence of difference minimization problems has been proposed in solving the trace ratio optimization problem arising in machine learning and high dimensional data analysis [20]. Here in Theorem 7.1, we considered the additional l 1 constraint Algorithm for finding r. We used a binary search algorithm. Algorithm 3 Find r Input the matrices B and A, a threshold ɛ > 0 for stopping, initial x 0, r min,r max, i and m; Set r 0 = 1 2 (r min + r max ); while x T 0 Bx 0 r 0 x T 0 Ax 0 > ɛ do Solve the quadratic program (6.1) with r 0 and obtain the maximizer x 0 ; if x T 0 Bx 0 r 0 x T 0 Ax 0 > ɛ then r min = r 0 else r max = r 0 end if r 0 = 1 2 (r min + r max ) end while 6.5. Algorithm for solving (6.1). The difficult part is solving the quadratic program (6.1). It is a nonconvex optimization problem. Therefore, classical algorithms can not guarantee a global optimizer. In addition, the computational cost is expensive when the dimension of the problem is large( 100).

13 CONSTRUCTION OF SPARSE MEAN REVERTING PORTFOLIOS 13 One way to tackle this problem is treating the problem as a difference of convex functions and using the DC (difference of convex functions) algorithm. DC programming is extensively studied in [10], [16], [21], [22]. The method aims to minimize a function g h where both g and h are convex functions on all of the space. In most of the literature on DC programming, an algorithm to find a local optimum is used. In practice, the local algorithm often yields a global minimum. In order to apply the DC algorithm, we could first reformulate the problem (6.1) in the following way by plugging in the constraint x(i) = 1: min x rxt V x + c T x x T Ux 1 m where U and V are both positive definite matrices and x R n 1. In addition, by a standard method in DC algorithm, we could change it to an unconstraint problem: (6.3) min rx T V x + c T x x T Ux + χ (x) x 1 m x where { : x 1 > m χ x 1 m(x) = 0 : x 1 m Let g(x) = rx T V x + c T x + χ x 1 m (x) h(x) = xt Ux then the objective is a difference of g and h. algorithm: We now can use the following DC Algorithm 4 Solve (6.3) for a given r Choose x 0 in R n 1 ; repeat Set y k = 2Ux k ; Solve the optimizer x k+1 of the convex program: (6.4) inf{rx T V x + c T x x T y k + χ x 1 m (x), x Rn 1 } until convergence 6.6. Algorithms for solving (6.4). The problem (6.4) can be considered as a quadratic program with l 1 constraint. (6.5) min rx T V x + c T x + x T y k s.t. x 1 m It can also be rewritten as a least squares optimization with l 1 constraint. This problem has been well studied in the literature. The major difficulty is handling the l 1 constraint. We present several methods for overcoming this difficulty in the appendix and test them on the U.S. swaps data. From our numerical results in section 7.4.1, the least angle regression method is the most efficient choice.

14 14 X. LONG, K. SOLNA AND J. XIN The idea is that we first reformulate (6.5) as a LASSO problem [23]. Notice that the objective of a LASSO problem is Xx Y λ x 1 = x T X T Xx 2Y T Xx + Y T Y + λ x 1 where X is a matrix of predictor variables, Y is an outcome vector and λ is a tuning parameter. Therefore, by solving the following system for X and Y, we retrieve a LASSO type problem from (6.5): rv = X T X c + y = 2X T Y The first equation can be solved by Cholesky decomposition and then the second equation is easy to solve. Finally, we could apply the least angel regression algorithm (LARS) [6]. 7. Numerical tests. Our procedures are implemented in Matlab R2011b. We used the optimization package YALMIP. Computations are performed on a Dell desktop with 8G RAM and 3.4 GHz i7 CPU. We have used two historical data sets. One is the U.S. daily swaps data for maturities 1Y, 2Y, 3Y, 4Y, 5Y, 7Y, 10Y and 30Y from July 3, 2000 until July 15, The data are obtained from The total number of data is The data are in percent with two digits after the decimal point. The other one is the daily closed prices of S&P 500 companies. The data are collected from Yahoo finance. In order to obtain a large sample set, we only select those companies that have been in the list since July After this preselection procedure, we have 458 companies left. We used the stock prices of the first 100 companies according to the alphabetical order of the ticker symbols. The data size in our numerical test is In the following sections, we performed several tests: 1. In-sample and out-of-sample tests for the performance of the portfolios constructed under two mean reverting proxies: predictability and direct OU estimator on synthetic data. Details are in section 7.1; 2. In-sample and out-of-sample tests for the ratio of l 1 and l 2 norms approach on the U.S. swaps data. Details are in section 7.3; 3. In-sample and out-of-sample tests for the l 1 norm with prior knowledge approach on the U.S. swaps data. Details are in section and 7.4.3; 4. In-sample tests for the l 1 norm with prior knowledge approach on the historical daily closing prices of 100 stocks. Details are in section 7.4.2; 5. In-sample and out-of-sample tests for the l 1 norm with prior knowledge approach on synthetic data with 100 assets. Details are in section 7.4.4; 7.1. Comparison of two mean reverting proxies. We first wanted to compare the performance of portfolio selection via two proxies that we discussed in section 3. We set the matrix β and the noise covariance matrix Σ in the VAR(1) model (3.1) as our estimations based on the U.S. swaps data. We generated a data set of size 350 8, so there are 350 observations for each asset and there are 8 assets in total. We used the first samples as the training set and the rest as the test set. Next we made estimations of parameters and solved for the optimal solutions by the exhaustive search method based on the training set. We repeated our simulation 1000 times and then we compared the average of the estimated mean reversion coefficients of our sparse portfolios on both the training set and the test set.

16 16 X. LONG, K. SOLNA AND J. XIN Fig Trading Opportunities (Not all are marked in the plot.) After presetting proper β and Σ, we generate a data set of 10 assets for 400 observations based on the VAR(1) model (3.1). We use the last 50, 100, 200 and 400 observations as the training set and estimate A s and B s in (4.1) respectively. Next, we construct the portfolios under different cardinality constraints by the exhaustive search method by using those A s and B s. We then generate the test data sets. Each has 400 observations of 10 assets. They follow the same VAR(1) model. The starting values are the final observations of the training data. In total, we have 100 test data sets and we compare the average out-of-sample mean reversion coefficients of the portfolios constructed based on the training set. If we use the last 50 observations of the training set for estimation, then we will only test those portfolios on the first 50 observations of each test set, and so forth. Figure 7.3 shows the results. It illustrates impact of the size of training set on out-of-sample performance. We compare the in-sample mean reversion coefficients with the average out-of-sample mean reversion coefficients under different cardinality constraints. The portfolios are constructed by solving the problem (4.1) by the exhaustive search method. The top left uses 50 observations of the training set and 50 observations of the test set; the top right uses 100 observations of the training set and 100 observations of the test set; the bottom left uses 200 observations of the training set and 200 observations of the test set; the bottom right uses 400 observations of the training set and 400 observations of the test set. Notice that when the size of the training set increases, the out-of-sample performance becomes closer to the in-sample performance Tests of the ratio of l 1 and l 2 norms approach. We performed insample and out-of-sample tests on the U.S. swaps data and tried to solve the following

17 CONSTRUCTION OF SPARSE MEAN REVERTING PORTFOLIOS 17 Fig Comparison of the average performance of sparse mean reverting portfolios solved under different proxies using the exhaustive search method on synthetic data from VAR(1) model (3.1). Top left: average in-sample mean reversion coefficients versus cardinality; Top right: average out-of-sample mean reversion coefficients versus cardinality; Bottom: average out-of-sample trading opportunities versus cardinality; problem in section 5: x T Ax/x T Bx + γ x 1 x 2 by using intermittent diffusion algorithm (ID algorithm) [3]. After each iteration of the line search algorithm [2], we fix the l 2 norm to 1. In our tests of ID algorithm, we let α = 20, κ = 20 and N = 20. In addition, to satisfy the conditions of ID algorithm so that a global minimizer exists in a bounded set, we also add a penalty function p(x, θ, ξ, ζ) to f(x): p(x, θ, ξ, ζ) = i u(x i, θ, ξ, ζ) where ξ(x i θ) ζ, x i > θ u(x i, θ, ξ, ζ) := 0, x i θ ξ(θ x i ) ζ, x i < θ We set θ = 10, ξ = 2 and ζ = 100 in our test. For in-sample tests, we used the entire U.S. swaps data set to estimate the parameters A and B and calculated the solutions for different γ. We found the minimizers to the problem (5.3) by the algorithm 2 in section 5 for different γ s between 0 and 1.5 with a step size Then we calculated the estimated mean reversion coefficients and counted the trading opportunities of the whole period for all the minimizers.

18 18 X. LONG, K. SOLNA AND J. XIN Fig The tests of the impact of the size of training set on out-of-sample performance based on synthetic data. The portfolios are constructed by the exhaustive search method under different cardinality constraints. Top left: 50 observations of the training set and 50 observations of the test set; Top right: 100 observations of the training set and 100 observations of the test set; Bottom left: 200 observations of the training set and 200 observations of the test set; Bottom right: 400 observations of the training set and 400 observations of the test set; The results are shown in Figure 7.4. It shows the estimated mean reversion coefficients/trading opportunities versus the values of γ of this in-sample test on the entire swap data set. For out-of-sample tests, we used every 100 consecutive observations to estimate those parameters and found the minimizers for different γ s between 0 and 1.5 with a step size This time, we calculated the estimated mean reversion coefficients and counted the trading opportunities for both these 100 days and the next 100 trading days. The results are shown in Figure 7.5. It shows the average estimated mean reversion coefficients/trading opportunities versus the values of γ from in-sample and out-of-sample tests. From the numerical results, we see that as the γ increases the mean reversion coefficients and the trading opportunities have a decreasing trend for both in-sample and out-of-sample tests. There are big jumps in Figure 7.4. After checking those portfolios, we found that we only recovered portfolios of cardinality 1, 5, 6, 7 and 8. We failed to recover the portfolios of cardinality 2, 3 and 4. The reason could be that we got trapped in the local minima. The curves in Figure 7.5 look smoother, since they are the average performance of portfolios on about 12 data sets. Normally, the in-sample performance is better than the out-of-sample performance. However, we notice that we can still maintain about 60% of the performance. When γ is close to 1.5, the minimizer will be an 1-sparse vector. This shows that the ratio of l 1 and l 2 norms indeed enforces extreme sparsity in our problem.

19 CONSTRUCTION OF SPARSE MEAN REVERTING PORTFOLIOS 19 Fig In-sample tests on the whole data set of the U.S. swaps. Estimated mean reversion coefficients/trading opportunities versus the values of γ using the intermittent diffusion(id) algorithm with the ratio of l 1 and l 2 norms penalty. Fig In-sample and out-of-sample tests on the U.S. swaps. Average estimated mean reversion coefficients/trading opportunities versus the values of γ using the intermittent diffusion(id) algorithm with the ratio of l 1 and l 2 norms penalty Comments. The advantages of the ratio of l 1 and l 2 norms approach are: 1. It does not require any prior knowledge of the assets; 2. It does not predetermine the cardinality. This approach also has its disadvantages: 1. It is difficult to recover a portfolio whose cardinalities are intermediate. We could encounter big jumps of the mean reversion coefficients and trading opportunities as we change the tuning parameter γ; 2. The algorithm is not very efficient. It takes more than 5 minutes in solving a problem of size 100 (100 different assets for us to choose). If we compare this speed with methods of the next section, it is relatively slow Tests of the methods in section 6.

20 20 X. LONG, K. SOLNA AND J. XIN In-sample tests on the U.S. swaps. In order to test the performance of our methods, we need to set up a criterion to measure the relative performance. We first find the global minimizers of the mean reverting coefficients given cardinality constraints by the exhaustive search method [9]. They serve as the reference global optima. Table 7.1 shows those solutions to the problem (4.1) obtained by the exhaustive search method based on the whole U.S. swaps data. The mean reversion coefficients, ˆλ s, are estimated based on the whole data set. x Y Y Y Y Y Y Y Y ˆλ Table 7.1 Solutions to the problem (4.1) obtained by the exhaustive search method on the entire U.S. swaps data which are treated as benchmarks for different cardinalities Notice that the 4Y swap rate has the largest weight in absolute value for portfolios with cardinality of 4 to 8. Therefore, setting the weights of the 4Y swap to be 1 is a potentially good strategy. The key part of the algorithms in section 7 is how to solve the nonconvex quadratic program 6.1 given r and i: max x(i)=1, x xt Bx rx T Ax 1 m We have tested the following four methods and the details of the implementation of these methods are presented in the Appendix B: A Directly use the active set method [18] for quadratic program and handle the l 1 constraint by using nonnegative variables; B Use the DC algorithm and Tibshirani s method, i.e. Algorithm 4 and 5 (option 2 in Appendix B ); C Use the DC algorithm and nonnegative variable method, i.e. Algorithm 4 and option 3 in Appendix B; D Use the DC algorithm and the least angel regression (LARS) algorithm, i.e. Algorithm 4 and option 4 in Appendix B; We have tested different l 1 levels (different m s) and constructed different sparse mean reverting portfolios. We estimated the mean reversion coefficients on these portfolios. Here we present one table of our portfolios constructed using different methods. From Table 7.2, we may find that the difference of convex functions algorithm does help improve the performance of the results. However, it increases the computational cost a lot. Generally, Tibshirani s method is not as efficient as the nonnegative variable method and this also states in [23]. The method D (LARS algorithm in conjunction

21 CONSTRUCTION OF SPARSE MEAN REVERTING PORTFOLIOS 21 with the DC algorithm) is the fastest method. Note that all the methods have the same choice of assets. After considering the computation efficiency, we conclude that the method D is the best choice. To measure relative performance, we compare the results with the benchmark portfolios in Table 7.1. Methods A, B, C and D suggest us picking the 1Y, 3Y, 4Y, 5Y and 30Y swaps. It is of cardinality 5. The best portfolio of cardinality 5 uses the 2Y, 3Y, 4Y, 5Y and 7Y swaps. Note that the chosen 3Y, 4Y and 5Y swaps have leading weights in the optimal portfolio of cardinality 5. Our methods select the main components of the optimal portfolio. We believe the difference of our portfolio and the optimal portfolio of the same cardinality is due to the difference between l 1 norm and the cardinality. If we normalize the weights of the optimal portfolio such that its 4Y swap is 1, then its l 1 norm will be which is much larger then our l 1 constraint Method A B C D 1Y Y Y Y Y Y Y Y λ Time(s) Table 7.2 Solutions to the problem 6.1 of method A, B, C and D obtained under x on the entire U.S. swaps data with estimated mean reversion coefficients and computational costs In-Sample tests on 100 stocks. The efficiency of the method D is better shown for a large dimensional data. In the next test, we applied the method D to a data set of 100 stocks. These stocks are the first 100 S&P stocks in ticker symbols alphabetical order from our preselected list of S& P 500 stocks. Therefore, the size of the matrices A and B is When the problem is of this size, we are not able to use the exhaustive search method to get the optimal solution for the middle cardinalities. Therefore, in order to set up a criterion, we will compare our results with the densest solution. This solution should give us the largest possible mean reversion coefficient based on the data set. For our data set, the largest possible mean reversion coefficient is In addition, we will set the weight of the No. 43 stock (ticker symbol: AMAT) to be 1, since it has the largest weight among all in the densest solution. Figure 7.6, 7.7 and 7.8 give the numerical results. The l 1 constraints, m, are in the range of 3 to 16 with a step size 1. Figure 7.6 demonstrates the mean reversion coefficients of portfolios built under different m s. The X-axis shows the level of m and the Y-axis is the mean reversion coefficients. Figure 7.6 also demonstrates the cardinality of those portfolios and Figure 7.7 shows the trading opportunities. Figure 7.8 displays the weights of 100 stocks. The average computational cost of the method D of all these tests is seconds. This is a remarkable improvement over the other computational methods to date in

22 22 X. LONG, K. SOLNA AND J. XIN terms of efficiency and speed. Fig Mean reversion coefficients/cardinality versus different l 1 constraints using the method of the least angle regression and the difference of convex functions algorithm on 100 stocks. The horizontal line in the left plot shows the largest possible mean reversion coefficient. Fig Trading Opportunities versus different l 1 constraints using the method of the least angle regression and DCA on 100 stocks Out-of-sample tests on the U.S. swaps. We also performed out-ofsample tests. We still worked on the U.S. swap data. Each time, we used 100 days data as the training data and built a sparse mean reverting portfolio. Then we estimated the mean reversion coefficients of the portfolio on these 100 days, next 50 days and next 100 days.

24 24 X. LONG, K. SOLNA AND J. XIN Fig In-sample and out-of-sample tests on the U.S. swaps. Average mean reversion coefficients versus different l 1 constraints using the method of the least angle regression and DCA. Fig In-sample and out-of-sample tests on the U.S. swaps. Average trading opportunities versus different l 1 constraints using the method of the least angle regression and DCA. equal weights on 100 assets. The other one is a random portfolio. We set the weight of the same ith asset to 1. For all the other assets, their weights are independently and identically distributed random variables which follows a uniform distribution in the interval [ 1, 1]. These two portfolios can be considered as two benchmarks. We expect that the portfolios constructed under our algorithms will beat them in out-of-sample performance. We generate observations based on the same VAR(1) model in each trial. These are test sets on which to evaluate our constructed portfolios and those two benchmark portfolios. We generate 100 different test sets and then we compare the average of the estimated mean reversion coefficients and trading opportunities of these portfolios on both the training set and the test set. When we count the out-of-sample trading opportunities, we still use the mean and standard deviation of the training set, since we are not supposed to know the future mean or variance.

### Applications to Data Smoothing and Image Processing I

Applications to Data Smoothing and Image Processing I MA 348 Kurt Bryan Signals and Images Let t denote time and consider a signal a(t) on some time interval, say t. We ll assume that the signal a(t) is

### Sensitivity analysis of utility based prices and risk-tolerance wealth processes

Sensitivity analysis of utility based prices and risk-tolerance wealth processes Dmitry Kramkov, Carnegie Mellon University Based on a paper with Mihai Sirbu from Columbia University Math Finance Seminar,

### Chapter 6: Multivariate Cointegration Analysis

Chapter 6: Multivariate Cointegration Analysis 1 Contents: Lehrstuhl für Department Empirische of Wirtschaftsforschung Empirical Research and und Econometrics Ökonometrie VI. Multivariate Cointegration

### The fastclime Package for Linear Programming and Large-Scale Precision Matrix Estimation in R

Journal of Machine Learning Research 15 (2014) 489-493 Submitted 3/13; Revised 8/13; Published 2/14 The fastclime Package for Linear Programming and Large-Scale Precision Matrix Estimation in R Haotian

### Machine Learning in Statistical Arbitrage

Machine Learning in Statistical Arbitrage Xing Fu, Avinash Patra December 11, 2009 Abstract We apply machine learning methods to obtain an index arbitrage strategy. In particular, we employ linear regression

### Maximum likelihood estimation of mean reverting processes

Maximum likelihood estimation of mean reverting processes José Carlos García Franco Onward, Inc. jcpollo@onwardinc.com Abstract Mean reverting processes are frequently used models in real options. For

### Estimation and Inference in Cointegration Models Economics 582

Estimation and Inference in Cointegration Models Economics 582 Eric Zivot May 17, 2012 Tests for Cointegration Let the ( 1) vector Y be (1). Recall, Y is cointegrated with 0 cointegrating vectors if there

### Absolute Value Programming

Computational Optimization and Aplications,, 1 11 (2006) c 2006 Springer Verlag, Boston. Manufactured in The Netherlands. Absolute Value Programming O. L. MANGASARIAN olvi@cs.wisc.edu Computer Sciences

### Duality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725

Duality in General Programs Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: duality in linear programs Given c R n, A R m n, b R m, G R r n, h R r : min x R n c T x max u R m, v R r b T

### ME128 Computer-Aided Mechanical Design Course Notes Introduction to Design Optimization

ME128 Computer-ided Mechanical Design Course Notes Introduction to Design Optimization 2. OPTIMIZTION Design optimization is rooted as a basic problem for design engineers. It is, of course, a rare situation

### Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components

Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components The eigenvalues and eigenvectors of a square matrix play a key role in some important operations in statistics. In particular, they

### 2.3 Convex Constrained Optimization Problems

42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions

### The University of Chicago, Booth School of Business Business 41202, Spring Quarter 2013, Mr. Ruey S. Tsay Solutions to Final Exam

The University of Chicago, Booth School of Business Business 41202, Spring Quarter 2013, Mr. Ruey S. Tsay Solutions to Final Exam Problem A: (40 points) Answer briefly the following questions. 1. Describe

### Random Vectors and the Variance Covariance Matrix

Random Vectors and the Variance Covariance Matrix Definition 1. A random vector X is a vector (X 1, X 2,..., X p ) of jointly distributed random variables. As is customary in linear algebra, we will write

### Solution based on matrix technique Rewrite. ) = 8x 2 1 4x 1x 2 + 5x x1 2x 2 2x 1 + 5x 2

8.2 Quadratic Forms Example 1 Consider the function q(x 1, x 2 ) = 8x 2 1 4x 1x 2 + 5x 2 2 Determine whether q(0, 0) is the global minimum. Solution based on matrix technique Rewrite q( x1 x 2 = x1 ) =

### Least-Squares Intersection of Lines

Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a

### Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

(für Informatiker) M. Grepl J. Berger & J.T. Frings Institut für Geometrie und Praktische Mathematik RWTH Aachen Wintersemester 2010/11 Problem Statement Unconstrained Optimality Conditions Constrained

### Penalized regression: Introduction

Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

### Degrees of Freedom and Model Search

Degrees of Freedom and Model Search Ryan J. Tibshirani Abstract Degrees of freedom is a fundamental concept in statistical modeling, as it provides a quantitative description of the amount of fitting performed

### (Quasi-)Newton methods

(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

### 10.3 POWER METHOD FOR APPROXIMATING EIGENVALUES

55 CHAPTER NUMERICAL METHODS. POWER METHOD FOR APPROXIMATING EIGENVALUES In Chapter 7 we saw that the eigenvalues of an n n matrix A are obtained by solving its characteristic equation n c n n c n n...

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

### Practical Guide to the Simplex Method of Linear Programming

Practical Guide to the Simplex Method of Linear Programming Marcel Oliver Revised: April, 0 The basic steps of the simplex algorithm Step : Write the linear programming problem in standard form Linear

### Stochastic Inventory Control

Chapter 3 Stochastic Inventory Control 1 In this chapter, we consider in much greater details certain dynamic inventory control problems of the type already encountered in section 1.3. In addition to the

MA 134 Lecture Notes August 20, 2012 Introduction The purpose of this lecture is to... Introduction The purpose of this lecture is to... Learn about different types of equations Introduction The purpose

### Eigenvalues and eigenvectors of a matrix

Eigenvalues and eigenvectors of a matrix Definition: If A is an n n matrix and there exists a real number λ and a non-zero column vector V such that AV = λv then λ is called an eigenvalue of A and V is

### Regression Using Support Vector Machines: Basic Foundations

Regression Using Support Vector Machines: Basic Foundations Technical Report December 2004 Aly Farag and Refaat M Mohamed Computer Vision and Image Processing Laboratory Electrical and Computer Engineering

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Identifying Small Mean Reverting Portfolios

Identifying Small Mean Reverting Portfolios By Alexandre d Aspremont February 26, 2008 Abstract Given multivariate time series, we study the problem of forming portfolios with maximum mean reversion while

### Notes for STA 437/1005 Methods for Multivariate Data

Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.

### (57) (58) (59) (60) (61)

Module 4 : Solving Linear Algebraic Equations Section 5 : Iterative Solution Techniques 5 Iterative Solution Techniques By this approach, we start with some initial guess solution, say for solution and

### Linear Systems. Singular and Nonsingular Matrices. Find x 1, x 2, x 3 such that the following three equations hold:

Linear Systems Example: Find x, x, x such that the following three equations hold: x + x + x = 4x + x + x = x + x + x = 6 We can write this using matrix-vector notation as 4 {{ A x x x {{ x = 6 {{ b General

### 1 Solving LPs: The Simplex Algorithm of George Dantzig

Solving LPs: The Simplex Algorithm of George Dantzig. Simplex Pivoting: Dictionary Format We illustrate a general solution procedure, called the simplex algorithm, by implementing it on a very simple example.

### Big Data - Lecture 1 Optimization reminders

Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics

### Linear Programming. March 14, 2014

Linear Programming March 1, 01 Parts of this introduction to linear programming were adapted from Chapter 9 of Introduction to Algorithms, Second Edition, by Cormen, Leiserson, Rivest and Stein [1]. 1

### Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### Factor analysis. Angela Montanari

Factor analysis Angela Montanari 1 Introduction Factor analysis is a statistical model that allows to explain the correlations between a large number of observed correlated variables through a small number

### By W.E. Diewert. July, Linear programming problems are important for a number of reasons:

APPLIED ECONOMICS By W.E. Diewert. July, 3. Chapter : Linear Programming. Introduction The theory of linear programming provides a good introduction to the study of constrained maximization (and minimization)

### Introduction to Convex Optimization for Machine Learning

Introduction to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009 Duchi (UC Berkeley) Convex Optimization for Machine Learning

### DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

### Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

### Gaussian Processes in Machine Learning

Gaussian Processes in Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany carl@tuebingen.mpg.de WWW home page: http://www.tuebingen.mpg.de/ carl

### Solving polynomial least squares problems via semidefinite programming relaxations

Solving polynomial least squares problems via semidefinite programming relaxations Sunyoung Kim and Masakazu Kojima August 2007, revised in November, 2007 Abstract. A polynomial optimization problem whose

### Lecture Topic: Low-Rank Approximations

Lecture Topic: Low-Rank Approximations Low-Rank Approximations We have seen principal component analysis. The extraction of the first principle eigenvalue could be seen as an approximation of the original

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large

### THE FUNDAMENTAL THEOREM OF ALGEBRA VIA PROPER MAPS

THE FUNDAMENTAL THEOREM OF ALGEBRA VIA PROPER MAPS KEITH CONRAD 1. Introduction The Fundamental Theorem of Algebra says every nonconstant polynomial with complex coefficients can be factored into linear

### Chapter 2 Portfolio Management and the Capital Asset Pricing Model

Chapter 2 Portfolio Management and the Capital Asset Pricing Model In this chapter, we explore the issue of risk management in a portfolio of assets. The main issue is how to balance a portfolio, that

### Iterative Techniques in Matrix Algebra. Jacobi & Gauss-Seidel Iterative Techniques II

Iterative Techniques in Matrix Algebra Jacobi & Gauss-Seidel Iterative Techniques II Numerical Analysis (9th Edition) R L Burden & J D Faires Beamer Presentation Slides prepared by John Carroll Dublin

### FE670 Algorithmic Trading Strategies. Stevens Institute of Technology

FE670 Algorithmic Trading Strategies Lecture 6. Portfolio Optimization: Basic Theory and Practice Steve Yang Stevens Institute of Technology 10/03/2013 Outline 1 Mean-Variance Analysis: Overview 2 Classical

### max cx s.t. Ax c where the matrix A, cost vector c and right hand side b are given and x is a vector of variables. For this example we have x

Linear Programming Linear programming refers to problems stated as maximization or minimization of a linear function subject to constraints that are linear equalities and inequalities. Although the study

### Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Effective Linear Discriant Analysis for High Dimensional, Low Sample Size Data Zhihua Qiao, Lan Zhou and Jianhua Z. Huang Abstract In the so-called high dimensional, low sample size (HDLSS) settings, LDA

### FIRST YEAR CALCULUS. Chapter 7 CONTINUITY. It is a parabola, and we can draw this parabola without lifting our pencil from the paper.

FIRST YEAR CALCULUS WWLCHENW L c WWWL W L Chen, 1982, 2008. 2006. This chapter originates from material used by the author at Imperial College, University of London, between 1981 and 1990. It It is is

### 1 Short Introduction to Time Series

ECONOMICS 7344, Spring 202 Bent E. Sørensen January 24, 202 Short Introduction to Time Series A time series is a collection of stochastic variables x,.., x t,.., x T indexed by an integer value t. The

### Chapter 15 Introduction to Linear Programming

Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of

### PITFALLS IN TIME SERIES ANALYSIS. Cliff Hurvich Stern School, NYU

PITFALLS IN TIME SERIES ANALYSIS Cliff Hurvich Stern School, NYU The t -Test If x 1,..., x n are independent and identically distributed with mean 0, and n is not too small, then t = x 0 s n has a standard

### Definition of a Linear Program

Definition of a Linear Program Definition: A function f(x 1, x,..., x n ) of x 1, x,..., x n is a linear function if and only if for some set of constants c 1, c,..., c n, f(x 1, x,..., x n ) = c 1 x 1

### Iterative Methods for Computing Eigenvalues and Eigenvectors

The Waterloo Mathematics Review 9 Iterative Methods for Computing Eigenvalues and Eigenvectors Maysum Panju University of Waterloo mhpanju@math.uwaterloo.ca Abstract: We examine some numerical iterative

### On the Learning Mechanism of Adaptive Filters

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 48, NO. 6, JUNE 2000 1609 On the Learning Mechanism of Adaptive Filters Vítor H. Nascimento and Ali H. Sayed, Senior Member, IEEE Abstract This paper highlights,

### The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series.

Cointegration The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series. Economic theory, however, often implies equilibrium

### Blind Deconvolution of Corrupted Barcode Signals

Blind Deconvolution of Corrupted Barcode Signals Everardo Uribe and Yifan Zhang Advisors: Ernie Esser and Yifei Lou Interdisciplinary Computational and Applied Mathematics Program University of California,

### Solutions Of Some Non-Linear Programming Problems BIJAN KUMAR PATEL. Master of Science in Mathematics. Prof. ANIL KUMAR

Solutions Of Some Non-Linear Programming Problems A PROJECT REPORT submitted by BIJAN KUMAR PATEL for the partial fulfilment for the award of the degree of Master of Science in Mathematics under the supervision

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### 1 Portfolio mean and variance

Copyright c 2005 by Karl Sigman Portfolio mean and variance Here we study the performance of a one-period investment X 0 > 0 (dollars) shared among several different assets. Our criterion for measuring

### Notes on Symmetric Matrices

CPSC 536N: Randomized Algorithms 2011-12 Term 2 Notes on Symmetric Matrices Prof. Nick Harvey University of British Columbia 1 Symmetric Matrices We review some basic results concerning symmetric matrices.

### 7 Gaussian Elimination and LU Factorization

7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method

### Bindel, Fall 2012 Matrix Computations (CS 6210) Week 8: Friday, Oct 12

Why eigenvalues? Week 8: Friday, Oct 12 I spend a lot of time thinking about eigenvalue problems. In part, this is because I look for problems that can be solved via eigenvalues. But I might have fewer

### Linear Programming Notes V Problem Transformations

Linear Programming Notes V Problem Transformations 1 Introduction Any linear programming problem can be rewritten in either of two standard forms. In the first form, the objective is to maximize, the material

### A Trading Strategy Based on the Lead-Lag Relationship of Spot and Futures Prices of the S&P 500

A Trading Strategy Based on the Lead-Lag Relationship of Spot and Futures Prices of the S&P 500 FE8827 Quantitative Trading Strategies 2010/11 Mini-Term 5 Nanyang Technological University Submitted By:

### 10. Proximal point method

L. Vandenberghe EE236C Spring 2013-14) 10. Proximal point method proximal point method augmented Lagrangian method Moreau-Yosida smoothing 10-1 Proximal point method a conceptual algorithm for minimizing

### Linear Codes. Chapter 3. 3.1 Basics

Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length

### Geometry of Linear Programming

Chapter 2 Geometry of Linear Programming The intent of this chapter is to provide a geometric interpretation of linear programming problems. To conceive fundamental concepts and validity of different algorithms

### Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,

### Using Microsoft Excel to build Efficient Frontiers via the Mean Variance Optimization Method

Using Microsoft Excel to build Efficient Frontiers via the Mean Variance Optimization Method Submitted by John Alexander McNair ID #: 0061216 Date: April 14, 2003 The Optimal Portfolio Problem Consider

### Linear Least Squares

Linear Least Squares Suppose we are given a set of data points {(x i,f i )}, i = 1,...,n. These could be measurements from an experiment or obtained simply by evaluating a function at some points. One

### Time Series Analysis

Time Series Analysis Forecasting with ARIMA models Andrés M. Alonso Carolina García-Martos Universidad Carlos III de Madrid Universidad Politécnica de Madrid June July, 2012 Alonso and García-Martos (UC3M-UPM)

### 10.3 POWER METHOD FOR APPROXIMATING EIGENVALUES

58 CHAPTER NUMERICAL METHODS. POWER METHOD FOR APPROXIMATING EIGENVALUES In Chapter 7 you saw that the eigenvalues of an n n matrix A are obtained by solving its characteristic equation n c nn c nn...

### Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

### Computational Optical Imaging - Optique Numerique. -- Deconvolution --

Computational Optical Imaging - Optique Numerique -- Deconvolution -- Winter 2014 Ivo Ihrke Deconvolution Ivo Ihrke Outline Deconvolution Theory example 1D deconvolution Fourier method Algebraic method

Quadratic Functions, Optimization, and Quadratic Forms Robert M. Freund February, 2004 2004 Massachusetts Institute of echnology. 1 2 1 Quadratic Optimization A quadratic optimization problem is an optimization

### LINEAR SYSTEMS. Consider the following example of a linear system:

LINEAR SYSTEMS Consider the following example of a linear system: Its unique solution is x +2x 2 +3x 3 = 5 x + x 3 = 3 3x + x 2 +3x 3 = 3 x =, x 2 =0, x 3 = 2 In general we want to solve n equations in

### 3 Does the Simplex Algorithm Work?

Does the Simplex Algorithm Work? In this section we carefully examine the simplex algorithm introduced in the previous chapter. Our goal is to either prove that it works, or to determine those circumstances

### Nonlinear Programming Methods.S2 Quadratic Programming

Nonlinear Programming Methods.S2 Quadratic Programming Operations Research Models and Methods Paul A. Jensen and Jonathan F. Bard A linearly constrained optimization problem with a quadratic objective

### PUTNAM TRAINING POLYNOMIALS. Exercises 1. Find a polynomial with integral coefficients whose zeros include 2 + 5.

PUTNAM TRAINING POLYNOMIALS (Last updated: November 17, 2015) Remark. This is a list of exercises on polynomials. Miguel A. Lerma Exercises 1. Find a polynomial with integral coefficients whose zeros include

### Introduction and message of the book

1 Introduction and message of the book 1.1 Why polynomial optimization? Consider the global optimization problem: P : for some feasible set f := inf x { f(x) : x K } (1.1) K := { x R n : g j (x) 0, j =

### The Behavior of Bonds and Interest Rates. An Impossible Bond Pricing Model. 780 w Interest Rate Models

780 w Interest Rate Models The Behavior of Bonds and Interest Rates Before discussing how a bond market-maker would delta-hedge, we first need to specify how bonds behave. Suppose we try to model a zero-coupon

### Section Notes 3. The Simplex Algorithm. Applied Math 121. Week of February 14, 2011

Section Notes 3 The Simplex Algorithm Applied Math 121 Week of February 14, 2011 Goals for the week understand how to get from an LP to a simplex tableau. be familiar with reduced costs, optimal solutions,

### Probability and Statistics

CHAPTER 2: RANDOM VARIABLES AND ASSOCIATED FUNCTIONS 2b - 0 Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be

### An Overview Of Software For Convex Optimization. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.

An Overview Of Software For Convex Optimization Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu In fact, the great watershed in optimization isn t between linearity

### When to Refinance Mortgage Loans in a Stochastic Interest Rate Environment

When to Refinance Mortgage Loans in a Stochastic Interest Rate Environment Siwei Gan, Jin Zheng, Xiaoxia Feng, and Dejun Xie Abstract Refinancing refers to the replacement of an existing debt obligation

### ECEN 5682 Theory and Practice of Error Control Codes

ECEN 5682 Theory and Practice of Error Control Codes Convolutional Codes University of Colorado Spring 2007 Linear (n, k) block codes take k data symbols at a time and encode them into n code symbols.

### CONTROLLABILITY. Chapter 2. 2.1 Reachable Set and Controllability. Suppose we have a linear system described by the state equation

Chapter 2 CONTROLLABILITY 2 Reachable Set and Controllability Suppose we have a linear system described by the state equation ẋ Ax + Bu (2) x() x Consider the following problem For a given vector x in

### Current Accounts in Open Economies Obstfeld and Rogoff, Chapter 2

Current Accounts in Open Economies Obstfeld and Rogoff, Chapter 2 1 Consumption with many periods 1.1 Finite horizon of T Optimization problem maximize U t = u (c t ) + β (c t+1 ) + β 2 u (c t+2 ) +...

### Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives