Math 348: Numerical Methods with application in finance and economics. Boualem Khouider University of Victoria

Transcription

1 Math 348: Numerical Methods with application in finance and economics Boualem Khouider University of Victoria Lecture notes: Updated January 2012

2 Contents 1 Basics of numerical analysis Notion of rounding and truncation errors Bounding the truncation error Computer representation of numbers: floating-point arithmetics Representation of numbers in an arbitrary base Floating-point numbers The smallest and largest fl-pt numbers Distance between two successive floating-point numbers, the epsilon machine Precision and round-off errors Floating-point arithmetics and the danger of cancellation of significant digits Stability and Conditioning Problems Direct methods for linear systems Introduction Lower and Upper triangular matrices Gauss elimination Operation count

3 2.3.2 LU factorization Partial pivoting or row interchange Cholesky s LL T factorization LU factorization in Matlab Vector and matrix norms Condition number and conditioning Problems Nonlinear equations: F(X) = Introduction The bisection method Convergence of the bisection method Newton s method Local convergence of Newton s method Rate of convergence Advantages and disadvantages of Newton s method Secant method Convergence criterion Function fzero of Matlab Fixed point iterations The method of fixed-point iterations Newton s method as a fixed-point iteration The chord method Application: The Black-Scholes formula

4 3.6 Problems Function approximation Curve fitting Least square approximation in Matlab Interpolation polynomial Lagrange interpolation polynomial The interpolation polynomial of a known function Interpolation error Piecewise polynomial interpolation (Introduction to Splines) Splines in Matlab Theory of least square approximation Notion of basis and generalization of linear least square approximation General least squares in Matlab Problems Numerical Integration Newton-Cotes Integration Integration error Order of accuracy Composite integration rules Error analysis for the composite integration rules Gauss integration Problems

5 6 Monte Carlo Integration Introduction: Probability distributions and random variables Definition and examples Mean, variance, and expectation Conditional probability and notion of dependent and independent random variables Crude Monte Carlo integration Random number generators: pseudo-random numbers Linear congruential generators (LCG) Inverse transform method Acceptance-rejection method Polar approach for generating normal variates Controlling the sampling error and variance reduction techniques Problems Optimization and multidimensional Newton s method Unconstrained optimization in one dimension Unconstrained multivariable smooth optimization and Newton s method for systems One variable unconstrained optimization: Golden section search method Multivariable unconstrained optimization Introduction to convex optimization The method of steepest descent Constrained optimization Equality constraints and Lagrange multipliers method

6 7.5.2 Penalty method Unconstrained optimization in matlab: fminunc Inequality constraint and the barrier function method Problems Finite difference methods for partial differential equations Introduction to partial differential equations Classification of PDE s Initial and Boundary conditions Finite differencing Finite difference schemes for the heat equation Finite difference approximation of second order derivative Consistency, order of accuracy, and convergence Crank-Nicholson Method Reading homework: Application to the Black-Scholes equation

7 Preliminary recommendations and guidelines To solve some questions in the problem sets, you need to use MATLAB. Below are a few hints and recommendations on how to use matlab. Normally both the Windows and the UNIX versions of MATLAB are installed on many of the machines on Campus. For example, check the Computer Lab downstairs of the Clearihue building. To access the Windows version, simply select start -> (productivity or programs)-> MATLAB 6.1 (or 7.x, etc.) Enter your MATLAB commands once you obtain the MATLAB prompt >> in the Command Window. To exit from MATLAB, type either exit or quit. To access the UNIX version of MATLAB, see Chapter 1 of the UVic MATLAB Manual: To learn more about the software read the Matlab Primer: bdriver/21ds99/matlab-primer.html, in addition to the UVic Manual. You can also visit the official matlab website: ********************************************************** FOR ALL ASSIGNMENTS IN WHICH MATLAB IS USED, HAND IN THE FOLLOWING: a printed copy of the MATLAB statements (including M-files) that you use to solve a problem, any output that is generated on your screen, and any graphs that are plotted. Either edit your saved files to eliminate any statements that are not part of your final solution, or cross them out, so that the marker can easily find your answers. CLEARLY IDENTIFY AND LABEL YOUR SOLUTIONS. Using diary to save your work under matlab 6

8 The simplest way to save a copy of the MATLAB commands that you type into the Command Window and the MATLAB output that you generate is to use the diary command. (See help diary.) Everything that appears on the MATLAB screen after entering the command diary( filename ) will be saved in a file called filename, until you enter diary off Windows version of the diary command Enter (including the quotes) >> diary( c:\students\filename ) %(check with your system administrator) UNIX version of the diary command Enter (including the quotes) >> diary( filename ) To print a file created using diary in the Windows version Select file -> open At the bottom of this window, under Files of type select All files and near the top of this window under Look in select Local Disk (C:) -> Students Then select (highlight) the file you want to open, and select open. This will cause the contents of the file to appear in an Editor window. Then select file and print. Then close the Editor window. To print a file created using diary in the UNIX version 7

9 Select File -> Open and under Files of Type select All files (. ). Then select the appropriate filename. This will cause the contents of the file to appear in an Editor window. Under file in this window, select print to print the file or choose the print icon. Then close the Editor window. Note: if you don t specify a file name and enter the dairy command without the parentheses: >>diary, your work will be saved in a file named dairy by default. ********************************************************** 8

10 Chapter 1 Basics of numerical analysis 1.1 Notion of rounding and truncation errors Because computer resources (memory and disk space) are finite, when real numbers and mathematical operations are used on a computer two kinds of errors are introduced: Truncation errors and rounding or round off errors. Truncation errors occur when for example the mathematical quantity under consideration involves an infinite number of arithmetic operations. It must then be approximated by a truncated expression having a finite number of terms before it is implemented on a computer. Numerical series and power series are the simplest examples. Let S = + n=1 1/n2. One way to estimate the value of S on a computer is to use its partial sums. ForsufficientlylargeN, wehaves S N N n=1 1/n2. ThedifferenceE = S S N = + n=n+1 1/n2 is called as the truncation error. Also an expression such as e x can be estimated by using its Taylor (or power) series: e x = x n /n! 1+x+x 2 /2!+x 3 /3!+ +x N /N! n=0 for N sufficiently large. The truncation error in this case, given by E = n=n+1 x n /n!, is also known as the reminder in the theory of Taylor expansions. Rounding or round off errors occur because a computer uses a finite set of rational numbers to represent (approximately) the whole real line: The rational numbers that are exactly represented on a computer are know as floating-point (fl-pt) numbers. All other real numbers are rounded to the nearest floating-point number. Very large and very small numbers are treated as infinity and as zero, respectively. Floating point numbers and round-off errors are discussed in detail below, but we first discuss truncation errors, for the simple case of Taylor expansions. 9

11 1.2 Bounding the truncation error In practical problems, truncation errors, when introduced, are impossible to estimate directly without further approximations. The numerical analyst often relies on very sophisticated mathematical theories (when available) to find upper bounds for the truncation error to verify a priori that the truncated quantity is an acceptable approximation for the original expression. As an example, assume we want to approximate the following indefinite integral I = x 1 ln(t)dt; for all x [1,1.5] to within an error of at most 0.002, using Taylor approximation on the function ln(t), about x 0 = 1. How many terms in the Taylor expansion do we need to keep? Recall Taylor s expansion: where f(x) = f(x 0 )+(x x 0 )f (x)+ (x x 0) 2 f (x 0 )+ + (x x 0) n f (n) (x 0 )+R n (x) 2! n! R n (x) = (x x 0) (n+1) f n+1 (ξ), where ξ lies somewhere between x and x 0. (n+1)! If the reminder is dropped, then we get the following Taylor approximation where f(x) P n (x) P n (x) = f(x 0 )+(x x 0 )f (x)+ x x 0 f (x 0 )+ + (x x 0) n f (n) (x 0 ) 2! n! is known as the Taylor polynomial of f about x 0 of order or degree n. Notice that a polynomial involves only a finite number of arithmetic operations (additions and multiplications) and therefore can be directly estimated on a computer! The truncation error in this case is given by the reminder, which satisfies R n (x) = f(x) P n (x). Note that R n (x 0 ) = 0 and recall that for well behaved functions, if x is sufficiently close to x 0, R n (x) decreases to zero as n goes to infinity, therefore, as n gets larger and larger the Taylor polynomial P n (x) gets closer and closer to f(x): the larger n is the better is the approximation. Back to our example of the integral I above. Let f(x) = x 1 lnt dt. Then f (x) = ln(x),f (x) = 1 x,f (x) = 1 x 2,f(4) (x) = 2 x 3,f(5) (x) = 6 x 4, A Taylor approximation of order 3, about x 0 = 1, yields x 1 lntdt = f(1)+(x 1)f (1)+ (x 1)2 2! f (1)+ (x 1)3 3! f (1)+R 3 (x) = 0+0+ (x 1)2 2 (x 1)3 +R 3 (x) 6 10

12 where R 3 (x) = (x 1) ξ3, 1 ξ x 1.5. In order for this approximation to be acceptable we need to verify whether the truncation error satisfies: R 3 (x) for all x [1,1.5]. We have R 3 (x) = (x 1) ξ3, x [1,1.5] and ξ [1,x] and max R 3 (x) R 3 (1.5) = (1.5 1) ξ3,ξ [1,1.5] and if ξ = 1 we have R 3 (1.5) = (0.5) > i.e the 3rd order Taylor approximation is not acceptable for the given tolerance error of We improve the Taylor approximation of f(x) by using instead an expansion of a higher order. With n = 4, we get and This satisfies Because P 4 (x) = (x 1)2 2 R 4 (x) = (x 1)5 120 (x 1) (x 1) ξ4,x [1,1.5] and ξ [1,x]. R 4 (x) < 0.002, x [1,1.5], ξ [1,x]. 120 x and 1 1, x [1,1.5] and ξ [1,x]. ξ4 We used the fact thatmax f(x)g(x) max f(x) max g(x). Thus within an error of 0.002, we have x 1 lntdt P 4 (x) = 1 12 (x4 6x 3 +18x 2 22x+9). The last expression can be implemented on a computer to provide acceptable approximations for the given integral for any given value of x [1,1.5]. Below, we use the Matlab language to plot the function f(x) = x 1 lnt dt = xlnx x+1 (solid line) and the Taylor polynomial P 4(x) (dashed line) on top of each other, for comparison, together with the error f(x) P 4 (x) for x [1,1.5] (bottom panel). As expected, because of the relatively small error, the two curves on the top panel are almost indistinguishable, except for values of x The error plot on the bottom panel provides extra evidence that the truncation error is within the tolerance level of Indeed, the error remains smaller than about in agreement with the upper bound value computed above. Also note the error increases rapidly as x increases away from x 0 = 1. This is typical of Taylor approximation. 11

13 x ln(x) x+1 P 4 (x) x 10 3 Error X Here are the Matlab commands used to produce these plots: (All the statements that are preceded by a % are user comments that are not seen by Matlab). >>f = inline( x*log(x)-x + 1 )% predefines the function f(x) >>figure %opens a new figure window >>subplot(2,1,1) %divides the figure window into two subwindows % and sets the pointer on the top panel. %Use help subplot to learn more. >> fplot(@(x)f(x),[1,1.5]) %creates a graphic for the function f(x) >> hold on %so we can graph a new function on top of the previous >> p4 = inline( (x.^4-6*x.^3+18*x.^2-22*x+9)/12 ) %defines the polynomial p4 as indicated >> fplot(@(x)p4(x),[1,1.5], r-- ) % graphs p4 on top of f(x) %using a red dashed line (thus the r-- option). >> legend( x ln(x)-x+1, P_4(x),0) %puts a legend box >> subplot(2,1,2) %create the second panel >> fplot(@(x)abs(f(x) - p4(x)),[1,1.5]) % abs stands for absolute value >> xlabel( x ) %puts a label on the x axis >> ylabel( Error ) %puts a label on the y axis. >> print -depsc taylorapproximation.eps %saves a hard copy of the figure 12

14 1.3 Computer representation of numbers: floating-point arithmetics Representation of numbers in an arbitrary base The number 3490 in the decimal base is interpreted as and 3490 = = The power of ten determines the order of magnitude and the coefficient in front, which can be any number from 0 to 9 (digits), refers to the actual term. Similarly, we can convert any given number, given for example in base 10, to an arbitrary base b. The commonly used bases are b = 2 (binary, coefficients are 0 or 1), b = 8 (octal, coefficients are digits from 0 to 7), and b = 16 (hexadecimal, coefficients are in {0,1,2,,9,A,B,C,D,E,F}). For example, the number three in the decimal base satisfies 3 = 2+1 = and thus should be written as 11 in the binary base, i.e, we have (3) 10 = (11) 2 where the subscript refers to the base where the number is being represented. Examples: Verify that (21.5) 10 = /2 = = ( ) 2. (12) 10 = (C) 16 = (14) 8 = (1100) 2 Write (21.5) 10 in both the octal and hexadecimal bases. General representation of integers in the binary base Let N be an integer. Then, there exists a finite sequence of coefficients b 0,b 1,,b k such that N = b k 2 k +b k 1 2 k 1 + +b b To compute the coefficients b 0,b 1,,b k, we first note that N 2 = b k2 k 1 +b k 1 2 k 2 + b 1 + b 0 2 Let Q 0 be the integer such that N = 2Q 0 + b 0, i.e, b 0 is the reminder of the division of N by 2. b 0 = 1 or 0 depending on whether N is odd or even. Similarly, consider Q 1 such that Q 0 = 2Q 1 +b 1, i.e b 1 is the reminder of the division of Q 0 by 2, which again is either 0 or 1. Iterating this process over and over yields an effective algorithm for computing the coefficients b 0,b 1,,b k. The corresponding Matlab program is given next. 13

15 %M-file DecToBinary.m function b=dectobinary(n) %Matlab program converting any giving decimal number to binary: %input: an integer number N in decimal base %output: a vector b--sequence of zeros and ones--the binary representation of N %%Initialization N0 = N ; i=1; %%while loop while(n0>0) N1 = floor(n0/2); %floor(x) returns the largest integer <= X. b(i) = N0-N1*2; %store the reminder of division of N0 by 2; i=i+1; N0=N1; end Type this tittle program an save it as an M-file an execute it for a few integers of your choice. For example 3 and 13 should yield the following outputs. >>DecToBinary(3) 1 1 >>DecToBinary(13) Now, execute the above algorithm by hand for these two examples of converting 3 and 13 from the decimal to the binary bases to confirm the Matlab results. Exercise: a) Find the binary representation of the dicimal number 123. b) Eplain how you would adopt the algorithm above to convert integers from base 10 to base 8. Write the corresponding Matlab program and find the representation of the dicimal number 123 in base 8. Binary representation of real numbers. Let X 0 be a non-negative real number. Then X can decomposed into its integer and fraction parts: X = N +R,0 R < 1. Example N = k b j 2 j and R = j=0 d j 2 j. j=1 R = (0.7) 10 = ( ) 2 Note that 0.7 has an exact (finite) representation in base 10 but not in base 2, i.e, on a binary computer, the number 0.7 can not be represented exactly. 14

16 Here is how this sequence has been computed. Similarly to the above, note that 2R = d j=2 d j 2 j+1 = d j=1 d j+1 2 j, therefore we have d 1 is the integer part of 2R and subsequently if we set F 1 = 2R d 1 (i.e the fraction part of 2R), then d 2 is the integer part of 2F 1 etc. Thus, we have the following algorithm. Set F 0 = R. Then for k = 1,2,, Set d k = integer part of (2F k 1 ), namely { 1 if 2Fk 1 1 d k = 0 if 2F k 1 < 1. and F k = fraction part of 2F k 1 : F k = 2F k 1 d k. Stop if F k = 0 or when the desired precision (i.e number of coefficients) is achieved. End For R = 0.7, we have 2R = 1.4 = d 1 = 1,F 1 = 0.4 2F 1 = 0.8 = d 2 = 0,F 2 = 0.8 2F 2 = 1.6 = d 3 = 1,F 3 = 0.6 2F 3 = 1.2 = d 4 = 1,F 4 = 0.2 2F 4 = 0.4 = d 5 = 0,F 5 = 0.4 But F 5 = F 1 = d 6 = d 2,d 7 = d 3, 1.4 Floating-point numbers In a representation system of numbers with a given base b, we call a floating point number any rational number X that can be written exactly, on the form X = ±0.d 1 d 2 d 3 d k b e = ±mb e (1.1) where m = 0.d 1 d 2 d 3 d k is called the mantissa or the fraction and e is the exponent. Note that m is viewed as a fraction in a numeral system with base b. The d j s are digits bewteen 0 and b 1 if b 10 nd digits from 0 to 9 plus a sequence of letters if b 11. k is an integer that limits the size of the mantissa while e satisfies E m e E M where E m < 0,E M > 0 are respectively the smallest and largest exponents that limit the magnitude of the smallest and largest numbers represented by the given floating-point system. Example: In the binary system adopted by most of today s computers the following two sets of parameters are used, commonly known as the single (or short) and double (or long) precision representations (taking into account the hidden bit). 15

17 Mode b k E m E M Precision 32 bits Single/short precision 64 bits Double/long precision Normalized floating-point numbers Whenever possible, a floating-point number is normalized so that d 1 0 (for normalized floatingpoint numbers in base 2, we have d 1 = 1). A floating-point number that can be normalized is called normal otherwise it is called subnormal. The normalized representation of a normal floating-point number is unique The smallest and largest fl-pt numbers The smallest positive normal floating point number, X m, is reached when e = E m,d 1 = 1,d j = 0,j = 2,,k, which yields X m = 0.1 b Em = b Em 1. In Matlab, this number is known as the smallest real number and is denoted by realmin. Given that Matlab is by default configured to use double precision, we have realmin = Exercise: Type >>realmin in the Matlab command window and compare the output to the number above. Characterization of subnormal numbers: A non-negative floating point number X is subnormal if and only if X < b Em 1. Therefore by convention (of the IEEE: Institute of Electrical and Electronics Engineers), subnormal numbers take the form X = 0.d 2 d 3 d k b Em 1. (1.2) Note that the size of the new mantissa is reduced by one because the original number was shifted to the right by one bit. The number zero is treated as a special number and is represented by 0 = b Em. The smallest (subnormal) positive fl-pt number (let us called it X min to distinguish it from X m (realmin in matlab), which is the smallest normal number) is achieved when d j = 0,j = 2, k 1 and d k = 1, in (1.2). This leads to X min = b k+1 b Em 1. In a 64-bit environment (such as matlab), the smallest representable number is given by X min = which is the smallest number before zero in Matlab. Exercise: To compute the smallest subnormal number of matlab, execute the following matlab program. 16

18 >>x=eps >>while(x>0), x=x/2, end Explain why this program yields the smallest positive number in matlab(give a different explanation than the one above). Here eps is a predefined matlab constant known as the epsilon machine. See next subsection. The largest representable (floating point) number is achieved when e = E M and d j = (b 1),j = 1,2, k. In the 64-bit mode this is equivalent to It is called realmax in matlab. Exercise: Go to Matlab and type >>realmax and >>2^(1023). Compare the two numbers and explain. Execute the Following Matlab commands. Type >>realmax Type >>realmax * 2 Type (realmax + 10*eps) - realmax Observe the output of each command and explain the results. Overflow and underflow If a number exceeds the largest fl-pt number, it is said to be an overflow and it is treated as either the largest fl-pt number or, depending on the rounding mode. Similarly, any number that is smaller than the smallest ft-pt is called an underflow and is treated as either zero or the smallest fl-pt number Distance between two successive floating-point numbers, the epsilon machine Normalized floating-point numbers with a common exponent t are found in the interval [b t 1,b t ). Note that b t 1 = b t. There are exactly (b 1)b k 1 (normalized) floating-point numbers in the interval [b t 1,b t ). b t 1 is the smallest and 0.(b 1)(b 1) (b 1) b t is the largest. Floating-point numbers within the interval [b t 1,b t ) are uniformly distributed and the distance between two consecutive fl-pt numbers within this interval is D = bt b t 1 (b 1)b k 1 = bt k. (1.3) The distance between 1 and the next floating-point number is called the epsilon machine or machine precision and often denoted by ǫ. Since 1 = (0.1) b b 1, we get ǫ = b k+1 (t = 1 in the above 17

19 expression). In double precision, the epsilon machine is given by ǫ = , which is also predefined in Matlab. Try >>eps in the Matlab command window. Exercise: Write a small matlab program to find the matlab epsilon, without using the formula (1.3) above Precision and round-off errors Definition 1 Let p denote an approximation to a given number p. Then the quantity p p is called the absolute error and if p 0, the quantity p p / p is called the relative error. When using floating-point numbers two modes of approximation are often used to find the nearest fl-pt number to a given real number: chopping and rounding. Chopping consists of simply chopping all the digits that exceed the size of the mantissa, in the normalized floating point system and rounding consists of rounding to the nearest floating-point that minimizes the absolute error. Example: Find the fl-point approximation and the associated absolute and relative errors for x = 2/3 using a floating point system with base 10 with a precision k = 4, using both chopping and rounding modes. Solution: Mode fl-pt approximation absolute error relative error chopping = 10 4 rounding = Link between the relative error and precision Fact: The relative error specifies the number of correct digits for a given approximation. To see how this works, let us consider the following example. Example: Consider the number π with a series of approximations. Approximation of π # of correct digits Relative error upper and lower bounds < < < <

20 If t 0 is an integer such that p p / p < 5 10 t, then p is said to approximate p to t significant digits. Definition 2 Maximum relative error in fl-pt representation: First let us assume the chopping mode is used. Let p > 0 be a real number and let t be a non-negative integer such that b t 1 p b t. Let p be the fl-pt approximation of p. Then p p p bt k p bt k b t 1 = b (k 1) = ǫ, the epsilon machine introduced above. The epsilon machine ǫ defines the unit round-off error for the chopping mode. We used the fact that p p is necessarily smaller than the distance between two successive fl-pts in [b t 1,b t ) and that p b t 1. If the rounding mode was used, instead, then p p will be smaller than half the distance between two successive fl-pt numbers, yielding a unit round-off error equal to b (k 1) /2 = ǫ/2 for the rounding mode. 1.5 Floating-point arithmetics and the danger of cancellation of significant digits When arithmetic operations such as additions and multiplications are performed in a floating-point system, in the ideal situation, the arithmetic operations are performed exactly one at a time and the result is rounded to the nearest fl-pt number after each operation. The fl-pt sum of three floating numbers x, y, z, for example, satisfies p = fl(x+y +z) = fl(fl(x+y)+z). Note that accordingly fl(x+y+z) fl(x+z+y) as it is demonstrated by the following example. Assume x = ,y = ,z = in 4-digits floating point system with a decimal base b = 10 using the rounding mode. We have fl(x+y) = fl( ) = fl(x+y +z) = fl( z) = fl(0.0001) = The exact value is p = x+y +z = and the relative error p p p 0.31 or 31% But fl(x+z +y) = fl(fl(x+z)+y) = This yields a relative error of or %, which is much smaller than that found in the first computation. What happened there is known as dangerous cancelation of significant 19

21 digits. It occurs when we compute the difference between two fl-pt numbers that are too close to each other. In fact this is the case for the sum of fl(x+y) and z, in this example. Starting with the sum x+z is just one way to get around the problem. To minimize round-off errors, suspicious expressions should be rewritten in a mathematically equivalent way that is more stable to avoid dangerous cancellation. Examples: 1) fl( x 2 +1 x) will be inaccurate if x is large and positive. Let us for example assume b = 10,k = 4,x = and a rounding mode of approximation. fl(x 2 ) = fl( ) = fl(x 2 +1) = fl( x 2 +1) = fl( ) = fl( x 2 +1 x) = 0.01 = The exact value is given by x 2 +1 x = i.e, the fl-pt calculation has no-significant digits. A better approximation is obtained when the given quantity is written in a mathematical expression that is more stable under fl-pt operations: Multiplying and dividing by the conjugate expression, we get x 2 +1 x = ( x x +1 x) x x 2 +1+x = (x2 +1) x 2 x 2 +1+x = 1 x 2 +1+x Now, for x = 65.43, ( ) 1 fl x 2 +1+x ( ) 1 = fl = fl( ) = This last approximation has technically 4 significant digits, for a relative error of ) To avoid dangerous cancellation when evaluating the expression x sin(x), when x is near zero (because sin(x) x near zero), one can use Taylor approximation of sin(x) near zero. sin(x) = x x3 3! + x5 5! + +( 1)p x 2p+1 (2p+1)!) +R p(x) (As for the Taylor series of e 5.5 given in example 3 below), in any given fl-pt system, for sufficiently large p, the reminder R p can be dropped and the fl-pt value of sin(x) will be exactly equal to the fl-pt value of the corresponding Taylor polynomial, i.e, ( x 3 fl(sin(x) x) = fl 3! x5 5! + +( 1)p x 2p+1 (2p+1)!) for sufficiently large p. Therefore, the last expression should be used instead of x sin(x) for fl-pt arithmetics when x is close to zero. It is more stable for fl-pt computations. 3) Also 1 sin(x) is problematic for x near π/2. For this one, we do the following. 1 sin(x) = (1 sin(x)) 1+sin(x) 1+sin(x) = 1 sin2 (x) 1+sin(x) = cos2 (x) 1+sin(x). The last expression should be used for fl-pt operations instead of the original when x π/2. Note that conversely, the expression cos 2 (x)/(1+sin(x)) should be avoided if x π/2. ), 20

22 1.6 Stability and Conditioning The aim of numerical methods is to solve mathematical problems on a digital a computer. To do so, we first need to design/use an algorithm (or a numerical method) for the given problem, which often provides only an approximate solution to the problem at hand in place of the full or exact solution. It is highly desirable to design an algorithm that leads to the most possibly accurate approximation. Here we present some of the common problems that may arise and hinder this goal and restrict the accuracy of the numerical solution. Here, we provide some necessary conditions that guarantee a fair approximation. For this we need both a stable algorithm and a well-conditioned problem, whose precise definitions are given next. Definition 3 A given mathematical problem is said to be ill-conditioned if small changes in data produce large deviations in the result. Data Solution X S ˆX = X +ǫ Ŝ The problem is well-conditioned if: ǫ << 1 = S Ŝ /S << 1. Definition 4 An algorithm is said to be stable if its approximate solution is close to the exact solution of the original problem with slightly perturbed data. Exact data Algorithm: Approximate solution: S n Perturbed data ˆX = X +ǫ exact solution Ŝ The algorithm is stable if there exists a small perturbation ǫ, such that S n Ŝ /S n is small. Example 1: Hilbert matrix. H ij = 1 i+j 1, 1 i,j n Consider the linear system HX = b with n = 3 1 1/2 1/3 x 11/6 1/2 1/3 1/4 y = 13/12. 1/3 1/4 1/5 z 47/60 The exact solution is X = (1,1,1) T. Consider the perturbed problem, obtained by rounding the entries of both the matrix H and the right hand side vector b to three significant digits. Let Ĥ and ˆb be the truncated matrix and truncated right hand side vector, respectively. 21

23 Ĥ = , ˆb = The (exact) solution to the perturbed problem Ĥ ˆX =ˆb is ˆX = (1.0895, ,1.4910) T. Let s compare the perturbed solution to the solution of the original problem. The absolute error, using the L 1 norm, is and the corresponding relative error is X ˆX = x ˆx + y ŷ + z ẑ = X ˆX X = = 36.4%. A small perturbation of the original problem (on the order 1/1000) resulted in a deviation of 36 % in the solution. The problem HX = b is therefore ill-conditioned. We will see later in the course that this is an issue with the Hilbert matrix itself it is ill conditioned. Stability: Now, assume that we use Gauss elimination with 3 digits to approximate the solution of the system HX = b. This is our algorithm. The question is whether this algorithm is stable or not. The approximate solution obtained by this algorithm (whose details are not shown here for streamlining) is given by X n = (0.480,1.88,1.22) T. First note that X n ˆX (do you know why?). In fact, the two solutions are very far apart from each other, another pitfall of ill-conditioning. Ill-conditioned problems are very sensitive to round-off errors and thus are very tricky to handle in a fl-pt environment. The question is whether the solution X n is close to the exact solution of a perturbed problem. In other words, can we find a perturbation matrix E and perturbation vector e to the original (truncated) matrix Ĥ and truncated vector ˆb such that the solution of is close to X n? It is easy to check that the solution, X p, for (Ĥ +E)X p =ˆb+e, x y = , z is X p = (0.4650,1.800,1.1700) T 22

24 which is indeed fairly close to X n. Also both the matrix and the right hand side vector of this last system are small perturbations (less than 1%) of the matrix Ĥ and vector ˆb. Therefore, our algorithm is stable. Example 2: Consider the problem of computing the quantity w = 1000x,x = ;y = ;z = x y z The exact solution is given by w = 255, To check if the problem is ill-conditioned or not, we consider the small perturbation of the data The perturbed solution is ˆx = ,ŷ = y;ẑ = ŵ = 1000ˆx ˆx ŷ ẑ = 425, The perturbed solution has no significant digits compared to the original solution. Thus, the problem is ill-conditioned. Consider the algorithm of using fl-pt arithmetic to compute w in base b = 10 with precision k = 4 and chopping mode. We have 1000x w = fl( x y z ) = 319,000. (The details of the fl-pt calculation of w are left as an exercise.) Can we find a perturbation to the data so that the solution to the perturbed problem is close to w? Let x = x,ȳ = y, z = which is a small perturbation for the original data. We have The algorithm is thus stable. Example 3: Consider the approximation Using n = 24, we get 1000 x x ȳ z e x 1+x+ x2 2! Consider the perturbed data: ˆx = x+ǫ. = 319,000 = w + + xn n! ; x = 5.5. e = y. eˆx = e x+ǫ = e x e ǫ = e x (1+ǫ+ ǫ2 2! + ) e x (1+ǫ) (1+ǫ)[1+x+ x2 2! xn n! ] = ŷ

25 y ŷ = ǫ = The problem is well-conditioned. y Assume we use 5 digits (base 10) in a rounding mode to evaluate the given Taylor approximation. Is this stable? Note that all terms ( 5.5) n /(n!) with n 25 add no further change (improvement) to the Taylor approximation in this fl-pt arithmetic (In fact /26! = is rounded to zero in a 5 digit precision when added to the sum of the first 25 terms given by y n = = ) This algorithm yields and approximate solution y n = for all n 25. This approximate solution has no significant digits: y y n y = = 35.32%! Can we find a small perturbation ˆx to the original data x so that ŷ eˆx y n? The answer is no. Otherwise, the solution ŷ will be close to y because we know that this problem is well-conditioned: ŷ y /y = ǫ; if we suppose that in addition ŷ is close to y n, we will get a contradiction: Suppose ŷ y n /y n < δ. Then y y n y = y ŷ +ŷ y n y y ŷ y + ŷ y n y ǫ+ ŷ y n y n y n y ǫ+δy n y = α α is small given that both ǫ and δ are small and that yn y is order one. This is a contraction with the fact that y yn y = which is very large compared to the unit round-off which is on the order of In fact, if we attempt to compute aperturbation ǫ that yields the solution y n (exactly or approximately), we find e 5.5+ǫ = = 5.5+ǫ = ln( ) = ǫ = ln( ) , which is clearly a large perturbation of /5.5 = 0.08 = 8%. The algorithm is therefore unstable. Clearly this is due to a cancellation of significant digits in the Taylor expansion. This can be dealt with for instance by changing the order of summation in the Taylor series. 1.7 Problems 1. Provide an algorithm and write a matlab program to convert any given decimal number to the octal base. 2. (a) Determine the second order (n = 2) Taylor polynomial approximation for f(x) = x+1 expanded about x 0 = 0. Include the remainder term. 24

26 (b) Use the polynomial approximation in (a) (without the remainder term) to approximate Give all 10 correct significant digits. The exact value of is approximately Use this value to compute the absolute error of your computed approximation. (c) Determine a good upper bound for the truncation error of the Taylor polynomial approximation in (b) by bounding the remainder term. Note that the absolute error in (b) should be smaller than this upper bound. (d) Determine a good upper bound for the truncation error of the Taylor polynomial approximation of order 2 of f(x) = x+1 for all values of x such that 0.05 x Use Matlab to plot the error f(x) P 2 (x) verify than it stays below the computed upper bound. 3. Consider the evaluation of f(x) = 1 1 x 1,x 1 and x 2, 2 x in floating point arithmetic. (a) Use 4 decimal digit, idealized, chopping floating point arithmetic to evaluate f l(f(2.001). Compute the relative error. (b) Repeat (a) for fl(f( 1234). (c) Based on your results in (a) and (b) and other calculations you may want to try, specify which of the following ranges of values of x give rise to inaccurate computation of f(x) in floating point arithmetic. Explain why. (i) x is close to 1 (ii) x is close to 2 (iii) x is close to 0. (iv) x is negative and large in magnitude (v) x is positive and large in magnitude 4. For each of the functions below, find another expression that is mathematically identical to the given function and that is more accurate when using floating-point arithmetic. (i) f(x,y) = x x 2 y when x is positive and x is much larger then y. (ii) g(x) = sinx when x is close to π 1+cosx (iii) h(x) = x 1 when x is very large in magnitude. 1+x 1 (iv) 1 x 1 when x is very large. 2 x 5. Use 4 decimal digit, idealized, chopping arithmetic to evaluate 1000x ŵ = fl( x y z ) for x = ,y = ,z = Find the absolute and relative error if the exact arithmetic value is w 255,251. How many correct digit the fl-pt computation has? 6. Consider the Taylor polynomial approximation for e x e x P n (x) = 1+x+ x2 2! xn + +,n 1. n! Assume we use this polynomial to approximate e 5.5 using floating point arithmetic with b = 10 and K = 5 (decimal base and 5 digit precision) in rounding mode. Show that fl(p n ( 5.5)) = for all n

27 7. Let f(x) = ln(x),x 0. Show that the problem of evaluating f(x) is ill-conditioned for x values of x close to Consider f(x) = 1+cosx (x π) 2,x π. (x is in radian). (a) Evaluate f l(f(3.129)) using 4 decimal digit, idealized, chopping floating-point arithmetic. Note f l(π) = The exact value of f(3.129) is approximately Compute the relative error. The latter should be larger than 35%. (b) Determine the fourth order Taylor polynomial approximation for g(x) = 1+cosx about x 0 = π expressed in terms of powers (x π) k (without the remainder.) (c) Use the Taylor polynomial in (b) to derive an approximation for f(x), in terms of powers (x π) k. Use this approximation to deduce that the computation in (a) is stable. 9. Using idealized, chopping, floating-pt arithmetic in base 10 and precision K = 4, the evaluation of fl(fl(w x) fl(y z)) for w = 3.456,x = 12.34,y = 23.45,z = gives a result of Show that this computation is stable. 10. The quadratic formula states that the roots of ax 2 +bx+c = 0, when a 0, are x 1 = b+ b 2 4ac 2a and x 2 = b b 2 4ac. 2a (a) Use four-digit rounding fl-arithmetic, in base 10, to find the fl-pt approximations ˆx 1 and ˆx 2 to the roots x 1,x 2 of x x+1. = 0. Find the associated relative errors if the exact roots are approximately x 1 = and x 2 = Compare and note which one is not accurate. (b) Show that x 1 = 2c b+ b 2 4ac. Repeat (a) using this new expression for x 1 instead. Compare the relative error and notice the difference. Explain what went wrong in the previous case. (c) Similarly, show that x 2 = 2c b b 2 4ac. For which situation this new expression for x 2 will be more accurate for fl-pt arithmetic than the original one? Hint: Repeat (a) for x x+1 = 0 using both the new expression for x 2 and the original one. 26

28 Chapter 2 Direct methods for linear systems 2.1 Introduction In this chapter we are concerned with the solution of linear systems on the form AX = b where A is a square matrix of dimension n, b is a vector in R n and X is the unknown solution vector, which is also in R n. Let us illustrate this through an example. Example: Consider the system of three equations and three unknowns 4x 1 9x 2 +2x 3 = 2 2x 1 4x 2 +4x 3 = 3 x 1 +2x 2 +2x 3 = 1. In matrix form this system reduces to x x 2 = x 3 1 or simply AX = b where x 1 2 A = ;X = x 2 ;b = x 3 1 Recall from linear algebra courses that for a square linear system AX = b, the following statements are equivalent. The system AX = b has a unique solution for any right hand side b. The only solution for the system AX = 0 is the trivial solution X = 0. 27

29 The matrix A is non-singular, i.e, the matrix A has an inverse A 1 such that A 1 A = AA 1 = I, where I is the identity matrix. The determinant of A is non-zero: det(a) 0. Matlab, whose name is short for matrix laboratory is perhaps the best mathematical software to operate and handle matrices. The following three commands are very useful. >> det(a) %returns the determinant of a predefined matrix A >> inv(a) % returns the matrix inverse of A >> X = A\b % returns the solution X of AX=b 2.2 Lower and Upper triangular matrices A matrix A is said to be upper triangular if all its entries below the diagonal are zero: A = (a ij ) 1 i,j n with a ij = 0,i > j. A is said to be lower triangular if a ij = 0,j > i. A matrix which is both lower triangular and upper triangular is said to be diagonal: a ij = 0,i j. Upper triangular, Lower triangular, Diagonal a 11 a 12 a 1n a 11 a 22 a 2n a 21 a 22 A =...., A = ann a n1 a n2 a nn, A = The good thing about triangular (upper or lower) matrices is that the solution for the linear system AX = b is straightforward and can be readily programed on a computer with very little effort. The methods of choice are respectively the backward and forward substitutions. First note that the determinant of a triangular matrix is given by the product of its diagonal elements: n If A is triangular, then det(a) = a 11 a 22 a nn = a ii. Thus, a triangular matrix is non-singular if and only if all its diagonal entries are non zero. To illustrate let us consider the following 3 3 upper triangular system x 1 x+0.5y +2z = y = 2 or 3y +z = z 3 5z = 3 i=1 a ann 28

30 The backward substitution methods consists in solving the last equation that involves z alone, then replace z by its value in the second equation and solve for y etc. More precisely, we have 5z = 3 = z = 3/5; 3y = 2 z = 2 3/5 = 7/5 = y = 7/15; x = 1 0.5y 2z = 1 7/30 6/5 = 13/30; or X = ( 13/30,7/15,3/5) T. General procedure: forward and backward substitution For an upper triangular system of arbitrary size a 11 a 12 a 1n x 1 b 1 a 22 a 2n x 2 b = ann x n b n Assume the triangular matrix is non-singular, i.e a ii 0, i = 1,,n. Then, we have the following the backward substitution algorithm for the solution of AX = b. n x n = b n /a nn, and for i = n 1,n 2,,1, x i = b i a ij x j /a ii. j=i+1 Similarly, for a lower triangular system a 11 a 21 a a n1 a n2 a nn x 1 b 1 x 2 b 2. =.,.. x n b n the forward substitution metho gives i 1 x 1 = b 1 /a 11, and for i = 2,3,,n,x i = b i a ij x j /a ii. j=1 As mentioned earlier, one advantage of the forward and backward substitution methods is that they can be easily implemented on a computer using an advanced programing language. The matlab program for the backward substitution applied to an upper triangular matrix is given below. Below, we denote by U an arbitrary upper triangular matrix with entries u ij,1 i j n. %%M-file: Usolve.m function X= Usolve(U,b) 29

31 %input: upper triangular matrix U, right hand side vector b %output: solution X n=max(size(u)); %determines the dimension of the matrix U X(n) = b(n)/u(n,n); for k=n-1:-1:1 %Backward for loop X(k) = ( b(k) - sum( U(k, k+1:n).*x(k+1:n) ))/U(k,k); end %%%%%%%% %To run in the command window, follow this example % >> U =[2, 0,0,0; 3, 1.5, 0,0; 2, 0, 2.5, 0; 2, 3, 0, 5] U = >>b=[2;0;1; -1] b = >> X= Usolve(U,b) X = Exercise: Write a matlab program to solve the lower triangular system LX = b, where L is a lower triangular matrix with entries l ij,1 j i n, using the forward substitution method. 2.3 Gauss elimination The Gauss elimination method is one of the most robust and universal (i.e, it can be used for a wide range of problems) numerical methods for solving linear systems AX = b, where A is a non-singular square matrix. It consists of reducing the full system into an upper triangular one that can be then easily solved by the method of backward substitution. Let us start by a simple example for illustration. Consider the three-by-three system: 4x 1 9x 2 +2x 3 = 2 2x 1 4x 2 +4x 3 = 3 x 1 +2x 2 +2x 3 = 1. 30

32 In matrix form x x 2 = x 3 1 Consider the augmented matrix, formed by the main matrix and the right hand side vector, Subtract 1/2 the first row from the second and -1/4 the first row from the third and substitute these differences to the second and third rows, respectively. In mathematical words, let E j denote the row j = 1,2,3 of the augmented matrix. We make the following j th operations: This yields new E 2 = E 2 1/2 E 1 and new E 3 = E 3 ( 1/4) E / /4 5/2. 3/2 This makes two zeros appear below the diagonal, on the first column of the matrix. To obtain an upper triangular system, we need to further eliminate the entree below the diagonal on the second column. For this we multiply the second row by -1/2 and subtract it from the third row (new E 3 = E 3 +(1/2)E 2 ) to yield / /2 This corresponds to the triangular system / x 1 x 2 = x /2, whose solution is given by x 3 = 5/8;x 2 = (2 3 5/8)/(1/2) = 1/4;x 1 = (2+9 1/4 2 5/8)/4 = 3/4 or X = (3/4,1/4,5/8). General procedure: Gauss elimination Consider a linear system AX = b. Let A = (a ij ) 1 i,j, n, b = (b i ) 1 i n. Let a (0) ij = a ij,1 i,j n, 31

33 a (0) i,n+1 = b i,1 i be the entries of the augmented matrix [A;b]. The first step of the Gauss elimination procedure aims at eliminating the entries of the first column of A below the diagonal by changing the ith row E (0) i by E (0) i (a (0) i1 /a(0) 11 )E(0) 1 E (1) i E (0) i m i1 E (0) 1, where m i1 = a(0) i1 a (0) 11,2 i n. The coefficients m i1 are called multipliers and the resulting augmented matrix A (1) formed by the rows E (1) i,2 i n as given above and E (1) 1 = E (0) 1 is such that all its first column entries below the diagonal are zero. a (0) 11 a (0) 12 a (0) 1n a (0) 1,n+1 0 a (1) 22 a (1) 2n a (1) 2,n+1 A (1) 0 a (1) 32 a (1) 33 a (1) 3n a (1) 3,n+1 = a (1) n2 a (1) n3... a (1) nn a (1) n,n+1 where a (1) ij = a (0) ij m i1 a (0) 1j,2 i n, 2 j n+1. We proceed to the next step which eliminates the entries on the second column below the diagonal and so on until we obtain an upper triangular system: a (0) 11 a (0) 12 a (0) 1n a (0) 1,n+1 a (1) 22 a (1) 2n a (1) 2,n+1 A (n 1) a (2) 33 a (2) 3n a (2) 3,n+1 = We have at each step k,1 k n 1, a (k) ij = a (k 1) ij m ik a (k 1) kj a (n 1) nn a (n 1) n,n+1, m ik = a(k 1) ik, k +1 i n, k +1 j n+1. a (k 1) kk This procedure that reduces a full linear system to an upper triangular one is known as the forward reduction step. A backward substitution is then applied to the triangular system. The computer algorithm that can be readily implemented in matlab or any other advanced computer language. It consists of two natural main steps. Algorithm: Gauss elimination 0) Initialization 32

34 for i = 1,n for j = 1,n a (0) ij = a ij end a (0) i,n+1 = b i end 1)Forward reduction (uses three nested for loops): for k = 1,n 1 for i = k +1,n m ik = a(k 1) ik a (k 1) kk for j = k +1,n+1 a (k) ij = a (k 1) ij m ik a (k 1) kj (2.1) end j-loop end i-loop end k-loop 2)Backward substitution: x n = a(n 1) n,n+1 a (n 1) nn for k = n 1, 1,1 (backward loop) x k = 1 n a (k 1) a (k 1) k,n+1 a (k 1) kj x j kk j=k+1 end Operation count The Gauss elimination algorithm above solves the system AX = b in a finite number of operations that can be counted as follows. The forward reduction step involves n 1 (n k) = n(n 1)/2 divisions for the evaluations of the multipliers. Here we used the k=1 well know formula for the summation of first m natural numbers. See Problem 1. The sum of first m squares is used below. 33

35 n 1 n 1 (n+1 k)(n k) = (n k) 2 +(n k) = n(n 1)(2n 1)/6+n(n 1)/2= n(n+1)(n 1)/3 k=1 k=1 addition and multiplication pairs for the evaluation of a (k) ij s. The backward substitution step involves n divisions and n 1 k=1 (n k) = n(n 1)/2 multiplications and n(n 1)/2 additions. Assuming the multiplication and division operations are performed similarly on a digital computer, the Gauss elimination algorithm above involves n(n 1)(2n 1) 6 +n(n 1)+n+ n(n 1) 2 = n3 3 +n2 n 3 multiplications/division. Similarly for additions and substractions, we get n(n+1)(n 1) 3 + n(n 1) 2 = n3 3 + n2 2 5n 6 additions/subtractions. To the leading order this amounts to O(n 3 ) operations. Note that the backward reduction alone involves O(n 2 ) operation, which is much less than the forward reduction step. i.e, the computer spends most of its time in the reduction step and much less to solve the triangular system LU factorization Let L (1) be the lower triangular matrix with ones on the diagonal and whose first column entries below the diagonal are the negatives of the multipliers m k1 from the forward reduction of the Gauss elimination of the matrix A and has zeros elsewhere: 1 m 21 1 L (1) = m m n We first note that the first step of the forward reduction is equivalent to multiplying the matrix A by L (1), i.e, we have A (1) = L (1) A. 34

36 In general, if we set L (k) =. 1,. m k+1,k m n,k 0 1 a lower triangular matrix with ones on the diagonal, the multipliers from the k th step of the forward reduction, m j,k, on its k th column below the diagonal and zeros elsewhere. Then we can show that A (k) = L (k) A (k 1) = L (k) L (k 1) L (1) A, k = 1,2,,n 1. (2.2) We can easily show that the product L (k) L (k 1) L(1) is a lower triangular matrix with ones on the diagonal: 1 L (n 1) L (n 2) L 1 (1) = and whose inverse is the lower triangular matrix with ones on the diagonal and the multipliers m ij,i > j below the diagonal: 1 L (L (n 1) L (n 2) L (1)) 1 = (L (1)) 1(L (2)) 1 ( L (n 1)) 1 m 21 1 = m 31 m m n1 m n2 m n,n 1 1 It is easy to see that ( L (k)) 1 = (2.3). m k+1,k m n,k 0 1. But the fact that the product ( L (n 1) L (n 2) L (1)) 1 turns out what it is is one of the beauties of mathematics. Exercise: (a) Show that the matrix ( L (k)) 1 given in (2.3) is indeed the inverse of L (k). (b) Compute the product ( L (1)) 1( L (2) ) 1 ( L (n 1) ) 1. Hint: Do it first for n = 3 and n = 4 then generalize. 35

37 Let U be the upper triangular matrix obtained by the forward reduction of A, i.e, U = A (n 1), i.e, u 11 u 12 u 1n u 22 u 2n U = unn where u ij = a (i 1) ij,1 i j n. Then from the identity (2.2), we have the lu-factorization of the matrix A A = LU or 1 m 21 1 A = m 31 m m n1 m n2 m n,n 1 1 a (0) 11 a (0) 12 a (0) 1n a (1) 22 a (1) 2n a (2) 33 a (2) 3n.... a (n 1) nn. Example: Find the LU factorization of the matrix A of the previous example: A = According to the forward reduction step performed above, we have i.e, Moreover, Therefore, m 21 = a 21 a 11 = 1 2 ; m 31 = a 31 a 11 = L (1) = = A (1) = L (1) A = m 32 = a(1) 32 a (1) 22 = = L(2) = U = A (2) = L (2) L (1) A = and L = Exercise: Compute the matrix product LU and check that we have indeed A = LU. Useful properties of LU factorization: 36

38 i) Given a factorization A = LU, then the system AX = b is solved in two simple steps: { LY = b AX = b (LU)X = b UX = Y First solve the lower triangular system LY = b for Y by forward substitution then the upper triangular system UX = Y by backward substitution. This takes an order O(n 2 ) operactions, according to section ii) det(a) = det(l)det(u) = det(u) = Π n i=1 u ii. We used the fact that the determinant of the product of two matrices is the product of their determinants and the fact that the determinant of a triangular matrix is the product of its diagonal entries (note: det(l) = 1). iii) To compute the inverse of A, it suffices to solve n systems AX = e j,1 j n, where e j = (0,,0,1,0,,0) T is the canonical unit vector of R n with zeros entries everywhere except for its j th component which is one. The n systems are solved in O(n 3 ) operations by using the LU factorization for each system as in i); each backward and forward substitution step takes O(n 2 ) operations. Multiplying this by the number of systems n, we get O(n 3 ). If we denote by X (j) the solution of AX = e j then the inverse of A is given A ( 1) = [X (1),X (2),,X (n) ] Partial pivoting or row interchange Except for some special cases, such as diagonally dominant and positive definite matrices, there is no guarantee that the forward reduction process will work and for the LU factorization to exist, even if the matrix A is non-singular. In fact, the forward reduction algorithm described above breaks down if at some step k = 1,,n 1 the denominator a (k 1) kk, known as the pivot element, vanishes. The computations become very unstable to round-off errors, if for some k the pivot a (k 1) kk is much smaller in magnitude than any other entry of A (k 1) on k th column below the diagonal, a (k 1) i,k,i > k. To circumvent this problem we adopt the partial pivoting strategy which consists of swapping the row E k of the (augmented) matrix A (k 1), obtained at the k th step of the forward reduction algorithm, by a row E i,i > k such that a (k 1) ik > a (k 1) kk. This row swapping procedure, known row interchange, can be thought of as the multiplication of the matrix A (k 1) by the permutation matrix P (ik), obtained by interchanging rows i and k for the identity matrix, i.e, P (ik) is the identity matrix except for the rows k and i which are interchanged. If for example n = 4, then I = ,P(23) = ,P(14) = , Note that no row interchange is required if a (k 1) kk a (k 1) ik,i > k and the forward reduction will proceed normally unless a (k 1) ik = 0, for all i k, which is only possible if A is non singular (Exercise). In this case the forward reduction should be halted and an error message must be returned during the computer-coding of the Gaus elimination method. 37

39 To maximize the performance (robustness, not necessarily the efficiency) of the the forward reduction procedure it is customary to interchange the k th row with the row with the largest entry a ik,i k. Let i 0 k such that a (k 1) i 0 k = max k i n a(k 1) i,k. The strategy is then to swap E i0 with E k if a (k 1) i 0 k 0 and i 0 > k, stop if a (k 1) i0k = 0, or continue the reduction algorithm normally otherwise. More precisely we have the following forward reduction algorithm with partial pivoting. 38

40 Algorithm: Gauss elimination with partial pivoting 0) Initialization for i = 1,n for j = 1,n a (0) ij = a ij end a (0) i,n+1 = b i end for k = 1,n 1 1)Forward reduction: Test for row interchange: [a,i 0 ] = max( a (k 1) ik,i k) % Here we adopt the matlab notation where the command >> [a,i0]=max(x), returns the largest entry of the vector X and its position i0 within the vector. if a 0 and i 0 > k then make a row interchange temp j = a (k 1) kj,k j n+1 a (k 1) kj = a (k 1) i 0 j,k j n+1 a (k 1) i 0 j = temp j,k j n+1 (2.4) (Here temp is a temporary vector used to ease the row interchange) if a = 0, stop and return an error message: the matrix A is singular. for i = k +1,n m ik = a(k 1) ik a (k 1) kk for j = k +1,n+1 a (k) ij = a (k 1) ij m ik a (k 1) kj end j-loop end i-loop end k-loop Example: Use Gauss elimination with row interchange to solve x x x 3 = x

41 The augmented matrix has a 11 = 0, thus to begin, a row interchange is required. Interchanging E 1 and E 2 yields Forward reduction with E 3 = E 3 +E 1,E 4 = E 4 E 1 yields An interchange E 3 E 4 yields Finally, we make E 3 = E 3 E 2 to obtain the upper triangular system , and the solution is given by x 4 = 1/2;x 3 = ( 1+x 4 )/2 = 1/4;x 2 = (1+x 3 x 4 ) = 1/4;x 1 = (0 x 2 +x 3 2x 4 ) = 1 or X = ( 1,1/4, 1/4,1/2) T. Now, let U =

42 be the upper triangular matrix above and P = be the permutation matrix obtained from the identity matrix by interchanging the first and second rows: E 1 E 2 and the third and fourth rows: E 3 E 4. From the calculation above we see that if Gauss elimination was performed to the matrix PA, i.e, the matrix A whose first and second rows and third and fourth rows were interchanged, respectively, we obtain the LU factorization where PA = LU, L = is the lower triangular matrix corresponding the multipliers involved in the calculation of U, given above. By noting that a permutation matrix is orthogonal and satisfies P 1 = P T we have the following factorization for A A = P 1 LU = P T LU = (P T L)U = Now we are ready to state the following theorem without proof. Theorem 1 (LU factorization) If A is a non singular n n matrix then there exists a permutation matrix P such that PA has an LU factorization, where U is upper triangular and L is lower triangular: PA = LU. When the matrix A has an LU factorization without row interchange the theorem above applies with P = I, the identity matrix. In this case the matrix A is said to have an LU factorization. As pointed out above, there are a few special matrices for which an LU factorization is possible without row interchange. This is the case if, for example, the matrix A is symmetric positive definite or A is diagonally dominant. The proofs of these results and other properties of positive definite matrices can be found in any good textbook on numerical analysis (e.g. Burden and Faires). The definitions of a diagonally dominant and a positive definite matrix are given next. 41

43 Definition 5 An n n matrix is said to be strictly diagonally dominant if its entries a ij,1 i,j n satisfy n a ii > a ij, i,1 i n. j=1,j i Example: The matrix is diagonally dominant while is not A = B = Definition 6 A symmetric matrix A is said to be positive definite if the quadratic form associated with A is positive definite: X T AX n i=1 j=1 n a ij x i x j > 0, for all X = (x 1,x 2,,x n ) T (0,0,,0) T. Example: A = X T AX = 2x 2 1 2x 1 x 2 +2x 2 2 2x 2 x 3 +2x 2 3 = x 2 1 +(x 1 x 2 ) 2 +(x 2 x 3 ) 2 +x 2 3 > 0, unless x 1 = x 2 = x 3 = 0. It follows from the definitions above that diagonally dominant and positive definite matrices are non-singular, as stated in the theorem below. Theorem 2 Both diagonally dominant and positive definite matrices are non-singular. To prove this result for the diagonally dominant case, it is easier to proceed by contradiction. Assume that A is diagonally dominant and AX = 0 has a non-zero solution X. Let X 0 be this solution and let k,1 k n, such that x k = max 1 i n x i > 0. Then, we have AX = 0 n j=1 a ijx j = 0, i,1 i n. When i = k we get a kk x k = n j=1,j k a kj x j = a kk x k n j=1,j k a kj x j x k 42 n j=1,j k a kj because x k x j, j k.

44 Since x k > 0, we arrive at a kk n j=1,j k a kj, which is a contradiction with the fact that A is diagonally dominant. With a different reasoning, we can also show that a positive definite matrix is non-singular. To do so we need to first show two (intermediate) properties of positive definite matrices. The eigenvalues of a positive definite matrix are all positive: Indeed. Let (λ,v) be an eigenvalue and eigenvector pair for the matrix A, i.e, such that AV = λv and V 0. We have V T AV = V T λv = λ V 2 2 > 0 = λ > 0. The diagonal entries of a positive definite matrix are strictly positive: a ii > 0,i = 1,,n. This can be easily shown by considering X = e j,j = 1,,n, in the inner product X T AX, where e j is the canonical unit vector of R n. The positivity of a ii s results from the fact that e T j Ae j = a jj,j = 1, : 0 a 11 a 1n j th a jj 1 j th = a jj. (2.5)... 0 a n1 a nn. 0 Combining the two items above yields det(a) = where λ j,j = 1,,n are the eigenvalues of A. n λ j > 0, i=1 Perhaps, the most important property of positive definite matrices, is their factorization on the form A = LL T, known as Cholesky s factorization (see below), where L is a lower triangular matrix (whose diagonal is not necessarily filled with ones) Cholesky s LL T factorization Theorem 3 (Cholesky s factorization) A is a symmetric positive definite matrix if and only if there exist a non-singular lower triangular matrix L such that A = LL T. 43

45 Note that if we have a factorization A = LL T for A with L non singular, then necessarily A is symmetric positive definite. It is symmetric because A T = (LL T ) T = (L T ) T L T = LL T = A and it is positive definite because if X R n,x 0, then X T AX = X T LL T X = (X T L)(L T X) = (L T X) T (L T X) = n n ( l ij x i ) 2 > 0; theeuclideannormofthevectorl T X isstrictlypositivesincelisnonsingular(soisl T ), therefore L T X 0. To show that the factorization is necessary is beyond the scope of these notes (for a complete proof see Burden and Faires, for example). Nevertheless, we give below the algorithm which permits to compute the lower triangular matrix L for a given positive definite matrix. Let L = l ij,1 j i n. Then, L T = l ij = l ji,1 i j n and (LL T ) ij = n k=1 Moving column by column, we have for j = 1, and For j 2, we have a jj = i > j : a ij = mini,j l ik lkj = k=1 l 2 11 = a 11 = l 11 = a 11 l ik l jk = a ij. l 11 l i1 = a i1,i = 2,,n = l i1 = a i1 l 11. j=1 i=1 l ij = 0, if i < j ( j j 1 ljk 2 = j 1 1/2 ljk 2 +l2 jj = l jj = a jj ljk) 2. k=1 k=1 k=1 k=1 k=1 j j 1 l ik l jk = l ik l jk +l ij l jj = l ij = a ij j 1 k=1 l ikl jk. l jj Algorithm: Cholesky factorization 0) Read A = (a ij ),i,j = 1, n 1) Set l 11 = a 11 2) For i = 2,,n l i1 = a i1 /l 11 End 3) For j = 2,,n ( l jj = a jj j 1 1/2. k=1 jk) l2 { 4) For i = j +1,n l ij = a ij j 1 k=1 l ikl jk }/l jj End i loop End j loop 44

46 Important remark: Note that the algorithm above fails whenever the term inside the radical in 1) or 3) is negative or if the denominator in 2) or 4) is zero. In fact it can be shown that A is positive definite if and only if the above algorithm goes through without a glitch, i.e, l jj is real positive for all 1 l n. In fact, it a good idea to include a warning message to a code of Cholesky s method saying that A is non positive definite, whenever a square root of a negative number or a division by zero is encountered. Example: Determine whether the following matrix is positive definite, if so find its Cholesky factorization A = Cholesky s algorithm above applied to A yields L = which implies that A admits a Cholesky factorization A = LL T = , which in turn implies that A is positive definite LU factorization in Matlab LU factorization can be easily performed in matlab on any non-singular square matrix A. The matlab function for this purpose is called simply lu and can be called directly from the command window. It returns the three matrices L,U,P such that PA = LU. This is illustrated by the simple example below. >> A=[1 2-1;1 2 3; 2-1 4] A = >> [L,U,P]=lu(A) L = U =

47 P = Exercise: Give your own interpretation of these Matlab results. 2.4 Vector and matrix norms Definition 7 (Vector norms) Let X = (x 1,x 2,,x n ) denote a vector in R n. A vector norm is a function, denoted by the doubled vertical lines., from R n to [0,+ ) that satisfies the following conditions 1) X = 0 X = 0. 2) cx = c X for all scalar c R (. is the absolute value). 3) X +Y X + Y, triangular inequality. The following are the more commonly used vector norms. i) Euclidean or L 2 norm X 2 = ( n x 2 i i=1 ) 1/2 ii) L 1 norm: X 1 = n x i i=1 iii) Max or L norm iv) L p norm, for p > 1 X = max 1 i n x i ( n ) 1/p X p = x i p. i=1 We can show that all these examples satisfy the properties 1), 2) and 3) of a vector norm. Here we do it for the case of the L 1 norm, as an example. The remaining three cases are left as an exercise. Proving the triangular inequality for the L 2 and L p norms is a bit tricky, though (See Pbs. 2, 3, 46

48 and 4.). They use the Cauchy-Shwartz and Holder inequalities, respectively. For the 1-norm, we have X 1 = 0 n x i = 0 x i = 0, i = 1,,n X = (0,0,,0), i=1 the zero vector, and n n cx 1 = cx i = c x i = c X 1. i=1 i=1 It remains to show that the triangular inequality 3) is also satisfied. X +Y 1 = n x i +y i n x i + y i = n x i + n y i = X 1 + Y 1. i=1 i=1 i=1 i=1 Example: Let X = (1,0,2, 1) a vector in R 4. Then X 2 = = 6; X 1 = 4, X = 2. Vector norms in matlab In matlab, there is a predefined function named norm: >>norm(v,p) returns the norm p of a given vector V. If p is not specified, the default is the L 2 norm, i.e >>norm(v) is equivalent to >>norm(v,2). Evidently, norm(v,inf) returns the max norm of V. Definition 8 (Matrix norm) A matrix norm is a function denoted by., defined on the space of squre matrices and taking values in [0,+ ) such that 1) A = 0 A = 0 2) ca = c A 3) A+B A + B 4) AB A B Note that properties 1),2),3) are those of a vector norm. For simplicity in exposition we denote by. both the vector and matrix norms. When there is a risk for confusion, we use. v and. M to denote vector and matrix norms, respectively. Definition 9 (Subordinate matrix norm) Given a vector norm,. v, we can show that A M = max AX AX v v sup X v=1 X 0 X v is a matrix norm called the subordinate matrix norm to the given vector norm. 47

49 As a consequence, we have the following inequality (known as the compability condition) for any given subordinate matrix norm. AX v A M v v. (2.6) Examples The common subordinate norms are those associated with the L,L 1 and L 2 vector norms. They are denoted respectively by A, A 1, and A 2, though there is a risk of confusing them with the vector norm they originate from. Interestingly, however, all these three subordinate norms are known in a closed form and therefore there is no need for computing a maximum (See problem 6). They are given as follows. i) The L subordinate norm ii) The L 1 subordinate norm iii) The L 2 subordinate norm A max AX = max X =1 1 i n A 1 max X 1 =1 AX 1 = max 1 j n n a ij j=1 n a ij i=1 A 2 max X 2 =1 AX 2 = ( ρ(aa T ) ) 1/2 where ρ(aa T ) is the spectral radius of the matrix AA T given by ρ(b) = max{ λ, λ is an eigenvalue of B}. It follows immediately that if A is symmetric, then A 2 = ρ(a). Exercise: (a) Show that B 2 = ρ(b) if B is symmetric. (b) Show that A 2 2 = AT A 2. (c) Deduce that A 2 = ρ(a T A). Solution: Here we show that the subordinate matrix norm A 2 max X 2 =1 AX 2 48

50 satisfies A 2 = ρ(a T A). We do this in two steps. First we show that B 2 = ρ(b) if B is symmetric then we show that A 2 2 = AT A for any matrix A. Step1: B 2 = ρ(b) if B is symmetric. First, we show that ρ(a) A 2 for any matrix A. Let (λ,v) be an eigenvalue-eigenvector pair (i.e, AV = λv) of A with V 2 = 1. We have Thus λ 2 = λ 2 V 2 2 = λv 2 2 = AV 2 2 A 2 2. ρ(a) 2 = max{ µ,µ is an eigenvalue of A} A 2 2. In fact, this shows that the inequality ρ(a) A is valid for any given subordinate matrix norm.. It remains to prove that B 2 ρ(b) for any symmetric matrix B. Recall one fundamental theorem of linear algebra states that the eigenvalues of a symmetric matrix are real and that the associated eigenvectors form an orthonormal basis of R n. Let λ 1,λ 2, λ n be the n real eigenvalues of B (counting multiplicities) and V 1,V 2,,V m the associated orthonormal basis of eigenvectors. Then for vector X R n can be uniquely written as a linear combination of the V i s. X = n α i V i. i=1 We have i.e BX 2 = B n α i V i 2 = i=1 n λ i α i V i 2 = i=1 n i=1 = max 1 i n λ i n α i BV i 2 = i=1 λ i α i V i 2 max 1 i n λ i n α i λ i V i 2 i=1 n α i V i 2 = ρ(b) X 2 i=1 max BX 2 ρ(b), X 2 =1 n α i V i 2 which is what we wanted. We note that the first equlity on the last line is valid because the vectors V i,i = 1,,n are orthogonal. i=1 49

51 Step 2: A 2 2 = A T A. Here A = (a ij ) denotes an n n matrix and X = (x i ) and Y = (y i ) denote generic vectors in R n. Let X,Y = n x i y i be the scalar (or dot) product in R n. By definition of the scalar (or dot) product of vectors, we have (i) X 2 2 = X,X X T X (ii) AX,Y = i=1 n x 2 i, i=1 n n y i a ij x j = i=1 i=1 n n x i a ij x j = X,A T Y, j=1 j=1 (iii) X,Y X 2 Y 2. (2.7) The last inequlity is known as the Cauchy-Swartz inequality (see Problem 2). Using successively (i), (ii), (iii) above, and proberty (2.6) of a subordinate matrix, we have AX 2 2 = AX,AX = X,A T AX X 2 A T AX 2 A T A X 2, dividing by X 2 2 both sides and taking the max on all X 0 yields A 2 2 A T A 2. It remains to show the inequality in the other direction, i.e, that A T A 2 A 2 2. We will exploit Step 1. Namely, we use the fact that A T A 2 = ρ(aa T ), because A T A is always symmetric. Let λ m be the eigenvalue of A T A such that ρ(a T A) = λ m. Let V m be the eigenvalue of A T A associated with λ m : A T AV m = λ m V m. Assume that V m 2 = 1. We have i.e, A T A 2 2 = λ m 2 = λ m 2 V m 2 2 = λ m V m,λ m V m = A T AV m,λ m V m = AV m,λ m AV m = λ m AV m 2 2 λ m A 2 2 V m 2 = λ m A 2 2, λ 2 m λ m A 2 2 which implies that λ m A 2 2, if λ m 0. (But if λ m = 0, then this statement becomes trivial, because it implies that A = 0 thanks to, the previously shown inequality, A 2 2 AT A 2.) 50

52 Conclusion: We showed that A 2 2 = AT A 2 and that A T A = ρ(a T A), since this matrix is symmetric, and when combined these two statements imply A 2 = ρ(a T A). Frobenius norm A well known (and widely used) matrix norm which is not a subordinate norm is the Frobenius norm given by 1/2 n n A F = which should not be confused with the L 2 norm. i=1 j=1 a 2 ij Definition 10 A matrix norm. M is said to be compatible with a vector norm. v if AX v A M X v. It is easy to see, from the definition, that a subordinate matrix norm is compatible with the vector norm it originated from. We have the following important theorem whose proof is left as an exercise. Theorem 4 If. is a matrix norm compatible with a vector norm, then ρ(a) A, for any square matrix A Matrix norm in matlab The function norm works for matrices in the same way as for vectors: >>norm(a) return the 2- norm of the matrix A and >>norm(a,p) return the subordinate p-norm of A while >>norm(a,fro) returns the Frobenius norm of A. 2.5 Condition number and conditioning As mentioned above, except for its inefficiency for very large systems, the only danger with Gauss elimination is that there are some situations where round-off errors may amplify over the successive operations on the elements of A, and Gauss elimination may result in a poor approximate solution. As we saw above, partial pivoting may help prevent some of this to happen but there are some cases where it can not help at all. Here we perform a systematic error analysis for the solution of AX = b and arrive to some important metric on the matrix A, known as the condition number, 51

53 which indicates a priori whether a numerical solution for the system AX = b will be a good approximation or not. It provides an upper bound estimate for the numerical error. Consider the system AX = b and suppose that we are using a certain numerical method to solve it, e.g., Gauss elimination. The idea is to see whether the problem AX = b is well conditioned. Consider the perturbed problem A(X +δx) = (b+δb). (2.8) (Here δx denotes a perturbation in the vector X not δ X: original data b yields solution X, perturbed data b+δb yields solution X +δx!) By linearity, (2.8) becomes AX +A(δX) = b+(δb) = A(δX) = (δb) or δx = A 1 (δb). Now let. M be a matrix norm compatible with the vector norm. v, then δx v A 1 M δb v or δx v b v A 1 M δb v b v = δx v A X v A 1 M δb v b v. For the last inequality, we used the fact that b v = AX v A M X v. Thus, we have the following upper bound on the relative error of the solution δx v X v A M A 1 M δb v b v The quantity A M A 1 M determines whether the relative error of the perturbed solution can be significantly larger than that of the perturbation δb. It is called the condition number and is denoted by cond(a) = A M A 1 M and it is obviously dependent on the choice of the matrix norm. Similarly if we consider a perturbation for the matrix A, which in practice can be interpreted as round-off errors, in the entries of A: (A+δA)(X +δx) = b. We assume that the perturbation δa is small enough so that A+δA remains non-singular, i.e, the perturbed system has a unique solution X + δx. Expanding the perturbed system yields AX+(δA)X+A(δX)+(δA)(δX) = b = A(δX) = (δa)(x+δx) = δx = A 1 (δa)(x+δx) Thus δx v A 1 δa M (X +δx) v A 1 M δa M (X +δx) v = δx v X +δx v A 1 M A M δa M A M = cond(a) δa M A M. Which yields the same conclusions as in the case when b is perturbed instead, i.e., the perturbation of the solution is bounded by the perturbation in the data times the condition number. Therefore, we make the following general statement 52

54 The linear system AX = b is ill-conditioned and the solution becomes very unstable to round-off errors if the condition number is large: cond(a) >> 1. As a rule of thumb, as it is confirmed by the following example and Problem 17 below, in a fl-pt system with precision k digits (in base 10), the ill-conditioning of a matrix A will be problematic (will induce large round-off errors) if cond(a) is on the order 10 k. In the matlab environment (which uses double precision computer arithmetic s) for example we qualify as large a condition number on the order of or higher. Example: A peculiar academic example of an ill-conditioned matrix is the Hilbert matrix: H ij = 1/(i+j 1),1 i,j n. For n = 3 we have 1 1/2 1/3 H = 1/2 1/3 1/4. 1/3 1/4 1/5 Consider the linear system HX = b with b = (11/16,13,47/60) T whose (exact) solution is X = (1,1,1) T. Imagine H and b are truncated to the first three significant digits to yield the system Ĥ ˆX =ˆb where Ĥ ,b = (1.83,1.08,0.783) T Then Ĥ ˆX =ˆb yields ˆX = (1.0895, ,1.4910) T. The absolute error in the 1-norm is given by and the relative error is X ˆX 1 = = X ˆX 1 X 1 = = 36.4%. Similar bad (if not worse) results will be obtained if we started with the original data but used Gauss elimination in 3-digits fl-pt arithmetics. See for instance problem 18. This clearly suggests that the Hilbert matrix with n = 3 is ill-conditioned. The condition number of this matrix is readily computed in matlab: cond(h 3 ) = which is not particularly large but for the 3-digit precision it is. We will see in the problem Pb. 17 below that the condition number for the Hilbert matrix increases considerably with increasing n and ultimately becomes ill-conditioned when n becomes large enough. Important remarks: Ill-conditioning is not restricted to the Hilbert matrix and/or to large size matrices. Many other matrices that are encountered in many practical problems can be illconditioned. For example any symmetric matrices whose ratio of the largest to smallest eigenvalue is large is ill-conditioned (see problem 18). As it was stated above the condition number depends on the choice of the norm. However, since all vector and matrix norms are equivalent, it can be shown that for all practical purposes this dependenceonthechoiceofnormisnotimportant, andanyoneofthenormslistedhereforexample can be used to assess effectively whether a given matrix is ill-conditioned or not. If is interesting to note that for the 2-norm we have cond 2 (A) A 2 A 1 2 = ρ(aa T ) 1/2 ρ((a T ) 1 A 1 ) 1/2 = max{ λ,λ eigenvalue of AAT } min{ λ,λ eigenvalue of AA T } 53

55 since the eigenvalues of the inverse of a matrix are the inverses of the eigenvalues of the matrix: if λ is an eigenvalue of A 1/λ is an eigenvalue of A 1. If in particular A is symmetric then cond 2 (A) = λ max λ min if λ max,λ min represent, respectively, the largest and smallest (in magnitude) eigenvalues of A. The condition number in matlab: The matlab function cond(a) returns the condition number of the matrix A with respect to the L 2 norm. The condition number with a specific p-norm is obtained by cond(a,p) in the same way as the matrix norm works. 2.6 Problems 1. Use mathematical induction to show that m j = m(m+1) m, j 2 = m(m+1)(2m+1) 2 6 j=1 k=1 and m j=1 j 3 = m2 (m+1) Follow the steps below to prove the Cauchy-Shwartz and Holder s inequalities. (a) ShowthatforanygiventwovectorsX,Y R n,wehavethecauchy-shwartz sinequality: ( n ) 2 ( n x i y i i=1 i=1 x 2 i )( n Hint: Consider the discriminant of the quadratic form f(t) = X +ty 2 2. (b) Follow the steps below to show that for any given two vectors X,Y R n, we have Holder s inequality i=1 y 2 i ). ( n n x i y i ) 1/p ( n 1/q yi) q, x p i i=1 i=1 i=1 1 p + 1 q = 1. (This one is in fact more involved.) i. Consider the function f(t) = t t p /p, t 0,p > 1. Prove that f(t) f(1), t > 0 ii. Deduce from 2(b)i that x i y i 1 p x i p + 1 q y i q for all p,q, Hint: let t = x i / y i q 1 when y i 0. iii. Deduce that n i=1 x i y i 1 p ap x p p + 1 q 1 p + 1 q = 1. 1 a q y q q, a > 0 iv. Deduce Holder s inequality. Hint: Let a = x 1/q p / y 1/p q in 2(b)ii. 54

56 3. Use the following two steps below to prove the triangular inequality for the L p norm, p 1, also known as Minkowski s inequality. (a) Show that for all x,y R, and p > 1 we have x+y p x+y p 1 x + x+y p 1 y (b) Use Holder s inequality and the result here above to show that X +Y p p X +Y p 1 p X p + X +Y p 1 p Y p and deduce Minkowski s inequality. 4. Show that the following expressions form vector norms in R n. (a) X = max 1 i n x i. ) 1/2 (b) X 2 = ( n i=1 x2 i (c) X p = ( n i=1 xp i )1/p,p > 1 5. Let. be a matrix norm which is compatible with some vector norm. Show that ρ(a) A. 6. Show the following identities for the given subordinate matrix norms: i) The L subordinate norm ii) The L 1 subordinate norm iii) The L 2 subordinate norm A max AX = max X =1 1 i n A 1 max X 1 =1 AX 1 = max 1 j n n a ij j=1 n a ij i=1 A 2 max X 2 =1 AX 2 = ( ρ(aa T ) ) 1/2 where ρ(aa T ) is the spectral radius of the matrix AA T given by ρ(b) = max{ λ, λ is an eigenvalue of B}. 7. Use Gauss elimination with backward substitution to solve the following linear systems (keep all the correct digits shown on your calculator) a. 4x 1 x 2 +x 3 = 8, 2x 1 +5x 2 +2x 3 = 3, x 1 +2x 2 +4x 3 = 0. b. 4x 1 +x 2 +2x 3 = 9, 2x 1 +4x 2 x 3 = 5, x 1 +2x 2 3x 3 = 9. Check (the accuracy of) your results in matlab, using the backslash division: >>X=A\b. 55

57 8. Repeat 7 but keep only the first two digits after each operation. What is the relative error, compared to the(exact) solution found in 7. To compute the relative error, in matlab, between the two vectors X1 and X2, you can use the norm command: >>RelEr = norm(x1-x2)/norm(x2). What can you say about your approximation? 9. Use Gauss elimination to solve the following systems, if possible, and determine whether row interchanges are necessary. a. x 1 x 2 +3x 3 = 2, 3x 1 3x 2 +x 3 = 1, x 1 +x 2 = 0. b. x 2 2x 3 = 4, x 1 x 2 +x 3 = 6, x 1 x 3 = 2. c. 2x 1 = 3, x x 2 = 1, 3x x 3 = 0, 2x 1 2x 2 +x 3 +x 4 = 0, d. x 1 +x 2 +4x 4 = 2, 2x 1 +x 2 x 3 +x 4 = 1, x 1 +2x 2 +3x 3 x 4 = 4. 3x 1 x 2 x 3 +2x 4 = Repeat a), b), c), and d) of Question 9. in matlab. Use the command lu (>>[L,U,P]=lu(A)) to compute the LU factorization for each matrix and to find out whether row interchanges were necessary. Verify, in matlab, that PA = LU. 11. Use the LU decomposition found with matlab above to solve system (d) again, by hand. You can proceed as follows. Solve LY = Pb Then solve UX = Y 12. Repeat 11 in matlab. Use the function Usolve below to solve the system UX = Y and a function Lsolve, of your own, to solve the lower-triangular system LY = P b. Call the matlab-editor window and create a new M-file. Type the Usolve function below then save it as Usolve.m in your matlab/working directory. function X=Usolve(U,b); %solves upper-tringular system U*X=b; n = max(size(u)); %gets system size n X(n) = b(n)/u(n,n); for k=n-1:-1:1 X(k) = (b(k) - sum(u(k,k+1:n).*x(k+1:n)))/u(k,k); end 56

58 To solve the system UY = Pb using the Usolve function, just type >>Y=Usolve(U,P*b), after the lu command. Hint: To program an (your own) Lsolve function, solving a lower triangular system, you can just edit the Usolve function. Only a few changes are necessary. 13. LU factorization is useful when solving the system AX = b with several right hand sides. To see this, use the TryLU.m M-file script given on page 158 of the book. %TryLU.m N=200; n=100; A=rand(n,n); tic for i=1:n b=rand(n,1); X=A\b; end toc tic [L,U,P]=lu(A); for i=1:100 b=rand(n,1); X=U\(L\(P*b)); end toc Execute this matlab script and compare the two elapsed times, then conclude. Change N,n (use several different pairs of values) and observe that the gain in time is more considerable when N gets much larger than n. Can you explain why? 14. Follow the algorithm given here in these notes to write a matlab program to find the lower triangular matrix L corresponding to Cholesky s factorization for a positive definite matrix A. 15. Find the number of operations used by the Cholesky factorization of a positive definite matrix. 16. Explain how you can use Cholesky s factorization of a positive definite matrix,a, to solve the system AX = b. Illustrate your method by solving the following linear system x y = z Condition number and ill-conditioning. Use the following few Matlab commands to find out how big the condition number of the Hilbert matrix should be in order for the matrix to be ill-conditioned in the matlab environment. For n=5,10,15,20, etc. execute 57

59 >> H=hilb(n); >>Xe=ones(n,1); >>b=h*xe; >>X=H\b; >>ReEr = norm(x-xe)/norm(xe) >>CondNbr = cond(h) Display your results in a suitable table. e.g. Interpret your results and conclude. n Error Cond# Consider the linear system (referred to as AX = b below) x y = z (a) Show that the (exact) solution for this system is X = (1,1,1). (b) Find the approximate solution for this system if Gauss elimination is used in ft-pt arithmeticswithafive-digitroundingmode(aftereachoperation). Answer X = (1.2001, , ) (c) Find the residual vector R = b A X and its max-norm R and the relative error R / b. (d) Use matlab to estimate the condition number (preferably with respect to the max-norm) of the matrix A. Do you think this matrix is ill-conditioned? (e) Find an upper bound for the error X X / X (don t compute this relative error, this is the next question) in max-norm using the condition number and the residual error found here above. (f) Compute the relative error X X / X. Verify that this is in fact smaller than the upper bound found above and more importantly it is close to it. Deduce whether A is indeed ill-conditioned and explain why (compare to the number of significant digits used in Gauss elimination calculation). 58

60 Chapter 3 Nonlinear equations: F(X) = Introduction Let f(x) be a continuous function defined on the interval [a,b]. The purpose of this chapter is to compute the root (or roots) x [a,b] for the equation f(x) = 0. Unfortunately, analytic solutions for such equations are limited to very particular classes of functions such as polynomials. Here we will introduce numerical techniques that are able to provide an approximate solution (to the desired accuracy). But first we state without proof the following important result from calculus that ensures the existence of at least one solution x such that f(x ) = 0. Theorem 5 If f(x) is a continuous function on the interval [a,b] such that f(a)f(b) 0, then there exists c [a,b] such that f(c) = 0. Proof: This is a direct consequence of the intermediate value theorem of calculus. General approach: Generally speaking, most numerical methods for solving non-linear equations f(x) = 0 consist of constructing a sequence of real numbers, (x n ) n,n 0, which converges to x : lim x n = x. Given an initial guess x 0, the numerical method constructs the sequence n iteratively, i.e, step by step, starting from x 0. The iterative process stops at some step n when the last iterate x n is close enough to the solution x and returns x n as the numerical estimate for x. This process is better illustrated through the bisection method given below, which is the simplest numerical method for f(x) = 0. 59

61 3.2 The bisection method Let f(x) be a continuous function on [a,b] such that f(a)f(b) < 0. Then the above theorem guarantees the existence of a solution x [a,b] to the equation f(x) = 0. To construct a sequence x n that converges to x when n, the bisection method proceeds as follows. Algorithm: bisection method. Let a 0 = a,b 0 = b and set x 0 = a 0 +b 0. 2 For n 0 do the following. If f(x n ) = 0 then stop and return x = x n, otherwise set { an+1 = a n,b n+1 = x n, if f(a n )f(x n ) < 0 a n+1 = x n,b n+1 = b n, if f(b n )f(x n ) < 0 In words, given an interval [a,b] containing the solution x of f(x) = 0, the bisection method starts with x 0 = (a+b)/2, the mid-point of the interval, then depending on the sign of f(x 0 ), we locate the solution on either the left half or the right half of the interval and the same process is repeated over and over, until convergence. See the sketch below. a n a n+1 b n b n+1 x n =(a n +b n )/2 f(a n )f(b n )<0 f(x n )f(b n )<0: a n+1 =x n, b n+1 =b n Note that this function has two roots as shown by the graph. The one near the origin is a root of 60

62 even multiplicity, at which the graph of f does not cross the horizontal line y = 0, and therefore it cannot be captured by the bisection method. Example: f(x) = x 3 +4x 10,x [1,2], is continuous and f(1)f(2) = 5 14 = 70 < 0. We have a 0 = 1,b 0 = 2, = x 0 = (1 + 2)/2 = 1.5 and f(1.5) = Thus f(x 0 )f(a) > 0 andf(x 0 )f(b) < 0. So we let a 1 = x 0 = 1.5 and b 1 = b 0 = 2. Iterating further yields x 1 = 1.75 and f(x 1 ) = , i.e, a 2 = 1.5,b 2 = 1.75 and x 2 = Starting from an initial guess x 0 = 1.5, the middle of the interval, after two iterations we arrive at the conclusion that the solution x is located in the smaller interval [1.5,1.75]. x 2 = is thus our new estimate for the solution. To accelerate the computation, the bisection algorithm can be programmed and run easily in matlab. A sample matlab program which does just this is given next. >> fx=inline( x.^3+4*x-10 ) fx = Inline function: fx(x) = x.^3+4*x-10 >> a1=1;b1=2; >> a=a1;b=b1; >> c=[]; >> for I=1:10 c1=(a1+b1)/2; fc=fx(c1);c=[c;c1]; if(fc==0) break, end if(fc*fa<0) a2=a1;b2=c1;fb=fc; else a2=c1;b2=b1;fa=fc; end a1=a2;b1=b2; a=[a;a1];b=[b;b1];end >> c c =

63 The vector c representing the iterative sequence x 0,x 1,,x 9 obtained after 10 iterations is displayed. We see that after 6 iterations the first three digits remain unchanged. Therefore we conclude, as a rule of thumb, that the last iterate x , our estimate for the solution x, has at least three significant digits, perhaps more. In fact, if we iterate further for example, we can show that the exact solution is approximately given by x Thus the absolute error, associated with our 10-iteration approximation, is given by x 9 x while the relative error is x 9 x /x < , confirming a formal 3-significant-digit accuracy, almost 4! Convergence of the bisection method What would happen if we keep iterating the bisection process over and over? Will the resulting infinite sequence (x n,n 0) converge? Will it converge to x? How fast does it converge to x, in case it does converge? To answer these questions we start by noticing that at step n of the bisection algorithm we have x n x b n a n, 2 since x n is the mid-point of the interval [a n,b n ] and by construction, we have x (a n,b n ). Moreover, for n 1, we have either a n = x n 1,b n = b n 1 or a n = a n 1,b n = x n 1. Therefore, and by mathematical induction we get b n a n = b n 1 a n 1 2 = b n 2 a n 2 2 2, b n a n = b 0 a 0 2 n = b a 2 n. Thus, we have the following upper bound for the absolute error associated with this approximation, x n x b a 2 n+1, and from calculus (the squeeze law for sequences), we have x n x with a linear rate of convergence 1/2, i.e, the error is simply divided by 2 at each iteration. Back to our Example. Getting back to our example, we can state the following. When the root of x 3 +4x 10 = 0 on the interval [1,2] is approximated by the iterate x , even without knowing what the exact solution x is, we are certain that the absolute error satisfies: x 9 x < 1/2 10 =

64 which is clearly larger than the actual error but yet a good estimate of x 9 x found above. A priori Number of iterations Another practical question to ask is whether we can determine, a priori, how many iterations of the bisection method do we need in order to achieve a given accuracy ǫ? i.e, find the smallest n 0 such that x n x ǫ? Again given that x n x (b a)/2 n+1, it suffices to choose n such that (b a)/2 n+1 < ǫ which is equivalent to n+1 lnǫ 1 ln2, therefore the smallest n for which x n x < ǫ is given by n = [ lnǫ 1 [ ln2 lnǫ 1 ln2 ], if lnǫ 1 ln2 is not an integer ] 1 where [x] denotes the integer part of the real number x. if lnǫ 1 ln2 is an integer For our example, if ǫ = 10 3, then the minimum number of iterations is [ n = 3 ln10 ] = [ ] = 9, ln2 i.e, we need nine iterations to guarantee an approximate solution x n that is accurate to within 10 3, in agreement with the calculations above where we found x 9 x < Advantages and disadvantages of the bisection method One can argue that the minimum of 9 is quite a large number of iterations to yield the mediocre accuracy of only 10 3, in the previous example. In fact, the bisection method is one of the slowest methods for f(x) = 0. Together with other pitfalls (listed below) the bisection method is not always the best method to use. In the next section we will introduce faster and more powerful methods. In summary, we retain the following advantages and disadvantages of the bisection method. Advantage: The main advantage of the bisection method is that in theory it always converges to the solution of f(x) = 0 provided f is continuous on the initial interval [a,b] and that f(a)f(b) < 0. Disadvantages: As stated above the convergence is slow, as we see later, the convergence of the bisection method is only linear, whereas Newton s method which is presented next, has a quadratic convergence. Perhaps the most serious disadvantage of the bisection method is that it can not be applied for roots of even multiplicity, because the condition f(a)f(b) < 0 can not be satisfied near x. Take the example of f(x) = x 2, which has a double root at x = 0. 63

65 Even for simple roots of odd multiplicity it is not always trivial to find an initial interval [a,b] such that f(a)f(b) < 0 if f has many roots that are too close to each other. The method becomes unstable if the evaluation of f(x) doesn t vary much near x (i.e, the graph of f is flat): typically, an inaccurate sign for f(x n ) will direct the method to the wrong side of the interval. 3.3 Newton s method Let f(x) be a continuous and differentiable function on [a,b]. We propose to find x [a,b] such that f(x ) = 0. Given an initial guess x 0 [a,b] (x 0 is arbitrary), to construct a sequence x n that will eventually converge to x when n goes to, we consider the first order Taylor (tangent line or linear) approximation to the function f(x), centered at x 0 f(x) P 1 (x) = f(x 0 )+(x x 0 )f (x 0 ). (3.1) Instead of solving f(x) = 0, since the tangent line is a good approximation of f(x) near x 0, there is a good reason to believe that the root of the linear equation f(x 0 )+(x 1 x 0 )f (x 0 ) = 0 is close to x (provided x 0 is close to x ). Let x 1 be this root, i.e, such that P 1 (x 1 ) = 0. We have f(x 0 )+(x 1 x 0 )f (x 0 ) = 0 x 1 = x 0 f(x 0) f (x 0 ). This is shown graphically below, where x 1 is indicated as the intersection of the slope line of the function f(x) at the point (x 0,f(x 0 )) and the x-axis. This process is then iterated to yield x 2 as the intersection of the slope line at x 1 and the x-axis and so on. y=f(x) y=f(x) x * x 1 x 0 x* x x 2 1 x 0 Slope line Slope line Iteration 1 Iteration 2 i.e, x 2 = x 1 f(x 1 )/f (x 1 ) and so on. This yield the following algorithm. Algorithm: Newton s method Given x 0 [a,b] 64

66 For n = 0,1,2, Set x n+1 = x n f(x n) f (x n ). Which yields in principle an infinite sequence that eventually converges to the solution x of f(x) = 0. This is at least what we expect because the algorithm is based on the Taylor approximation but some surprises may occur. As an example we consider again x 3 +4x 10 = 0, x [1,2]. with x 0 = 1, we compute x 1 = x 0 f(x 0 )/f (x 0 ) and x 2 = x 1 f(x 1 )/f (x 1 ) and so on. To continue the iterative process we write and execute a matlab program. Our matlab program computes the Newton iterates starting from x 0 = 1, and as a convergence test, compares each iterate with the solution obtained with the built-in fzero function of matlab to provide a sequence of errors monitoring the convergence of the Newton sequence. Ideally, we could choose a function f(x) whose root x is known analytically but this procedure is just as good. %%Matlab program for Newton s method >> f=inline( x.^3+4*x-10 ); >> fp=inline( (3*x.^2+4) ); >> xstar = fzero(@(x)f(x),[1,2])%here we call fzero of matlab % a built in matlab function to solve f(x)=0 %use xstar, solution obtained by fzero as exact solution %%Newton s method begins here. >> x0=1; >> X=x0; >> Error=abs(x0-xstar); >> for I=1:10 x1=x0-f(x0)/fp(x0); x0=x1; X=[X;x0]; Error=[Error;abs(xstar-x0)]; end >> [X,Error] %displays the sequence x_n and the absolute errors x_n-x* ans =

67 Contrarily to the bisection method which requires at least 9 iterations to achieve an accuracy to within 10 3, Newton s method seems to yield at least 5 significant digits after just 3 iterations. This is a huge gain in terms of computational effort. However, Newton s method has also its drawbacks as we will see below Local convergence of Newton s method We start by stating the following local convergence theorem without proof. The proof is provided in the next subsection. Theorem 6 (Local convergence of Newton s method) Assume that f(x) is continuously differentiable in some open interval containing the root x such that f(x ) = 0 and f (x ) 0. Then if x 0, the initial guess, is close enough to x, then Newton s method converges to the solution x. Interestingly, the condition that x 0 is close to x is crucial for the convergence of Newton s method but not the condition f (x ) 0. The latter can be violated and the method will still converge but at a much slower rate. How far or how close to x x 0 should be depends on the function itself. To illustrate we sketch below a situation of divergence of Newton s method when the initial guess is chosen at a point beyond which the concavity of the function changes. In fact, we can show that Newton s method converges globally, i.e, for any given initial guess x 0 if f(x) is convex. x* x 0 x 1 y=f(x) Divergence of Newton s method 66

68 3.3.2 Rate of convergence Definition 11 A sequence (x n ),n 0 converges to its limit x at a rate γ > 0 if there exists a constant c > 0 such that x n x c x n 1 x γ. (3.2) If γ = 1, the convergence is said to be linear and if γ = 2 the convergence is said to be quadratic. Note that for the case γ = 1, convergence is guaranteed only if 0 < c < 1. Linear convergence of the bisection method: Wecanshowformtherelation x n x a n b n /2thatthebisectionmethod srateofconvergence is linear (γ = 1) with c = 1/2. The analytically proof is very technical and there not conducted here but we can demonstrate by numerical tests that on average we have x n x 1 2 x n 1 x, as in the matlab output of the previous example. Quadratic convergence of Newton s method Now, we show that Newton s method converges quadratically. Assume that f(x) is continuous and differentiable to the second order and that its derivatives are bounded, in the neighborhood of x. We have x n x = x n 1 f(x n 1) f (x n 1 ) x = x n 1 x + f(x ) f(x n 1 ) f (x n 1 ) Note the last equality is true because f(x ) = 0. The zero term is added for convenience. Now, we Taylor expand f(x ) with respect to x n 1 : or Which yields f(x ) = f(x n 1 )+(x x n 1 )f (x n 1 )+ (x x n 1 ) 2 f (ξ) 2 f(x ) f(x n 1 ) = (x x n 1 )f (x n 1 )+ (x x n 1 ) 2 f (ξ). 2 x n x = (x n 1 x ) (x n 1 x )+ (x x n 1 ) 2 2 f (ξ) f (x n 1 ) = (x x n 1 ) 2 f (ξ) 2 f (x n 1 ). Thus, x n x 1 M 2 m x n 1 x 2 where M = max f (x) and m = min f (x), in some neighborhood of x. Therefore, Newton s method has a quadratic convergence with parameters γ = 2 and c = M/2m. Note that when f (x ) = 0 the constant c becomes arbitrary large, because m = 0 (no matter how close we are to x ) and the quadratic convergence breaks down for Newton s method. Note that this discussion provides, at the same time, a formal proof for the local convergence of Newton s method (Theorem 6). 67

69 Remark: linear v.s quadratic convergence Notice that as opposed to linear convergence, where the error x n x is only divided by 1/c at each iteration, with quadratic convergence the error at the next iteration x n+1 x is on the order of c x n x 2. Imagine that at iteration n the error is on the order of 10 2, for example, then at the next step we get an error on the order c 10 4, which is huge improvement. Unless c is very large, a few iterations (4 or 5) with Newton s method is enough to achieve a good estimate for the solution of f(x) = 0. But even in such cases significant gain towards convergence is gained after each iteration, provided the the initial guess doesn t lead to divergence Advantages and disadvantages of Newton s method As stated above the main advantage of Newton s method is that, except for the case when f (x ) = 0, the convergence is quadratic and the main disadvantage is that convergence is only local; convergence is guaranteed only when the initial guess is close enough to the root x. In general, we have the following list of advantages and disadvantages for Newton s method. Advantages: Quadratic convergence Easy to implement Requires only one first guess, not an interval containing the root x, as it is the case for the bisection method. Disadvantages Local convergence only. Requires the computation of the derivative, which can be costly in some cases (especially for systems of equations as we will see later in the course) Division by zero may occur during the iterative process and cause a crash of the computation if f (x) = 0 for some x in the neighborhood of x ; A crashing will occur if f (x n ) = 0 for some n. The last two problems related to the derivative can be avoided by using variants of Newton s method that do not use systematically the derivative f (x n ) but some numerical approximation. This is illustrated below through the secant method Secant method The secant method is obtained through Newton s method by replacing the derivative f (x n ) by the slope (f(x n ) f(x n 1 ))/(x n x n 1 ), where x n and x n 1 are the last two iterates, to yield The drawback of this simplification is two-fold. x n+1 = x n (x n x n 1 )f(x n ) f(x n ) f(x n 1 ). 68

70 Firstly, the secant method requires two initial guesses rather that one. We need to provide values for both x 0 and x 1 in order to compute x 2 and get the method going. Secondly, the convergence is slower than that of the original Newton method but still faster than the bisection method. It can be shown that the rate of convergence of the bisection method is γ sec Convergence criterion Let (x n ) n 0 be a convergent sequence and we propose to approximate its limit x = lim n x n by the term x n for n sufficiently large. However, in a typical numerical computation for f(x) = 0, the sequence x n is generated iteratively. In order to limit the computation cost it is desirable to stop the iterations as early as possible. This is an easy task for the bisection method (because we know already that x n x b a /2 n+1 ) but in general, especially for Newton s method, this is not obvious. Because of its quadratic convergence, we can show that for Newton s method the error x n x is small whenever the difference between two successive iterates, x n+1 x n, is small enough (see problem 3). Also for the bisection method it is straightforward to see that x n x n+1 = b n a n /4 (make a sketch). Therefore as a general statement for both methods we have x n x n+1 < ǫ = x n x < 2ǫ. (For the bisection method we can show that x n+1 x < ǫ but for Newton s method this is the best we can do.) Thus, a possible stopping criterion for the iterative process is given by x n x n+1 < ǫ, where ǫ = δ/2 and δ is the desired tolerance error. In some situations the criterion may not be sufficient in the sense that x n and x n+1 can be close to each other or even close enough to x but the value of f(x n ) is still large (not very close to f(x ), i.e, zero). To enforce this we can stop the iteration on the condition that f(x n ) is small. Yet an even better criterion is to combine the two to yield the following stopping criterion: Stop iterations and return value of x n or x n+1 if x n x n+1 + f(x n ) < ǫ. The algorithm below uses Newton s method to solve f(x) = 0 with the stopping criterion above as a convergence test. Algorithm: Newton s method with convergence test Step1: Initialization Enter initial guess x 0, tolerance ǫ tol = δ tol /2, and maximum number of iterations n max Initialize iteration pointer: n=0 69

71 Step 2: Compute next iterate: x n+1 = x n f(x n) f (x n ) Compute error: e = x n x n+1 Step 3: Convergence test: If (e < ǫ tol ) then Exit loop and return approximate solution x a = x n Else if (n < n max ) Set n = n+1 x n = x n+1 Go to Step 2. Else exit Loop and return error message: The maximum number of iteration is reached before convergence. Change the initial guess and/or increase the maximum number of iteration and try again. End Function fzero of Matlab To compute the root of a function f(x) = 0, matlab provides us with a nice easy-to-use built in function named fzero. Type >>help fzero in the matlab command window to learn more about this function. In a nutshell, this is how fzero works. To use it you need first do predefine your function f(x) either within an M-file or using the inline function within the matlab command (as done above). To use the M-file option, invoke the Matlab editor (Click on file on left top corner of the Matlab Window, then select new M-file). Type in your function in the body and save it under a name (myfunction) of your choice such as myfunction.m. In the Matlab command window you just need to type >>xstar = fzero( myfunction, x0) %if you want to use a starting point x0 or >>xstar = fzero( myfunction,[a,b])%if you want to use a starting interval [a,b] Note: Use a starting point x 0 if you want to find a root x which is near x 0 and a starting interval [a,b] if you want to find a root inside [a,b]. An important thing to know about fzero is that it is essentially based on the bisection method. So don t be surprised if you hear that fzero can not be used for double roots (or in general for roots of even multiplicity). In such situation, you may want to try to use fzero for the derivative first and then check if the resulting root of f (x) is also a root for f(x). Also when providing an initial interval the condition f(a)f(b) < 0 must be satisfied. 70

72 As a fun exercise try the following (from the book, see also Pb. 3). >>fzero(@(x) 1/x, 3) and >>fzero(@(x) x^2, 3) Observe and interpret the results. 3.4 Fixed point iterations Let g(x) be a continuous function defined on [a,b] and takes its values in [a,b], i.e, symbolically, g : [a,b] [a,b]. Definition 12 A point x in [a,b] is said to be a fixed point for g(x) if it satisfies g(x ) = x. Important Remark: Note that a real number x is a root for f(x) = 0 if and only if x is a fixed point for g(x) = x f(x). Therefore, to find the roots of f(x) = 0, it will suffice to look for the fixed points of g(x) x f(x) or in general of g α (x) = x αf(x), where α is an arbitrary parameter, whose value can be adjusted to accelerate the convergence, as we will see below The method of fixed-point iterations An effective method which is often used in practice to find approximate values to a given fixed point x = g(x ) is the fixed-point iterations, which may be summarized as follows. Let x 0 be a given initial guess. For all n 0, we set x n+1 = g(x n ). Then, if the sequence (x n ) converges its limit satisfies L lim n x n = lim n g(x n) = g(l). i.e, the limit L is a fixed point for g. We can then use the notation L = x and write x n x = g(x ). The obvious question is of course, whether the sequence x n defined above converges or not, i.e, under which conditions on the function g and/or the initial guess x 0 this sequence converges? The answer is provided below. 71

73 Theorem 7 (Convergence of the fixed-point iterations) The fixed point iterations, x n+1 = g(x n ),n 0,x 0 [a,b], converge to a fixed-point x = g(x ) if g is a contraction, i.e, if there exists a positive constant c such that the derivative of g(x) satisfies g (x) c < 1, x [a,b]. Proof: Because x = g(x ), we have x n+1 x = g(x n ) g(x ) = g (ξ)(x n x ) (by Taylor expansion), ξ [x n,x ] or [x,x n ] [a,b]. Let M = max x [a,b] g (x). Then x n+1 x M x n x = x n x M x n 1 x M 2 x n 2 x and by induction we arrive at x n x M n x 0 x. But g (x) c, x [a,b] implies M c < 1. Thus, M n 0 when n and by the squeeze law for sequences of calculus we conclude that (x n ) converges to x. A somewhat hidden result from the theorem above is that the condition g (x) c < 1 guaranties both the existence and uniqueness of a fixed-point x for g(x) in [a,b]. The existence is somewhat beyond the scope of these notes but the uniqueness is easy to prove. See Problem Newton s method as a fixed-point iteration Newton s method can be viewed as the fixed-point iterations applied to g(x) = x f(x)/f (x). In fact we have x n+1 = x n f(x n )f (x n ). Therefore, we prove the local convergence for Newton s method by applying the convergence theorem for fixed-point iterations of this section. We have g (x) = 1 f (x) 2 f(x)f (x) f (x) 2 = f(x)f (x) f (x) 2. We assume there exists x such that f(x ) = 0 and f (x ) 0. Let a = f (x ) and assume that f,f and f are continuous near x. Then for small ǫ > 0, we have, within a small neighborhood of x, f(x) < ǫ and f (x) > a ǫ. Let M > 0 such that f (x) < M in this neighborhood of x. Then we have g (x) = f(x)f (x) f (x) 2 ǫm < 1, for ǫ sufficiently small. (a ǫ) The chord method The cord method consists of using the fixed-point iteration to compute the root of f(x) = 0 as the fixed point of g(x) = x αf(x). The iterative process is thus given by Given an initial guess x 0. Set x n+1 = x n αf(x n ), n 0. 72

74 The parameter α is chosen so that the iterative sequence converges to the root x. Note that when f is continuously differentiable we have g (x) = 1 αf (x). If α is chosen so that 1 αf (x) c < 1, in a neighborhood of x containing x 0, then the fixed-point iterations theorem guarantees (local) convergence of this algorithm. We get the fastest convergent chord method if we chose α for which max 1 αf (x), in some neighborhood of x containing x 0, is the smallest. Link between Newton s method and the chord method Roughly speaking, Newton s method can be viewed as a chord method for which the constant α(= 1/f (x n ))changesateveryiterationsothattheeffectivederivativeg (x n ) = 0, whicheffectively maximizes the convergence rate. Note however, that because α is not constant, it changes with n, Newton s method does not qualify as a cord method. 3.5 Application: The Black-Scholes formula The Black and Scholes formula for a European call option is given by C = S 0 N(d 1 ) Ke rt N(d 2 ). Where C is the fair price or value of the option, S 0 is the price of the underlying asset at t = 0, K is the strike price at maturity (i.e, the price the option holder would pay if she/he decides to buy the asset) at the maturity or expiry time T, S(T) is the price of the asset at maturity, which is in fact the biggest unknown it is a random variable, and r is the risk-free or bank interest rate, N(d) is the cumulative distribution function of the standard normal probability distribution (with mean zero and variance one) N(y) = 1 y e x2 /2 dx 2π and d 1 = ln(s 0/K)+(r + σ2 2 )T σ T d 2 = d 1 σ T = ln(s 0/K)+(r σ2 2 )T σ T while σ measures the uncertainty or variability in the marked price known as the volatility. If S 0,K,T,r and σ are known, then the option price C is easily obtained by this formula, provided we have the proper tools to evaluate the improper integral N(y) for d 1,d 2. In matlab this can be easily accomplished by calling the function normcdf, the cumulative normal distribution: >>normcdf(y,m,mu) returns an approximate value for N(y) = 1 µ 2π y 73 e (x m)2 /(2µ 2) dx

75 In our case y = d 1,d 2 ; m = 0; and µ = 1 (mu=1). In reality however, the volatility is unknown at time t = 0 and the option holder or someone who wants to buy the option, is interested to known a realistic price C of the option at time t = 0. For this matter we can proceed as follows. Given a target price C, what is the corresponding volatility σ? In other words, we want to know what should the volatility be in order to yield a certain price C? This problem can be solved by finding the root of the function f(σ) = S 0 N(d 1 (σ)) Ke rt N(d 2 (σ)) C, (note the dependence of d 1,d 2 on σ is highlighted) so that f(σ ) = 0 implies C = S 0 N(d 1 (σ )) Ke rt N(d 2 (σ )). Newton s method for this problem is as follows. Given an initial guess σ 0, for n 0, set where the derivative f (σ) is given by σ n+1 = σ n f(σ n) f (σ n ) f (σ) = S 0 N (d 1 (σ))d 1(σ) Ke rt N (d 2 (σ))d 2(σ). Here, we used the chain rule We have d dσ (N(d(σ)) = N (d(σ))d (σ). N (d) = 1 2π e d2 /2, and d 1(σ) = 1 T σ 2 T ln(s 0 /K) (r +σ 2 /2)T σ 2 d 2(σ) = d 1(σ) T = d 1(σ) σ. = ln(s 0/K)+(r σ 2 /2)T σ 2 T = d 2(σ) σ (See Problem 5.) 3.6 Problems 1. (a) Using MATLAB, plot the graph of the function f(x) = x 5 1.9x x x x+75.6 for values of x [ 10,10] and observe that all real zeros of f(x) clearly lie in this interval (and, in fact, are clearly in the interval [-6, 6] ). Why? This can be done by entering the following MATLAB command: >>ezplot( x^5-1.9*x^4-16.4*x^3+7.2*x^2+86.4*x+75.6, [-10,10]) 74

76 This will cause a graphics window to open containing the desired graph. Under file in the figure-window, select print in order to print it. Then close the graphics window. (b) Use the MATLAB function fzero to compute one zero of f(x) in [-6,6]. (See help fzero.) This can be done by entering >> fzero( x^5-1.9*x^4-16.4*x^3+7.2*x^2+86.4*x+75.6, [ -6, 6] ) Note: the function fzero uses a combination of the Bisection algorithm, the Secant method and interpolation to compute zeros. (c) Although it s not clear from the above graph, f(x) has three real zeros in [-6, 0], and two real zeros in [0, 6]. Use ezplot to obtain a graph of f(x) on [-2.3, -1], and print this graph. Use fzero again to compute a zero of f(x) on [-2.3, -1]: >>fzero( x^5-1.9*x^4-16.4*x^3+7.2*x^2+86.4*x+75.6, [ -2.3, -1] ) Note: you can save some typing by copying and pasting, or by using the up arrow key to re-execute a previous statement (and then modifying it before pressing the return key). (d) It can be shown that f(x) has two real zeros between -2.1 and -1. Attempt to compute one of these zeros using fzero on the interval [ -2.09, -1]. (e) Verify mathematically (using either MATLAB or your calculator) that f(x) has a zero at x = and explain why (using calculus and some computations in either MATLAB or on your calculator) this is a zero that cannot be computed using the Bisection Method (and thus by fzero). This explains the result in (d). 2. Now we propose to use Newton s method to solve for the missing root in the interval [-2.09,-1] in problem 1. (a) Create a MATLAB M-file for the function f(x) = x 5 1.9x x x x+75.6 and another for its derivative f (x); that is two matlab functions, which when called in MATLAB will return, respectively, the values f(x) and f (x) for any given real number x. For example, to create a M-file for f(x) go to the main menu on the top bar of the MATLAB window, then Select File > New -----> M-file This opens new text editor window. Inside this window type function fx=func(x) fx=x^5-1.9*x^4-16.4*x^ *x^ *x Then save it under the name fx.m. To do this, use the menu in the text editor window. Create a similar MATLAB function (M-file) for f (x) and save it for example under the name fpx.m. 75

77 (a) Use Newton s method to find the root of f(x) in the interval [-2.09,-1]. The matlab code for Newton s method, using a maximum of 20 iteration, an initial guess x 0 = 1, and a tolerance (relative) error Tol=20 6, is given below. ******************************************** Maxiter=20; x0=-1; Tol=1.e-06; I=0; xn=x0; while (I<Maxiter) x1=x0-fx(x0)/fpx(x0); if(abs((x0-x1)/x0)<tol), break, end x0=x1; xn=[xn;x0]; I=I+1; end if(i>=maxiter) display([ Newton s method failed to converge in,num2str(maxiter), Iterations ]) end display([ Computed root:, num2str(x0), ; Relative Error:, num2str(abs((x0-x1)/x0)),. ;Number of Iterations:, num2str(i)]) n_xn=[(0:i) xn] ***************************************** (b) Repeat (a) using the secant method to find the root of f(x) in the interval [-2.09,-1]. This time write your own MATLAB code based on the iterative process x n+1 = x n f(x n)(x n x n 1 ) f(x n ) f(x n 1 ). Use the same M-file for the function f(x) as you used in (a). Take x 0 = 1 and x 1 = x 0 f(x 0 )/f (x 0 ), i.e, the first iterate of Newton s method found in (a), to get the secant method started. Use Tol=10 6 and Maxiter = 20 as in (a). (c) Compare the sequences in (a) and (b) obtained respectively by the Newton and the secant methods. Conclude and justify their behavior by using what you learned in class about the convergence rates, etc. 3. Let c = max f (x) /(2min f (x) ) in the neighborhood of the root x of f(x). Let x n be the sequence of Newton iterates for f(x) = 0. Show that if x n x < 1 2c and x n+1 x n < ǫ, then x n x < 2ǫ. 4. Prove that if f is continuously differentiable to the third order (at least) and that in addition to the conditions f(x ) = 0,f (x ) 0, we have also f (x ) = 0, then Newton s method has a cubic rate of convergence, i.e, there is exists c > 0 such that x n x c x n 1 x 3. 76

78 3. Use the function fzero of matlab for the functions f(x) = x 2 and g(x) = 1/x with a starting point x 0 = 3. Type >>fzero(@(x) 1/x,3) and >>fzero(@(x) x^2,3) respectively. Notice the behavior of fzero in both cases and explain. Is there something wrong with what you see? Repeat the two experiments above using Newton s method instead. Notice the difference, then conclude. 5. Use Newton s method to find an approximation for 2. How many iterations are needed to find an approximation with 5 correct digits, if the starting point is x 0 = 1? Hint: consider the function f(x) = x Let g be a continuous function from [a,b] to [a,b] such that g (x) < 1 for all x [a,b]. Show that the fixed point for g if it exists is unique. Hint: Proceed by contradiction. Assume that there are two distinct fixed points x and x for g in [a,b]. 7. The Black-Scholes formula for an European call option is given by C = S 0 N(d 1 ) Ke rt N(d 2 ) where C is the fair price or the value of the call option, S 0, is the initial or current price of the underlying asset, at time t = 0, K is the strike or exercise price (i.e, the price that the option holder will pay if he/she decides to buy the underlying asset) at maturity time T > 0 (expressed in years). (The call option holder will exercise the option only if the asset price at maturity S(T) is larger than the exercise price K, so he/she will have a net gain of S(T) K > 0). r is the risk free interest rate and N is the normal cumulative distribution function N(d) = 1 d e x2 /2 dx, 2π with d 1 = log(s 0/K)+(r +σ 2 /2)T σ T and d 2 = d 1 σ T. The new parameter σ is called the volatility, it somehow measures the risk or randomness in the market price S(t). For a given initial price S 0, a strike price K, a maturity time T, and an interest rate r, the matlab script below returns the current fair price C of the call option. function C=blkscles(S0,K,T,r,sigma) %Black-Scholes formula for a European call option d1 = (log(s0/k) + (r +sigma^2/2)*t )/(sigma * sqrt(t)) ; 77

79 d2=d1-sigma * sqrt(t); %Below we use the matlab built in normcdf function %to compute the normal cumulative distribution function N(d) C = S0*normcdf(d1,0,1) - K*exp(-r*T)*normcdf(d2,0,1); % %Use the matlab documentation to learn more about normcdf % a) Type in this matlab script in your matlab editor window and save it as an M-file under the name blkscles.m. b) Use the M-file you saved in (a) to compute the call fair price C if K = $54,T = 5 months,s 0 = $50,σ = 0.3,r = 0.07 c) Now assume that we want to find the volatility σ that yields a price C, given beforehand. This can be accomplished by solving the equation f(σ) S 0 N(d 1 (σ)) Ke rt N(d 2 (σ)) C = 0 where both d 1 and d 2 are now considered as functions of σ and we need to find the root σ for our new function f. This can be done easily in matlab using fzero. Try >> fzero(@(x) blkscles(s0,k,t,r,x) - Cstar,1) Assume the same values for S 0,K,T,r as in (b) above and use your result for C found there as the value of C and see if it will yield the same volatility. Perturb C slightly and observe whether the problem is well conditioned or not. Make sure to enter the values of S0,K,T,r and Cstar in the matlab command window before you call fzero. d) Repeat (c) using Newton s method. Hint: To compute the derivative of f you need to use the chain rule, namely d dσ N(d(σ)) = N (d(σ))d (σ). 8. Consider the function f(x) = x 3 3x a) Plot the graph of this function and observe that it has two roots, r 1 < r 2, within the interval [-2,4]. Check that they are genuine roots by evaluating f at r 1 and r 2. b) Before performing the calculations predict which outcome (i.e say which root if any) both the fzero function of matlab and Newton s method will yield, respectively, if the starting point was (i) x 0 = 3 (ii) x 0 = 1 (iii) x 0 = 0 (iv) x 0 = 0.5 c) Run both fzero and Newton s method and verify whether your predictions in b) are correct. 78

80 9. Do these ones using a hand calculator. Find the first two iterates, x 1,x 2 using Newtons method for the given function f(x) and given initial guess x 0. (i) f(x) = x 2 6,x 0 = 1 (ii) f(x) = x 3 cosx,x 0 = 1. What if x 0 = 0? 10. Use Newton s method to find solutions with an error tolerance ǫ = 10 4 for the following problems (i) x 3 2x 2 5 = 0,x [1,4] (ii) x cosx = 0,x [0,π/2] (iii) x 3 +3x 2 1 = 0,x [ 3, 2] (iv) x sinx = 0,x [0,π/2] 11. Use Newton s method to approximate, to within 10 4, the value of x that produces the point on the graph of y = x 2 that is closest to the point (1,0) of the xy-plane. Hint: minimize (d(x)) 2, where d(x) is the Euclidean distance between (1,0) and an arbitrary point (x,x 2 ) on the parabola. 12. Repeat 11 for f(x) = 1/x and the point (2,1). 13. Consider the function f(x) = x 3 3x 2 +4, whose graph is given below. a) According to this graph, f(x) has two roots, r 1 < r 2, within the interval [-2,4]. Using calculus, determine the nature (simple, repeated, even multiplicity, etc.) of each root. b) Without computations, just from the graph of f(x) given below and your answer above, predictwhichoutcome(i.esaywhichrootifany)eachoffzeroofmatlabandnewton smethod will yield, respectively, if the starting point takes each one of the values in the first column of table 1. [3] Circle the correct answer, on the corresponding row and column, in the table below. 15 x 3 3 x x Table 1 Circle the outcome (r 1,r 2 or failed to converge ) for Newton s method and Matlab fzero, respectively, if used with the corresponding starting point on the first column. 79

81 Starting point x 0 Newton s outcome Fzero s outcome 3 r 1, r 2, failed to converge r 1, r 2, failed to converge 1 r 1, r 2, failed to converge r 1, r 2, failed to converge 0 r 1, r 2, failed to converge r 1, r 2, failed to converge -.5 r 1, r 2, failed to converge r 1, r 2, failed to converge c) Now assume that the bisection method is used with a starting interval [a,b]. Circle the correct answer from Table 2 below for each one of the starting interval limits a,b given on the first column. Table 2. Starting interval limits: a, b Bisection s outcome 3, 4 r 1, r 2, failed to converge 1, 3 r 1, r 2, failed to converge -2, 3 r 1, r 2, failed to converge -2, 0 r 1, r 2, failed to converge 80

82 Chapter 4 Function approximation 4.1 Curve fitting An important issue that commonly arises in many areas of applied sciences is that of curve fitting. Given a cloud of data points (x i,y i ),i = 1,2,,N,N > 0 in the xy-plane. We would like to find a function whose graph is as close as possible to the given data points. Such function may then be used to compute estimates of derivatives, integrals and other important physical quantities from the given data. One simple way to solve this problem is by constructing a polynomial of a certain degree m, P m (x) = a 0 +a 1 x+ +a m x m, such that the Euclidean distance between the data points and the graph of P m (x) is as small as possible. In mathematical words this is equivalent to finding the polynomial P m (x) of degree m that minimizes the following functional min Φ(a 0,a 1,,a m ), where Φ(a 0,a 1,,a m ) = a 0,a 1,,a m N y i P m (x i ) 2. Note that the maximum degree m of the polynomial is fixed independently of the number of data points N and that the minimization is taking with respect to the coefficients a 0,a 1,,a m. The case m = 1 yields a linear curve and it is illustrated below. i=1 Y Data point Linear Fit Curve X This curve fitting technique, based on the minimization of the Euclidean distance, is a well estab- 81

83 lished procedure, used in science and engineering, known as the least square approximation. From now on, we assume that the abscissas x i,i = 1,,N are all distinct. In this case, we will see later in this chapter that if m N 1, then the least square polynomial exists and is unique and when N = m+1 it matches the unique interpolation polynomial of degree m that passes through all the data points (x i,y i ),i = 1,,m + 1, i.e, P m (x i ) = y i,i = 1,,m + 1, while when m N, the least square problem is ill-posed, in the sense that the solution is not unique. This can be easily seen for the case when N = 1, one data point and m = 1; There is infinitely many first degree polynomials P 1 (x) = a 1 x+a 0 such that a 1 x 1 +a 0 = y 1. If for instance x 1 = 0 then a 0 = y 1 and a 1 remains arbitrary Least square approximation in Matlab In matlab the least square approximation is handled by the function polyfit. It has the following syntax: >>P=polyfit(X,Y,m) where X and Y are two vectors of the same size, containing the data points, and m is the degree of the fitting polynomial P m (x) while the output P is a vector of dimension m + 1 containing the coefficients of P m (x) = a m x m + a m 1 x m a 1 x + a 0, ranged in the decreasing order, i.e, P(m+1) = a m,p(m) = a m 1,,P(2) = a 1,P(1) = a 0. This illustrated by the following example. Matlab Polyfit function: 7 6 data N=1 N=2 N=3 5 >> X=[ ]; >> Y=[ ] ; >> P=polyfit(X,Y,1) P = e e x >> P=polyfit(X,Y,2) P = e e e-01 82

84 >> P=polyfit(X,Y,3) P = e e e e-15 >> plot(x,y, ro ) >> hold on >>ezplot( e-01*x e+00,[0 4]) >>ezplot( e-02*x^ e+00*x e-01,[0 4]) >>ezplot( e+00*x^ e+00*x^ e+00*x e-15,[0 4]) >>legend( data, N=1, N=2, N=3 ) For all practical purposes, the constant term e-15 in the degree 3 polynomial above can be treated as zero. There are a few things that we can learn from this matlab example. The graphic shows three different curve fittings for the given data consisting of the points (0,0),(0.5,3),(2,2), and (3.5,4), corresponding to three polynomials of degree one, two, and three. Note that the degree one (m=1) and degree two(m=2) polynomials are almost identical while the third degree polynomial(m=3) has higher variations and happens to passe exactly through all the four data points, as it is anticipated above. This is the interpolation polynomial discussed below. 4.2 Interpolation polynomial In the previous matlab example, we saw that when m = 3 the least square approximation polynomial passes through all the four data points so that the Euclidean distance N i=1 (P m(x i ) y i ) 2 is zero, since P m (x i ) = y i,i = 1,2,N. In fact, we have the following theorem. Theorem 8 (Interpolation polynomial) Let x 0,x 1,,x m be m+1 distinct points and y 0,y 1,,y m be m+1 corresponding values. Then, there exists a unique polynomial P m (x) = a 0 +a 1 x+ +a m x m of degree m such that P m (x i ) = y i, i = 0,1,,m. A polynomial P m (x) satisfying these conditions is known as the interpolation polynomial. Proof: (Uniqueness) We proceed by contradiction. Assume there are two distinct interpolation polynomials P m (x) and Q m (x) of degree m, (i.e such that P m (x i ) = Q m (x i ) = y i, i = 0,1,,m). 83

85 Let R m (x) = P m (x) Q m (x). Then R m (x) is a polynomial of degree m such that R m (x i ) = 0, i = 0,1,,m i.e, R m (x) has m+1 distinct zeros. This is NOT impossible, unless R m (x) = 0, since the degree of R m is m, which is a contradiction with the assumption that P m (x) and Q m (x) were distinct. Proof: (Existence) The existence of the interpolation polynomial is provided by construction through the Lagrange polynomial given in the next section. 4.3 Lagrange interpolation polynomial Let x 0,x 1,,x m be m+1 distinct points and y 0,y 1,,y m be m+1 corresponding values. We consider the following m + 1 polynomials, or generically L 0 (x) = (x x 1)(x x 2 ) (x x m ) (x 0 x 1 )(x 0 x 2 ) (x 0 x m ) L 1 (x) = (x x 0)(x x 2 ) (x x m ) (x 1 x 0 )(x 1 x 2 ) (x 1 x m ) L 2 (x) = (x x 0)(x x 1 )(x x 3 ) (x x m ) (x 2 x 0 )(x 2 x 1 )(x 2 x 3 ) (x 2 x m ) L m (x) = (x x 1)(x x 2 ) (x x m 1 ) (x m x 1 )(x m x 2 ) (x m x m 1 ) L j (x) = m i=0,i j (x x j ) (x j x i ). The polynomials L j (x),j = 0,1,2,,m are known as the Lagrange polynomials. Example: Let x 0 = 0,x 1 = 0.5,x 2 = 2,x 3 = 3.5. Then (4.1) L 0 (x) = (x 0.5)(x 2)(x 3.5) (0 0.5)(0 2)(0 3.5) = 2 7 x x x+1, L 1 (x) = (x 0)(x 2)(x 3.5) (0.5 0)(.5 2)( ) = 4 9 x x x L 2 (x) = (x 0)(x 0.5)(x 3.5) (2 0)(2.5)(2 3.5) = 2 9 x x x L 3 (x) = (x 0)(x 0.5)(x 2) (3.50 0)( )(3.5 2) = 4 63 x x x. 84

86 Some basic properties of Lagrange polynomials The Lagrange polynomials L j (x) defined above satisfy the following important properties. 1. Degree of L j (x) is exactly m for all j = 0,1,,m. 2. L j (x i ) = 0 if i j. 3. L j (x j ) = 1 for all j = 0,1,2,,m. Property1. resultsdirectlyfromthefactthatthecoefficientofx m inl j isgivenby m i=0,i j which is obviously non zero. Properties 2. and 3. are straightforward. 1 (x j x i ), As a consequence of these three properties we have the following important result that proves the existence part of Theorem 8. Theorem 9 (Lagrange Interpolation Polynomial) Let x 0,x 1,,x m be m+1 distinct points and y 0,y 1,,y m be m+1 corresponding values. Then, the interpolation polynomial of Theorem 8 is given by m P m (x) = L 0 (x)y 0 +L 1 (x)y 1 + +L m (x)y m y j L j (x). This procedure is known as the Lagrange polynomial interpolation. j=0 Proof: It follows directly from Property 1. above, that the degree of P m (x) is m. The condition P m (x i ) = y i follows from Properties 2) and 3): P m (x i ) = L 0 (x i )y 0 +L 1 (x i )y 1 + +L m (x i )y m = L i (x i )y i = y i since L j (x i ) = 0 if i j and L i (x i ) = 1. Example Let x 0 = 0,x 1 = 0.5,x 2 = 2,x 3 = 3.5 and y 0 = 0,y 1 = 3,y 2 = 2,y 3 = 4. Then, the Lagrange interpolation polynomial is given by (using the results of the previous example) P 3 (x) = 0 L 0 (x)+3 L 1 (x)+2 L 2 (x)+4 L 3 (x) ( 4 = 3 9 x x ) 9 x +2 = 8 7 x x x x x x, ( 29 x x2 718 x ) +4 ( 4 63 x x2 + 4 ) 63 x which is the polynomial found by the function polyfit of matlab in the previous section when m = 3. This of course is not a surprise since the interpolation polynomial is unique. 85

87 4.3.1 The interpolation polynomial of a known function So far the interpolation polynomial is defined as a convenient way to construct a function that passes through a given set of data points, which in some sense is the best polynomial fit for our data. However, in some situations we may want to find a polynomial approximation to a given function. Especially, for those functions that are hard to integrate or differentiate, the polynomial approximation may reveal to be very convenient indeed, as we will see later in the course. Definition 13 Let f be a smooth 1 function of the variable x. Let x 0,x 1,,x m be m+1 distinct points and y 0 = f(x 0 ),y 1 = f(x 1 ),,y m = f(x m ) be the m+1 corresponding values of f. Then the unique interpolation polynomial, P m (x), associated with the data points (x i,y i = f(x i )),i = 0,1,,m, which satisfies P m (x i ) = f(x i ),i = 0,1,,m, is known as the interpolation polynomial of f associated with the data points x 0,x 1,,x m. In many practical situations, the interpolation polynomial, P m (x), of a given function f at some given interpolation points x 0,x 1,,x m, is used to approximate values of the function f at points other than the interpolation points x j s and we write f(x) P m (x). Below, we will quantify the truncation error e(x) = f(x) P m (x) associated with this approximation. Example: Let f(x) = 1/x and consider the three interpolation points x 0 = 2,x 1 = 2.5,x 3 = 4. Then L 0 (x) = x 2 6.5x+10, L 1 (x) = 4 3 x2 +8x 32 3,L 2(x) = 1 3 x2 3 2 x+ 5 3 and 1 x P 2(x) = 1 20 x x Both the function f(x) = 1/x and the interpolation polynomial P 2 (x) and the associated three interpolation points are shown on the plot below. 1.5 f(x)=1/x P 2 (x) = x 2 /20 17*x/40+23/20 1 Interpolation point Here smooth is used vaguely to mean that f has as many derivatives as it is required. 86

88 Note, from this example, that the approximation seems to be fairly good within the span of the interpolations points but deteriorates quickly as we move away from those points. The explanation for this behaviour is given by the error analysis given below. 4.4 Interpolation error Let f(x) be a smooth function defined on the interval [a,b] and x 0,x 1,,x m, m+1 distinct points in [a,b]. Let P m (x) be the interpolation polynomial of f associated with the points x 0,x 1,,x m. The interpolation error is defined as the difference E(x) = f(x) P m (x). As we can see from the example of the function f(x) = 1/x the interpolation error is small within the range of the interpolation points and increases rapidly outside this range. Moreover, as it is the case for Taylor polynomials, it is expected that as the number of interpolation points is increased the interpolation error decreases. More precisely, we have the following theorem. Theorem 10 Let x 0 < x 1 < < x m be m+1 distinct points in [a,b] and f(x) a smooth function on [a,b]; We assume that f has at least m+1 continuous derivatives. Let P m (x) be the interpolation polynomial of f(x) associated with x 0 < x 1 < < x m. Then the interpolation error satisfies E m (x) f(x) P m (x) = f(m+1) (ξ) (m+1)! (x x 0)(x x 1 ) (x x m ) = f(m+1) (ξ) (m+1)! where ξ is in the range of x,x 0,x 1,,x m. m (x x j ) j=0 We skip the proof of this theorem. It is found in the book of Burden and Faires for example. Below we give some important remarks about it. Remarks: First, notice the similarity between the interpolation error given in the theorem above and that of the Taylor expansion R n (x) = f(n+1) (n+1)! (x x 0) n+1. The theorem above suggests that the interpolation error E m (x) will go to zero when m (because of the (m+1)! in the denominator). However, this may not be always the case. In fact, if the combination of the remaining factors grows faster than (m + 1)!, then the error will not converge to zero: 87

89 i) Let h = max(x j+1 x j ). We can show that if x [x 0,x m ], then m (x x j ) hm+1 (m!) 4 j=0 ii) If M m+1 = max x 0 x x m f (m+1) (x) then E m (x) M m+1 h (m+1) 4(m+1) Therefore E m (x) 0 when m provided the derivative f (m+1) (ξ) i.e, M m+1 doesn t increase faster or at the same rate as h(m+1) 4(m+1) goes to zero. Situations where the interpolation error doesn t converge to zero can occur in practice as it is illustrated below by the example of the Runge function. Example: Runge s phenomenon. Consider the function f(x) = 1/(1+25x 2 ) on [ 1,1], known as Runge s function. Here we use the function polyfit of matlab to compute and draw its interpolation polynomials of degree five and 10, associated with the interpolation points, x 0 = 1,x 1 = 0.6,x 2 = 0.2,x 3 = 0.2,x 4 = 0.6,x 5 = 1 and x 0 = 1,x 1 = 0.8,x 2 = 0.6,x 3 = 0.4,x 4 = 0.2,x 5 = 0,x 6 = 0.2,x 7 = 0.4,x 8 = 0.6,x 9 = 0.8,x 10 = 1, respectively. The function polyval of matlab The program below uses the matlab polyval function to evaluate the polynomials P 5 (x),p 10 (x) at the values of x contained in the vector xi. It uses a vector Pm of size (m + 1) containing the coefficients of the a given polynomial P m (x) of degree m (Pm = [a m,a m 1,,a 0 ] P m (x) = a m x m +a m 1 x m 1 + +a 0 ) and a vector X = (x i ),i = 1,,N where N is an arbitrary number of points where the polynomial P needs to be evaluated. It returns a vector of values Y = (y i ) of size N such that y i = P m (x i ),i = 1,,N. >> X=-1:2/5:1;%defines our 6 interpolation pts for P_5(x) >> Y=1./(1+25*X.^2); >> P5=polyfit(X,Y,5); >> X=-1:2/10:1; %defines the 11 interpolation pts for P_10(x) >> Y=1./(1+25*X.^2); >> P10=polyfit(X,Y,10); >> xi=-2:.01:2;% grid points for plotting >> plot(xi,1./(1+25*xi.^2), linewidth,2) >> p5=polyval(p5,xi); >> plot(xi,p5, - ) >> p10=polyval(p10,xi); >> plot(xi,p10, - ) >> legend( f(x), P_5(x), P_{10}(x) ) >> title( Runge s example: f(x) =1/(1+25x^2), fontsize,24) 88

90 2 Runge s example: f(x) =1/(1+25x 2 ) f(x) P 5 (x) P 10 (x) As we see from these plots, high order interpolation polynomials for the Runge function exhibits strong oscillations which is suggestive of a non convergence behaviour. In fact, roughly speaking, the nth order derivative exhibits an exponential growth of its maximum f (n) (x) with respect to n (as powers of 25), which is certainly the cause of these oscillations. We have f (x) = 2 25x (1+25x 2 ) 2,f (x) = x 2 (1+25x 2 ) (1+25x 2 ) 2, etc. This oscillatory behaviour of the high order interpolation polynomial is not specific to Runge s function but it may arise for many other functions or sets of data. Though, it is known as the Runge phenomenon because Runge was probably the first to discover this behaviour. It is somewhat related to a similar behaviour of Fourier series known as Gibbs phenomenon. It can be problematic to approximate a function with high accuracy if its higher order interpolation polynomial exhibit such oscillations. To overcome this difficulty there are some options. The most attractive one is piece-wise polynomial interpolation which is introduced next. 4.5 Piecewise polynomial interpolation (Introduction to Splines) Let x 0,x 1,,x m be a set of data points and y 0,y 1,,y m an associated set of values. Piecewise interpolation polynomials or splines consist in constructing an interpolation polynomial, P i (x), 2 of low order (degree three or less) tojoin together each successivepair of points (x i,y i ) and (x i+1,y i+1 ) on the interval [x i,x i+1 ]. The result is a piecewise polynomial function fitting the given data. Notice that the interpolation condition y i = P i (x i ),y i+1 = P i (x i+1 ) implies that the approximation function is continuous at the intersection points, x i,i = 0,,n. 2 Note the index i refers to the interval [x i,x i+1] and is not the degree of the interpolation polynomial. 89

91 Example 1: Piecewise linear approximation Let x 0,x 1,,x m be a set of data points and y 0,y 1,,y m an associated set of values. For i = 0,1,,m 1, set P i 1(x) = a i x+b i, [x i,x i+1 ] betheinterpolationpolynomialofdegreeoneassociatedwiththedatapoints(x i,y i )and(x i+1,y i+1 ). Lagrange s interpolation method applied to the data (x i,y i ) and (x i+1,y i+1 ), yields P i 1(x) = x x i+1 x i x i+1 y i + x x i x i+1 x i y i+1 = y i+1 y i x i+1 x i x+ x i+1y i x i y i+1 x i+1 x i. The piecewise linear interpolation function for the data (x i,y i ),i = 1,,n is given P 1 (x) = P i 1(x); x [x i,x i+1 ], i = 0,1,,m 1 P 1 (x) is continuous by construction as mentioned above but it is not differentiable at the interpolation points. Though, the left and right derivatives are well defined but are not necessarily equal. We have d P 1 (x + i+1 ) dx = dp i+1(x) dx = a i+1 and d P 1 (x i+1 ) dx The graph of P 1 (x) is a broken line as illustrated below. = dp i(x) dx = a i. Piecewise linear interpolation x_1 x_2 x_3 x_4 Example 2: Cubic splines Let x 0,x 1,,x m be a set of data points and y 0,y 1,,y m an associated set of values. To obtain a smooth piecewise interpolation, we require that both the first and the second derivatives of P match at the interpolation points, to achieve an approximation function which is twice differentiable. The solution to this problem is provided by the cubic splines. We have P(x) S(x) = S i (x) = s i0 +s i1 (x x i )+s i2 (x x i ) 2 +s i3 (x x i ) 3, x [x i,x i+1 ], i = 0,1,,m 1, with the following conditions 90

92 (1) S i (x i ) = y i,s i (x i+1 ) = y i+1,i = 0,1,,m 1 = S i (x i+1 = S i+1 (x i+1 ),i = 0,1,,m 2 (continuity) (2) S i (x i+1) = S i+1 (x i+1),i = 0,1,,m 2 (continuity of the first derivative) (3) S i (x i+1) = S i+1 (x i+1),i = 0,1,,m 2 (continuity of the second derivative) To construct the spline function S(x) for the given m+1 interpolation points we need to find the 4m coefficients s i0,s i1,s i2,s i3,i = 0,1,,m using the conditions in (1),(2), (3) above. However there are only 2m+m 1+m 1 = 4m 2 conditions, yielding 4m 2 equations. Therefore we need two more equations to fully determine the coefficients s ik. Three types of extra conditions/assumptions are commonly used in practice, depending on the problem at hand: Natural Cubic Splines assume: S 0(x 0 ) = S m 1(x m ) = 0. These are also known as the free boundary conditions. Clamped boundary conditions: S 0(x 0 ) = α 0,S m 1(x m ) = β 0 where α 0,β 0 are given constants usually equal to the derivative of the function to be approximated. Not-a-knot conditions i.e, S (x) is continuous at x 1 and x m 1. S 0 (x 1 ) = S 1 (x 1 ) and S m 2(x m 1 ) = S m 1(x m 1 ) Numerical example: Consider the data points x 0 = 0, x 1 = 1, x 2 = 2, y 0 = 0.5, y 1 = 2, y 2 = 1. Find the natural cubic spline associated with these data points. We have { S(x) = S 0 (x) = s 00 +s 01 x+s 02 x 2 +s 03 x 3, 0 x 1 S 1 (x) = s 10 +s 11 (x 1)+s 12 (x 1) 2 +s 13 (x 1) 3, 1 x = S 0(0) = s 00 = s 00 = 1/2; 2 = S 0 (1) = s 00 +s 01 +s 02 +s 03 or s 01 = 2 s 00 s 02 s 03 ; 2 = S 1 (1) = s 10 = s 10 = 2 and 1 = S 1 (2) = s 10 +s 11 +s 12 +s 13 or s 11 = 1 s 10 s 12 s 13 91

93 We need to solve for s 02,s 03,s 12,s 13 only. The boundary conditions yield: and Continuity of S (x) at x = 1: 0 = S 0(0) = 2s 02 +6s 03 x x=0 = 2s 02 = s 02 = 0 S 1(2) = 2s 12 +6s 13 (x 1) x=2 = 0 = s 12 = 3s 13. S 0(1) = S 1(1) = s 01 +2s 02 +3s 03 = s 11 and continuity of S (x) at x = 1: S 0(1) = S 1(1) 2s 02 +6s 03 = 2s 12 To summarize we have s 00 = 1 2,s 10 = 2,s 02 = 0 and the following system of equations for s 01,s 03,s 11,s 12,s 13 whose solution is s 01 +s 03 = 1.5 s 11 +s 12 +s 13 = 1 2s 12 +6s 13 = 0 s 01 +3s 03 s 11 = 0 6s 03 2s 12 = s s s s 12 = 1 0 0, s 13 0 s 01 = 17 8,s 03 = 5 8,s 11 = 1 4,s 12 = 15 8,s 13 = 5 8 Therefore the natural cubic spline associated with the given data points is x 5 8 x3 ; 0 x 1 S(x) = (x 1) 15 8 (x 1) (x 1)3, 1 x 2. Exercise: Check that S(x) satisfies the conditions (1),(2), and (3) and the free boundary conditions. 92

94 2.5 2 S 1 (x) 1.5 S 0 (x) Splines in Matlab The function interp1 of matlab provides piecewise interpolation functions of various kinds. For given data point vectors X and Y and a set of arbitrary points XI >>YI= interp1(x,y,xi, option); returns the values YI = P(XI) using piecewise interpolation specified in the variable option, which is a character variable. Two possible options are: spline, which uses spline interpolation with the not-a-knot condition (3) and linear uses piecewise interpolation. Use the help command for more options. When the option variable is omitted, the default is piecewise linear. As an example execute the following matlab program. 93

95 X=[0 1 2]; Y=[ ]; XI=0:0.1:2; YL = interp1(x,y,xi, linear ); YS= interp1(x,y,xi, spline ); figure plot(xi,yl, --, linewidth,2) hold on plot(xi,ys, -, linewidth,2) plot(x,y, o ) legend( Piecewise Linear, Cubic Splines ) print -depsc matlabspline.eps Piecewise Linear Cubic Splines Note that the cubic spline function found analytically in the previous example and the one computed here through matlab are not exactly the same (as the matlab one has a strong hump to the right of the point (1,2)). This is not surprising since S 1 (x) and S 2 (x) in the previous analytic example use the natural boundary conditions while Matlab by default uses the not-a-knot boundary condition. The option cubic provides the clamped splines. Remark: It can be shown that the natural cubic spline is the piecewise interpolation polynomial that minimizes the curvature of the interpolation curve that passes through a given cloud of points (x i,y i ), i.e. the natural cubic spline function S(x) satisfies xn xn S (x) dx f (x) dx x 0 x 0 for all functions f(x) such that f(x i ) = y i. A somewhat more natural matlab function for spline interpolation is called spline. It works in the same way as interp1 except it doesn t require the option spline or cubic. Use the help command to learn more about spline interpolations in matlab. 4.7 Theory of least square approximation Let x 1,x 2,,x N be N distinct points and let y 1,y 2,,y N be N associated values. The linear least square approximation consists in finding the polynomial P m (x) = a 0 +a 1 x+ + a m x m of degree m, for a given m, such that the quantity Φ(a 0,a 1,,a m ) N P m (x i ) y i 2 i=1 is as small as possible. 94

96 Remark: Notice that when N = m+1, i.e, there are exactly m+1 data points, the minimum is given by the interpolation polynomial, which satisfies P m (x i ) = y i = Φ(a 0,a 1,,a m ) = 0. Assume that m N 1. From the remark above, it is desirable to find the polynomial, P m (x), that passes through all the data points such that P m (x i ) = y i,i = 1,2,,N. Clearly this yields the following linear system for the (unknown) coefficients, a 0,a 1,,a m a 0 +a 1 x 1 +a 2 x a m x m 1 = y 1 a 0 +a 1 x 2 +a 2 x a m x m 2 = y 2 a 0 +a 1 x N +a 2 x 2 N + +a m x m N = y N of N equations and m+1 unknowns. The associated matrix 1 x 1 x 2 1 x m 1 1 x 2 x 2 2 x m 2 V[x 1,x 2,,x n ] =..., 1 x N x 2 N xm N called the Vandermonde matrix, after its discoverer, is a square-non singular matrix if N = m+1 and the points x 1,x 2,,x N are distinct. Otherwise it has a rank k = min(n,m+1) (i.e, it has full rank) if the points x 1,x 2,,x N are distinct. More precisely, we have the following. The (Vandermonde) linear system has a unique solution if N = m+1, providing the interpolation polynomial The (Vandermonde) linear system is over-determined if N > m+1 The (Vandermonde) linear system is under-determined if N < m+1 In the two last cases the system may have an infinite number of solutions or no solution. Example: Let x 1 = 0,x 2 = 1,x 2 = 2, y 1 = 2,y 2 = 0.5,y 3 = 1. Assume we want to find the linear polynomial P 1 (x) = a 0 +a 1 x such that a 0 +a 1 x 1 = y 1 a 0 +a 1 x 2 = y 2 a 0 +a 1 x 3 = y 3 95

97 or a 0 +0 = 2 a 0 +a 1 = 0.5 a 0 +2a 1 = 1 The first two equations yield a 0 = 2,a 1 = 1.5 but the last equation is not satisfied: 2+2 ( 1.5) = 1 1. Thus, the over-determined system has no solution. In words, the first two equations provide the linear interpolation that passes through (x 1,y 1 ) and (x 2,y 2 ) but the third point is way off. This is in fact not the optimal solution. An optimal solution, in the least-squares sense, is obtained by minimizing Φ(a 0,a 1 ) = a 0 +a 1 x 1 y a 0 +a 1 x 2 y a 0 +a 1 x 3 y 3 2. From multi-variable calculus, we know that for a quadratic non-negative function of the two variables, a 0,a 1, the minimum of Φ is obtained when its gradient is zero: Φ(a 0,a 1 ) = ( Φ a 0, Φ a 1 ) T = (0,0) T, known as the partial derivatives tests. We have and Φ a 0 = 2(a 0 +a 1 x 1 y 1 )+2(a 0 +a 1 x 2 y 2 )+2(a 0 +a 1 x 3 y 3 ) Φ a 1 = 2x 1 (a 0 +a 1 x 1 y 1 )+2x 2 (a 0 +a 1 x 2 y 2 )+2x 3 (a 0 +a 1 x 3 y 3 ) Therefore, Φ(a 0,a 1 ) = 0 is equivalent to the linear system { 3a 0 +a 1 (x 1 +x 2 +x 3 ) = y 1 +y 2 +y 3 a 0 (x 1 +x 2 +x 3 )+a 1 (x 2 1 +x2 2 +x2 3 ) = y 1x 1 +y 2 x 2 +y 3 x 3 or, in numbers, { 3a0 +3a 1 = 3.5 3a 0 +5a 1 = 2.5 which yields the solution a 0 = 10/6,a 1 = 1/2 and the least square linear polynomial (m = 1) P 1 (x) = x. Least square polynomial approximation For general, N,m, we have the following procedure. Given N data points (x i,y i ),i = 1,2,,N we aim to minimize Φ(a 0,a 1,,a m ) = N a 0 +a 1 x i +a 2 x 2 i + +a m x m i y i 2. i=1 96

98 The partial derivatives test for minimization yields Φ a 0 = 0 Φ a 1 = 0 Φ a m = 0 N (a 0 +a 1 x i + +a m x m i ) = i=1 N i=1 N x i (a 0 +a 1 x i + +a m x m i ) = i=1 N x m i (a 0 +a 1 x i + +a m x m i ) = i=1 y i N y i x i i=1 N y i x m i For a given set of data points, (x i,y i ),i = 1,2,,N, let us denote, for convenience, by f,g the discrete inner product N f,g = f(x i )g(x i ) so that x k,x j = i=1 i=1 i=1 N N x k ix j i = x k+j i, and x k,y = The system above can then be written symbolically as 1,1 1,x 1,x m x,1 x,x x,x m a 0 x 2,1 x 2,x x 2,x m a = x m,1 x m,x x m,x m a m We can show that the matrix 1,1 1,x 1,x m x,1 x,x x,x m A = x 2,1 x 2,x x 2,x m... x m,1 x m,x x m,x m satisfies A = V T V i=1 N x k iy i. i=1 1,y x,y. x m,y where V is the Vandermonde matrix introduced above. The matrix A is positive definite if at least m + 1 points among the data x 1,x 2,,x N are distinct. Thus, the system above has a unique solution a 0,a 1,,a m if N m + 1 (and at least m + 1 points are distinct), i.e, there exists a unique least square polynomial P m (x) = a 0 +a 1 x+ +a m x m that minimizes the functional Φ. In other words, we just proved the following theorem. (4.2) Theorem 11 Let (x i,y i ) be N data points where at least m+1 points are distinct (N m+1). Then there exists a unique least square polynomial P m (x) = a 0 +a 1 x+ +a m x m, whose coefficients 97

99 a 0,a 1,,a m are given by the solution of the system (4.2) which minimizes the functional Φ(a 0,a 1,,a m ) = N a 0 +a 1 x i +a 2 x 2 i + +a m x m i y i 2. i=1 Remark: The system (4.2) is sometimes referred to as the normal equations and the least square approximation is also know as linear regression Notion of basis and generalization of linear least square approximation The least square polynomial, P m (x), constructed above can be viewed as a linear combination of the monomials 1,x,x 2,,x m and the least square approximation as an orthogonal projection of an arbitrary function f(x) that passes through the data points (x i,y i ),i = 1,2,,m, with respect to the discrete inner product.,.. In principle the monomials 1,x,,x m can be replaced by any given set of linearly independent functions. This is illustrated below by an example. Example: Let x 1,x 2,,x N and y 1,y 2,,y N be a set of data points and associated values, respectively. Assume that according to some theory related to the nature of the data, we know that the data maybe represented by a function of the form g(x) = a 1 x+a 2 e x but the constants a 1,a 2 are not known. Assume that the data (x i,y i ),i = 1,,N are obtained by some measurements that are not perfect. Therefore, our task is to determine the coefficients a 1,a 2 so that g(x) best fits the data without expecting it to passe exactly through all the data points. One way to achieve such approximation is via least square approximation that amounts to minimizing N Φ(a 1,a 2 ) = a 1 x i +a 2 e x i y i 2 As a simple numerical example, let us assume that N = 3 and Then We have and i=1 x 1 = 1,x 2 = 2,x 3 = 3,y 1 = 2,y 2 = 3,y 3 = 5. Φ(a 1,a 2 ) = a 1 x 1 +a 2 e x 1 y a 1 x 2 +a 2 e x 2 y a 1 x 3 +a 2 e x 3 y 3 2. Φ a 1 = x 1 (a 1 x 1 +a 2 e x 1 y 1 )+x 2 (a 1 x 2 +a 2 e x 2 y 2 )+x 3 (a 1 x 3 +a 2 e x 3 y 3 ) Φ a 2 = e x 1 (a 1 x 1 +a 2 e x 1 y 1 )+e x 2 (a 1 x 2 +a 2 e x 2 y 2 )+e x 3 (a 1 x 3 +a 2 e x 3 y 3 ). The minimum is obtained when Φ = Φ = 0 a 1 a 2 98

100 which yields the system (x 2 1 +x 2 2 +x 2 3)a 1 +(x 1 e x 1 +x 2 e x 2 +x 3 e x 3 )a 2 = x 1 y 1 +x 2 y 2 +x 3 y 3 (x 1 e x 1 +x 2 e x 2 +x 3 e x 3 )a 1 +(e 2x 1 +e 2x 2 +e 2x 3 )a 2 = e x 1 y 1 +e x 2 y 2 +e x 3 y 3 or [ 14 e+2e 2 +3e 3 e+2e 2 +3e 3 e 2 +e 4 +e 6 whose solution is a ,a or ]( a1 a 2 g(x) 1.594x e x. ) ( ) 23 = 2e+3e 2 +5e 3 Linear v.s non-linear least square approximation The polynomial least square approximation and the other least square approximation seen in the example above, using the basis functions x and e x, are both called linear least squares because both the polynomial P m (x) and the function g(x) = a 1 x+a 2 e x depend linearly on the coefficients a 0,a 1,,a m and a 1,a 2 respectively. A least square approximation is said to be nonlinear if the approximation function depends non-linearly on the minimizing variables (e.g., the coefficients a 0,a 1,,a m etc.). Examples of non-linear least-square approximations are given in the set of problems below see problems 8 and General least squares in Matlab In addition to the function polyfit that can be used to find the least square polynomial approximation of degree m for a given set of N data points. Matlab can be used to solve almost any least square problem that involves the minimization of a non-negative function Φ(a 1,a 2,,a n ). It suffices to use the function fsolve, which determines the minimizer (a 1,a 2,,a n) of the function Φ when it exists. 4.9 Problems 1. For each one of the data sets given in the tables a) and b) below use the polyfit function of matlab to find the least square polynomial approximation of degree at most one, two, three, and four, respectively. Plot both the data (using symbols only, eg. plot(xi,yi, o )) and the 4 polynomials on the same graph. Use 4 different lines for the different polynomials (eg. >>plot(x,y, -- ) makes a dashed line plot,>>plot(x,y, : )makes a doted line, etc. Use >>help plot to find out more options. You can also use the figure properties menu to change the lines and symbols for each plot, etc.). To plot the different polynomials obtained by polyfit, you can use either ezplot which is not recommended for large degree polynomials or the following more efficient method: First use polyval to evaluate your polynomial on a fairly dense discrete set of points, eg. x = a :.1 : b where a can be set to the most left interpolation point and b to the most right one. Here is an example 99

101 >>xi=[ ]; >>yi=[ ]; >>figure(1) >>plot(xi,yi, ro ) >>hold on >>p1=polyfit(xi,yi,1) >>X=1:.1:2; %creates 11 equally spaced points in [1,2] >>Y=polyval(p1,X); >>plot(x,y, k, linewidth,2)% here linewidth,2 makes a tick line curve % and k means black color >>legend( data, p_1(x) ) a) b) x i y i x i y i For each of the data sets in problem 1, compute the total absolute error E 1 = N P m (x i ) y i i=1 where P m is the corresponding least square polynomial. 3. Consider the interpolation points x 0 = 0,x 1 = 0.6,x 2 = 0.9. For each one of the functions ( i),ii),iii),iv) ) below do the following. (a) Find the interpolation polynomial of degree at most one, P 1 (x), using x 0,x 1, as interpolation points. (b) Find the interpolation polynomial of degree at most one, Q 1 (x), using x 0,x 2, as interpolation points. (c) Find the interpolation polynomial of degree at most two, P 2 (x), using all three points x 0,x 1,x 2, as interpolation points. (d) Evaluate f(x), P 1 (x),q 1 (x),p 2 (x) at x = 0.45, estimate the absolute errors: f(x) P 1 (x), f(x) Q 1 (x), f(x) P 2 (x) and compare. Explain. i) f(x) = cos(x), ii) f(x) = 1+x, (4.3) iii) f(x) = ln(1+x), iv) f(x) = tan(x). (4.4) Recall:The interpolation polynomial for a function f(x) associated with the interpolation points x 0,x 1,x 2,,x m (m+1 distinct points) is the polynomial of degree at most m, P m (x) that satisfies P m (x j ) = f(x j ),j = 0,1,,m. 100

102 4. Here we assume that f(x) is a function, which is know only on a finite set of points, x 0,x 1,x 2,,x m. We propose to use polynomial interpolation using the given point values, f(x 0 ),f(x 1 ),,f(x m ) to approximate f(x) at some intermediate point x. Use Lagrange polynomials to construct the appropriate interpolation polynomial using the given data points. (a) Approximate f(0.43) if (b) Approximate f(0) if f(0) = 1,f(0.25) = ,f(0.5) = ,f(0.75) = f( 0.5) = ,f( 0.25) = ,f(0.25) = ,f(0.5) = (c) Approximate f(0.18) if f(0.1) = ,f(0.2) = ,f(0.3) = ,f(0.4) = (d) Approximate f(0.25) if f( 1) = ,f( 0.5) = ,f(0) = ,f(0.5) = Let P m (x) be the interpolation polynomial for f using the uniformly spaced points x 0 = 0,x 1 = 1/m,,x m = 1, i.e, x j = j/m,j = 0, m, for each one of the functions given below. f(x) = e 2x,f(x) = x 2 cos(x) 3x,f(x) = ln(e x +2). (a) Find an upper bound for the absolute error P m (x) f(x) for m = 1 and m = 2, for each function f above. (b) For each function f, find the smallest value of m, if it exists, such that the interpolation error satisfies f(x) P m (x) < 10 6 for all x [0,1]. 6. Consider the normal cumulative distribution, with mean µ and standard deviation, σ N(x) = 1 x ( ) (t µ) 2 σ exp 2π 2σ 2 dt 0 (a) Use the matlab built in function normcdf to estimate N j = N(x j ) at the points x j = 10 + j/2,j = 0,1,,40, i.e, 41 points uniformly spaced on the interval [ 10,10], for a mean µ = 0 and a standard deviation σ = 1. You can do this in matlab using the following commands. >> X=-10:.5:10 and >>N=normcdf(X,0,1). Execute >>plot(x,n, o- ) to plot the discrete vector N j as a function of x j. (b) Repeat the above question with only 11 points and make the new (11 pts) plot on top of the previous (41 pts) one (use the hold on command). Also make sure you use different line and symbol options to distinguish the curves. Try >>plot(x,n, rx-- ) this time. Notice the difference. Can you explain what is going on? Which type of interpolation the plot function assumes between two consecutive data points? 101

103 (c) Use the polyfit function of matlab to compute the interpolation polynomial for N(x) associated with the 11 data points in 6b. Use the polyval function of matlab to evaluate your interpolation polynomial at the equally spaced points >>xi=-10:.1:10. Let yi be the values of P 10 (xi). Plot yi as a function of xi. On the same figure. What do you see? Can you explain why?. (d) Now use spline interpolation to obtain a new set of values ysi = S(xi) using the interp1 function of matlab using the 11 data points in 6b and the method option spline. Plot ysi as a function of xi. Compare to your previous plots then conclude. Hint: Use the legend command to name the different curves before you save or print your figure. 7. For table b) of problem 1 find the coefficients a,b that minimize Compute the absolute error N ax i +be x i y i 2. i=1 N E 2 = ax i +be x i y i. i=1 8. Repeat 7 for a,b that minimize N be ax i y i 2. i=1 Hint: Set up the optimization equations, which are non-linear in this case, then solve them either by hand or by using the fsolve function of matlab. Compute the absolute error E 3 = N be ax i y i. i=1 9. Repeat 7 for a,b that minimize N bx a i y i 2. i=1 Compute the absolute error E 4 = N bx a i y i. i=1 10. ComparetheerrorsE 1,E 2,E 3,E 4 computedaboveanddecidewhichoneofthebasisfunctions is the most suitable for the given data. 102

104 Chapter 5 Numerical Integration Given a function f(x) on [a,b], we consider the integral S = b a f(x)dx. Except for some particular cases of elementary functions a closed form for such an integral is often an impossible task. Numerical integration or quadrature formulas provide an alternative way to make approximations for such integrals on the form of a finite sum S m a i f(x i ) i=0 where the a i s, i = 0,1,m are predefined constants known as the integration coefficients and the x i s, i = 0,1,m are m+1 distinct points in [a,b]. Example: Simple integration formulas 1) Rectangular rule consists in approximating the area under the graph of f(x) on [a,b] by that of the rectangle of width b a and height f(a) or f(b). b a f(x)dx (b a)f(a) or b a f(x)dx (b a)f(b). a b 103

105 2) Trapezoidal rule approximates the integral with the trapezoid of base [a,b] and side lines f(a) and f(b): b a f(x)dx (b a) (f(a)+f(b)). 2 a b 3) The mid-point rule uses a rectangle of base [a,b] and height f((a+b)/2). b a ( ) a+b f(x)dx (b a)f. 2 a (a+b)/2 b Example: Assume f(x) = ln(1+x), x [0,1] and consider the integral S = 1 0 ln(1+x)dx. The rectangle rule yields or 1 0 ln(1+x)dx (1 0)f(0) = ln(1+x)dx (1 0)f(1) = ln With the midpoint rule we have 104

106 1 While the trapezoidal rule gives 0 ln(1+x)dx f(0.5)(1 0) = ln(3/2) ln(1+x)dx f(0)+f(1) (1 0) = ln(2) The exact value of this integral is 1 0 ln(1+x)dx = (1+x)ln(1+x) x 1 0 = 2ln(2) Comparing with this exact value, we can see that both the mid-point and the trapezoidal rules are reasonably good approximations, with relative errors of /0.386 = 4.92 % and /0.386 = % with the mid-point rule being slightly better while the rectangle rule has at best an error of /0.386 = 79.27%. Remark: Note that the rectangle and the mid-point formulas are simply the integrals of the constant polynomials p 0 (x) = f(a) or p 0 (x) = f(b) and p 0 (x) = f((a+b)/2), respectively, while the trapezoidal rule is the integral of the first degree polynomial p 1 (x) = (x a) b a (f(b) f(a))+f(a). i.e, thelinearinterpolationpolynomialassociatedwiththetwodatapoints(a,f(a))and(b,f(b)). In fact, this also applies for the rectangle and mid point rule for degree zero interpolation polynomials associated with a single data point. 5.1 Newton-Cotes Integration Consider b a f(x)dx and let x 0 < x 1 < < x m be m + 1 distinct points in [a,b]. Recall the interpolation polynomial of f(x) associated with the points x 0,x 1,,x m P m (x) = L 0 (x)f(x 0 )+L 1 (x)f(x 1 )+ +L m (x)f(x m ), where L 0 (x),l 1 (x),,l m (x) are the Lagrange polynomials associated with x 0,x 1,,x m. m x x i L j (x) =. x j x i i=0,i j To approximate the integral b a f(x)dx, we assume the approximation f(x) P m(x) in [a,b], so that b b b b b f(x)dx P m (x)dx = f(x 0 ) L 0 (x)dx+f(x 1 ) L 1 (x)dx f(x m ) L m (x)dx. a a a 105 a a

107 Then we obtain the following integration formula, known as the Newton-Cotes quadrature formula: b a f(x)dx a 0 f(x 0 )+a 1 f(x 1 )+ +a m f(x m ) = m a j f(x j ) (5.1) j=0 where a j = b a L j (x)dx,j = 0,1,m. Remark: If x 0 = a and x m = b, the Newton-Cotes formula is said to be closed and if a < x 0 and x m < b, the Newton-Cotes formula is said to be open. Some classical Newton-Cotes Formulas Rectangle rule: m = 0,x 0 = a: The single Lagrange polynomial associated with x 0 = a is L 0 (x) = 1 and the interpolation polynomial of degree zero is P 0 (x) = f(a). Therefore b a f(x)dx b Mid-point rule: m = 0,x 0 = a+b 2 : We have L 0 (x) = 1 and P 0 (x) = f( a+b 2 ), which yields b Trapezoidal rule: m = 1,x 0 = a,x 1 = b: We have L 0 (x) = x b a b,l 1(x) = x a b a and a a P 0 (x)dx = f(a)(b a). f(x)dx f( a+b 2 )(b a). which yields P 1 (x) = x b x a x a f(a)+ f(b) = a b b a b a (f(b) f(a))+f(a), i.e b a f(x)dx b a x a b a (f(b) f(a))+f(a) = 1 2 (b a)(f(b) f(a))+f(a)(b a). b a f(x)dx f(b)+f(a) (b a) 2 which is in fact the trapezoidal rule introduced above. Now, we push the approximation a little further to m = 2 to yield the celebrated Simpson s rule. 106

108 Simpson s rule: m = 2, x 0 = a,x 1 = a+b 2,x 2 = b: For convenience we let h = b a 2. Then and We have a+2h a L 0 (x) = (x x 1)(x x 2 ) (x 0 x 1 )(x 0 x 2 ) = 1 2h 2(x a h)(x a 2h), L 1 (x) = (x x 0)(x x 2 ) (x 1 x 0 )(x 1 x 2 ) = 1 h 2(x a)(x a 2h), L 2 (x) = (x x 0)(x x 1 ) (x 2 x 0 )(x 2 x 1 ) = 1 2h 2(x a)(x a h). L 0 (x)dx = h a+2h 3, L 1 (x)dx = 4h a+2h 3, L 2 (x)dx = h 3. The details of these integrals are left as an exercise. Then, or b a b a a f(x)dx h 3 [f(x 0)+4f(x 1 )+f(x 2 )] f(x)dx b a 6 This integration formula is known as Simpson s rule. [ f(a)+4f( a+b ] 2 )+f(b). a Remarks: 1)Note that according to the definition above, both the trapezoidal and Simpson s rules are considered as closed Newton-Cotes quadratures while the mid-point rule is open. 2)The Simpson rule can be viewed as a combination of the trapezoidal (1/3) and mid-point (2/3) rules. In deed, we have: [ b a f(a)+4f( a+b ] 6 2 )+f(b) = 1 ( ) b a [f(a)+f(b)] (b a)f(a+b 2 ). Numerical Example: 1. f(x) = x 2 on [0,2]. Trapezoidal rule yields: while Simpson s rule gives x 2 dx = 4, x 2 dx 2 6 ( ) =

109 Let s compute the exact value of this integral. 2 0 x 2 dx = x = 8 3. i.e the Simpson s rule yields the exact integral while the trapezoidal has a relative error of 4/(8/3) 1 = 3/2 1 = 50%. In fact we will see later in this chapter that Simpson s rule is in fact exact for all polynomials of degree 3 or less. 2. f(x) = x 4 on [0,2]. The exact value is 2 0 x4 dx = x 4 dx 16 = 32/5. The trapezoidal rule yields and a relative error of 16/(32/5) 1 = 150 % and Simpson s rule gives 2 0 x 4 dx 2 6 [ ] = 20 3 and a relative error of (20/3)/(32/5) 1 = 4.17%. iii) f(x) = 1/(1+x) on [0,2]. 2 dx 1+x = ln(1+x) 2 0 = ln Trapezoidal rule gives dx 1+x 4 3 and a relative error of (4/3)/ln(3) % and Simpson s rule 2 10 with a relative error of 9ln(3) 1 = 1.14 %. 0 dx 1+x 10 9 = Clearly, at for these examples, Simpson s rule has much better approximations. Even in the worst case 2. when the Trapezoidal rule has an error of 150%, Simpson s rule has an error of only about 4%! Exercise: Repeat the three examples above using the mid-point and the rectangle rules and compare the results. 5.2 Integration error Assume f is C in [a,b], i.e, f has infinitely many derivatives. Recall the interpolation polynomial P m (x) of f for a given set of points a x 0 < x 1 < < x m b satisfies f(x) = P m (x)+e m (x) 108

110 where the interpolation error e m (x) is given by Thus we have or b a f(x)dx = e m (x) = f(m+1) (ξ) (m+1)! b a P m (x)dx+ b a m (x x i ). i=0 S = S m +E m e m (x)dx = m a i f(x i )+E m where S = b a f(x)dx is the exact-continuous integral, S m = m i=0 a if(x i ) is the numerical integral or quadrature formula, and E m is known as the integration error. i=0 E m = b a f(x) P m (x)dx = b a f (m+1) (ξ) (m+1)! m (x x i )dx. i=0 Below we derive upper-bounds for this integral error for the basic formulas seen above. A more general derivation of error bounds for the general case of open and close Newton-Cotes formulas can be found in all good text-books on numerical analysis, see for example Burden and Faires. Rectangular rule (1 point formula, m = 0): x 0 = a, P 0 (x) = f(a) and e 0 (x) = f (ξ)(x a). This yields E R = b Recall the mean-value theorem for integrals. a f (ξ(x))(x a)dx. Theorem 12 (Mean value for integrals) Let f(x) and g(x) be two continuous functions on [a,b]. If g(x) 0 on [a,b], then there exist c [a,b] such that b a f(x)g(x)dx = f(c) b a g(x) dx. Applying this theorem to the integral E R above, with g(x) = (x a), we get E R = f (η) b a (x a)dx = (b a)2 f (η) 2 for some η [a,b]. Therefore, we have the following error upper-bound for the rectangle rule E R (b a)2 M 1, where M 1 = max 2 x [a,b] f (η). 109

111 Trapezoidal rule ( two points, m = 1). E T = b a (f(x) P 1 (x))dx = b a e 1 (x)dx = Applying the mean value theorem with g(x) = (x a)(x b) yields E T = 1 2 f (η) b a b (x a)(x b)dx = 1 2 f (η) where we made the change of variable t = x a. a 1 2 f (ξ(x))(x a)(x b)dx b a 0 t(t (b a))dt i.e E T = 1 ( ) t 3 2 f (η) 3 t2 (b a) 2 (b a) 0 = (b a)3 f (η) 12 E T (b a)3 M 2 where M 2 = max 12 x [a,b] f (x). Midpoint rule (1 point, m = 0): E M = b a f (ξ)(x a+b 2 )dx. This one is a little tricky. Because (x a+b 2 ) changes sign inside the interval [a,b] we can not use the mean-value theorem directly as done above. Instead, we let x 0 = (b+a)/2 and consider the Taylor expansion of f(x) with respect to x 0 f(x) = f(x 0 )+f (x 0 )(x x 0 )+ 1 2 f (ξ(x))(x x 0 ) 2, ξ(x) [x 0,x] or [x,x 0 ] We have E M = But Thus, b a (f(x) P 0 (x))dx = E M = b a b a b a (f(x) f(x 0 ))dx = f (x 0 )(x x 0 )dx = f (x 0 ) b 1 2 f (ξ(x))(x x 0 ) 2 dx = 1 2 f (η) Therefore, we have the upper bound where h = (b a)/2. a b a f (x 0 )(x x 0 )+ 1 2 f (ξ)(x x 0 ) 2 dx. (x x 0 )dx = 0, by symmetry. b E M (b a)3 M 2 = h M 2 a (x x 0 ) 2 dx = 1 24 f (η)(b a) 3 110

112 Simpson s rule (three points, m = 2): E S = b a (f(x) P 2 (x))dx = b a 1 6 f (ξ(x))(x a)(x (a+b)/2)(x b)dx. Again because (x a)(x (a+b)/2)(x b) changes sign within [a,b] the mean value theorem cannotbeapplieddirectlyandinsteadweneedtotaylorexpandf(x)withrespectto(a+b)/2 to the order 3, to yield E S = h5 90 f(4) (η) where h = (b a)/2. The details are left as an exercise for the students. 5.3 Order of accuracy Definition 14 A numerical integration or quadrature formula b a f(x)dx N a j f(x j ), where x j are N distinct points, is said to have a degree of accuracy or precision n 0 if it is exact for all polynomials p n (x) of degree n, i.e, b a p n (x)dx = j=1 N a j p n (x j ). (note an equality sign = rather than an approximation.) In other words, the integration error is zero: E = b a p n(x) P N 1 (x) = 0 where P N 1 is the interpolation polynomial of p n (x) associated with the points x 1,x 2,,x N of the quadrature formula. j=1 Remark (Theorem): Because of the linearity of the integral and quadrature-summation operations, we can show easily that a given quadrature formula, b a f(x)dx N a j f(x j ), has an accuracy n if and only if it is exact for all monomials x k, k n, i.e b a x k dx = j=1 N a j x k j,k = 0,1,,n. j=1 For Newton-Cotes formulas, we can show that the integration error is in general given by E N = C(b a) K+1 f (K) (η) 111

113 wherec issomeconstant, K isanon-negativeinteger(k N 1), andf (K) isthek th derivativeof f, where f is the integrand function, taken at some point η. Therefore, we can associate directly the order of accuracy for Newton-Cotes formulas with the integer K appearing in their error formulas: Because the kth derivative of a polynomial p n (x) of degree n is zero if k n+1, i.e, d k p n (x) dx k = 0, k n+1, we have necessarily the integration error for a given Newton-Cotes formula applied to p n (x) is zero if n K 1. Therefore the precision of a Newton-Cotes formula with an error E = C(b a) K+1 f (K) (η) is exactly K 1. Applying this principle to the quadrature formulas seen so far yields: The rectangle rule has an order of accuracy n = 0. It is exact for constants, only. The mid-point and the trapezoidal rules have the same order of accuracy n = 1. They are exact for linear functions f(x) = ax+b. Simpson rule has an order of accuracy n = 3, i.e, it is exact for all polynomials of degree 3. Remark: Note that the mid-point rule uses only one-point, i.e, only one function evaluation but has the same order of accuracy as the trapezoidal rule. In fact it is slightly better, because as we can see from the expression of their respective error formulas, we have, roughly, E M = 1 2 E T. The same applies to Simpson s method. With only three points, it yields a degree of precision 3. One fundamental question however, is given a certain number of points x 1,x 2,,x N, what is the maximum precision that can be achieved? The answer is 2N-1 and it is obtained by Gauss integration which is introduced below, in Section 5.5, but before that we discuss the composite integral rules, which are useful when we have a large number of integration points. 5.4 Composite integration rules Note that the error for all the Newton-Cotes quadrature formulas seen so far (i.e, rectangle, midpoint, trapezoidal, Simpson), grows as the length of the integration interval [a, b] is increased. Therefore, such methods should be avoided for large integration-intervals. Try to use Simpson s method for the integral of f(x) = e x on an interval [0, 10] for example! Also higher order Newton- Cotes integration formulas (i.e, using high order polynomial approximation) should be avoided as well, because of the Runge phenomenon. The remedy to this problem is to use what is know 112

114 as composite integration formulas, which consist in subdividing the interval [a, b] onto small subintervals, in the same manner as for piece-wise polynomial interpolation, and apply a low-order Newton-Cotes quadrature on each one of the small intervals. Precisely, let a = x 0 < x 1 < < x m = b be a subdivision of the interval [a,b] into n small sub-intervals [x i,x i+1 ],i = 0,1,,m 1. Recall from calculus that b a f(x)dx = x1 x 0 f(x)dx+ x2 x 1 f(x)dx+ + xm x m 1 f(x)dx. We obtain a composite integration formula of a certain type by applying the Newton-Cotes quadrature for each sub-integral x i+1 x i f(x)dx. This is done below for the rectangle, trapezoidal, the mid-point, and Simpson s rules. Composite rectangle rule b x a f(x)dx f(x 0 )(x 1 x 0 )+f(x 1 )(x 2 x 1 )+ +f(x m 1 )(x m x m 1 ). If the points x i s are uniformly distributed (i.e equidistant) such that x i+1 x i = h,i = 0,1,,m 1, then the composite rectangle rule simplifies to b Composite trapezoidal rule b x a f(x)dx h(f(x 0 )+f(x 1 )+ +f(x m 1 )) = h f(x)dx f(x 0)+f(x 1 ) x a 2 (x 1 x 0 )+ f(x 1)+f(x 2 ) 2 m 1 i=0 f(a+ih). (x 2 x 1 )+ + f(x m 1)+f(x m ) (x m x m 1 ). 2 If the points x i s are uniformly distributed (i.e, equidistant) such that x i+1 x i = h,i = 0,1,,m 1, then the composite trapezoidal rule simplifies to b ( ) ( ) m 1 f(x0 )+f(x m ) f(a)+f(b) f(x)dx h +f(x 1 )+ +f(x m 1 ) = h + f(a+ih). x a 2 2 Composite mid-point rule b f(x)dx f( x 0 +x 1 x a 2 )(x 1 x 0 )+f( x 1 +x 2 2 i=1 )(x 2 x 1 )+ +f( x m 1 +x m )(x m x m 1 ). 2 If the points x i s are uniformly distributed (i.e, equidistant) such that x i+1 x i = h,i = 0,1,,m 1, then the composite mid-point rule simplifies to b x a f(x)dx h m 1 i=0 f(a+ih+h/2). 113

115 Composite Simpson s rule f(x)dx 1 ( f(x 0 )+4f( x 0 +x 1 x a 6 2 ( 1 f(x 1 )+4f( x 1 +x ( 1 6 b ) )+f(x 1 ) (x 1 x 0 )+ ) )+f(x 2 ) f(x m 1 )+4f( x m 1 +x m )+f(x m ) 2 (x 2 x 1 )+ + ) (x m x m 1 ). If the points x i s are uniformly distributed (or equidistant) such that x i+1 x i = h,i = 0,1,,m 1, then the composite mid-point rule simplifies to b x a f(x)dx h 6 (f(a)+f(b))+ h 3 m 1 i=1 f(a+ih)+ 4h 6 m 1 i=0 f(a+ih+h/2). Example: Consider the integral I = 4 0 e x dx whose exact value is I = e Using Simpson s rule on [0,4] yields the approximation I 4 6 ( e 0 +4e 2 +e 4) The composite Simpson s rule with m = 2, i.e, 2 subintervals, [0,2] and [2,4], leads to I 1 3 (e0 +4e 1 +e 2 )+ 1 3 (e2 +4e 3 +e 4 ) and with m = 4, four sub-intervals,[0,1],[1,2],[2,3],[3,4] we get I 1 6 (e0 +e 4 )+ 1 3 (e1 +e 2 +e 3 )+ 4 6 (e1/2 +e 3/2 +e 5/2 +e 7/2 ) In the table below, we summarize the tree approximations above and compute the associated absolute errors. m=1 m=2 m=4 Approximate value Absolute error The table above suggests that absolute error for the composite Simpson rule decreases as the number of sub-intervals m is increased and the composite Simpson s will eventually converge to the exact integral I when m is increased to infinity, no matter how large is the interval [a,b]. In fact this is confirmed by the error analysis done in the next sub-section. Exercise: Repeat this example for the composite trapezoidal and mid-point rules and make the same convergence observation. 114

116 5.4.1 Error analysis for the composite integration rules Let s start by the composite trapezoidal rule. Let f be a smooth function defined on the interval [a,b]. Consider a subdivision into m subinterval x 0 = a < x 1 < < x m = b of the interval [a,b]. Assume the sub-division is uniform and let h = x i+1 x i, i = 0,1,m 1, where h = (b a)/m. This will also provide a formal proof that such composite rules converge to the exact solution as the size of the sub-division h goes to zero, or equivalently, m goes to infinity. In general we have the following definition. Definition 15 A numerical method is said to have an order of convergence k 1 with respect to some parameter h (here the size of the subdivision) if the absolute error, E, between the exact solution and the approximation satisfies E = O(h k ), where O(h k ) is a generic notation in analysis for quantities that converge to zero as h k, when h 0. Formally, we have where C is some constant. φ φ = O(g) lim g 0 g = C. Applying the trapezoidal rule on each one of the sub-intervals, keeping the integral error as given in the previous section, yields ( ) b m 1 f(a)+f(b) f(x)dx = h + f(a+ih) h3 m 1 f (η i ) 2 12 a where η i [x i,x i+1 ],i = 0,1,m 1. By the continuity of f (x) on [a,b], we apply the intermediate value theorem of calculus, for continuous functions, repeatedly for the sequence of values f(η i ),i = 0,1,,m 1. This guarantees the existence of µ [a,b] such that a i=1 f (µ) = 1 m i=1 m 1 i=0 f (η i ). Thus the composite trapezoidal rule, including the error term, becomes [ ] [ ] b m 1 f(a)+f(b) f(x)dx = h + f(x i ) m h3 m 1 f(x 0 )+f(x m ) 2 12 f (µ) = h + f(x i ) (b a) h f (µ). Where the last equality is true because mh = (b a). Therefore the global error for the composite trapezoidal rule is E TC = (b a)f (µ) h2 12 = O(h2 ), µ [a,b]. To summarize, through this discussion, we have proved formally that the composite trapezoidal rule converges at a rate which is proportional to h 2, where h is the size of the subdivision. Thus, the composite trapezoidal method is said to be of order two or has an order of convergence two. i=0 i=1 115

117 Similar calculations for the composite Simpson s rule yield b a f(x)dx = h 6 (f(x 0)+f(x m ))+ h 3 m 1 i=1 f(x i )+ 4h 6 m 1 i=0 i.e, the error for the composite Simpson s rule is given by E SC = h (b a)f(4) (µ) = O(h 4 ), i.e, Simpson s method is of order four. f(x i +h/2) h (b a)f(4) (µ),µ [a,b] The details of the calculations for the composite Simpson s rule are left as an exercise. Remark: The error expression above, implies, in particular, that Simpson s is four times more accurate than the trapezoidal rule, in the sense, that as the number of the subintervals for the composite integration is doubled, or equivalently h is halved, the error in the trapezoidal composite rule is divided by four ( 2 2 = 4) while the error is divided by 2 4 = 16 for the composite Simpson. Remark: Notice that the order of convergence defined here is not the same the same as the order of accuracy (or degree of precision) defined in the previous section for the Newton-Cotes methods. In fact we can easily say that as a general statement that the order of convergence k and the order of accuracy n satisfy the relationship k = n+1. Exercise: Find the expression of the error of the composite rectangle and mid-point rules and determine their order of accuracy. Answer: The mid-point composite rule is second order (E MC = O(h 2 )) while the rectangular rule is first order (E MC = O(h)). 5.5 Gauss integration The Newton-Cotes integration formulas discussed in the previous section are based on the integration of the interpolation polynomial of the function f on equally spaced interpolation points x 0,x 1,,x m. As it is demonstrated by the few examples seen here, the order of accuracy for a given Newton-Cotes formula can not exceed the number m+1 of interpolation or integration points. In fact, we can show that, except for the rectangle rule, the degree of precision of a Newton-Cotes integration formula using m+1 integration points x 0,x 1,,x m, has an order of accuracy or degree of precision { m+1 if m is even n = m if m is odd and an order of convergence k = { m+2 if m is even m+1 if m is odd Theresultaboveis, infact, a theoremwhoseproofcanbefoundinanygoodtextbookonnumerical analysis (e.g. Burden and Faires). Here we content to confirm this result with the three cases seen here above. 116

118 The trapezoidal rule has m = 1, 2 points, and a degree of precision n = 1, because the error term involves the second derivative. The midpoint has m = 0, one point, and its degree of precision is also n = 1 and Simpson s rule has m = 2 and a precision n = 3. Note that the rectangle rule is in fact an exception. It has a degree of precision n = m = 0 and an order of convergence k = m+1 = 1 although m = 0 is even. Gauss integration formula consists in finding an adequate choice of interpolation or integration points so that the order of accuracy is maximum. As a motivation example consider the approximation 1 f(x)d c 1 f(x 1 )+c 2 f(x 2 ) 1 where the constants c 1,c 2 and the integration points x 1,x 2 are to be determined so that the order of accuracy for this integration formula is maximized. Since we have only four degrees of freedom, i.e, four unknowns c 1,c 2,x 1,x 2 we can only form four equations. Thus we can make this formula exact for polynomials of a degree up to three, i.e, the monomials 1, x,x 2,x 3. Hence we assume dx = c 1 +c 2 xdx = c 1 x 1 +c 2 x 2 x 2 dx = c 1 x 2 1 +c 2 x 2 2 or 1 x 3 dx = c 1 x 3 1 +c 2 x 3 2 c 1 +c 2 = 2 c 1 x 1 +c 2 x 2 = 0 c 1 x 2 1 +c 2 x 2 2 = 2 3 c 1 x 3 1 +c 2 x 3 2 = 0 whose solution is c 1 = c 2 = 1,x 1 = 3/3,x 2 = 3/3. Therefore we have our first Gaussintegration formula f(x)dx = f( 3 )+f( ), (5.2) 3 1 using only two integration points (m = 1) but has a degree of precision 3 = 2m+1 or 2N 1 where N is the total number of points. Notice its Newton-Cotes counterpart, with only two integration points, is the trapezoidal rule, which has the modest degree of precision one. Definition 16 (Guass integration) In general, a Gauss quadrature or integration formula consists in chosen the integration points x 1,x 2,,x N so that the degree of precision is maximum, n = 2N

119 It can be shown in fact that n = 2N 1, is the maximum degree of precision one can achieve with N integration points on a given interval [a,b]. On the interval [ 1,1] the integration points happen to be the roots of the Legendre polynomial of degree N and the resulting numerical formula are known as the Gauss-Legendre quadrature. Gauss-Legendre integration Legendre polynomials P N (x), N 0, are defined (up to a multiplicative constant) by the following properties: P N (x) is a polynomial of degree N. 1 1 P N(x)x k dx = 0 for all k N 1. i.e, P N (x) is orthogonal to all polynomials of degree N 1. The few first Legendre polynomials are P 0 (x) = 1, P 1 (x) = x, P 2 (x) = 3 2 x2 1 2, P 3(x) = 5 2 x3 3 2 x, P 4(x) = 35 8 x x , Notice that the integration points x 1 = found above for the two points-integration 3,x 2 = formula (5.2) are precisely the roots of the Legendre polynomial of degree two P 2 (x) = 3 2 x Below, we derive the polynomial P 3 (x) as a pedagogical example. Example: Find a polynomial p 3 (x) = x 3 +a 2 x 2 +a 1 x+a 0 such that We have which yields p 3 (x)x k = 0,k = 0,1,2. p 3 (x)dx = a 2 +2a 0 = 0 p 3 (x)xdx = a 1 = 0 p 3 (x)x 2 dx = a a 0 = 0 a 2 = 1/3a 0,a 1 = 3 5,a 0 = 3 5 a 2 = a 2 = a 0 = 0,a 1 = 3 5. i.e p 3 (x) = x x an multiplying by 5/2 yields the Legendre polynomial P 3(x) = 5 2 x3 3 2 x given above. Now, in principle, we can derive the Gauss-Legendre integration formula of any order, as it is illustrated below for the case N = 3, as an example. The roots of the Legendre polynomial of degree 3 are 3 3 x 1 = 5,x 2 = 0,x 3 = 5 118

120 using the fact that the integral formula should be exact for the monomials 1,x,x 2 (in fact it would be exact for all polynomials of degree n 2N 1 = 5), we have 2 = c 1 +c 2 +c 3 which yields 0 = c 1 +c = 3 5 (c 1 +c 3 ), c 1 = c 3 = 5 9 = and c 2 = 2 c 1 c 3 = 8 9 = Thus, the Gauss quadrature with three points is given by 1 1 f(x)dx 5 9 f( 3 5 )+ 8 9 f(0)+ 5 9 f( 3 5 ). Many textbooks provide an extended list for the roots of the Legendre polynomials and the associated integration constants c j s for the corresponding Gauss-Legendre formula. An example of such a table is given below. Gauss-Legendre Integration Table Degree of Legendre polynomial, N Roots x j s Coefficients c j s We finish this section by illustrating how the Gauss-Legendre integration formula can be generalized/used to approximate an integral on an arbitrary interval [a, b]. Givenaninterval[a,b], weintroducethechange ofvariablex = (b a)t+(b+a) 2 to transform an integral on [a,b] on an integral on [ 1,1]: b a f(x)dx = 1 1 ( ) (b a)t+(b+a) b a f dt

121 Thus the generalized Gauss-Legendre integration formula b a f(x)dx N ( ) (b a)xj +(b+a) b a c j f 2 2, (5.3) j=1 where the integration points x j are the roots of the Legendre polynomial of degree N and the c j s are the associated coefficients of the Gauss-Legendre standard integration formula as given for example in the table above. Example: Compute an approximate value for e x2 dx using Gauss-Legendre quadrature with both 2 and 3 points. Answer: We use the formula (5.3) with a = 1, b = 3/2 and x j s and j s are given by the table. With N = 2 we have: e x2 dx 1 4 [ e ( )2 /16 +e ( )2 /16 ] = With N = 3 we have: 1.5 e x2 dx 1 [ 5 / e ( ) e 25/ ] /16 9 e ( )2 = Problems 1. Approximate the following integrals using the trapezoidal rule d. 1 a /4 1/4 2 dx b. x 4 2x x 2 dx e π/4 0 x 2 lnxdx c. xsinxdx f. π/ x 2 e x dx (5.4) e 3x sin2xdx. (5.5) 2. Determine the exact values of the integrals in 1. then find the absolute error associated with the approximation. 3. Find an upper bound for the error of each one of the approximations in 1. using the error formula in the notes. Verify that the absolute error found in 2. is within this upper bound. 4. Repeat 1,2,3 using Simpson s rule. 5. The trapezoidal rule applied to 2 0 f(x)dx gives the value 4, and Simpson s rule gives the value 2. What is f(1)? 120

122 6. The trapezoidal rule applied to 2 0 f(x)dx gives the value 5, and the midpoint rule gives the value 4. What value does Simpson s rule return? 7. Show that Simpson s rule can be written as a linear (convex) combination of the trapezoidal and midpoint rules. 8. Find the degree of precision (or the order) of the quadrature formula 1 1 f(x)dx f( 3/3)+f( 3/3). 9. Let h = (b a)/3,x 0 = a,x 1 = a+h,x 2 = b. Find the degree of precision of the quadrature formula b a f(x)dx 9 4 hf(x 1)+ 3 4 hf(x 2). 10. The quadrature formula 1 1 f(x)dx = c 0f( 1)+c 1 f(0)+c 2 f(1) is exact for all polynomials of degree 2. Find the coefficients c 0,c 1,c Find the values of x 0,x 1 and c 1 so that the formula 1 has the highest degree of precision. 0 f(x)dx 1 2 f(x 0)+c 1 f(x 1 ) 12. Use the composite trapezoidal rule with the indicated number n of subintervals to approximate the integrals below. Write a short matlab program to computerize the calculations, using the function sum or a for loop. a xlnxdx,n = 4; b. x 3 e x 2 dx n = 8; c. 2 0 x 2 dx,n = 6 (5.6) Repeat the previous exercise using the composite Simpson s rule. 14. Integrals in Matlab: The matlab functions trapz and quad use respectively the trapezoidal and Simpson composite rules to compute the integral of a given function, with the following major difference. trapz acts directly on a vector of values Y = f(x) where X is a set of integration nodes while quad uses a function, to be integrated on an interval [a, b], defined in an M-file that can be evaluated at any point x in [a,b]. quad chooses for itself the integration nodes and their number to approximate the integral within a given tolerance. The default tolerance in the latest matlab version is Use the help command to learn more about these two commands. a) Use the function trapz of matlab to approximate the following integrals with the given number of sub-intervals (n+1 is the number of points). d. a. 2 1 π 0 2 xlnxdx,n = 4; b. x 2 cosxdx,n = 6 e x 3 e x dx,n = 8 c. e 2x sin3xdx,n = 8; f x 2 dx,n = 10 (5.7) +4 x x 2 dx,n = 8; (5.8)

123 b) Evaluate the integrals in (a) using quad of matlab using both the default tolerance and a user-defined tolerance tol=10 2. c) Compare your results in (a) and (b) then conclude. d) Double the number of subintervals n and repeat a),b), and c). 15. Determine the number of integration subinterval n (or nodes n + 1) required to approximate to within 10 6 using the x+4 dx a. the composite trapezoidal rule b. the composite midpoint rule c. the composite Simpson rule 16. Consider the normal probability distribution function, P(x) = 1 σ 2π x e y2 /(2σ 2) dy, with mean µ = 0 and standard deviation σ. P(x) expresses the probability that a randomly chosen number according to this distribution is less than x. Note that consistently with the fact that the probability that a randomly choosing value is less than + is one we have 1 σ 2π + e y2 /(2σ 2) dy = 1. The probability that a randomly chosen number is in a given interval [a,b] is given by Prob(a x b) = 1 σ 2π b a e y2 /(2σ 2) dy. For σ = 0.3use quad of matlab to approximate to within 10 5 the probability that a randomly chosen value is in a.[ σ, σ] b.[ 2σ, 2σ] c.[ 3σ, 3σ] 17. A car laps a race track in 84 seconds. The speed of the car at each 6-second interval is determined using a radar gun and is from the beginning of the lap, in feet/second, by the entries in the following table Time Speed How long is the track? Hint: use trapz of matlab. 18. Approximate the following integrals using Gauss-Legendre quadrature with n = 2 and n = 3 and compare your results to the exact values of the integrals a. x 2 e x 2 dx b. 1 0 x 2 dx c. 4 π/4 0 x 2 sinxdx. 122

124 19. Determine the constants a, b, c, d that will produce a quadrature formula 1 1 that has a degree of precision Use the orthogonality constraints f(x)dx = af( 1)+bf(1)+cf ( 1)+df (1) 1 1 x k P n (x)dx = 0, k = 0,1, n 1 to determine the first few Legendre polynomials, P 1 (x),p 2 (x),p 3 (x),p 4 (x). Assume P 0 (x) = (Hard.) Let x 1,x 2,,x n be the n zeros of the Legendre polynomial of degree n, which we denote by P n (x). a) Show that x 1,x 2,,x n are real, distinct, and are contained in the interval ( 1,1). b) Let c 1,c 2,,c n be n constants chosen so that the Gauss-quadrature formula, 1 1 f(x)dx n c j f(x j ) is exact for all polynomials of degree n 1, i.e, the c j s are obtained by solving the n-by-system n 1 { 2 c j x k j = x k dx = k+1 if k is even,k = 0,1,,n 1 0 if k is odd j=1 1 Show that the numerical integration rule is therefore exact for all polynomial of degree 2n 1. j=1 Hints: For a). All three statements can be shown by contradiction. -Real. Assume that P n (x) has two complex conjugate roots: x 1 = α+iβ,x 2 = α iβ. Then P n (x) = Q(x)(x x 1 )(x x 2 ) = Q(x)(x 2 +bx+c) with Q(x) a polynomial of degree n 2. Use the orthogonality condition of P n (x) and Q(x) to deduce that P n (x) = 0. Which is a contradiction of course. -Disticnt. Assume P n (x) has a repeated root x 0. Then P n (x) = Q(x)(x x 0 ) 2. The same reasoning as above applies. -All roots are in ( 1,1). Assume that a root x 0 1. Then P n (x) = Q(x)(x x 0 ). Then the orthogonality of P n (x) and Q(x) together with the fact that P n (x)q(x) = Q(x) 2 (x x 0 ) 0 for x [ 1,1] lead to a contradiction. For b) we have. A polynomial p 2n 1 (x) of degree n can be factorized with respect to the Legendre polynomial P n (x): p 2n 1 (x) = q(x)p n (x)+r(x) where the remainder r(x) and the quotient q(x) are both polynomials of degree n 1. Why? 123

125 Chapter 6 Monte Carlo Integration 6.1 Introduction: Probability distributions and random variables Definition and examples ArandomvariableX isbydefinitionamappingthattakesitsvaluesinadiscretesetoracontinuous interval of real numbers, according to a given probability distribution, P. A random variable with values in a discrete set of real numbers is called a discrete random variable and a random variable that takes value in an interval is called a continuous random variable. Examples of discrete random variables a) Perhaps the simplest example of a discrete random variable is that of tossing a fair coin, which results in heads or tails with a 50/50 % chance. We can thus associate a random variable X that takes the value X = 0 if the outcome is heads and X = 1 if it is tails with the probability distribution P({X = 0}) = 1 2, P({X = 1}) = 1 2. b) A similar example is that of throwing a dice. The associated random variable takes its values in the discrete set {1,2,3,4,5,6} with the probabilities P({X = 1}) = p 1,P({X = 2}) = p 2,,P({X = 6}) = p 6 where 0 p 1,p 2,,p 6 1 and p 1 + p p 6 = 1. If the dice is unloaded, then p 1 = p 2 = = p 6 = 1/6. Examples of continuous random variables Concrete examples of continuous random variables are ubiquitous in nature and human life. Often a continuous random variable represents an idealized approximation of a discrete random variable taking values in a large discrete set. Common examples include that of the price of an asset in the 124

126 stock market or the exact dimensions of a manufactured object there is always some deviations from the aimed dimensions due to imperfections in the devices used for making the product, etc. The probability distribution for a continuous random variable is given in terms of the probability that X lies in a given interval [a,b]: P({a X b}). For all practical purposes, the point-wise probability for a continuous random variable, P({X = x}), (which is zero in theory, P({X = x}) = 0) is meaningless. a) Uniform random variable on [0, 1]: If [a,b] [0,1], then P({a X b}) = b a, the length of the interval [a,b]. If (a,b) (0,1) =, then P({a X b}) = 0. Note that accordingly, we have in general P({a X b}) = (a,b) (0,1) and P({0 X 1}) = 1. Very often, the probability distribution for a continuous random variable is given on the form of an integral P({a X b}) = b a f(x)dx where f(x) is a non-negative real valued function with the important property + f(x)dx = 1. f(x) is know as the probability density function (pdf for short). For the example of a uniform random variable on [0,1] we have { 1 if 0 x 1 f(x) = 0 otherwise. Exercise: Verify that indeed P({a X b}) = b a f(x)dx, for the example of a uniform random variable on [0, 1]. The function F(x) = x f(x)dx which is the probability that X x is known as the cumulative distribution function (CDF). b) Gaussian or normal distributed random variable: A Gaussian random variable takes its values in (, + ) according to the Gaussian probability density function f(x) = 1 ) exp ( (x µ)2 2πσ 2σ 2 where µ,σ > 0 are two real parameters known as the mean and standard deviation of the Gaussian distribution. Note that P({a X b}) = 1 b ) exp ( (x µ)2 2πσ 2σ 2 dx 125 a

127 is not known in closed form but it is easily evaluated using quadrature technique based algorithms that are already implemented in many of the available softwares. For example in matlab one can use the function normcdf which serves to evaluate the cumulative distribution function F(x) = 1 2πσ x ) exp ( (x µ)2 2σ 2 dx. The Gaussian distribution is widely known as the normal distribution. The reason for this is because it is ubiquitous in nature. Many naturally generated random variables are Gaussian (as this is reflected by the central limit theorem given below). To verify that the Gaussian pdf satisfies + f(x)dx = 1 the following important result from multivariable calculus is used (combined with a clever change of variables). [ + e x2 dx] 2 = = + e x2 dy e (x2 +y 2) dx = e y2 dxdy c) Exponentially distributed random variable: The pdf of an exponential random variable is given by { λe λx if x 0 f(x) = 0 otherwise. where λ > 0 is a real parameter. Exercise: Show that + f(x)dx = 1. 2π re r drdθ = 2π. d) Log normal distribution: A positive random variable Y > 0 is said log-normal if X = lny is a normal random variable. Other common and important examples of random variables or probability distributions include, the binomial and Poisson distributions, which are discrete, the Ξ(Chi) distribution, the Γ(Gamma) distribution, etc. Notation: In many textbooks and research papers the uniform distribution on an interval [α, β] is denoted by U(α,β). U(0,1) denotes the standard uniform distribution on (0,1). While the standard normal or Gaussian distribution is denoted by N(µ, σ). N(0, 1) denotes the Gaussian distribution with mean zero and standard deviation 1. The PDFs and CDFs of the standard uniform U(0,1), standard normal N(0,1), and standard exponential (λ = 1) distributions are plotted in Figure

128 1 Uniform (0,1), PDF 1 Uniform on (0,1), CDF Normal N(0,1), PDF 1 Normal N(0,1), CDF Exponential (lambda = 1), PDF 1 Exponential (lambda = 1), CDF Figure 6.1: The PDFs and CDFs of the standard uniform, standard normal, and standard exponential distributions. 127

129 6.1.2 Mean, variance, and expectation Let X denote a continuous random variable with a pdf f(x). The mean or expectation of X is given by E[X] = + xf(x)dx. Often the mean or expectation is equivalently denoted by an overbar or angle braces E[X] X X. For any given real valued function g we can define the expectation of g(x) as E[g(X)] = + g(x)f(x)dx. In a call option for example, if X = S(T) is the market (selling) price of the underlying asset at the strike time T, distributed according to its pdf f(x), and K is the strike price, then the profit is given by π = max(x K,0), and the expected profit is E[π] = + The variance of X is given by and the standard deviation is given by max(x K,0)f(x)dx = + K xf(x)dx KP({X K}). Var[X] = E[(X E[X]) 2 ] = E[X 2 ] E[X] 2 σ = Var[X]. Using simple integration rules, we can easily show that for U(0,1), uniform distribution on [0,1], we have E[X] = 1 0 xdx = and Var[X] = x 2 dx 1 4 = N(µ, σ), Gaussian distribution with parameters µ, σ, we have E[X] = 1 + xexp ( (x µ)2 2πσ 2σ 2 and Var[X] = 1 + (x µ) 2 exp ( (x µ)2 2πσ 2σ 2 exponential distribution with a parameter λ > 0, we have E[X] = λ ) dx = µ ) dx µ 2 = σ 2. xe λx dx = 1 + λ and Var[X] = λ x 2 exp( λx)dx 1 λ 2 = 1 λ

130 6.1.3 Conditional probability and notion of dependent and independent random variables Let X,Y be two random variables with their respective probability density functions, f X,f Y. The probability P({a X b,c Y d}) that both a X b and c Y d at the same time, is know as the joint probability distribution. The joint cumulative distribution function is the two variable function F(x,y) = P({X x,y y}). The joint probability density function of X,Y is given by f(x,y) = 2 F(x,y) x y so that F(x,y) = y x f(t, s)dtds. We have the following compatibility conditions. F X (x) and F Y (y) x y f X (t)dt = f Y (s)ds = + x + y f(t, s)dtds (6.1) f(t, s)dsdt. The two random variables X,Y are said independent if P({a X b,c Y d}) = P({a X b}) P({c Y d}). If X, Y are independent then their joint probability density function satisfies f(x,y) = f X (x)f Y (y). The probability that a X b given that c Y d denoted by P({a X b/c Y d} is known as the conditional probability of the random variable X given that Y was realized. The following relationship holds. P({a X b,c Y d}) = P({a X b/c Y d}) P({c Y d}). If X, Y are independent then the conditional probability satisfies P({a X b/c Y d}) = P({a X b). The covariance between the two random variables X,Y is defined as Cov(X,Y) = E[XY] E[X]E[Y] 129

131 where E[XY] = + + xyf(x,y)dxdy, E[X] = and E[Y] = yf(x,y)dxdy = xf(x,y)dxdy = yf Y (y)dy, where we used the compatibility condition (6.1). It is easy to verify that Cov(X,X) = Var[X], Cov(Y,Y) = Var[Y] + xf X (x)dx, and if X,Y are independent then Cov(X,Y) = 0. Note that Cov(X,Y) = 0 doesn t necessarily mean that X,Y are independent. If Cov(X,Y) > 0, then X,Y are said positively correlated and if Cov(X,Y) < 0, they are said negatively correlated. We have and E[X +Y] = E[X]+E[Y], E[cX] = ce[x] Var[X +Y] = Var[X]+Var[Y]+2Cov(X,Y), Var[cX] = c 2 Var[X] Law of large numbers Let X 1,X 2,,X n,n 2 be a sequence of independent and identically distributed (i.i.d) random variables, each having a mean µ and a standard deviation σ; E[X j ] = µ,var[x j ] = σ 2, j. Define the average random variable X = 1 n (X 1 +X 2 + X n ) = 1 n Then the sample mean equals the population mean E[X] = 1 n n X j. j=1 n E[X j ] = 1 n nµ = µ and Cov(X j,x k ) = 0,j k (because the r.v. s are independent), implies j=1 Var[X] = 1 n 2 n j=1 Var[X j ] = σ2 n. As a result the sample mean X = lim n + 1 n n j=1 X j converges (is said to) to the population mean µ in the probability or weak sense, i.e, 1 lim n + n n X j = µ with probability one j=1 130

132 or more precisely we have (Chebechev s inequality 1 ) ǫ > 0,P({ X µ ǫ}) Var[X] ǫ 2 = σ2 nǫ 2. Therefore,inprinciple,theexpectationE[X] = µcanbeestimatedbythesampleaverage 1 n n j=1 X j if n is large enough. Central limit theorem Let X 1,X 2,,X n,n 2 be a sequence of i.i.d random variables with mean µ and standard deviation σ. The sequence of random variables X 1+X 2 + +X n nµ σ converges in probability to a normally n distributed random variable with mean zero and variance one. i.e, the limit in the probability sense, X 1 +X 2 + +X n nµ Y = lim n + σ n has normal distribution with mean zero and standard deviation one: Y N(0, 1). 6.2 Crude Monte Carlo integration Consider the integral I = 1 0 g(x)dx. This integral can be thought of as the expectation E[g(U)] where U is a random variable uniformly distributed on [0,1] (U U(0,1)). According to the law of large numbers above the expected value I = E[g(U)] can be estimated by a sample mean. Let U 1,U 2,,U n be a sequence of random numbers generated according to the uniform distribution U(0, 1). Then I În = 1 n n g(u j ). The law of large numbers guarantees that with probability one, j=1 lim Î n = I. n This is the basis for Monte-Carlo integration. However, the sampling of genuinely random numbers is not possible on a digital computer. Instead, most programming languages and computing environments such as matlab have one or more built in functions that can generate sequences of pseudo-random numbers. The function rand() of matlab can be used to generate sequences of pseudo-random numbers that are U(0, 1) while randn samples the standard normal distribution N(0, 1) (Gaussian distribution of mean zero and variance one). 1 For the proof see the appendix at the end of these notes. 131

133 Example 1: Consider I = 1 0 e x dx = e To use Monte-Carlo integration, we view this integral as the expectation E[e U ] 1 n e U j, with U(0,1) n j=1 and use the function rand() of matlab to generate a sequence of pseudo-random numbers U(0,1). Recall that rand(n,m) returns an N M matrix of random numbers, all uniformly distributed in [0,1]. >>rand( state,0) %this (re)initializes the function rand >>I = mean(exp(rand(1,10)) ans= >>I = mean(exp(rand(1,10)) ans= >>I = mean(exp(rand(1, )) ans= >>I = mean(exp(rand(1, )) ans= Increasing the number of samples clearly improves the estimated integral. Example 2: Here we show how the Monte Carlo method can be used to compute an improper integral. As an example, we consider I = + cos(x)e x2 /2 dx. To use Monte Carlo method we first rewrite the integral as I = + 2πcos(x) 1 2π e x2 /2 dx. Recognizing the normal probability density function f(x) = 1 2π e x2 /2, we view the given integral as the expectation E[cos(X)], where X is an N(0, 1) random variable. Therefore 2π n I cos(x j ) n j=1 where the X j s are random numbers sampled (drawn) from the standard normal distribution. This can be easily accomplished in matlab by using the randn() function. The matlab lines below show how this is implemented. For the sake of comparison, we use the deterministic quad function (i.e. the composite Simpson rule) on a finite interval [ A,A] with A = 10 and A =

134 >> quad( cos(x).*exp(-x.^2/2),-10,+10) ans = >> quad( cos(x).*exp(-x.^2/2),-100,+100) ans = >> sqrt(2*pi)*mean(cos(randn(1,100))) ans = >> sqrt(2*pi)*mean(cos(randn(1,100))) ans = >> sqrt(2*pi)*mean(cos(randn(1, ))) ans = >> sqrt(2*pi)*mean(cos(randn(1, ))) ans =

135 Conclusion The two examples above show that in fact the crude Monte Carlo method as utilized here to estimate integrals is not at all competitive with the various deterministic quadrature rules such as the trapezoidal, Simpson, Gauss, etc. rules. In fact the convergence is very slow and it requires a large number of samples. However the one major advantage of Monte Carlo integration lies in multi-dimensional integrals, because the number of integration points (i.e samples) required to achieve a given accuracy is independent of the dimension of the integral unlike the deterministic quadrature rules for which the number of integration points typically increases exponentially with the dimension: if n points are needed to achieve an accuracy ǫ in one dimension, then n d points are needed in R d (see the book, on page 223 for more discussion). We finish this section by a finance example (taken from the book) dealing with the pricing of a vanilla European option. The discounted payoff is given through the expected value of the payoff at maturity. f = e rt E[f T ], where T is the maturity date, r is the risk free (i.e, the bank) interest rate, and f T is the payoff at maturity. For a call option, we have, according to the Black-Scholes formula, f T = max (0,S(0)e ((r σ2 /2)T+σ ) T ξ) K where S(0) is the initial price of the underlying asset, σ is the volatility, and ξ is a standard normal random variable ξ N(0,1). The following matlab code uses Monte Carlo to compute the discounted payoff f (Book, page 224). 134

136 %M-file: BlsMC1.m function Price = BlsMC1(S0,K,r,T,sigma,NRepl) nut = (r - 0.5*sigma^2)*T; sit = sigma * sqrt(t); DiscPayoff = exp(-r*t)*max(0, S0*exp(nuT+siT*randn(NRepl,1))-K); Price = mean(discpayoff); %%%%%end of M-file >>S0 = 50 ; >>K =60; >>r = 0.05; >>T = 1; >>sigma = 0.2; >> randn( state,0) >> BlsMC1(S0,K,r,T,sigma,1000) ans = >> BlsMC1(S0,K,r,T,sigma,1000) ans = >> BlsMC1(S0,K,r,T,sigma,1000) ans = >> BlsMC1(S0,K,r,T,sigma, ) ans = >> BlsMC1(S0,K,r,T,sigma, ) ans = >> BlsMC1(S0,K,r,T,sigma,1000) ans =

137 6.3 Random number generators: pseudo-random numbers Generating pseudo-random variates from a given probability distribution starts by generating pseudo-random numbers from the uniform distribution U(0, 1). The uniformly distributed pseudorandom numbers are then converted for the desired distribution according to various transformation methods. Here we discuss two main such methods, the inverse transform and the acceptance rejection method. Also pseudo-variates for distributions with different parameters can be generated from the standard distributions by simple change of variables. Pseudo-random numbers uniformly distributed on an arbitrary interval (α, β) can be generated from uniform variates on (0, 1). U U(0,1) = X = (β α)u +α U(α,β) Pseudo normal variates with mean µ and variance σ generated from the standard normal distribution according to X N(0,1) = Y = µ+σx N(µ,σ) Pseudo-random exponential variates with a parameter λ > 0 are generated from the standard exponential variates with λ = 1 X EXP(1) = Y = X λ EXP(λ) Log-normal variates are obtained from normal variates. How? Linear congruential generators (LCG) The most popular and simplest textbook examples of pseudo-random number generators is the family of linear congruential generators (LCG). They consist of generating a sequence of integers of the form Z i+1 = (az i +c) mod m where a, c, m are judiciously chosen integer parameters. a is called the multiplier, c the shift, and m is the modulus. Recall that the operation r = x mod m returns the integer remainder 0 r m 1 from the division x/m. A sequence of pseudo-random variates uniformly distributed on (0,1) is given by U i = Z i /m,i = 1,2, To generate the sequence Z i, an initial value Z 0 should be provided. Z 0 is know as the seed. In fact, for a given seed the whole sequence Z i,i 0 can be predicted i.e, the sequence of U i,i 0 are far from being independent random numbers. This is why they are called pseudo-random numbers, instead. When called with the same seed, the LCG algorithm will always generate the same sequence. To simulate a better resemblance to randomness, some programmers use a seed that changes with the machine clock time, eg. seed = 100*clock but this can reveal to be a very 136

138 bad choice, especially if we want to obtain the same result twice, for comparing or validating the numerical results. The matlab code below (taken from the book) permits to generates N U(0,1)-pseudo-random variates using an LCG generator with given parameters a,c,m and a given seed. %M-file: LCG.m function [Useq,Zseq] = LCG(a,c,m,seed,N) Zseq = zeros(n,1); Useq = zeros(n,1); for i=1:n seed = mod(a*seed + c,m) ; Zseq(i) = seed; Useq(i) = seed/m; end Assume for example we set a = 5,c = 3,m = 16,seed = 7,N = 20. Then, the matlab code above generates the two sequences and {Z i,i = 1,,20} = 6,1,8,11,10,5,12,15,14,9,0,3,2,13,4,7,6,1,8,11 {U i,i = 1,,20} =0.3750,0.0625,0.5000,0.6875,0.6250,0.3125,0.7500,0.9375,0.8750,0.5625,0, , , ,, , , , , , Two main conclusions can be drawn from these two sequences. The U sequence seems to jump back and forth within the interval [0, 1) in an almost random fashion, uniformly covering the interval [0,1]. The sequence Z however starts from and passes through all the 20 numbers 0,1,,19 and comes back to number 6 after exactly 20 iterations and then repeats itself again. Thus, the sequence Z is periodic with period 20. In fact, all LCG generators are periodic, because they operate within a finite set of numbers, the sequence will repeat itself as soon as it hits twice the same number. The period of the sequence Z satisfies T m. However, not all LCG sequences always have a maximal period T = m as in the previous example. In fact repeating the previous experiment with a = 11,c = 5,m = 16 and a seed Z 0 = 3 yields the sequence Z = 6,7,2,11,14,15,10,3,6,7,2,11,14,15,10,3,6,7,2,11 that displays a period T = 8, which is half the maximum possible period. LCG s used in practice have a very large modulus m and the parameters a and c are chosen to achieve a maximum period. This ensures that the uniform variates U i = Z i /m cover uniformly the interval (0, 1) with an apparent random behaviour, having samples that look independent. This 137

139 is not an easy task however. An easy way to obtain a sequence with a maximal period, for any given m, is to set a = c = 1 and use a seed Z 0 = 0. This yields the sequence U i = i,i = 1,2,,m 1. m In fact, this is a very bad choice. The generated variates do not look independent. Luckily, the random number generators found in commercial softwares were already tested to comply some minimal standards of having the required statistical properties to mimic well the desired randomness etc. and you don t have to worry about designing a random generator for yourself each time you want to use Monte Carlo to solve a numerical problem. The function rand installed in the latest matlab version (7.1.x) is in fact based on an algorithm that is far more sophisticated than LCGs and its details are beyond the scope of these notes. Exercise: Read Example 4.5 of the book (page 228) and discuss the lattice structure of the pseudo-uniform numbers generated by the LCG method Inverse transform method The inverse transform method is based on the following basic and general statement. Given a random variable X and its probability density f X and cumulative distribution function, F X (x), then the random variable U = F X (X) = X f X (x)dx is uniformly distributed on (0,1). To prove this it suffices to establish that i) P({U y}) = 1 if y 1 ii) P({U y}) = 0 if y 0 iii) P({a U b}) = b a if (a,b) (0,1) The first two statements (i), (ii) are trivial given the fact 0 X f X(x)dx 1. For (iii), we have P({a U b}) = P({a F X (X) b}) = P({F 1 = F 1 X (b) f X (x)dx Conversely, if U U(0,1), then we have F 1 X (a) X (a) X F 1 X (b)}) f X (x)dx = F(F 1 X (b)) F(F 1 X (a)) = b a. P({X x}) = P({F 1 (U) x}) = P({U F(x)}) = F(x) This is represented schematically in Figure 6.2. If we are able to invert F easily 2, then the inverse transform method can be implemented to generate pseudo-random variates drawn from the distribution f X as follows. 2 which is of course guaranteed because F(X) is an absolutely continuous and monotonically increasing function on the support of f X 138

140 1. Draw a uniform pseudo-random variate U U(0,1) 2. Set X = F 1 (U). Example: As a typical easy example, we consider the standard exponential distribution X exp(λ). We have F(x) = 1 e λx. For a given uniform variate, the inverse transform method yields X = F 1 (U) = 1 λ ln(1 U). Given that the probability for drawing U and 1 U from the uniform distribution is the same, in practice exponential variates are usually generated by drawing uniform variates U (0, 1), then return X = ln(u)/λ. This is what it is implemented in the Statistics toolbox of matlab. As an example we use the above algorithm to compute the integral I = + 0 2xe 2x dx = 1 2, which is the expectation of the exponential random variable X exp(2). >> u = rand(1, ); >> x = - log(u)/2; >> mean(x) ans = >> u = rand(1, ); >>x = - log(u)/2; >>mean(x) ans = >> u = rand(1, ); >>x = - log(u)/2; >>mean(x) ans = Acceptance-rejection method As mentioned above, the inverse transform method is only feasible when the cdf F(x) can be easily inverted. Although, we can always resort to numerical root-finding techniques such as Newton s method to invert F(x), it is not always a good idea because this can be very costly especially when we need to generate a large number of variates to perform a Monte Carlo integration for example. 139

141 b U a X F 1 X (b) F 1 X (a) Figure 6.2: Schematic of the inverse method: P({a U b}) = P({F 1 X (a) X F 1 X (b)}) 140

142 A better approach for the case when the inverse of F(x) is not known in closed form or is expensive to evaluate is the acceptance-rejection method discussed here. To simulate a random variable X obeying a given probability distribution with a pdf f(x), the acceptance-rejection method starts by finding a function t(x) such that and K = f(x) t(x), x R + t(x)dx < + that is easy to invert. Once such a function t(x) is found, the acceptance-rejection method consists of the three main steps listed below. Noting that g(y) = t(y)/k (so that + g(y)dy = 1) is a probability density function. 1. Use the inverse transform method to generate a pseudo random number Y corresponding to g(y). 2. Draw a uniform variate U form U(0,1), independent of Y. 3. If U f(y)/t(y), then return X = Y (accept), otherwise go to step 1. Recall that for a given (fixed) Y, the probability for a uniformly distributed random number U to satisfy U f(y)/t(y) is P({U f(y)/t(y)}) = f(y)/t(y). Therefore, the more this ratio is close to one the better are chances for the random number Y to be accepted and the above procedure to be terminated. The points Y where this ratio is close to 1 are likely to be accepted while those with a small f(y)/t(y) are very unlikely to be accepted. To gain efficiency, it is thus important to choose a function t(x) which as close as possible to f(x). Also, it can be shown that the average number of iterations (acceptance and rejection trails) to terminate the procedure with an accepted value X is K = + t(x)dx. If the support of f(x) is bounded, i.e, f(x) = 0 outside a bounded interval [α,β], then a natural choice for g(x) is simply the uniform distribution on [α,β] and choose t(x) = constant = max [α,β] f(x) for α x β and t(x) = 0 otherwise. As an example we consider below the beta-distribution on [0, 1]. where For α 1 = α 2 = 3, we have f(x) = xα1 1 (1 x) α 2 1, x [0,1] B(α 1,α 2 ) B(α 1,α 2 ) = 1 0 x α 1 1 (1 x) α 2 1 dx f(x) = 30(x 2 2x 3 +x 4 ), x [0,1]. Note that this is a fourth order polynomial that is not easy to invert and for which the inverse transform method would be hard to apply. We thus use the acceptance-rejection method. Let t(x) = max [0,1] f(x) = 30/16. Using the uniform distribution as the reference density g(x), we have the following algorithm 141

143 1. Draw two independent uniform random variates U 1,U 2 form U(0,1) 2. If U 2 16(U 2 1 2U3 1 +U4 1 ) accept X = U 1; otherwise, reject and go back to step 1. Exercise: As an exercise, you can try to verify statistically that the average number of iterations to generate one random number using this algorithm is 30/ Also, you can estimate the expectation E[X] = 1 0 xf(x)dx Polar approach for generating normal variates The inverse and acceptance-rejection methods are universal and in theory they can be used for an arbitrary distribution. However, they both have their own limitations. As it is already mentioned above, it is not always easy to find the inverse of the CDF for the inverse method and for the acceptance-rejection method, except for the case of a bounded support, finding the easy function t(x)isnotalwaystrivial. Insomesituationsitisbeneficialtodesignasamplingmethodforaspecific distribution. The designed method may or may not be applicable for other distributions. Such methods are called ad hoc methods. Here we describe such a method for the normal distribution, called the polar method because it is based on the polar coordinates. Consider the probability density function for the standard normal distribution in two dimensions. f(x,y) = 1 2π e (x2 +y 2 )/2. This can be thought of as the joint pdf of two independent standard Gaussian random variables X, Y with their respective pdf s f X (x) = 1 2π e x2 /2, f Y (y) = 1 2π e y2 /2. Let D = R 2 = X 2 +Y 2 and Θ = arctan(y/x) be two functions of the random variables X and Y. We view D,Θ as two new random variables in polar coordinates (r = x 2 +y 2,θ = arctan(y/x)). Let s compute the joint-cumulative probability function for D, Θ. First note that D 0 and Θ [0,2π]. Therefore if suffices to find the probability P({0 D d,0 Θ θ}) for all (d,θ) [0,+ ) [0,2π]. P({0 D d,0 Θ θ}) = We express this integral in polar coordinates P({0 D d,0 Θ θ}) = {X 2 +Y 2 d, 0 arctan(y/x) θ} d θ 0 0 e r2 /2 rdαdr = 1 θ dα 2π 0 1 2π e (x2 +y 2 )/2 dxdy. d 0 e r2 /2 rdr. The θ-integral can readily be viewed as the CDF of the uniform distribution for Θ on the interval [0, 2π] P({0 Θ θ}) = 1 θ dα = θ 2π 0 2π 142

144 while the remaining r-integral is further transformed by the change of variables s = r 2, yielding P({0 D d}) = d e s/2 ds = 1 e d/2, which is simply the exponential distribution for D. Thus, the two random variables D, Θ are independent P({0 D d,0 Θ θ}) = P({0 Θ θ})p({0 D d}) and have much simpler distributions than X,Y: (D,Θ) can be easily sampled by the two methods described above; the inverse transform for D and (stretched) uniform variates for Θ. X, Y are then obtained by converting back to Cartesian coordinates. We obtain the following Box-Muller algorithm. Generate two independent uniform variates U 1,U 2 U(0,1) Set Θ = 2πU 1 and D = 2ln(U 2 ) Return X = DcosΘ, and Y = DsinΘ Exercise: read example 4.8, page 237, from the book. 6.4 Controlling the sampling error and variance reduction techniques Assume we use Monte Carlo integration to estimate the mean µ for a given probability distribution. We do this by building the sample mean µ X(n) = 1 n n X j, where X j are independent samples drawn from the given distribution. Statistically speaking, as mentioned earlier, X(n), when regarded as a random variable, is an unbiased estimate of µ since for all j we have E[ X(n)] = E[X j ] = µ. However, we may wonder how good this estimate is as an actual approximation for µ? We can quantify this by computing the expectation of the squared error E[( X(n) µ) 2 ], which in some sense, is a measure of the total error. j=1 E[(X(n) µ) 2 ] = Var[X(n)] = σ2 n if σ is the standard deviation of our distribution. We see clearly from this inequality that as the number n of replications (or samples) increases the estimate is improved, since its variance 143

145 decreases. In practice however, both µ and σ are unknown, therefore we may also rely on Monte Carlo to estimate σ σ 2 S 2 (n) = 1 n (X j n X(n)) 2 j=1 whichhasacertainunknownerrorσ S(n). Ifthiserrorissmall, thenwecanusetheapproximation σ S(n) to estimate the number n of samples required to get a good estimate of µ using the sample mean X(n). Confidence Interval: To control the sampling error, in practice, we rely on what is called the confidence interval, where a real number δ > 0 is specified such that X(n) [µ δ,µ+δ] with a probability (1 α). According to the law of large numbers (combined with the central limit theorem), the confidence interval is such that δ = z α S 2 (n)/n where z α, called the (1 α) quantile of the standard normal distribution satisfies (1 α 2 ) = zα 1 2π e x2 /2 dx. i.e, z α can be computed by inverting the cumulative probability function of the standard normal distribution. In fact, according to the central limit theorem Z = ( X(x) µ)/(σ/ n) is a standard normal random variable. Thus accordingly, we have 1 α = P{ z α Z z α } = P{µ z α σ n X(n) µ+z α σ n }. The relation δ = z α S 2 (n)/n suggests that the absolute error X(n) µ converges to zero on the order 1/ n, when n +. This is in fact confirmed by the numerical example of an European vanilla call option, provided below. Assume we want to control the absolute error in such a way that X(n) µ ǫ with a given tolerance ǫ > 0. The confidence interval states that P({ X(n) δ < µ < X(n)+δ}) (1 α), δ = z α S 2 (n)/n (note that the here is because S 2 (n) is only an approximation for σ 2 and also the standard normal distribution for z α is just an approximation based on the central limit theorem). 144

146 Therefore, to obtain an estimate for µ within a tolerance ǫ, with a probability (1 α), α > 0 is small, all we need to do is sample until δ = z α S 2 (n)/n < ǫ so that X(n) µ < δ = X(n) µ < ǫ. Example: As an example we consider the pricing of an European vanilla call option. The matlab code below extends the Monte Carlo integration code used before to compute the expected pay off of an European vanilla call option to include the confidence interval (CI). It uses the matlab function normfit which for a given vector-sequence of samples, X, it returns estimates for the mean ˆµ, the standard deviation ˆσ, and the limits CI = (CI 1,CI 2 ) of the 95% confidence interval, i.e, with a probability 1 α = 0.95 the true mean µ satisfies CI 1 µ CI 2. function [Price, CI] = BlsMC2(S0,K,r,T,sigma,NRepl) % Returns estimated mean (Price) and confidence interval (CI) nut = ( r - 0.5*sigma^2)*T; sit = sigma * sqrt(t) ; DiscPayoff = exp(-r*t)*max(0,s0*exp(nut+sit*randn(nrepl,1))-k); [Price, VarPrice, CI] = normfit(discpayoff); Note that normfit returns three fields. The expected pay off(price), its variance(varprice), and the limits of the confidence interval CI, which is a 2-by-1 vector. Next the BlsMC2 function is called a few times with two different numbers of replications, NRepl=100,000 and NRepl=1,000,000. The CI vector is then used to compute some sort of a relative error; the relative length of the confidence interval (CI 2 CI 1 )/ˆµ. >>randn( state,0)%initializes the randn function, %which generates standard normal variates >>S0 = 50; K=55; r=0.05; T=5/12;sigma = 0.2; >> [CallMC,CI] = BlsMC2(S0,K,r,T,sigma,100000) CallMC = CI = >> (CI(2)-CI(1))/CallMC ans = >> [CallMC,CI] = BlsMC2(S0,K,r,T,sigma,100000) CallMC = CI =

147 >> (CI(2)-CI(1))/CallMC ans = >> [CallMC,CI] = BlsMC2(S0,K,r,T,sigma,100000) CallMC = CI = >> (CI(2)-CI(1))/CallMC ans = >> [CallMC,CI] = BlsMC2(S0,K,r,T,sigma, ) CallMC = CI = >> (CI(2)-CI(1))/CallMC ans = >> [CallMC,CI] = BlsMC2(S0,K,r,T,sigma, ) CallMC = CI = >> (CI(2)-CI(1))/CallMC ans = >> [CallMC,CI] = BlsMC2(S0,K,r,T,sigma, ) CallMC = CI = >> (CI(2)-CI(1))/CallMC ans = The important thing to learn from this example is that the length of the confidence interval depends on the number of replications. When the number of replications is increased from 100,000 to 1,000,000, the relative length of the confidence interval is dropped from about δ 1 = to δ 2 = Note that the ratio δ 1 /δ 2 = / 10 = 100,000/1,000,000 confirms the theoretical 146

148 prediction that crude Monte Carlo has a rate of convergence on the order O(1/ N), where N is the number of replications. In fact this is rather slow. Doubling the number of replications will reduce the error by only 1/ 2 One way of improving the error, without taking astronomically large numbers of samples, is using what is called a variance reduction technique, which permits to reduce the sample variance (represented by S 2 (n) above) and thus to reduce the absolute-error bound δ. Many such techniques were developed and used by statisticians. A few of them are listed in the book. They all have their own strengths and own weaknesses but conceptually they are all similar. The idea is that instead of computing µ = E[X], using only one set of independent samples X j, we use at least two different sets of samples, Y j,z j that are associated with two dependent random variables Y,Z that are negatively correlated, Cov(Y, Z) < 0, such that but µ = E[X] = E[Y +Z] Var[Y +Z] = Var[Y]+Var[Z]+2Cov(Y,Z) < Var[X]. Thus clearly sampling µ from the sum Y +Z will yield a better approximation, with smaller absolute error, since Y + Z has a smaller variance than the original random variable X. As an example, we discuss next one of the simplest variance reduction techniques, namely the antithetic sampling method. Antithetic sampling The antithetic sampling consists in generating two sets of samples Xj 1,X2 j,j = 1,2,,n such that µ = E[Xj 1] = E[X2 j ],σ2 = Var[Xj 1] = Var[X2 j ] and that X1 j,x2 k are independent if j k but Xj 1,X2 j are dependent and satisfy Cov(Xj,X 1 j) 2 < 0. The idea then is to set Y j = Xj 1/2 and Z j = Xj 2/2. Note that the sequence Y j + Z j = (Xj 1 + Xj 2 )/2,j = 1,2,,n are i.i.d. random variables. Let we have X(n) = 1 n n j=1 X 1 j +X2 j 2 E[ X(n)] = E[(X 1 j +X 2 j)/2] = (E[X 1 j]+e[x 2 j])/2 = µ i.e, X(n) provides an unbiased estimate for µ. The sample variance is given by Var[ X(n)] = 1 n n 2 Var[(Xj 1 +Xj)/2] 2 = 1 n ( Var[X 1 n 2 j /2]+Var[Xj/2]+2Cov(X 2 j/2,x 1 j/2) 2 ) j=1 j=1 = 1 4n 2(nVar[X1 ]+nvar[x 2 ]+2 n j=1 Cov(X 1 j,x 2 j)) = σ2 2n (1+ 1 n where ρ(xj 1,X2 j ) = Cov(X1 j,x2 j Var[X )/ j 1]Var[X2 j ] is the normalized correlation. n ρ(xj,x 1 j)) 2 j=1 (6.2) 147

149 Let ρ = ρ(xj 1,X2 j ), j = 1,,n, with ρ > 0, then Var[ X(n)] = σ2 σ2 (1 ρ) < 2n 2n Thus, we get a reduced sample variance provided 1 ρ(xj 1,X2 j ) < 0. Note that by using the rectangular inequality 2 ab < a 2 +b 2 it is easy to show that 1 ρ(xj 1,X2 j ) 1. For a given X 1 j, the sequence X2 j can be chosen by exploiting possible symmetries of the given distribution. The cases of the uniform and normal distributions are discussed next. For the standard uniform distribution for example we can set Xj 1 = U j U(0,1) and Xj 2 = 1 U j. It is easy to see in this case that Xj 1 and X2 j are negatively correlated. Cov(X 1 j,x 2 j) = E[X 1 jx 2 j] E[X 1 j]e[x 2 j] = E[U(1 U)] E[U]E[(1 U)] since U and 1 U have the same (uniform) distribution we have E[U] = E[(1 U)] = 1/2 and Thus E[U(1 U)] = 1 0 u(1 u)du = = 1 6 Cov(X 1 j,x 2 j) = = 1 12 < 0 For the standard normal distribution Y N(0,1), we can set X j 1 = Y j and X j 2 = Y j. Clearly X j 1 and Xj 2 are identically distributed since by symmetry we have y e t2 /2 dt = + i.e P({Y y}) = P({ Y y}) = P({Y y}). But y e t2 /2 dt Cov(X 1 j,x 2 j) = E[Y( Y)] E[Y]E[ Y] = E[Y 2 ]+E[Y] 2 = Var[Y] = 1 < 0 Example We illustrate this by the simple example of evaluating Using only 100 samples, we have I = 1 0 e x dx = e >> rand( state,0) % >> X = exp(rand(100,1)); >> [I, dummy, CI] = normfit(x); >> I I = 148

150 >> (CI(2)-CI(1))/I ans = %%Using antithetic sampling >> U = rand(100,1); >> X = [exp(u) + exp(1-u)]/2; >> [I, dummy, CI] = normfit(x); >> I I = >> (CI(2)-CI(1))/I ans = From this example we see clearly that antithetic sampling improves the confidence interval by a factor of about 10. However, the antithetic sampling also has its own limitations. As it is illustrated by Example 4.12, page 245, in the book, the antithetic sampling when applied for estimating an integral as the expectation I = b a I = E f [h(x)] = g(x)dx b a h(x)f(x)dx where f is a pdf and h is such that h(x) = g(x)/f(x), one requirement is that the function h should be monotonic, in order to preserve the negative correlation between Y and Z, so that Cov(Y,Z) < 0 = Cov(h(Y),h(Z)) < 0. Otherwise we need to construct Y,Z such Cov(h(Y),h(Z)) < 0, which can be difficult of course. It happens that the example above the choice Y = e U and Z = e ( 1 U) worked fine because in fact negatively corrolated: with µ = 1 0 ex dx = e 1, we have Cov(Y,Z) = 1 0 (e x µ)(e 1 x µ)dx = e µ 2 = Appendix: Proof of Chebechev s Inequality Chebechev s inequality states that, for a random variable of mean µ and variance σ 2, we have ǫ > 0,P{ (X µ) > ǫ} σ2 ǫ

151 To prove this inequality we must use the more general Markov s inequality stating that if Y 0, is a non-negative random variable then for all d > 0 we have P{Y d} 1 d E[Y]. Proof of Markov s inequality Let We have Z = { d if Y d 0 otherwise 0 Z Y = E[Y] E[Z] = dp{y d}. Proof of Chebechev s inequality Let Y = (X µ) 2,d = ǫ 2. We have E[Y] = σ 2. Thus, using Chebechev s inequality yields P{(X µ) 2 > ǫ 2 } σ2 ǫ Problems 1. Consider a congruential random number generator with b = 0,M = 8192 and a seed x 0 = 1. x k+1 = (ax k +b) mod M (a) What is the period of this generator if a = 2? Hint: recall that the period is the smallest positive integer T such that x T+k = x k, for some k, and note that 8192 = (b) What is the period if a = 125? (c) (hard) What is the longest possible period, given these values for b,m, and x 0? 2. Monte Carlo integration requires random sampling of a region in d-dimensional space. If the sequence of random numbers one is using exhibits any serial correlation, then the sampling may systematically miss a portion of the region and may possibly even be confined to a subregion of lower dimension, which would obviously make the estimate for the integral erroneous. One way to detect such serial correlation is to plot pairs of consecutive pseudorandom numbers in the plane, so that any nonrandom pattern becomes apparent visually. (a) Use the congruential random number generator x k+1 = (ax k +b) mod M, with a = 125,b = 0,M = 8192 and the seed x 0 = 1, to generate a sequence of random integers, and convert each to a floating-point number f k on [0,1) by taking f k = x k /M. Plot 100 pairs of consecutive members of the sequence, i.e, let X = [f 1,f 3,,f 199 ] and Y = [f 2,f 4,,f 200 ] then plot the vector Y as a function of X. In matlab this can be done as follows. 150

152 %let f be the vector containing the generated sequence of pseud-random numbers %set >>X = f(1:2:199); >>Y = f(2:2:200); >>plot(x,y, o ) If you don t notice an obvious pattern, then increase the number of pairs to 1000, for example, instead of 100. (b) Repeat the experiment in (i) but this time use the function rand of matlab to generate a sequence of pseudo-random number on [0,1), instead of f k. Attempt with both 100 and 1000 pairs. 3. A sequence of random numbers distributed uniformly on [0, 1) should have a mean and variance σ 2 = µ = xdx = 1 2 (x µ) 2 dx = Check for both the congruential random number generator in the previous exercise (2) and the function rand of matlab to see how close they come to these values for a sequence of length Do both of them pass this test? 4. In this exercise we test three different approaches to generate pseudo-normal variates, with mean zero and variance one. One based on the central limit theorem, one is the polar method discussed in the notes, and the third is the function randn of matlab. (a) Method based on the central limit theorem. Recall that according to the central limit theorem, if x k,k = 1,2,,n are n random numbers from a distribution with mean µ and variance σ 2 then as n increases y = n k=1 x k nµ σ n approaches a normal distribution with mean zero and variance one. If the x k s are uniformly distributed on [0,1] then µ = 1/2 and σ 2 = 1/12. If in addition, we choose n = 12 then we obtain a simple formula for y. y = 12 k=1 x k 6 is approximately normally distributed with mean zero and variance one. Write a short matlab code to generate normal variates according to this formula by generating 12 uniform variates using the function rand of matlab, sum them together and substrate 6. e.g. >> y = sum(rand(1,12))- 6; will generate one pseudo-random number, which is approximately N(0,1). Use this as a building-block to write a matlab code to generate sequences of normally distributed random numbers of arbitrary size. 151

153 (b) The polar method. Let x 1,x 2 be two independent uniformly distributed (pseudo) random numbers on [0,1). Then, y 1 = sin(2πx 1 ) 2log(x 2 ) and y 2 = cos(2πx 1 ) 2log(x 2 ) are normally distributed with mean zero and variance one. Write a matlab code to generate a sequence of normally distributed random numbers of an arbitrary size, according to this algorithm. (c) Generate 1000 normally distributed N(0, 1) according to each one of the algorithms above and according to the matlab function randn and save the three sequences as three different vectors, which you may call Ncentral, Npolar, Nrandn, respectively. Then use the matlab function hist to bin each one of three random vectors in bins of size 1. Normalize the bin numbers by the total samples (1000). Use the bar command of matlab to plot the histogram and plot the normal density f(x) = e x2 /2 / 2π on top on each one of the histogram. Follow the following simple matlab instructions for the function randn of matlab, as a guideline example. >> N =1000; >>Nrandn = randn(n,1); >> x=-3:1:3; >> nh = hist(nrandn,x); %counts number of random numbers %in each subinterval centered >> nhnormalized = nh/n; % >> figure >> bar(x, nhnormalized) >> hold on >>ezplot( exp(-x^2/2)/sqrt(2*pi),[-4,4]); (d) Check the validity of each one of the three different methods by comparing the numerical values of unit-binned histograms (nhnomalized in the matlab code above) to the exact normal distribution: 5. Consider the function h i = i+.5 i.5 e x2 /2 2π dx, i =, 2, 1,0,1,2, =,0.5977,6.0598, , , ,6.0598,0.5977, % 0 if x < 0 2x, if 0 x 0.5 h(x) = 2 2x if 0.5 x 1 0 if x > 1. Use Monte Carlo integration to estimate this integral using the uniform random number generator rand of matlab. Use the function normfit to estimate both the integral and the associated interval of confidence for a number of samples N = 100. Use the matlab M-file function below to evaluate h(x). 152

154 function hx=hfunction(x) N=max(size(X)); hx=zeros(n,1); Int1 = find(x>=0 & X<=0.5); Int2 = find(x>=.5 & X<=1); hx(int1) = 2*X(Int1); hx(int2) = 2-2*X(Int2); Repeat the experiment using antithetic sampling instead (i.e 1 0 h(x)dx = E[(h(U) + h(1 U))/2]). Compare the lengths of the confidence interval in the cases with and without antithetic sampling. Does antithetic sampling reduce variance in this case? Why? (Provide an analytic reason). Hint: compare Var[h(U)] and Var[(h(U)+h(1 U))/2]. 6. Consider the beta-distribution f(x) = xα1 1 (1 x) α 2 1, x [0,1] B(α 1,α 2 ) where with α 1 = α 2 = 3. B(α 1,α 2 ) = 1 0 x α 1 1 (1 x) α 2 1 dx (a) Write a matlab code based on the acceptance-rejection method to sample this distribution. (b) Use your code the estimate E[X] if X is a random variable beta-distributed on [0,1] with α 1 = α 2 = 3. (c) Estimate the average number of iterations it takes for the acceptance-rejection algorithm to terminate and return an accepted random number. (The number of iterations before one random number is accepted is the number of rejections occurred plus one, each time the algorithm is called). Answer: %%%accept_reject.m %%%%%%%%%%%%%%%%%%%%%%% function [Ex,ANT]=accept_reject(N) %%Ex=expectation, ASR=avergae success rate rand( state,0) X=zeros(N,1); NT=X; for I=1:N nr =1; nt=0; while(nr==1) nt=nt+1; U1=rand(1);U2=rand(1); 153

155 if(u2<=16*(u1^2-2*u1^3+u1^4)) X(I)=U1; nr=0; end end NT(I)=nt; end ANT=mean(NT); Ex=mean(X); %%%%%%%%%%%%%%%%%%%%% >> [E,N]=accept_reject(100) E = N = >> [E,N]=accept_reject(1000) E = N = >> [E,N]=accept_reject(10000) E = N = >> [E,N]=accept_reject(100000) 154

156 E = N =

157 Chapter 7 Optimization and multidimensional Newton s method Optimization problems are wide spread in finance and economics. They range from the maximization of the return in a portfolio to the minimization of the cost of an enterprise. Optimization problems are in general given on the form Find minf(x 1,x 2,,x n ) Such that G i (x 1,x 2,,x n ) = 0,i = 1,2,,m (7.1) and H j (x 1,x 2,,x n ) 0,j = 1,2,,p. WhereF,G,H aremultivariablefunctionsofx 1,x 2,,x n onsomedomaind ofr n. F(x 1,x 2,,x n ) isoftencalledtheobjectiveorcost functionwhileg i (x 1,x 2,,x n ),i = 1,2,,mandH j (x 1,x 2,,x n ),j = 1,2,,p define supplemental constraints; the G i s define equality constraints and the H j s define inequality constraints. Both constraints can be present at the same time or only one of them. When there are no constraints, i.e, when (7.1) reduces to minf(x 1,x 2,,x n ), the minimization problem is said to be unconstrained. An optimization problem can also occur as a maximization problem, e.g., Find maxf(x 1,x 2,,x n ). However since maxf(x) = min( F(x)), a maximization problem can be regarded as the minimization problem for the function F(x) = F(x). Therefore it suffices to treat minimization problems. The usual procedure for solving a minimization problem such as (7.1) is to find a minimizer point (x 1,x 2,,x n) in the domain of F, which satisfies any constraint associated with the problem, such that F(x 1,x 2,,x n) F(x 1,x 2,,x n ) 156

158 for all x 1,x 2,,x n in the domain of F that satisfies the constraints if any. The domain of F that satisfies the eventual constraints of the optimization problem is often called the feasible domain. The goal of this chapter is to compute minimizers for optimization problems using numerical methods. To begin we start with the easiest case of one-dimensional unconstrained problems. 7.1 Unconstrained optimization in one dimension Suppose we want to find the minimum of a function F(x) for x [a,b], i.e, Find min x [a,b] F(x) according to the remark above this is equivalent, in practice, to when such x does exist. Find x [a,b], such that F(x ) F(x), x [a,b] Let us assume that F is a smooth function, with at least two continuous derivatives on [a,b]. Recall from calculus that in this case the minimizer x is either an end-point (x = a or x = b) or x is a local minimum. In the case of a local minimum we have necessarily F (x ) = 0. Therefore finding the minimum of F(x) amounts to finding in first place the zeros of the derivative F (x). Then one needs to compare the values of F at all those zeros of F (x) together with the extreme values F(a) and F(b) to find the global minimum. To solve F (x) = 0, we can use, for instance, Newton s Method. Recall Newton s method for f(x) = 0 leads to the iterative process: which transforms to for the minimization problem. Given an initial guess x 0, x n+1 = x n f(x n) f (x n ), n 0, Given an initial guess x 0, x n+1 = x n F (x n ) F (x n ), n 0, Example: Find the shortest Euclidean distance between a point (a,b) in the xy-plane and a curve y = g(x), where g is a smooth function. The Euclidean distance between (a,b) and an arbitrary point (x,y = g(x)) on the graph is d(x) = (x a) 2 +(b y) 2 = (x a) 2 +(b g(x)) 2. Note that the presence of the square root in front makes the Euclidean distance a non-smooth function near the point (x = a,g(x) = b) to avoid such unpleasant behaviour, it is often better to consider the minimization of the squared distance F(x) = d 2 (x) = (x a) 2 +(g(x) b) 2, 157

159 which is clearly a smooth function on the domain of g(x) provided g(x) is smooth. We have and F (x) = 2(x a)+2(g(x) b)g (x) F (x) = 2+2(g(x) b)g (x)+2 ( g (x) ) 2. Thus, applying Newton s method for min F(x), yields the iterative process x 0 given, x n+1 = x n F (x n ) F (x n ) = x n (x n a)+(g(x n ) b)g (x n ) 1+(g(x n ) b)g (x n )+(g (x n )) 2. Exercise: Write a matlab program to find an approximate value of the shortest distance between the origin and the graph of y = cos(x 2 + 1). Run your program and provide a reasonable answer to this problem. 7.2 Unconstrained multivariable smooth optimization and Newton s method for systems As for one-dimensional optimization problems finding the minimum of an unconstrained optimization problem minf(x 1,x 2,,x n ) multivariable calculus suggests that in the case of smooth objective functions F(x 1,x 2,,x n ) the optimization problems leads to finding the zeros of the gradient of F: F(x 1,x 2,,x n ) = F(x 1,x 2,,x n) x 1 F(x 1,x 2,,x n) x 2. F(x 1,x 2,,x n) x n Finding zeros of such gradients can be done by generalizing Newton s method to systems of nonlinear equations. Therefore it is meaningful to first introduce Newton s method for systems of nonlinear equations.. Hence, we consider the problem of finding the point (x 1,x 2,,x n) in R n solution of f 1 (x 1,x 2,,x n ) = 0 f 2 (x 1,x 2,,x n ) = 0. f n (x 1,x 2,,x n ) = 0, (7.2) where f 1,f 2,,f n are n functions of the n variables x 1,x 2,,x n, which can be thought of as the partial derivatives of the function F introduced above. For simplicity in exposition we introduce 158

160 the vector notation and x 1 x 2 x =. x n f 1 (x) f 2 (x) f(x) =.. f n (x) Thus the non-linear system (7.2) becomes simply Df(x) = f 1 (x) x 1 f(x) = 0. Let Df denote the Jacobian-derivative matrix of the vector function f given by f 1 (x) x 2 f 2 (x) x 1. f n(x) x 1 f 2 (x) x 2. f n(x) x 2 f 1 (x) x n f 2 (x) x n. f n(x) x n Then Newton s method generalizes for the system f(x) = 0 as follows. Let x 0 = (x 0 1,x0 2,,x0 n) T be a given initial guess in R n, then. x n+1 = x n [Df(x n )] 1 f(x n ), (7.3) where [Df(x n )] 1 is the inverse of the Jacobian Matrix introduced above. However, in practice, one does not need to inverse the matrix Df in order to get the iterative process going. In fact, it should be avoided as the calculation can be very unstable. Instead, at each iteration we solve the linear system [Df(x n )](x n+1 x n ) = f(x n ), which is obtained from (7.3) by multiplying each side of the equality by the matrix [Df(x n )] and moving the term [Df(x n )]x n to the left. Note here the solution for the linear system is the difference (x n+1 x n ). So we have two main steps (at each iteration): first solve the system AX = b with A = [Df(x n )] and b = f(x n ) then set x n+1 = x n +X. In matlab language this can be written as follows. Assume that the function f and its Jacobian derivative are coded separately in their own M-files fct.m and Dfct.m, which can be called directly for any given x and return the values f(x) and Df(x), respectively. For a given initial guess X 0 = (x 1,x 2,,x n ) and a convergence tolerance Tol, we have: 159

161 X0 = [x1,x2,...,xn] ; A=Dfct(X0); b= - fct(x0); X=A\b; X0=X0+X ; if(abs(x)<tol) return end Derivation of the multidimensional Newton s Method The multidimensional Newton s method can be derived through first order Taylor expansion just like for the one-dimensional case. The first-order Taylor expansion for the multidimensional and multivariable function f(x) reads f(x) = f(x 0 )+[Df(x 0 )](x x 0 )+ Here f is from R n to R n and [Df(x 0 )] is the Jacobian n-by-n matrix and [Df(x 0 )](x x 0 ) is a Matrix-vector multiplication. Approximating the equation f(x) = 0 by yields f(x 0 )+[Df(x 0 )](x x 0 ) = 0 x = x 0 [Df(x 0 )] 1 f(x 0 ), which is the first step of the Newton s method given above with x 1 = x. Example from micro-economics: Cournot equilibrium Consider two firms 1 and 2 manufacturing the same product. Let q 1 be the quantity produced by firm 1 and q 2 be the quantity produced by firm 2, in some given time interval. Let C i (q i ) = 1 2 c iq 2 i, i = 1,2 be the cost functions for each one of the firms, respectively, where c 1,c 2 are two constants, and P(q) = q 1/m be the unit price of the product with m > 0 and where q = q 1 +q 2 is the total market supply. Note that P is a non-increasing function of q. The profit π i of the firm i is given by π i (q 1,q 2 ) = P(q 1 +q 2 )q i C i (q i ), i = 1,2. The Cournot equilibrium is defined as the state when both profits π i,i = 1,2 are maximized with respect to the quantity q i, i.e, we need to find q1,q 2 produced by the two firms so that π 1 (q 1,q 2) = max q 1 π 1 (q 1,q 2) and π 2 (q 1,q 2) = max q 2 π 2 (q 1,q 2 ). which yields π 1 (q 1,q 2 ) q 1 = 0; π 2 (q 1,q 2 ) q 2 =

162 After the appropriate substitutions and the computing of the derivatives, etc, we get { 1 m (q 1 +q 2 ) 1/m 1 q 1 +(q 1 +q 2 ) 1/m c 1 q 1 = 0 1 m (q 1 +q 2 ) 1/m 1 q 2 +(q 1 +q 2 ) 1/m c 2 q 2 = 0 We can now apply Newton s method for this non-linear system to compute the Cournot equilibrium. To simplify the notation, let f 1 (q 1,q 2 ) = π 1(q1,q 2 ) q 1 and f 2 (q 1,q 2 ) = π 2(q1,q 2 ) q 2 and let Then with f = (f 1,f 2 ) T we have ] Df(q 1,q 2 ) = [ f1 (q 1,q 2 ) q 1 f 1 (q 1,q 2 ) q 2 f 2 (q 1,q 2 ) q 1 f 2 (q 1,q 2 ) q 2 α = 1 m,q = q 1 +q 2. [ 2αq = α 1 +α(α 1)q α 2 q 1 c 1 αq α 1 +α(α 1)q α 2 ] q 2 αq α 1 +α(α 1)q α 2 q 1 2αq α 1 +α(α 1)q α 2 q 2 c 2 This is the matrix that should be used for A to solve a 2-by-2 linear system at each iteration of the Newton s method. Remark Notice that the Cournot equilibrium is a very special optimization problem and it obeys to the fact that each firm has control only on its own production. If on the other hand the two firms were actually two plants owned by one single person, who has control on the production at both plants, then it is reasonable to maximize the total profit π(q 1,q 2 ) = P(q 1,q 2 )(q 1 +q 2 ) 1 2 (c 1q 2 1 +c 2 q 2 2) = (q 1 +q 2 ) 1/m (c 1q 2 1 +c 2 q 2 2) with respect to the two variables (q 1,q 2 ). Which will then yield the non-linear system π(q 1,q 2 ) q 1 = 0, π(q 1,q 2 ) q 2 = 0. Which is in fact a completely different problem from the one corresponding to the Cournot equilibrium. Matlab implementation of Cournot equilibrium: Assume m = 2,c 1 = 1,c 2 = 2. The matlab functions below fct.m, Dfct.m, and Newton.n define respectively the non-linear system f(q 1,q 2 ) = 0, the Jacobian matrix Df(q 1,q 2 ) and the Newton s algorithm-main driver, respectively while pi1.m and pi2.m are the two profits corresponding to the two firms (they are all saved in separate M-files). %%%%fct.m function y = fct(q1,q2) alpha=-1/2;c1=1;c2=2; y=zeros(2,1); y(1)=alpha*(q1+q2).^(alpha-1).*q1 + (q1+q2).^(alpha) -c1*q1; y(2)=alpha*(q1+q2).^(alpha-1).*q2 + (q1+q2).^(alpha) -c2*q2; 161

163 %%%%Dfct.m function Dy = Dfct(q1,q2) alpha=-1/2;c1=1;c2=2; q=q1+q2; Dy(1,1)=2*alpha*q.^(alpha-1)+alpha*(alpha-1)*q^(alpha-2).*q1 -c1; Dy(1,2)= alpha*q.^(alpha-1)+alpha*(alpha-1)*q.^(alpha-2).*q2; Dy(2,1)= alpha*q.^(alpha-1)+alpha*(alpha-1)*q.^(alpha-2).*q1; Dy(2,2)=2*alpha*q.^(alpha-1)+alpha*(alpha-1)*q^(alpha-2).*q2 -c2; %%%%Newton.m q1=1;q2=1; Maxiter=100; iter=1; Tol=1.e-9; while(iter<maxiter) A=Dfct(q1,q2); b=-fct(q1,q2); X=A\b; q1=q1+x(1); q2=q2+x(2); %or simply [q1; q2]=[q1; q2]+x; if(abs(x)<tol) display([ Newton s method converged after,num2str(iter), Iterations ]) display([ Maximizer: q1=, num2str(q1), q2=,num2str(q2)]) display([ Profit for firm 1:, num2str(pi1(q1,q2))]) display([ Profit for firm 2:, num2str(pi2(q1,q2))]) return end iter=iter+1; end display([ Newton s method failed to converge after,num2str(iter), Iterations ]) function p=pi1(q1,q2) alpha=-1/2;c1=1;c2=2; 162

164 p=(q1+q2)^(alpha).*q1 - c1*q1.^2/2; function p=pi2(q1,q2) alpha=-1/2;c1=1;c2=2; p=(q1+q2)^(alpha).*q2 - c1*q2.^2/2; (in Matlab Command Window) >> Newton Newton s method converged after 10 Iterations Maximizer: q1= q2= Profit for firm 1: Profit for firm 2: Using an initial-guess point q 1 = q 2 = 1 and a tolerance Tol=10 9, Newton s method converged in just 10 iterations, with an approximate answer q ,q Note that the profits π 1 (q 1,q 2 ) and π 2 (q 1,q 2 ) are zero at q 1 = 0 and q 2 = 0 and converge to when q 1 and q 2 tend to +, respectively, and are both concave functions. Therefore, using calculus, we know that π 1,π 2 must attain their maxima for q 1 > 0 and q 2 > 0 respectively. Also, from calculus, the computed zeros of f are local maxima, minima. But using the fact that h(x) = π 1 (x,q 2 ) and g(x) = π 2 (q 1,x), for fixed q 2,q 1, respectively are concave functions, we know for sure that q1,q 2 are in fact global maxima. Note the concavity follows from the fact that h (x) = f 1(x,q 2 ) = 2α(x+q 2 ) α 1 +α(α 1)(x+q 2 ) α 2 x c q 1 2α(x+q 2 ) α 1 +α(α 1)(x+q 2 ) α 2 (x+q 2 ) c = 2α(x+q 2 ) α 1 +α(α 1)(x+q 2 ) α 1 c = (α 2 +α)(x+q 2 ) α 1 c < 0 The last inequality follows because 1 < α < 0 (α = 1/m,m > 1), therefore α 2 +α < 0. The respective maximized profits for each firm are therefore given by π 1 (q 1,q 2) and π 2 (q 1,q 2) Notice that firm 1 with the smaller production cost (c 1 = 1 v.s c 2 = 2), produces a larger quantity q1 = v.s. q and has a larger profit. 7.3 One variable unconstrained optimization: Golden section search method Consider the one-dimensional optimization problem without constraints. 163

165 Find min x [a,b] f(x). We assume that f(x) has a unique minimum (local and global) in [a,b]. Such function f is often said to be unimodal. See graph below. unimodal Not unimodal x* Important Remark: Let x be the unique (global and local) minimum for f(x) in [a,b]. Then necessarily, we have f is non-increasing to the left of x and non-decreasing to the right of x. Let x 1,x 2 in [a,b]. Assume that x 1 < x 2, then according to this remark, we have the following three cases: i) x 1 < x 2 x = f(x 1 ) f(x 2 ) ii) x x 1 < x 2 = f(x 1 ) f(x 2 ) iii) x 1 x x 2 = x [a,x 2 ] [x 1,b] (i.e x is in both intervals [a,x 2 ] and [x 1,b]). So it is safe to state that for a unimodal function f(x), for any given x 1,x 2 [a,b], x 1 < x 2, as it 164

166 is illustrated on the graphic below, we have If f(x 1 ) f(x 2 ), then x [x 1,b] If f(x 1 ) f(x 2 ), then x [a,x 2 ] (7.4) This reduces the uncertainty of x from being anywhere in the interval [a,b] to being in a smaller interval, either [a,x 2 ] or [x 1,b]. The goal of the golden search method is to devise an iterative process that permits to reduce effectively the length of the interval containing x at each iteration, in a manner somewhat similar to the bisection method for f(x) = 0. f(x1) f(x2) f(x2) f(x1) f(x1) > f(x2) f(x1) < f(x2) x1 x2 x3 x* x* x3 x1 x2 The Golden search method. Note also that x 2 [x 1,b] and x 1 [a,x 2 ]. The next important step is to choose x 3 in the new interval [x 1,b] or [a,x 2 ]. Assume that x 1,x 2 were chosen so that the intervals [a,x 2 ] and [x 1,b] have the same size. We introduce a new variable τ that measures the relative common length of the intervals [a,x 2 ] and [x 1,b]: x 2 a = τ(b a) and b x 1 = τ(b a), 0 < τ < 1 so that at each iteration the interval containing x is reduced by a factor τ. Recall that for the bisection method the factor is 1/2, here we will find a somewhat optimal value for τ which will keep the computation to a minimum and assures a symmetric dichotomy between x 1 and x 2 at each step. First we rewrite the above equalities as x 1 = a+(1 τ)(b a) and x 2 = a+τ(b a) (7.5) Note the second formula in (7.5) follows directly from the previous equations while the first one uses some intermediate steps: x 1 = b τ(b a) = a+(b a) τ(b a) = a+(1 τ)(b a). Now, assume that x 1,x 2 are chosen according to (7.5) and that by the test in (7.4), x [a,x 2 ]. Let s denote this interval by [a 1,b 1 ] (with a 1 = a,b 1 = x 2 ). 165

167 Recall that x 1 = a+(1 τ)(b a) and accordingly choose x 3 = a 1 +(1 τ)(b 1 a 1 ) Assume that in addition we have So that the new points x 1 and x 2 x 1 = a 1 +τ(b 1 a 1 ) are such that x 1 = x 3 = a 1 +(1 τ)(b 1 a 1 ), x 2 = x 1 = a 1 +τ(b 1 a 1 ). See the graphic below. x * a x 3 x 1 x 2 b a 1 x 1 x 2 b 1 The fundamental question is whether such a constant τ exists? It should then satisfy: x 1 = a 1 +τ(b 1 a 1 ) and x 1 = a+(1 τ)(b a). Equating the right hand sides of the two equations and after simplification and using the fact that a 1 = a,(b 1 a 1 ) = (x 2 a) = τ(b a), we get τ 2 = (1 τ) or τ 2 +τ 1 = 0 which is a quadratic equation whose solutions are τ ± = 1± 5 2 Discarding the negative solution yields τ = ( 5 1)/2, known as the golden ratio. Note that x 3 < x 1 < x 2 follows from the fact that τ = ( 5 1)/2 > 1 τ = (3 5)/2. The case when x [x 1,b], with x 2 = x 4 = a 1 + τ(b 1 a 1 ), a 1 = x 1,b 1 = b and x 1 = x 2 = a 1 +(1 τ)(b 1 a 1 ), yields the same conclusion: τ = ( 5 1)/2. Therefore we have the following algorithm. Algorithm: Golden section search method Assume f is a unimodal function on [a,b]. Let τ = ( 5 1)/2 be the golden ratio. Let Tol be a fixed tolerance. 166

168 1. Set x 1 = a+(1 τ)(b a), x 2 = a+τ(b a) 2. Let f 1 = f(x 1 ),f 2 = f(x 2 ) 3. While ((b a) > Tol) 4. If (f 1 > f 2 ) Set a = x 1,x 1 = x 2,f 1 = f 2,x 2 = a+τ(b a), f 2 = f(x 2 ) Else Set b = x 2,x 2 = x 1,f 2 = f 1,x 1 = a+(1 τ)(b a), f 1 = f(x 1 ) 5. End if 6. End while. Example Write a matlab code using the golden section method to minimize f(x) = 0.5 xe x2 on [0,2]. Use a tolerance tol = The matlab code below was saved as an M-file goldensearch.m then called in the matlab command window. %%%%goldensearch.m a=0;b=2; tol=1e-3; tau=(sqrt(5)-1)/2; x1=a + (1-tau)*(b-a); x2=a + tau*(b-a); f1=0.5- x1*exp(-x1^2); f2=0.5- x2*exp(-x2^2); aseq=[a];bseq=[b]; f1seq=f1;f2seq=f2; while((b-a)>tol) if(f1>f2) a=x1; x1=x2; x2=a + tau*(b-a); f1=f2; f2=0.5- x2*exp(-x2^2); else b=x2; x2=x1; f2=f1; x1=a + (1-tau)*(b-a); 167

169 f1=0.5- x1*exp(-x1^2); end aseq=[aseq;a];bseq=[bseq;b]; f1seq=[f1seq;f1];f2seq=[f2seq;f2]; end format short e display([ a f(x1) b f(x2) ]) display([aseq f1seq bseq f2seq]) ; Command window. >> goldensearch a f(x1) b f(x2) e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e-02 Note that after about 12 to 15 iterations the f values at a and b become almost the same and change very little afterwards, which suggest a minimum value at the point x of approximately f(x ) The exact answer is x = 1/ 2,f(1/sqrt(2)) = 0.5 e 1/2 /sqrt Exercise: Use Newton s method for the example above and compare the results and the method performance with the golden section search results. Answer: We have f (x) = (2x 2 1)exp( x 2 ), f (x) = ( 4x 3 +2x 2 +4x)exp( x 2 ). This yields the iterative process 168

170 2x 2 n 1 x n+1 = x n 4x 3 n +2x 2 n +4x n with x 0 fixed to x 0 = 1 we obtain the sequence below. %Newton s method: newton.m a=0;b=2; x0=(a+b)/2; error=1; xn=[x0]; while(abs(error)>tol) x1=x0 - (2*x0^2-1)/(-4*x0^3+2*x0^2+4*x0); xn=[xn;x1]; error=abs(x0-x1); x0=x1; end display([xn (0.5-xn.*exp(-xn.^2))]) >> newton ans = e e e e e e e e e e e e e e-02 As expected Newton s method converges much faster than the Golden search method. However, we need to keep in mind that Newton s method works only when the function f is smooth, since it requires the use of the first and second order derivatives while no derivative calculation is required for the golden section search method, which can be used for cases when the derivatives do not exist or can not be computed explicitly. For practical problems, when we don t have information about the unimodality of f we still can use the golden search method, though with several and different starting points to find all of its local minima and compare their values eventually. 169

171 7.4 Multivariable unconstrained optimization Introduction to convex optimization Let s start with a few definitions and results from convex optimization, as a generalization of the notion of unimodal functions. Definition 17 A subset S of R n is said to be convex if it contains all straight segments of R n that start and end within S, i.e, x,y S = αx+(1 α)y S, α [0,1]. The three domains of R 2 below are examples of convex, non-convex, and convex domains, respectively. convex non convex convex Definition 18 A function F(x) from R n to R is said to be convex if F(αx+(1 α)y) αf(x)+(1 α)f(y), α [0,1]. If the above inequality is strict then F is said to be strictly convex. Here are a few results about convex functions. If F is convex on an open convex set S of R n, then F is continuous in S. The sub-level set of a convex function is convex, i.e, {x S/F(x) c} is convex. 170

172 Level curves of a convex function f=c f<=c The sub level set f<=c is convex If F is convex on an open convex set S R n, then the epigraph Epi(F) = {(x,y) S R/y F(x)} is convex. z The epigraph of convex function is the volume enclosed inside its graph Y x The epigraph of a convex function is convex 171

173 If x is a local minimum of a convex function, then x is a global minimum, i.e, F(x ) F(x), x { x x < δ} = F(x ) F(x) x S. If F(x) is strictly convex and x is a local minimum of F, then x is the unique (global) minimum of F. If S is a closed bounded and convex set of R n and F is a strictly convex function on S, then F has a global minimum in S. If x is in the interior of S then x is unique. (Recall that a set S is said to be closed if it contains its boundary.) Note that the existence and uniqueness result of minima for convex functions, from the last bullet, is restricted to bounded domains. For unbounded domains we need an extra-property for the function F, namely, the notion of coercivity, which is given next. Definition 19 A function F defined on an unbounded set of R n is said to be coercive if lim F(x) = + x +. is a norm on R n ; the Euclidean norm, for example. Examples of coercive functions a) F(x) = x 2 is coercive on R. b) F(x,y) = x 2 +y 2 2xy is coercive on R 2. c) F(x) = x 3 is not coercive on R. d) F(x,y) = x 2 y 2 is not coercive on R 2 but it is coercive on the strip S = {(x,y), 1 y 1}. e) The linear function F(x,y) = ax+by +c is not coercive of R 2. f) F(x,y) = e x2 y 2 is not coercive on all unbounded domains of R 2. Main Result for unbounded domains If F is strictly convex and coercive on a convex and unbounded set S, then F has a unique global minimum in S. Case of smooth functions: When F is differentiable, i.e, its gradient vector F(x) is defined, we have the following results. i) A differentiable function F is convex if it satisfies F(x) F(x 0 )+ F(x 0 ) (x x 0 ), x,x 0 R n. 172

174 ii) If F is differentiable and convex then x is a minimum F(x ) = 0. iii) A function F which is twice differentiable is strictly convex/convex if and only if its Hessian matrix HF(x) is positive definite/semi-positive definite. The Hessian matrix is in essence the second derivative of the multivariable function: HF(x) = 2 F(x) x F(x) x 1 x 2. 2 F(x) x 1 x n 2 F(x) x 2 x 1 2 F(x) x F(x) x 2 x n 2 F(x) x n x 1 2 F(x) x n x which is in fact the matrix-derivative of the gradient of F, i.e, HF(x) = [D F(x)]. 2 F(x) x n 2, Some remarks on the results i),ii),iii): The if part of the equivalence statement in6 ii) follows from i) when applied to x 0 = x. if F(x ) = 0. F(x) F(x )+ F(x ) (x x ) = F(x ) The result i) is an expression of the fact that the graph of a convex function lies above its tangent hyperplane (tangent line in one-dimension) at all points x 0, i.e, the convexity of the epigraph. The statement iii) is a direct consequence of the Taylor expansion F(x) = F(x 0 )+ F(x 0 ) (x x 0 )+ 1 2 (x x 0) T HF(x 0 )(x x 0 )+ combined with i). It is equivalent to the fact that in one dimension, a function f(x) is convex if its second derivative if non-negative everywhere f (x) > 0. Note from the fact that a function F is concave if F is convex implies that F is concave if its Hessian is negative definite. In general we have the following results for a function F defined on R n. Let x be a critical point, i.e, F(x ) = 0, then x is a local minimum if HF(x ) is positive definite x is a local maximum if HF(x ) is negative definite x is a saddle point if HF(x ) is non-singular and neither positive definite or negative definite 173

175 Inconclusive if HF(x ) is singular. Also we know that a symmetric matrix is positive definite if its minor determinants are all positive. For two variable functions, the Hessian of F is given by [ ] HF(x A B ) =, B C where and the positive definiteness is equivalent to A = 2 F(x ),B = 2 F(x ),C = 2 F(x ) x 2 1 x 1 x 2 x 2 2 A > 0 and AC B 2 > 0. Which are the two minor determinants. Therefore we obtain the second derivative test of second year calculus. x is a local minimum if AC B 2 > 0 and A > 0. x is a local maximum if AC B 2 > 0 and A < 0 x is a saddle point if AC B 2 < 0. Inconclusive if AC B 2 = 0, i.e, if the Hessian is singular. Now, we are ready to formulate the method of steepest descent for multivariable functions, which is the equivalent of the golden search method in the sense that it requires a minimum regularity (smoothness) on F unlike Newton s method The method of steepest descent Let F(x 1,x 2,,x n ) be a function defined on R n and denote by x = (x 1,x 2,,x n ) a generic point of R n. Consider the minimization problem min x R nf(x). To derive an algorithm to solve this optimization problem we begin by picking a starting point x 0 R n and a search direction s 0 R n, s 0 = 1 a unit vector, (. is the Euclidean norm) and consider the one variable function h(α) = F(x 0 +αs 0 ), α R Now, find α 0 such that h(α 0 ) = min α R h(α). Assume that F is differentiable then by the chain rule we have h (α) = F(x 0 +αs 0 ) s

176 Note that h(0) = F(x 0 ). To build an iterative process x 1 = x 0 +αs 0 that will eventually converge to the minimizer of F we require that h (0) < 0 so that F(x 0 +αs 0 ) < F(x 0 ) for small α > 0, i.e, when moving in the direction of s 0. This also guaranties that the minimum α 0 of h is reached to the right of the origin α = 0. s 0 is called the descent direction. We have h (0) = F(x 0 ) s 0 = F(x 0 ) cos(θ), where θ is the angle between F(x 0 ) and the unit vector s 0, therefore the best choice for s 0 is s 0 = F(x 0) F(x 0 ), the unit vector in the direction which is directly opposite to F(x 0 ) (which yields θ = π and cos(θ) = 1) that maximizes the rate of descent h (0) = F(x 0 ). Recall that F(x 0 ) gives the maximum rate of increase of F at x 0 and it occurs in the direction of F(x 0 ) while the maximum rate of decrease is F(x 0 ) which occurs in the direction directly opposite to F(x 0 ). Thus the method of steepest descent is obtained by fixing Let α 0 be the minimum of h(α): and set s 0 = F(x 0) F(x 0 ). h(α 0 ) = min α>0 h(α) x 1 = x 0 +α 0 s 0. If we now let s 1 = F(x 1 )/ F(x 1 ) and α 1 the minimum of h(α) = F(x 1 + αs 1 ), then we obtain a second iterate x 2 = x 1 +α 1 s 1, etc. Thus we have the algorithm Method of steepest descent: 1. Let x 0 be an initial guess 2. For k = 0,1,2, Set s k = F(x k) F(x k ) Set h(α) = F(x k +αs k ) and find α k such that h(α k ) = min α>0 h(α) 175

177 Set x k+1 = x k +α k s k 3. Convergence test: if x k x k+1 α k < ǫ Stop 4. End for loop Example: Use the steepest descent to minimize F(x,y) = 1 2 x y2. Clearly this is a convex and coercive function on R 2 that has a unique minimum, reached at (0,0). This is then a good simple example to test the steepest descent method we just derived. F(x,y) = Let (x 0,y 0 ) = (5,1) be our starting point. Then F(x 0,y 0 ) = ( ) x 5y ( ) 5 5 and and F(x 0,y 0 ) = = 5 2 s 0 = F(x 0,y 0 ) F(x 0,y 0 ) = ( ) 1/ 2 1/ 2 h(α) = 1 2 (5 α 2 ) (1 α 2 ) 2 h (α) = 1 2 (5 α 2 ) 5 2 (1 α 2 ) = 1 2 ( α) = 0 α = i.e, α 0 = 5 2/3 and (x 1,y 1 ) = (5 5/3,1 5/3) = (10/3, 2/3) and so on. The few first iterates are given in the table below. k x k y k F(x k,y k ) F(x k,y k ) T

178 Note that after nine iterations the iterate (x 9 = 0.130,y 9 = 0.026), with F(x 9,y 9 ) = 0.010, provides only a crude estimates for the minimizer and minimum value achieved at (0,0). We see that the steepest descent seems to converge but very slowly, in fact. The level curves of the function F(x,y)withafewsuccessiveiterates(x k,x k )laidontop(smallcircles)aregraphedbelow. Notethe zigzag-like behaviour of the steepest method due to the fact that at any given point the gradient of F, which is always perpendicular to the level curves, is not necessarily directed towards the minimum point at the origin. Y X Comparing with Newton s method Using Newton s method instead, for the example above, yields (x k+1,y k+1 ) = (x k,y k ) [HF(x k,y k )] 1 F(x k,y k ) HF(x,y) = Starting with x 0 = 5,y 0 = 1 as above, we have ( ) HF(x 0,y 0 )X = F(x 0,y 0 ) ( )( ) ( ) 1 0 x 5 = 0 5 y = x = 5, y = 1

179 = x 1 = x 0 +x = 5 5 = 0, y 1 = 1 1 = 0 or (x 1,y 1 ) = (0,0) which is the exact solution!. Newton s method converged in just one iteration to the exact solution...is this expected? The answer is yes. This is in fact expected, because by design Newton s method for f(x) = 0 is based on the first order Taylor approximation the next iterate is the zero of the first order Taylor approximation, but here ( ) x f(x) = F(x,y) = 5y is a linear function. Thus, its first order Taylor approximation is itself... Therefore, we conclude that Newton s method for optimization problems relies in fact on the exact minimization of quadratic polynomial approximation. Therefore when the objective function is a quadratic functions (such as F(x,y) = x 2 /2 + 5y 2 /2 of our example), no mater what the initial guess is, Newton s method will converge to the exact solution in just one iteration. However, we have to keep in mind that unlike Newton s method, the steepest descent method does not require the computation of the Hessian matrix (2nd order derivative) of F, therefore it could in principle be applied to any convex function, regardless if it is smooth or not. 7.5 Constrained optimization Equality constraints and Lagrange multipliers method Here we look at the case of constrained optimization. First, we consider the case of equality constraints and revisit the method of Lagrange multipliers. { minx R n F(x) (7.6) G i (x) = 0, i = 1,2,,m F(x) is the objective or cost function and G 1,G 2,,G m are the constraint functions. We have m constraints in total. Typically m n but it can be larger. We introduce the function L(x,λ 1,λ 2,,λ m ) = F(x)+λ 1 G 1 (x)+λ 2 G 2 (x)+ +λ m G m (x) from R n R m to R. The new variables λ 1,λ 2,,λ m are known as the Lagrange multipliers and L is called the Lagrange function. The theory of Lagrange multipliers states that the minimization of the constrained problem (7.6) amounts to finding a critical or stationary point for the Lagrange function L with respect to the variables x,λ 1,λ 2,,λ m on the extended space R n R m : L(x,λ 1,λ 2,,λ m ) [ L x 1 L x 2 L x n L λ 1 L λ 2 L λ m ] T =

180 Let s apply Newton s method to this problem. If we introduce the vector notations and G(x) = (G 1 (x),g 2 (x),,g m (x)) T Λ = (λ 1,λ 2,,λ m ) T Then the gradient of L can be written in the reduced and compact form as [ F(x)+[DG(x)] L(x,Λ) = T ] Λ G(x) where DG(x) = G 1 (x) x 1 G 2 (x) x 1. G m(x) x 1 G 1 (x) x 2 G 2 (x) x 2. G m(x) x 2 G 1 (x) x n G 2 (x) x n. G m(x) x n is the Jacobian matrix-derivative of the vector function G. Note that DG(x) is an m n matrix, and the quantity [DG(x)] T Λ is understood as a matrix-vector multiplication from R m to R n, i.e, an n m matrix multiplying an R m vector. The Hessian of L is given by [ ] B(x,Λ) [DG(x)] T HL(x,Λ) = DG(x) where B(x,Λ) is the n n matrix given by m B(x,Λ) = HF(x)+ λ i HG i (x) and is the m m zero matrix, i.e filled with zeros. Therefore, Newton s method applied to L(x, Λ) can be summarized as follows. i=1 1. Given an initial guess (x 0,Λ 0 ) R n+m 2. For k = 0,1,2, Solve the linear system [ Bk [D k ] T ] ( ) Xk = [D k ] k ( Wk G k ) { Bk X k +(D k ) T k = W k D k X k = G k where 3. Update: G k = G(x k ),W k = F(x k )+[DG(x k )] T Λ k B k = B(x k,λ k ),D k = DG(x k ) x k+1 = x k +X k Λ k+1 = Λ k + k 179

181 4. If X k + k < ǫ, then stop. We clearly see from this discussion that the major inconvenience of the Lagrange multipliers method is that it introduces many new variables. For high dimensional problems with many constraints this may be overwhelming. The penalty method introduced below addresses this issue Penalty method The penalty method is somehow a variant of the Lagrange multipliers method in the sense that it introduces a new variable to deal with the various constraints but instead of introducing m new variables, as many as the number of constraints, it uses only one extra-variable, rather a parameter. Consider the minimization problem with m constraints as in (7.6) { minx R n F(x) G i (x) = 0, i = 1,2,,m. (7.7) Let ρ > 0 be a real parameter and introduce the penalty function Φ ρ (x) = F(x)+ 1 m 2 ρ (G i (x)) 2 = F(x)+ 1 2 ρ G(x) 2, (7.8) i=1 and x ρ be the minimizer of the so-called penalty problem which we assume to exist for all ρ > 0 fixed. min x R nφ ρ(x), (7.9) Then it can be shown that under suitable conditions the limit of x ρ when ρ + exists and satisfies lim ρ + x ρ = x where x is the minimizer of (7.7). A purely intuitive explanation of why this method works in theory is due to the fact that when ρ is very large the minimum of Φ ρ must be reached at a value where G is small, to compensate for the large value of ρ, practically zero if ρ = +, which is a point where all the constraints G i = 0 are satisfied. Therefore, this provides an attractive numerical procedure that can be used to approximate the solution of the minimization problem (7.7). It suffices to find the minimum of the penalty problem (7.9) for a large enough ρ and set x ρ x. This method is conceptually simple but it has one serious drawback for when ρ is very large (which maybe required in some situations for a better accuracy), the Hessian matrix HΦ ρ (x) can become very ill-conditioned and Newton s method can not be used effectively. To overcome, this 180

182 inconvenience, the penalty method is implemented iteratively. Starting with a moderately small value, ρ is increased gradually and at each step the solution of the previous step is used as an initial guess. This is summarized next. Algorithm: Penalty method with a gradually increasing value of ρ. Given an initial guess x 0, pick a moderately small value ρ 0 > 0 and solve the minimization problem Let x ρ 0 be the solution to this problem. min 1 x R nf(x)+ 2 ρ 0 G(x) 2 Set x 0 = x ρ 0 as a new initial guess and choose a new ρ-value: ρ 1 > ρ 0, e.g, ρ 1 = 10ρ 0 and solve the problem Let x ρ 1 be the solution to this problem. min 1 x R nf(x)+ 2 ρ 1 G(x) 2 Set x 0 = x ρ 1, choose ρ 2 > ρ 1, and continue until convergence is reached. A rough way to estimate the convergence of the sequence x ρ 0,x ρ 1,x ρ 2, is by testing whether x ρ k+1 x ρ k < ǫ. Using fsolve function of matlab for optimization problems Important Remark about Fsolve Fsolveisamatlabfunctionwhichisnormallyusedtosolvesystemsofnon-linearequationsf(x) = 0. But when the function-components of f are all non-negative it provides the solution that is as close as possible to zero, i.e, the minimum. Likewise it will provide the maximum if the components of f are all non-positive. You can not use fsolve to problems for which the objective function changes sign. In such a case we can always find a lower bound c F(x) and minimize G(x) = F(x) c instead. Example: Use the penalty method to solve { min 1 2 x y2 y x+1 = 0 Let Φ ρ (x,y) = 1 2 x y ρ(y x+1)2. For a given value of ρ we use the fsolve function of matlab to find the minimum (x ρ,y ρ). We define an M-file for the function Φ ρ then find its minimum using fsolve for increasing values of ρ. 181

183 %%phirho.m function z = phirho(x) rho = 1; z = x(1).^2/2 + 5*x(2).^2/2 +0.5*rho*(x(2)-x(1)+1).^2; >>[xs,fun]=fsolve(@phirho,[5 1]) xs = fun = We fix ρ = 1 (inside the M-file) and call fsolve with an initial guess x 0 = 5,y 0 = 1. The outputs are the minimizer point x = ,y = and the value of the Φ ρ at this minimum point, fun= Then we set ρ = 10 and call fsolve again but with x 0 = ,y 0 = This process is repeated with ρ increased to 100, 1000, etc. and the results are given in the table below ρ x ρ yρ Φ ρ (x ρ,yρ) Note a nice converging pattern in both the (x ρ,y ρ) and Φ ρ values. Remark: When fsolve is called with the initial guess x 0 = 5,y 0 = 1 and a large value of ρ we get very poor results, as it is demonstrated by the numerical examples given below. With ρ = 1000, we have >>[xs,fun]=fsolve(@phirho,[5 1]) xs = fun =

184 and with ρ = 10000, we have >>[xs,fun]=fsolve(@phirho,[5 1]) xs = fun = This clearly shows a divergence pattern, likely due to an ill-conditioning as anticipated above. Lagrange multipliers Using Lagrange multipliers, the exact solution to this optimization problem can be calculated. We have L(x,y,λ) = x 2 /2+5y 2 /2+λ(y x+1) and yields the solution x λ L(x,y,λ) = 5y +λ = 0 y x+1 = x = λ,y = λ/5, λ/5 λ+1 = 0 = λ = 5/6 x = 5 6 = ,y = 1 6 = F(x,y ) = 5 12 = , which is clearly the solution toward which the penalty method, with gradually increasing ρ, seems to converge! Unconstrained optimization in matlab: fminunc As stressed above fsolve is not the most suitable function to use for optimization problems. Two interesting functions are implemented in matlab, which can be used directly, one for unconstrained problems and one for constrained problems. They are named fminunc and fmincon, respectively. Usage: 183

185 fminunc can be used in two different ways, either in the same way as fsolve, i.e, we just need to provide the objective function and an initial guess or the objective function and its gradient together with an initial guess. Both ways are illustrated below for the example above. a. fminunc without a gradient: %%phirho.m function z = phirho(x) rho = 1000; z = x(1).^2/2 + 5*x(2).^2/2 +0.5*rho*(x(2)-x(1)+1).^2; >>[xs,fun]=fminunc(@phirho,[5 1]) xs = fun = Note that the correct solution is obtained for the penalized example used above with ρ = 1000 even with an initial guess of (5,1) unlike the fsolve function. b. fminunc with a gradient. The gradient is supplied in the same M-file as the objective function. It is important to note that before calling fminunc the option GradObj need to be set ON, using the optimset command. Also note that the parameter ρ is not fixed inside the M-file as in the previous examples but instead it is passed as a variable in the function phirho_grad(x,rho). It s value is fixed in the matlab command window, before calling fminunc. %%%phirho_grad.m function [z,nablaz] = phirho_grad(x,rho) z = x(1).^2/2 + 5*x(2).^2/2 +0.5*rho*(x(2)-x(1)+1).^2; %function nablaz = [ x(1) - (x(2)-x(1)+1)*rho 5*x(2) + (x(2)-x(1)+1)*rho]; %gradient >>options = optimset( GradObj, on ); % indicates gradient is provided >> rho=1000; %sets the desired value or rho >> [xs,fun]=fminunc(@(x) phirho_grad(x,rho),[5; 1],options) Optimization terminated: first-order optimality less than OPTIONS.TolFun, 184

186 and no negative/zero curvature detected in trust region model. xs = fun = Constrained optimization in matlab: fmincon fmincon is much more involved and can be used with all kinds of constraints: Linear, Nonlinear, equalities, inequalities, or combinations of those. To learn more about it I encourage you to use the help command in the matlab. Here we illustrate its usage for the equality constrained problem of the example above, namely, 1 min (x,y) R 2 2 x y2 y x+1 = 0. First we create two M-files where the cost function and the constraint are stored, separately, then we run fmincon with a couple of options turned off. %%%myfun.m : M-file for my objective function function z = myfun(x) z = x(1).^2/2 + 5*x(2).^2/2 %%%mycon.m : M-file for the constraint function [c,ceq] = mycon(x) ceq = x(2) - x(1)+1; %the equality constraint c = []; % means no inequality constraint >>[xs,fun] = fmincon(@(x) myfun(x),[5;1],[],[],[],[],[],[],@(x) mycon(x)) Optimization terminated: first-order optimality measure less than options.tolfun and maximum constraint violation is less than options.tolcon. xs = 185

187 fun = The sequence of [] in the arguments of fmincon indicate that the corresponding constraints are not present in our problem. The two first correspond to a linear inequality constraint: AX = b, the third and fourth to a linear equality constraint: BX = d, the fifth and sixth are for lower and upper bounds XL X XU. If all these constraints were present, we should use minf(x) Ax b Bx = d XL x XU G(x) 0 H(x) = 0, >>[xs,fun] = fmincon(@(x) myfun(x),[5;1],a,b,b,d,xl,xu,@(x) mycon(x)) instead. (Note that A and B are matrices while b,d, XL, and XU, are vectors). Any missing constraint is replaced by the symbol []. Recall that [] symbolizes and empty matrix in matlab. Remark: Note the non-linear constraints need to be implanted in an M-file just like the example above while the linear ones are entered directly into the command line. In principle, the linear constraints can also be coded inside an M-file just like the non-linear one, in fact in our example the constraint is linear but we choose to treat it as a non-linear one. Matlab does this distinction between linear and non-linear constraints for a good reason. Linear constraints are treated differently, using what is known as linear programing that leads to very efficient algorithms Inequality constraint and the barrier function method Consider the minimization problem with inequality constraints { minx R n F(x) G i (x) 0 i = 1,2,,m (7.10) where F and G i, i = 1,2,,m, are functions from R n to R. The barrier function method consists in introducing an auxiliary function Φ µ (x) depending on some parameter µ whose minimum on the whole space R n (i.e, without constraints) converges to 186

188 the minimum of the original problem when µ 0. Thus, it is somewhat similar to the penalty method. Two types of barrier functions are introduced here, but many others can be constructed as well, the inverse barrier function given by m 1 Φ µ (x) = F(x) µ G i (x) and the logarithmic barrier function Φ µ (x) = F(x) µ i=1 m log( G i (x)). i=1 Starting with an initial guess x 0 in the interior of the feasible domain, i.e, such that G i (x 0 ) < 0, i = 1,2,,m. Let x µ be the solution for the unconstrained problem min x R nφ µ(x), The claim is that under some suitable conditions lim µ x µ = x where x is the minimum of the original constrained problem (7.10). Remarks: To see intuitively, why this claim is true assume that the minimum x µ was obtained through an iterative searching scheme of some sort. Then if for some k during the iterative process, x k, approaches the boundary of the feasible domain, i.e, G i (x k ) = 0 for some k, then necessarily we have Φ µ (x k ) +, which would in principle, inhibit the minimum searching algorithm from crossing G i (x) = 0, therefore the minimizer of the function Φ µ (x) will be necessarily reached within G i (x) 0 for all i = 1,2, m. Note the role of the µ part in the barrier function Φ µ during the minimization process is to inhibit the iterates from crossing G i (x) = 0, thus the name barrier function. Also as µ gets smaller and smaller, the weight of the G i s on the barrier function Φ µ becomes weaker and weaker except when the iterate gets too close to the boundary G i = 0. Therefore, in the limit µ 0, minimizing the barrier functions is equivalent to minimizing the constrained problem (7.10). 7.6 Problems 1. Determine whether each of the following functions is coercive on R

189 (a) f(x,y) = x+y +2, (b)f(x,y) = x 2 +y 2 +2, (c)f(x,y) = x 2 2xy +y 2 (d)f(x,y) = x 4 2xy +y Determine whether each of the functions is convex, strictly convex, or non-convex. (a) f(x) = x 2, (b)f(x) = x 3, (c)f(x) = e x (d)f(x) = x. 3. Letf(x)beasmoothfunctionofx R n (atleasttwicedifferentiable). Acriticalorstationary pointoff(x)isbydefinitionapointwhere f(x) = 0,where f(x) = ( x1 f(x), x2 f(x),, xn f(x)) is the gradient of f. Recall the Taylor approximation where f(x) f(x 0 )+ f(x 0 ) (x x 0 )+ 1 2 (x x 0) T Hf(x 0 +θ(x x 0 ))(x x 0 ) x 2 1 f x1 x 2 f x1 x n f x2 x 1 f x 2 2 f x2 x n f Hf = xnx 1 f xnx 2 f x 2 n f is the Hessian matrix of f and 0 θ 1. Recall that for X = (x 1,x 2,,x n ) R n,a = (a ij ) R n R n : X T AX n i=1,j=1 a ij x i x j. The matrix A is said to be symmetric positive definite if A is symmetric, i.e, a ij = a ji and X 0 = X T AX > 0. A is said to be negative definite if X 0 = X T AX < 0. (a) Show that the Hessian matrix of a given function f is symmetric if its partial derivatives of second order are continuous (use known results from calculus). (b) Show that if Hf(x ) is positive definite, then x is a local minimum, if Hf(x ) is negative definite, then x is a local maximum, if X T Hf(x )X 0 for all X 0 but X T Hf(x )X > 0 for some values of X and X T Hf(x )X < 0 for some values of X, then x is a saddle point (i.e a maximum in some directions and minimum in other directions). (c) Show that a symmetric matrix A is positive definite if and only if all its eigenvalues are positive and it is negative definite if its eigenvalues are all negative. Recall that one fundamental result of linear algebra states that all the eigenvalues of a symmetric (real) matrix are real and that the associated eigenvectors form an orthogonal basis of R n. (d) Assume that n = 2. Use (a),(b), (c) to show that a critical point x of a function f(x,y) in R 2 to R is a local minimum if (f xy ) 2 f xx f yy < 0 and f xx > 0 a local maximum if (f xy ) 2 f xx f yy < 0 and f xx < 0 a saddle point if (f xy ) 2 f xx f yy > 0. Hint: Show that the roots, x 1,x 2, of a quadratic equation satisfy x 2 Sx + P = 0 where S = x 1 +x 2 and P = x 1 x 2 and that x 1,x 2 are both positive if both P and S are positive and that x 1,x 2 are both negative if P > 0 and S < 0. If the product P is negative then x 1,x 2 have opposite signs,etc. 188

190 4. Determine the critical points of each of the following functions and characterize each as a minimum, maximum, or saddle point. Also determine whether each function has a global minimum or maximum on R 2. Hint: use (d) of the previous problem. (a) f(x,y) = x 2 4xy +y 2, (b)f(x,y) = x 4 4xy +y 4, (c)f(x,y) = 2x 3 3x 2 6xy(x y 1) (d)f(x,y) = (x y) 4 +x 2 y 2 2x+2y Suppose that the real valued function f(x) is unimodal on the interval [a,b] and x 1,x 2 are points in the interval such that x 1 < x 2 and f(x 1 ) < f(x 2 ). (a) What is the shortest interval in which you know that the minimum must lie? (b) How would your answer change if we happened to have f(x 1 ) = f(x 2 )? 6. Suppose that the real valued function is unimodal on the interval [a,b] and x 1,x 2 are points in the interval such that a < x 1 < x 2 < b. If f(x 1 ) = and f(x 2 ) = 3.576, then which of the following statements is valid. (a) The minimum of f must lie in the subinterval [x 1,b]. (b) The minimum of f must lie in the subinterval [a,x 2 ]. (c) One can t tell which of these two subintervals the minimum must lie in without knowing the values off(a) and f(b) 7. An alternative to Newton s method for optimization of one-variable functions, which is in some sense the analog of the secant method for f(x) = 0, is known as the successive parabolic interpolation. Assume f(x) has a unique minimum in [a,b]. Let x 1,x 2,x 3 be three distinct points in [a, b], then the main algorithm for the successive parabolic interpolation consists of the 4 steps (a),(b), (c) and (d) below (see also the attached figure). (a) Let P 2 (x) be the quadratic interpolation polynomial of f(x) associated with the three points x 1,x 2,x 3. Starting with the approximation f(x) P 2 (x), x [a,b] (b) Find the minimum x 4 of P 2 (x) in [a,b]. (c) Find the largest value among P 2 (x 1 ),P 2 (x 2 ),P 2 (x 3 ), discard it and replace it with the new point x 4, i.e, the minimum of P 2 (x). If P(x 1 ) = max(p 2 (x 1 ),P 2 (x 2 ),P 2 (x 3 )), then x new 1 = x 4 go back to step (a). Else-if P(x 2 ) = max(p 2 (x 1 ),P 2 (x 2 ), 2 P(x 3 )), then x new 2 = x 4 go back to step (a) Else go back to step (a). x new 3 = x 4 189

191 (d) Stop when the max distance between the three points, d = max( x 1 x 2, x 2 x 3, x 1 x 3 ), is less than some tolerance value It can be shown that this method convergences super-linearly, i.e, with the rate of convergence which is strictly between linear and quadratic. Question: Show that either x 4 = x 2 +p/q where p = ±(x 2 x 1 ) 2 (f(x 2 ) f(x 3 )) (x 2 x 3 ) 2 (f(x 2 ) f(x 1 )) q = (x 2 x 1 )(f(x 2 ) f(x 3 )) (x 2 x 3 )(f(x 2 ) f(x 1 )), or x 4 is one of the end points a, b P 2 (x) f(x) x 1 x4 x 2 x x Figure: Successive parabolic interpolation method for optimization. 8. Show that similar to the successive parabolic interpolation described above, Newton s method for optimization can be conceived by minimizing a quadratic polynomial approximation to the objective function, at each iteration. What is this quadratic polynomial? 9. The steepest descent method for minimizing a function of several variables is usually slow but reliable. However, it can sometimes fail, and it can also sometimes converge rapidly. Under which conditions would each of these two types of behaviour occur? 10. Consider the function f from R 2 to R defined by f(x,y) = 1 2 (x2 y) (1 x)2. (a) At what point does f attain its minimum? (b) Perform one iteration of Newton s method for minimizing f using as a starting point (x 0,y 0 ) = (2,2). 11. Let f : R n R be given by f(x) = 1 2 xt Ax x T b+c where A is a symmetric positive definite matrix, b a vector in R n and c R, a scalar. 190

192 (a) Show that and that f(x) = Ax b Hf(x) = A (b) Deduce that Newton s method for minimizing f will converge in exactly one iteration from any starting point x 0. (c) If the steepest descent is used for this function, what happens if the starting value x 0 is such that x 0 x is an eigenvector of A, where x is the minimum of A. 12. Consider the constrained problem subject to min x,y x2 +y 2 g(x,y) = x+y 1 = 0 (a) Find the solution (x,y ) of this problem using Lagrange multipliers. (b) Write down the penalty function for this constrained problem and find the associated minimum (x ρ,y ρ) for a fixed value of ρ. (c) Show that the solution of the penalized problem (x ρ,y ρ) converges to (x,y ) when ρ +. Hint: you don t need to use numerical methods here. 13. Consider the function f : R 2 R defined by f(x,y) = 2x 3 3x 2 6xy(x y 1) (a) Determine all the critical points of f analytically (i.e, using calculus) (b) Classify each critical point as minimum, maximum, or saddle point(again using calculus) (c) Verify your results graphically by creating a contour plot and a 3D surface plot of f, over the square region 2 x,y 2, in matlab by following the matlab instructions below >> f=inline( 2*x.^3-3*x.^2-6*x.*y.* (x-y-1) ) >> x=-2:.1:2;y=x; >>size(x) ans = 1 41 >> F=zeros(41,41); >> for I=1:41 F(I,:) = f(x(i),y); end >>figure >>mesh(x,y,f )%plots the graph of f(x,y) as 3D mesh surface 191

193 >>xlabel( x ) >>ylabel( y ) >>figure >> contour(x,y,f,[-2:0.1:2])%plots the level curves f=c, for all values % of c =-2:.1:2; 41 contours in total. >> colorbar %shows the color scale of the contours >>xlabel( x ) >>ylabel( y ) (d) Use the function fminunc 1 of matlab to minimize this function using various values for the initial guess. Try all the critical points of f identified above and some other points that are close or away from them. Interpret the results. Follow the instructions below to accomplish this. (a) First you need to create an M-file for your objective function, e.g. function Y = func(x) Y=2*x(1).^3-3*x(1).^2-6*x(1)*x(2)* (x(1)-x(2)-1); (b) Call the fminunc with the following starting points x0 = [0,0],x0 = [ 1.1,1],x0 = [1.1,0] and other points of your choice by typing >> xs = fminunc(@func,x0). Run matlab in the format-long mode to display sufficient digits to appreciate your approximation, by typing >>format long before calling fminunc. (e) Repeat (d) for g = f and interpret the new results. 1 minimization for unconstrained optimization, for constrained problems there is fmincon. Use the matlab help to learn more on these commands. 192

194 Chapter 8 Finite difference methods for partial differential equations 8.1 Introduction to partial differential equations Partial differential equations (or PDE s for short) are in some sense an extension of the notion of differential equations to multi-variable functions. For simplicity, we consider functions of two variables x,t where t > 0 is time and x R, usually called the spatial variable, may have several interpretation depending on the application. Let u = u(x,t) be a generic function of x,t, then a PDE in u(x,t) takes the general form F(u,u t,u x,u xt,u tt,u xx, ) = 0 where F is some multivariable function linking the unknown function u and its partial derivatives. The highest order derivative in the equation F = 0 defines the order of the PDE. Here we will limit ourselves to second order PDE s, i.e, the dots in the F-equation will be ignored in the subsequent. Most of the well-known examples of PDE s come from the physical sciences. Here are a few examples The heat equation: u t = ku xx,t 0,x [0,L] where u is the temperature in a rod of length L which spreads or diffuses within the rod as time grows, where k > 0 is the heat-diffusivity coefficient, that depends on the material. The wave equation u tt = c 2 u xx,t > 0,x [0,L] here u(x,t) is the height at any given time t > 0 at a position x, from its equilibrium (rest) position, of a vibrating string of length L, where c > 0 is the speed at which waves travel within the material. 193

195 The Poisson (or Laplace, when f = 0) equation u xx +u yy = f(x,y) Here u(x,y) is a function of the two variables x,y, which usually represents a potential of some sort, and f is an external force or an imposed flux of some sort. Laplace s equation can also regarded as a steady state (time independent) solution for the heat equation in 2D, in the presence of a constant heat source or sink f(x,y). Black-Scholes equation f t σ2 S 2 2 f S 2 +rs f rf = 0, S which is the highlight example in finance. Here f is the price of a derivative security, t is time, S is the varying price of the underlying asset (which replaces the spatial variable x), r is the risk-free interest rate, and σ is the market volatility Classification of PDE s PDE s can be classified by their order or linearity. For instances the PDE s above are all linear and second order. An example of a first order PDE is the famous one-way wave-equation whose solution is u t +cu x = 0 u(x,t) = u 0 (x ct) where u(x,0) = u 0 (x) is the initial condition (the wave profile at t = 0, assumed given) and c is the speed of propagation of the wave. To see that u(x,t) = u 0 (x ct), it suffices to make use of the chain rule: u t (x,t) = ( c)u 0(x ct) and cu x (x,t) = cu 0(x ct) therefore u t (x,t)+cu x (x,t) = c( u 0 +u 0) = 0. Note that the solution formula u(x,t) = u 0 (x ct) implies that the initial profile, u 0 (x), of the wave remains unchanged as time grows. It is simply translated either to the right (when c > 0) or to the left (when c < 0) with the speed c. Classification of second order linear PDE s A second order linear PDE, in two variables (x,t), has the generic form a(x,t)u xx +b(x,t)u xt +c(x,t)u tt +d(x,t)u x +e(x,t)u t +f(x,t)u+g(x,t) = 0. (8.1) Thefunctionsa(x,t),b(x,t),,f(x,t)aregivenandcalledcoefficients, theycanbeeithervariable, i.e depend on x,t or constant, g(x,t) is often referred to as an external forcing or source/sink term. 194

196 The PDE (8.1) is said to be hyperbolic if the coefficients of the second order derivatives satisfy b(x,t) 2 4a(x,t)c(x,t) > 0. It is said to be parabolic if b(x,t) 2 4a(x,t)c(x,t) = 0. It is said to be elliptic if b(x,t) 2 4a(x,t)c(x,t) < 0. Note that this classification is related to that of quadric surfaces form multivariable calculus. Examples The heat equation is parabolic: u t = ku xx or ku xx +u t = 0 i.e, a = k,b = 0,c = 0 = = 0 The wave equation is hyperbolic: u tt = c 2 u xx or c 2 u xx +u tt = 0 i.e, a = c 2,b = 0,c = 1 = = 4c 2 > 0 The Laplace (or Poisson) equation is elliptic: The Black-Scholes equation is parabolic : u xx +u yy = 0 a = 1,b = 0,c = 1 = = 4 < 0 f t σ2 S 2 2 f S 2 +rs f rf = 0, S a = 0,b = 0,c = 1 2 σ2 S 2 = = 0. Important result from PDE theory A linear PDE of second order (i.e, at least one of the coefficients a,b,c is non-zero) can always be transformed, via an appropriate change of variables, to the (generalized) heat equation, the (generalized) wave equation, or the (generalized) Poisson equation, if it is parabolic, hyperbolic, or elliptic respectively. (Here generalized means that the coefficients a,b,c from (8.1) are as in the heat, wave, or Poisson equation while the rest are arbitrary). We will see below that, as a parabolic PDE, the Black-Scholes equation for instance can be regarded as a generalized heat equation. 195

197 8.1.2 Initial and Boundary conditions For a second order time dependent PDE, we typically require initial and boundary conditions to be supplemented to the equation, in order to have a well posed problem, i.e, a problem that has a unique and well behaved solution. The type and number of conditions required depend on the equation itself. For the heat equation on an interval (0,L) we can have an initial condition and boundary conditions of Dirichlet-type where the solution u is fixed at the end-points u t = ku xx, a x b,t 0 (8.2) u(x,0) = u 0 (x) initial condition (8.3) u(0,t) = α(t), u(l,t) = β(t),t 0 Dirichlet boundary conditions (8.4) or an initial condition plus boundary conditions of Neumann-type where the derivative of u (i.e, the flux of heat) is fixed at the end-points u t = ku xx, a x b,t 0 (8.5) u(x,0) = u 0 (x) initial condition (8.6) u x (0,t) = α(t), u x (L,t) = β(t),t 0 Neumann boundary conditions (8.7) For the wave equation, on the other hand we need two initial conditions since we have a second order derivative in time: u tt = cu xx, a x b,t 0 (8.8) u(x,0) = u 0 (x),u t (x,0) = u 1 (x) two initial conditions (8.9) + Dirichlet or Neumann boundary conditions (8.10) Case of the Black-Scholes equation In more general settings the heat equation is also known as the diffusion equation, and is often given on the form u t = ( D(x) u ) x x where D(x) 0 is the diffusion coefficient that may vary in space. For the heat equation in a uniform medium D(x) = k is constant. To relate the Black-Scholes equation to the diffusion equation, we first note that the Black-Scholes equation is not subject to an initial condition as the heat equation above but rather to a terminal condition: f(s,t) = f T (S), given as a function of the market price S. The terminal condition depends on the underlying option. For a call option, the holder makes profit when the price S is larger then the strike price K and f T (S) = max(s K,0) 196

198 and for a put option the holder makes profit when K > S: f T (S) = max(k S,0). The price f(s,0) at time t = 0 is not known. It is determined by solving the Black-Scholes equation backward in time. To see the Black-Scholes equation as an initial value problem with time going forward we can inverse the flow of time by introducing the change of variables and let g(s,τ) = f(s,t), then τ = T t, g(s,τ) τ = f(s,t) t t τ = f(s,t). t Therefore, the Black-Scholes equation becomes g τ 1 2 σ2 S 2 2 g S 2 rs g S +rg = 0 or g τ rs g S +rg = 1 2 σ2 S 2 2 g S 2. Multiplying this equation through by e rτ, the forcing term can be collected into the τ derivative, If we introduce e rτ g τ rs erτ g S = 1 2 σ2 S 2 2 e rτ g S 2. h(s,τ) = e rτ g which can be thought of as the compounded asset price, we arrive to the equation or h τ rs h S = 1 2 σ2 S 2 2 h S 2, h τ (rs σ2 S) h S = ( 1 S 2 σ2 S 2 h ), S which can be regarded as a generalized heat equation, that has, in addition to the diffusion term on the right hand side (with the right sign + for the diffusion coefficient 1 2 σ2 S 2 just like the heat equation) and a drift term (rs σ 2 S) h S that can be regarded as a one way wave-equation. In conclusion, the Black-Scholes equation is regarded as a composition or superposition of the heat equation and the one way wave equation. Therefore, we consider here numerical methods for both the one-way wave equation and the heat equation. 197

199 8.2 Finite differencing What is normally called finite differencing is a way to approximate derivatives. Given a function f(x) of the variable x, and assume we want to approximate its derivatives at some giving point x 0. There can be many reasons for this. An obvious one is when the derivative of the function itself is not known in closed form. This can happen if f(x) is given through a black box numerical software or inferred from data such as the Market price or is an unknown solution of a PDE as we will see below. Also it can occur that the function f is known only at a certain number of discrete points x 1,x 2,x 3,,x n. Let h > 0 positive fixed, then according to Taylor expansion we have f(x 0 +h) = f(x 0 )+hf (x 0 )+ 1 2 h2 f (ξ) if h is small enough, then the remainder 1 2 h2 f (ξ) can be dropped and we obtain the approximation f(x 0 +h) f(x 0 )+hf (x 0 ) = f (x 0 ) f(x 0 +h) f(x 0 ). h Recall that this is just the basic formula used in 1st year calculus to definite the derivative as f (x 0 ) = lim h 0 f(x 0 +h) f(x 0 ). h Note that for a function defined at a certain number of points by f i = f(x i ),i = 1,2, n then h can be thought of as the distance between successive points. Thus the forward differencing formula reads: f i f i+1 f i x i+1 x i. Similarly, if we consider Taylor approximation of f(x h), namely f(x 0 h) f(x 0 ) hf (x 0 ), we arrive at the backward differencing formula: f i f i f i 1 x i x i 1. Note that when the points are equally spaced, x i+1 x i = h, i, the forward and backward formulas become f i f i+1 f i h and f i f i f i 1, h 198

200 respectively. Now, consider the second order Taylor expansion of f(x 0 +h) and f(x 0 h), respectively, and Combining the two expressions, yields f(x 0 +h) = f(x 0 )+hf (x 0 )+ 1 2 h2 f (x 0 )+ 1 6 h3 f (ξ) f(x 0 h) = f(x 0 ) hf (x 0 )+ 1 2 h2 f (x 0 ) 1 6 h3 f (η). f(x 0 +h) f(x 0 h) = 2hf (x 0 )+ 1 6 h3( f (ξ)+f (η) ) thus, we get the centered difference approximation or when the x i s are equidistant. f (x 0 ) f(x 0 +h) f(x 0 h) 2h f i f i+1 f i 1 x i+1 x i 1, Example: Let f(x) = e x. Here we use the three different finite differencing formulas introduced above to estimate the derivative f (x 0 ) for x 0 = 1 and compare the numerical values to the exact derivative f (1) = e, for different values of the grid size h. h Backward Abs. Error Forward Abs. Error Centered Abs. Error where Backward formula is f(x 0) f(x 0 h) h and the associated absolute error is f(x 0) f(x 0 h) h f (x 0 ) Forward: Centered: f(x 0 +h) f(x 0 ) h, f(x 0) f(x 0 h) h f (x 0 ) f(x 0 +h) f(x 0 h) 2h, f(x 0+h) f(x 0 h) 2h f (x 0 ) From this table we see that as h decreases the absolute error decreases for all three formulas. However, while the error for the forward and backward formulas is roughly divided by two each time h is divided by two, implying a first order or linear convergence, the error of the centered formula is divided by roughly four each time h is divided by two, which suggests a second order convergence. In fact, this is confirmed below from the truncation error analysis. Truncation errors: From the Taylor approximation f(x 0 +h) f(x 0 ) = hf (x 0 )+ 1 2 h2 f (ξ) 199

201 we have, for the forward finite-differencing formula, f(x 0 +h) f(x 0 ) f (x 0 ) h = 1 2 hf (ξ) h 2 max f (ξ) = O(h) x and similarly form f(x 0 ) f(x 0 h) = hf (x 0 ) 1 2 h2 f (ξ) we get f(x 0 ) f(x 0 h) h f (x 0 ) h 2 max f (ξ) = O(h) i.e, both the forward and backward formulas are first order accurate, the error is an O(h). For the centered formula however, we have f(x 0 +h) f(x 0 h) f (x 0 ) 2h = 1 h2f (ξ)+f (η) 6 2 h2 6 max f (x) = O(h 2 ) x which shows a second order approximation. Finite-difference schemes for the one-way wave equation Consider the one-way wave equation u t +cu x = 0, x [a,b],t [0,T] with the initial condition u(x,0) = u 0 (x) plus boundary conditions to be specified. Assume c > 0 constant. Discretization: x Let n,m be two positive integers. Consider a set of discrete times t k = k t, k = 0,1,2, n and a set of discrete spatial points x j = a+j x, j = 0,1,2, m, with t = T n and x = b a m are called the time step and spatial mesh size respectively. Next, we use finite differences to approximate the partial derivatives u t and u x in the one-way wave equation to obtain a difference scheme. We use forward differencing in time combined with either forward, backward, or centered differencing in space then compare and analyze the performance of each one of the resulting schemes. Denote: Consider the forward difference formula in time: u k j u(x j,t k ) u t (x j,t k ) = uk+1 j t u k j + t 2 u tt(x j,τ) 200

202 combined with forward differencing in space, u x (x j,t k ) = uk j+1 uk j x + x 2 u xx(ξ,t k ). When applied to the one-way wave equation, this combination yields u k+1 j = u k j c t x (uk j+1 u k j) ( t)2 2 u tt (x j,τ) t x u xx (ξ,t k ) 2 Forsufficientlysmalltimestepandspatialgridsize, theo( t 2 )ando( t x)termscanbedropped to yield the downwind or downstream difference scheme w k+1 j = wj k c t ( ) wj+1 k wj k. x The name downwind or downstream comes from the fact that the derivative is taken downwind with respect to the direction in which the wave is moving, to the right for c > 0, i.e, ahead of the wave. The main idea is to set w 0 j = u 0 j = u 0 (x j ), j and iterate the difference scheme forward in time to obtain successive approximations u k j w k j, j, k 1. If the backward differencing in space is used instead, we obtain the upstream or upwind scheme w k+1 j = wj k c t ( ) wj k wj 1 k x and the centered differencing yields the centered scheme w k+1 j = wj k c t ( ) wj+1 k wj 1 k 2 x Important Remark: Note that the truncation error is fist order in both time and space for both the upstream and downstream schemes while it is first order in time and second order in space for the centered scheme. However, it turns out that only the upstream scheme is useful. Both the downstream and centered schemes can lead to catastrophic numerical instabilities and thus must be avoided in practice. Also note that by symmetry, when c < 0, the situation is reversed and it is the forward differencing in space which yields the upstream stable scheme w k+1 j = wj k c t ( ) wj+1 k wj k. x Finally, we implement each one of these schemes in matlab for the following example. 201

203 Example: u t +u x = 0, (c = 1), on [0,1] [0,T] u(x,0) = sin(2πx) u(1, t) = u(0, t) t periodic boundary conditions. The exact solution for this PDE is u(x,t) = sin(2π(x t)). The numerical algorithms to iterate the downstream, upstream, and centered difference schemes, respectively, are given next. 202

204 Downstream Enter grid size m, number of time steps n, and termination time T. Set x = 1 m, t = T n, x j = j x,j = 0,1,,m, t k = k t,k = 0,1,,n Initialization: For k = 0,2,,n 1 For j = 0,1,,m 1 end for j-loop µ = c t x w 0 j = sin(2πx j ),j = 0,1,2,,m, w k+1 j = w k j µ(w k j+1 w k j) Boundary conditions w k+1 m = w k+1 0 End for k-loop. Upstream Enter grid size m, number of time steps n, and termination time T. Set x = 1 m, t = T n, x j = j x,j = 0,1,,m, t k = k t,k = 0,1,,n Initialization: For k = 0,2,,n 1 For j = 1,2,,m end for j-loop µ = c t x w 0 j = sin(2πx j ),j = 0,1,2,,m, w k+1 j = w k j µ(w k j w k j 1) Boundary conditions w0 k+1 = wm k+1 End for k-loop. 203

205 Centered Enter grid size m, number of time steps n, and termination time T. Set x = 1 m, t = T n, x j = j x,j = 0,1,,m, t k = k t,k = 0,1,,n µ = c t x Initialization: For k = 0,2,,n 1 For j = 1,2,,m w 0 j = sin(2πx j ),j = 0,1,2,,m,m+1 w k+1 j = w k j µ 2 (wk j+1 w k j 1) end for j-loop Boundary conditions w0 k+1 = wm k+1 w k+1 m+1 = wk+1 1 End for k-loop. Here is the corresponding matlab code. %advection speed c=1; %Grid m=100; dx = 1/m; dt = dx/2/c; n = 350; T =n*dt; mu = a*dt/dx; %initial condition w =zeros(m+2,1); % note because matlab doesn t allow nonpositive indices % w(1) here satnds for w(-1, k) and w(2) stands for w(0,k), %... w(m+1) is w(m-1,k), and w(m+2) is w(m,k) w (2:m+1)= sin((0:m-1)*2*pi/m); % 204

206 %periodic boundary conditions at t=0 w(1) = w(m+1) ; w(m+2) = w(2); for k = 0 : n-1 %%to switch back-and-forth from one scheme to another just comment and uncomment to respecti % w(2:m+1) = w(2:m+1) - mu*(w(3:m+2) - w(2:m+1)) ; % forward-downstream scheme w(2:m+1) = w(2:m+1) - mu*(w(2:m+1) - w(1:m)) ; % backward-upstream scheme % w(2:m+1) = w(2:m+1) - (mu/2)*(w(3:m+2) - w(1:m)) ; % centered scheme % periodic boundary conditions at t = t_{k+1} w(1) = w(m+1) ; w(m+2) = w(2); end figure x=0:dx:1; %subplot(2,2,1) plot(x,sin(2*pi*x)) % initial solution hold on plot(x,sin(2*pi*(x-t)), linewidth,2) % exact solution at time T %legend( initial solution, exact sol at t=t ) %subplot(2,2,2) plot(x,w(2:m+2), --, linewidth,2) % numerical solution at time T legend( initial solution, exact sol at t=t, numerical sol. at t=t ) %title( forward (downstream) scheme, fontsize,14) title( backward (upstream) scheme, fontsize,14) %title( centred scheme, fontsize,14) In Figures below, we plot the numerical solutions associated with each one of the three scheme (dashed) on top of the exact solution (solid) for the one-way wave equation, both at time T = 0.25 and T = Already at time t = 0.25 we see that while both the backward (upstream) scheme seem to yield accurate approximations, the forward-downstream scheme suffers from a numerical instability for the numerical solution being very noisy. Nonetheless, the error deteriorates greatly as time grows as it should be expected. With T = 1.75, only the upstream scheme yields a reasonable solution although very inaccurate. Both the downstream (not shown) and the centered schemes suffer from 205

207 numerical instabilities. The numerical solution oscillates and grows. Numerical solution at Time T=0.25 Downstream Upstream Centered 1 downward (upstream) scheme 1.5 centered scheme, T= forward (downstream) scheme initial solution 0.8 initial solution exact sol at t=t numerical sol. at t=t initial solution exact sol at t=t numerical sol. at t=t exact sol at t=t 1 1 numerical sol. at t=t Numerical solution at Time T=1.75 Upstream Centered backward (upstream) scheme T=1.75 initial solution exact sol at t=t numerical sol. at t=t 8 6 centered scheme, T = 1.75 initial solution exact sol at t=t numerical sol. at t=t Finite difference schemes for the heat equation Finite difference approximation of second order derivative To discretize the heat equation we need to make an approximation for both the time derivative u t and the second order derivative u xx. Let f be a smooth function and consider the third order Taylor expansion of f near some fixed point x 0. For h > 0, we have f(x 0 +h) = f(x 0 )+hf (x 0 )+ 1 2 h2 f (x 0 )+ 1 6 h3 f (x 0 ) h4 f (4) (ξ) f(x 0 h) = f(x 0 ) hf (x 0 )+ 1 2 h2 f (x 0 ) 1 6 h3 f (x 0 ) h4 f (4) (η) Summing up yields f(x 0 +h)+f(x 0 h) = 2f(x 0 )+h 2 f (x 0 ) h4 (f (4) (ξ)+f (4) (η)) 206

208 Therefore, or f(x 0 +h) 2f(x 0 )+f(x 0 h) h 2 = f (x 0 ) h2 (f (4) (ξ)f (4) (η)) = f (x 0 )+O(h 2 ) f (x 0 ) = f(x 0 +h) 2f(x 0 )+f(x 0 h) h 2 +O(h 2 ) which when the O(h 2 ) error term is dropped yields the centered difference formula for the second derivative. Finite difference scheme for the heat equation Consider the heat equation with Dirichlet boundary conditions Consider a discretization of time and space u t = κu xx, a x b, 0 t T. u(x,0) = u 0 (x), u(a,t) = α(t), u(b,t) = β(t). Using the notation and the approximations and The heat equation becomes t k = k t,k = 0,1,2,,n, t = T n x j = a+j x,j = 0,1,2,,m, x = b a m u k j = u(x j,t k ) (u t ) k j = uk+1 j t u k j +O( t) (u xx ) k j = uk j+1 2uk j +uk j 1 x 2 +O( x 2 ) u k+1 j u k j t = uk j+1 2uk j +uk j 1 x 2 +O( t)+o( x 2 ) If the O( t) and O( x) 2 error terms were dropped, we obtain The Forward in time centered in space (FTCS) difference scheme for the heat equation with µ = κ t x 2, we have: Initial condition:w 0 j = u(x j,0) = u 0 (x j ),j = 0,1,,m w k+1 j = w k j +µ(w k j+1 2w k j +w k j 1),k 0,j = 1,,m 1 (8.11) 207

209 Boundary conditions: w k 0 = α(t k ), w k m = β(t k ). Example: Note the exact solution for this problem is u t = 2u xx, 0 x 1 u(x,0) = sin(πx) u(0,t) = u(1,t) = 0. u(x,t) = e 2π2t sin(πx). (By the method of separation of variables). Here we use matlab to solve this equation numerically using the FTCS scheme (8.11) and compare the results with the this exact solution. %Grid m=20; dx = 1/m; mu=0.5;kappa=2; dt = dx^2*mu/kappa; n = 40; T =n*dt; mu = kappa*dt/(dx^2); %initial condition w =zeros(m+1,1); % note because matlab doesn t allow nonpositive indices % w(1) here satnds for w(-1, k) and w(2) stands for w(0,k), %... w(m+1) is w(m-1,k), and w(m+2) is w(m,k) w (2:m)= sin((1:m-1)*pi/m); % %Dirichlet boundary conditions at t=0 w(1) = 0 ; w(m+1) = 0; for k = 0 : n-1 w(2:m) = w(2:m) + mu*(w(3:m+1) - 2*w(2:m)+ w(1:m-1)) ; % forwrad in time centered in space % Dirichlet boundary conditions at t = t_{k+1} w(1) = 0 ; w(m+1) = 0; end figure x=0:dx:1; %subplot(2,2,1) 208

210 plot(x,sin(pi*x)) % initial solution hold on plot(x,exp(-k*pi^2*t)*sin(pi*x), linewidth,2) % exact solution at time T %legend( initial solution, exact sol at t=t ) %subplot(2,2,2) plot(x,w(1:m+1), o--, linewidth,2) % numerical solution at time T legend( initial solution, exact sol at t=t, numerical sol. at t=t ) title([ Heat eqn: forward in time centered in space,... \mu =,num2str(floor(mu*100)/100)], fontsize,14) An interesting practical question is how to pick the time step and spatial grid size t and x. One can argue from the derivation above that smaller these quantities are the better the approximation will be and turns out to be true in practise, provided a stability condition is satisfied. However, because of finite computing resources the size of these parameters can be limited. As we can see from the equation the ratio µ seems to be an important non-dimensional parameter for the numerical scheme itself, and its behaviour may depend greatly. Here we fix n = 40, m = 20, x = 1/m, t = µ( x) 2 /κ and integrate the numerical scheme to T = n t. In the top panel of the figure below, we plot the initial condition u(x,0) at t = 0, the solution u(x,t) at t = T with µ = 0.4 (thick) and the numerical solution wj n at t = 1 (circles). As expected the numerical solution lies accurately on top on the exact solution. On the bottom panel, however, we show the numerical solution at t = T with µ = 1 instead. As we see the calculation is nowhere close to the truth. It displays rapid and large oscillations. In fact this is a manifestation of a numerical instability...the algorithm becomes unstable. Notion of stability A numerical scheme w k+1 = F(w k ), w = (w 1,w 2,,w m ) is said stable if there exists a constant C > 0 such that w k C, k 0, i.e, the numerical solution doesn t grow without bounds. According to the results above the FTCS scheme for the heat equation is stable for µ = 0.4 and unstable for µ = 1. In fact, for a linear scheme like FTCS we can determine analytically for which values of µ it is stable and for which values it is unstable. 209

211 1 0.9 Heat eqn: forward in time centered in space, µ =0.4 initial solution exact sol at t=t numerical sol. at t=t Heat eqn: forward in time centered in space, µ =1 initial solution exact sol at t=t numerical sol. at t=t

212 The FTCS scheme (8.11), when rewritten as w k+1 j = µw k j 1 +(1 2µ)w k j +µw k j+1, can be seen as a linear iterative process w k+1 = Aw k where A is a tri-diagonal matrix 1 2µ µ µ 1 2µ µ A = µ 1 2µ As for the fixed point iterations, we can show that this iterative process converges if and only if A < 1, for some matrix norm., or equivalently its spectral radius (the absolute value of the largest eigenvalue) satisfies ρ(a) < 1, i.e, all the eigenvalues of A are less than one in magnitude. In fact, this method may seem a little tedious, here we use an alternative route, namely von Neumann analysis. Assume w k j = ρ k e ilx j where i 2 = 1 is the complex imaginary, square root of 1, j = 0,1,2,,m and l 2πl is the Fourier mode wavenumber. Thus, the FTCS scheme (8.11) leads to Using the fact that x j+1 = x j + x, we get ρ k+1 e ilx j = µρ k e ilx j 1 +(1 2µ)ρ k e ilx j +µρ k e ilx j+1 ρ = µ(e il x +e il x )+(1 2µ) = 2µcos(l x)+1 2µ = 1 2µ(1 cos(l x)) = 1 4µsin 2 (l x/2) Thus, the solution w k j = ρ k e ilx j will grow without bound if and only if ρ > 1. The condition for stability is therefore given by ρ 1 1 < 1 4µsin 2 (l x/2) < 1 0 4µsin 2 (l x/2) < 2 0 µ 1 2, which is consistent with the results in the plots above. The scheme is stable for 0 µ 0.5 (thus in particular for µ = 0.4) and unstable for µ > 0.5 (thus in particular for µ = 1). the FTCS scheme for the heat equation (8.11) is said to be CONDITIONALLY STA- BLE: Stable if 0 µ = κ t/ x 2 0.5, unstable otherwise. Below, we derive a scheme that is 211

213 unconditionally stable, i.e, stable for all values of t and x by using, instead, backward differencing in time. Backward in time centered in space: BTCS If the time derivative u t is approximated instead by the backward formula u t (x j,t k+1 ) uk+1 j, t then we arrive at the finite differences scheme ( ) wj k+1 = wj k +µ wj+1 k+1 2wk+1 j +wj 1 k+1,k 0,j = 1,2,,n 1 (8.12) with the boundary and initial conditions as above. Implicit v.s explicit Note that while, in the forward scheme (8.11) the values of w at time step t k+1 are given explicitly in terms of the values of w at previous time steps 1, wj k+1 in (8.12) solves a non-trivial system of (linear) equations, i.e, wj k+1 is given only implicitly. Thus the FTCS scheme is called explicit while the BTCS scheme is called implicit. Solving the implicit scheme: To recover the values of w k+1 in terms of w k, at the previous time step, from the BTCS scheme (8.12), we need to solve the corresponding (linear) system of equations. Case of Dirichlet Boundary conditions u k j ( ) wj k+1 = wj k +µ wj+1 k+1 2wk+1 j +wj 1 k+1,j = 1,,n 1 w k+1 0 = α k+1, w k+1 n = β k+1 Bringing the k +1 terms to the right hand side yields µwj 1 k+1 +(1+2µ)wk+1 j µwj+1 k+1 = wk j, for j = 2,,n 2 µα k+1 +(1+2µ)w1 k+1 µw2 k+1 = w1, k for j = 1 µwn 2 k+1 +(1+2µ)wk+1 n 1 µβ k+1 = wn 1 k for j = n 1 In matrix form 1+2µ µ µ 1+2µ µ µ 1+2µ w1 k+1 1 Here t k only but it could include t k 1,t k 2, and still called explicit. w w k+1 1 k +µα k+1 2. = w k 2. wn 1 k+1 wn 1 k +µβ k+1 212

214 i.e, at each time step we need to solve the linear system where (1+2µ) µ A = AX = b µ (1+2µ) µ The dimension of the system is (n 1) (n 1). µ 1+2µ µ µ 1+2µ,b = w k 1 +µα k+1 w k 2. w k n 2 w k n 1 +µβ k+1 Note that the matrix A is sparse-tri-diagonal, it has many zeros. Gauss elimination requires only only one row operation at a time and can be carried very efficiently. However, for such matrices there are other numerical methods, namely, iterative methods that can be very competitive when the number of grid points is increased...especially for 2D problems. Stability of the implicit scheme Repeating the von Neumann analysis for the BTCS scheme (8.12), yields (the details are left as an exercise for the student) ρ = 1+2µρ(cos(l x) 1) or ρ(1+2µ(1 cos(l x))) = 1 = ρ = µsin 2 1, µ 0. (l x/2) i.e, the implicit (BTCS) scheme is stable for all values of time step and spatial grid size, provided κ 0. The BTCS scheme is thus said to be unconditionally stable. Exercise: Consider the one-way wave equation and a discretization of size t > 0 and x > 0. u t +cu x = 0 Use the von Neumann s method illustrated above of assuming to show that w k j = ρ k e ilj x,i 2 = 1,j = 0,1,2,,m,l Ris a wavenumber,µ = c t/deltax if c > 0 then the backward (upstream) scheme w k+1 j = w k j µ(w k j w k j 1) is conditionally stable (under the condition 0 µ 1) while both the forward (downstream) and centered w k+1 j = w k j µ(w k j+1 w k j) w k+1 j schemes are both unstable for all values of µ 0 = w k j µ 2 (wk j+1 w k j 1) 213

215 if c < 0 then the forward (which is now upstream) scheme w k+1 j = w k j µ(w k j+1 w k j) is conditionally stable (under the condition 1 µ 0) while both the backward (downstream) = wj k µ(wj k wj 1) k and centered w k+1 j w k+1 j schemes are both unstable for all values of µ 0 = w k j µ 2 (wk j+1 w k j 1) Other boundary conditions Note that the matrix system above depends strongly on the boundary conditions. If boundary conditions other than Dirichlet were used, both the matrix A and the right hand side vector b will change significantly. Here we give the resulting matrix and second hand side vector corresponding to periodic and Neumann boundary conditions. The details are left as an exercise for the student. Periodic Boundary conditions u t = κu xx,x [0,L], u(x,0) = u 0 (x), u(x,t) = u(x+l,t) Translates to ( ) wj k+1 = wj k +µ wj+1 k+1 2wk+1 j +wj 1 k+1,j = 0,,n 1 w k+1 n = w k+1 0 and w k+1 1 = wk+1 n 1 which yields the matrix system AX = b for wj k+1,j = 0,1,2,,n 1, with (1+2µ) µ µ µ (1+2µ) µ A = µ 1+2µ µ µ µ 1+2µ,b = w k 0 w k 1. w k n 1. Note the dimension of the system is n n. Neumann Boundary conditions u t = κu xx,x [0,L], u(x,0) = u 0 (x), u x (0,t) = α(t), u x (L,t) = β(t) 214

216 Writing the boundary conditions in term of centered finite differences and u x (0,t) uk 1 uk 1 2 x u x (L,t) uk n+1 uk n 1, 2 x where values u k 1,uk n+1 at fictitious or ghost grid points, x n+1 = x n + x,x 1 = x 0 x, outside the domain where introduced. This leads to which yields Thus, we obtain the system where u k 1 u k 1 2 x α k and u k n+1 2 x β k +u k n 1 ( ) wj k+1 = wj k +µ wj+1 k+1 2wk+1 j +wj 1 k+1,j = 0,,n w k+1 1 = wk x α k+1 and w k+1 n+1 = wk+1 n 1 +2 x β k+1. (1+2µ) 2µ µ (1+2µ) µ AX = b µ 1+2µ µ 2µ 1+2µ w0 k 2µ x α k+1 w k 1 and b =.. wn 1 k wn k +2µ x β k+1 Now A has dimension (n+1) (n+1). (The details of this last derivation are left as an exercise for the students). Numerical implementation: Here we use matlab to solve the heat equation with Dirichlet boundary condition of the example seen above using the implicit BTCS scheme (8.12), instead. The matlab code and the results (plot) are giving below. Note that an alternative way to right the matrix A is A = I µb where B is the tri-diagonal matrix with 2 s on the main diagonal and 1 s on upper and lower diagonals. This is used below to write the matlab code. %%%backward in time centered in space scheme for the heat equation %%%(implicit scheme) %Grid m=200; dx = 1/m; 215

217 dt = dx; n = 20; T =n*dt; K=2; mu = K*dt/(dx^2); %initial condition w =zeros(m+1); % note because matlab doesn t allow nonpositive indices % w(1) here satnds for w(-1, k) and w(2) stands for w(0,k), %... w(m+1) is w(m-1,k), and w(m+2) is w(m,k) w (2:m)= sin((1:m-1)*pi/m); % %Dirichlet boundary conditions at t=0 w(1) = 0 ; w(m+1) = 0; %matrix B = zeros(m-1,m-1); for I=1:m-1 B(I,I) = -2; end for I=1:m-2 B(I,I+1) = 1; B(I+1,I) = 1; end A = eye(m-1) - mu *B; X0= w(2:m) ; Fk=zeros(m-1,1); for k = 0 : n-1 Fk(1)=0;F(m-1)=0; X1 = A\(X0+Fk); X0=X1; end w(2:m) = X0; figure x=0:dx:1; %subplot(2,2,1) plot(x,sin(pi*x)) % initial solution hold on plot(x,exp(-k*pi^2*t)*sin(pi*x), linewidth,2) % exact solution at time T %legend( initial solution, exact sol at t=t ) %subplot(2,2,2) plot(x,w(1:m+1), o--, linewidth,2) % numerical solution at time T 216

218 legend( initial solution, exact sol at t=t, numerical sol. at t=t ) title([ Heat eqn: backward in time centered in space,... \mu =,num2str(floor(mu*100)/100)], fontsize,14) Results with µ = 1 and µ = 400. To achieve a value µ = 400 we used a small grid size x instead of increasing t, which would have led to (stable but) inaccurate results. Interestingly, even the case with µ = 400 yield good and accurate results, which is consistent with the unconditional stability of the implicit scheme. 8.4 Consistency, order of accuracy, and convergence A numerical scheme F n (w k+1,w k,w k 1, ) = 0 for a given PDE F(u,u t,u x,u xx,u tt,u tx ) = 0 in the variables (x,t), where F n is the numerical approximation of F for a given time step t and spacial grid size x, is said to be consistent of order (p,q),p,q > 0 if the difference between the numerical scheme and the PDE satisfies F(u,u t,u x,u xx,u tt,u tx ) F n (w k+1,w k,w k 1, ) = O( x) p +O( t) p tends to zero when x+ t 0. The error denoted by τ h = F(u,u t,u x,u xx,u tt,u tx ) F n (w k+1,w k,w k 1, ) is called the truncation error. If the truncation error tends to zero, when x+ t 0, regardless of the order of convergence, the numerical scheme is said to be consistent. Examples: The FTCS (8.11) and the BTCS schemes for the heat equation are consistent of order (2,1). For the FTSC scheme we have ( ) u k+1 j u k j u t (x j,t k ) ku xx (x j,t k ) κ uk j+1 2uk j uk j 1 t x 2 = O( x) 2 +O( t) according the forward and centered differencing formulas for the time derivative and second order spatial derivative. Do the same for the BTCS scheme. Exercise: 217

219 1 0.9 Heat eqn: forward in time centered in space, µ =1 initial solution exact sol at t=t numerical sol. at t=t Heat eqn: backward in time centered in space, µ =400 initial solution exact sol at t=t numerical sol. at t=t

220 Show that the upstream and downstream schemes for the one-way wave equation are consistent of order (1,1) while the centered scheme for the one-way wave equation is consistent of order (2,1). Remark: Truncation error converges to zero does not imply that the numerical solution converges to the exact solution For the FTCS scheme for the heat equation, for example, we have τ h = O( x) 2 +O( t). Note that in principle we have τ h 0 when t, x 0 if the scheme is consistent and therefore the numerical solution wj k will converge to the exact solution uk j. However, this is not always true in practise, as you may have guessed we need also stability of the scheme. We have the following celebrated theorem due to Peter Lax (a recent winner of Abel s price, the equivalent of Nobel s price, for his work on numerical solutions for PDEs). Lax equivalence theorem A numerical scheme converges, i.e, the numerical solution converges to the analytic solution of the original PDE, if and only if the scheme is both stable and consistent. Schematically Convergence Consistency + Stability 8.5 Crank-Nicholson Method If we combine the explicit and the implicit schemes (8.11) and (8.12), respectively, we obtain the so-called Crank-Nicholson scheme, which is both unconditionally stable and second order accurate in both time and space. Precisely, the Crank-Nicholson scheme for the heat equation is obtained by taking the arithmetic average of the schemes (8.11) and (8.12) to yield wj k+1 = wj k + µ ( ) wj+1 k 2wj k +wj 1 k +wj+1 k+1 2 2wk+1 j +wj 1 k+1 (8.13) µ = κ t/ x 2. Exercise Use von Neumann analysis to show that the Crank-Nicholson scheme is unconditionally stable. To prove that the Crank-Nicholson scheme is second order accurate in both time and space, i,e, its truncation error satisfies τ h = O( t) 2 +O( x) 2 is a little tricky. But, this can be seen intuitively by interpreting the time derivative finite differencing u k+1 u k as a centered (second order approximation) of u t at the half time t k+1/2 = t k + t/2. 219

221 Assuming Dirichlet BC s the C-N scheme can be implemented in the same fashion as the implicit scheme, i.e, by solving a linear system at each iteration. First, the scheme (8.13) is rewritten as w k+1 j µ 2 which in matrix form becomes ( w k+1 j+1 2wk+1 j +w k+1 j 1 ) = wj k + µ 2 (I + µ 2 B)Wk+1 = (I µ 2 B)Wk +BC s ( ) wj+1 k 2wj k +wj 1 k where and B = α k +α k+1 BC s = µ β k +β k Reading homework: Application to the Black-Scholes equation Read chapter 9 from the book and write a short reprot article. 220