Sales forecasting # 1

Transcription

1 Sales forecasting # 1 Arthur Charpentier arthur.charpentier@univ-rennes1.fr 1

2 Agenda Qualitative and quantitative methods, a very general introduction Series decomposition Short versus long term forecasting Regression techniques Regression and econometric methods Box & Jenkins ARIMA time series method Forecasting with ARIMA series Practical issues: forecasting with MSExcel 2

3 Somes references Major reference for this short course, Pindyck, R.S. & Rubinfeld, D.L. (1997). Econometric models and economic forecasts. Mc Graw Hill. A forecast is a quantitative estimate about the likelihood of future events which is developed on the basis of past and current information. 3

4 Forecasting challenges? With over 50 foreign cars already on sale here, the Japanese auto industry isn t likely to carve out a big slice of the U.S. market. - Business Week, 1958 I think there is a world market for maybe five computers. - Thomas J. Watson, 1943, Chairman of the Board of IBM 640K ought to be enough for anybody. - Bill Gates, 1981 Stocks have reached what looks like a permanently high plateau. - Irving Fisher, Professor of Economics, Yale University, October 16,

5 Challenge: use MSExcel (only) to build a forecast model MSExcel is not a statistical software. Specific softwares can be used, e.g. SAS, Gauss, RATS, EViews, SPlus, or more recently, R (which is the free statistical software). 5

6 Macro versus micro? Macroeconomic Forecasting is related to the prediction of aggregate economic behavior, e.g. GDP, Unemployment, Interest Rates, Exports, Imports, Government Spending, etc. It is a very difficult exercice, which appears frequently in the media. 6

7 American Express University of North Carolina Goldman Sachs PNC Financial Kudlow & co Figure 1: Economic growth forecasts, from Wall Street Journal, Sept. 12, 2002, Q4 2002, Q and Q

8 Macro versus micro? Microeconomic Forecasting is related to the prediction of firm sales, industry sales, product sales, prices, costs... Usually more accurate, and applicable to business manager... Problem is that human behavior is not always rational: there is always unpredictable uncertainty. 8

9 Short versus long term? Figure 2: Forecasting a time series, with different models. 9

10 Short versus long term? Figure 3: Forecasting a time series, with different models. 10

11 Short versus long term? The Nasdaq index, Daily log return Level of the Nasdaq index Figure 4: Forecasting financial time series. 11

12 Series decomposition Decomposition assumes that the data consist of data = pattern + error Where the pattern is made of trend, cycle, and seasonality. General representation is X t = f(s t, D t, C t, ε t ) where X t denotes the time series value at time t, S t denotes the seasonal component at time t, i.e. seasonal effect, D t denotes the trend component at time t, i.e. secular trend, C t denotes the cycle component at time t, i.e. cyclical variation, ε t denotes the error component at time t, i.e. random fluctuations, 12

13 Series decomposition The secular trends are long-run trends that cause changes in an economic data series, three different patterns can be distinguished, linear trend, Ŷt = α + βt constant rate of growth trend, Ŷt = Y 0 (1 + γ) t declining rate of growth trend, Ŷt = exp(α β/t) For the linear trend, adjustment can be obtained, introducing breaks for instance. For constant rate of growth trend, note that in that case log Ŷt = log Y 0 + log(1 + γ) t, which is a linear model on the logarithm of the serie. 13

14 Series decomposition For those two models, standard regression techniques can be used. For declining rate of growth trend, log Ŷt = α β/t, which is sometimes called semilog regression model. The cyclical variations are major expansions and contractions in an economic series that are usually greater than a year in duration The seasonal effect cause variation during a year, that tend to be more or less consistent from year to year, From an econometric point of view, a seasonal effect is obtained using dummy variables. E.g for quaterly data, Ŷ t = α + βt + γ 1 1,t + γ 2 2,t + γ 3 3,t + γ 4 4,t where i,t is an indicator series, being equal to 1 when t is in the ith quarter, and 0 if not. The random fluctuations cannot be predicted. 14

15 Figure 5: Standard time series model, X t. 15

16 Figure 6: Standard time series model, the linear trend component. 16

17 Figure 7: Removing the linear trend component X t D t. 17

18 Figure 8: Standard time series model, detecting the cycle on X t D t. 18

19 Figure 9: Standard time series model, X t. 19

20 Figure 10: Removing linear trend and seasonal component X t D t S t. 20

21 Exogeneous versus endogenous variables Model X t = f(s t, D t, C t, ε t, Z t ) can contain on exogeneous variables Z, so that S t, the seasonal component at time t, can be predicted, i.e. S T +1, S T +2,, S T +h D t, the trend component at time t, can be predicted, i.e. D T +1, D T +2,, D T +h C t, the cycle component at time t, can be predicted, i.e. C T +1, C T +2,, C T +h Z t, the exogeneous variables at time t, can be predicted, i.e. Z T +1, Z T +2,, Z T +h but ε t, the error component cannot be predicted 21

22 Exogeneous versus endogenous variables Like in classical regression models: try to find a model Y i = X i β + ε i which the highest prediction value. Classical ideas in econometrics: compare Ŷi and Y i, which should be as closed as n possible. E.g. minimize (Y i Ŷi) 2, which is the sum of squared errors, and i=1 can be related to the R 2, or MSE, or RMSE. When dealing with time series, it is possible to add an endogeneous component. Endogeneous variables are those that the model seeks to explain via the solution of the system of equations. The general model is then X t = f(s t, D t, C t, ε t, Z t, X t 1, X t 2,..., Z t 1,..., ε t 1,...) 22

23 Comparing forecast models In order to evaluate the accuracy - or reliability - of forecasting models, the R 2 has been seen as a good measure in regression analysis,but the standard is the root mean square error (RMSE), i.e. RMSE = 1 n (Y i n Ŷi) 2 i=1 where is a good measure of the goodness of fit. The smaller the value of the RMSE, the greater the accurary of a forecasting model. 23

24 ESTIMATION PERIOD EX POST FORECAST PERIOD EX ANTE FORECAST PERIOD Figure 11: Estimation period, ex-ante and ex-post forecasting periods. 24

25 Regression model Consider the following regression model, Y i = X i β + ε i. Call: lm(formula = weight ~ groupctl+ grouptrt - 1) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) groupctl e-15 *** grouptrt e-14 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 18 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 2 and 18 DF, p-value: < 2.2e-16 25

26 Lest square estimation Parameters are estimated using ordinary least squares techniques, i.e. β = (X X) 1 X Y. E( β) = β. Linear regression, distance versus speed distance car speed Figure 12: Least square regression, Y = a + bx. 26

27 Lest square estimation Parameters are estimated using ordinary least squares techniques, i.e. β = (X X) 1 X Y. E( β) = β. Linear regression, speed versus distance distance car speed Figure 13: Least square regression, X = c + dy. 27

28 Lest square estimation Assuming ε N (0, σ 2 ), then V ( β) = (X X) 1 σ 2. The variance of residuals σ 2 can be estimated using ε ε/(n k 1). It is possible to test H 0 : β i = 0, then β i /σ (X X) 1 i,i has a Student t distribution under H 0, with n k 1 degrees of freedom. The p-value corresponding to the power of the t-test, i.e. 1- probability of second type error. The confidence interval for β i can be obtained easilty as [ ] β i t n k (1 α/2) σ [(X X) 1 ] i,i ; β i + t n k (1 α/2) σ [(X X) 1 ] i,i where t n k (1 α/2) stands for the (1 α/2) quantile of the t distribution with n k degrees of freedom. 28

29 Lest square estimation Area Area Endemics Elevation 29

30 Lest square estimation The R 2 is the correlation coefficient between series {Y 1,, Y n } and {Ŷ1,, Ŷn}, where Ŷi = X i β. It can be interpreted as the ratio of the variance explained by regression, and total variance. The adjusted R 2, called R 2, is defined as R 2 = (n 1)R2 k n k = 1 n 1 n k 1 (1 R2 ). Assume that residuals are N (0, σ 2 ), then Y N (Xβ, σ 2 I), and thus, it is possible to use maximum likelihood technique, log L(β, σ X, Y ) = n 2 log(2π) n 2 log(σ2 ) (Y Xβ) (Y Xβ) 2σ 2 Akake criteria (AIC) and Schwarz criteria (SBC) can be used to choose a model. AIC = 2 log L + 2k and SBC = 2 log L + k log n 30

31 Lest square estimation Fisher s statistics can be used to test globally the significance of the regression, i.e. H 0 : β = 0, defined as F = n k R 2 k 1 1 R 2. Additional tests can be run, e.g. to test normality of residuals, such as Jarque-Berra statistics, defined as BJ = n 6 sk2 + n 24 [κ 3]2, where sk denotes the empirical skewness, and κ the empirical kurtosis. Under assumption H 0 of normality, BJ 2 (2). 31

32 Residual in linear regression Residuals vs Fitted Normal Q Q Residuals Standardized residuals Fitted values lm(y ~ X1 + X2) Theoretical Quantiles lm(y ~ X1 + X2) 32

33 Prediction in the linear model Given a new observation x 0, the predicted response is x 0 β. Note that the associated variance is V ar(x 0 β) = x 0(X X) 1 x 0 σ 2. Since the future observation should be x β+ε 0 (where ε is unknown, but yield additional uncertainty), the confidence interval for this predicted value can be computed as [ ] β i t n k (1 α/2) σ 1+x 0 (X X) 1 x 0 ; β i + t n k (1 α/2) σ 1+x 0 (X X) 1 x 0 where again t n k (1 α/2) stands for the (1 α/2) quantile of the t distribution with n k degrees of freedom. Remark Recall that this is rather different compared with the confidence interval for the mean response, given x 0, which is [ ] β i t n k (1 α/2) σ x 0 (X X) 1 x 0 ; β i + t n k (1 α/2) σ x 0 (X X) 1 x 0 33

34 Prediction in the linear model car speed distance Confidence and prediction bands car speed distance Confidence and prediction bands 34

35 Regression, basics on statistical regression techniques Remark statistical uncertainty and parameter uncertainty. Consider i.i.d. observations X 1, lcdot, X n from a N (µ, σ) distribution, where µ is unknown and should be estimated. Step 1: in case σ is known. The natural estimate of unkown µ is µ = 1 n X i, n and the 95% confidence interval is [ µ + u 2.5% σ n ; µ + u 97.5% σ n ] i=1 where u 2.5% = and u 97.5% = Both are quantiles of the N (0, 1) distribution. 35

36 Regression, basics on statistical regression techniques Step 2: in case σ is unknown. The natural estimate of unkown µ is still µ = 1 n X i, and the 95% confidence interval is n i=1 [ µ + t 2.5% σ n ; µ + t 97.5% σ n ] The following table gives values of t 2.5% and t 97.5% for different values of n. 36

37 n t 2.5% t 97.5% n t 2.5% t 97.5% Table 1: Quantiles of the t distribution for different values of n. This information is embodied in the form of a model - a single equation structural model, a multiequation model, or a time series model By extrapolating the models beyond the period over which they are estimated,we get forecasts about future events. 37

38 Regression model for time series Consider the following regression model, Y t = α + βx t + ε t where ε t N (0, σ 2 ). Step 1: in case α and β are known, Given a known value X T +1, and if α and β are known, then Ŷ T +1 = E(Y T +1 ) = α + βx T +1 This yields a forecast error, ε T +1 = ŶT +1 Y T +1. This error has two properties the forecast should be unbiased E( ε T +1 ) = 0 the forecast error variance is constant V ( ε T +1 ) = E( ε 2 T +1 ) = σ2. 38

39 Regression model for time series Step 2: in case α and β are unknown, The best forecast for Y T +1 is then determined from a simple two-stage procedure, estimate parameters of the linear equation using ordinary least squares set ŶT +1 = α + βx T +1 Thus, the forecast error is then ε T +1 = ŶT +1 Y T +1 = ( α α) + ( β β)x T +1 ε T +1 Thus, there are two sources of error: the additive error term ε T +1 the random nature of statistical estimation 39

40 Figure 14: Forecasting techniques, problem of uncertainty related to parameter estimation. 40

41 Regression model for time series Consider the following regression model Goal of ordinay least squares, minimize N I=1 (Y i Ŷi) 2 where Ŷ = α + βx. Then β = n X i Y i X i Yi n X 2 i ( X i ) 2 and Yi α = n β The least square slope can be writen Xi n = Y βx β = (Xi X)(Y i Y ) (Xi X) 2 V ( ε T +1 ) = V ( α) + 2X T +1 cov( α, β) + X 2 T +1V ( β) + σ 2 41

42 Regression model for time series under the assumption of the linear model, i.e. there exists a linear relationship between X and Y, Y = α + βx, the X i s are nonrandom variables, the errors have zero expected value, E(ε) = 0, the errors have constant variance, V (ε) = σ 2, the errors are independent, the errors are normally distributed. 42

43 Regression model and Gauss-Markov theorem Under the 5 first assumptions, the estimators α and β are the best (most efficient) linear unbiased estimator of α and β, in the sense that they have minimum variance, of all linear unbiased estimators (i.e. BLUE, best linear unbiased estimators). The two estimators are further asymptotically normal, ( n σ 2 ) n( β β) N 0, and n( α α) N (Xi X) 2 ( ) X 0, σ 2 2 i. (Xi X) 2 The asymptotic variances of α and β can be estimated as V ( β) = while the covariance is σ 2 (Xi X) 2 and V ( α) = cov( α, β) = X σ 2 (Xi X) 2. σ 2 n (X i X) 2 43

44 Regression model and Gauss-Markov theorem Thus, if σ denotes the standard deviation of ε T +1, the standard deviation s of ε T +1 can be estimated as ŝ 2 = σ (1+ 1T + (X T +1 X) 2 (Xi X) 2 ) > σ. 44

45 RMSE (root mean square error) and Theil s inequality Recall that the root mean square error (RMSE), i.e. RMSE = 1 n (Y i n Ŷi) 2 Another useful statistic is Theil inequality coefficient defined as 1 n (Y i T Ŷi) 2 i=1 U = 1 n Ŷi n Yi 2 T T i=1 From this normalization U always fall between 0 and 1. U = 0 is a perfect fit, while U = 1 means that the predictive performance is as bad as it could possibly be. i=1 i=1 45

46 Step 3, assume that α, β and X T +1 are unknown, but that X T +1 = X T +1 + u T +1, where u T +1 N (0, σ 2 u). The two errors are uncorrelated. Here, the error of forecast is ε T +1 = ŶT +1 Y T +1 = ( α α) + ( β β)x T +1 ε T +1 It can be proved (easily) that E( ε T +1 ) = 0. But its variance is slightly more complecated to derive V ( ε T +1 ) = V ( α) + 2X T +1 cov( α, β) + (X 2 T +1+σ 2 u)v ( β) + σ 2 +β 2 σ 2 u And therefore, the forecast error variance is then ( s 2 = σ T + (X T +1 X) 2 + σu 2 (Xi X) 2 which,again, increases the forecast error. + β 2 σ 2 u ) > σ 2, 46

47 To go further, multiple regression model In the multiple regression model, Y = Xβ + ε, in which Y = Y 1 Y 2...,X = X 1,1 X 2,1... X k,1 X 1,2 X 2,2... X k, ,β = β 1 β 2...,ε = ε 1 ε 2... Y n X 1,n X 2,n... X k,n β K ε n there exists a textcolorbluelinear relationship between X 1,, X k and Y, Y = α + β 1 X 1 + +β k X k, the X i s are nonrandom variables, and moreover, there are no exact linear relationship between two and more independent variables, the errors have zero expected value, E(ε = 0, the errors have constant variance, var(ε) = σ 2, the errors are independent, 47

48 the errors are normally distributed. The new assumption here is that there are no exact linear relationship between two and more independent variables. If such a relationship exists, variables are perfectly collinear, i.e. perfect collinearity. From a statistical point of view, multicollinearity occures when two variables are closely related. This might occur e.g. between two series {X 2, X 3,, X T } and {X 1, X 2,, X T 1 } with strong autocorrelation. 48

49 To go further, forecasting with serial correlated errors In previous model, errors were homoscedastic. A more general model is obtained when errors are heteroscedastic, i.e. non-constant variance. Goldfeld-Quandt test can be performed. An alternative is to assume serial correlation. Cochrane-Orcutt or Hildreth-Lu procedures can be performed. Consider the following regression model, with 1 ρ +1 and η t N (0, σ 2 ). Y t = α + βx t + ε t where ε t = ρε t 1 + η t Step 1, assume that α, β and ρ are known. Ŷ T +1 = α + βx T +1 + ε T +1 = α + βx T +1 + ρε T assuming that ε T +1 = ρε T. Recursively, ε T +2 = ρ ε T +1 = ρ 2 ε T 49

50 ε T +3 = ρ ε T +2 = ρ 3 ε T ε T +h = ρ ε T +h 1 = ρ h ε T Since ρ < 1, ρ h approaches 0 as h gets arbitrary large. Hence, the information provided by serial correlation becomes less and less usefull. Ŷ T +1 = α(1 ρ) + βx T +1 + ρ(y T βx T ) Since Y T = α + βx T + ε T, then Ŷ T +1 = α + βx T +1 + ρε T Thus, the forecast error is then ε T = ŶT Y T = ρε T ε T +1 50

51 To go further, using lag models We have mentioned earlier that when dealing with time series, it was possible not only to consider the linear regression of Y t on X t, but to consider lagged variates either X t 1, X t 2, X t 2,...etc, or Y t 1, Y t 2, Y t 2,...etc, First, we will focuse on adding lagged explanatory exogneous variable, i.e. models such as Y t = α + β 0 X t + β 1 X t 1 + β 2 X t β h X t h + + ε t. Remark In a very general setting X t can be a random vector in R k. 51

52 To go further, a geometric lag model Assume that weights of the lagged explanatory variables are all positive and decline geometrically with time, Y t = α + β ( X t + ωx t 1 + ω 2 X t 2 + ω 3 X t ω h X t h + ) + ε t, with 0 < ω < 1. Note that Y t 1 = α + β ( X t 1 + ωx t 2 + ω 2 X t 3 + ω 3 X t ω h X t h 1 + ) + ε t 1, so that where η t = ε t ωε t 1. Y t ωy t 1 = α(1 ω) + βx t + η t Rewriting Y t = α(1 ω) + ωy t 1 + βx t + η t. 52

53 To go further, a geometric lag model This would be called single-equation autoregressive model, with a single lagged dependent variable. The presence of a lagged dependent variable in the model causes ordinary least-squares parameter estimates to be biased, although they remain consistent. 53

54 Estimation of parameters In classical linear econometrics, Y = Xβ + ε, with ε N (0, σ 2 ). Then β = (X X) 1 X Y is the ordinary least squares estimator, OLS, is the maximum likelihood estimator, ML. Maximum likelihood estimator is consistent, asymptotically efficient, and (asymptotic) variances can be determined. This can be obtined using optimization techniques. Remark it is possible to use generalized method of moments, GMM. 54

55 To go further, modeling a qualitative variable In some case, the variable of interest is not necessarily of price (continuous variable on R), but a binary variable. 1 Consider the following regression model Y i = α + βx i + ε i, with Y i = 0 where the ε are independent random variables, with 0 mean. Then E(Y i ) = α + βx i. Note that Y i is then a Bernoulli (binomial) distribution. Classical models are either the probit or the logit model. The idea is that there exists a continuous latent unobservable Y such that 1 if Y i Y i = > t i 0 if Y i t with Yi = α + βx i + ε i, which is now a classical i regression model. Equivalently, it means that Y i is then a Bernoulli (binomial) distribution B(p i ) 55

56 where p i = F (α + βx i ), where F is a cumulative distribution function. If F is the cumulative distribution function of N(0,1), i.e. F (x) = 1 x ) exp ( z2 dz, 2π 2 which is the probit model, or the cumularive distribution of the logistic distribution 1 F (x) = 1 + exp( x) for the logit model. Those models can be extended to so-called ordered probit model, where Y can denote e.g. a rating (AAA,BB+, B-,...etc). Maximum likelihood techniques can be used. 56

57 Modeling the random component The unpredictible random component is the key element when forecasting. Most of the uncertainty comes from this random component ε t. The lower the variance, the smaller the uncertainty on forecasts. The general theoritical framework related to randomness of time series is related to weakly stationary. 57