Forecast Model for Box-office Revenue of Motion Pictures Jae-Mook Lee, Tae-Hyung Pyo December 7, 2009 1 Introduction The main objective of the paper is to develop an econometric model to forecast box-office revenue of motion pictures. Considering the importance demand for new products, marketing researches have developed various demand forecasting models. However, these models forecast future demands based on either several months of initial sales data after new product introduction or the survey data on customer purchase intention. Different from these model, we forecast future demands of new products based on the analysis of historical sales patterns of similar products, Even thought we apply our model to the case of revenue forecasting for motion pictures, it can easily be applied to forecast future demands for several other industries such as books, music albums, video and pharmaceutical products. 1
2 Development of Forecast Model for Box- Office Revenue Econometric forecast model for box-office revenue before new release can be divided into two categories. The first type of forecast model is a little more traditional revenue forecasting method to estimate total revenue of each film directly. This model can be represented in more detail as follows: Q i = β 0 + β 1 X 1i +... + β K X Ki + ǫ i (1) where Q i indicates the revenue of film i, X Ki indicates the value of Kth independent variable for film i, β is the parameter to be estimated and ǫ i means the error term of this model. The above equation indicates a typical regression model, and often uses total sales or revenue of box-office as independent Q i. Here, apply Equation (1) to a series of previous film data based on which we can determine boxoffice revenue, and estimate parameter β. Then substitute X, an independent variable of a film under forecast for it. As a result, we can estimate box-office revenue of the film. The second method to forecast the box-office revenue of motion pictures is forecasting weekly box-office revenue/sales pattern of each film, i.e. summing up weekly forecast values from the first to the last week in each month to estimate total revenue. First, it is necessary to set up a step for estimating various box-office patterns from existing data on weekly revenue/sales of each film, followed by grouping those patterns into several parameters. 2
According to the result of determined the patterns in the number of weekly sales over various films, this study may suggest the following model: log Q it = γ 0i + γ 1i t + ǫ it (2) where Q it indicates the box-office revenue in the time frame of t after new release of film i, t means the interval of data collection, which also shows how long weeks go by after release herein, indicates the parameter to be estimated, and ǫ it means the error term of model. As already used by previous marketing researchers, Equation (2) shows a forecast model on the assumption that the patterns of weekly box-office revenue for a film follows the function of exponential decay, i.e. becomes gradually declined after new release. In addition, in view of characteristics of Equation (2), it is interesting that γ 0i is a parameter that summarizes the information on sales in the 1st week after release, while γ 1i is a parameter that summarizes the information on the decaying rate of spectator number after release. To apply Equation (2), we need the data on weekly box-office revenue ranging from release to the end of each film. Moreover, γ is a parameter for estimation in each film. So the estimation of this regression equation comprises a course of grouping data on weekly revenue/sales into two parameters, i.e. γ 0i and γ 1i. First, estimate γ 0i and γ 1i respectively in each film. Then apply a regressive model using each of these parameters as dependent variables, while using film characteristics shown in Equation (1) as independent variables. In other words, replace only dependent variables in Equation (1) by γ 0i and γ 1i to estimate a desired regression model herein. 3
Estimate the second regression equation, and then substitute X, the value of independent variables of a film under forecast for the equation, so that we can estimate γ 0i and γ 1i of the film in question. Furthermore, substitute these two parameters for Equation (2), so that we can forecast the whole trend of box-office revenue over the time frame of a film ranging from its release to the end. Of course, total revenue of the film can be determined by summing up weekly box-office revenues ranging from new release to the end. We also applied Bayesian regression on these two parameters using a package R2OpenBUGS and compared two methods in terms of MSE(mean square error) and MAE(mean absolute error). 3 The Data The data under analysis covers 266 films released nationwide in the year 2000 in Seoul, Korea. Total sales of each movie can be replaced with total numbers of ticket sales since the prices for a movie are the same for all movies. Advertising expenditure data was obtained from the movie advertising agent, Dave, in Seoul. Dave monitored four major media (TV, radio, newspaper, and magazine) and estimated the expenditure according to each movie s advertisement frequency and time. Other data was collected from the Korean Film Archive (http://www.koreafilm.or.kr). 4
Variables lnsales tv fre news fre dir mean star mean ACAD SF season revi Table 1: Summery statistics of variables Attributes of variables Natural log on total number of admission on each movie Number of TV advertisement Number of newspaper advertisement Average of previous movie admission for the director Average of previous movie admissions for the main actor Genre: Dummy variable for action and adventure movie Genre: Dummy variable for scientific fiction movie Dummy variable for high demand season for movie Review on movies 4 Estimation and Comparison of Models 4.1 How to Forecast the Weekly Revenue of Movies As described above, this approach attempts to forecast revenue of movies over two steps: The first step is to draw certain sales patterns from weekly data for each film. Here, Equation (2) is applied to the data on sales for each film. In other words, this study attempts to estimate 200 regression equations, because the number of films in estimation sample reached 200. The frequency distribution for 200 values of γ 0i and γ 1i as drawn in this course can be outlined as shown in Figure 1. In view of characteristics of Equation (2), γ 0i must have positive values, because it is a parameter that summarizes the information on sales in the first week after release. Total 266 5
estimated values of γ 0i were all positive. On the other hand, γ 1i is a parameter that summarizes the information on the slope of weekly sales revenue after new release of a film. So in order that exponential function assumed in Equation (2) may have significance, the value of γ 1i must be negative. It was estimated that total 266 estimated values of γ 0i were all negative. It implies that the models suggested herein have more or less validity. Figure 1: Frequency of intercepts(γ 0i ) and slopes(γ 1i ) Frequency 0 10 20 30 40 Frequency 0 10 20 30 40 4 6 8 10 12 14 Estimates of Intercept 10 8 6 4 2 0 Estimates of Slope Next procedure refers to modeling the relationships between two parameters(γ 0i and γ 1i ) summarizing information on weekly revenue and the attribute variables of film. The results of regression equation using γ 0i as dependent variable are listed in Table 2, while the results of regression equation using γ 1i as dependent variable are outlined in Table 3. 6
First, Table 2 shows that adjust R 2 reaches 0.621, and only 3 variables such as tv fre, news fre, and revi were significant out of 8 variables at p = 0.05. It is very interesting that newspaper advertisement had the most significant impact on revenue in the first week after new release of a film. Also it was found that other variables such as tv fre and revi had significant effects on revenue in the first week after new release. Table 2: Relation between the intercept(γ 0i ) and variables Variables OLS of Bayesian Intercept 6.639(0.24)** 10.066(0.090)** tv fre 0.018(0.005)** 0.018(0.005)** news fre 0.043(0.005)** 0.043(0.005)** dir mean 7.38e-07(7.560e-07) 6.96e-07(7.6e-07) ACAD -0.179(0.213) -0.181(0.215) SF 0.133(0.363) 0.144(0.374) revi 0.650(0.073)** 0.648(0.072)** season 0.350(0.238) 0.343(0.246) star 2.75-0.7(6.60e-07) 2.73e-07(6.70e-07) Adjusted-R 2 0.621 **significant at p = 0.05 Additionally, Table 2 shows the result of applying Bayesian regression analysis to 8 independent variables. The estimate converged quickly. The Gelman-Rubin diagnostics and confidence interval for estimates are summarized in figure1 of appendix. The result is almost identical to OLS model except the Bayesian regression gave larger value for intercept. The signifi- 7
cant coefficients and their estimates are not much different. Likewise, Table 3 outlines the results of regression equation using γ 1i as dependent variable. Adjusted-R 2 reached 0.367, and 4 variables such as tv fre, news fre, SF, and revi are significant at p = 0.10. Here, it is noteworthy that SF became significant in estimating weekly sales and it is negatively related to them, which implies that the movie admission for SF movies drops quickly then other genre. Moreover, it was found that advertisement on TV and newspaper, and positive review from professional film reviewers also played an effective role in keeping film running over a long period. Table 3: Relation between the slope(γ 1i ) and variables Variables OLS of Bayesian Intercept -7.080(0.328)** -4.430(0.125)** tv fre 0.013(0.006)** 0.013( 0.006)** news fre 0.040(0.007)** 0.040( 0.007)** dir mean 5.2e-07(1.05e-06) 4.6e-07(1.1e-06) ACAD -0.196(0.297) -0.200( 0.300) SF -1.266(0.506)** -1.251( 0.522)** revi 0.453(0.102)** 0.449( 0.101)** season 0.646(0.332) 0.636( 0.344) star 1.8e-07(9.2e-07) 1.8e-07(9.4e-07) Adjusted-R 2 0.367 **significant at p = 0.05 Table 3 shows the results of applying Bayesian regression analysis to 8 8
independent variables. As shown in figure 2 of appendix the model has no sign of divergence. Similar to previous estimates on the first week sales(intercept), parameter estimates are almost identical except the intercepts. We will then compare the accuracy of two method using MSE and MAE. 4.2 Comparison of Two Forecast Models in the Level of Precision A formulated assessment scale is required to determine which one of two models as suggested herein is superior to the other in the level of forecast precision. The typical methods for measuring fitness of model are AIC and BIC for classical regression and DIC for Bayesian regression. However, AIC or BIC cannot obtained from Bayesian regression and DIC cannot calculated from classical regression. Therefore mean standard error(mse) and mean absolute error(mae) have been applied to evaluate the forecast accuracy of each model. MAE = Mean Absolute Error = N i=1 Actual i Predicted i N (3) MSE = Mean Squared Error = N i=1 (Actual i Predicted i ) 2 N (4) In the equations as described above, N indicates sample size, and it amounts to 266 herein. The results of evaluating forecast accuracy are outlined in Table 4. 9
Table 4: Comparison between the forecasting models Paramters Method OLS Bayesian Intercept(γ 0i ) MSE 2.048069 2.048113 Intercept(γ 0i ) MAE 1.144300 1.144390 Slope (γ 1i ) MSE 3.986199 3.986285 Slope (γ 1i ) MAE 1.604464 1.605046 As shown in this table, the OLS is superior to that for total revenue in the level of average absolute error and average square error. However the difference is ignorable. To be short, it indicates that the method of forecasting the patterns of weekly box-office revenue two models yield the same accuracy. Furthermore the parameter estimates are almost identical with the exception of intercepts for both the first week sales and decaying sales revenue over times. 5 Conclusions We forecast box-office revenue of a motion picture given its characteristics such as genre, reviews by movie critics, star power of main actor/actress, directors. review, advertising expenditure and so on. Our results show that advertisements(tv and newspaper) has positive effect on the revenue for the opening week and life cycle of movies as well as positive review. The analysis shows that SF movies has no competitiveness over other movie genre because their weekly sales decay quicker than others. We tried Bayesian regression in addition to classical analysis. The two 10
method have equivalent accuracy in estimating the total admission to movies in the first week and sales patters of movies over time. For the forecast of the first week revenue we have fairly good estimates (Adjusted R 2 =0.621). The forecast for sales over time, however, is more unpredictable (Adjusted R 2 =0.367). It might be due to the fact that we did not take account for competition effect after opening of movies. A movies weekly sales would drop quickly if the same genre movie or blockbuster often while it is on screen. Introducing new measurements for competition into model will improve accuracy of forecast. 11
APPENDIX A R Code and R2OpenBUGS code setwd( "C:/Users/THPyo/Documents/1Class/Compt in Stat/Project/analysis" ) view <- function( dat,k ){ message <- paste( "First",k,"rows" ) krows <- dat[1:k,] cat( message,"\n","\n" ) print( krows ) } ## data import weeksales <- read.csv( "C:/Users/THPyo/Documents/1Class /Compt in Stat/Project/data/weeksales.csv",sep = "," ) colnames(weeksales) <- c("movieid", 1:(ncol(weeksales)-1)) #view(weeksales, 10) ad <- read.csv( "C:/Users/THPyo/Documents/1Class /Compt in Stat/Project/data/ad.csv",sep = ",", header=t ) #view(ad, 10) director <- read.csv( "C:/Users/THPyo/Documents/1Class /Compt in Stat/Project/data/director.csv",sep = ",", header=t ) #view(director, 10) genre <- read.csv( "C:/Users/THPyo/Documents/1Class 12
/Compt in Stat/Project/data/genre.csv",sep = ",", header=t ) #view(genre, 10) review <- read.csv( "C:/Users/THPyo/Documents/1Class /Compt in Stat/Project/data/review.csv",sep = ",", header=t ) #view(review, 10) season <- read.csv( "C:/Users/THPyo/Documents/1Class /Compt in Stat/Project/data/season.csv",sep = ",", header=t ) #view(season, 10) star <- read.csv( "C:/Users/THPyo/Documents/1Class /Compt in Stat/Project/data/star.csv",sep = ",", header=t ) #view(star, 10) ### Data Transformation #### # data rearrangement dat.func <- function( x ) { dat <- rep( 0, ncol( x )*nrow( x )*3 ) dim(dat) <- c( ncol( x ), 3, nrow( x ) ) comb <- NULL a <- c( 0:( ncol( x )-1 ) ) ## change data interval for ( j in 1:nrow( x ) ) { for ( i in 1:( sum(!is.na( x[j,] ) == TRUE ))) { dat[i,1,j] <- j dat[i,2,j] <- a[i] if ( is.na(x[j, i+1] )) dat[i,3,j]=1 else { dat[i,3,j] <- x[j, i+1] 13
} comb <-rbind( comb, dat[i,,j] ) } } comb <- data.frame( comb ) list( comb=comb ) colnames( comb ) <- c( "movieid", "Week", "sales" ) ##change for new variable names print( comb ) } rearranged <-dat.func( weeksales ) # log transformation lnsales <- log( rearranged$sales ) newdata <- cbind( rearranged[,c(1,2)], lnsales ) #view(newdata, 30) ### regression for each obs reg.out <- by( newdata, newdata$movieid, function( x ) lm( lnsales ~ Week, data = x) ) out <- sapply( reg.out, coef ) out <- matrix( out, ncol=nrow(out), byrow=t ) out <- cbind( weeksales$movieid, out ) colnames(out) <- c( "movieid", "int", "coef" ) estimate <- data.frame( out ) 14
#view(estimate, 10) # data combine revi <- ( review$rev1+review$rev2+review$rev12 )/3 movie <- data.frame( estimate$movieid, ad$tv_fre, ad$news_fre, director$dir_mean, genre$acad, genre$sf, revi, season$dum_sea, star$star1_mean, estimate$int, estimate$coef ) colnames(movie) <- c( "ID", "tv_fre", "news_fre", "dir_mean", "ACAD", "SF", "revi", "season", "star", "int", "coef" ) #view(movie, 10) #### regression analysis summary( reg.int <- lm(movie$int ~ movie$tv_fre + movie$news_fre + movie$dir_mean + movie$acad + movie$sf + movie$revi + movie$season + movie$star )) summary( reg.coef <- lm(movie$coef ~ movie$tv_fre + movie$news_fre + movie$dir_mean + movie$acad + movie$sf + movie$revi + movie$season + movie$star )) #### Run Bayesian Regression setwd( "C:/Users/THPyo/Documents/1Class/ Compt in Stat/Project/analysis" ) library(r2winbugs) N <- nrow( movie );v <- ncol( movie )-3 15
mtv <- mean( movie$tv_fre ) ;mnews <- mean( movie$news_fre ) ;mdir_mean <- mean( movie$dir_mean ); macad <- mean( movie$acad );msf <- mean( movie$sf );mrevi <- mean( movie$revi );mseason <- mean( movie$season );mstar <- mean( movie$star );tv_fre <- movie$tv_fre;news_fre <- movie$news_fre;dir_mean <- movie$dir_mean;acad <- movie$acad;sf <- movie$sf;revi <- movie$revi;season <- movie$season;star <- movie$star;int <- movie$int;coef <- movie$coef ## Bayesian Reg on Intercept (R2OpenBugs) intercept <- list ( "N", "v", "mtv", "mnews", "mdir_mean", "macad", "msf", "mrevi", "mseason", "mstar", "tv_fre", "news_fre", "dir_mean", "ACAD", "SF", "revi", "season", "star", "int" ) inits <- function() {list (a = rnorm( 1,0,100 ), b = rnorm( v,0,100 ), prec = rgamma( 1, 0.001, 0.001 ))} # or prec parameters <- c( "a", "b","mean", "prec" ) # c("a", "b", "prec", "mean") intercept.sim <- bugs ( intercept, inits, parameters, "inter_r.txt", n.chains=3, n.iter=1000, debug = TRUE, program = "openbugs" ) # delete progrma = "openbus" then use winbug print ( intercept.sim, digits=10 ) 16
plot ( intercept.sim ) ## Bayesian Reg on Coefficients (R2OpenBugs) coefficient <- list ( "N", "v", "mtv", "mnews", "mdir_mean", "macad", "msf", "mrevi", "mseason", "mstar", "tv_fre", "news_fre", "dir_mean", "ACAD", "SF", "revi", "season", "star", "coef" ) coefficient.sim <- bugs ( coefficient, inits, parameters, "coef_r.txt", n.chains=3, n.iter=1000, debug = TRUE, program = "openbugs" ) print ( coefficient.sim, digits=8 ) plot ( coefficient.sim ) #### Model Comparison Using MSE # caculating MSE for Intercepts pred.intercept.reg <- predict( reg.int ) pred.int <- cbind( movie$int, pred.intercept.reg, intercept.sim$mean$mean ) colnames( pred.int ) <- c( "orig", "reg", "sim" ) Int.MSE.reg <- var( pred.int[,1]-pred.int[,2] ) Int.MSE.sim <- var( pred.int[,1]-pred.int[,3] ) compare.int <- c( Int.MSE.reg, Int.MSE.sim ) names( compare.int ) <- c( "MSE for Reg on Intercept", 17
"MSE for Sim on Intercept" ) compare.int # caculating MSE for Coefficients pred.coef.reg <- predict( reg.coef ) pred.coef <- cbind(movie$coef, pred.coef.reg, coefficient.sim$mean$mean) colnames( pred.coef ) <- c( "orig", "reg", "sim" ) coef.mse.reg <- var( pred.coef[,1]-pred.coef[,2] ) coef.mse.sim <- var( pred.coef[,1]-pred.coef[,3] ) compare.coef <- c( coef.mse.reg, coef.mse.sim) names( compare.coef ) <- c( "MSE for Reg on Coefficient", "MSE for Sim on Coefficient" ) compare.coef # caculating MAE for Intercepts library(msbvar) Int.MAE.reg <- mae( pred.int[,1],pred.int[,2] ) Int.MAE.sim <- mae( pred.int[,1],pred.int[,3] ) MAE.compare.int <- c( Int.MAE.reg, Int.MAE.sim ) names( MAE.compare.int ) <- c( "MAE for Reg on Intercept", "MAE for Sim on Intercept" ) MAE.compare.int 18
# caculating MAE for Coefficients coef.mae.reg <- mae( pred.coef[,1],pred.coef[,2] ) coef.mae.sim <- mae( pred.coef[,1],pred.coef[,3] ) MAE.compare.coef <- c( coef.mae.reg, coef.mae.sim) names( MAE.compare.coef ) <- c( "MAE for Reg on Coefficient", "MAE for Sim on Coefficient" ) MAE.compare.coef ################### END OF CODE ########################### B OpenBUGS Code for The Opening Week model { for (i in 1:N) { int[i] ~ dnorm(mean[i], prec) mean[i] <- a + b[1]*(tv_fre[i]-mtv) + b[2]*(news_fre[i]-mnews) + b[3]*(dir_mean[i]-mdir_mean) + b[4]*(acad[i]-macad) + b[5]*(sf[i]-msf) + b[6]*(revi[i]-mrevi) + b[7]*(season[i]-mseason) + b[8]*(star[i]-mstar) } for (i in 1:v) { 19
} b[i] ~ dnorm(0, 1.0E-6) } a ~ dnorm(0, 1.0E-6) prec ~ dgamma(0.001, 0.001) C Graphs from R2OpenBUGS 20
Figure 2: R2OpenBUGS Output for the Intercept(γ 0i ) Bugs model at "inter_r.txt", fit using OpenBUGS, 3 chains, each with 1000 iterations (first 500 discarded) a b[1] [2] [3] [4] [5] [6] [7] [8] * mean[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] prec 80% interval for each chain R hat 10 10 0 0 10 10 * array truncated for lack of space 20 20 1 1.5 2+ 1 1.5 2+ medians and 80% intervals 10.2 10.1 a 10 9.9 1 0.5 b 0 0.5 1 2 3 4 5 6 7 8 20 15 * mean 10 5 1 2 3 4 5 6 7 8 910 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 0.55 0.5 prec 0.45 0.4 965 960 deviance955 950 945 21
Figure 3: R2OpenBUGS Output for the slopes(γ 1i ) Bugs model at "coef_r.txt", fit using OpenBUGS, 3 chains, each with 1000 iterations (first 500 discarded) a b[1] [2] [3] [4] [5] [6] [7] [8] * mean[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] prec 80% interval for each chain R hat 10 5 0 5 1 1.5 2+ medians and 80% intervals 4.2 a 4.4 4.6 2 b 0 2 1 2 3 4 5 6 7 8 5 0 * mean 5 10 1 2 3 4 5 6 7 8 910 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 0.3 prec 0.25 0.2 1140 1135 deviance 1130 10 5 0 5 * array truncated for lack of space 1 1.5 2+ 1125 22