Business Intelligence from Social Media

Business Intelligence Analytics Business Intelligence from Social Media A Study from the VAST Box Office Challenge Yafeng Lu, Feng Wang, and Ross Maciejewski Arizona State University S ocial media presents a promising, albeit challenging, source of data for business intelligence. Customers voluntarily discuss products and companies, giving a real-time pulse of brand sentiment and adoption. Unfortunately, such data is noisy and unstructured, making it difficult to easily extract real-time intelligence. So, using such data can be time-consuming and cost prohibitive for businesses. One promising direction is to apply visual analytics (VA). Recently, the VA community has begun focusing on extracting knowledge from unstructured social This visual-analytics toolkit media data. Studies have ranged extracts data from Twitter and from geotemporal anomaly debitly to predict movie revenue tection2,3 to topic extraction4 and ratings. Its interactivity to customer sentiment analyprovides benefits that a purely sis.5 The development of tools for such analyses now lets users statistical approach can t. The explore this rich information approach is generalizable source and mine it for business to other domains involving social media data, such as sales intelligence. One key area for business forecasting and advertisement intelligence is revenue predicanalysis. tion. In particular, owing to the abundance of social media discussions on movies, movie revenue prediction has drawn much attention from both the movie industry and academia. Prediction methods have employed movie metadata, social media data, and Google search volumes (for some examples, see the Related Work sidebar). Such methods have demonstrated the benefits of extracting business intelligence from social media for predicting movie revenue. However, they ve relied solely on 58 g5mac.indd 58 September/October 204 automated extraction and knowledge prediction. We ve developed a VA toolkit for predicting opening-weekend revenue and viewer-rating scores of upcoming movies. It consists of a Webdeployable series of linked visualization views that combine data mining with statistical techniques. To demonstrate our toolkit s effectiveness, we report on the results of the 203 Visual Analytics Science and Technology (VAST) Box Office Challenge (www.boxofficevast.org/vast-welcome. html). These results also let us explore the hypothesis that VA can help users develop better movie revenue predictions, compared to a purely statistical solution. Such a VA approach for social media analysis and forecasting is directly applicable to a wide range of business intelligence problems. Understanding how information spreads, as well as the underlying sentiment of the messages being spread, can give analysts critical insight into the general pulse of their brand or product. Developing a set of quick-look visualization tools for an overview of such social media data and linking these tools to models that business analysts generate for deploying new products, advertising campaigns, and sales forecasts can be crucial. Our toolkit can also be used to explore other business-related social media data for example, to see how well an ad campaign did and the pattern of information spreading. Some exploration can help adjust business decisions. Tools for Movie Predictions Our toolkit lets users quickly extract, visualize, and clean information from social media sources. Published by the IEEE Computer Society 0272-76/4/$3.00 204 IEEE 8/2/4 4: PM

To create predictions, it integrates visual analytics with linear regression, temporal modeling, and sentiment analysis. Tweet Mining For tweet mining, we focused on structured data from the Internet Movie Database (IMDb) (for example, the genre, budget, and review rating) and unstructured data from social media (for example, movie-related tweets and blog posts). Whereas extracting structured data is relatively straightforward, unstructured data requires much preprocessing and manipulation. We collected tweets during the two weeks before the release date, on the basis of the hashtag provided by a movie s official Twitter account. We wanted tools that can extract a variety of metrics from IMDb and Twitter. Table summarizes the metrics we found most useful. Several of them require data mining and cleaning. To facilitate this, we developed tools to present the volume of tweets at various levels of temporal aggregation (see Figure a), let users remove unrelated tweets from the aggregate metrics, and let users extract and manually adjust a tweet s sentiment (see Figures b through d). To approximate the popular sentiment of a movie, we process each tweet using SentiWordNet, a dictionary-based classifier.6 First, we assign each word in the tweet a score from to, with being the most negative sentiment and being the most positive sentiment. Next, we assign each tweet a sentiment score (TSS) by summing the sentiment score of all the words in the tweet and scaling the range from 0.5 to 0.5. Finally, we calculate the movie sentiment score (MSS): MSS = Positive Score, Positive Score + Negative Score Related Work in Predicting Movie Revenue A n early study by Jeffrey Simonoff and Ilana Sparrow predicted movie revenue with a logged response regression model using metadata features (for example, the time of year, genre, and Motion Picture Association of America rating) as categorical regressors. Wenbin Zhang and Steven Skiena enhanced regression models based on metadata features by using variables extracted from news sources.2 Mahesh Joshi and his colleagues explored the relationship between film critic reviews and movie revenue.3 Sitaram Asur and Bernardo Huberman found that the rate of tweets per day explained nearly 80 percent of the variance in movie revenue prediction.4 Finally, a recent Google white paper claimed 94 percent accuracy in movie revenue prediction, using the volume of Internet trailer searches for a given movie title.5 References. J.S. Simonoff and I.R. Sparrow, Predicting Movie Grosses: Winners and Losers, Blockbusters and Sleepers, Chance, vol. 3, no. 3, 2000, pp. 5 24. 2. W. Zhang and S. Skiena, Improving Movie Gross Prediction through News Analysis, Proc. IEEE/WIC/ACM Int l Joint Conf. Web Intelligence and Intelligent Agent Technology, 2009, pp. 30 304. 3. M. Joshi et al., Movie Reviews and Revenues: An Experiment in Text Regression, Human Language Technologies: The 200 Ann. Conf. North Am. Chapter of the Assoc. for Computational Linguistics, 200, pp. 293 296. 4. S. Asur and B.A. Huberman, Predicting the Future with Social Media, Proc. IEEE/WIC/ACM Int l Conf. Web Intelligence and Intelligent Agent Technology, 200, pp. 492 499. 5. R. Panaligan and A. Chen, Quantifying Movie Magic with Google Search, white paper, Google, 203. where Positive Score is the sum of all tweets for a given movie with TSS > 0 and Negative Score is the absolute value of the sum of all tweets for a given movie with TSS < 0. Our toolkit visualizes the extracted TSSs for the users. Figures b through d show the bubble plot, Table. Metrics we found useful. Metric Description OW The three-day opening-weekend revenue. Budget The approximate movie budget (in US$ millions) according to the Internet Movie Database (IMDb). Genre The movie s genre according to IMDb. TUser The number of unique users who tweeted about a movie. TBD The average daily number of tweets during the two weeks before the movie s release. TSS Tweet sentiment score a summation of each word s sentiment polarity as calculated with SentiWordNet.6 MSS Movie sentiment score a derivation of a movie s overall sentiment. MSP Movie star power a summation of the Twitter followers of the three highest-billed movie stars (as listed by IMDb). g5mac.indd 59 IEEE Computer Graphics and Applications 59 8/2/4 4: PM

Business Intelligence Analytics (a) (b) (c) (d) Figure. Tweet trend and sentiment views for the movie Despicable Me 2. (a) Line charts and bar graphs showing how many tweets per day and the predictions. (b) A tweet bubble plot in which blue represents positive sentiment and red represents negative sentiment. A bubble s size represents how many times a tweet has been retweeted; the x-axis is time, and the y-axis is how many followers the person who submitted the tweet has. (c) A sentiment river view that aggregates sentiment over four-hour intervals. Positive sentiment is red; negative sentiment is blue. Users can select an area on the river to see the ratio of positive to negative sentiment. (d) A sentiment wordle in which a word s size represents how many times it was used in a tweet and in which its color represents sentiment. Users can click on a word to view the tweets containing it. 60 g5mac.indd 60 September/October 204 8/2/4 4: PM

Figure 2. Our interactive Bitly classification widget. In the center are the unclassified links, which the user can click and classify, as seen in the floating window. The upper left is a plot of review scores by click counts, with a line for the average review score. the sentiment river, and the sentiment wordle. The sentiment wordle visualizes the 200 most frequently mentioned words. Both the bubble plot and wordle enable interactive searching and filtering by keywords and users. Users can remove irrelevant tweets from the tweet count and modify mismatched sentiment. The primary use we found for the views in Figure was data cleaning. The primary lesson learned was that visualization tools are a necessity for data cleaning owing to the noisiness of social media data and the problems inherent in sentiment matching using a sentiment dictionary. (For example, phrases such as I want to see this movie so bad are marked as negative because of the word bad, and words such as Despicable are marked as negative even though they re merely references to a movie title.) The wordle provides a quick way to assess the sentiment of popular words. However, to fully explore a tweet s context, users must hover over the bubble plot or open a tweet list view through the search bar. Our implementation of the toolkit (which we describe later) demonstrated that these views were more effective for cleaning and overview than for model analysis. The need for tools to extract the correct metrics for regression modeling is a major hurdle for using social media data for business intelligence. The bubble plot and wordle plot helped us deal with the challenges of sentiment analysis and cleaning the noise from social media data. Bitly Mining Here, we explored long-form text by extracting Bitly links containing movie keywords. These links typically consisted of review articles or news reports about the movies (or in many cases unrelated news for example, when the movie The Heat was released, the Miami Heat basketball team had just won the National Basketball Association championship). We developed an interactive tool for extracting prescreening review scores embedded in Bitly links (see Figure 2). Initially, each Bitly link is unclassified and represented in a pixel matrix (the color saturation corresponds to how many times a link was clicked). When users click on an unclassified square, a pop-up box appears with a brief bit of text from the article. Users can follow the link to scan the article for review scores and manually assign a score to an article or classify it as news or unrelated. For analysis, the tool provides a plot of review scores from articles versus how many times an article was accessed (see the upper-left graph in Figure 2). The predicted review score is an average of extracted review scores normalized into one scale. This tool allows for quick data filtering and extraction. For example, users can easily separate reviews of the Star Trek video game from reviews of the Star Trek movie, which would be difficult to automatically encode. Furthermore, the pixel matrix s color coding can serve as a metric for classifying only those articles with a substantial number of views. Similarly to our experiences with tweet mining, we learned here that extracting information from Bitly can be difficult to fully automate. As in the Star Trek example, multiple products related to a movie might be released and reviewed at the same time. Furthermore, review scores might vary, from two thumbs up to 4 out of 5 stars to 6 out of 0. With the user in the loop, these scores can be mapped to the user s own base system (in the case of our contest entry, our metric was x out of 0 ). Regression Modeling Once we completed data cleaning and variable extraction, we used the social media metrics to develop a model to predict movie revenue and review scores. Traditional variables used in movie revenue prediction models include structured variables (for IEEE Computer Graphics and Applications 6

Business Intelligence Analytics (a) (b) (c) (d) Figure 3. The weekend prediction view for newly released movies and the prediction adjustment widget. This view shows the weekend when Despicable Me 2 and The Lone Ranger were released. (a) A bar graph showing the actual value, submitted prediction, and model prediction. (b) A stacked bar graph showing the predicted weekend revenue overlaid with the upcoming movie s regression model prediction. (c) The prediction adjustment widget, for modifying the total weekend revenue prediction. (The predicted values for the new movies remain proportional.) (d) The adjustment widget, for changing individual predictions. The gray box represents the total weekend revenue. example, the Motion Picture Association of America [MPAA] rating and movie budget) and derived measures (for example, movie stars popularity and popular sentiment regarding the movie). On the basis of our initial literature search, we used multiple linear regression for an initial prediction range for the opening-weekend movie revenue (OW). (For a brief introduction to multiple linear-regression modeling, see the related sidebar.) We explored a variety of variables that could be mined from the contest (see Table ). After initial model fitting and evaluation using R, 7 we found our best fit to be OW = b 0 + b TBD + b 2Budget + e, where b is a coefficient parameter and e is the error term. We updated the model weekly as new movies entered the dataset. We fit the parameters using movie data beginning in January 203. Our first prediction, for the 7 May weekend, used data from 39 movies for training. Our weekly models reported an adjusted R 2 of approximately 0.60, with p < 0.5. Our final parameters were b 0 4.9 0 3, b 4,462, and b 2 2.3 0 5. Unfortunately, this model doesn t fit the data overly well, and predictions have a large variance. For comparison, a linear-regression model using Google search volumes explained more than 90 percent of the variance on movie revenue performance. 8 Also, models by Sitaram Asur and Bernardo Huberman produced an adjusted R 2 of over 90 percent with the number of theaters as a regressor. 9 However, we hypothesized that a VA toolkit could partly help users overcome poor data (due partly to noise in social media data and partly to the closedworld nature of the contest). To facilitate better model prediction, we created a simple bar graph view (see Figure 3a). For past movies, it shows the model prediction, its 95 percent confidence interval error range, the submitted prediction, and the actual movie revenue. For new movies, it shows only the model prediction and submitted prediction. This view was critical in our analysis. The primary view of the data consists of an overview of the tweets per day and the predictions for the selected movies (see Figure a). Temporal Modeling The regression model provides one point for analysis; we wanted to also provide a big-picture overview. For any given weekend, there s likely a maximum amount of money available in the market. To approximate the total available money, we employed a simple moving-average model. Limitations here included access to data (historical weekend revenues weren t available, and after a movie opened, further weekend revenues were no longer reported in the contest). To compensate for this, we approximated subsequent weekend revenues for movies, assuming that movies would run for three weeks following their opening weekend and that each weekend their revenue would decrease by 50 percent. So, for any given weekend, we approximated the revenue as j= 3 j WeekendRevenue()= t OWi ()+ t 05. OWi ( t j), i i, j= where t is the current weekend and i is the index to a movie that exists at t. Then, for the weekend revenue prediction, we used a moving average: j= 2 WeekendRevenue( t+ )= WeekendRevenue( t j). 3 j= 0 Finally, we approximated the available revenue for new movies as 62 September/October 204

Linear-Regression Model Construction and Evaluation Regression analysis is one of the most common methods of pattern detection and multifactor analysis. With a proper regression model, analysts can better describe, interpret, and predict data. T The solution takes the form ˆb = ( X X) T XY, and the prediction function is Y = HY, where H = X(X T X) X T. In oneorder multiple linear regression, the predicted response is a linear combination of observations. The Linear-Regression Model A k-variable linear-regression model has this basic form: y = b 0 + b x + b 2x 2 + + b kx k + e, where y is the response; b is an unknown parameter; x i, i =, 2,, k, are the regressors; and e is the error term. The goal is to define a relationship between the response and regressors by solving for the linear coefficients that best map the regressors to the response. The linear-regression model is most often written as a matrix, such that Y = Xβ + ε, y y Y = 2, y n x x k x x k X = 2 2, xn xnk β0 = β β βk. For multiple regression models, you can use higherorder terms to model the response (for example, secondorder variables are of the form x i 2 and x ix j). However, for the research described in the main article, we focused on the simple linear-regression model. Parameter Estimation To solve for b i, the ordinary least squares (OLS) solution is most often employed. This assumes normality for the data. However, if this assumption isn t valid, a maximum-likelihood estimation would be employed (which is equivalent to OLS under the assumption of normality). For OLS, we wish to minimize n 2 i i= T T S( β)= ε = εε= y Xβ y Xβ, where S indicates the least-squares function and indicates a partial derivative, by satisfying S b bˆ T T = 2X y + 2X Xbˆ = 0. Model Selection In a multiple-variable dataset with a single response variable, analysts traditionally face a large set of potential linear-regression models consisting of various regressors and orders. For example, in movie revenue prediction, the response could be related to the number of tweets per day, the number of theaters the movie is released in, or any combination of variables. To decide which model to use in prediction, analysts typically consider four principles: Don t violate the scientific principle, if one exists, behind the dataset. Maintain a sense of parsimony to keep the order of the model and the number of regressors as low as possible. Keep an eye on extrapolation. Regression fits data in a given regressor space; there s no guarantee that the same model applies to other data outside this space. Always check evaluation plots more than the statistics. Residual plots and normal plots help show outliers and lack of fit. To verify a model s efficacy, analysts typically rely on a variety of statistical graphics to determine the critical variables in the model those that explain the most variation with the simplest form. 2 Evaluation of a model s effective fit usually involves three statistics. The p-value shows a regression model s significance, where p < 0.05 indicates the model is significant with a 95 percent confidence interval. R 2 and the adjusted R 2 generally describe the percentage of variance explained by a given model. The adjusted R 2 takes into consideration the degrees of freedom and should be used in multiple regression to compensate for the increased variance when adding regressors. A model is typically selected when its p-value is small, its R 2 or adjusted R 2 is high, and it has a relatively simple form with reasonable residual distributions. References. D.C. Montgomery, E.A. Peck, and G.G. Vining, Introduction to Linear Regression Analysis, John Wiley & Sons, 202. 2. T. Muhlbacher and H. Piringer, A Partition-Based Framework for Building and Validating Regression Models, IEEE Trans. Visualization and Computer Graphics, vol. 9, no. 2, 203, pp. 962 97. IEEE Computer Graphics and Applications 63

Business Intelligence Analytics NewMovieRevenue( t+ )= j= 3 j WeekendRevenue( t+ ) 0. 5 OWi ( t+ j). i, j= () Although this prediction is crude, it gives users a valuable bound in which to explore the revenue predictions. Our toolkit provides two views of the results from the weekend revenue prediction and the linear-regression model. The first view combines a linked bar graph with stacked bars (see Figure 3b). The graph s primary portion consists of gray bars indicating the predicted total weekend revenue for the new movies. The short dark-gray line indicates the actual weekend revenue for each calendar week shown on the x-axis. The stacked bar graph appears only for the analyzed weekend; the colors are the same as in the prediction bar graph. The second view (see Figures 3c and 3d) lets users interactively adjust predictions while visualizing the bounds of the total weekend revenue prediction. A gray rectangle s area is scaled linearly to the total weekend revenue prediction. Colored rectangles are superimposed onto the gray rectangle; each colored rectangle s area represents the linear-regression prediction for each movie released on that weekend. If the sum of the individual predictions is equal to the total prediction, the colored rectangles will fit exactly into the gray rectangle. The colors are the same as in the bar graph; modifying a bar s size in any view modifies the size across all views. Users can perform three types of prediction adjustments: They can change the total weekend revenue prediction, but the ratio between the movies will remain consistent. They can change an individual movie revenue prediction, but the total weekend revenue prediction will remain consistent. They can arbitrarily change each movie s revenue prediction and ignore the total weekend revenue. By implementing and integrating multiple comparison methods, we could quickly bound our analysis. Although flexible, these bounds provided an early estimate of the total weekend revenue with which to compare the predictions of our linearregression models. Although our temporal predictions were of low quality, the combination of predictions and bounding of the problem space provided critical information for comparison and analysis. Overall, adding multiple models predicting similar information can help guide users to a better ground truth. Like the Delphi method, which solicits predictions from multiple experts and uses them to come to a common conclusion, 0 our toolkit lets users solicit predictions from multiple models to aid their analysis. Users can employ this bounded adjustment widget for other hierarchical predictions that have both individual and total predictions, such as subtopic trend prediction in a time period. Similarity Visualization The similarity widget lets users quickly find and compare predictions accuracy on the basis of various similarity criteria. They can determine whether the given prediction model typically underestimates, overestimates, or is relatively accurate regarding movies they deem similar. So, they can further refine their final prediction for both revenue and review scores. We ve defined eight similarity criteria; Table 2 shows them and their distance measurements. In all similarity matches, our toolkit shows the top five most similar movies. These views let users directly compare tweet trends and sentiment words between movies deemed similar in a category. Figure 4 contains snapshots from the Despicable Me 2 similarity page, showing line charts using the MPAA criterion, a wordle using the sentiment wordle criterion, and a theme river using the sentiment river criterion. Although all the variables used in our similarity metrics could also be used in the linear-regression model, the modeling results indicated that these variables weren t significant in altering the model. However, by providing users with insight into these secondary variables, coupled with the weekend modeling, our toolkit lets them further refine predictions. For example, users might compare the absolute difference between tweets of two movies or inspect the trend of the tweets through line chart comparison using the tweet-changing-trend criterion. Users can also quickly compare the selected movies to recently released movies with the same MPAA rating or genre. In addition, they can compare the popularity of the movies stars, which is based on how many Twitter followers the stars have. Implementing the Toolkit In the VAST 203 Box Office Challenge, we used our toolkit to predict 23 movies over three months. Here, we give an example based on the July 4th holiday in the US, when Despicable Me 2 and The Lone Ranger were released. 64 September/October 204

Table 2. Calculations of similarity criteria.* Similarity criteria Distance measurement Number of tweets 4 by day Dis vs, TBDi v TBDi s i= = () Tweet changing trend Sentiment river MSS MPAA Genre MSP Sentiment wordle 4 TBDi ( v) TBDi s Dis( vs, )= i = Max TBDj ( v), j = 2,,, 4 Max TBDj () s, j = 2,,, 4 () 4 MSSi ( v) MSSi s Dis( vs, )= i = Max MSSj ( v), j = 2,,, 4 Max MSSj () s, j = 2,,, 4 Dis( v, s)= MSS ( v) MSS () s () The same Motion Picture Association of America rating and close release dates card ( Genre( v) Genre() s ) 2 Dis( vs, )= card ( Genre( v) )+ card Genre() s Dis( v, s)= MSP ( v) MSP () s card SWordle( v) SWordle() s Dis( vs, )= card SWordlev *v and s are the two movies being compared; card is the cardinality. Figure 4. User-defined similarity views cropped to show the most similar movies. On the top in the middle are graphs using the MPAA criterion. On the top right are graphs of the actual opening-weekend revenue, our final prediction, and the prediction range. The circled star shows the review score. On the bottom left is a wordle using the sentiment wordle criterion; on the bottom right is a theme river using the sentiment river criterion. (For an explanation of these criteria, see Table 2.) Predicting Review Scores To predict IMDb review scores, we first entered the Bitly view for each movie. We manually extracted review scores from Bitly users who had attended a prescreening of the movie (see Figure 2). For Despicable Me 2, the analysts manually classified the most-clicked Bitly reviews; the average value of the extracted review scores was 7.8. Once we recorded the selected movie s average value, we used the similarity view to compare it to IEEE Computer Graphics and Applications 65

Business Intelligence Analytics Table 3. Competitors performance in the 203 VAST Box Office Challenge. The average error is in millions of dollars. Revenue predictions Viewer-rating predictions Team No. of predictions Average error Standard deviation MRAE* No. of predictions Average error Standard deviation MRAE* Our team (VADER) 23.23 9.46 0.467 23 0.487 0.460 0.075 Team Prolix 23 6.466 5.95 0.424 20 0.820 0.640 0.29 Uni Konstanz Boxoffice 4 7.056 5.743 3.929 2 0.905.59 0.095 CinemAviz 2 7.29 7.677.970 2 0.738 0.559 0.4 Team Turboknopf 8 2.900 5.606 0.685 8 0.54 0.426 0.079 elvertoncf UFMG 3 2.677 9.806 3.009 3.323 0.328 0.259 Philipp Omentisch 5 30.657 38.028 0.678 5 0.500 0.324 0.07 CDE IIIT 2 60.600 62.084 0.537 2 0 0 0 *Mean relative absolute error. other movies. The movie review score appeared as a star highlighting the review value in the corner of the bar graphs (see Figure 4). Typically, we compared across genre, movie rating, and sentiment to determine whether we felt the average value extracted from Bitly links was a reasonable prediction. We compared Despicable Me 2 to Monsters University because both were animated sequels. Monsters University s IMDb rating was 7.8, giving us confidence that our predicted value of 7.8 for Despicable Me 2 was reasonable. We then performed this process for the Lone Ranger, which received a predicted rating of 6.4. The actual IMDb ratings were 7.9 for Despicable Me 2 and 6.8 for The Lone Ranger. Predicting Revenue Predicting revenue for the July 4th weekend was challenging for two reasons. First, the data stream from the contest was broken, providing only six days worth of tweets. Second, the predictions were for a five-day weekend instead of the typical threeday weekend. Using the available data, we obtained rough estimates of US$76M (±$3M) for Despicable Me 2 and $85M (±$3M) for The Lone Ranger. For the three-day weekend, the New Movie Revenue (see Equation ) estimated that $24M was available for the two movies. A quick look at Figure 3 shows that our regression predictions were well outside the bounds of the time series model prediction. Given the misalignment between the two models, we explored the similarity views to determine the movies most similar to Despicable Me 2 and The Lone Ranger, on the basis of the predicted review scores and various other metrics. We compared Despicable Me 2 to a variety of animated movies; the predicted $73M was actually low compared to animated movies such as Monsters University. Next, we explored various similarity views for The Lone Ranger. It was likely similar to World War Z, which had a weekend revenue of $66M. However, World War Z s viewer rating was 7.4, much higher than the predicted 6.4 for The Lone Ranger. We determined that Despicable Me 2 should perform similarly to Monsters University, and we predicted a three-day revenue of $85M. On the basis of our temporal prediction, this left only $39M for The Lone Ranger. However, given the other evidence, The Lone Ranger seemed likely to underperform. Finally, we took our three-day prediction values and linearly scaled them, resulting in a five-day prediction of $6.5M for Despicable Me 2 and $55.45M for The Lone Ranger. The actual three-day revenue was $83.5M for Despicable Me 2 and $29M for The Lone Ranger. The actual five-day revenue was $43M for Despicable Me 2 and $48.7M for The Lone Ranger. VAST Challenge Results Eight teams from various research institutes participated in the 203 VAST Box Office Challenge. Our team was Team VADER (Visual Analytics and Data Exploration Research Lab; http://vader.lab. asu.edu). Here, we compare our performance with that of our VAST competitors and four professional movie prediction websites. Comparison with Peer Teams Table 3 summarizes each team s performance. For the revenue predictions, we report the average error (in terms of millions of dollars), the standard deviation of the average error, and the mean relative absolute error (MRAE), which is the percentage of bias deviating from the real value: MRAE = N N i= Predictioni Real Valuei. Real Value We report similar values for predicting the IMDb rating (which ranged from to 0). For these statistics, smaller values indicate more accurate pre- i 66 September/October 204

MRAE.5.0 0.5 0 Star Trek Epic Fast 6 Hangover 3 After Earth Now You See Me Internship Purge Man of Steel This is the End dictions. The data in Table 3 was provided to all challenge participants after the contest closed. Regarding the average error and standard deviation for revenue predictions, our team reported the lowest values. Regarding the MRAE for revenue predictions and viewer-rating predictions, our results were slightly worse than Team Prolix and similar to Philipp Omentisch, CDE IIIT, and Team Turboknopf. However, Team Prolix s average error and standard deviation were much larger than ours, indicating more inconsistent predictions. Regarding the average error and MRAE for viewer-rating predictions, our team had the lowest values of all teams that submitted more than five predictions. CDE IIIT submitted two perfect predictions; however, it submitted only those two predictions, making it difficult to determine whether its methods would produce consistent results. Regarding the average error and standard deviation for viewer-rating predictions, our team performed similarly to Team Turboknopf, but with a slightly lower average error and a slightly higher standard deviation. Comparison with Professional Predictions In this comparison, we used our predictions for only 2 of the 23 movies. Two of the 23, The Bling Ring and The To Do List, were limited-release movies that opened in only five and 59 theaters, respectively. Most expert prediction sites don t provide predictions for limited-release movies. MU WWZ The Heat White House Down DM2 Ranger Turbo Conjuring Red 2 For each prediction, we followed the same general process we described in the section Implementing the Toolkit. As we stated before, the underlying linear-regression model used in our toolkit was significant, with an adjusted R 2 of approximately 0.60. Figure 5 compares our MRAE with that of the four websites for the opening-weekend revenue. We clearly outperformed the experts on the weekend when Epic, The Hangover Part III, and Fast & Furious 6 were released. On the weekend when we had the largest error (for After Earth), we relied heavily on the analytical component, with no interaction. Figure 6 plots the MRAE for the review scores. Approximately half of our predictions were within a 5 percent error of the real review score. The four websites had no published review score predictions. The predictions with our toolkit were a dramatic improvement over using just our model without interaction (see the first two rows of Table 4). This strongly indicates that our hypothesis (that VA will help users develop better predictions than a purely statistical solution will) is valid. However, we don t wish to overstate our claims. The contest provided only a single data point for exploring how one group of analysts in a closed-world setting could use a VA toolkit for improved prediction. The need exists for further controlled studies in which a group of analysts performs similar model predictions both with a VA platform and with only a given regression model. Our prediction boxoffice.com filmgo.net hsx.com boxofficemojo.com RIPD Wolverine Figure 5. The mean relative absolute error (MRAE) of weekend revenue predictions. We clearly outperformed the experts for three movies (Epic, The Hangover Part III, and Fast & Furious 6). Where we had the largest error (After Earth), we relied heavily on the analytical component, with no interaction. MRAE 0.25 0.20 0.5 0.0 0.05 0 Star Trek Epic Fast 6 Hangover 3 After Earth Now You See Me Internship Purge Man of Steel This is the End MU WWZ The Heat White House Down DM2 Ranger Turbo Conjuring Red 2 RIPD Wolverine Figure 6. The MRAE of our viewer-rating predictions. Sixteen out of 2 predictions had an error below 0 percent, and had an error below 5 percent. IEEE Computer Graphics and Applications 67

Business Intelligence Analytics Table 4. Comparing our toolkit with professional predictions. Prediction source No. of predictions Average error Standard deviation Average MRAE VADER, interactive 2 2.729 9.425 0.285 VADER, no interaction 2 23.05 22.0 0.50 boxoffice.com 2 8.538 7.466 0.9 filmgo.net 6 2.750 7.409 0.297 hsx.com 20 9.060 7.397 0.205 boxofficemojo.com 4 9.864 7.527 0.224 Table 4 shows that our average error and average MRAE were slightly lower than those of filmgo.net. This indicates that our approach enabled our group of novice analysts to be competitive with experts. The significance of this relies on three major assumptions: The professional prediction websites had more experience in movie revenue prediction than our team. The professional prediction websites had access to more data than our team was allowed in the closed-world contest. Access to more data can enable better predictive models. 8,9,,2 First, it seems reasonable that a professional prediction website would have much more experience than a computer science team who had never previously attempted to predict movie revenue. Second, there s no restriction on what data a professional website s predictions can use. For example, boxoffice.com uses Facebook tracking and Twitter tracking, and hsx.com uses the Hollywood Stock index. Third, it s clear that using more data (specifically, the number of theaters a movie is released in) will produce a better prediction model (a larger R 2 ). From these assumptions, it becomes clear that (in this instance) a VA toolkit can enable individuals who are knowledgeable about data analysis to quickly understand information being presented to them in new domains and make predictions that are in line with expert predictions. Our MRAE (0.285) was slightly lower than that of filmgo.net (0.297) but approximately 50 percent worse than that of boxoffice.com (0.9). However, if we remove the After Earth and Now You See Me weekend (during which we relied heavily on the model and little on the interactive visuals), our MRAE drops to 0.239, which puts us near boxofficemojo.com (0.224). Other error sources can be accounted for in disrupted Twitter and Bitly data feeds. These interruptions were pronounced for The Heat, White House Down, Monsters University, and World War Z. However, even with those interruptions, our predictive analysis was still quite robust, with only The Heat obtaining a significantly worse prediction than the professional sites. The Challenges Ahead Overall, applying VA for social media analysis has proven relatively effective. However, four main challenges exist in applying this to all domains of business intelligence. First, social media data is extremely noisy. Movie predictions work well because you can track ad campaigns effectiveness by following the specific hashtags promoted by a brand. As the analysis gets farther afield from Twitter (for example, when trying to mine Bitly data), choosing effective keywords becomes difficult. Second, owing to the ever-changing stream of social media sources and users, any automated system for data collection and prediction will likely eventually be steered off course. So, it s critical to link the human into the loop. However, as is evidenced by the issues in sentiment analysis, data cleaning shouldn t overburden analysts. The sentiment analysis and cleaning employed in our research places an overly large burden on the user. A more effective solution could be a system for sentiment model training that has users label a subset of tweets. Third, it s imperative to link highly curated small datasets with this big data. Although social media data can serve as a proxy for many signals, we find that linking multiple data sources with varying reliability levels (for instance, the total weekend revenue for all movies and regression modeling) can enhance a system s predictive abilities. For example, doing focus groups and linking their data with results from social media could enhance the analysis of a proposed new product release. Finally, this research demonstrates the need for interactive tools to mine social media data. From the examples of movie revenue prediction, it s clear that such data contains a wealth of information. However, extracting knowledge from this data and effectively communicating it remain a challenge. The need clearly exists for effective data-cleaning tools to improve the filtering of unrelated social media signals and for improving the results of challenging analytical tasks (such as sentiment analysis). Our results demonstrate that using VA 68 September/October 204

tools can significantly affect knowledge discovery for business intelligence. Although our results demonstrate only a single data point, we feel this is significant in that the contest provisions let us directly compare analysts using a VA toolkit to experts in a particular modeling domain. We recognize that this is a far cry from definitively validating our hypothesis that the use of VA will enable users to develop better box-office predictions than a purely statistical solution would. This research points to the need for better methods for evaluating the impact of VA used for complex problems such as prediction. A variety of factors and variables must be addressed and controlled, including the level of expertise and the types of visualizations provided. Using our toolkit, we ve been collecting streaming movie data in a manner similar to the VAST Box Office Challenge and plan to run a variety of controlled experiments. Of primary interest is exploring levels of expertise and VA s impact on predictions. We feel that the results we reported here are an important starting point for such explorations. Acknowledgments This research was supported partly by the US Department of Homeland Security s VACCINE (Visual Analytics for Command, Control, and Interoperability Environments) Center under award 2009-ST- 06-CI000. We thank the 203 Visual Analytics Science and Technology Box Office Challenge organizers and participants for their help in data collection, evaluation, and discussions. 5. M.C. Hao et al., Visual Sentiment Analysis of Customer Feedback Streams Using Geo-temporal Term Associations, Information Visualization, vol. 2, nos. 3 4, 203, pp. 273 290. 6. S. Baccianella, A. Esuli, and F. Sebastiani, SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining, Proc. Int l Conf. Language Resources and Evaluation, 200, pp. 2200 2204. 7. R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, 2008. 8. R. Panaligan and A. Chen, Quantifying Movie Magic with Google Search, white paper, Google, 203. 9. S. Asur and B.A. Huberman, Predicting the Future with Social Media, Proc. IEEE/WIC/ACM Int l Conf. Web Intelligence and Intelligent Agent Technology, 200, pp. 492 499. 0. G. Rowe and G. Wright, The Delphi Technique as a Forecasting Tool: Issues and Analysis, Int l J. Forecasting, vol. 5, no. 4, 999, pp. 353 375.. W. Zhang and S. Skiena, Improving Movie Gross Prediction through News Analysis, Proc. IEEE/ WIC/ACM Int l Joint Conf. Web Intelligence and Intelligent Agent Technology, 2009, pp. 30 304. 2. M. Joshi et al., Movie Reviews and Revenues: An Experiment in Text Regression, Human Language Technologies: The 200 Ann. Conf. North Am. Chapter of the Assoc. for Computational Linguistics, 200, pp. 293 296. Yafeng Lu is a PhD student working for Ross Maciejewski in Arizona State University s School of Computing, Informatics, and Decision Systems Engineering. Her research interests are data analysis and visualization. Lu received her master s in computer science and theory from Northeastern University, China. Contact her at lyafeng@asu.edu. References. T. Schreck and D. Keim, Visual Analysis of Social Media Data, Computer, vol. 46, no. 5, 203, pp. 68 75. 2. H. Bosch et al., Scatterblogs2: Real-Time Monitoring of Microblog Messages through User-Guided Filtering, IEEE Trans. Visualization and Computer Graphics, vol. 9, no. 2, 203, pp. 2022 203. 3. J. Chae et al., Spatiotemporal Social Media Analytics for Abnormal Event Detection and Examination Using Seasonal-Trend Decomposition, Proc. 202 IEEE Conf. Visual Analytics Science and Technology (VAST 2), 202, pp. 43 52. 4. X. Wang et al., I-SI: Scalable Architecture for Analyzing Latent Topical Level Information from Social Media Data, Computer Graphics Forum, vol. 3, no. 3, part 4, 202, pp. 275 284. Feng Wang is a PhD student working for Ross Maciejewski in Arizona State University s School of Computing, Informatics, and Decision Systems Engineering. His research interests include data visualization and data mining. He received his master s in computer science from the University of Science and Technology of China. Contact him at fwang49@asu.edu. Ross Maciejewski is an assistant professor in Arizona State University s School of Computing, Informatics, and Decision Systems Engineering. His research interests are geographical visualization and visual analytics focusing on public health, social media, criminal incident reports, and dietary analysis. He received his PhD in computer engineering from Purdue University. Contact him at rmacieje@asu.edu. Selected CS articles and columns are also available for free at http://computingnow.computer.org. IEEE Computer Graphics and Applications 69