Business Intelligence from Social Media

Size: px
Start display at page:

Download "Business Intelligence from Social Media"

Transcription

1 Business Intelligence Analytics Business Intelligence from Social Media A Study from the VAST Box Office Challenge Yafeng Lu, Feng Wang, and Ross Maciejewski Arizona State University S ocial media presents a promising, albeit challenging, source of data for business intelligence. Customers voluntarily discuss products and companies, giving a real-time pulse of brand sentiment and adoption. Unfortunately, such data is noisy and unstructured, making it difficult to easily extract real-time intelligence. So, using such data can be time-consuming and cost prohibitive for businesses. One promising direction is to apply visual analytics (VA). Recently, the VA community has begun focusing on extracting knowledge from unstructured social This visual-analytics toolkit media data. Studies have ranged extracts data from Twitter and from geotemporal anomaly debitly to predict movie revenue tection2,3 to topic extraction4 and ratings. Its interactivity to customer sentiment analyprovides benefits that a purely sis.5 The development of tools for such analyses now lets users statistical approach can t. The explore this rich information approach is generalizable source and mine it for business to other domains involving social media data, such as sales intelligence. One key area for business forecasting and advertisement intelligence is revenue predicanalysis. tion. In particular, owing to the abundance of social media discussions on movies, movie revenue prediction has drawn much attention from both the movie industry and academia. Prediction methods have employed movie metadata, social media data, and Google search volumes (for some examples, see the Related Work sidebar). Such methods have demonstrated the benefits of extracting business intelligence from social media for predicting movie revenue. However, they ve relied solely on 58 g5mac.indd 58 September/October 204 automated extraction and knowledge prediction. We ve developed a VA toolkit for predicting opening-weekend revenue and viewer-rating scores of upcoming movies. It consists of a Webdeployable series of linked visualization views that combine data mining with statistical techniques. To demonstrate our toolkit s effectiveness, we report on the results of the 203 Visual Analytics Science and Technology (VAST) Box Office Challenge ( html). These results also let us explore the hypothesis that VA can help users develop better movie revenue predictions, compared to a purely statistical solution. Such a VA approach for social media analysis and forecasting is directly applicable to a wide range of business intelligence problems. Understanding how information spreads, as well as the underlying sentiment of the messages being spread, can give analysts critical insight into the general pulse of their brand or product. Developing a set of quick-look visualization tools for an overview of such social media data and linking these tools to models that business analysts generate for deploying new products, advertising campaigns, and sales forecasts can be crucial. Our toolkit can also be used to explore other business-related social media data for example, to see how well an ad campaign did and the pattern of information spreading. Some exploration can help adjust business decisions. Tools for Movie Predictions Our toolkit lets users quickly extract, visualize, and clean information from social media sources. Published by the IEEE Computer Society /4/$ IEEE 8/2/4 4: PM

2 To create predictions, it integrates visual analytics with linear regression, temporal modeling, and sentiment analysis. Tweet Mining For tweet mining, we focused on structured data from the Internet Movie Database (IMDb) (for example, the genre, budget, and review rating) and unstructured data from social media (for example, movie-related tweets and blog posts). Whereas extracting structured data is relatively straightforward, unstructured data requires much preprocessing and manipulation. We collected tweets during the two weeks before the release date, on the basis of the hashtag provided by a movie s official Twitter account. We wanted tools that can extract a variety of metrics from IMDb and Twitter. Table summarizes the metrics we found most useful. Several of them require data mining and cleaning. To facilitate this, we developed tools to present the volume of tweets at various levels of temporal aggregation (see Figure a), let users remove unrelated tweets from the aggregate metrics, and let users extract and manually adjust a tweet s sentiment (see Figures b through d). To approximate the popular sentiment of a movie, we process each tweet using SentiWordNet, a dictionary-based classifier.6 First, we assign each word in the tweet a score from to, with being the most negative sentiment and being the most positive sentiment. Next, we assign each tweet a sentiment score (TSS) by summing the sentiment score of all the words in the tweet and scaling the range from 0.5 to 0.5. Finally, we calculate the movie sentiment score (MSS): MSS = Positive Score, Positive Score + Negative Score Related Work in Predicting Movie Revenue A n early study by Jeffrey Simonoff and Ilana Sparrow predicted movie revenue with a logged response regression model using metadata features (for example, the time of year, genre, and Motion Picture Association of America rating) as categorical regressors. Wenbin Zhang and Steven Skiena enhanced regression models based on metadata features by using variables extracted from news sources.2 Mahesh Joshi and his colleagues explored the relationship between film critic reviews and movie revenue.3 Sitaram Asur and Bernardo Huberman found that the rate of tweets per day explained nearly 80 percent of the variance in movie revenue prediction.4 Finally, a recent Google white paper claimed 94 percent accuracy in movie revenue prediction, using the volume of Internet trailer searches for a given movie title.5 References. J.S. Simonoff and I.R. Sparrow, Predicting Movie Grosses: Winners and Losers, Blockbusters and Sleepers, Chance, vol. 3, no. 3, 2000, pp W. Zhang and S. Skiena, Improving Movie Gross Prediction through News Analysis, Proc. IEEE/WIC/ACM Int l Joint Conf. Web Intelligence and Intelligent Agent Technology, 2009, pp M. Joshi et al., Movie Reviews and Revenues: An Experiment in Text Regression, Human Language Technologies: The 200 Ann. Conf. North Am. Chapter of the Assoc. for Computational Linguistics, 200, pp S. Asur and B.A. Huberman, Predicting the Future with Social Media, Proc. IEEE/WIC/ACM Int l Conf. Web Intelligence and Intelligent Agent Technology, 200, pp R. Panaligan and A. Chen, Quantifying Movie Magic with Google Search, white paper, Google, 203. where Positive Score is the sum of all tweets for a given movie with TSS > 0 and Negative Score is the absolute value of the sum of all tweets for a given movie with TSS < 0. Our toolkit visualizes the extracted TSSs for the users. Figures b through d show the bubble plot, Table. Metrics we found useful. Metric Description OW The three-day opening-weekend revenue. Budget The approximate movie budget (in US$ millions) according to the Internet Movie Database (IMDb). Genre The movie s genre according to IMDb. TUser The number of unique users who tweeted about a movie. TBD The average daily number of tweets during the two weeks before the movie s release. TSS Tweet sentiment score a summation of each word s sentiment polarity as calculated with SentiWordNet.6 MSS Movie sentiment score a derivation of a movie s overall sentiment. MSP Movie star power a summation of the Twitter followers of the three highest-billed movie stars (as listed by IMDb). g5mac.indd 59 IEEE Computer Graphics and Applications 59 8/2/4 4: PM

3 Business Intelligence Analytics (a) (b) (c) (d) Figure. Tweet trend and sentiment views for the movie Despicable Me 2. (a) Line charts and bar graphs showing how many tweets per day and the predictions. (b) A tweet bubble plot in which blue represents positive sentiment and red represents negative sentiment. A bubble s size represents how many times a tweet has been retweeted; the x-axis is time, and the y-axis is how many followers the person who submitted the tweet has. (c) A sentiment river view that aggregates sentiment over four-hour intervals. Positive sentiment is red; negative sentiment is blue. Users can select an area on the river to see the ratio of positive to negative sentiment. (d) A sentiment wordle in which a word s size represents how many times it was used in a tweet and in which its color represents sentiment. Users can click on a word to view the tweets containing it. 60 g5mac.indd 60 September/October 204 8/2/4 4: PM

4 Figure 2. Our interactive Bitly classification widget. In the center are the unclassified links, which the user can click and classify, as seen in the floating window. The upper left is a plot of review scores by click counts, with a line for the average review score. the sentiment river, and the sentiment wordle. The sentiment wordle visualizes the 200 most frequently mentioned words. Both the bubble plot and wordle enable interactive searching and filtering by keywords and users. Users can remove irrelevant tweets from the tweet count and modify mismatched sentiment. The primary use we found for the views in Figure was data cleaning. The primary lesson learned was that visualization tools are a necessity for data cleaning owing to the noisiness of social media data and the problems inherent in sentiment matching using a sentiment dictionary. (For example, phrases such as I want to see this movie so bad are marked as negative because of the word bad, and words such as Despicable are marked as negative even though they re merely references to a movie title.) The wordle provides a quick way to assess the sentiment of popular words. However, to fully explore a tweet s context, users must hover over the bubble plot or open a tweet list view through the search bar. Our implementation of the toolkit (which we describe later) demonstrated that these views were more effective for cleaning and overview than for model analysis. The need for tools to extract the correct metrics for regression modeling is a major hurdle for using social media data for business intelligence. The bubble plot and wordle plot helped us deal with the challenges of sentiment analysis and cleaning the noise from social media data. Bitly Mining Here, we explored long-form text by extracting Bitly links containing movie keywords. These links typically consisted of review articles or news reports about the movies (or in many cases unrelated news for example, when the movie The Heat was released, the Miami Heat basketball team had just won the National Basketball Association championship). We developed an interactive tool for extracting prescreening review scores embedded in Bitly links (see Figure 2). Initially, each Bitly link is unclassified and represented in a pixel matrix (the color saturation corresponds to how many times a link was clicked). When users click on an unclassified square, a pop-up box appears with a brief bit of text from the article. Users can follow the link to scan the article for review scores and manually assign a score to an article or classify it as news or unrelated. For analysis, the tool provides a plot of review scores from articles versus how many times an article was accessed (see the upper-left graph in Figure 2). The predicted review score is an average of extracted review scores normalized into one scale. This tool allows for quick data filtering and extraction. For example, users can easily separate reviews of the Star Trek video game from reviews of the Star Trek movie, which would be difficult to automatically encode. Furthermore, the pixel matrix s color coding can serve as a metric for classifying only those articles with a substantial number of views. Similarly to our experiences with tweet mining, we learned here that extracting information from Bitly can be difficult to fully automate. As in the Star Trek example, multiple products related to a movie might be released and reviewed at the same time. Furthermore, review scores might vary, from two thumbs up to 4 out of 5 stars to 6 out of 0. With the user in the loop, these scores can be mapped to the user s own base system (in the case of our contest entry, our metric was x out of 0 ). Regression Modeling Once we completed data cleaning and variable extraction, we used the social media metrics to develop a model to predict movie revenue and review scores. Traditional variables used in movie revenue prediction models include structured variables (for IEEE Computer Graphics and Applications 6

5 Business Intelligence Analytics (a) (b) (c) (d) Figure 3. The weekend prediction view for newly released movies and the prediction adjustment widget. This view shows the weekend when Despicable Me 2 and The Lone Ranger were released. (a) A bar graph showing the actual value, submitted prediction, and model prediction. (b) A stacked bar graph showing the predicted weekend revenue overlaid with the upcoming movie s regression model prediction. (c) The prediction adjustment widget, for modifying the total weekend revenue prediction. (The predicted values for the new movies remain proportional.) (d) The adjustment widget, for changing individual predictions. The gray box represents the total weekend revenue. example, the Motion Picture Association of America [MPAA] rating and movie budget) and derived measures (for example, movie stars popularity and popular sentiment regarding the movie). On the basis of our initial literature search, we used multiple linear regression for an initial prediction range for the opening-weekend movie revenue (OW). (For a brief introduction to multiple linear-regression modeling, see the related sidebar.) We explored a variety of variables that could be mined from the contest (see Table ). After initial model fitting and evaluation using R, 7 we found our best fit to be OW = b 0 + b TBD + b 2Budget + e, where b is a coefficient parameter and e is the error term. We updated the model weekly as new movies entered the dataset. We fit the parameters using movie data beginning in January 203. Our first prediction, for the 7 May weekend, used data from 39 movies for training. Our weekly models reported an adjusted R 2 of approximately 0.60, with p < 0.5. Our final parameters were b , b 4,462, and b Unfortunately, this model doesn t fit the data overly well, and predictions have a large variance. For comparison, a linear-regression model using Google search volumes explained more than 90 percent of the variance on movie revenue performance. 8 Also, models by Sitaram Asur and Bernardo Huberman produced an adjusted R 2 of over 90 percent with the number of theaters as a regressor. 9 However, we hypothesized that a VA toolkit could partly help users overcome poor data (due partly to noise in social media data and partly to the closedworld nature of the contest). To facilitate better model prediction, we created a simple bar graph view (see Figure 3a). For past movies, it shows the model prediction, its 95 percent confidence interval error range, the submitted prediction, and the actual movie revenue. For new movies, it shows only the model prediction and submitted prediction. This view was critical in our analysis. The primary view of the data consists of an overview of the tweets per day and the predictions for the selected movies (see Figure a). Temporal Modeling The regression model provides one point for analysis; we wanted to also provide a big-picture overview. For any given weekend, there s likely a maximum amount of money available in the market. To approximate the total available money, we employed a simple moving-average model. Limitations here included access to data (historical weekend revenues weren t available, and after a movie opened, further weekend revenues were no longer reported in the contest). To compensate for this, we approximated subsequent weekend revenues for movies, assuming that movies would run for three weeks following their opening weekend and that each weekend their revenue would decrease by 50 percent. So, for any given weekend, we approximated the revenue as j= 3 j WeekendRevenue()= t OWi ()+ t 05. OWi ( t j), i i, j= where t is the current weekend and i is the index to a movie that exists at t. Then, for the weekend revenue prediction, we used a moving average: j= 2 WeekendRevenue( t+ )= WeekendRevenue( t j). 3 j= 0 Finally, we approximated the available revenue for new movies as 62 September/October 204

6 Linear-Regression Model Construction and Evaluation Regression analysis is one of the most common methods of pattern detection and multifactor analysis. With a proper regression model, analysts can better describe, interpret, and predict data. T The solution takes the form ˆb = ( X X) T XY, and the prediction function is Y = HY, where H = X(X T X) X T. In oneorder multiple linear regression, the predicted response is a linear combination of observations. The Linear-Regression Model A k-variable linear-regression model has this basic form: y = b 0 + b x + b 2x b kx k + e, where y is the response; b is an unknown parameter; x i, i =, 2,, k, are the regressors; and e is the error term. The goal is to define a relationship between the response and regressors by solving for the linear coefficients that best map the regressors to the response. The linear-regression model is most often written as a matrix, such that Y = Xβ + ε, y y Y = 2, y n x x k x x k X = 2 2, xn xnk β0 = β β βk. For multiple regression models, you can use higherorder terms to model the response (for example, secondorder variables are of the form x i 2 and x ix j). However, for the research described in the main article, we focused on the simple linear-regression model. Parameter Estimation To solve for b i, the ordinary least squares (OLS) solution is most often employed. This assumes normality for the data. However, if this assumption isn t valid, a maximum-likelihood estimation would be employed (which is equivalent to OLS under the assumption of normality). For OLS, we wish to minimize n 2 i i= T T S( β)= ε = εε= y Xβ y Xβ, where S indicates the least-squares function and indicates a partial derivative, by satisfying S b bˆ T T = 2X y + 2X Xbˆ = 0. Model Selection In a multiple-variable dataset with a single response variable, analysts traditionally face a large set of potential linear-regression models consisting of various regressors and orders. For example, in movie revenue prediction, the response could be related to the number of tweets per day, the number of theaters the movie is released in, or any combination of variables. To decide which model to use in prediction, analysts typically consider four principles: Don t violate the scientific principle, if one exists, behind the dataset. Maintain a sense of parsimony to keep the order of the model and the number of regressors as low as possible. Keep an eye on extrapolation. Regression fits data in a given regressor space; there s no guarantee that the same model applies to other data outside this space. Always check evaluation plots more than the statistics. Residual plots and normal plots help show outliers and lack of fit. To verify a model s efficacy, analysts typically rely on a variety of statistical graphics to determine the critical variables in the model those that explain the most variation with the simplest form. 2 Evaluation of a model s effective fit usually involves three statistics. The p-value shows a regression model s significance, where p < 0.05 indicates the model is significant with a 95 percent confidence interval. R 2 and the adjusted R 2 generally describe the percentage of variance explained by a given model. The adjusted R 2 takes into consideration the degrees of freedom and should be used in multiple regression to compensate for the increased variance when adding regressors. A model is typically selected when its p-value is small, its R 2 or adjusted R 2 is high, and it has a relatively simple form with reasonable residual distributions. References. D.C. Montgomery, E.A. Peck, and G.G. Vining, Introduction to Linear Regression Analysis, John Wiley & Sons, T. Muhlbacher and H. Piringer, A Partition-Based Framework for Building and Validating Regression Models, IEEE Trans. Visualization and Computer Graphics, vol. 9, no. 2, 203, pp IEEE Computer Graphics and Applications 63

7 Business Intelligence Analytics NewMovieRevenue( t+ )= j= 3 j WeekendRevenue( t+ ) 0. 5 OWi ( t+ j). i, j= () Although this prediction is crude, it gives users a valuable bound in which to explore the revenue predictions. Our toolkit provides two views of the results from the weekend revenue prediction and the linear-regression model. The first view combines a linked bar graph with stacked bars (see Figure 3b). The graph s primary portion consists of gray bars indicating the predicted total weekend revenue for the new movies. The short dark-gray line indicates the actual weekend revenue for each calendar week shown on the x-axis. The stacked bar graph appears only for the analyzed weekend; the colors are the same as in the prediction bar graph. The second view (see Figures 3c and 3d) lets users interactively adjust predictions while visualizing the bounds of the total weekend revenue prediction. A gray rectangle s area is scaled linearly to the total weekend revenue prediction. Colored rectangles are superimposed onto the gray rectangle; each colored rectangle s area represents the linear-regression prediction for each movie released on that weekend. If the sum of the individual predictions is equal to the total prediction, the colored rectangles will fit exactly into the gray rectangle. The colors are the same as in the bar graph; modifying a bar s size in any view modifies the size across all views. Users can perform three types of prediction adjustments: They can change the total weekend revenue prediction, but the ratio between the movies will remain consistent. They can change an individual movie revenue prediction, but the total weekend revenue prediction will remain consistent. They can arbitrarily change each movie s revenue prediction and ignore the total weekend revenue. By implementing and integrating multiple comparison methods, we could quickly bound our analysis. Although flexible, these bounds provided an early estimate of the total weekend revenue with which to compare the predictions of our linearregression models. Although our temporal predictions were of low quality, the combination of predictions and bounding of the problem space provided critical information for comparison and analysis. Overall, adding multiple models predicting similar information can help guide users to a better ground truth. Like the Delphi method, which solicits predictions from multiple experts and uses them to come to a common conclusion, 0 our toolkit lets users solicit predictions from multiple models to aid their analysis. Users can employ this bounded adjustment widget for other hierarchical predictions that have both individual and total predictions, such as subtopic trend prediction in a time period. Similarity Visualization The similarity widget lets users quickly find and compare predictions accuracy on the basis of various similarity criteria. They can determine whether the given prediction model typically underestimates, overestimates, or is relatively accurate regarding movies they deem similar. So, they can further refine their final prediction for both revenue and review scores. We ve defined eight similarity criteria; Table 2 shows them and their distance measurements. In all similarity matches, our toolkit shows the top five most similar movies. These views let users directly compare tweet trends and sentiment words between movies deemed similar in a category. Figure 4 contains snapshots from the Despicable Me 2 similarity page, showing line charts using the MPAA criterion, a wordle using the sentiment wordle criterion, and a theme river using the sentiment river criterion. Although all the variables used in our similarity metrics could also be used in the linear-regression model, the modeling results indicated that these variables weren t significant in altering the model. However, by providing users with insight into these secondary variables, coupled with the weekend modeling, our toolkit lets them further refine predictions. For example, users might compare the absolute difference between tweets of two movies or inspect the trend of the tweets through line chart comparison using the tweet-changing-trend criterion. Users can also quickly compare the selected movies to recently released movies with the same MPAA rating or genre. In addition, they can compare the popularity of the movies stars, which is based on how many Twitter followers the stars have. Implementing the Toolkit In the VAST 203 Box Office Challenge, we used our toolkit to predict 23 movies over three months. Here, we give an example based on the July 4th holiday in the US, when Despicable Me 2 and The Lone Ranger were released. 64 September/October 204

8 Table 2. Calculations of similarity criteria.* Similarity criteria Distance measurement Number of tweets 4 by day Dis vs, TBDi v TBDi s i= = () Tweet changing trend Sentiment river MSS MPAA Genre MSP Sentiment wordle 4 TBDi ( v) TBDi s Dis( vs, )= i = Max TBDj ( v), j = 2,,, 4 Max TBDj () s, j = 2,,, 4 () 4 MSSi ( v) MSSi s Dis( vs, )= i = Max MSSj ( v), j = 2,,, 4 Max MSSj () s, j = 2,,, 4 Dis( v, s)= MSS ( v) MSS () s () The same Motion Picture Association of America rating and close release dates card ( Genre( v) Genre() s ) 2 Dis( vs, )= card ( Genre( v) )+ card Genre() s Dis( v, s)= MSP ( v) MSP () s card SWordle( v) SWordle() s Dis( vs, )= card SWordlev *v and s are the two movies being compared; card is the cardinality. Figure 4. User-defined similarity views cropped to show the most similar movies. On the top in the middle are graphs using the MPAA criterion. On the top right are graphs of the actual opening-weekend revenue, our final prediction, and the prediction range. The circled star shows the review score. On the bottom left is a wordle using the sentiment wordle criterion; on the bottom right is a theme river using the sentiment river criterion. (For an explanation of these criteria, see Table 2.) Predicting Review Scores To predict IMDb review scores, we first entered the Bitly view for each movie. We manually extracted review scores from Bitly users who had attended a prescreening of the movie (see Figure 2). For Despicable Me 2, the analysts manually classified the most-clicked Bitly reviews; the average value of the extracted review scores was 7.8. Once we recorded the selected movie s average value, we used the similarity view to compare it to IEEE Computer Graphics and Applications 65

9 Business Intelligence Analytics Table 3. Competitors performance in the 203 VAST Box Office Challenge. The average error is in millions of dollars. Revenue predictions Viewer-rating predictions Team No. of predictions Average error Standard deviation MRAE* No. of predictions Average error Standard deviation MRAE* Our team (VADER) Team Prolix Uni Konstanz Boxoffice CinemAviz Team Turboknopf elvertoncf UFMG Philipp Omentisch CDE IIIT *Mean relative absolute error. other movies. The movie review score appeared as a star highlighting the review value in the corner of the bar graphs (see Figure 4). Typically, we compared across genre, movie rating, and sentiment to determine whether we felt the average value extracted from Bitly links was a reasonable prediction. We compared Despicable Me 2 to Monsters University because both were animated sequels. Monsters University s IMDb rating was 7.8, giving us confidence that our predicted value of 7.8 for Despicable Me 2 was reasonable. We then performed this process for the Lone Ranger, which received a predicted rating of 6.4. The actual IMDb ratings were 7.9 for Despicable Me 2 and 6.8 for The Lone Ranger. Predicting Revenue Predicting revenue for the July 4th weekend was challenging for two reasons. First, the data stream from the contest was broken, providing only six days worth of tweets. Second, the predictions were for a five-day weekend instead of the typical threeday weekend. Using the available data, we obtained rough estimates of US$76M (±$3M) for Despicable Me 2 and $85M (±$3M) for The Lone Ranger. For the three-day weekend, the New Movie Revenue (see Equation ) estimated that $24M was available for the two movies. A quick look at Figure 3 shows that our regression predictions were well outside the bounds of the time series model prediction. Given the misalignment between the two models, we explored the similarity views to determine the movies most similar to Despicable Me 2 and The Lone Ranger, on the basis of the predicted review scores and various other metrics. We compared Despicable Me 2 to a variety of animated movies; the predicted $73M was actually low compared to animated movies such as Monsters University. Next, we explored various similarity views for The Lone Ranger. It was likely similar to World War Z, which had a weekend revenue of $66M. However, World War Z s viewer rating was 7.4, much higher than the predicted 6.4 for The Lone Ranger. We determined that Despicable Me 2 should perform similarly to Monsters University, and we predicted a three-day revenue of $85M. On the basis of our temporal prediction, this left only $39M for The Lone Ranger. However, given the other evidence, The Lone Ranger seemed likely to underperform. Finally, we took our three-day prediction values and linearly scaled them, resulting in a five-day prediction of $6.5M for Despicable Me 2 and $55.45M for The Lone Ranger. The actual three-day revenue was $83.5M for Despicable Me 2 and $29M for The Lone Ranger. The actual five-day revenue was $43M for Despicable Me 2 and $48.7M for The Lone Ranger. VAST Challenge Results Eight teams from various research institutes participated in the 203 VAST Box Office Challenge. Our team was Team VADER (Visual Analytics and Data Exploration Research Lab; asu.edu). Here, we compare our performance with that of our VAST competitors and four professional movie prediction websites. Comparison with Peer Teams Table 3 summarizes each team s performance. For the revenue predictions, we report the average error (in terms of millions of dollars), the standard deviation of the average error, and the mean relative absolute error (MRAE), which is the percentage of bias deviating from the real value: MRAE = N N i= Predictioni Real Valuei. Real Value We report similar values for predicting the IMDb rating (which ranged from to 0). For these statistics, smaller values indicate more accurate pre- i 66 September/October 204

10 MRAE Star Trek Epic Fast 6 Hangover 3 After Earth Now You See Me Internship Purge Man of Steel This is the End dictions. The data in Table 3 was provided to all challenge participants after the contest closed. Regarding the average error and standard deviation for revenue predictions, our team reported the lowest values. Regarding the MRAE for revenue predictions and viewer-rating predictions, our results were slightly worse than Team Prolix and similar to Philipp Omentisch, CDE IIIT, and Team Turboknopf. However, Team Prolix s average error and standard deviation were much larger than ours, indicating more inconsistent predictions. Regarding the average error and MRAE for viewer-rating predictions, our team had the lowest values of all teams that submitted more than five predictions. CDE IIIT submitted two perfect predictions; however, it submitted only those two predictions, making it difficult to determine whether its methods would produce consistent results. Regarding the average error and standard deviation for viewer-rating predictions, our team performed similarly to Team Turboknopf, but with a slightly lower average error and a slightly higher standard deviation. Comparison with Professional Predictions In this comparison, we used our predictions for only 2 of the 23 movies. Two of the 23, The Bling Ring and The To Do List, were limited-release movies that opened in only five and 59 theaters, respectively. Most expert prediction sites don t provide predictions for limited-release movies. MU WWZ The Heat White House Down DM2 Ranger Turbo Conjuring Red 2 For each prediction, we followed the same general process we described in the section Implementing the Toolkit. As we stated before, the underlying linear-regression model used in our toolkit was significant, with an adjusted R 2 of approximately Figure 5 compares our MRAE with that of the four websites for the opening-weekend revenue. We clearly outperformed the experts on the weekend when Epic, The Hangover Part III, and Fast & Furious 6 were released. On the weekend when we had the largest error (for After Earth), we relied heavily on the analytical component, with no interaction. Figure 6 plots the MRAE for the review scores. Approximately half of our predictions were within a 5 percent error of the real review score. The four websites had no published review score predictions. The predictions with our toolkit were a dramatic improvement over using just our model without interaction (see the first two rows of Table 4). This strongly indicates that our hypothesis (that VA will help users develop better predictions than a purely statistical solution will) is valid. However, we don t wish to overstate our claims. The contest provided only a single data point for exploring how one group of analysts in a closed-world setting could use a VA toolkit for improved prediction. The need exists for further controlled studies in which a group of analysts performs similar model predictions both with a VA platform and with only a given regression model. Our prediction boxoffice.com filmgo.net hsx.com boxofficemojo.com RIPD Wolverine Figure 5. The mean relative absolute error (MRAE) of weekend revenue predictions. We clearly outperformed the experts for three movies (Epic, The Hangover Part III, and Fast & Furious 6). Where we had the largest error (After Earth), we relied heavily on the analytical component, with no interaction. MRAE Star Trek Epic Fast 6 Hangover 3 After Earth Now You See Me Internship Purge Man of Steel This is the End MU WWZ The Heat White House Down DM2 Ranger Turbo Conjuring Red 2 RIPD Wolverine Figure 6. The MRAE of our viewer-rating predictions. Sixteen out of 2 predictions had an error below 0 percent, and had an error below 5 percent. IEEE Computer Graphics and Applications 67

11 Business Intelligence Analytics Table 4. Comparing our toolkit with professional predictions. Prediction source No. of predictions Average error Standard deviation Average MRAE VADER, interactive VADER, no interaction boxoffice.com filmgo.net hsx.com boxofficemojo.com Table 4 shows that our average error and average MRAE were slightly lower than those of filmgo.net. This indicates that our approach enabled our group of novice analysts to be competitive with experts. The significance of this relies on three major assumptions: The professional prediction websites had more experience in movie revenue prediction than our team. The professional prediction websites had access to more data than our team was allowed in the closed-world contest. Access to more data can enable better predictive models. 8,9,,2 First, it seems reasonable that a professional prediction website would have much more experience than a computer science team who had never previously attempted to predict movie revenue. Second, there s no restriction on what data a professional website s predictions can use. For example, boxoffice.com uses Facebook tracking and Twitter tracking, and hsx.com uses the Hollywood Stock index. Third, it s clear that using more data (specifically, the number of theaters a movie is released in) will produce a better prediction model (a larger R 2 ). From these assumptions, it becomes clear that (in this instance) a VA toolkit can enable individuals who are knowledgeable about data analysis to quickly understand information being presented to them in new domains and make predictions that are in line with expert predictions. Our MRAE (0.285) was slightly lower than that of filmgo.net (0.297) but approximately 50 percent worse than that of boxoffice.com (0.9). However, if we remove the After Earth and Now You See Me weekend (during which we relied heavily on the model and little on the interactive visuals), our MRAE drops to 0.239, which puts us near boxofficemojo.com (0.224). Other error sources can be accounted for in disrupted Twitter and Bitly data feeds. These interruptions were pronounced for The Heat, White House Down, Monsters University, and World War Z. However, even with those interruptions, our predictive analysis was still quite robust, with only The Heat obtaining a significantly worse prediction than the professional sites. The Challenges Ahead Overall, applying VA for social media analysis has proven relatively effective. However, four main challenges exist in applying this to all domains of business intelligence. First, social media data is extremely noisy. Movie predictions work well because you can track ad campaigns effectiveness by following the specific hashtags promoted by a brand. As the analysis gets farther afield from Twitter (for example, when trying to mine Bitly data), choosing effective keywords becomes difficult. Second, owing to the ever-changing stream of social media sources and users, any automated system for data collection and prediction will likely eventually be steered off course. So, it s critical to link the human into the loop. However, as is evidenced by the issues in sentiment analysis, data cleaning shouldn t overburden analysts. The sentiment analysis and cleaning employed in our research places an overly large burden on the user. A more effective solution could be a system for sentiment model training that has users label a subset of tweets. Third, it s imperative to link highly curated small datasets with this big data. Although social media data can serve as a proxy for many signals, we find that linking multiple data sources with varying reliability levels (for instance, the total weekend revenue for all movies and regression modeling) can enhance a system s predictive abilities. For example, doing focus groups and linking their data with results from social media could enhance the analysis of a proposed new product release. Finally, this research demonstrates the need for interactive tools to mine social media data. From the examples of movie revenue prediction, it s clear that such data contains a wealth of information. However, extracting knowledge from this data and effectively communicating it remain a challenge. The need clearly exists for effective data-cleaning tools to improve the filtering of unrelated social media signals and for improving the results of challenging analytical tasks (such as sentiment analysis). Our results demonstrate that using VA 68 September/October 204

12 tools can significantly affect knowledge discovery for business intelligence. Although our results demonstrate only a single data point, we feel this is significant in that the contest provisions let us directly compare analysts using a VA toolkit to experts in a particular modeling domain. We recognize that this is a far cry from definitively validating our hypothesis that the use of VA will enable users to develop better box-office predictions than a purely statistical solution would. This research points to the need for better methods for evaluating the impact of VA used for complex problems such as prediction. A variety of factors and variables must be addressed and controlled, including the level of expertise and the types of visualizations provided. Using our toolkit, we ve been collecting streaming movie data in a manner similar to the VAST Box Office Challenge and plan to run a variety of controlled experiments. Of primary interest is exploring levels of expertise and VA s impact on predictions. We feel that the results we reported here are an important starting point for such explorations. Acknowledgments This research was supported partly by the US Department of Homeland Security s VACCINE (Visual Analytics for Command, Control, and Interoperability Environments) Center under award 2009-ST- 06-CI000. We thank the 203 Visual Analytics Science and Technology Box Office Challenge organizers and participants for their help in data collection, evaluation, and discussions. 5. M.C. Hao et al., Visual Sentiment Analysis of Customer Feedback Streams Using Geo-temporal Term Associations, Information Visualization, vol. 2, nos. 3 4, 203, pp S. Baccianella, A. Esuli, and F. Sebastiani, SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining, Proc. Int l Conf. Language Resources and Evaluation, 200, pp R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, R. Panaligan and A. Chen, Quantifying Movie Magic with Google Search, white paper, Google, S. Asur and B.A. Huberman, Predicting the Future with Social Media, Proc. IEEE/WIC/ACM Int l Conf. Web Intelligence and Intelligent Agent Technology, 200, pp G. Rowe and G. Wright, The Delphi Technique as a Forecasting Tool: Issues and Analysis, Int l J. Forecasting, vol. 5, no. 4, 999, pp W. Zhang and S. Skiena, Improving Movie Gross Prediction through News Analysis, Proc. IEEE/ WIC/ACM Int l Joint Conf. Web Intelligence and Intelligent Agent Technology, 2009, pp M. Joshi et al., Movie Reviews and Revenues: An Experiment in Text Regression, Human Language Technologies: The 200 Ann. Conf. North Am. Chapter of the Assoc. for Computational Linguistics, 200, pp Yafeng Lu is a PhD student working for Ross Maciejewski in Arizona State University s School of Computing, Informatics, and Decision Systems Engineering. Her research interests are data analysis and visualization. Lu received her master s in computer science and theory from Northeastern University, China. Contact her at [email protected]. References. T. Schreck and D. Keim, Visual Analysis of Social Media Data, Computer, vol. 46, no. 5, 203, pp H. Bosch et al., Scatterblogs2: Real-Time Monitoring of Microblog Messages through User-Guided Filtering, IEEE Trans. Visualization and Computer Graphics, vol. 9, no. 2, 203, pp J. Chae et al., Spatiotemporal Social Media Analytics for Abnormal Event Detection and Examination Using Seasonal-Trend Decomposition, Proc. 202 IEEE Conf. Visual Analytics Science and Technology (VAST 2), 202, pp X. Wang et al., I-SI: Scalable Architecture for Analyzing Latent Topical Level Information from Social Media Data, Computer Graphics Forum, vol. 3, no. 3, part 4, 202, pp Feng Wang is a PhD student working for Ross Maciejewski in Arizona State University s School of Computing, Informatics, and Decision Systems Engineering. His research interests include data visualization and data mining. He received his master s in computer science from the University of Science and Technology of China. Contact him at [email protected]. Ross Maciejewski is an assistant professor in Arizona State University s School of Computing, Informatics, and Decision Systems Engineering. His research interests are geographical visualization and visual analytics focusing on public health, social media, criminal incident reports, and dietary analysis. He received his PhD in computer engineering from Purdue University. Contact him at [email protected]. Selected CS articles and columns are also available for free at IEEE Computer Graphics and Applications 69

Integrating Predictive Analytics and Social Media

Integrating Predictive Analytics and Social Media Integrating Predictive Analytics and Social Media Yafeng Lu, Robert Krüger, Student Member, IEEE, Dennis Thom, Feng Wang, Steffen Koch, Member, IEEE, Thomas Ertl, Member, IEEE, and Ross Maciejewski, Member,

More information

Data Visualization Techniques

Data Visualization Techniques Data Visualization Techniques From Basics to Big Data with SAS Visual Analytics WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 Generating the Best Visualizations for Your Data... 2 The

More information

Predicting Movie Revenue from IMDb Data

Predicting Movie Revenue from IMDb Data Predicting Movie Revenue from IMDb Data Steven Yoo, Robert Kanter, David Cummings TA: Andrew Maas 1. Introduction Given the information known about a movie in the week of its release, can we predict the

More information

Data Visualization Techniques

Data Visualization Techniques Data Visualization Techniques From Basics to Big Data with SAS Visual Analytics WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 Generating the Best Visualizations for Your Data... 2 The

More information

Simple Linear Regression Inference

Simple Linear Regression Inference Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

More information

Prediction of Stock Performance Using Analytical Techniques

Prediction of Stock Performance Using Analytical Techniques 136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

More information

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon t-tests in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. [email protected] www.excelmasterseries.com

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

Keywords social media, internet, data, sentiment analysis, opinion mining, business

Keywords social media, internet, data, sentiment analysis, opinion mining, business Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Real time Extraction

More information

Capturing Meaningful Competitive Intelligence from the Social Media Movement

Capturing Meaningful Competitive Intelligence from the Social Media Movement Capturing Meaningful Competitive Intelligence from the Social Media Movement Social media has evolved from a creative marketing medium and networking resource to a goldmine for robust competitive intelligence

More information

Text Analytics Beginner s Guide. Extracting Meaning from Unstructured Data

Text Analytics Beginner s Guide. Extracting Meaning from Unstructured Data Text Analytics Beginner s Guide Extracting Meaning from Unstructured Data Contents Text Analytics 3 Use Cases 7 Terms 9 Trends 14 Scenario 15 Resources 24 2 2013 Angoss Software Corporation. All rights

More information

White Paper. Data Visualization Techniques. From Basics to Big Data With SAS Visual Analytics

White Paper. Data Visualization Techniques. From Basics to Big Data With SAS Visual Analytics White Paper Data Visualization Techniques From Basics to Big Data With SAS Visual Analytics Contents Introduction... 1 Tips to Get Started... 1 The Basics: Charting 101... 2 Line Graphs...2 Bar Charts...3

More information

Marketing Mix Modelling and Big Data P. M Cain

Marketing Mix Modelling and Big Data P. M Cain 1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored

More information

The Viability of StockTwits and Google Trends to Predict the Stock Market. By Chris Loughlin and Erik Harnisch

The Viability of StockTwits and Google Trends to Predict the Stock Market. By Chris Loughlin and Erik Harnisch The Viability of StockTwits and Google Trends to Predict the Stock Market By Chris Loughlin and Erik Harnisch Spring 2013 Introduction Investors are always looking to gain an edge on the rest of the market.

More information

A GENERAL TAXONOMY FOR VISUALIZATION OF PREDICTIVE SOCIAL MEDIA ANALYTICS

A GENERAL TAXONOMY FOR VISUALIZATION OF PREDICTIVE SOCIAL MEDIA ANALYTICS A GENERAL TAXONOMY FOR VISUALIZATION OF PREDICTIVE SOCIAL MEDIA ANALYTICS Stacey Franklin Jones, D.Sc. ProTech Global Solutions Annapolis, MD Abstract The use of Social Media as a resource to characterize

More information

5. Multiple regression

5. Multiple regression 5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful

More information

International Statistical Institute, 56th Session, 2007: Phil Everson

International Statistical Institute, 56th Session, 2007: Phil Everson Teaching Regression using American Football Scores Everson, Phil Swarthmore College Department of Mathematics and Statistics 5 College Avenue Swarthmore, PA198, USA E-mail: [email protected] 1. Introduction

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

The Definitive Guide to. Video SEO. i5 web works Email: [email protected] Phone: 855-367-4599 Web: www.i5ww.com

The Definitive Guide to. Video SEO. i5 web works Email: info@i5ww.com Phone: 855-367-4599 Web: www.i5ww.com The Definitive Guide to Video SEO i5 web works Email: [email protected] Phone: 855-367-4599 Web: www.i5ww.com Incorporating Video SEO into your strategies Video represents a unique place in the SEO world.

More information

A Review of Cross Sectional Regression for Financial Data You should already know this material from previous study

A Review of Cross Sectional Regression for Financial Data You should already know this material from previous study A Review of Cross Sectional Regression for Financial Data You should already know this material from previous study But I will offer a review, with a focus on issues which arise in finance 1 TYPES OF FINANCIAL

More information

PREDICTING BOX-OFFICE SUCCESS OF MOVIES IN THE U.S. MARKET

PREDICTING BOX-OFFICE SUCCESS OF MOVIES IN THE U.S. MARKET PREDICTING BOX-OFFICE SUCCESS OF MOVIES IN THE U.S. MARKET I. INTRODUCTION Darin Im and Minh Thao Nguyen CS 229, Fall 2011 The movie industry is a multi-billion dollar industry, generating approximately

More information

IBM SPSS Data Preparation 22

IBM SPSS Data Preparation 22 IBM SPSS Data Preparation 22 Note Before using this information and the product it supports, read the information in Notices on page 33. Product Information This edition applies to version 22, release

More information

SAS VISUAL ANALYTICS AN OVERVIEW OF POWERFUL DISCOVERY, ANALYSIS AND REPORTING

SAS VISUAL ANALYTICS AN OVERVIEW OF POWERFUL DISCOVERY, ANALYSIS AND REPORTING SAS VISUAL ANALYTICS AN OVERVIEW OF POWERFUL DISCOVERY, ANALYSIS AND REPORTING WELCOME TO SAS VISUAL ANALYTICS SAS Visual Analytics is a high-performance, in-memory solution for exploring massive amounts

More information

A Visualization is Worth a Thousand Tables: How IBM Business Analytics Lets Users See Big Data

A Visualization is Worth a Thousand Tables: How IBM Business Analytics Lets Users See Big Data White Paper A Visualization is Worth a Thousand Tables: How IBM Business Analytics Lets Users See Big Data Contents Executive Summary....2 Introduction....3 Too much data, not enough information....3 Only

More information

P6 Analytics Reference Manual

P6 Analytics Reference Manual P6 Analytics Reference Manual Release 3.2 October 2013 Contents Getting Started... 7 About P6 Analytics... 7 Prerequisites to Use Analytics... 8 About Analyses... 9 About... 9 About Dashboards... 10 Logging

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Business Intelligence and Process Modelling

Business Intelligence and Process Modelling Business Intelligence and Process Modelling F.W. Takes Universiteit Leiden Lecture 2: Business Intelligence & Visual Analytics BIPM Lecture 2: Business Intelligence & Visual Analytics 1 / 72 Business Intelligence

More information

Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

More information

A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data

A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data Athanasius Zakhary, Neamat El Gayar Faculty of Computers and Information Cairo University, Giza, Egypt

More information

Using Twitter for Business

Using Twitter for Business Using Twitter for Business The point is, Twitter can be a tremendously valuable marketing tool! In this section, we ll explain some specific usecases of Twitter for marketing. HOW TO USE TWITTER FOR MARKETING:

More information

Socialbakers Analytics User Guide

Socialbakers Analytics User Guide 1 Socialbakers Analytics User Guide Powered by 2 Contents Getting Started Analyzing Facebook Ovierview of metrics Analyzing YouTube Reports and Data Export Social visits KPIs Fans and Fan Growth Analyzing

More information

5 TIPS FOR SETTING MEASURABLE SOCIAL MEDIA GOALS

5 TIPS FOR SETTING MEASURABLE SOCIAL MEDIA GOALS TIP SHEET 5 TIPS FOR SETTING MEASURABLE SOCIAL MEDIA GOALS Social media participation has become a must for businesses today. A survey by CMO in February 2012 revealed that marketers expect to spend almost

More information

Visualization methods for patent data

Visualization methods for patent data Visualization methods for patent data Treparel 2013 Dr. Anton Heijs (CTO & Founder) Delft, The Netherlands Introduction Treparel can provide advanced visualizations for patent data. This document describes

More information

Principles of Data Visualization for Exploratory Data Analysis. Renee M. P. Teate. SYS 6023 Cognitive Systems Engineering April 28, 2015

Principles of Data Visualization for Exploratory Data Analysis. Renee M. P. Teate. SYS 6023 Cognitive Systems Engineering April 28, 2015 Principles of Data Visualization for Exploratory Data Analysis Renee M. P. Teate SYS 6023 Cognitive Systems Engineering April 28, 2015 Introduction Exploratory Data Analysis (EDA) is the phase of analysis

More information

Business Intelligence. A Presentation of the Current Lead Solutions and a Comparative Analysis of the Main Providers

Business Intelligence. A Presentation of the Current Lead Solutions and a Comparative Analysis of the Main Providers 60 Business Intelligence. A Presentation of the Current Lead Solutions and a Comparative Analysis of the Main Providers Business Intelligence. A Presentation of the Current Lead Solutions and a Comparative

More information

Predicting Box Office Success: Do Critical Reviews Really Matter? By: Alec Kennedy Introduction: Information economics looks at the importance of

Predicting Box Office Success: Do Critical Reviews Really Matter? By: Alec Kennedy Introduction: Information economics looks at the importance of Predicting Box Office Success: Do Critical Reviews Really Matter? By: Alec Kennedy Introduction: Information economics looks at the importance of information in economic decisionmaking. Consumers that

More information

How To Run Statistical Tests in Excel

How To Run Statistical Tests in Excel How To Run Statistical Tests in Excel Microsoft Excel is your best tool for storing and manipulating data, calculating basic descriptive statistics such as means and standard deviations, and conducting

More information

GUIDE CAMPAIGN MANAGEMENT BOARDS. How to Use Boards for. How to Set up a Board

GUIDE CAMPAIGN MANAGEMENT BOARDS. How to Use Boards for. How to Set up a Board How to Use Boards for CAMPAIGN MANAGEMENT BOARDS GUIDE How to Set up a Board BOARDS are highly customized, interactive dashboards that ubervu via Hootsuite users can personalize to fit a specific task,

More information

Introduction to Regression and Data Analysis

Introduction to Regression and Data Analysis Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

More information

WHITEPAPER. Text Analytics Beginner s Guide

WHITEPAPER. Text Analytics Beginner s Guide WHITEPAPER Text Analytics Beginner s Guide What is Text Analytics? Text Analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content

More information

Bernice E. Rogowitz and Holly E. Rushmeier IBM TJ Watson Research Center, P.O. Box 704, Yorktown Heights, NY USA

Bernice E. Rogowitz and Holly E. Rushmeier IBM TJ Watson Research Center, P.O. Box 704, Yorktown Heights, NY USA Are Image Quality Metrics Adequate to Evaluate the Quality of Geometric Objects? Bernice E. Rogowitz and Holly E. Rushmeier IBM TJ Watson Research Center, P.O. Box 704, Yorktown Heights, NY USA ABSTRACT

More information

Data Visualization Handbook

Data Visualization Handbook SAP Lumira Data Visualization Handbook www.saplumira.com 1 Table of Content 3 Introduction 20 Ranking 4 Know Your Purpose 23 Part-to-Whole 5 Know Your Data 25 Distribution 9 Crafting Your Message 29 Correlation

More information

Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement

Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement Ray Chen, Marius Lazer Abstract In this paper, we investigate the relationship between Twitter feed content and stock market

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Text Mining - Scope and Applications

Text Mining - Scope and Applications Journal of Computer Science and Applications. ISSN 2231-1270 Volume 5, Number 2 (2013), pp. 51-55 International Research Publication House http://www.irphouse.com Text Mining - Scope and Applications Miss

More information

T he complete guide to SaaS metrics

T he complete guide to SaaS metrics T he complete guide to SaaS metrics What are the must have metrics each SaaS company should measure? And how to calculate them? World s Simplest Analytics Tool INDEX Introduction 4-5 Acquisition Dashboard

More information

Information Literacy Program

Information Literacy Program Information Literacy Program Excel (2013) Advanced Charts 2015 ANU Library anulib.anu.edu.au/training [email protected] Table of Contents Excel (2013) Advanced Charts Overview of charts... 1 Create a chart...

More information

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance. Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics

More information

ORACLE SOCIAL ENGAGEMENT AND MONITORING CLOUD SERVICE

ORACLE SOCIAL ENGAGEMENT AND MONITORING CLOUD SERVICE ORACLE SOCIAL ENGAGEMENT AND MONITORING CLOUD SERVICE KEY FEATURES Global social media, web, and news feed data Market-leading listening quality Automatic categorization Configurable dashboards, drill-down

More information

Using visualization to understand big data

Using visualization to understand big data IBM Software Business Analytics Advanced visualization Using visualization to understand big data By T. Alan Keahey, Ph.D., IBM Visualization Science and Systems Expert 2 Using visualization to understand

More information

Measure Social Media like a Pro: Social Media Analytics Uncovered SOCIAL MEDIA LIKE SHARE. Powered by

Measure Social Media like a Pro: Social Media Analytics Uncovered SOCIAL MEDIA LIKE SHARE. Powered by 1 Measure Social Media like a Pro: Social Media Analytics Uncovered # SOCIAL MEDIA LIKE # SHARE Powered by 2 Social media analytics were a big deal in 2013, but this year they are set to be even more crucial.

More information

Chapter 23. Inferences for Regression

Chapter 23. Inferences for Regression Chapter 23. Inferences for Regression Topics covered in this chapter: Simple Linear Regression Simple Linear Regression Example 23.1: Crying and IQ The Problem: Infants who cry easily may be more easily

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

!!!#$$%&'()*+$(,%!#$%$&'()*%(+,'-*&./#-$&'(-&(0*.$#-$1(2&.3$'45 !"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:

More information

ICT Perspectives on Big Data: Well Sorted Materials

ICT Perspectives on Big Data: Well Sorted Materials ICT Perspectives on Big Data: Well Sorted Materials 3 March 2015 Contents Introduction 1 Dendrogram 2 Tree Map 3 Heat Map 4 Raw Group Data 5 For an online, interactive version of the visualisations in

More information

Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams

Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams 2012 International Conference on Computer Technology and Science (ICCTS 2012) IPCSIT vol. XX (2012) (2012) IACSIT Press, Singapore Using Text and Data Mining Techniques to extract Stock Market Sentiment

More information

How To Use Neural Networks In Data Mining

How To Use Neural Networks In Data Mining International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and

More information

How To Filter Spam Image From A Picture By Color Or Color

How To Filter Spam Image From A Picture By Color Or Color Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among

More information

Behavioral Segmentation

Behavioral Segmentation Behavioral Segmentation TM Contents 1. The Importance of Segmentation in Contemporary Marketing... 2 2. Traditional Methods of Segmentation and their Limitations... 2 2.1 Lack of Homogeneity... 3 2.2 Determining

More information

An interactive 3D visualization system for displaying fieldmonitoring

An interactive 3D visualization system for displaying fieldmonitoring icccbe 2010 Nottingham University Press Proceedings of the International Conference on Computing in Civil and Building Engineering W Tizani (Editor) An interactive 3D visualization system for displaying

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Strategic Online Advertising: Modeling Internet User Behavior with

Strategic Online Advertising: Modeling Internet User Behavior with 2 Strategic Online Advertising: Modeling Internet User Behavior with Patrick Johnston, Nicholas Kristoff, Heather McGinness, Phuong Vu, Nathaniel Wong, Jason Wright with William T. Scherer and Matthew

More information

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices: Doing Multiple Regression with SPSS Multiple Regression for Data Already in Data Editor Next we want to specify a multiple regression analysis for these data. The menu bar for SPSS offers several options:

More information

Getting Correct Results from PROC REG

Getting Correct Results from PROC REG Getting Correct Results from PROC REG Nathaniel Derby, Statis Pro Data Analytics, Seattle, WA ABSTRACT PROC REG, SAS s implementation of linear regression, is often used to fit a line without checking

More information

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable

More information

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 8 August 2013

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 8 August 2013 A Short-Term Traffic Prediction On A Distributed Network Using Multiple Regression Equation Ms.Sharmi.S 1 Research Scholar, MS University,Thirunelvelli Dr.M.Punithavalli Director, SREC,Coimbatore. Abstract:

More information

Ten Mistakes to Avoid

Ten Mistakes to Avoid EXCLUSIVELY FOR TDWI PREMIUM MEMBERS TDWI RESEARCH SECOND QUARTER 2014 Ten Mistakes to Avoid In Big Data Analytics Projects By Fern Halper tdwi.org Ten Mistakes to Avoid In Big Data Analytics Projects

More information

How to Get More Value from Your Survey Data

How to Get More Value from Your Survey Data Technical report How to Get More Value from Your Survey Data Discover four advanced analysis techniques that make survey research more effective Table of contents Introduction..............................................................2

More information

Visualizing the Top 400 Universities

Visualizing the Top 400 Universities Int'l Conf. e-learning, e-bus., EIS, and e-gov. EEE'15 81 Visualizing the Top 400 Universities Salwa Aljehane 1, Reem Alshahrani 1, and Maha Thafar 1 [email protected], [email protected], [email protected]

More information

Biomarker Discovery and Data Visualization Tool for Ovarian Cancer Screening

Biomarker Discovery and Data Visualization Tool for Ovarian Cancer Screening , pp.169-178 http://dx.doi.org/10.14257/ijbsbt.2014.6.2.17 Biomarker Discovery and Data Visualization Tool for Ovarian Cancer Screening Ki-Seok Cheong 2,3, Hye-Jeong Song 1,3, Chan-Young Park 1,3, Jong-Dae

More information

DEMYSTIFYING BIG DATA. What it is, what it isn t, and what it can do for you.

DEMYSTIFYING BIG DATA. What it is, what it isn t, and what it can do for you. DEMYSTIFYING BIG DATA What it is, what it isn t, and what it can do for you. JAMES LUCK BIO James Luck is a Data Scientist with AT&T Consulting. He has 25+ years of experience in data analytics, in addition

More information

Regularized Logistic Regression for Mind Reading with Parallel Validation

Regularized Logistic Regression for Mind Reading with Parallel Validation Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland

More information

5 Tips For Setting Measurable. Social Media Goals. 5 Tips for Measurable social media goals

5 Tips For Setting Measurable. Social Media Goals. 5 Tips for Measurable social media goals 5 Tips For Setting Measurable Social Media Goals 1 introduction Five practical tips for setting measurable social media goals Social media participation has become a must for businesses today. A survey

More information

CORRALLING THE WILD, WILD WEST OF SOCIAL MEDIA INTELLIGENCE

CORRALLING THE WILD, WILD WEST OF SOCIAL MEDIA INTELLIGENCE CORRALLING THE WILD, WILD WEST OF SOCIAL MEDIA INTELLIGENCE Michael Diederich, Microsoft CMG Research & Insights Introduction The rise of social media platforms like Facebook and Twitter has created new

More information

Social Market Analytics, Inc.

Social Market Analytics, Inc. S-Factors : Definition, Use, and Significance Social Market Analytics, Inc. Harness the Power of Social Media Intelligence January 2014 P a g e 2 Introduction Social Market Analytics, Inc., (SMA) produces

More information

Big Data: Rethinking Text Visualization

Big Data: Rethinking Text Visualization Big Data: Rethinking Text Visualization Dr. Anton Heijs [email protected] Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

The Future of Business Analytics is Now! 2013 IBM Corporation

The Future of Business Analytics is Now! 2013 IBM Corporation The Future of Business Analytics is Now! 1 The pressures on organizations are at a point where analytics has evolved from a business initiative to a BUSINESS IMPERATIVE More organization are using analytics

More information

Purchase Conversions and Attribution Modeling in Online Advertising: An Empirical Investigation

Purchase Conversions and Attribution Modeling in Online Advertising: An Empirical Investigation Purchase Conversions and Attribution Modeling in Online Advertising: An Empirical Investigation Author: TAHIR NISAR - Email: [email protected] University: SOUTHAMPTON UNIVERSITY BUSINESS SCHOOL Track:

More information

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

A Comparative Study on Sentiment Classification and Ranking on Product Reviews A Comparative Study on Sentiment Classification and Ranking on Product Reviews C.EMELDA Research Scholar, PG and Research Department of Computer Science, Nehru Memorial College, Putthanampatti, Bharathidasan

More information

IBM Social Media Analytics

IBM Social Media Analytics IBM Social Media Analytics Analyze social media data to better understand your customers and markets Highlights Understand consumer sentiment and optimize marketing campaigns. Improve the customer experience

More information

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

More information

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

More information

Visibility optimization for data visualization: A Survey of Issues and Techniques

Visibility optimization for data visualization: A Survey of Issues and Techniques Visibility optimization for data visualization: A Survey of Issues and Techniques Ch Harika, Dr.Supreethi K.P Student, M.Tech, Assistant Professor College of Engineering, Jawaharlal Nehru Technological

More information

How To Check For Differences In The One Way Anova

How To Check For Differences In The One Way Anova MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

More information

20 A Visualization Framework For Discovering Prepaid Mobile Subscriber Usage Patterns

20 A Visualization Framework For Discovering Prepaid Mobile Subscriber Usage Patterns 20 A Visualization Framework For Discovering Prepaid Mobile Subscriber Usage Patterns John Aogon and Patrick J. Ogao Telecommunications operators in developing countries are faced with a problem of knowing

More information

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96 1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years

More information

CoolaData Predictive Analytics

CoolaData Predictive Analytics CoolaData Predictive Analytics 9 3 6 About CoolaData CoolaData empowers online companies to become proactive and predictive without having to develop, store, manage or monitor data themselves. It is an

More information

the beginner s guide to SOCIAL MEDIA METRICS

the beginner s guide to SOCIAL MEDIA METRICS the beginner s guide to SOCIAL MEDIA METRICS INTRO Social media can be an incredibly important business tool. Tracking the right social metrics around your industry, company, products, competition and

More information

Social Media Implementations

Social Media Implementations SEM Experience Analytics Social Media Implementations SEM Experience Analytics delivers real sentiment, meaning and trends within social media for many of the world s leading consumer brand companies.

More information

See how social media listening and engagement can help your business

See how social media listening and engagement can help your business See how social media listening and engagement can help your business In a socially connected world, engagement with your customers can happen anywhere or anytime. Microsoft Social Engagement puts powerful

More information