Twitter keyword volume, current spending, and weekday spending norms predict consumer spending

Transcription

1 212 IEEE 12th International Conference on Data Mining Workshops Twitter keyword volume, current spending, and weekday spending norms predict consumer spending Justin Stewart, Homer Strong, Jeffrey Parker, Mark A. Bedau Reed College, Portland, Oregon, USA Lucky Sort, Inc., Portland, Oregon, USA Current address: Department of Economics, University of Pittsburgh, Pittsburgh, PA, USA Abstract We examine whether aggregate daily Twitter keyword volumes over eight months from November 211 to June 212 can be used to predict aggregate daily consumer spending as reported by Gallup. We also examine whether Twitter keyword volume improves predictive ability over prediction based solely on current spending, weekday spending norms, and spending history. We divide spending and Twitter data into (i) in-sample data used to identify which Twitter words are highly correlated with spending and to estimate model coefficients, and (ii) out-ofsample data used to measure model forecast success. Our methods are very general and include n-grams (e.g., pairs of words, like going shopping ). We note that the historical spending data exhibit a weekday pattern of high spending on two days and low spending over the rest of the week. Spending history also shows some striking deviations from weekday norms, such as Black Friday (the day after the American Thanksgiving holiday) and Boxing day (the day after Christmas) historically large shopping days. We build models on combinations of Twitter keyword volume (T), current spending (S), and weekday spending norms (D), and compare four model forecast success measures: the correlation between actual and forecast daily spending changes, the percentage of correctly forecast directions of daily spending change, the correlation between actual and forecast deviations from weekday spending norms, and the percentage of correctly forecast deviations from weekday norms. We test model forecasts over the period: April - June. Our results show that weekday Twitter keyword volume, current spending, and weekday spending norms all have significant value for predicting consumer spending three days in advance, but none demonstrates a significant predictive advantage over the others. Index Terms social media; Twitter; forecast; consumer spending. I. INTRODUCTION We examine whether the aggregate daily volume of keywords in Twitter can be used to forecast consumer spending. The exploding volume of on-line social media opens the door to new ways to forecast economic variables, and Twitter has become the E. coli model organism of computational social science [1] []. Twitter has been used to forecast daily consumer confidence as reported by Gallup [3] and financial variables like the S&P, DJIA, and VIX [2], [6], [7], as well as movie box office receipts [4] and Amazon book sales [8]. We examine whether Twitter can be used to forecast something much broader and more diffuse: the total aggregation of all consumer spending. Furthermore, whereas traditional economic forecasts of consumer spending are monthly [9], [1], we use daily Twitter and spending data to forecast daily consumer spending values. We compare models that are linear combinations of three parameters: current consumer spending (S), weekday average spending (D), and current Twitter keyword volume (T ). We evaluate how well the models can predict daily spending changes and daily deviations from weekday spending norms. We choose Twitter keywords by looking for words with Twitter frequencies that are highly correlated with consumer spending. We also use historical spending and Twitter data to estimate model coefficients. We mitigate the well-known problems with estimating model terms and parameters from data [11] by providing a clear and compelling motivation for choosing the terms in our models, by separating in-sample and outof-sample data and generating model terms and coefficients only in-sample, and testing model forecast success only outof-sample, and by emphasizing the statistical significance of our results. The causal link between sentiment and economic activity can seem obscure and dubious when financial indexes are predicted by sentiment analysis applied to Twitter data [3], [6], [7], [12]. A plausible causal connection between consumer spending and Twitter keyword keyword volume is that tweeters are a reasonable approximation of consumer spenders, and they tend to tweet about whatever they plan to do or are doing. So, if tweeters plan to spend more or are spending more, this will be reflected by an increased volume of tweets about spending. As we explain below, we observed that Twitter volume about spending peaks three days before peaks in consumer spending as reported by Gallup. This might enable models based on today s Twitter keyword volume to predict the spending levels that Gallup will report in three days. II. CONSUMER SPENDING AND TWITTER DATA We test each model s ability to predict daily consumer spending changes and deviations from weekday norms, derived from daily consumer spending values reported by Gallup. Gallup describes this data as the average dollar amount /12 $ IEEE DOI 1.119/ICDMW

2 Americans report spending or charging on a daily basis, not counting the purchase of a home, motor vehicle, or normal household bills. Fig. 1 shows consumer spending (S t ) reported by Gallup, daily changes in spending (ΔS t ), weekday statistical spending norms (D t ), and daily deviations from weekday norms (S t D t ) for November - March 212. This is the in-sample data used in all experiments reported here. Fig. 2 is a blow-up of the data just from November. The Gallup data is reported as a three day moving average of consumer spending. In order to attain the daily consumer spending time series S t used in this study, we had to reconstruct daily spending values. To do so, we identified a 4-day interval in the month preceding the first day of the sample which exhibited almost zero variation; assuming that two of the four were the true spending values for those days, we were able to infer the third used to construct the moving average. Then we proceeded to decompose the moving averaged series into its component daily parts. We handle missing values as Gallup does, by dropping those days from the data set. All of the data used in this study falls within the period 1 November 211 through 3 June 212. All experiments segregate data into non-overlapping in- and out-of-sample segments. Only in-sample data influenced Twitter keyword choice and model coefficient estimation, and only out-ofsample data tested model spending forecasts. Fig. 3 is a scatter plot of daily consumer spending on each day of the week. One can see a slight but evident difference between the scatter plots on the top and bottom. The top shows the first five months of data (November - March) and the bottom shows all eight months of data (November - June). The similarity between top and bottom shows that five months of data provides a good but imperfect approximation of the typical pattern in spending across the days of the week. There is a noticeable pattern in mean spending across the week, consisting of four days of low spending, followed by a day of average spending, followed by two days of high spending (starting on the second day depicted in the figure). We use the weekday spending pattern (D t ) in a number of ways. First, we ask how well weekday spending norms alone predict consumer spending (Model D), and we ask how much predictions improve with the addition of terms for recent spending history (Model DS) and Twitter keyword volume (Model DST). Second, we examine deviations from weekday spending norms (S t D t ), and evaluate how well models of recent spending history (Model DS) and Twitter keyword volume (Model DST) predict deviations from this norm (D t ). Third, the highest spending value on the chart (which is Black Friday) is reported by Gallup on Sunday. This indicates that Gallup s daily spending values report spending that actually occurred two days before. So, forecasting what Gallup reports three days in advance is equivalent to forecasting one day before money changes hands and spending actually occurs. Finally, we use D t to produce our null model H, which simply samples the data points scattered in Fig. 3. This string of spending forecasts is guaranteed to share the much of the statistical quality of the actual spending values Nov Dec Jan Feb Mar Apr Fig. 1. Actual consumer spending (S t), daily change in consumer spending (ΔS t = S t S t 1 ), average weekday spending (D t), and deviation from weekday norms (S t D t) for November 211 to April 212. Twitter keywords and model coefficients for our experiments are based on the in-sample data in this figure Oct 31 Nov 6 Nov 13 Nov 2 Nov 27 Fig. 2. A blow-up of November from Fig. 1. Black Friday is the big spike in spending near the end of November. Note the gaps in spending record (and, hence, gaps in the records of spending change and weekday deviation), due to missing data from Gallup on those days. Our social media data source is Twitter, and our raw Twitter data is the daily word frequencies in a Twitter spritzer Spending Change WeekdayNorm Deviation Spending Change WeekdayNorm Deviation 748

3 spending spending mon tues wed thurs fri sat sun TABLE I WORDS FREQUENCIES CORRELATED WITH CONSUMER SPENDING. Word Correlation Lag Word Correlation Lag shopping days fun days store.39 3 days clubbing days wal mart days couch days going shopping days bar.34 2 days shop days beer days buy days bought days the correlation between the lagged Twitter word frequency time series and the consumer spending index and keeping the words exhibiting high correlations. Word frequencies were lagged by three or two days before being correlated with spending, so a high correlation could be used to forecast future consumer spending. To maximize the statistical power of our words, we correlated consumer spending with lagged frequency of thousands of randomly sampled words in Twitter and kept only those words with correlations in the highest 2.% tail of the distribution of correlations. Table I lists the resulting words, their lagged correlations with consumer spending, and their lag length. We also computed the correlation of every possible pair of words in the sample to find highly related words. A subset of words that are mutually highly correlated with one another form a cluster. We found two distinct clusters of related words: the first was defined by high correlation with spending three days in advance, and the other two days in advance. Fig. 4 displays a heat map revealing the two distinct clusters in the sample words. The top-right cluster consists of words with three-day lags, and the bottom-left cluster consists of words with two-day lags. Our hypothesis is that the two clusters represent two different signals that can be used to forecast consumer spending. Proper definition of the clusters will robustly capture the consumer spending signal and reduce its measurement noise. mon tues wed thurs fri sat sun Fig. 3. Scatter plot of Gallup s daily consumer spending values, S t, for each day of the week. The large blue dots show the mean spending on each day (D t). The red line shows the mean spending over the entire week. Above: Data from only the five months November - March. Below: Data from the eight months November - June. (which represents 1% of the total Twitter volume). Twitter word frequencies (T ) appear as a term in some of the models we study here. We chose the words that compose T by examining Twitter feed data (tweets) during periods of high consumption. An initial set of words was found by ranking all the words used in tweets occurring around the Black Friday holiday sales by their frequency and selecting those that signaled general consumption behavior (e.g., buy, shopping, and store ). This list of words was then reduced by computing III. MODELS OF CONSUMER SPENDING We compare the forecasting ability of a series of models of consumer spending. The models are driven by different sources of data. We use these models to predict daily changes and deviations from weekday norms in consumer spending three days in advance. Then we gauge a model s relative forecasting success by comparing correlations and percentage match of direction between model forecasts and actual consumer spending. A. Model parameters The models are driven by different combinations of the following data: current consumer spending, spending averaged over each day of the week, and current Twitter keyword frequencies: S t : the consumer spending as reported by Gallup at t, D t : the weekday average spending norm at t, T t : the Twitter keyword volume at t. 749

4 couch clubbing bar beer fun bought wal.mart buy shop store walmart going.shopping shopping shopping going.shop walmart store shop buy wal.mart bought fun beer bar clubbing couch Fig. 4. Heat map depicting the magnitudes of the correlations between lagged word frequencies and consumer spending. Each box indicates the absolute correlation between one word pair, and brighter colors represent larger correlations. Words are clustered along the axes so that closely related words are adjacent. Two distinct clusters emerge: the large positive correlations among words with 3-day and 2-day lags. Our raw Twitter data is parsed into a two-dimensional Twitter volume matrix, W, where W tw refers to the volume (number of occurrences or tokens) of word type w at time t. Welet t be the total number of different times t (rows), and w be the total number of word types w (columns) in W. Summing along the rows of the Twitter volume matrix (W ) yields a vector containing the volume over all word types at some time t, and the total Twitter volume of word type w, V w, is: V w = t t=1 W tw. (1) Summing along the columns yields a vector of the volume for some word type w over all times, and the total Twitter volume of some time t, V t, is: V t = w w=1 Finally, the total absolute Twitter volume is: V = t t=1 w=1 W tw. (2) w W tw. (3) These three volumes can then be used to estimate relative Twitter word frequencies. The probability that a randomly sampled word token in the entire Twitter volume is of type w can be estimated to be V w /V, and the probability that the token occurred at time t can be estimated to be V t /V. We choose our set K of keywords to be those candidate words that most highly correlate with spending and maximize our models predictive power on in-sample training data. Some of our keywords are bi-grams, i.e., sequences of two words, like wal mart or going shopping. We treat bi-grams as single words, and normalize bi-grams by dividing by the total number of bi-grams. We define the Twitter index, Tt w, for each word type w K to be: Tt w = W tw, (4) V t V w which gives the volume of w at t compared to (divided by) the total volume at t and the total volume of w. A word s Twitter index indicates the degree to which the word s volume is higher or lower than expected from the background volumes at t relative to other times and of w relative to other words. The total Twitter index we ultimately then used in our models of consumer spending, T t, is simply the mean of the Twitter indices of all keywords w in K: T t = 1 Tt w () K w K The influence of each keyword on the index is weighted by its relative importance, undistorted by fluctuations in the number of keywords. B. Predicting consumer spending We compare models on their ability to forecast consumer spending. The coefficients in the models (α, β, γ, δ) are estimated by an autoregressive distributed lag model from the history of consumer spending in the in-sample data, and the coefficients are estimated separately for each model. Here we study predictions that are three days into the future, but our methods generalize to predictions of other ranges. The models we study here are constructed by regressing consumer spending history on weekday spending norms and current spending levels. The models explicitly forecast consumer spending, from which we calculate forecasts of daily changes in spending and deviations from weekday spending norms. Models built by regressing on daily spending changes or deviations from weekday spending norms produced results akin to the models studied here. Model H predicts consumer spending by randomly resampling the history of actual spending values (from the training data). The model uses the actual historical sequence of spending values (S t ) in the entire training set, and then forecasts successive values by sampling with replacement from this history. If we let spending.history refer to the distribution of points scattered in Fig. 3, then Model H generates n successive predictions of consumer spending by sampling n times with replacement from spending.history, as in the following R command: S H t+3 =sample(spending.history, n, replace = T). (6) 7

5 Model H guarantees that the scatter of future spending values will resemble the scatter seen in Fig. 3. We assess a model s forecasting ability by asking how much better it predicts spending than Model H. Model D guarantees that the consumer spending level matches the statistical average value for the day of the week, D t. The model calculates the average spending on each day of the week in the in-sample data, and then forecasts that consumer spending every day out-of-sample will be equal to that weekday s average spending. The forecast for consumer spending three days in the future is given by an equation of the form: S D t+3 = α + βd t+3 + ɛ. (7) Note that D t is known for all times t covered in our experiments. Model DS builds on Model D and assumes that spending is determined by the combination of recent history of spending and average weekday spending. So, the spending forecast for three days in the future of Model DS is: St+3 DS = α + βd t+3 + γs t + ɛ. (8) Model ST forecasts spending using only current spending and Twitter volume: St+3 ST = α + γs t + δt t + ɛ. (9) Model DT forecasts spending using only weekday norms and current Twitter volume: St+3 DT = α + βd t+3 + δt t + ɛ. (1) Model DST builds on Model DS by forecasting spending from today s Twitter index in addition to today s spending and weekday norms: St+3 DST = α + βd t+3 + γs t + δt t + ɛ. (11) Fig. shows April - June consumer spending daily changes (top) and deviations from weekday norms (bottom), and the predictions of Models D, DS and DST. (Model D predicts no deviation from weekday norms, of course.) Note the comparable predictions of Models DS and DST. (Model D predicts no deviation from weekday norms, of course.) C. Measuring model forecast success We ask how well a model forecasts daily changes and deviations from weekday norms in consumer spending. We choose the Twitter keywords and estimate the coefficients (α, β...) by training models on in-sample data about consumer spending history and Twitter keyword frequencies. We then have our models predict consumer spending three days in advance of time t, S t+3, using only information available at t, such as S t and T t. Each model predicts consumer spending on each day in the out-of-sample consumer spending data. The model three-day forecast is then compared with the actual consumer spending reported by Gallup to see how well predicted and actual spending match. Economic data like consumer spending is known to be autocorrelated, so we measure how well models forecast two detrended data streams calculated from consumer spending: the Apr 1 Apr 1 May 1 May 1 Jun 1 Jun 1 Jul 1 Apr 1 Apr 1 May 1 May 1 Jun 1 Jun 1 Jul 1 Fig.. Above: Actual consumer spending changes (ΔS) and predictions by models D, DS, DT, ST, and DST for out-of-sample data April - July 212. Below: The same for deviations from weekday norms (S t D t). Note that Model D predicts no deviations from weekday norms. daily spending change (ΔS t = S t S t 1 ) and daily deviation from weekday norms (S t D t ). We use two measures of the success of a model M in predicting each data stream: the correlation between actual and predicted values, (S t, St M ), and the percentage of pairs with the same sign (i.e., pairs of values that move up or down together). These four measures of model DeltaS DeltaS.D DeltaS.DS DeltaS.DT DeltaS.ST DeltaS.DST Deviations Deviations.D Deviations.DS Deviations.DT Deviations.ST Deviations.DST 71

6 forecast success constitute a model s success profile. IV. RESULTS OF PREDICTING CONSUMER SPENDING We calculate success profiles for each model on the two tasks of forecasting daily spending changes (ΔS t ) and daily deviations from weekday spending norms (S t D t ). We first look for ability to predict spending significantly better than Model H. We test whether a model s correlation scores are significantly better than zero (Model H) by comparing measured values with the standard error ( 1 n ). To test whether a model is improved with the addition of one or more further terms, we check whether F statistics for the pair of models exceed the % significance threshold. 1 We do not quantify the statistical success of percentage match scores. Table II shows the success profiles for model forecasts three days into the future. Model terms and coefficients were based on in-sample data from months November to March, and model predictions were evaluated on out-of-sample data from months April to June. D(1) collects norms over only five in-sample months from November to March. Models DS, DT, and DST, and weekday norm deviations were based on D(1). Success results for Model H are reported with standard error bounds from resampling 1 times. The standard error of the other correlation measurements (n =88) is close to.1. Note that the percentage of ups and downs in ΔS t and S t D t is not exactly %. As expected, Model H displays zero ability to forecast future spending. This makes Model H perfect as a no success baseline against which to measure the profiles of better models. Also note that Model D fails to forecast major deviations from weekday norms. The most significant positive finding from the table is that all of the models are significantly better than Model H at predicting spending. Since the standard error for the correlation results in Table II is.1, the correlation profiles provide solid evidence that models of Twitter keyword volume in combination with either current spending or weekday norms have significant success at predicting consumer spending. The most significant negative finding from the table is that none of the models demonstrates a significantly better forecasting ability than any other model. The F statistics for nested model comparisons range from.8 to 1.19, and none is above the % significance threshold of In sum, Models D, DS, and DST all significantly predict spending changes, but the differences in their success profiles are statistically too weak to reject the null hypothesis of no difference in model success. We completed other tests as well by applying the same basic methodology to variations of the in- and out-sample periods. For example, because November is known to include significant deviations from weekday norms (most notably Black Friday), we applied the above methods to forecast 1 We compute F-statistics for out-of-sample forecasts by noting that the ratio of the residual sum of squares for the reduced and full models should have approximately an F (n,n) distribution, where n is the number of out-of-sample observations. TABLE II SUCCESS PROFILES OF MODEL FORECASTS OF APRIL-JUNE. Forecasting April - June (n =88) from November - March (n = 13) Model Cor ΔS t % ΔS t Cor S t D t % S t D t H (x1). ± % ± 6%.9 ±.9 49% ± % D(1).14 42% NA NA DS.39 44%.4 % DT.13 44%.3 3% ST.3 6%.29 8% DST.38 46%.42 6% November spending. However, the results were consistent with those presented above. V. CONCLUSIONS Earlier work demonstrated strong correlations between Twitter and economic variables in narrowly focused economic contexts [4], [8], but it is much more difficult to predict aggregate consumer spending on a daily resolution. To identify whether social media information significantly improves model forecasts, we compared forecasts of models based on social media data and models based on only recent spending history. Our results verified the significant three-day forecasting power of models based on Twitter volume and current spending (Model ST) and models based on Twitter volume and weekday norms (Model DT). But similar forecasting ability was demonstrated also by models based on only weekday norms and current spending (Model DS) or on weekday norms alone (Model D). The statistical resolution provided by our data detected no significant difference between models that do or do not depend on Twitter volume. So, Twitter keyword volume helps predict consumer spending but not demonstrably better than current spending and weekday spending norms alone. Future work could aim to adjust certain aspects of the methodology presented above. One possibility could be to develop further the n-gram selection process. To this end, it could be potentially beneficial to include keyword cooccurences (i.e., n-grams that are often contained in the same tweet as the objective keyword) in the models as well. The difficulty of determining whether Twitter data improves model forecasting ability in the present work is partly due to our small sample size (the small number of days in the insample data). We are cautiously optimistic that this limitation can be surmounted with the accumulation and analysis of big data in the social sciences [13] [1]. ACKNOWLEDGMENT Thanks for helpful advice to Albyn Jones, Noah Pepper, and Norman Packard, and thanks for support to a Ruby grant from Reed College. REFERENCES [1] J. Bollen, H. Mao, and S. Counts, Computational economic and finance gauges: Polls, search, and twitter, Meeting of the National Bureau of Economic Research - Behavioral Finance Meeting, Stanford, CA, November

7 [2] X. Zhang, H. Furhres, and P. Gloor, Predicting the stock market through twitter i hope it is not as bad as i fear, Collaborative Innovation Networks (COINs), Savannah, GA, 21. [3] B. O Connor, R. Balasubramanyan, B. R. Routledge, and N. A. Smith, From tweets to polls: Linking text sentiment to public opinion time series, In Proceedings of the International AAAI, Conference on Weblogs and Social Media, 21. [4] S. Asur and B. A. Huberman, Predicting the future with social media, arxiv:13.699v1, March 21. [] J. Weng, E.-P. Lim, Q. He, and C. W.-K. Leung, What do people want in microblogs? measuring interestingness of hashtags in twitter, in 21 IEEE 1th International Conference on Data Mining, December 21, pp [6] J. Bollen, H. Mao, and X. Zeng, Twitter mood predicts the stock market, Journal of Computational Science, vol. 2, no. 1, pp. 1 8, March 211. [7] E. Gilbert and K. Karahalios, Widespread worry and the stock market, in Fourth International AAAI Conference on Weblogs and Social Media. Washington, DC: E. Gilbert and K. Karahalios, 21, pp [8] D. Gruhl, R. Guha, R. Kumar, J. Novak, and A. Tomkins, The predictive power of online chatter, in Proceeding of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. New York, NY: ACM Press, 2, pp [9] J. Slacalek, Forecasting consumption, 24, working Paper, German Institute for Economic Research. [1] J. A. Wilcox, Forecasting components of consumption with components of consumer sentiment, Business Economics, vol. 42, no. 4, pp , 27. [11] M. C. Lovell, Data mining, The Review of Economics and Statistics, vol. 6, no. 1, pp. 1 12, [12] A. Pak and P. Paroubek., Twitter as a corpus for sentiment analysis and opinion finding, in Proceedings of the Seventh conference on international Language Resources Association (ELRA), Valletta, Malta,, May 21, pp [13] D. Lazer, A. Pentland, L. Adamic, S. Aral, A.-L. Barabasi, D. Brewer, N. Christakis, N. Contractor, J. Fowler, M. Gutmann, T. Jebara, G. King, M. Macy, D. Roy, and M. V. Alstyne, Computational social science, Science, vol. 323, no. 91, pp , 29. [14] J. Giles, Making the links, Nature, vol. 488, pp , 212. [1] D. J. Watts, A twenty-first century science, Nature, vol. 44, p. 489,