(Big) Data Analytics: From Word Counts to Population Opinions Mark Keane Insight@University College Dublin October 2014 ~ RSS ~ Edinburgh
September 2014/EPIC 2
September 2014/EPIC 3
September 2014/EPIC 4
September 2014/EPIC 5
September 2014/EPIC 6
Outline What s New About (Big) Data Analytics 3 Sample Cases: Google Queries Predicting Epidemics Networks of Influence Financial Opinions in an Stockmarket Bubble Take Home Messages October 2014/RSS-Edinburgh 7
What s New? October 2014/RSS-Edinburgh 8
Four Vs of Big Data October 2014/RSS-Edinburgh 9
What s New?: The Suggestion of a Brave new world of (new) data analysis that can Handle vast amounts of data effortlessly with Instant press-of-a-button answers from Vast server farms of (almost free) computation October 2014/RSS-Edinburgh 10
What s New?: The Suggestion of a Brave new world of (new) data analysis that can Handle vast amounts of data effortlessly with Instant press-of-a-button answers from Vast server farms of (almost free) computation But there are significant issues And there is a lot that is old (familiar) October 2014/RSS-Edinburgh 11
What s Old? Good old-fashioned, data analysis Many statistical ideas are very familiar Many research problems are familiar Proper collection of data is important Proper treatment of data is critical October 2014/RSS-Edinburgh 12
What s Really New? An Approach Tipping-point with Very Large Data Sets» from 100s to 1,000,000,000s of data points Unusual Types of Data» video, text, thumbs-up, unstructured data Non-standard Data Sources» social media (FB, Tweets), news, phones Data is not conventionally-measured» the sensing devices are doing other things October 2014/RSS-Edinburgh 13
In this New Big-Data World! Who we know, says a lot about who we are Facebook friends, linked-in network, tweet followers What we write, says a lot about what we think text in books, news, blogs, social media and so on Where we located, says a lot about us location-based sensing, GPS, IP-addresses What we do, says a lot about our decisions/interests what we buy, web-sites visited, youtube videos watched, news re-tweeted, items shared and so on October 2014/RSS-Edinburgh 14
Three Sample Cases October 2014/RSS-Edinburgh 15
Finding Flu Outbreaks October 2014/RSS-Edinburgh 16
Case 1: Predicting Flu from Searches Google Flu Trends (GFT): aggregates search data, counting influenza keywords US Centre for Disease Control: tracks influenza-like-illnesses (ILIs) in outpatient data From 2003-2009: GFT showed high correlations with ILI stats (ILINet) until 2009 influenza virus A (H1N1) pandemic [ph1n1] Cook, S., Conrad, C., Fowlkes, A. L., & Mohebbi, M. H. (2011). Assessing Google flu trends performance in the United States during the 2009 influenza virus A (H1N1) pandemic. PloS one, 6, e23610. October 2014/RSS-Edinburgh 17
Good Correlations (Initially ) Body Level One Body Level Two Body Level Three Body Level Four» Body Level Five Cook, S., Conrad, C., Fowlkes, A. L., & Mohebbi, M. H. (2011). Assessing Google flu trends performance in the United States during the 2009 influenza virus A (H1N1) pandemic. PloS one, 6, e23610. October 2014/RSS-Edinburgh 18
Hang on a sec Body Level One Body Level Two Body Level Three Body Level Four» Body Level Five In 2009, Google modify model with new search terms October 2014/RSS-Edinburgh 19
The Message What we do, says a lot about our concerns if I think I have flu and I am looking it up on Google Here, people s illness is being defined by their search behaviour and keywords Population behaviour can be predicted (in locations) by aggregating these searches October 2014/RSS-Edinburgh 20
The Message What we do, says a lot about our concerns if I think I have flu I am looking it up on Google Here, people s illness is being defined by their search behaviour and keywords Population behaviour can be predicted (in locations) by aggregating these searches But, proper treatment of data is critical (keywords, normalising) a model of what leads a user to use a certain search term October 2014/RSS-Edinburgh 21
Networks of Influence 22
Case 2: Showing Networks of Influence Tracking news on Social Networks terrorists release youtube videos politicians comment in Facebook celebs tweet intimacies Who you comment on, What you comment on and where; can reveal networks of influence Storyful is using Insight system, to curate the lists of sources and propose new ones, by analysing social networks October 2014/RSS-Edinburgh 23
Curated Lists of Sources (Large) D. Greene, G. Sheridan, B. Smyth, & P. Cunningham (2012) Aggregating content and network information to curate twitter user lists. In Proc. 4th ACM RecSys Wkshp on Recommender Systems & The Social Web. October 2014/RSS-Edinburgh 24
Automated Recommendation D. Greene, G. Sheridan, B. Smyth, & P. Cunningham, Aggregating Content and Network Information to Curate Twitter User Lists, in Proc. 4th ACM RecSys Workshop on Recommender Systems & The Social Web, 2012. October 2014/RSS-Edinburgh 25
Networks in Syrian Conflict Network of Syrian-related Twitter accounts active during late 2013 O'Callaghan, D., Prucha, N., Greene, D., Conway, M., Carthy, J., & Cunningham, P. (2014). Online Social Media in the Syria Conflict: Encompassing the Extremes and the In-Betweens. arxiv preprint arxiv:1401.7535. October 2014/RSS-Edinburgh 26
European Parliament Networks Data analysed for 584 MEPs on Twitter during July-Sept 2014. J. P. Cross & D. Greene. (2014) Tracking information flows in the Council of the European Union: A social network analysis. Under review. October 2014/RSS-Edinburgh 27
Political Groupings Data analysed for 584 MEPs on Twitter during July-Sept 2014. Cross & Greene (2014) October 2014/RSS-Edinburgh 28
The Outlier Party Data analysed for 584 MEPs on Twitter during July-Sept 2014. Cross & Greene (2014) October 2014/RSS-Edinburgh 29
The Message Who we know, says a lot about who we are Facebook friends, linked-in network, tweet followers I can be defined by the people I know/like/respect/follow (homophily) My behaviour can be predicted by assuming that like-people act alike But, accuracy of those relationships is critical may not generalise from one domain to another September 2014/EPIC 30
Tracking Bubble Behaviour
Case 3: Tracking Herding & Market Bubbles Word frequencies reveal power-laws (Zipf s Law) Bubble would show in herd-like use of language Power laws change systematically with herding Sentiment of phrases should also be trackable Gerow, A., & Keane, M. T. (2011, July). Mining the web for the voice of the herd to track stock market bubbles. IJCAI-2011. AAAI Press. October 2014/RSS-Edinburgh 32
Zipf s Law & Moby Dick October 2014/RSS-Edinburgh 33
Agreement tween Commentators Agreeing to Agree in Power Laws of Words October 2014/RSS-Edinburgh 34
Analysing Text in News 17,713 finance articles (FT, NYT, BBC) 4 years (Jan 2006-Jan 2010) including 2007 crash 10,418,266 words, we extract nouns and verbs Gerow, A., & Keane, M. T. (2011, July). Mining the web for the voice of the herd to track stock market bubbles. IJCAI-2011. AAAI Press. October 2014/RSS-Edinburgh 35
September 2014/EPIC 36
September 2014/EPIC 37
Analysing Text in News 17,713 finance articles (FT, NYT, BBC) 4 years (Jan 2006-Jan 2010) including 2007 crash 10,418,266 words, we extract nouns and verbs Correlations for verb distributions show: DJIA (r =.79), FTSE-100 (r =.78), NIKKEI-225 (r =.73) NB: prediction is another matter Gerow, A., & Keane, M. T. (2011, July). Mining the web for the voice of the herd to track stock market bubbles. IJCAI-2011. AAAI Press. October 2014/RSS-Edinburgh 38
September 2014/EPIC 39
The Message What we write, says a lot about what we think text in books, news, blogs, social media and so on Here, agreement in a population is being captured by carefully treated word frequencies Population beliefs can be tracked by a distributional analysis of changes in words October 2014/RSS-Edinburgh 40
The Message What we write, says a lot about what we think text in books, news, blogs, social media and so on Here, agreement in a population is being captured by carefully treated word frequencies Population beliefs can be tracked by a distributional analysis of changes in words But, proper treatment of words is critical (stop-words, syntax) sentiment analysis had to be based on human judgements October 2014/RSS-Edinburgh 41
Some Conclusions October 2014/RSS-Edinburgh 42
In this New Big-Data World! Who we know, says a lot about who we are Facebook friends, linked-in network, tweet followers What we write, says a lot about what we think text in books, news, blogs, social media and so on Where we located, says a lot about us location-based sensing, GPS, IP-addresses What we do, says a lot about our decisions/interests what we buy, web-sites visited, youtube videos watched, news re-tweeted, items shared and so on October 2014/RSS-Edinburgh 43
In this New Big-Data World! Who we know, Facebook friends, linked-in network, tweet followers What we write, text in books, news, blogs, social media and so on Where we located location-based sensing, GPS, IP-addresses What we do what we buy, web-sites visited, youtube videos watched, news re-tweeted, items shared and so on NOW ROUTINELY AVAILABLE AT A SMARTPHONE NEAR YOU October 2014/RSS-Edinburgh 44
Promises and Caveats Data analytics bears promise in tracking and predicting: population actions, beliefs, opinions, illness changes in those actions, beliefs, opinions, illnesses Challenges are in finding: right treatment of the data: selection/collation of data is still critical, combining multiple data-sources right analytic methods: which, if any, are appropriate right interpretations; old-fashion exclusion-of-vars/ interpretation October 2014/RSS-Edinburgh 45
The End October 2014/RSS-Edinburgh 46