Big Data as a data source for official statistics Piet Daas, Marco Puts, Bart Buelens and Paul van den Hurk Statistics Netherlands
Overview Data sources and statistics More & more data becomes available Effect on statistics production How we study Big Data: 2 examples Traffic loop detection data Social media messages 1
Introduction Statistics Netherlands has produced about 5000 official publications and tables in 2012 For this we need DATA 2
Data sources for official statistics Primary data Secondary data Data from others Our own surveys - Administrative sources - New data sources 3
Statistics Netherlands law Statistics Netherlands aims to reduce the administrative burden for companies and the public as much as possible By (re-)using existing administrative registrations of both government and government-funded organizations. And study potential new sources of information 3
Data, data everywhere! X 4
Statistics Netherlands and Data Data is generated in increasing amounts and at increasing frequencies: From Data scarcity (sample survey) to Data abundance (administrative & Big) Ever increasing amounts of data need to be checked, processed and analyzed More sources of information become available Opportunities to produce statistics faster ( real-time statistics ) Need for new methods and tools 1. Methods to quickly uncover information from massive amounts of data available, such as visualisation methods and data-, text- and streammining techniques ( making Big Data small ), High Performance Comp. 2. Methods capable of integrating the information in the statistical process, e.g. linking at massive scale, macro/meso-integration, estimation methods suited for large datasets 5
2 Big Data case studies Research findings on the study of Big Data sources from a statistics point of view 1. Traffic loop detection data 80 million records/day, studied 90 days so far, number of vehicles detected each minute 2. Dutch social media messages 1~2 million public messages/day, studied up to 2 billion records, content and sentiment 6
1. Traffic loop detection data Traffic loops Every minute (24/7) the number of passing vehicles is counted by >10,000 road sensors & camera s in the Netherlands Total vehicles and in different length classes Interesting source to produce traffic and transport statistics (and more) Huge amounts of data, about 100 million records a day Locations 7
Number of detected vehicles on a single day By all loops Total = ~ 295 million 8
Traffic loop detection activity (only first 10 min.) 9
Correct for missing data Corrected data (for blocks of 5 min) Before Total = ~ 295 million After Total = ~ 330 million (+ 12%) 10
Total vehicles during the day (snapshots) 12
For different vehicle lengths 1 categorie 3 categoriën 5 categoriën Totaal Totaal <= 5.6m > 5.6 & <= 12.2m > 12.2m Totaal > 1.85 & <= 2.4m > 2.4 & <= 5.6m > 5.6 & <= 11.5m > 11.5 & <= 12.2m > 12.2m Small vehicles <= 5.6 m Medium sized vehicles > 5.6 m & <= 12.2 m Large vehicles > 12.2 m 13
Small vehicles ~75% of total 14
Small & medium vehicles 15
Small, medium & large vehicles 16
Volatile behaviour at the micro-level 17
2. Social media messages Dutch are very active on social media platforms Bijna altijd bij zich en staat vrijwel altijd aan Steeds meer mensen hebben een smartphone! Mogelijke informatiebron voor: Welke onderwerpen zijn actueel: Aantal berichten en sentiment hierover Als meetinstrument te gebruiken voor:. Map by Eric Fischer (via Fast Company) 18
2. Social media messages Dutch are very active on social media platforms Potential information source for: Topics discussed and sentiment over these topics (quickly available!) and probably more? Investigate it to obtain an answer on potential use 2a. Content: - Collected Dutch Twitter messages for study: selection of 12 million 2b. Sentiment - Sentiment in Dutch social media messages: all ~2 billion 19
Social media: Dutch Twitter topics (3%) (7%) (3%) (10%) (7%) (3%) (5%) (46%) 12 million messages 20
Sentiment in Social media Access to Coosto database > 2 billion publicly available messages Twitter, Facebook, Hyves, Webfora, Blogs etc. Sentiment of each message Positive, negative or neutral Interesting finding Determine so-called Mood of the nation compared to Consumer confidence of Statistics Netherlands 21
Consumer confidence, survey data (pos neg) as % of total Sentiment towards the economic climate ~1000 respondents/month 22
Final remarks: Big Data and statistics Preparing Big data for statistics is time consuming Exploration phase takes a lot of time Try to reduce amount of data without losing information ( making big data small, noise reduction) Risk: garbage in garbage statistics out Traditional approach does not suffice Big data sources are definitely not large sample surveys or admin data Often a selective but a large part of the population is included Events are registered, not units! Careful with using traditional statistical analysis (everything is significant!) More need for: Visualisation methods (to rapidly gain insight) Methods & models specific for large dataset (fast and robust ) Learn from computational statistics & (try to) use dedicated hardware Beware of privacy issues! 27
The future of Stat Neth?