Big Data andofficial Statistics Experiences at Statistics Netherlands



Similar documents
Big data, the future of statistics

Big CBS. Experiences at Statistics Netherlands. Dr. Piet J.H. Daas Methodologist, Big Data research coördinator. Statistics Netherlands

Big Data (and official statistics) *

Visualization and Big Data in Official Statistics

Big Data as a Data Source for Official Statistics: experiences at Statistics Netherlands

Big Data. Case studies in Official Statistics. Martijn Tennekes. Special thanks to Piet Daas, Marco Puts, May Offermans, Alex Priem, Edwin de Jonge

Selectivity of Big data

Big Data and Official Statistics

United Nations Global Working Group on Big Data for Official Statistics Task Team on Cross-Cutting Issues

WHAT DOES BIG DATA MEAN FOR OFFICIAL STATISTICS?

Big Data and Official Statistics The UN Global Working Group

big data in the European Statistical System

Modernization of European Official Statistics through Big Data methodologies and best practices: ESS Big Data Event Roma 2014

Unlocking the Full Potential of Big Data

Introduction to Quality Assessment

Data Visualization in Official Statistics

HLG - Big Data Sandbox for Statistical Production

Report of the 2015 Big Data Survey. Prepared by United Nations Statistics Division

Big data coming soon... to an NSI near you. John Dunne. Central Statistics Office (CSO), Ireland

USE OF GEOSPATIAL AND WEB DATA FOR OECD STATISTICS

Item rd International Transport Forum. Big Data to monitor air and maritime transport. Paris, March 2016

RATIONALISING DATA COLLECTION: AUTOMATED DATA COLLECTION FROM ENTERPRISES

Data Intensive Research Initiative for South Africa (DIRISA)

Big Data for Official Statistics The 2030 Agenda for Sustainable Development

Big Data, Official Statistics and Social Science Research: Emerging Data Challenges

Big data for official statistics

STATISTICS PAPER SERIES

New Frontiers for Official Statistics

A Suggested Framework for the Quality of Big Data. Deliverables of the UNECE Big Data Quality Task Team December, 2014

Dimensions of Statistical Quality

Big data in official statistics Insights about world heritage from the analysis of Wikipedia use

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

THE STATISTICAL DATA WAREHOUSE: A CENTRAL DATA HUB, INTEGRATING NEW DATA SOURCES AND STATISTICAL OUTPUT

INNOBAROMETER THE INNOVATION TRENDS AT EU ENTERPRISES

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Advanced Metering Infrastructure

The use of Big Data for statistics

The Sandbox 2015 Report

IT OUTSOURCING STUDY GERMANY/AUSTRIA 2015 MANAGEMENT SUMMARY

Company information around the globe

New forms of data for official statistics Niels Ploug Statistics Denmark

How To Understand The Data Collection Of An Electricity Supplier Survey In Ireland

This survey addresses individual projects, partnerships, data sources and tools. Please submit it multiple times - once for each project.

Survey on Merchants' Costs of Processing Cash and Card Payments Preliminary Results

EIOPA Stress Test Press Briefing Frankfurt am Main, 4 July 2011

Fleet Logistics and TÜV SÜD in strategic partnership

PRINCIPLES FOR EVALUATION OF DEVELOPMENT ASSISTANCE

Wat verwacht de hybride consument van de verschillende distributiesystemen? Jan Verlinden Insurance Leader Belgium Capgemini

Big Data Big Noise. Its relevance to industrial Statistics in the context of SDG monitoring. Shyam Upadhyaya UNIDO

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

T Non-discriminatory Machine Learning

HMG Corporate Development Team.

ACCESSIBLE INFORMATION PROVISION FOR LIFELONG LEARNING KEY POLICY MESSAGES

Fleet Logistics partners with AlertDriving to offer online driver training and risk assessment

COMMISSION OF THE EUROPEAN COMMUNITIES. Proposal for a RECOMMENDATION OF THE COUNCIL AND OF THE EUROPEAN PARLIAMENT

Use of Mobile Positioning Data for Tourism Statistics

SUMMARY OF THE IMPACT ASSESSMENT

IT OUTSOURCING STUDY EUROPE 2015/2016 MANAGEMENT SUMMARY

Chapter 7. Using Hadoop Cluster and MapReduce

Discussion Paper on Follow-up and Review of the Post-2015 Development Agenda - 12 May 2015

EU Twinning Project IS12/ENP-APFI/08

Assuring the Cloud. Hans Bootsma Deloitte Risk Services +31 (0)

Report on impacts of raised thresholds defining SMEs

Testing 3Vs (Volume, Variety and Velocity) of Big Data

REPORT FROM THE COMMISSION TO THE EUROPEAN PARLIAMENT AND THE COUNCIL

How To Understand Factoring

The Impact of Big Data on Social Research David Rhind Sharon Witherspoon

European Statistical System Code of Practice Peer Reviews: (Version 1.3)

13 Reasons (or more) Not To Do A Joint Degree

REPORT. Public seminar, 10 November 2010, p.m. Concordia Theatre, The Hague

Keywords: big data, official statistics, quality, Wikipedia page views, AIS.

ESS EA TF Item 2 Enterprise Architecture for the ESS

France Telecom Orange investor day conquests 2015

A Scientific Study "ETAC" European Truck Accident Causation

Careers of doctorate holders (CDH) 2009 Publicationdate CBS-website:

Transcription:

Big Data andofficial Statistics Experiences at Statistics Netherlands Peter Struijs Poznań, Poland, 10 September 2015

Outline Big Data and official statistics Experiences at Statistics Netherlands with: - use of road sensor data - use of mobile phone location data - use of public social media messages Issues and solutions Strategic, policy and organisational challenges Cooperation and collaboration 2

What is Big Data? 3

4

Data sources and approaches Surveys / questionnaires Administrative data sources sampling theory Where does Big Data fit in? New methods may be needed, e.g. modeling for nowcasting and other methods not based on sampling theory 5

Potential Opportunities New statistics More detailed statistics More timely statistics Nowcasts and early indicators Quality improvement Response burden reduction Cost reduction and higher efficiency 6

Examples of possible Big Data sources Road sensor data Mobile phone location data Public social media messages Websites Google Trends Satellite information Etc 7

Data scientists involved in the research shown Piet Daas (photo) May Offermans Marco Puts Martijn Tennekes 8

Statistics based on road sensor data Aim: Statistics on traffic intensities Characteristics of the data source Research on the usability of the data Process of using the data for statistics Issues when using traffic loop data 9

Road sensor data Source: National Data Warehouse for Traffic Information (NDW) There are 20.000 traffic loops on Dutch motorways, and 40.000 on provincial roads Each minute (24/7) the number of passing vehicles is counted, and their average speed Three different length classes are distinguished No identification of vehicles Around 230 million records a day used Locations 10

The main roads 11

A special dike 12

Road sensors in the dike 13

Minute data of one sensor for 196 days 14

Researching the data Cross correlation between sensor pairs - Used to validate metadata Trajectory speed vs. point speed - Average speed is 98 Km/h

Small, medium-sized & large vehicles 22

Sensors in a road segment 17

Process of making traffic intensities statistics Select sensors on Dutch highways Preprocessing - - - - Remove non-informative variables Remove bad records Exclude bad sensors Quality indicators for daily data per sensor Processing - Reduce dimensions on same road and region - - - Obtain number of vehicles for each road and region For each road and region, calculate monthly traffic intensity Use of R-Hadoop 18 Validation and publication

Data options Historical database - Request data via web interface - Minute data for all highways (48 variables, Jan 2010- April 2014: around 2.5 TB) Data stream - Every minute, all data for all active sensors - Continuously collected 19

Road sensor data: Issues and non-issues Non-issues: Privacy Data acquisition Issues: Methodology - Selectivity - Quality Infrastructural needs Other issues - Skills needed - Transition from research to regular statistics 20

References: statistical use of road sensor data Publication of statistical results (in Dutch): http://www.cbs.nl/nl-nl/menu/themas/verkeervervoer/publicaties/artikelen/archief/2015/a13-drukste-rijksweg.htm Explanation in English: http://www.cbs.nl/nr/rdonlyres/25ce3592-a756-42b7-babf- C3E4C4E9375B/0/a13busiestnationalmotorwayinthenetherlands.pdf Research reference: Puts, M., Tennekes, M. and Daas, P. (2014) Using Road Sensor Data for Official Statistics: Towards a Big Data Methodology. Paper for Strata + Hadoop World, Barcelona, Spain. 21

Statistics based on mobile phone location data Why use mobile phone data for official statistics? Characteristics of the data source Research on the usability of the data Issues when using mobile phone location data Solutions tothe issues 22

Possible uses of mobile phone data Daytime population statistics Mobility statistics Tourism statistics Other uses 23

Mobile phone activity as a data source Nearly every person in the Netherlands has a mobile phone - Usually on them - Almost always switched on - Many people are very active during the day There is a grid of antennas with good coverage 24 Data of a single mobile company was used - Hourly aggregates per area - Threshold of 15 events

Daytime population based on mobile phone data

Issues when using mobile phone data Privacy Data acquisition Methodology - Representativeness - Selectivity - Quality Other issues - Infrastructure - Skills needed 26

Solutions Agreement with data provider to provide only aggregates and apply a threshold Data provider performed analysis of mobile phone ownership characteristics A large number of analyses were made, with the regular population registration data as a reference A number of assumptions had to be made 27

References: statistical use of mobile phone data Research references: Daas, P.J.H., Puts, M.J., Buelens, B. and van den Hurk, P.A.M. (2015) Big Data as a Source for Official Statistics. Journal of Official Statistics 31(2), pp. 249-262. Daas, P. and Burger, J. (2014) Profiling big data sources to assess their selectivity. Paper for the 2015 New Techniques and Technologies for Statistics conference, Brussels, Belgium. 28

Mobile phone data versus road sensor data 29

Statistics based on social media data Why use social media data for official statistics? Characteristics of the data source Research on the usability of the data Issues when using social media data 30

Possible uses of social media data Sentiment indicators - e.g. consumer confidence index Social indicators - e.g. social coherence indices Other uses 31

Social media Dutch are very active on social media! - Around 60% according to a surveyna altijd bij zich en staat vrijwel altijd aan Steeds meer mensen hebben een smartphone! Mogelijke informatiebron voor: - Welke onderwerpen zijn actueel: Aantal berichten en sentiment hierover - Als meetinstrument te gebruiken voor:. 32 Map by Eric Fischer (via Fast Company)

The data All social media messages: - that are written in Dutch - and are public These messages are systematically and instantly collected by the Dutch firm Coosto Dataset of more than 3.5 billion messages: - covering June 2010 till the present - between 3-4 million new messages added per day 33

Research question Can we replicate the consumer confidence index by only using social media data, while reducing production time? 34

Sentiment determination Bag of words approach - list of Dutch words with their associated sentiment - added social media specific words ( FAIL, LOL, OMG etc.) Use overall score to determine sentiment - is either positive, negative or neutral Average sentiment per period (day / week / month) - (#positive - #negative)/#total * 100% 35

Sentiment per platform (~10%) (~80%)

Build a model Idea: Fitting characteristics derived from social media messages to consumer confidence Success: If correlation can be found that is high and remains high, that is, has predictive power 37

Figure 1. Development of daily, weekly and monthly aggregates of social media sentiment from June 2010 until November 2013, in green, red and black, respectively. In the insert the development of consumer confidence is shown for the identical period. 38

Results High correlation achieved (0.9) Changes in consumer confidence preceed changes in sentiment by one week Short processing time, so time-to-market may be reduced. Sentiment index can be produced on a weekly basis To be considered: - Use model-based figures as early indicators - Reduce sampling of consumer confidence index 39

General sentiment indicator (draft version) 40

Issues when using social media data Lesser issues: Privacy Data acquisition Main issues: Methodology - Selectivity - Meaning of the data - Validity of methods used Other issues - Skills needed 41

Questions on the validity of methods used Is it acceptable, under certain conditions, to base official statistics on correlations? If so, what are the conditions? What to do if there is a shock? 42

Reference: statistical use of social media data Research reference: Daas, P.J.H. and Puts, M.J.H. (2014) Social Media Sentiment and Consumer Confidence. European Central Bank Statistics Paper Series No. 5, Frankfurt, Germany. 43

Big Data Characteristics Definition: Volume Velocity Variety Data characteristics: Unstructured data Selectivity Population dynamics Event data Organic data Distributed data Data use: Other ways of processing Fundamentally new applications 44

Overview of Issues Getting access to the data Usability of the data - Meaning of the data, stability of the source, reproducability Methodologal issues - Selectivity, representativeness, unknown population, quality and validity Privacy, confidentiality and reputation IT-infrastructure and security Knowledge and skills Transition from research to production Strategic challenges 45

Possible responses to the issues Invest in good relations with the data provider Invest in methodological research and play with the data to get a grip on quality Use only aggregate data if possible Explore alternatives to population-based estimation methods Keep an open mindset Take the strategic challenges seriously 46

Strategic aspects Others start producing statistics - there may be quality issues - but they are extremely rapid - and there is obviously demand Need for good, impartial information (benchmark information) will remain - without a monopoly for NSIs There is a need for validation of information produced by others 47

Billion Prices Project MIT 48

49

The Roadmap Approach Awareness that Big Data is a strategic issue Position paper for Board of Directors Roadmap Big Data External validation of the Roadmap Roadmap updated twice a year for Board of Directors Roadmap monitor Deputy Director General responsible at strategic level Coordination group forbig Data 50

The Scope of the Roadmap Identification of outputs to be based on Big Data For each output, definition of time target and ownership Identification by owner of conditions to be fulfilled Commitment by supporting services for fulfilling the conditions (IT, data collection, methodological support, ) Supporting programmes 51

The Roadmap Projects Focus projects Road sensor data for traffic intensities statistics Mobile phone data for daytime population statistics Other projects Internet data for price statistics Financial transactions data for statistics Social media data for detecting trends in social cohesion Internet data for encoding enterprise purchases and sales 52

Supporting Programmes Big Data features in: Innovation programme Methodological research programme 53

Cooperation and Collaboration on Big Data Statistics Netherlands works together with: Other NSIs UN, UNECE, EU, WorldBank ESSnet on Big Data (to be confirmed) Government organisations Universities and research organisations Data providers IT providers Big Data firms Research consortia (e.g. H2020) 54

UNECE Big Data Activities Classification of Big Data sources Big Data project in 2014, with three Task Teams: - Partnerships - Privacy - Quality Sandbox in 2014, 2015 and possibly beyond Big Data survey, together with UNSD Results: http://www1.unece.org/stat/platform/display/bigdata/2014+project 55

UN Big Data Activities Global Working Group on Big Data for Official Statistics with eight Task Teams: - Mobile phone data - Satellite imagery - Social media data - Access / partnerships - Advocacy / communication - Big Data and SDGs - Training / skills / capacity building - Cross-cutting issues UNSD survey on Big Data for official statistics 56

Draft Big Data Access Principles (UN) Social responsibility Level playing field Equal treatment Confidentiality and security Transparency Respect for business interest Proportionality 57

Conclusion: The Way Forward Get to know Big Data Use Big Data for efficiency and response burden reduction Use Big Data for early indicators Use Big Data for filling gaps and new demands Use new professional methods where needed Create the right environment Don t do it alone! 58

General references Glasson, M., Trepanier, J., Patruno, V., Daas, P., Skaliotis, M. and Khan, A. (2013) What does "Big Data" mean for Official Statistics? Paper for the High-Level Group for the Modernization of Statistical Production and Services, March 10. Struijs, P., Braaksma, B. and Daas, P. (2014) Official Statistics and Big Data. Big Data & Society, April June, pp. 1 6. Struijs, P. and Daas, P.J.H. (2013) Big Data, Big Impact? Paper for the Seminar on Statistical Data Collection, Geneva, Switzerland. Struijs, P. and Daas, P. (2014) Quality Approaches to Big Data in Official Statistics. Paper for the European Conference on Quality in Official Statistics 2014, Vienna, Austria. 59

The Future 60

Questions? Thank you for your attention! p.struijs@cbs.nl 61