Big data, the future of statistics



Similar documents
Big CBS. Experiences at Statistics Netherlands. Dr. Piet J.H. Daas Methodologist, Big Data research coördinator. Statistics Netherlands

Big Data andofficial Statistics Experiences at Statistics Netherlands

Big Data. Case studies in Official Statistics. Martijn Tennekes. Special thanks to Piet Daas, Marco Puts, May Offermans, Alex Priem, Edwin de Jonge

Visualization and Big Data in Official Statistics

Big Data as a Data Source for Official Statistics: experiences at Statistics Netherlands

Data Visualization in Official Statistics

Social Network Analysis

Big Data and Official Statistics

Big Data (and official statistics) *

Unlocking the Full Potential of Big Data

Peter Rakers De Verstoring van Big Data

STAR WARS AND THE ART OF DATA SCIENCE

The state of DIY. Mix Express DIY event Maarssen 14 mei 2014

Big Data, Official Statistics and Social Science Research: Emerging Data Challenges

Doing Multidisciplinary Research in Data Science

THE EMOTIONAL VALUE OF PAID FOR MAGAZINES. Intomart GfK 2013 Emotionele Waarde Betaald vs. Gratis Tijdschrift April

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Big Data and Open Data

MLg. Big Data and Its Implication to Research Methodologies and Funding. Cornelia Caragea TARDIS November 7, Machine Learning Group

STATISTICS PAPER SERIES

The future of Big Data A United Hitachi View

WHAT DOES BIG DATA MEAN FOR OFFICIAL STATISTICS?

Top Line Findings. George Washington University and Cision 2009 Social Media & Online Usage Study

Real Time Analytics for Big Data. NtiSh Nati

International Journal of Advancements in Research & Technology, Volume 3, Issue 5, May ISSN BIG DATA: A New Technology

Innovation of tourism statistics through the use of new big data sources.

Advanced Metering Infrastructure

Sentiment Analysis on Big Data

THE STATE OF Social Media Analytics. How Leading Marketers Are Using Social Media Analytics

Dutch Mortgage Market Pricing On the NMa report. Marco Haan University of Groningen November 18, 2011

Big Data and Analytics: Challenges and Opportunities

Big Data to trade bonds/fx & Python demo on FX intraday vol

CORRALLING THE WILD, WILD WEST OF SOCIAL MEDIA INTELLIGENCE

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

AUTOMOTIVE REPORT 2015 ADOBE DIGITAL INDEX

Statistics for BIG data

Professional Diploma in Digital Marketing

Big Data for Social Good. Nuria Oliver, PhD Scientific Director User, Data and Media Intelligence Telefonica Research

IP-NBM. Copyright Capgemini All Rights Reserved

Scaling Big Data Mining Infrastructure: The Smart Protection Network Experience

Safe Harbor Statement

Big Data. What is Big Data? Over the past years. Big Data. Big Data: Introduction and Applications

Buurten van gemeente Groningen

Utilizing big data to bring about innovative offerings and new revenue streams DATA-DERIVED GROWTH

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

IBM Predictive Analytics Solutions

COMP9321 Web Application Engineering

Data Mining. Toon Calders TU Eindhoven

What is Data Science? Data, Databases, and the Extraction of Knowledge Renée November 2014

Multichannel Customer Listening and Social Media Analytics

SOCIAL MEDIA AND HEALTHCARE HYPE OR FUTURE?

Capgemini Big Data Analytics Sandbox for Financial Services

Text and data analytics for social network mining

Predictive Analytics: Extracts from Red Olive foundational course

See how social media listening and engagement can help your business

Making, Moving and Shaking a Community of Young Global Citizens Resultaten Nulmeting GET IT DONE

Matt Erickson Economist American Farm Bureau Federation March 5, 2014

Management Summary 3. Role within the organization 4. For what purposes do you use social media? 5. What social channels does the company use?

Of all the data in recorded human history, 90 percent has been created in the last two years. - Mark van Rijmenam, Think Bigger, 2014

Social Intelligence Report Adobe Digital Index Q2 2015

Big Data Analytics: 14 November 2013

Miguel Ortiz, Sr. Systems Engineer. Globanet

Digital Marketing Strategy

Your guide to using new media

Outline. What is Big data and where they come from? How we deal with Big data?

Maximize Social Media Effectiveness with Data Science. An Insurance Industry White Paper from Saama Technologies, Inc.

Starten met ARLearn Het leren van toekomst

What is Data Science? Girl Develop It! Meetup Renée M. P. Teate, March 2015

Using Big Data & Analytics to Increase PR & Marketing Brand Awareness

The Big Deal about Big Data. Mike Skinner, CPA CISA CITP HORNE LLP

HMG Corporate Development Team.

Web review of yoast.com

Anomaly Detection and Predictive Maintenance

Healthcare data analytics. Da-Wei Wang Institute of Information Science

UNIVERSITY OF INFINITE AMBITIONS. MASTER OF SCIENCE COMPUTER SCIENCE DATA SCIENCE AND SMART SERVICES

De tarieven van Proximus Niet meer gecommercialiseerde Bizz packs

SMARTPHONES & BIG DATA. Daniel Nelson Head of Enterprise Development, daniel.nelson@braintreepayments.

Selectivity of Big data

Big Data in Retail Big Data Analytics Central to Customer Acquisition and Retention Strategies in Retail

Transcription:

Big data, the future of statistics Experiences from Statistics Netherlands Dr. Piet J.H. Daas Senior-Methodologist, Big Data research coordinator and Marco Puts, Martijn Tennekes, Alex Priem, Edwin de Jonge. Statistics Netherlands 9 Sept, Stockholm

Overview Big Data One of 8 research themes at our office Which skills do you need? Examples of our work Lessons learned (so far) 2

Data, data everywhere! X

Two kinds of data Statistics Netherlands Primary data Secondary data 4 Our own questionnaires Data of others - Administrative sources - Big Data

Big Data research at Stat. Netherlands Explorative, data driven Case studies: Road sensors, Mobile phone data, Social media There is now Big Data methodology yet (we are working on it) Combination of IT, methodology and Content (Data Science) Important topics for official statistics Accessing Big Data in a structural way Selectivity ( representativity ) Checking and editing large amounts of data Reducing data size (without information loss) 5

Big Data skills We need new skills At the border of IT and methodology Data Scientists (a group) High Performance Analytics Available in a number of research areas outside traditional statistics research Computer sciences, artificial intelligence Machine learning (Statistical learning) Ability to extract information from new sources Such as: texts and pictures/video s 6

Case studie resultaten

Example 1: Roads sensors Road sensor (traffic loop) data Each minute (24/7) the number of passing vehicles is counted in around 20.000 loops in the Netherlands Total and in different length classes Nice data source for transport and traffic statistics (and more) A lot of data, around 230 million records a day 8 Locations

High ways in the Netherlands 9

Coverage by road sensors 10

Road sensor data All vehicles All vehicles Large vehicles All vehicles 11

Minute data of 1 sensor for 196 days 12

Afsluitdijk (IJsselmeer dam) 13

Afsluitdijk (IJsselmeer dam) (2) Cross correlation between sensor pairs -Used to validate metadata Trajectory speed vs. point speed - Average speed is 98 Km/h

Process overview Big Data processing a 1 Big Data processing Select + Transform 2 Frame 4 Editing 3 b 5 c Estimation 6

Data in process Select + Transform Editing Estimation Raw data 80 TB 105 billion records 105 billion data points 2010-2014 Transformed data 70 GB 13 million records 15 million data points Micro data 500 MB 13 million records 13 million data points Traffic index figures 6 KB 16

Number of vehicles Estimations for the whole country Time (years) 17 This is our first Big Data based official publication!! Findings: tiny.cc/6tn02x Description: tiny.cc/skn02x

Example 2: Mobile phones 18 Mobile phone activity as a data source Nearly every person in the Netherlands has a mobile phone Usually on them and almost always switched on! Many people are very active during the day Can data of mobile phones be used for statistics? Travel behaviour (of active phones) Day time population (of active phones) Tourism (new phones that register to network) Data of a single mobile company was used Hourly aggregates per area (only when > 15 events) Especially important for roaming data (foreign visitors)

Travel behaviour Mobility of active users - Anonymized data - During a fourteen day period Based on: - Mobile phone use - Location of phone mast What is observed: - Large Dutch cities - Hardly any activity in North-East and South-West of the country 19

Day time population Hourly changes of mobile phone activity 7 & 8 May 2013 Per area distinguished Only data for areas with > 15 events per hour 20

Daytime population results Almere: commuter town? Dutch population totals Foreigners at Schiphol Airport 21

Tourism: Roaming during European league final Hardly any Low Medium High Very high 22

Example 3: Social media Dutch are very active on social media! Around 70% according to a surveyna altijd bij zich en staat vrijwel altijd aan Steeds meer mensen hebben een smartphone! Mogelijke informatiebron voor: Welke onderwerpen zijn actueel: Aantal berichten en sentiment hierover Als meetinstrument te gebruiken voor:. 23 Map by Eric Fischer (via Fast Company)

Dutch Twitter topics (3%) (7%) (3%) (3%) (10%) (7%) (5%) (46%) 24 12 million messages 38

Sentiment in Dutch social media About the data Dutch firm that continuously collects ALL public social media messages written in Dutch Dataset of more than 3.5 billion messages! Covering June 2010 till the present Between 3-4 million new messages are added per day About sentiment determination Bag of words approach List of Dutch words with their associated sentiment Added social media specific words ( FAIL, LOL, OMG etc.) Use overall score to determine sentiment Is either positive, negative or neutral Average sentiment per period (day / week / month) 25 (#positive - #negative)/#total * 100%

Daily, weekly, monthly sentiment 26

Sentiment per platform (~10%) (~80%)

Platform specific results Table 1. Social media messages properties for various platforms and their correlation with consumer confidence Correlation coefficient of Social media platform Number of social Number of messages as monthly sentiment index and media messages 1 percentage of total (%) consumer confidence ( r ) 2 All platforms combined 3,153,002,327 100 0.75 0.78 Facebook 334,854,088 10.6 0.81* 0.85* Twitter 2,526,481,479 80.1 0.68 0.70 Hyves 45,182,025 1.4 0.50 0.58 News sites 56,027,686 1.8 0.37 0.26 Blogs 48,600,987 1.5 0.25 0.22 Google+ 644,039 0.02-0.04-0.09 Linkedin 565,811 0.02-0.23-0.25 Youtube 5,661,274 0.2-0.37-0.41 Forums 134,98,938 4.3-0.45-0.49 1 period covered June 2010 untill November 2013 2 confirmed by visual inspecting scatterplots and additional checks (see text) *cointegrated 28

Schematic overview Previous Vorige maand month Current Maand month Day Dag 1-7 Dag Day 8-14 Dag Day 15-21 Dag Day 22-28 Consumer Confidence 29 Publication date (~20 th ) Sentiment Social media sentiment

Results of comparing various periods Consumer Confidence Facebook Facebook Facebook + Twitter * Twitter 0.81* 0.84* 0.86* 0.85* 0.87* 0.89* 0.82 0.85 0.87 0.82* 0.85* 0.89* 0.79* 0.82* 0.84* 0.79 0.83 0.84 0.82* 0.86* 0.89* 0.79* 0.83* 0.87* 0.75* 0.80* 0.81* 30 *cointegration LOOCV results

Overall findings Correlation and cointegration 1 st week of Consumer confidence usually has 70% response Best correlation and cointegration with 2 nd week of the month Highest correlation 0.93* (all Facebook * specific word filtered Twitter) Granger causality Changes in Consumer confidence precede changes in Social media sentiment For all combinations shown! Only tried linear models (so far) Prediction Slightly better than random chance Best for the 4 th week of month 31

Sentiment indicator voor NL (beta-versie) Gebaseerd op het gemiddelde sentiment van publieke NL-talige Facebook en Twitter berichten 32

Lessons learned The most important ones: 1) There are many types of Big Data 2) Know how to access and analyse large amounts of data 3) Find ways to deal with noisy and unstructured data 4) Mind-set for Big Data Mind-set for survey data 5) Need to beyond correlation 6) Need people with the right skills, knowledge and mind-set 7) Solve privacy and security issues 8) Data management & costs 33 We are getting more and more grip on these topics & we published our first Big Data based official statistics!!

The Future The future of statistics looks BIG 34

@pietdaas Thank you for your attention!