Big Data @ CBS. Experiences at Statistics Netherlands. Dr. Piet J.H. Daas Methodologist, Big Data research coördinator. Statistics Netherlands



Similar documents
Big data, the future of statistics

Big Data andofficial Statistics Experiences at Statistics Netherlands

Big Data. Case studies in Official Statistics. Martijn Tennekes. Special thanks to Piet Daas, Marco Puts, May Offermans, Alex Priem, Edwin de Jonge

Visualization and Big Data in Official Statistics

Big Data as a Data Source for Official Statistics: experiences at Statistics Netherlands

Social Network Analysis

Big Data (and official statistics) *

Search Engines are #1 Way to be Found

International Journal of Advancements in Research & Technology, Volume 3, Issue 5, May ISSN BIG DATA: A New Technology

Social Media and Digital Marketing Analytics ( INFO-UB ) Professor Anindya Ghose Monday Friday 6-9:10 pm from 7/15/13 to 7/30/13

see, say, feel, do Social Media Metrics that Matter

STATISTICS PAPER SERIES

Multichannel Customer Listening and Social Media Analytics

Unlocking the Full Potential of Big Data

ACT Enrollment Planners Conference

Real Time Analytics for Big Data. NtiSh Nati

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Extracting Information from Social Networks

THE POWER OF INFLUENCE TAKING THE LUCK OUT OF WORD OF MOUTH

1 The total values reported in the tables and

Location-Based Social Media Intelligence

BY Maeve Duggan NUMBERS, FACTS AND TRENDS SHAPING THE WORLD FOR RELEASE AUGUST 19, 2015 FOR FURTHER INFORMATION ON THIS REPORT:

Earn Money Sharing YouTube Videos

C O M M U N I C A T I O N S I N C. Marketing. Presented by Claudia Guerrero

SOUTH EAST ASIA AND OCEANIA ERICSSON MOBILITY REPORT JUNE

1 Which of the following questions can be answered using the goal flow report?

Freelance Dietitians Factsheets. Building a Professional Website. then. What makes you different?

Data Mining. Toon Calders TU Eindhoven

Big Data and Open Data

Linear Models in STATA and ANOVA

CivicScience Insight Report

WHAT DOES BIG DATA MEAN FOR OFFICIAL STATISTICS?

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

Digital Marketing Boot Camp 4 Days Residential Program Dubai Singapore Thailand

IMPACT OF TRUST, PRIVACY AND SECURITY IN FACEBOOK INFORMATION SHARING

Capgemini Big Data Analytics Sandbox for Financial Services

Big Data and Official Statistics

Professional Diploma in Digital Marketing

16 /20 AL DIGIT P R I N T D I G I T A L S O C I A L E V E N T S GREENSPUN MEDIA 2016

Lead Generation Lessons From 4,000 Businesses. study based on real data from 4,000 businesses worldwide

Section 3 Part 1. Relationships between two numerical variables

Big Data, Official Statistics and Social Science Research: Emerging Data Challenges

Buurten van gemeente Groningen

The state of DIY. Mix Express DIY event Maarssen 14 mei 2014

How we design & build websites. Menu 2 Menu 3 Menu 4 Menu 5. Home Menu 1. Item 1 Item 2 Item 3 Item 4. bob. design & marketing

Anomaly Detection and Predictive Maintenance

30+ football sites 10m+ unique visitors The first football-only ad network

State of the Web Address: Navigating the Ever-Changing Web

Your guide to using new media

Advanced Metering Infrastructure

Predictive Analytics: Extracts from Red Olive foundational course

Build Your Online Social Network & Your Business. 87% of homebuyers used the Internet to research their options

STAR WARS AND THE ART OF DATA SCIENCE

Affiliate Program Features & Benefits

Introduction to e-marketing

AUTOMOTIVE REPORT 2015 ADOBE DIGITAL INDEX

RECOMMENDATIONS HOW TO ATTRACT CLIENTS TO ROBOFOREX

A Guide to Social Media Marketing for Contractors

Employer Brand Analytics

Social Media Monitoring, Planning and Delivery

Big Data for Social Good. Nuria Oliver, PhD Scientific Director User, Data and Media Intelligence Telefonica Research

The Balanced Marketing Ecosystem: Using Traditional & Digital Marketing to Maximize Your Acquisition of New Customers

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Statistiek II. John Nerbonne. October 1, Dept of Information Science

Event Marketing: Best-in-Class Companies Integrate Events with Multi-Channel Marketing Strategy

BUSINESS ADMINISTRATION

5/30/2012 PERFORMANCE MANAGEMENT GOING AGILE. Nicolle Strauss Director, People Services

2014 Advertising Kit

Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables 2

Why Your Business Needs a Website: Ten Reasons. Contact Us: Info@intensiveonlinemarketers.com

THE SME S GUIDE TO COST-EFFECTIVE WEBSITE MARKETING

Evaluate Digital Digital Marketing Strategy

Checklist: Are you ready for ecommerce?

Top 101 Tasks You can Outsource to your Virtual Assistant to Bring Your Business Up!

Creating Careers. Together.

Google Analytics 101

Elements of a Successful Marketing Plan Part 1. Inbound Marketing Strategies

Level 3 Certificate ITQ Social Media

Game Changing Trends and Technologies for Video Surveillance

Social Media Optimization: Is SMO the New SEO?

SOCIAL MEDIA AND THE CUSTOMER EXPERIENCE. View The Webinar. Presented by: Jeff Hodson Aspect. Hosted by: Sally Hurley VIPdesk

Teacher s notes and key

Transcription:

Big Data @ CBS Experiences at Statistics Netherlands Dr. Piet J.H. Daas Methodologist, Big Data research coördinator Statistics Netherlands April 20, Enschede

Overview Big Data Research theme at Statistics Netherlands Examples of our work Lessons learned Important topics in 2015 2

Data, data everywhere! X

Two types of data CBS Primairy data Secondary data 4 Our own surveys Data from others - Administrative sources - Big Data

Big Data research at Stat. Netherlands Exploratory, data driven Cases: Road sensor data, Mobile phone data, Social media There is no standard Big Data methodology (yet) Combination of IT, methodology and content (Data Science) Important issues for official statistics Getting access Selectivity (representativity) Data editing at large scale Data reduction (dimension reduction) 5

Results of case studies

Example 1: Roads sensors Road sensor (traffic loop) data Each minute (24/7) the number of passing vehicles is counted in around 20.000 highway loops in the Netherlands Total and in different length classes 7 Nice data source for transport and traffic statistics (and more) A lot of data, around 230 million records a day Our current dataset has 56 billion records! Locations

Dutch highways 8

Dutch highways with road sensors 9

Road sensor data All vehicles All vehicles Large vehicles (> 12m) All vehicles 10

Minute data of 1 sensor for 196 days 11

Afsluitdijk (IJsselmeer dam) 12

Afsluitdijk (IJsselmeer dam) (2) Cross correlation between sensor pairs -Used to validate metadata Trajectory speed vs. point speed - Average speed is 98 Km/h

Overall process Transform & Select Frame Clean Aggregate 14

Reducing Big Data 15

Traffic index (prototype) NUTS 3 area based traffic indexes for the main roads per month / week

Example 2: Mobile phones 17 Mobile phone activity as a data source Nearly every person in the Netherlands has a mobile phone Usually on them and almost always switched on! Many people are very active during the day Can data of mobile phones be used for statistics? Travel behaviour (of active phones) Day time population (of active phones) Tourism (new phones that register to network) Data of a single mobile company was used Hourly aggregates per area (only when > 15 events) Especially important for roaming data (foreign visitors)

Travel behaviour (of mobile phones) Mobility of very active active mobile phone users during a 14 day period data of a single mob. comp. Based on: Mobile phone activity Location based on phone masts 18 Observed: Includes major cities But the North and South east of the country are much less represented

Day time population Hourly changes of mobile phone activity 7 & 8 May 2013 Per area distinguished Only data for areas with > 15 events per hour 19

Tourism: Roaming during European league final Hardly any Low Medium High Very high 20

Example 3: Social media Dutch are very active on social media! Around 60% according to a surveyna altijd bij zich en staat vrijwel altijd aan Steeds meer mensen hebben een smartphone! Mogelijke informatiebron voor: Welke onderwerpen zijn actueel: Aantal berichten en sentiment hierover Als meetinstrument te gebruiken voor:. 21 Map by Eric Fischer (via Fast Company)

Dutch Twitter topics (3%) (7%) (3%) (10%) (7%) (3%) (5%) (46%) 12 million messages 38

Sentiment in Dutch social media About the data Dutch firm that continuously collects ALL public social media messages written in Dutch Dataset of more than 3.5 billion messages! Covering June 2010 till the present Between 3 4 million new messages are added per day About sentiment determination Bag of words approach List of Dutch words with their associated sentiment Added social media specific words ( FAIL, LOL, OMG etc.) Use overall score to determine sentiment Is either positive, negative or neutral Average sentiment per period (day / week / month) 23 (#positive #negative)/#total * 100%

Daily, weekly, monthly sentiment 24

Sentiment per platform (~10%) (~80%)

Platform specific results Table 1. Social media messages properties for various platforms and their correlation with consumer confidence Correlation coefficient of Social media platform Number of social Number of messages as monthly sentiment index and media messages 1 percentage of total (%) consumer confidence ( r ) 2 All platforms combined 3,153,002,327 100 0.75 0.78 Facebook 334,854,088 10.6 0.81* 0.85* Twitter 2,526,481,479 80.1 0.68 0.70 Hyves 45,182,025 1.4 0.50 0.58 News sites 56,027,686 1.8 0.37 0.26 Blogs 48,600,987 1.5 0.25 0.22 Google+ 644,039 0.02-0.04-0.09 Linkedin 565,811 0.02-0.23-0.25 Youtube 5,661,274 0.2-0.37-0.41 Forums 134,98,938 4.3-0.45-0.49 1 period covered June 2010 untill November 2013 2 confirmed by visual inspecting scatterplots and additional checks (see text) *cointegrated 26

Schematic overview Previous Vorige maand month Current Maand month Day Dag1-7 Dag Day 8-14 Dag Day 15-21 Dag Day 22-28 Consumer Confidence 27 Publication date (~20 th ) Sentiment Social media sentiment

Results of comparing various periods Consumer Confidence Facebook Facebook Facebook + Twitter * Twitter 0.81* 0.84* 0.86* 0.85* 0.87* 0.89* 0.82 0.85 0.87 0.82* 0.85* 0.89* 0.79* 0.82* 0.84* 0.79 0.83 0.84 0.82* 0.86* 0.89* 0.79* 0.83* 0.87* 0.75* 0.80* 0.81* 28 *cointegration LOOCV results

Overall findings Correlation and cointegration 1 st week of Consumer confidence usually has 70% response Best correlation and cointegration with 2 nd week of the month Highest correlation 0.93* (all Facebook * specific word filtered Twitter) Granger causality Changes in Consumer confidence precede changes in Social media sentiment For all combinations shown! Only tried linear models so far Prediction Slightly better than random chance Best for the 4 th week of month 29

Sentiments indicator for NL (beta-version) 30 Displays average sentiment of Dutch public Facebook and Twitter messages

Lessons learned (so far) The most important ones: 1) There are many types of Big Data 2) Know how to access and analyse large amounts of data 3) Find ways to deal with noisy and unstructured data 4) Mind set for Big Data Mind set for survey data 5) Need to beyond correlation 6) Need people with the right skills, knowledge and mind set 7) Solve privacy and security issues 8) Data management & costs 31 We are getting more and more grip on these topics

Important research topics for 2015 1) Profiling Extracting features from units in Big Data sources Use this to study i) population changes and ii) input for ways to correct for selectivity For persons: gender, age, income, education, origin, urbanicity, and household composition For companies: number of employees (size class), turnover, type of economic activity, legal form 2) Data editing at large scale Combination of methods/algorithms and IT 3) Data reduction (dimension reduction) Reduce size of Big Data while retaining similar information content (not sampling!)

The Future The future of statistics looks BIG 33

@pietdaas Thank you for your attention!