Unlocking the Full Potential of Big Data

Similar documents
Task Force Members: Lilli Japec Frauke Kreuter Marcus Berg Paul Biemer Paul Decker Cliff Lampe

Micro Data Hubs for Central Banks and a (different) view on Big Data

Total Survey Error: Adapting the Paradigm for Big Data. Paul Biemer RTI International University of North Carolina

AAPOR Report on Big Data

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Visualization and Big Data in Official Statistics

Big Data. Case studies in Official Statistics. Martijn Tennekes. Special thanks to Piet Daas, Marco Puts, May Offermans, Alex Priem, Edwin de Jonge

Big Data andofficial Statistics Experiences at Statistics Netherlands

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

big data in the European Statistical System

Big data, the future of statistics

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

The Sandbox 2015 Report

Keywords: big data, official statistics, quality, Wikipedia page views, AIS.

Meeting with the Advisory Scientific Board of Statistics Sweden November 12, 2013

2015 SOI Consultants Panel Meeting

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Big Data and Official Statistics The UN Global Working Group

COMP9321 Web Application Engineering

Getting Started Practical Input For Your Roadmap

Big Data (and official statistics) *

Questionnaire about the skills necessary for people. working with Big Data in the Statistical Organisations

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

How To Use Big Data For Business

Utilizing big data to bring about innovative offerings and new revenue streams DATA-DERIVED GROWTH

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

The? Data: Introduction and Future

BIG DATA AND ANALYTICS

ESS event: Big Data in Official Statistics

Native Connectivity to Big Data Sources in MSTR 10

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Strategies For Setting Up Your Organisation For Success With Big Data. Kevin Long Business Development Director Teradata

Big Data and New Paradigms in Information Management. Vladimir Videnovic Institute for Information Management

Big CBS. Experiences at Statistics Netherlands. Dr. Piet J.H. Daas Methodologist, Big Data research coördinator. Statistics Netherlands

Big Data and Analytics: Challenges and Opportunities

Using Data Mining and Machine Learning in Retail

Navigating Big Data business analytics

WHAT DOES BIG DATA MEAN FOR OFFICIAL STATISTICS?

Big Data and Data Science. The globally recognised training program

Big Data Analytics. Optimizing Operations and Enabling New Business Models

The Future of Data Management

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Data Analyst Program- 0 to 100

6 Steps to Faster Data Blending Using Your Data Warehouse

Crack Open Your Operational Database. Jamie Martin September 24th, 2013

Data First Framework. How to Build Your Enterprise Data Hub. Luis Campos Big Data Solutions Director Oracle Europe, Middle East and Africa

This survey addresses individual projects, partnerships, data sources and tools. Please submit it multiple times - once for each project.

Reference Architecture, Requirements, Gaps, Roles

Big Data Analytics Nokia

Big Data - Business, Math, Technology Best combination for big data 商 业 理 解, 数 据 科 学, 技 术 实 践 之 完 美 结 合

How To Make Sense Of Data With Altilia

Data Science and Big Data: Below the Surface and Implications for Governance

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Big Data Analytics: Where is it Going and How Can it Be Taught at the Undergraduate Level?

Big Data, Official Statistics and Social Science Research: Emerging Data Challenges

Big Data Analytics OverOnline Transactional Data Set

Big Data & Netflix. Paul Ellwood February 9th, 2015

BIG DATA What it is and how to use?

ANALYTICS CENTER LEARNING PROGRAM

This Symposium brought to you by

Using distributed technologies to analyze Big Data

Cloud Computing Training

Applications for Big Data Analytics

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Search and Real-Time Analytics on Big Data

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

ONS Big Data Project Progress report: Qtr 1 January to March 2015

Real Time Big Data Processing

Bringing the Power of SAS to Hadoop. White Paper

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Automated Machine Learning For Autonomic Computing

Report of the 2015 Big Data Survey. Prepared by United Nations Statistics Division

Roadmap Talend : découvrez les futures fonctionnalités de Talend

Student Project 2 - Apps Frequently Installed Together

#TalendSandbox for Big Data

The Future of Business Analytics is Now! 2013 IBM Corporation

Open source Google-style large scale data analysis with Hadoop

Advanced Big Data Analytics with R and Hadoop

NEWLY EMERGING BEST PRACTICES FOR BIG DATA

Transcription:

Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015 facebook.com/statisticssweden @SCB_nyheter

The report is available at https://www.aapor.org

Task Force Members: Lilli Japec, Co-Chair, Statistics Sweden Frauke Kreuter, Co-Chair, JPSM at the U. of Maryland, U. of Mannheim & IAB Marcus Berg, Stockholm University Paul Biemer, RTI International Paul Decker, Mathematica Policy Research Cliff Lampe, School of Information at the University of Michigan Julia Lane, American Institutes for Research Cathy O Neil, Johnson Research Labs Abe Usher, HumanGeo Group

AAPOR (American Association for Public Opinion Research) a professional organization dedicated to advancing the study of public opinion, broadly defined, to include attitudes, norms, values, and behaviors promotes best practices and transparency works to educate its members as well as policy makers, the media, and the public at large to help them make better use of surveys and survey findings, and to inform them about new developments in the field other task force reports available on https://www.aapor.org

Outline of our presentations What is Big Data? Paradigm shift Big Data activities in different organizations Skills required Big Data process and data quality

three main data sources UNTIL RECENTLY

Survey Data Administrative Data Experiments

NOW

US Aggregated Inflation Series, Monthly Rate, PriceStats Index vs. Official CPI. Accessed January 18, 2015 from the PriceStats website.

Number of vehicles detected in the Netherlands on December 1, 2011 created by Statistics Netherlands (Daas et al. 2013). The vehicle size is shown in different colors; black is small size, red is medium size and green is large size.

Social media sentiment (daily, weekly and monthly) in the Netherlands, June 2010 - November 2013. The development of consumer confidence for the same period is shown in the insert (Daas and Puts 2014).

Big Data http://www.rosebt.com/blog/data-veracity

Hope that found/organic data Can replace or augment expensive data collections More (= better) data for decision making Information available in (nearly) real time

New paradigm New business model Federal agencies no longer major players New analytical model Outliers Finegrained analysis New units of analysis New sets of skills Computer scientists Citizen scientists Different cost structure Source: Julia Lane

Eurostat Big Data Action Plan and Roadmap Pilots exploring the potential of selected big data sources The project will also include activities on: Methodological frameworks, Quality frameworks, Metadata frameworks, IT infrastructures, Communication, Legal frameworks, Ethical frameworks, Skills and training, and Experience sharing.

UNECE and Big Data The Sandbox provides a computing environment to load Big Data sets and tools Consumer price indices experimenting with the computation of price indexes Mobile telephone data statistics on tourism and daily commuting Smart meters statistics on power consumption using data collected from smart meter readings. Traffic loops traffic statistics using data from traffic loops Social media using Twitter data to analyze sentiment and to tourism flows. Job portals computing statistics on job vacancies Web scraping tested methods for automatically collecting data from web sources.

UNECE Big Data Inventory

Statistics Netherlands: Roadmap BIG DATA Two focus projects: the use of traffic loop data for transportation statistics the use of mobile phone data for daytime population and tourism statistics. Six other projects: the use of internet data for price statistics, investigating the use of bank and credit card transactions, the use of social media data for detecting trends in social cohesion, the use of internet data for encoding enterprise purchases and sales, investigating the use of smartcards of public transport for statistics, and the use of internet data for statistics about job vacancies. Source: Pieter Vlag, Statistics Netherlands 18

Examples from Statistics Sweden Scanner data to improve the Household Budget Survey Job vacancy statistics by scraping of the web To evalutate the use of AIS (Automatic Identification System) data. Cooperation between Statistics Sweden and the agency for Transport Analysis (Trafa). Research funding from the Swedish Innovation Agency (Vinnova).

Source: Moström and Justesen, Statistics Sweden One day data

What tasks are required to get there? SKILLS

We have to do this jointly Data Output/Access Example: map visualization / privacy Data Analysis Example: Hadoop MapReduce; High Frequency Data Data Curation/Storage Data Generating Process Research Questions Example: Hadoop Distributed File System Examples: geolocated social media + survey + administrative data Examples: Behavior of interest (migration/political participation/job searches)

Source: Abe Usher

Big words What is big data? What is Hadoop File System? (HDFS) What is Hadoop MapReduce? (MR) How do you link surveys with big data? Source: Abe Usher

Computer scientist Data preparation MapReduce algorithms Python/R programming Hadoop ecosystem System Administrator Storage systems (MySQL, Hbase, Spark) Cloud computing: Amazon Web Services (AWS) Google Compute Engine Hadoop ecosystem Source: Abe Usher

What do we know about the data generating process? RESEARCH

Veracity Who? What? Why? Who is missing? Who is counted repeatedly? What is not said / measured?..and why?

But (at least) one more V http://www.rosebt.com/blog/data-veracity

Terrorist Detector Terrorist Detector Errors in Big Data: An Illustration Suppose 1 in 1,000,000 people are terrorists The Big Data Terrorist Detector is 99.9 accurate The detector says your friend, Jack is a terrorist. What are the odds that Jack is really a terrorist? Source: Paul Biemer 29

Terrorist Detector Terrorist Detector Errors in Big Data: An Illustration Suppose 1 in 1,000,000 people are terrorists The Big Data Terrorist Detector is 99.9 accurate The detector says your friend, Jack is a terrorist. What are the odds that Jack is really a terrorist? Answer: 1 in 1000 i.e., 99.9% of the terrorist detections will be false! Source: Paul Biemer 30

Big Data Process Map Generate Source 1 ETL Extract Analyze Filter/Reduction (Sampling) Source 2 Source K Transform (Cleanse) Load (Store) Computation/ Analysis (Visualization) Source: Paul Biemer 31

Big Data Process Map Generation Source 1 Source 2 Source K ETL Errors include: Extract low signal/noise ratio; lost signals; failure to capture; non-random (or nonrepresentative) sources; metadata that are lacking, absent, or erroneous. Transform (Cleanse) Load (Store) Analyze Filter/Reduction (Sampling) Computation/ Analysis (Visualization) Source: Paul Biemer 32

Big Data Process Map Generation Source 1 Source 2 Source K ETL Extract Transform (Cleanse) Load (Store) Analyze Errors include: specification error (including, errors in meta-data), matching error, Filter/Reduction coding error, editing error, data (Sampling) munging errors, and data integration errors.. Computation/ Analysis (Visualization) Source: Paul Biemer 33

Generation Source 1 Big Data Process Map Data are filtered, sampled or otherwise Errors reduced. include: ETL This sampling may errors, involve selectivity further errors (or lack transformations of representativity), Extract of the modeling data. errors Analyze Filter/Reduction (Sampling) Source 2 Source K Transform (Cleanse) Load (Store) Computation/ Analysis (Visualization) Source: Paul Biemer 34

Big Data Process Map Generation Source 1 ETL Extract Analyze Filter/Reduction (Sampling) Source 2 Source K Errors include: Transform modeling errors, inadequate or (Cleanse) erroneous adjustments for representativity, computation and algorithmic errors. Load (Store) Computation/ Analysis (Visualization) Source: Paul Biemer 35

POTENTIAL

We have to do this jointly Data Output/Access Data Analysis Data Curation/Storage Data Generating Process Research Questions Example: map visualization / privacy Psychology, Law, Math&Comp, Business Example: Hadoop MapReduce; High Frequency Data Economics, Social Sciences, Business, Math&Comp Example: Hadoop Distributed File System Math & Computer Science, Applied Statistics Examples: geolocated social media + survey + administrative data Social Science & Psychology, Humanities, Econ, Business Examples: Behavior of interest (migration/political participation/job searches) Any field

..and think about legal framework