1 Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015
2 The report is available at https://www.aapor.org
3 Task Force Members: Lilli Japec, Co-Chair, Statistics Sweden Frauke Kreuter, Co-Chair, JPSM at the U. of Maryland, U. of Mannheim & IAB Marcus Berg, Stockholm University Paul Biemer, RTI International Paul Decker, Mathematica Policy Research Cliff Lampe, School of Information at the University of Michigan Julia Lane, American Institutes for Research Cathy O Neil, Johnson Research Labs Abe Usher, HumanGeo Group
4 AAPOR (American Association for Public Opinion Research) a professional organization dedicated to advancing the study of public opinion, broadly defined, to include attitudes, norms, values, and behaviors promotes best practices and transparency works to educate its members as well as policy makers, the media, and the public at large to help them make better use of surveys and survey findings, and to inform them about new developments in the field other task force reports available on https://www.aapor.org
5 Outline of our presentations What is Big Data? Paradigm shift Big Data activities in different organizations Skills required Big Data process and data quality
6 three main data sources UNTIL RECENTLY
7 Survey Data Administrative Data Experiments
9 US Aggregated Inflation Series, Monthly Rate, PriceStats Index vs. Official CPI. Accessed January 18, 2015 from the PriceStats website.
10 Number of vehicles detected in the Netherlands on December 1, 2011 created by Statistics Netherlands (Daas et al. 2013). The vehicle size is shown in different colors; black is small size, red is medium size and green is large size.
11 Social media sentiment (daily, weekly and monthly) in the Netherlands, June November The development of consumer confidence for the same period is shown in the insert (Daas and Puts 2014).
12 Big Data
13 Hope that found/organic data Can replace or augment expensive data collections More (= better) data for decision making Information available in (nearly) real time
14 New paradigm New business model Federal agencies no longer major players New analytical model Outliers Finegrained analysis New units of analysis New sets of skills Computer scientists Citizen scientists Different cost structure Source: Julia Lane
15 Eurostat Big Data Action Plan and Roadmap Pilots exploring the potential of selected big data sources The project will also include activities on: Methodological frameworks, Quality frameworks, Metadata frameworks, IT infrastructures, Communication, Legal frameworks, Ethical frameworks, Skills and training, and Experience sharing.
16 UNECE and Big Data The Sandbox provides a computing environment to load Big Data sets and tools Consumer price indices experimenting with the computation of price indexes Mobile telephone data statistics on tourism and daily commuting Smart meters statistics on power consumption using data collected from smart meter readings. Traffic loops traffic statistics using data from traffic loops Social media using Twitter data to analyze sentiment and to tourism flows. Job portals computing statistics on job vacancies Web scraping tested methods for automatically collecting data from web sources.
17 UNECE Big Data Inventory
18 Statistics Netherlands: Roadmap BIG DATA Two focus projects: the use of traffic loop data for transportation statistics the use of mobile phone data for daytime population and tourism statistics. Six other projects: the use of internet data for price statistics, investigating the use of bank and credit card transactions, the use of social media data for detecting trends in social cohesion, the use of internet data for encoding enterprise purchases and sales, investigating the use of smartcards of public transport for statistics, and the use of internet data for statistics about job vacancies. Source: Pieter Vlag, Statistics Netherlands 18
19 Examples from Statistics Sweden Scanner data to improve the Household Budget Survey Job vacancy statistics by scraping of the web To evalutate the use of AIS (Automatic Identification System) data. Cooperation between Statistics Sweden and the agency for Transport Analysis (Trafa). Research funding from the Swedish Innovation Agency (Vinnova).
20 Source: Moström and Justesen, Statistics Sweden One day data
21 What tasks are required to get there? SKILLS
22 We have to do this jointly Data Output/Access Example: map visualization / privacy Data Analysis Example: Hadoop MapReduce; High Frequency Data Data Curation/Storage Data Generating Process Research Questions Example: Hadoop Distributed File System Examples: geolocated social media + survey + administrative data Examples: Behavior of interest (migration/political participation/job searches)
23 Source: Abe Usher
24 Big words What is big data? What is Hadoop File System? (HDFS) What is Hadoop MapReduce? (MR) How do you link surveys with big data? Source: Abe Usher
25 Computer scientist Data preparation MapReduce algorithms Python/R programming Hadoop ecosystem System Administrator Storage systems (MySQL, Hbase, Spark) Cloud computing: Amazon Web Services (AWS) Google Compute Engine Hadoop ecosystem Source: Abe Usher
26 What do we know about the data generating process? RESEARCH
27 Veracity Who? What? Why? Who is missing? Who is counted repeatedly? What is not said / measured?..and why?
28 But (at least) one more V
29 Terrorist Detector Terrorist Detector Errors in Big Data: An Illustration Suppose 1 in 1,000,000 people are terrorists The Big Data Terrorist Detector is 99.9 accurate The detector says your friend, Jack is a terrorist. What are the odds that Jack is really a terrorist? Source: Paul Biemer 29
30 Terrorist Detector Terrorist Detector Errors in Big Data: An Illustration Suppose 1 in 1,000,000 people are terrorists The Big Data Terrorist Detector is 99.9 accurate The detector says your friend, Jack is a terrorist. What are the odds that Jack is really a terrorist? Answer: 1 in 1000 i.e., 99.9% of the terrorist detections will be false! Source: Paul Biemer 30
31 Big Data Process Map Generate Source 1 ETL Extract Analyze Filter/Reduction (Sampling) Source 2 Source K Transform (Cleanse) Load (Store) Computation/ Analysis (Visualization) Source: Paul Biemer 31
32 Big Data Process Map Generation Source 1 Source 2 Source K ETL Errors include: Extract low signal/noise ratio; lost signals; failure to capture; non-random (or nonrepresentative) sources; metadata that are lacking, absent, or erroneous. Transform (Cleanse) Load (Store) Analyze Filter/Reduction (Sampling) Computation/ Analysis (Visualization) Source: Paul Biemer 32
33 Big Data Process Map Generation Source 1 Source 2 Source K ETL Extract Transform (Cleanse) Load (Store) Analyze Errors include: specification error (including, errors in meta-data), matching error, Filter/Reduction coding error, editing error, data (Sampling) munging errors, and data integration errors.. Computation/ Analysis (Visualization) Source: Paul Biemer 33
34 Generation Source 1 Big Data Process Map Data are filtered, sampled or otherwise Errors reduced. include: ETL This sampling may errors, involve selectivity further errors (or lack transformations of representativity), Extract of the modeling data. errors Analyze Filter/Reduction (Sampling) Source 2 Source K Transform (Cleanse) Load (Store) Computation/ Analysis (Visualization) Source: Paul Biemer 34
35 Big Data Process Map Generation Source 1 ETL Extract Analyze Filter/Reduction (Sampling) Source 2 Source K Errors include: Transform modeling errors, inadequate or (Cleanse) erroneous adjustments for representativity, computation and algorithmic errors. Load (Store) Computation/ Analysis (Visualization) Source: Paul Biemer 35
37 We have to do this jointly Data Output/Access Data Analysis Data Curation/Storage Data Generating Process Research Questions Example: map visualization / privacy Psychology, Law, Math&Comp, Business Example: Hadoop MapReduce; High Frequency Data Economics, Social Sciences, Business, Math&Comp Example: Hadoop Distributed File System Math & Computer Science, Applied Statistics Examples: geolocated social media + survey + administrative data Social Science & Psychology, Humanities, Econ, Business Examples: Behavior of interest (migration/political participation/job searches) Any field
AAPOR Report on Big Data AAPOR Big Data Task Force February 12, 2015 Prepared for AAPOR Council by the Task Force, with Task Force members including: Lilli Japec, Co-Chair, Statistics Sweden Frauke Kreuter,
Questionnaire about the skills necessary for people working with Big Data in the Statistical Organisations Preliminary results of the survey (19.08 2014) More detailed analysis will be prepared by October
Insurance Analytics Driving insight to gain advantage 11 March 2014 Agenda What is Analytics? Using analytics to overcome challenges in the Insurance industry Retention Customer Segmentation Overcoming
32 Big Data: present and future Big Data: present and future Mircea Răducu TRIFU, Mihaela Laura IVAN University of Economic Studies, Bucharest, Romania firstname.lastname@example.org, email@example.com
BIG DATA Preconditions to Productivity Dr. Steve Hallman, DBA 1, Dr. Michel Plaisent 2, Jasur Rakhimov 3, Dr. Prosper Bernard 4 1,3 MBA Program, Park University, Parkville, MO 64152 U.S.A. 2 Management
ISSN (Online): 2409-4285 www.ijcsse.org Page: 78-85 A Survey of Big Data Cloud Computing Security Elmustafa Sayed Ali Ahmed 1 and Rashid A.Saeed 2 1 Electrical and Electronic Engineering Department, Red
Uses of Big Data for Official Statistics: Privacy, Incentives, Statistical Challenges, and Other Issues Discussion Paper by Steve Landefeld Senior Advisor to the United Nations Statistics Division International
IBM Software Thought Leadership White Paper June 2013 The top five ways to get started with big data 2 The top five ways to get started with big data Big data: A high-stakes opportunity Remember what life
For Big Data Analytics There s No Such Thing as Too Big The Compelling Economics and Technology of Big Data Computing March 2012 By: 4syth.com Emerging big data thought leaders Forsyth Communications 2012.
BIG DATA IN ACTION FOR DEVELOPMENT This volume is the result of a collaboration of World Bank staff (Andrea Coppola and Oscar Calvo- Gonzalez) and SecondMuse associates (Elizabeth Sabet, Natalia Arjomand,
NESSI White Paper, December 2012 Big Data A New World of Opportunities Contents 1. Executive Summary... 3 2. Introduction... 4 2.1. Political context... 4 2.2. Research and Big Data... 5 2.3. Purpose of
Research Memorandum 94 July 2014 Creating business value from Big Data and business analytics: organizational, managerial and human resource implications Hull University Business School Prof Richard Vidgen
Big Data How it can become a differentiator Contents Definition of Big Data 3 Opportunity space 4 Key Players 6 Leading Industries taking advantage 7 of the Big Data trend Big Data in the Financial Industry
SPAM FILTERING FOR OPTIMIZATION IN INTERNET PROMOTIONS USING BAYESIAN ANALYSIS Ion SMEUREANU 1 PhD Univ. Professor Academy of Economic Studies Bucharest, Pta. Romana, no. 2-5, district 1, Romania E-mail:
July 2013 Contents 1. Introduction 3 2. What is Big Data? 4 3. Big Data Adoption 5 4. Drivers and Barriers 11 5. Opportunities for Digital Entrepreneurship 14 5.1. Supply-side Business opportunities 14
White Paper The Business Analyst s Guide to Hadoop Get Ready, Get Set, and Go: A Three-Step Guide to Implementing Hadoop-based Analytics By Alteryx and Hortonworks (T)here is considerable evidence that
How to embrace Big Data A methodology to look at the new technology Contents 2 Big Data in a nutshell 3 Big data in Italy 3 Data volume is not an issue 4 Italian firms embrace Big Data 4 Big Data strategies
Enabling Big Data by Removing Security and Compliance Barriers A SANS Survey Written by Barbara Filkins Advisor: John Pescatore April 2015 Sponsored by Cloudera 2015 SANS Institute Executive Summary Stage
New Data for Understanding the Human Condition: International Perspectives OECD Global Science Forum Report on Data and Research Infrastructure for the Social Sciences Data-driven and evidence-based research
International Journal of Education and Research Vol. 1 No. 5 May 2013 The Big Data opportunity for the retail industry Online desk research Adina Săniuţă Academy of Economic Studies, Bucharest, Romania
CHAPTER9 BUSINESS INTELLIGENCE THE VALUE OF DATA MINING Data mining tools are very good for classification purposes, for trying to understand why one group of people is different from another. What makes
A Guide to Horizon 2020 Funding for the Creative Industries October 2014 Introduction This document is provided as a short guide to help you submit a proposal for the Horizon 2020 funding programme (H2020).
Big data and positive social change in the developing world: A white paper for practitioners and researchers Rockefeller Foundation Bellagio Centre conference, May 2014 Please cite as: Bellagio Big Data