Task Force Members: Lilli Japec Frauke Kreuter Marcus Berg Paul Biemer Paul Decker Cliff Lampe



Similar documents
Unlocking the Full Potential of Big Data

Total Survey Error: Adapting the Paradigm for Big Data. Paul Biemer RTI International University of North Carolina

2015 SOI Consultants Panel Meeting

Micro Data Hubs for Central Banks and a (different) view on Big Data

AAPOR Report on Big Data

Collaborations between Official Statistics and Academia in the Era of Big Data

The? Data: Introduction and Future

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Statistics for BIG data

This Symposium brought to you by

Big Data & Security. Aljosa Pasic 12/02/2015

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

North Highland Data and Analytics. Data Governance Considerations for Big Data Analytics

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

COMP9321 Web Application Engineering

Advanced Big Data Analytics with R and Hadoop

Reference Architecture, Requirements, Gaps, Roles

How To Handle Big Data With A Data Scientist

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Data Isn't Everything

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

ICT Perspectives on Big Data: Well Sorted Materials

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Oracle Big Data SQL Technical Update

Safe Harbor Statement

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Big Data and Analytics: Challenges and Opportunities

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

BIG DATA What it is and how to use?

SEIZE THE DATA SEIZE THE DATA. 2015

Big Data & Netflix. Paul Ellwood February 9th, 2015

Using Predictive Maintenance to Approach Zero Downtime

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Big Data and New Paradigms in Information Management. Vladimir Videnovic Institute for Information Management

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Big Data Analytics Nokia

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Big Data for Development: What May Determine Success or failure?

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

Crack Open Your Operational Database. Jamie Martin September 24th, 2013

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

BIG DATA IN BUSINESS ENVIRONMENT

Policy-based Pre-Processing in Hadoop

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

CHAPTER SIX DATA. Business Intelligence The McGraw-Hill Companies, All Rights Reserved

Extend your analytic capabilities with SAP Predictive Analysis

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

BIG DATA DAY BAKU 2015

SURVEY REPORT DATA SCIENCE SOCIETY 2014

How to Enhance Traditional BI Architecture to Leverage Big Data

Search and Real-Time Analytics on Big Data

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Architectures for Big Data Analytics A database perspective

How To Make Sense Of Data With Altilia

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

A New Era Of Analytic

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Big Data, Official Statistics and Social Science Research: Emerging Data Challenges

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Big Data. Fast Forward. Putting data to productive use

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business

Advanced In-Database Analytics

6 Steps to Faster Data Blending Using Your Data Warehouse

Big Data and Analytics (Fall 2015)

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Keywords: big data, official statistics, quality, Wikipedia page views, AIS.

Big Data and Open Data

IBM Big Data Platform

Utilizing big data to bring about innovative offerings and new revenue streams DATA-DERIVED GROWTH

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Sunnie Chung. Cleveland State University

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

Big Data at DST. Bill Nixon, Matt Crouch

Cloud Big Data Architectures

Introduction to Big Data with Apache Spark UC BERKELEY

Transcription:

Task Force Members: Lilli Japec, Co-Chair, Statistics Sweden Frauke Kreuter, Co-Chair, JPSM at the U. of Maryland, U. of Mannheim & IAB Marcus Berg, Stockholm University Paul Biemer, RTI International Paul Decker, Mathematica Policy Research Cliff Lampe, School of Information at the University of Michigan Julia Lane, American Institutes for Research Cathy O Neil, Johnson Research Labs Abe Usher, HumanGeo Group Acknowledgement: We are grateful for comments, feedback and editorial help from Eran Ben-Porath, Jason McMillan, and the AAPOR council members.

The report has four objectives: 1. to educate the AAPOR membership about Big Data (Section 3) 2. to describe the Big Data potential (Section 4 and Section 7) 3. to describe the Big Data challenges (Section 5 and 6) 4. to discuss possible solutions and research needs (Section 8)

Big Data AAPOR Task Force Source: Frauke Kreuter

until recently three main data sources

Survey Data Administrative Data Experiments Source: Frauke Kreuter

now

US Aggregated Inflation Series, Monthly Rate, PriceStats Index vs. Official CPI. Accessed January 18, 2015 from the PriceStats website.

Number of vehicles detected in the Netherlands on December 1, 2011 created by Statistics Netherlands (Daas et al. 2013). The vehicle size is shown in different colors; black is small size, red is medium size and green is large size.

Social media sentiment (daily, weekly and monthly) in the Netherlands, June 2010 - November 2013. The development of consumer confidence for the same period is shown in the insert (Daas and Puts 2014).

Big Data http://www.rosebt.com/blog/dataveracity

Hope that found/organic data Can replace or augment expensive data collections More (= better) data for decision making Information available in (nearly) real time Source: Frauke Kreuter

But (at least) one more V http://www.rosebt.com/blog/dataveracity

fkreuter@umd.edu Thank You!

CHANGE IN PARADIGM AND RISKS INVOLVED Julia Lane New York University American Institutes for Research University of Strasbourg

Big Data definition Big Data is an imprecise description of a rich and complicated set of characteristics, practices, techniques, ethics, and outcomes all associated with data. (AAPOR) No canonical definition By characteristics: Volume Velocity Variety (and Variability and Veracity) By source: found vs. made By use: professionals vs. citizen science By reach: datafication By paradigm: Fourth paradigm Source: Julia Lane

Motivation New business model Federal agencies no longer major players New analytical model Outliers Finegrained analysis New units of analysis New sets of skills Computer scientists Citizen scientists Different cost structure Source: Julia Lane

New Frameworks Source: Ian Foster, University of Chicago

New kinds of analysis Source: Jason Owen Smith and UMETRICS data

Access for Research Source: Julia Lane

Value in other fields Source: Julia Lane

Source: Julia Lane

Core Questions What is the legal framework? What is the practical framework? What is the statistical framework? Source: Julia Lane

Legal Framework Current legal structure inadequate The recording, aggregation,and organization of information into a form that can be used for data mining, here dubbed datafication, has distinct privacy implications that often go unrecognized by current law (Strandburg) Assessment of harm from privacy inadequate Privacy and big data are incompatible Anonymity not possible Informed consent not possible Source: Julia Lane

Informed consent (Nissenbaum) Source: Julia Lane

Statistical Framework Importance of valid inference The role of statisticians/access Inadequate current statistical disclosure limitation Diminished role of federal statistical agencies Limitations of survey New analytical framework: Mathematically rigorous theory of privacy Measurement of privacy loss Differential privacy Source: Julia Lane

Some suggestions Source: Julia Lane

And a reminder of why Source: Julia Lane

Comments and questions Julia.lane@nyu.edu

Skills Required to Integrate Big Data into Public Opinion Research Abe Usher Chief Technology Officer, HumanGeo

Outline Big data demystified Four layers of big data Skills required Easter eggs Source: Abe Usher

Courtesy of Google Trends: http://goo.gl/4h8ttd Big Data Today

Courtesy of Google Trends: http://goo.gl/qhiqcn Big Data & Public Opinion vs Pop Culture

Big data: De-mystified What is big data? What is Hadoop File System? (HDFS) What is Hadoop MapReduce? (MR) Source: Abe Usher 34

20 th Century model of analysis How can you identify a legitimate hip-hop artist (versus someone who just gets up and rhymes)? Tracy Morrow (aka Ice T ) http://www.npr.org/2005/08/30/4824690/original-gangster-rapper-and-actor-ice-t Source: Abe Usher 35

20 th Century model of analysis How can you identify a legitimate hip-hop artist (versus someone who just gets up and rhymes)? Game knows game, baby. Tracy Morrow (aka Ice T ) Source: Abe Usher 36

20 th Century model of analysis How can you identify a legitimate hip-hop artist (versus someone who just gets up and rhymes)? Tracy Morrow (aka Ice T ) If you have expert knowledge, then you are capable of answering complex questions by interpreting domain specific information. [paraphrased] Source: Abe Usher 37

New model of analysis Peter Gibbons hatches a plot to write a computer virus that grab fractions of a penny from a corporate retirement account. http://goo.gl/rdg1u Office Space Takeaway point: Little bits of value (information) provide deep insights in the aggregate Source: Abe Usher 38

New model of analysis Takeaway point: Hadoop simplifies the creation of massive counting machines Source: Abe Usher 39

Big Data: Layers Source: Abe Usher 40

Big Data: Layers Data Output Example: map visualization Data Analysis Example: Hadoop MapReduce Data Storage Example: Hadoop Distributed File System Data Source(s) Examples: geolocated social media (Proxy variable for behavior of interest) Source: Abe Usher 41

Source: Abe Usher Four roles related to big data: each provide different skills

Skills by role Computer scientist Data preparation MapReduce algorithms Python/R programming Hadoop ecosystem System Administrator Storage systems (MySQL, Hbase, Spark) Cloud computing: Amazon Web Services (AWS) Google Compute Engine Hadoop ecosystem Source: Abe Usher 43

Big data enables new insights into human behavior Geolocated social media activity in Washington DC during a 15 minute time period generated by MR. TweetMap Source: Abe Usher

Contact information Abe Usher abe@thehumangeo.com http://www.thehumangeo.com/ 45

Easter Eggs Google do a barrel roll Google gravity Google search in Klingon www.google.com/?hl=kn Source: Abe Usher 46

Big Data Veracity: Error Sources and Inferential Risks Paul Biemer RTI International and University of North Carolina Source: Paul Biemer

Errors in Big Data: An Illustration Suppose 1 in 1,000,000 people are terrorists The Big Data Terrorist Detector is 99.9 accurate The detector says your friend, Jack is a terrorist. What are the odds that Jack is really a terrorist? Terrorist Detector Terrorist Detector Source: Paul Biemer 48

Errors in Big Data: An Illustration Suppose 1 in 1,000,000 people are terrorists The Big Data Terrorist Detector is 99.9 accurate The detector says your friend, Jack is a terrorist. What are the odds that Jack is really a terrorist? Answer: 1 in 1000 i.e., 99.9% of the terrorist detections will be false! Terrorist Detector Terrorist Detector Source: Paul Biemer 49

Some questions regarding Big Data veracity What constitutes a Big Data error? What are the sources and causes of the errors? Do the error distributions vary by source? Are the errors systematic or variable or both? Systematic error in Google Flu Trends data Source: Paul Biemer 50

Some questions regarding Big Data veracity What constitutes a Big Data error? What are the sources and causes of the errors? Do the error distributions vary by source? Are the errors systematic or variable or both? How do the errors affect data analysis such as Classifications Correlations Regressions How can analysts mitigate these effects? Source: Paul Biemer 51

Total Error Framework for Traditional Data Sets Typical File Structure Record # V 1 V 2 V K variables or features Population units Source: Paul Biemer

Total Error Framework for Traditional Data Sets Typical File Structure Record # V 1 V 2 V K variables or features Population units total error = row error + column error + cell error Source: Paul Biemer

Possible Column and Cell Errors Typical File Structure Record # V 1 V 2 V K variables or features Population units Source: Paul Biemer Misspecified variables = specification error Variable values in error = content error Variable values missing = missing data?

Possible Row Errors Typical File Structure Record # V 1 V 2 V K variables or features Population units Missing records = undercoverage error Non-population records = overcoverage Duplicated records = duplication error Source: Paul Biemer

Shortcomings of the Traditional Framework for Big Data Big Data files are often not rectangular hierarchically structure or unstructured Data may be distributed across many data bases Sometimes federated, but often not Data sources may be quite heterogeneous Includes texts, sensors, transactions, and images Errors generated by Map/Reduce process may not lend themselves to column-row representations. Source: Paul Biemer 56

Big Data Process Map Generate Source 1 ETL Extract Analyze Filter/Reduction (Sampling) Source 2 Source K Transform (Cleanse) Load (Store) Computation/ Analysis (Visualization) Source: Paul Biemer 57

Big Data Process Map Generation Source 1 Source 2 Source K ETL Errors include: Extract low signal/noise ratio; lost signals; failure to capture; non-random (or nonrepresentative) sources; metadata that are lacking, absent, or erroneous. Transform (Cleanse) Load (Store) Analyze Filter/Reduction (Sampling) Computation/ Analysis (Visualization) Source: Paul Biemer 58

Big Data Process Map Generation Source 1 Source 2 Source K ETL Extract Transform (Cleanse) Load (Store) Analyze Errors include: specification error (including, errors in meta-data), matching error, Filter/Reduction coding error, editing error, data (Sampling) munging errors, and data integration errors.. Computation/ Analysis (Visualization) Source: Paul Biemer 59

Big Data Process Map Generation Source 1 Data are filtered, sampled or otherwise Errors reduced. include: ETL This sampling may errors, involve selectivity further errors (or lack transformations of representativity), Extract of the modeling data. errors Analyze Filter/Reduction (Sampling) Source 2 Source K Transform (Cleanse) Load (Store) Computation/ Analysis (Visualization) Source: Paul Biemer 60

Big Data Process Map Generation Source 1 ETL Extract Analyze Filter/Reduction (Sampling) Source 2 Source K Errors include: Transform modeling errors, inadequate or (Cleanse) erroneous adjustments for representativity, computation and algorithmic errors. Load (Store) Computation/ Analysis (Visualization) Source: Paul Biemer 61

Implications for Data Analysis Study of rare groups is problematic Stork Die-off Linked to Human Birth Decline Biased correlational analysis Biased regression analysis Coincidental correlations Source: Paul Biemer 62 www.rti.org

Implications for Data Analysis Study of rare groups is problematic Biased correlational analysis Biased regression analysis Coincidental correlations Noise accumulation inability to identify correlates Incidental endogeneity Cov(error, covariates) Source: 63 Paul Biemer www.rti.org

Implications for Data Analysis Study of rare groups is problematic Biased correlational analysis Biased regression analysis Coincidental correlations Noise accumulation inability to identify correlates Incidental endogeneity Cov(error, covariates) These latter three issues are a concern even if the data could be regarded as error-free. Data errors can considerably exacerbate these problems. Current research is aimed at investigating these errors. Source: Paul Biemer 64 www.rti.org

Recommendations 1. Surveys and Big Data are complementary data sources not competing data sources. There are differences between the approaches, but this should be seen as an advantage rather than a disadvantage. 2. AAPOR should develop standards for the use of Big Data in survey research when more knowledge has been accumulated.

3. AAPOR should start working with the private sector and other professional organizations to educate its members on Big Data 4. AAPOR should inform the public of the risks and benefits of Big Data.

5. AAPOR should help remove the barrier associated with different uses of terminology. 6. AAPOR should take a leading role in working with federal agencies in developing a necessary infrastructure for the use of Big Data in survey research.