2 Task Force Members: Lilli Japec, Co-Chair, Statistics Sweden Frauke Kreuter, Co-Chair, JPSM at the U. of Maryland, U. of Mannheim & IAB Marcus Berg, Stockholm University Paul Biemer, RTI International Paul Decker, Mathematica Policy Research Cliff Lampe, School of Information at the University of Michigan Julia Lane, American Institutes for Research Cathy O Neil, Johnson Research Labs Abe Usher, HumanGeo Group Acknowledgement: We are grateful for comments, feedback and editorial help from Eran Ben-Porath, Jason McMillan, and the AAPOR council members.
3 The report has four objectives: 1. to educate the AAPOR membership about Big Data (Section 3) 2. to describe the Big Data potential (Section 4 and Section 7) 3. to describe the Big Data challenges (Section 5 and 6) 4. to discuss possible solutions and research needs (Section 8)
4 Big Data AAPOR Task Force Source: Frauke Kreuter
5 until recently three main data sources
6 Survey Data Administrative Data Experiments Source: Frauke Kreuter
8 US Aggregated Inflation Series, Monthly Rate, PriceStats Index vs. Official CPI. Accessed January 18, 2015 from the PriceStats website.
9 Number of vehicles detected in the Netherlands on December 1, 2011 created by Statistics Netherlands (Daas et al. 2013). The vehicle size is shown in different colors; black is small size, red is medium size and green is large size.
10 Social media sentiment (daily, weekly and monthly) in the Netherlands, June November The development of consumer confidence for the same period is shown in the insert (Daas and Puts 2014).
11 Big Data
12 Hope that found/organic data Can replace or augment expensive data collections More (= better) data for decision making Information available in (nearly) real time Source: Frauke Kreuter
13 But (at least) one more V
14 Thank You!
15 CHANGE IN PARADIGM AND RISKS INVOLVED Julia Lane New York University American Institutes for Research University of Strasbourg
16 Big Data definition Big Data is an imprecise description of a rich and complicated set of characteristics, practices, techniques, ethics, and outcomes all associated with data. (AAPOR) No canonical definition By characteristics: Volume Velocity Variety (and Variability and Veracity) By source: found vs. made By use: professionals vs. citizen science By reach: datafication By paradigm: Fourth paradigm Source: Julia Lane
17 Motivation New business model Federal agencies no longer major players New analytical model Outliers Finegrained analysis New units of analysis New sets of skills Computer scientists Citizen scientists Different cost structure Source: Julia Lane
18 New Frameworks Source: Ian Foster, University of Chicago
19 New kinds of analysis Source: Jason Owen Smith and UMETRICS data
20 Access for Research Source: Julia Lane
21 Value in other fields Source: Julia Lane
22 Source: Julia Lane
23 Core Questions What is the legal framework? What is the practical framework? What is the statistical framework? Source: Julia Lane
24 Legal Framework Current legal structure inadequate The recording, aggregation,and organization of information into a form that can be used for data mining, here dubbed datafication, has distinct privacy implications that often go unrecognized by current law (Strandburg) Assessment of harm from privacy inadequate Privacy and big data are incompatible Anonymity not possible Informed consent not possible Source: Julia Lane
25 Informed consent (Nissenbaum) Source: Julia Lane
26 Statistical Framework Importance of valid inference The role of statisticians/access Inadequate current statistical disclosure limitation Diminished role of federal statistical agencies Limitations of survey New analytical framework: Mathematically rigorous theory of privacy Measurement of privacy loss Differential privacy Source: Julia Lane
27 Some suggestions Source: Julia Lane
28 And a reminder of why Source: Julia Lane
29 Comments and questions
30 Skills Required to Integrate Big Data into Public Opinion Research Abe Usher Chief Technology Officer, HumanGeo
31 Outline Big data demystified Four layers of big data Skills required Easter eggs Source: Abe Usher
32 Courtesy of Google Trends: Big Data Today
33 Courtesy of Google Trends: Big Data & Public Opinion vs Pop Culture
34 Big data: De-mystified What is big data? What is Hadoop File System? (HDFS) What is Hadoop MapReduce? (MR) Source: Abe Usher 34
35 20 th Century model of analysis How can you identify a legitimate hip-hop artist (versus someone who just gets up and rhymes)? Tracy Morrow (aka Ice T ) Source: Abe Usher 35
36 20 th Century model of analysis How can you identify a legitimate hip-hop artist (versus someone who just gets up and rhymes)? Game knows game, baby. Tracy Morrow (aka Ice T ) Source: Abe Usher 36
37 20 th Century model of analysis How can you identify a legitimate hip-hop artist (versus someone who just gets up and rhymes)? Tracy Morrow (aka Ice T ) If you have expert knowledge, then you are capable of answering complex questions by interpreting domain specific information. [paraphrased] Source: Abe Usher 37
38 New model of analysis Peter Gibbons hatches a plot to write a computer virus that grab fractions of a penny from a corporate retirement account. Office Space Takeaway point: Little bits of value (information) provide deep insights in the aggregate Source: Abe Usher 38
39 New model of analysis Takeaway point: Hadoop simplifies the creation of massive counting machines Source: Abe Usher 39
40 Big Data: Layers Source: Abe Usher 40
41 Big Data: Layers Data Output Example: map visualization Data Analysis Example: Hadoop MapReduce Data Storage Example: Hadoop Distributed File System Data Source(s) Examples: geolocated social media (Proxy variable for behavior of interest) Source: Abe Usher 41
42 Source: Abe Usher Four roles related to big data: each provide different skills
43 Skills by role Computer scientist Data preparation MapReduce algorithms Python/R programming Hadoop ecosystem System Administrator Storage systems (MySQL, Hbase, Spark) Cloud computing: Amazon Web Services (AWS) Google Compute Engine Hadoop ecosystem Source: Abe Usher 43
44 Big data enables new insights into human behavior Geolocated social media activity in Washington DC during a 15 minute time period generated by MR. TweetMap Source: Abe Usher
45 Contact information Abe Usher 45
46 Easter Eggs Google do a barrel roll Google gravity Google search in Klingon Source: Abe Usher 46
47 Big Data Veracity: Error Sources and Inferential Risks Paul Biemer RTI International and University of North Carolina Source: Paul Biemer
48 Errors in Big Data: An Illustration Suppose 1 in 1,000,000 people are terrorists The Big Data Terrorist Detector is 99.9 accurate The detector says your friend, Jack is a terrorist. What are the odds that Jack is really a terrorist? Terrorist Detector Terrorist Detector Source: Paul Biemer 48
49 Errors in Big Data: An Illustration Suppose 1 in 1,000,000 people are terrorists The Big Data Terrorist Detector is 99.9 accurate The detector says your friend, Jack is a terrorist. What are the odds that Jack is really a terrorist? Answer: 1 in 1000 i.e., 99.9% of the terrorist detections will be false! Terrorist Detector Terrorist Detector Source: Paul Biemer 49
50 Some questions regarding Big Data veracity What constitutes a Big Data error? What are the sources and causes of the errors? Do the error distributions vary by source? Are the errors systematic or variable or both? Systematic error in Google Flu Trends data Source: Paul Biemer 50
51 Some questions regarding Big Data veracity What constitutes a Big Data error? What are the sources and causes of the errors? Do the error distributions vary by source? Are the errors systematic or variable or both? How do the errors affect data analysis such as Classifications Correlations Regressions How can analysts mitigate these effects? Source: Paul Biemer 51
52 Total Error Framework for Traditional Data Sets Typical File Structure Record # V 1 V 2 V K variables or features Population units Source: Paul Biemer
53 Total Error Framework for Traditional Data Sets Typical File Structure Record # V 1 V 2 V K variables or features Population units total error = row error + column error + cell error Source: Paul Biemer
54 Possible Column and Cell Errors Typical File Structure Record # V 1 V 2 V K variables or features Population units Source: Paul Biemer Misspecified variables = specification error Variable values in error = content error Variable values missing = missing data?
55 Possible Row Errors Typical File Structure Record # V 1 V 2 V K variables or features Population units Missing records = undercoverage error Non-population records = overcoverage Duplicated records = duplication error Source: Paul Biemer
56 Shortcomings of the Traditional Framework for Big Data Big Data files are often not rectangular hierarchically structure or unstructured Data may be distributed across many data bases Sometimes federated, but often not Data sources may be quite heterogeneous Includes texts, sensors, transactions, and images Errors generated by Map/Reduce process may not lend themselves to column-row representations. Source: Paul Biemer 56
57 Big Data Process Map Generate Source 1 ETL Extract Analyze Filter/Reduction (Sampling) Source 2 Source K Transform (Cleanse) Load (Store) Computation/ Analysis (Visualization) Source: Paul Biemer 57
58 Big Data Process Map Generation Source 1 Source 2 Source K ETL Errors include: Extract low signal/noise ratio; lost signals; failure to capture; non-random (or nonrepresentative) sources; metadata that are lacking, absent, or erroneous. Transform (Cleanse) Load (Store) Analyze Filter/Reduction (Sampling) Computation/ Analysis (Visualization) Source: Paul Biemer 58
59 Big Data Process Map Generation Source 1 Source 2 Source K ETL Extract Transform (Cleanse) Load (Store) Analyze Errors include: specification error (including, errors in meta-data), matching error, Filter/Reduction coding error, editing error, data (Sampling) munging errors, and data integration errors.. Computation/ Analysis (Visualization) Source: Paul Biemer 59
60 Big Data Process Map Generation Source 1 Data are filtered, sampled or otherwise Errors reduced. include: ETL This sampling may errors, involve selectivity further errors (or lack transformations of representativity), Extract of the modeling data. errors Analyze Filter/Reduction (Sampling) Source 2 Source K Transform (Cleanse) Load (Store) Computation/ Analysis (Visualization) Source: Paul Biemer 60
61 Big Data Process Map Generation Source 1 ETL Extract Analyze Filter/Reduction (Sampling) Source 2 Source K Errors include: Transform modeling errors, inadequate or (Cleanse) erroneous adjustments for representativity, computation and algorithmic errors. Load (Store) Computation/ Analysis (Visualization) Source: Paul Biemer 61
62 Implications for Data Analysis Study of rare groups is problematic Stork Die-off Linked to Human Birth Decline Biased correlational analysis Biased regression analysis Coincidental correlations Source: Paul Biemer 62
63 Implications for Data Analysis Study of rare groups is problematic Biased correlational analysis Biased regression analysis Coincidental correlations Noise accumulation inability to identify correlates Incidental endogeneity Cov(error, covariates) Source: 63 Paul Biemer
64 Implications for Data Analysis Study of rare groups is problematic Biased correlational analysis Biased regression analysis Coincidental correlations Noise accumulation inability to identify correlates Incidental endogeneity Cov(error, covariates) These latter three issues are a concern even if the data could be regarded as error-free. Data errors can considerably exacerbate these problems. Current research is aimed at investigating these errors. Source: Paul Biemer 64
65 Recommendations 1. Surveys and Big Data are complementary data sources not competing data sources. There are differences between the approaches, but this should be seen as an advantage rather than a disadvantage. 2. AAPOR should develop standards for the use of Big Data in survey research when more knowledge has been accumulated.
66 3. AAPOR should start working with the private sector and other professional organizations to educate its members on Big Data 4. AAPOR should inform the public of the risks and benefits of Big Data.
67 5. AAPOR should help remove the barrier associated with different uses of terminology. 6. AAPOR should take a leading role in working with federal agencies in developing a necessary infrastructure for the use of Big Data in survey research.
Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015 facebook.com/statisticssweden @SCB_nyheter The report is available at https://www.aapor.org Task Force Members:
AAPOR Report on Big Data AAPOR Big Data Task Force February 12, 2015 Prepared for AAPOR Council by the Task Force, with Task Force members including: Lilli Japec, Co-Chair, Statistics Sweden Frauke Kreuter,
Brian Kleiner, Alexandra Stam and Nicolas Pekari Big data for the social sciences Lausanne, April 2015 FORS Working Papers 2015-2 FORS Working Paper series The FORS Working Paper series presents findings
Chapter 1 Grasping the Fundamentals of Big Data In This Chapter Looking at a history of data management Understanding why big data matters to business Applying big data to business effectiveness Defining
THE LYTX ADVANTAGE: USING PREDICTIVE ANALYTICS TO DELIVER DRIVER SAFETY BRYON COOK VP, ANALYTICS www.lytx.com 1 CONTENTS Introduction... 3 Predictive Analytics Overview... 3 Drivers in Use of Predictive
Eindhoven University of Technology Department of Mathematics and Computer Science Master s thesis Towards a Big Data Reference Architecture 13th October 2013 Author: Supervisor: Assessment committee: Markus
A Suggested Framework for the Quality of Big Data Deliverables of the UNECE Big Data Quality Task Team December, 2014 Contents 1. Executive Summary... 3 2. Background... 5 3. Introduction... 7 4. Principles...
32 Big Data: present and future Big Data: present and future Mircea Răducu TRIFU, Mihaela Laura IVAN University of Economic Studies, Bucharest, Romania firstname.lastname@example.org, email@example.com
Business Analytics Big Data Next-Generation Analytics the way we see it Table of contents Executive summary 1 Introduction: What is big data and why is it different? 3 The business opportunity 7 Traditional
Big Data Analytics: towards a European research agenda Editors: Mirco Nanni1, Costantino Thanos1, Fosca Giannotti1 and Andreas Rauber2 Executive Summary Big data are blossoming together with the hope to
Linked data Connecting and exploiting big data Whilst big data may represent a step forward in business intelligence and analytics, Fujitsu sees particular additional value in linking and exploiting big
Learning Objectives Dr. Chen, Data Base Management Chapter 6: Big Data and Analytics Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration Gonzaga University Spokane, WA 99258 firstname.lastname@example.org
June 2013 BIG DATA FOR DEVELOPMENT: A PRIMER Harnessing Big Data For Real-Time Awareness WHAT IS BIG DATA? Big Data is an umbrella term referring to the large amounts of digital data continually generated
Customer Cloud Architecture for Big Data and Analytics Executive Overview Using analytics reveals patterns, trends and associations in data that help an organization understand the behavior of the people
For Big Data Analytics There s No Such Thing as Too Big The Compelling Economics and Technology of Big Data Computing March 2012 By: 4syth.com Emerging big data thought leaders Forsyth Communications 2012.
The Use of Social Media for Research and Analysis: A Feasibility Study December 2014 DWP ad hoc research report no. 13 A report of research carried out by the Oxford Internet Institute on behalf of the
Big data and positive social change in the developing world: A white paper for practitioners and researchers Rockefeller Foundation Bellagio Centre conference, May 2014 Please cite as: Bellagio Big Data
NESSI White Paper, December 2012 Big Data A New World of Opportunities Contents 1. Executive Summary... 3 2. Introduction... 4 2.1. Political context... 4 2.2. Research and Big Data... 5 2.3. Purpose of
Enhancing Teaching and Learning Through Educational Data Mining and Learning Analytics: An Issue Brief Enhancing Teaching and Learning Through Educational Data Mining and Learning Analytics: An Issue Brief
BIG DATA IN ACTION FOR DEVELOPMENT This volume is the result of a collaboration of World Bank staff (Andrea Coppola and Oscar Calvo- Gonzalez) and SecondMuse associates (Elizabeth Sabet, Natalia Arjomand,
THE BIGGER THE BETTER? Big Data in Financial Services T H E B I G G E R T H E B E T T E R? BIG DATA IN FINANCIAL SERVICES 4 CONTENTS ABOUT THIS REPORT 5 GENERAL VIEWPOINT 6 BIG DATA PILLARS 8 Pillar 1:
PREPUBLICATION DRAFT Subject to Further Editorial Correction Frontiers in Massive Data Analysis Committee on the Analysis of Massive Data Committee on Applied and Theoretical Statistics Board on Mathematical
Database Systems Journal vol. III, no. 4/2012 3 Perspectives on Big Data and Big Data Analytics Elena Geanina ULARU, Florina Camelia PUICAN, Anca APOSTU, Manole VELICANU 1 Phd. Student, Institute of Doctoral
Climate Surveys: Useful Tools to Help Colleges and Universities in Their Efforts to Reduce and Prevent Sexual Assault Why are we releasing information about climate surveys? Sexual assault is a significant
BIG DATA PROJECT SUCCESS A META ANALYSIS Koronios, Andy, University of South Australia, Adelaide, Australia, email@example.com Gao, Jing, University of South Australia, Adelaide, Australia, firstname.lastname@example.org