The Social Science Data Revolution

Similar documents
Exploring Big Data in Social Networks

Machine Learning using MapReduce

Statistics for BIG data

FOR RELEASE: MONDAY, SEPTEMBER 10 AT 4 PM

Big Data a threat or a chance?

Model Selection. Introduction. Model Selection

Using Data Mining for Mobile Communication Clustering and Characterization

OBAMA IS FIRST AS WORST PRESIDENT SINCE WWII, QUINNIPIAC UNIVERSITY NATIONAL POLL FINDS; MORE VOTERS SAY ROMNEY WOULD HAVE BEEN BETTER

CLINTON OR CUOMO THUMP GOP IN 2016 NEW YORK PRES RACE, QUINNIPIAC UNIVERSITY POLL FINDS; ADOPTED DAUGHTER RUNS BETTER THAN NATIVE SON

Social Media Mining. Data Mining Essentials

Inside the Obama Analytics Cave Andrew Claster, Deputy Chief Analytics Officer Obama for America W INNING K N OWLEDGE T M

Data Mining with R. Text Mining. Hugh Murrell

Collaborations between Official Statistics and Academia in the Era of Big Data

Social Media and Digital Marketing Analytics ( INFO-UB ) Professor Anindya Ghose Monday Friday 6-9:10 pm from 7/15/13 to 7/30/13

Big Data how it changes the way you treat data

BIG DATA FUNDAMENTALS

How To Cluster

THE PRESIDENCY OF GEORGE W. BUSH January 11-15, 2009

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

Technology has transformed the way in which science is conducted. Almost

Breen Elementary School

Connecting library content using data mining and text analytics on structured and unstructured data

How To Change Medicine

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Statistical Challenges with Big Data in Management Science

Big Data. Fast Forward. Putting data to productive use

Analysis One Code Desc. Transaction Amount. Fiscal Period

Big Data, Official Statistics and Social Science Research: Emerging Data Challenges

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data Analytics in Health Care

Social Intelligence Report ADOBE DIGITAL INDEX Q4 2013

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Social Media Marketing Measurement A research project to understand new possibilities

CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing. University of Florida, CISE Department Prof.

Big Data and Economics, Big Data and Economies. Susan Athey, Stanford University Disclosure: The author consults for Microsoft.

AP / AOL / PEW RESEARCH CENTER SURVEY OF CELLULAR PHONE USERS March 8-28, 2006 Total N=1,503 / Landline RDD Sample N=752 / Cell Phone RDD Sample N=751

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

The Big Picture on Big Data. Princeton Section 307 Dinner Meeting December 11, 2013 Richard Herczeg

Using Your Own Marketing Score to Measure Your Impact in Social Media A Market Smart 360 Whitepaper

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

The Scientific Data Mining Process

Newsweek Poll Psychology of Voter Anger Princeton Survey Research Associates International. Final Topline Results (10/1/10)

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

What Will Drive Growth in the Washington Area Economy Going Forward?

We Must Prepare Ph.D. Students for the Complicated Art of Teaching - Commentary - The Chronicle of Higher Education

Online & Offline Correlation Conclusions I - Emerging Markets

Pretty Only TABLE 2 POSITIVE RATINGS: TRENDS SINCE 9/11/01: SUMMARY. April Nov Feb. 2005

i. Definition ii. Primary Activities iii. Support Activities iv. Information Systems role in value chain analysis

Unprecedented Exposure January 2014

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Social Business Intelligence For Retail Industry

Advisors: Using Marketing to Build Your Pipeline. Presenter: Barbara Kotlyar Sr. Marketing Manager ByAllAccounts Managing Director, Bridge Marketing

A U T H O R S : G a n e s h S r i n i v a s a n a n d S a n d e e p W a g h Social Media Analytics

2014 Midterm Elections: Voter Dissatisfaction with the President and Washington October 23-27, 2014

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Conquering the Astronomical Data Flood through Machine

Data Mining and Neural Networks in Stata

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Domain Name Abuse Detection. Liming Wang

Social Media Implementations

Guide: Social Media Metrics in Government

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

adjust reports The Undead App Store A 2014 retrospective report on App Store performance

Predictive Analytics Certificate Program

Distances, Clustering, and Classification. Heatmaps

Role of Social Networking in Marketing using Data Mining

Pallas Ludens. We inject human intelligence precisely where automation fails. Jonas Andrulis, Daniel Kondermann

The Fed, Interest Rates, and Presidential Elections

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

BIG Data Analytics Move to Competitive Advantage

Lead Generation Lessons From 4,000 Businesses. A study based on real data from 4,000 businesses

Sentiment Analysis on Big Data

Analytics on Big Data

Lead Generation in Emerging Markets

Introduction to Machine Learning Using Python. Vikram Kamath

Web analytics: Data Collected via the Internet

National and Transnational Security Implications of Big Data in the Life Sciences

Democratic Process and Social Media: A Study of Us Presidential Election 2012

Transcription:

The Social Science Data Revolution Gary King Department of Government Harvard University (Horizons in Political Science talk, Government Department, Harvard University, 3/30/11) Gary King (Harvard, Gov Dept) The Data Revolution 1 / 24

The Changing Evidence Base of Social Science Research The Last 50 Years: Survey research Aggregate government statistics In depth studies of individual places, people, or events The Next 50 Years: Spectacular increases in new data sources, due to... Much more of the above improved, expanded, and applied Shrinking computers & the growing Internet: data everywhere The replication movement: academic data sharing (e.g., Dataverse) Governments encouraging data collection, distribution, experimentation (e.g., GovData) Advances in statistical methods, informatics, & software The march of quantification: through academia, professions, government, & commerce (SuperCrunchers, The Numerati, MoneyBall) Gary King (Harvard, Gov Dept) The Data Revolution 2 / 24

Examples of what s now possible Opinions of activists: A few thousand interviews millions of political opinions in social media posts (1B tweets/week) Exercise: A survey: How many times did you exercise last week? 500K people carrying cell phones with accelerometers Social contacts: A survey: Please tell me your 5 best friends continuous record of phone calls, emails, text messages, bluetooth, social media connections, electronic address books Economic development in developing countries: Dubious or nonexistent governmental statistics satellite images of human-generated light at night, or networks of roads and other infrastructure Expert-vs-Statistician contests: Whenever enough information is quantified (& a right answer exists), stats wins every time Many, many more... Gary King (Harvard, Gov Dept) The Data Revolution 3 / 24

How to make progress in the new data-rich world? 1 Computer-assisted methods: Traditional quantitative-only or qualitative-only approaches are infeasible 2 Large-scale, interdisciplinary, collaborative research 3 New statistical methods & engineering required 4 Better theory: to respond to massive new evidence, privacy challenges, data-driven science Bigger changes in the practice of social science then ever before Gary King (Harvard, Gov Dept) The Data Revolution 4 / 24

Two Examples of Automated Text Analysis Gary King (Harvard, Gov Dept) The Data Revolution 5 / 24

Example 1: How to Read a Billion Blog Posts (& Classify Deaths without Physicians) Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text AJPS, commercialized via: Gary King and Ying Lu. Verbal Autopsy Methods with Multiple Causes of Death, Statistical Science used by (among others): Gary King (Harvard, Gov Dept) The Data Revolution 6 / 24

Data and Quantities of Interest Input Data: Large set of text documents (e.g., all English language blog posts) Categories (posts about US candidates): extremely negative, negative, neutral, positive, extremely positive, no opinion, not a blog A small training set of documents hand-coded into the categories Quantities of interest Computer science: individual document classifications (spam filters, Google searches) Social Science: proportion in each category (proportion of email which is spam; proportion extremely negative comment about Pres Bush) Estimation Can get the 2nd by counting the 1st (if 1st is accurate) High classification accuracy unbiased category proportions 70% classification accuracy is high disaster for category proportions New methodology: unbiased category proportions, even when classification accuracy is low Gary King (Harvard, Gov Dept) The Data Revolution 7 / 24

Out-of-sample Comparison: 60 Seconds vs. 8.7 Days Affect in Blogs Estimated P(D) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Actual P(D) Gary King (Harvard, Gov Dept) The Data Revolution 8 / 24

Reactions to John Kerry s Botched Joke You know, education if you make the most of it... you can do well. If you don t, you get stuck in Iraq. Affect Towards John Kerry Proportion 0.0 0.1 0.2 0.3 0.4 0.5 0.6 1 1 2 0 2 Sept Oct Nov Dec Jan Feb Mar 2006 2007 Gary King (Harvard, Gov Dept) The Data Revolution 9 / 24

Example 2: Computer-Assisted Reading Justin Grimmer and Gary King. 2011. General-Purpose Clustering and Conceptualization Proceedings of the National Academy of Sciences. Conceptualization through Classification: one of the most central and generic of all our conceptual exercises.... Without classification, there could be no advanced conceptualization, reasoning, language, data analysis or,for that matter, social science research. (Bailey, 1994). Cluster Analysis: simultaneously (1) invents categories and (2) assigns documents to categories Gary King (Harvard, Gov Dept) The Data Revolution 10 / 24

What s Hard about Clustering? (Why Johnny Can t Classify) Goal: Computer-assisted conceptualization & clustering Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Bell(5) = 52 Bell(100) 10 28 Number of elementary particles in the universe Now imagine choosing the optimal classification scheme by hand! Available compromises pursue different goals: Standard Approach: Fully automated methods no method works well in general; impossible to know which to apply! Our Approach: Computer-assisted methods You, not some computer algorithm, decides what s important, but with help Gary King (Harvard, Gov Dept) The Data Revolution 11 / 24

Switch from Fully Automated to Computer Assisted Computer-Assisted Clustering Easy in theory: list all clusterings; choose the best Impossible in practice: Too hard for us mere humans! An organized list will make the search possible Insight: Many clusterings are perceptually identical E.g.,: consider two clusterings that differ only because one document (of 10,000) moves from category 5 to 6 Question: How to organize clusterings so humans can understand? Gary King (Harvard, Gov Dept) The Data Revolution 12 / 24

How to Zoom Out while Reading You choose one (or more) clustering, based on insight, discovery, useful information,... Ford Obama Clustering 1 Space of Clusterings mixvmf Clustering 2 Nixon kmedoids stand.euc affprop info.costs Carter Johnson Carter Eisenhower Truman kmeans correlation rock hclust correlation single hclust pearson single affprop maximum Ford Eisenhower Truman Roosevelt Johnson Kennedy Clinton Roosevelt ``Reagan Republicans'' ``Other Presidents '' Bush kmeans binary hclust maximum single hclust canberra binary hclust centroid correlation pearson hclust median centroid correlation median hclust correlation pearson mcquitty average hclust kendall single hclust euclidean centroid kmeans hclust kendall canberra mcquitty hclust canberra binary median affprop biclust_spectral manhattan cosine mspec_max hclust canberra binary single mspec_canb hclust canberra average kmeans pearson hclust binary average divisive stand.euc mspec_cos som hclust manhattan centroid hclust kmedoids spearman maximum manhattan kendall manhattan centroid median single hclust euclidean median hclust correlation pearson complete hclust hclust manhattan spearman kendall maximum euclidean average median complete mcquitty average single affprop hclust euclidean manhattan hclust mcquitty average euclidean average hclust spearman single divisive euclidean kmedoids euclidean hclust spearman average mspec_mink mspec_euc hclust binary complete mcquitty divisive manhattan mspec_man hclust euclidean complete mcquitty hclust kendall complete hclust correlation hclust canberra ward complete clust_convex hclust spearman kendall mcquitty hclust euclidean dismea ward hclust binary ward hclust canberra ward hclust spearman complete hclust manhattan complete spec_canb hclust kendall ward mixvmfva spec_cos hclust manhattan ward spec_euc spec_man kmeans euclidean kmeans manhattan hclust pearson ward hclust spearman ward kmeans spearman spec_max hclust maximum ward kmeans maximum spec_mink Nixon ``Roosevelt To Carter'' Kennedy Bush Obama `` Reagan To Obama '' Reagan HWBush kmeans canberra Clinton HWBush mult_dirproc Reagan Gary King (Harvard, Gov Dept) The Data Revolution 13 / 24

Evaluation: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters) Created info packet on each clustering (for each cluster: exemplar document, automated content summary) Asked for ( 6 2) =15 pairwise comparisons User chooses only care about the one clustering that wins Both cases a Condorcet winner: Immigration : Our Method 1 vmf 1 vmf 2 Our Method 2 K-Means 1 K-Means 2 Genetic testing : Our Method 1 {Our Method 2, K-Means 1, K-means 2} Dir Proc. 1 Dir Proc. 2 Gary King (Harvard, Gov Dept) The Data Revolution 14 / 24

Evaluation: What Do Members of Congress Do? Gary King (Harvard, Gov Dept) The Data Revolution 15 / 24

Example Discovery mult_dirproc hclust canberra ward kmeans correlation sot_cor divisive stand.euc mixvmf hclust correlation mixvmfva mcquitty hclust binary complete hclust pearson single affprop cosine hclust pearson median hclust correlation single hclust pearson mcquitty mec hclust pearson average hclust correlation complete hclust correlation median hclust binary single hclust correlation average hclust pearson complete hclust binary average kmeans pearson hclust pearson correlation centroid centroid rock som hclust binary median hclust canberra single hclust binary mcquitty biclust_spectral hclust spearman complete spec_cos spec_man kmeans hclust canberra kendall median spec_mink spec_max spec_euc affprop maximum hclust canberra average mspec_mink mspec_man spec_canb kmeans spearman kmeans manhattan mspec_max mspec_canb mspec_cos mspec_euc kmeans hclust binary canberra centroid hclust hclust hclust spearman kendall spearman kendall median single single average hclust spearman hclust median centroid kendall mcquitty mcquitty hclust canberra centroid hclust kendall complete hclust hclust hclust euclidean kmedoids single affprop manhattan manhattan hclust hclust manhattan euclidean manhattan median centroid divisive median single average hclust manhattan hclust maximum hclust euclidean single hclust manhattan euclidean centroid mcquitty average clust_convex hclust correlation ward hclust euclidean kmedoids mcquitty euclidean hclust pearson kmedoids wardstand.euc hclust maximum divisive hclust euclidean median centroid maximum affprop average euclidean hclust canberra mcquitty hclust hclust euclidean maximum complete complete hclust manhattan hclust maximum complete mcquitty dist_minkowski dist_ebinary dist_fbinary dist_binary dist_canb dist_max dist_cos dismea hclust manhattan ward affprop info.costs kmeans hclust euclidean euclidean ward hclust canberra complete sot_euc hclust binary ward hclust hclust kendall spearman ward ward hclust maximum ward kmeans binary kmeans maximum Gary King (Harvard, Gov Dept) The Data Revolution 16 / 24

In Sample Illustration of Partisan Taunting Taunting ruins deliberation - Senator Lautenberg Blasts Republicans as Chicken Hawks [Government Oversight] Sen. Lautenberg on Senate Floor 4/29/04 Gary King (Harvard, Gov Dept) The Data Revolution 17 / 24

Out of Sample Confirmation of Partisan Taunting - Discovered using 200 press releases; 1 senator. Gary King (Harvard, Gov Dept) The Data Revolution 18 / 24

Out of Sample Confirmation of Partisan Taunting - Discovered using 200 press releases; 1 senator. - Confirmed using 64,033 press releases; 301 senator-years. - Apply supervised learning method: measure proportion of press releases a senator taunts other party Frequency 10 20 30 0.1 0.2 0.3 0.4 0.5 Prop. of Press Releases Taunting Gary King (Harvard, Gov Dept) The Data Revolution 19 / 24

Normative Implications of Taunting Partisan taunting: Very common Makes deliberation less likely Occurs more often in homogeneously partisan districts (i.e., when preaching to the choir) Incompatibility of the principles of representative democracy To get reflection: Homogeneous (noncompetitive) districts To get deliberation (no taunting): Heterogeneous (competitive) districts you can t have both! Gary King (Harvard, Gov Dept) The Data Revolution 20 / 24

Some New Data Types 1 Unstructured text: emails (1 LOC every 10 minutes), speeches, government reports, blogs, social media updates, web pages, newspapers, scholarly literature 2 Commercial activity: credit cards, sales data, and real estate transactions, product RFIDs 3 Geographic location: cell phones, Fastlane or EZPass transponders, garage cameras 4 Health information: digital medical records, hospital admittances, google/ms health, and accelerometers and other devices being included in cell phones 5 Biological sciences: effectively becoming social sciences as genomics, proteomics, metabolomics, and brain imaging produce huge numbers of person-level variables. 6 Satellite imagery: increasing in scope, resolution, and availability. 7 Electoral activity: ballot images, precinct-level results, individual-level registration, primary participation, and campaign contributions Gary King (Harvard, Gov Dept) The Data Revolution 21 / 24

Some More New Data Examples 8 Social media: facebook, twitter, social bookmarking, blog comments, product reviews, virtual worlds, game behavior, crowd sourcing 9 Web surfing artifacts: clicks, searches, and advertising clickthroughs. (Google collects 1 petabyte/72 minutes on human behavior!) 10 Multiplayer web games and virtual worlds: Billions of highly controlled experiments on human behavior 11 Government bureaucracies: moving from paper to electronic data bases, increasing availability 12 Governmental policies: requiring more data collection, such e.g., No Child Left Behind Act ; allowing randomized policy experiments; Obama pushing data distribution 13 Scholarly data: the replication movement in academia, led in part by political science, is massively increasing data sharing Gary King (Harvard, Gov Dept) The Data Revolution 22 / 24

Enormous Emerging Opportunities for Social Scientists For the first time: technologies, policies, data, and methods are making it feasible to attack some of the most vexing problems that afflict human society A massive change from studying problems to understanding and solving problems Opportunities require a change in our job descriptions, with new: 1 Computer-assisted methods: Traditional quantitative-only or qualitative-only approaches are infeasible 2 Large-scale, interdisciplinary, collaborative research 3 New statistical methods & engineering required 4 Better theory: to respond to massive new evidence, privacy challenges, data-driven science And then there s you & me: In most legislatures, courts, academic departments,..., change comes from replacement not conversion Will we wait to be replaced? or put in the effort to convert and learn how to use the new information? Gary King (Harvard, Gov Dept) The Data Revolution 23 / 24

For more information http://gking.harvard.edu Gary King (Harvard, Gov Dept) The Data Revolution 24 / 24