The Social Science Data Revolution Gary King Department of Government Harvard University (Horizons in Political Science talk, Government Department, Harvard University, 3/30/11) Gary King (Harvard, Gov Dept) The Data Revolution 1 / 24
The Changing Evidence Base of Social Science Research The Last 50 Years: Survey research Aggregate government statistics In depth studies of individual places, people, or events The Next 50 Years: Spectacular increases in new data sources, due to... Much more of the above improved, expanded, and applied Shrinking computers & the growing Internet: data everywhere The replication movement: academic data sharing (e.g., Dataverse) Governments encouraging data collection, distribution, experimentation (e.g., GovData) Advances in statistical methods, informatics, & software The march of quantification: through academia, professions, government, & commerce (SuperCrunchers, The Numerati, MoneyBall) Gary King (Harvard, Gov Dept) The Data Revolution 2 / 24
Examples of what s now possible Opinions of activists: A few thousand interviews millions of political opinions in social media posts (1B tweets/week) Exercise: A survey: How many times did you exercise last week? 500K people carrying cell phones with accelerometers Social contacts: A survey: Please tell me your 5 best friends continuous record of phone calls, emails, text messages, bluetooth, social media connections, electronic address books Economic development in developing countries: Dubious or nonexistent governmental statistics satellite images of human-generated light at night, or networks of roads and other infrastructure Expert-vs-Statistician contests: Whenever enough information is quantified (& a right answer exists), stats wins every time Many, many more... Gary King (Harvard, Gov Dept) The Data Revolution 3 / 24
How to make progress in the new data-rich world? 1 Computer-assisted methods: Traditional quantitative-only or qualitative-only approaches are infeasible 2 Large-scale, interdisciplinary, collaborative research 3 New statistical methods & engineering required 4 Better theory: to respond to massive new evidence, privacy challenges, data-driven science Bigger changes in the practice of social science then ever before Gary King (Harvard, Gov Dept) The Data Revolution 4 / 24
Two Examples of Automated Text Analysis Gary King (Harvard, Gov Dept) The Data Revolution 5 / 24
Example 1: How to Read a Billion Blog Posts (& Classify Deaths without Physicians) Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text AJPS, commercialized via: Gary King and Ying Lu. Verbal Autopsy Methods with Multiple Causes of Death, Statistical Science used by (among others): Gary King (Harvard, Gov Dept) The Data Revolution 6 / 24
Data and Quantities of Interest Input Data: Large set of text documents (e.g., all English language blog posts) Categories (posts about US candidates): extremely negative, negative, neutral, positive, extremely positive, no opinion, not a blog A small training set of documents hand-coded into the categories Quantities of interest Computer science: individual document classifications (spam filters, Google searches) Social Science: proportion in each category (proportion of email which is spam; proportion extremely negative comment about Pres Bush) Estimation Can get the 2nd by counting the 1st (if 1st is accurate) High classification accuracy unbiased category proportions 70% classification accuracy is high disaster for category proportions New methodology: unbiased category proportions, even when classification accuracy is low Gary King (Harvard, Gov Dept) The Data Revolution 7 / 24
Out-of-sample Comparison: 60 Seconds vs. 8.7 Days Affect in Blogs Estimated P(D) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Actual P(D) Gary King (Harvard, Gov Dept) The Data Revolution 8 / 24
Reactions to John Kerry s Botched Joke You know, education if you make the most of it... you can do well. If you don t, you get stuck in Iraq. Affect Towards John Kerry Proportion 0.0 0.1 0.2 0.3 0.4 0.5 0.6 1 1 2 0 2 Sept Oct Nov Dec Jan Feb Mar 2006 2007 Gary King (Harvard, Gov Dept) The Data Revolution 9 / 24
Example 2: Computer-Assisted Reading Justin Grimmer and Gary King. 2011. General-Purpose Clustering and Conceptualization Proceedings of the National Academy of Sciences. Conceptualization through Classification: one of the most central and generic of all our conceptual exercises.... Without classification, there could be no advanced conceptualization, reasoning, language, data analysis or,for that matter, social science research. (Bailey, 1994). Cluster Analysis: simultaneously (1) invents categories and (2) assigns documents to categories Gary King (Harvard, Gov Dept) The Data Revolution 10 / 24
What s Hard about Clustering? (Why Johnny Can t Classify) Goal: Computer-assisted conceptualization & clustering Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Bell(5) = 52 Bell(100) 10 28 Number of elementary particles in the universe Now imagine choosing the optimal classification scheme by hand! Available compromises pursue different goals: Standard Approach: Fully automated methods no method works well in general; impossible to know which to apply! Our Approach: Computer-assisted methods You, not some computer algorithm, decides what s important, but with help Gary King (Harvard, Gov Dept) The Data Revolution 11 / 24
Switch from Fully Automated to Computer Assisted Computer-Assisted Clustering Easy in theory: list all clusterings; choose the best Impossible in practice: Too hard for us mere humans! An organized list will make the search possible Insight: Many clusterings are perceptually identical E.g.,: consider two clusterings that differ only because one document (of 10,000) moves from category 5 to 6 Question: How to organize clusterings so humans can understand? Gary King (Harvard, Gov Dept) The Data Revolution 12 / 24
How to Zoom Out while Reading You choose one (or more) clustering, based on insight, discovery, useful information,... Ford Obama Clustering 1 Space of Clusterings mixvmf Clustering 2 Nixon kmedoids stand.euc affprop info.costs Carter Johnson Carter Eisenhower Truman kmeans correlation rock hclust correlation single hclust pearson single affprop maximum Ford Eisenhower Truman Roosevelt Johnson Kennedy Clinton Roosevelt ``Reagan Republicans'' ``Other Presidents '' Bush kmeans binary hclust maximum single hclust canberra binary hclust centroid correlation pearson hclust median centroid correlation median hclust correlation pearson mcquitty average hclust kendall single hclust euclidean centroid kmeans hclust kendall canberra mcquitty hclust canberra binary median affprop biclust_spectral manhattan cosine mspec_max hclust canberra binary single mspec_canb hclust canberra average kmeans pearson hclust binary average divisive stand.euc mspec_cos som hclust manhattan centroid hclust kmedoids spearman maximum manhattan kendall manhattan centroid median single hclust euclidean median hclust correlation pearson complete hclust hclust manhattan spearman kendall maximum euclidean average median complete mcquitty average single affprop hclust euclidean manhattan hclust mcquitty average euclidean average hclust spearman single divisive euclidean kmedoids euclidean hclust spearman average mspec_mink mspec_euc hclust binary complete mcquitty divisive manhattan mspec_man hclust euclidean complete mcquitty hclust kendall complete hclust correlation hclust canberra ward complete clust_convex hclust spearman kendall mcquitty hclust euclidean dismea ward hclust binary ward hclust canberra ward hclust spearman complete hclust manhattan complete spec_canb hclust kendall ward mixvmfva spec_cos hclust manhattan ward spec_euc spec_man kmeans euclidean kmeans manhattan hclust pearson ward hclust spearman ward kmeans spearman spec_max hclust maximum ward kmeans maximum spec_mink Nixon ``Roosevelt To Carter'' Kennedy Bush Obama `` Reagan To Obama '' Reagan HWBush kmeans canberra Clinton HWBush mult_dirproc Reagan Gary King (Harvard, Gov Dept) The Data Revolution 13 / 24
Evaluation: More Informative Discoveries Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings: 2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters) Created info packet on each clustering (for each cluster: exemplar document, automated content summary) Asked for ( 6 2) =15 pairwise comparisons User chooses only care about the one clustering that wins Both cases a Condorcet winner: Immigration : Our Method 1 vmf 1 vmf 2 Our Method 2 K-Means 1 K-Means 2 Genetic testing : Our Method 1 {Our Method 2, K-Means 1, K-means 2} Dir Proc. 1 Dir Proc. 2 Gary King (Harvard, Gov Dept) The Data Revolution 14 / 24
Evaluation: What Do Members of Congress Do? Gary King (Harvard, Gov Dept) The Data Revolution 15 / 24
Example Discovery mult_dirproc hclust canberra ward kmeans correlation sot_cor divisive stand.euc mixvmf hclust correlation mixvmfva mcquitty hclust binary complete hclust pearson single affprop cosine hclust pearson median hclust correlation single hclust pearson mcquitty mec hclust pearson average hclust correlation complete hclust correlation median hclust binary single hclust correlation average hclust pearson complete hclust binary average kmeans pearson hclust pearson correlation centroid centroid rock som hclust binary median hclust canberra single hclust binary mcquitty biclust_spectral hclust spearman complete spec_cos spec_man kmeans hclust canberra kendall median spec_mink spec_max spec_euc affprop maximum hclust canberra average mspec_mink mspec_man spec_canb kmeans spearman kmeans manhattan mspec_max mspec_canb mspec_cos mspec_euc kmeans hclust binary canberra centroid hclust hclust hclust spearman kendall spearman kendall median single single average hclust spearman hclust median centroid kendall mcquitty mcquitty hclust canberra centroid hclust kendall complete hclust hclust hclust euclidean kmedoids single affprop manhattan manhattan hclust hclust manhattan euclidean manhattan median centroid divisive median single average hclust manhattan hclust maximum hclust euclidean single hclust manhattan euclidean centroid mcquitty average clust_convex hclust correlation ward hclust euclidean kmedoids mcquitty euclidean hclust pearson kmedoids wardstand.euc hclust maximum divisive hclust euclidean median centroid maximum affprop average euclidean hclust canberra mcquitty hclust hclust euclidean maximum complete complete hclust manhattan hclust maximum complete mcquitty dist_minkowski dist_ebinary dist_fbinary dist_binary dist_canb dist_max dist_cos dismea hclust manhattan ward affprop info.costs kmeans hclust euclidean euclidean ward hclust canberra complete sot_euc hclust binary ward hclust hclust kendall spearman ward ward hclust maximum ward kmeans binary kmeans maximum Gary King (Harvard, Gov Dept) The Data Revolution 16 / 24
In Sample Illustration of Partisan Taunting Taunting ruins deliberation - Senator Lautenberg Blasts Republicans as Chicken Hawks [Government Oversight] Sen. Lautenberg on Senate Floor 4/29/04 Gary King (Harvard, Gov Dept) The Data Revolution 17 / 24
Out of Sample Confirmation of Partisan Taunting - Discovered using 200 press releases; 1 senator. Gary King (Harvard, Gov Dept) The Data Revolution 18 / 24
Out of Sample Confirmation of Partisan Taunting - Discovered using 200 press releases; 1 senator. - Confirmed using 64,033 press releases; 301 senator-years. - Apply supervised learning method: measure proportion of press releases a senator taunts other party Frequency 10 20 30 0.1 0.2 0.3 0.4 0.5 Prop. of Press Releases Taunting Gary King (Harvard, Gov Dept) The Data Revolution 19 / 24
Normative Implications of Taunting Partisan taunting: Very common Makes deliberation less likely Occurs more often in homogeneously partisan districts (i.e., when preaching to the choir) Incompatibility of the principles of representative democracy To get reflection: Homogeneous (noncompetitive) districts To get deliberation (no taunting): Heterogeneous (competitive) districts you can t have both! Gary King (Harvard, Gov Dept) The Data Revolution 20 / 24
Some New Data Types 1 Unstructured text: emails (1 LOC every 10 minutes), speeches, government reports, blogs, social media updates, web pages, newspapers, scholarly literature 2 Commercial activity: credit cards, sales data, and real estate transactions, product RFIDs 3 Geographic location: cell phones, Fastlane or EZPass transponders, garage cameras 4 Health information: digital medical records, hospital admittances, google/ms health, and accelerometers and other devices being included in cell phones 5 Biological sciences: effectively becoming social sciences as genomics, proteomics, metabolomics, and brain imaging produce huge numbers of person-level variables. 6 Satellite imagery: increasing in scope, resolution, and availability. 7 Electoral activity: ballot images, precinct-level results, individual-level registration, primary participation, and campaign contributions Gary King (Harvard, Gov Dept) The Data Revolution 21 / 24
Some More New Data Examples 8 Social media: facebook, twitter, social bookmarking, blog comments, product reviews, virtual worlds, game behavior, crowd sourcing 9 Web surfing artifacts: clicks, searches, and advertising clickthroughs. (Google collects 1 petabyte/72 minutes on human behavior!) 10 Multiplayer web games and virtual worlds: Billions of highly controlled experiments on human behavior 11 Government bureaucracies: moving from paper to electronic data bases, increasing availability 12 Governmental policies: requiring more data collection, such e.g., No Child Left Behind Act ; allowing randomized policy experiments; Obama pushing data distribution 13 Scholarly data: the replication movement in academia, led in part by political science, is massively increasing data sharing Gary King (Harvard, Gov Dept) The Data Revolution 22 / 24
Enormous Emerging Opportunities for Social Scientists For the first time: technologies, policies, data, and methods are making it feasible to attack some of the most vexing problems that afflict human society A massive change from studying problems to understanding and solving problems Opportunities require a change in our job descriptions, with new: 1 Computer-assisted methods: Traditional quantitative-only or qualitative-only approaches are infeasible 2 Large-scale, interdisciplinary, collaborative research 3 New statistical methods & engineering required 4 Better theory: to respond to massive new evidence, privacy challenges, data-driven science And then there s you & me: In most legislatures, courts, academic departments,..., change comes from replacement not conversion Will we wait to be replaced? or put in the effort to convert and learn how to use the new information? Gary King (Harvard, Gov Dept) The Data Revolution 23 / 24
For more information http://gking.harvard.edu Gary King (Harvard, Gov Dept) The Data Revolution 24 / 24