New Frontiers of Automated Content Analysis in the Social Sciences

Symposium on the New Frontiers of Automated Content Analysis in the Social Sciences University of Zurich July 1-3, 2015 www.aca-zurich-2015.org Abstract Automated Content Analysis (ACA) is one of the key fields of methodological innovation in the social sciences, not least because there is a growing need to analyze the increasing number of digitally available text collections. Our goal is to bring together computational linguists and social scientists in order to improve the dialogue between the two research communities and to exploit mutual benefits for the advancement of ACA in the social sciences. More precisely, our program pairs social scientists and computational linguists into thematically coherent sessions, which are related to event analysis, trend identification, text classification, text scaling and sentiment detection. This setup should enable social scientists to gain insights into the sophisticated methodological instruments of computational linguistics in order to enhance their analyses. Computational linguists, in contrast, have the opportunity to apply their concepts and instruments to the vast array of research questions debated in the social sciences. The conference is jointly organized by the Swiss National Center of Competence Research in Democracy, the Stein Rokkan Chair of the European University Institute, as well as the Department of Political Science at the University of Zurich. Organisers Prof. Gerold Schneider and Dr. des. Bruno Wueest (NCCR Democracy) Prof. Hanspeter Kriesi (European University Institute) Prof. Silja Häusermann (University of Zurich) 1

Speakers and topics A: Social Sciences B: Computational linguistics Introductory keynote (July 1, 17:00-18:00, UZH main building, KOL-H-317) B. Kathleen McKeown Session 1 Extracting Complex Relational Data Chair: Jasmine Lorenzini A. Alexander Hanna and Pamela Oliver: Automated Coding of Protest Event Data B. Peter Makarov and Klaus Rothenhäusler: Towards Automated Protest Event Analysis Session 2 Retrieving Events from Large-Scale Data Chair: Bruno Wueest A. Wouter van Atteveldt: Using grammatical clauses for social and semantic network analysis B. Vasileios Lampos: Extracting Interesting Concepts from Large-Scale Textual Data Session 3 Trend Identification Chair: Swen Hutter A. Bruno Wueest: Taking care of time dependency and theoretical mismatch in topic models of political attention B. Michael Amsler and Gerold Schneider: Data-Driven and Linguistically Motivated Trend Identification Session 4 Enhancing Text classification Chair: Silja Häusermann A. Nils Weidmann and Mihai Croicu: Improving the Selection of News Reports for Event Coding Using Ensemble Classification B. Jordan Boyd-Graber: Interactive Topic Modeling for Labeling and Making Sense of Large Corpora Session 5 Data-Driven vs. Annotation-driven Text Mining Chair: Thomas Kurer A. Martin Wettstein and Werner Wirth: Semi-automated content analysis of news texts B. Andrew Salway: Some possibilities and limits of data-driven content analysis Session 6 Actor-level Sentiment Chair: Gerold Schneider A. Martin Haselmayer and Marcelo Jenny: Dictionary-based Sentiment Analysis with Crowdcoding B. Jochen Leidner: A Critical Analyisis of Sentiment Analysis Session 7 Text Scaling / Document-level sentiment Chair: Hanspeter Kriesi A. Will Lowe: Scaling things we can count B. Ralf Steinberger: Observing trends in multilingual media analysis Closing keynote (July 3, 16:30-17:30, UZH main building, KOL-H-317) A. Justin Grimmer 2

Outline As much as ACA is on the verge of becoming a standard tool for social scientists, scholars still dispute its promises and pitfalls. Hence, existing approaches to analyze unstructured text data mainly developed in computational linguistics need to be amended and adapted to the specific requirements of social scientific studies. To achieve this, we bring together computational linguists and social scientists who share interests in the analysis of large-scale text data. The conference is structured into seven thematic sessions, which are accompanied by an introductory and closing keynote speech. Prof. Kathleen McKeown will provide the introductory talk from the perspective of computational linguistics. The focus of her presentation will lie on the potential of computational linguistics for the social sciences. While possible applications seem abundant, there may well be paramount challenges for the integration of computational linguistic approaches into social scientific research frameworks. The closing talk will be given by Prof. Justin Grimmer. His talk will sum up the most important findings of the conference and give an outlook on the most likely advancements in the social scientific application of ACA in the near future. Session 1 Extracting Complex Relational Data Events such as the eruption of political protest or hostilities in armed conflicts are the unit of enquiry of many social scientific analyses. Obviously, the conceptual and operational specifications of what constitutes an event vary significantly. However, what all event analyses have in common is that a combination of several individual indicators is necessary to specify an event. On the most basic level, events are usually defined by the relation of an action, a date, and a location. When working with large-scale text data, this relation mining task of linking the single indicators to an event is far from trivial, especially since further indicators such as the goals of the action and actors involved are frequently added. Hence, one of the major challenges of automated event analyses is to generate models that allow one to extract events defined as compounds of single indicators. We have invited two research teams (Hanna and Oliver; Makarov and Rothenhäusler), who will report on their progress in dealing with this chal- 3

lenge. Both teams are in the process of creating a system for the automated recognition of political protest events, the former dealing with protest events in the US and the latter in Europe. Session 2 Retrieving Events from Large-Scale Data The output that social scientists need from event analyses is information on the actual occurrence of events, and not only the number of mentions of these events in the data. The insights on the recognition of events in session 1 thus have to be enriched with approaches on how to aggregate the single event instances found in the data. Indeed, if event data is retrieved largely through automated procedures, two challenging problems for the retrieval of events arise. First, an aggregative model needs to be able to distinguish between reports belonging to the same event and reports covering different events. Second, there most certainly is bias in how frequently the data source contains information on particular events. The most pressing issue here is how these biases can be assessed and controlled for. This session includes two presentations (van Attveldt; Lampos) approaching such questions from different perspectives. Session 3 Trend Identification This session deals with models to explore corpora in which documents have a sequential order. Agenda research, i.e. the study of attention to political topics over time, is a prominent research area in the social sciences where such corpora are used. Serial correlation in these corpora can be both a curse and a blessing. On the one hand, time-specific dynamics in textual data can be directly used to identify trends. On the other hand, the general evolvement of language over time needs to be taken into account in studies measuring time-invariant concepts such as topic categories. While this may complicate tracking topics, short-term linguistic changes, particularly the introduction of new terms and multi-word units, are equally a useful instrument. The presenters in this session (Gilardi, Wueest and Giovanoli; Amsler and Schneider) will, thus, propose and evaluate approaches that deal with trends in different ways. 4

Session 4 Enhancing Text Classification This session will present and discuss new approaches to classify textual content. Classification is one of the most frequent tasks of content analyses also in the social sciences. An important issue in this area of research is the frequent mismatch between the researcher s theoretical expectations and the results of unsupervised text classifications. While inductively generated text classifications are statistically sound, they often considerably deviate from the researchers conception of the structure of the data. Supervised classifications, in contrast, may suffer from poor predictive robustness if the classes strongly confound the statistical properties of the data. The first presentation by Boyd-Graber and Hu discusses a specific model that reconciles the potential conflict between theoretical expectations and statistical predictions. The second presentation by Weidmann and Croicu presents an application and extensive evaluation of a supervised classification on a large newswire corpus. Session 5 Data-driven vs. Annotation-driven Text Mining The participants of this session (Wettstein and Wirth; Salway) are invited to engage in the fundamental question on content analyses in the social sciences, that is whether we should approach text mining in a deductive or rather inductive way. Most social scientists expect manual approaches to the quantification of content to remain indispensable for some tasks at least in the near future. The question thus arises whether and how computational models can support human-generated data collections. An opposite perspective argues for a largely data-driven content analysis. The idea here is to automatically augment representations of text content until results come close to the concepts social scientists want to explore. We expect that the comparison of these two perspectives will lead to a particularly fruitful exchange on the possibilities and constraints to automated content analyses. Session 6 Actor-level Sentiment The identification of tonality in language is essential for many social scientific research questions, first of all for all analyses of political rhetoric and discourse. For many such applications, however, sentiment measures are only valuable if they can be attributed to political actors. In most cases, this involves the detection of sentiment at the level of statements and a model 5

relating this sentiment to the speakers communicating them. Among the pressing questions for this session are thus a) how tonality can be measured at the level of single statements such as sentences and speech acts and b) how this tonality can be related to speakers and addressees so that information on the intensity of political conflict can be generated. Presenters in this session are Haselmayer and Jenny as well as (tba). Session 7 Text Scaling / Document-level sentiment Research in the social sciences has already brought forward an impressive array of approaches that aim to locate text on latent scales such as ideological dimensions or documentlevel sentiment. These efforts have developed largely independently from similar advances in computational linguistics, which means the potential for an interdisciplinary exchange seems especially large in this area. The presentation by Lowe will provide the most recent advancements in this field from the social scientific perspective. Ralf Steinberger will complement the session by showing how computational linguists generalize such approaches to the study of trends over time, across different languages, and in different media. 6