New Frontiers of Automated Content Analysis in the Social Sciences



Similar documents
Doctoral Consortium 2013 Dept. Lenguajes y Sistemas Informáticos UNED

TEACHING INTERCULTURAL COMMUNICATIVE COMPETENCE IN BUSINESS CLASSES

Chapter 6. The stacking ensemble approach

Introduction to Pattern Recognition

IT services for analyses of various data samples

A Survey on Product Aspect Ranking

Preface. A Plea for Cultural Histories of Migration as Seen from a So-called Euro-region

Automated Multilingual Text Analysis in the Europe Media Monitor (EMM) Ralf Steinberger. European Commission Joint Research Centre (JRC)

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

The international conference Networks in the Global World. Bridging Theory and Method: American, European, and Russian Studies took place at St.

Data Mining Yelp Data - Predicting rating stars from review text

Collecting Polish German Parallel Corpora in the Internet

Computational Linguistics and Learning from Big Data. Gabriel Doyle UCSD Linguistics

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Text Analysis for Big Data. Magnus Sahlgren

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Facilitating Business Process Discovery using Analysis

Gallito 2.0: a Natural Language Processing tool to support Research on Discourse

Machine Learning using MapReduce

Study Plan for Master of Arts in Applied Linguistics

PREDICTIVE ANALYTICS: PROVIDING NOVEL APPROACHES TO ENHANCE OUTCOMES RESEARCH LEVERAGING BIG AND COMPLEX DATA

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Data, Measurements, Features

Presentation fiche: ESCO, the forthcoming European Skills, Competencies and Occupations taxonomy

Find the signal in the noise

Probabilistic topic models for sentiment analysis on the Web

The Scientific Data Mining Process

Methods in writing process research

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Text Mining - Scope and Applications

Information Visualization WS 2013/14 11 Visual Analytics

Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju

Mid-Term Review: A contractual obligation and a fruitful dialogue

Text Analytics with Ambiverse. Text to Knowledge.

Social Media Mining. Data Mining Essentials

Survey Results: Requirements and Use Cases for Linguistic Linked Data

Spatio-Temporal Patterns of Passengers Interests at London Tube Stations

How To Analyse The Diffusion Patterns Of A Lexical Innovation In Twitter

Big Data: Rethinking Text Visualization

Text Analytics. A business guide

Building a Question Classifier for a TREC-Style Question Answering System

THE BACHELOR S DEGREE IN SPANISH

II. TYPES OF LEVEL A.

COURSE DESCRIPTION FOR THE BACHELOR DEGREE IN INTERNATIONAL RELATIONS

How To Become A Data Scientist

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Towards a new paradigm of science

GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization

Research Challenge on Opinion Mining and Sentiment Analysis *

6.2.8 Neural networks for data mining

OPEN SOURCE INFORMATION ACQUISITION, ANALYSIS, AND INTEGRATION IN THE IAEA DEPARTMENT OF SAFEGUARDS 1

Sentiment analysis on tweets in a financial domain

Using Artificial Intelligence to Manage Big Data for Litigation

Introduction to Text Mining and Semantics. Seth Grimes -- President, Alta Plana

CONNECTING DATA WITH BUSINESS

Master of Arts in Linguistics Syllabus

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Crime Pattern Analysis

Conquering the Astronomical Data Flood through Machine

Honorary Fellow of the Amsterdam School of Communication Research (ASCoR), University of Amsterdam, The Netherlands

Information Need Assessment in Information Retrieval

Internet of Things, data management for healthcare applications. Ontology and automatic classifications

Natural Language to Relational Query by Using Parsing Compiler

This Symposium brought to you by

How to prepare and submit a proposal for EARLI 2015

Miracle Integrating Knowledge Management and Business Intelligence

Study program International Communication (120 ЕCTS)

NSF Workshop on Big Data Security and Privacy

Sentiment Analysis on Big Data

The Knowledge Sharing Infrastructure KSI. Steven Krauwer

Why are Organizations Interested?

Volume 2, Issue 12, December 2014 International Journal of Advance Research in Computer Science and Management Studies

Assessing speaking in the revised FCE Nick Saville and Peter Hargreaves

Azure Machine Learning, SQL Data Mining and R

Clustering Connectionist and Statistical Language Processing

Context Aware Predictive Analytics: Motivation, Potential, Challenges

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

Transcription:

Symposium on the New Frontiers of Automated Content Analysis in the Social Sciences University of Zurich July 1-3, 2015 www.aca-zurich-2015.org Abstract Automated Content Analysis (ACA) is one of the key fields of methodological innovation in the social sciences, not least because there is a growing need to analyze the increasing number of digitally available text collections. Our goal is to bring together computational linguists and social scientists in order to improve the dialogue between the two research communities and to exploit mutual benefits for the advancement of ACA in the social sciences. More precisely, our program pairs social scientists and computational linguists into thematically coherent sessions, which are related to event analysis, trend identification, text classification, text scaling and sentiment detection. This setup should enable social scientists to gain insights into the sophisticated methodological instruments of computational linguistics in order to enhance their analyses. Computational linguists, in contrast, have the opportunity to apply their concepts and instruments to the vast array of research questions debated in the social sciences. The conference is jointly organized by the Swiss National Center of Competence Research in Democracy, the Stein Rokkan Chair of the European University Institute, as well as the Department of Political Science at the University of Zurich. Organisers Prof. Gerold Schneider and Dr. des. Bruno Wueest (NCCR Democracy) Prof. Hanspeter Kriesi (European University Institute) Prof. Silja Häusermann (University of Zurich) 1

Speakers and topics A: Social Sciences B: Computational linguistics Introductory keynote (July 1, 17:00-18:00, UZH main building, KOL-H-317) B. Kathleen McKeown Session 1 Extracting Complex Relational Data Chair: Jasmine Lorenzini A. Alexander Hanna and Pamela Oliver: Automated Coding of Protest Event Data B. Peter Makarov and Klaus Rothenhäusler: Towards Automated Protest Event Analysis Session 2 Retrieving Events from Large-Scale Data Chair: Bruno Wueest A. Wouter van Atteveldt: Using grammatical clauses for social and semantic network analysis B. Vasileios Lampos: Extracting Interesting Concepts from Large-Scale Textual Data Session 3 Trend Identification Chair: Swen Hutter A. Bruno Wueest: Taking care of time dependency and theoretical mismatch in topic models of political attention B. Michael Amsler and Gerold Schneider: Data-Driven and Linguistically Motivated Trend Identification Session 4 Enhancing Text classification Chair: Silja Häusermann A. Nils Weidmann and Mihai Croicu: Improving the Selection of News Reports for Event Coding Using Ensemble Classification B. Jordan Boyd-Graber: Interactive Topic Modeling for Labeling and Making Sense of Large Corpora Session 5 Data-Driven vs. Annotation-driven Text Mining Chair: Thomas Kurer A. Martin Wettstein and Werner Wirth: Semi-automated content analysis of news texts B. Andrew Salway: Some possibilities and limits of data-driven content analysis Session 6 Actor-level Sentiment Chair: Gerold Schneider A. Martin Haselmayer and Marcelo Jenny: Dictionary-based Sentiment Analysis with Crowdcoding B. Jochen Leidner: A Critical Analyisis of Sentiment Analysis Session 7 Text Scaling / Document-level sentiment Chair: Hanspeter Kriesi A. Will Lowe: Scaling things we can count B. Ralf Steinberger: Observing trends in multilingual media analysis Closing keynote (July 3, 16:30-17:30, UZH main building, KOL-H-317) A. Justin Grimmer 2

Outline As much as ACA is on the verge of becoming a standard tool for social scientists, scholars still dispute its promises and pitfalls. Hence, existing approaches to analyze unstructured text data mainly developed in computational linguistics need to be amended and adapted to the specific requirements of social scientific studies. To achieve this, we bring together computational linguists and social scientists who share interests in the analysis of large-scale text data. The conference is structured into seven thematic sessions, which are accompanied by an introductory and closing keynote speech. Prof. Kathleen McKeown will provide the introductory talk from the perspective of computational linguistics. The focus of her presentation will lie on the potential of computational linguistics for the social sciences. While possible applications seem abundant, there may well be paramount challenges for the integration of computational linguistic approaches into social scientific research frameworks. The closing talk will be given by Prof. Justin Grimmer. His talk will sum up the most important findings of the conference and give an outlook on the most likely advancements in the social scientific application of ACA in the near future. Session 1 Extracting Complex Relational Data Events such as the eruption of political protest or hostilities in armed conflicts are the unit of enquiry of many social scientific analyses. Obviously, the conceptual and operational specifications of what constitutes an event vary significantly. However, what all event analyses have in common is that a combination of several individual indicators is necessary to specify an event. On the most basic level, events are usually defined by the relation of an action, a date, and a location. When working with large-scale text data, this relation mining task of linking the single indicators to an event is far from trivial, especially since further indicators such as the goals of the action and actors involved are frequently added. Hence, one of the major challenges of automated event analyses is to generate models that allow one to extract events defined as compounds of single indicators. We have invited two research teams (Hanna and Oliver; Makarov and Rothenhäusler), who will report on their progress in dealing with this chal- 3

lenge. Both teams are in the process of creating a system for the automated recognition of political protest events, the former dealing with protest events in the US and the latter in Europe. Session 2 Retrieving Events from Large-Scale Data The output that social scientists need from event analyses is information on the actual occurrence of events, and not only the number of mentions of these events in the data. The insights on the recognition of events in session 1 thus have to be enriched with approaches on how to aggregate the single event instances found in the data. Indeed, if event data is retrieved largely through automated procedures, two challenging problems for the retrieval of events arise. First, an aggregative model needs to be able to distinguish between reports belonging to the same event and reports covering different events. Second, there most certainly is bias in how frequently the data source contains information on particular events. The most pressing issue here is how these biases can be assessed and controlled for. This session includes two presentations (van Attveldt; Lampos) approaching such questions from different perspectives. Session 3 Trend Identification This session deals with models to explore corpora in which documents have a sequential order. Agenda research, i.e. the study of attention to political topics over time, is a prominent research area in the social sciences where such corpora are used. Serial correlation in these corpora can be both a curse and a blessing. On the one hand, time-specific dynamics in textual data can be directly used to identify trends. On the other hand, the general evolvement of language over time needs to be taken into account in studies measuring time-invariant concepts such as topic categories. While this may complicate tracking topics, short-term linguistic changes, particularly the introduction of new terms and multi-word units, are equally a useful instrument. The presenters in this session (Gilardi, Wueest and Giovanoli; Amsler and Schneider) will, thus, propose and evaluate approaches that deal with trends in different ways. 4

Session 4 Enhancing Text Classification This session will present and discuss new approaches to classify textual content. Classification is one of the most frequent tasks of content analyses also in the social sciences. An important issue in this area of research is the frequent mismatch between the researcher s theoretical expectations and the results of unsupervised text classifications. While inductively generated text classifications are statistically sound, they often considerably deviate from the researchers conception of the structure of the data. Supervised classifications, in contrast, may suffer from poor predictive robustness if the classes strongly confound the statistical properties of the data. The first presentation by Boyd-Graber and Hu discusses a specific model that reconciles the potential conflict between theoretical expectations and statistical predictions. The second presentation by Weidmann and Croicu presents an application and extensive evaluation of a supervised classification on a large newswire corpus. Session 5 Data-driven vs. Annotation-driven Text Mining The participants of this session (Wettstein and Wirth; Salway) are invited to engage in the fundamental question on content analyses in the social sciences, that is whether we should approach text mining in a deductive or rather inductive way. Most social scientists expect manual approaches to the quantification of content to remain indispensable for some tasks at least in the near future. The question thus arises whether and how computational models can support human-generated data collections. An opposite perspective argues for a largely data-driven content analysis. The idea here is to automatically augment representations of text content until results come close to the concepts social scientists want to explore. We expect that the comparison of these two perspectives will lead to a particularly fruitful exchange on the possibilities and constraints to automated content analyses. Session 6 Actor-level Sentiment The identification of tonality in language is essential for many social scientific research questions, first of all for all analyses of political rhetoric and discourse. For many such applications, however, sentiment measures are only valuable if they can be attributed to political actors. In most cases, this involves the detection of sentiment at the level of statements and a model 5

relating this sentiment to the speakers communicating them. Among the pressing questions for this session are thus a) how tonality can be measured at the level of single statements such as sentences and speech acts and b) how this tonality can be related to speakers and addressees so that information on the intensity of political conflict can be generated. Presenters in this session are Haselmayer and Jenny as well as (tba). Session 7 Text Scaling / Document-level sentiment Research in the social sciences has already brought forward an impressive array of approaches that aim to locate text on latent scales such as ideological dimensions or documentlevel sentiment. These efforts have developed largely independently from similar advances in computational linguistics, which means the potential for an interdisciplinary exchange seems especially large in this area. The presentation by Lowe will provide the most recent advancements in this field from the social scientific perspective. Ralf Steinberger will complement the session by showing how computational linguists generalize such approaches to the study of trends over time, across different languages, and in different media. 6