Sentiment Analysis and Opinion Mining in Collections of Qualitative Data

Similar documents
Improving Traceability of Requirements Through Qualitative Data Analysis

Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams

The role of multimedia in archiving community memories

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

Using Feedback Tags and Sentiment Analysis to Generate Sharable Learning Resources

End-to-End Sentiment Analysis of Twitter Data

Single Level Drill Down Interactive Visualization Technique for Descriptive Data Mining Results

Analyzing survey text: a brief overview

The Oxford Learner s Dictionary of Academic English

CHAPTER THREE: METHODOLOGY Introduction. emerging markets can successfully organize activities related to event marketing.

The Role of Reactive Typography in the Design of Flexible Hypertext Documents

Course Syllabus My TOEFL ibt Preparation Course Online sessions: M, W, F 15:00-16:30 PST

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

How do we know what we know?

A PLATFORM FOR SHARING DATA FROM FIELD OPERATIONAL TESTS

CONFIOUS * : Managing the Electronic Submission and Reviewing Process of Scientific Conferences

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING

1998 Workplace Employee Relations Survey

WHITEPAPER. Text Analytics Beginner s Guide

A Guide. to Assessment of Learning Outcomes. for ACEJMC Accreditation

Digital archiving of scientific information Czech experience

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Virginia English Standards of Learning Grade 8

Doctoral Consortium 2013 Dept. Lenguajes y Sistemas Informáticos UNED

Text Analytics for Competitive Analysis and Market Intelligence Aiaioo Labs

YOUNG PROFESSIONALS AT WORK

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

Merging learner performance with browsing behavior in video lectures

User research for information architecture projects

Text Mining - Scope and Applications

KNOWLEDGE ORGANIZATION

The Open University s repository of research publications and other research outputs. Collaborative sensemaking in learning analytics

Why Semantic Analysis is Better than Sentiment Analysis. A White Paper by T.R. Fitz-Gibbon, Chief Scientist, Networked Insights

Data documentation and metadata for data archiving and sharing. Data Management and Sharing workshop Vienna, April 2010

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-4, Issue-4) Abstract-

CHOOSE THE RIGHT ONE!

Terminology Extraction from Log Files

Data Mining Yelp Data - Predicting rating stars from review text

Soziale Suche und Selbstgesteuertes Lernen

The Italian Hate Map:

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

Towards Continuous Information Security Audit

Identifying Focus, Techniques and Domain of Scientific Papers

Donnellan, Brian Gleasure, Rob Helfert, Markus Kenneally, Jim Rothenberger, Marcus Chiarini Tremblay, Monica VanderMeer, Debra Winter, Robert

A Decision Support Approach based on Sentiment Analysis Combined with Data Mining for Customer Satisfaction Research

Fogbeam Vision Series - The Modern Intranet

Better for recruiters... Better for candidates... Candidate Information Manual

Text Analytics Beginner s Guide. Extracting Meaning from Unstructured Data

The One Page Public Relations Plan

IBM Content Analytics adds value to Cognos BI

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Case Writing Guide. Figure 1: The Case Writing Process Adopted from Leenders & Erskine (1989)

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

Training Management System for Aircraft Engineering: indexing and retrieval of Corporate Learning Object

ST. PETER S CHURCH OF ENGLAND (VOLUNTARY AIDED) PRIMARY SCHOOL SOUTH WEALD. Modern Foreign Language Policy

Automated vs. manual methods of coding and analysing free text survey responses

Provalis Research Text Analytics and the Victory Index

Text Opinion Mining to Analyze News for Stock Market Prediction

Semi-structured interviews

Random Forest Based Imbalanced Data Cleaning and Classification

The Power of 32/X levels x21. 24hr security 32/X 1. Potential. x storey x sq ft.

Co-Creation of Models and Metamodels for Enterprise. Architecture Projects.

Syllabus. Dr. Calderón connects instructional practice with the Common Core State Standards, and backs up her recommendations with research:

BILINGUALISM AND LANGUAGE ATTITUDES IN NORTHERN SAMI SPEECH COMMUNITIES IN FINLAND PhD thesis Summary

Organizational Social Network Analysis Case Study in a Research Facility

Improving SAS Global Forum Papers

Where do new product ideas come from?

Delivering Smart Answers!

ONTOLOGY FOR MOBILE PHONE OPERATING SYSTEMS

Holly. Anubhav. Patrick

Appendix B Data Quality Dimensions

EXAMS Leaving Certificate English

CoreMedia 6

WEGOV ANALYSIS TOOLS TO CONNECT POLICY MAKERS WITH CITIZENS ONLINE

A Framework for the Delivery of Personalized Adaptive Content

Introduction to Data Mining

Good Data Practices VIREC Cyber Seminar Series. September 9 13, 2013

Lausanne Procedure of Speech- and Text Analysis at BAMF Office/Germany (BAMF = Federal Office for Migration and Refugees)

Using Semantic Data Mining for Classification Improvement and Knowledge Extraction

Sentiment analysis for news articles

Usability Evaluation with Users CMPT 281

An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them

TextGrid Research Infrastructure for the e-humanities

Why Enterprises Need a Social Media

AWERProcedia Information Technology & Computer Science

Comparative Analysis on the Armenian and Korean Languages

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC

Identifying Thesis and Conclusion Statements in Student Essays to Scaffold Peer Review

Online Student Engagement as Formative Assessment

Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources

72. Ontology Driven Knowledge Discovery Process: a proposal to integrate Ontology Engineering and KDD

Workshop Series on Open Source Research Methodology in Support of Non-Proliferation

To download the script for the listening go to:

Top 4 Ways Social Media is Helping to Reshape Marketing

Enterprise Resource Planning Analysis of Business Intelligence & Emergence of Mining Objects

CHAPTER FIVE: SUMMARY AND CONCLUSIONS, DISCUSSION, AND RECOMMENDATIONS Summary and Conclusions

Using Requirements Traceability Links At Runtime A Position Paper

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

A Business Process Services Portal

THE HUMAN TOUCH FOR TECH TALENT EMPLOYEE RETENTION COULD BE AS SIMPLE AS THANK YOU

Transcription:

Sentiment Analysis and Opinion Mining in Collections of Qualitative Data Sergej Zerr, Nam Khanh Tran, Kerstin Bischoff, and Claudia Niederée Leibniz Universität Hannover / Forschungszentrum L3S, Hannover, Germany zerr@l3s.de, NTran@L3S.de, bischoff@l3s.de, niederee@l3s.de Abstract. In social sciences, a tremendous body of data is being collected by observing or interviewing people. Such qualitative data forms a valuable source for later secondary research. One major challenge, though, is the preservation of privacy of the interviewees even after longer time periods of archival storage. Modern sentiment analysis techniques could help to judge the sensitivity of particular textual content and help the data provider to remove sensitive data from unauthorized eyes, thus reducing manual processing of large collections of primary material. Besides, mining opinions enables enhanced data access, e.g., by finding negative attitudes about a topic. In this paper we will describe properties of qualitative social science data with respect to sentiment analysis. We compare it to datasets used in the literature, identify main challenges, and provide directions for solving them. By discussing how to exploit state-of-the-art techniques to leverage the (secondary) exploration of archived qualitative data we hope to foster interdisciplinary dialogue. Keywords: digital humanities, qualitative data, sentiment analysis 1 Introduction The Sociological Research Institute (SOFI) in Göttingen (Germany) carried out a number of studies observing working situation in German automobile and shipyard industry after the rapid economic growth in post-world War II Germany - the so-called German economic miracle. Findings of these studies had a significant impact on the working situation in German industry. Intelligent access to this data would turn such data collection into a valuable source for secondary research, e.g., for longitudinal (meta)analysis or historical investigations. Within the scope of the project Gute Arbeit ( Good Work ) we are developing tools for enabling rich exploratory access to this data for secondary research. Reusing such sources is not only a challenging and time consuming task, e.g. regarding the selection of an appropriate subset, capturing context, etc. Moreover, behind each document there is a particular person whose privacy need to be respected by the data provider and secondary analyst and technically preserved by the data provider. Modern sentiment analysis (or opinion mining) techniques could help to judge the sensitivity of a particular document, paragraph, or even sentence and help the data provider to remove extremely sensitive

2 data from unauthorized eyes. For example, a highly negative statement about your own employer may be problematic when made somehow traced back, in particular once the interviewee climbed up the hierarchy in the very same enterprise. Moreover, since usually also the company is assured non-disclosure, overly critical statements may be especially harmful (besides of course confidential information). Second, for the secondary researcher those techniques could help to automatically find passages with interesting points of view on a particular subject and reduce manual processing of large collections of primary material. For example, our project is interested in how peoples concepts of good work evolved over the last decades. However our literature analysis revealed that due to the specificity of qualitative data, straight-forward application of state-of-the-art sentiment analysis tools is not always feasible even after modification. 2 Data Our corpus consist of qualitative data, in German language, from studies of the Sociological Research Institute (SOFI) in Göttingen. The data consists of a variety of (case) studies typically including worker interviews and observation at the workplace. It was collected within about 50 projects during a period of over 40 years, starting from the 60 s (i.e Volkswagen and German dockyard studies). one of the latest studies contains 41 interviews with individuals and groups of the vehicle manufacturing company Auto 5000, which was set up inside the Volkswagen complex in Wolfsburg, Germany in 2001. This lower cost model company was set up aiming at keeping manufacturing jobs in Germany instead of moving production to other areas of Europe. Interviews include, for example, the employment history of the formerly unemployed workers and engineers as well as topics like shift work, team work, or relations between regular Volkswagen and Auto 5000 employees. For comparison and illustration, we also use an English dataset, namely the case study on Changing Organizational Forms and the Re-shaping of Work [1]. Each case (some examples are: airlines, ceramics manufacturer, hotel services, etc.) has transcriptions or summaries of in-depth Face-to-face interviews conducted in England and Scotland between 1999 and 2002. Participants were managers and employees at all levels, sometimes also union representatives. Examples below are taken from these interviews. 3 Related Work and Challenges Dealing with qualitative interview data we face general challenges to sentiment analysis (see e.g. [2]) but find some peculiarities. For example, it is typically assumed that the subject (e.g., a YouTube video or market item) is known and that the sentiment can be estimated quite well already using simple vocabulary based techniques. In our dataset, however, indirect sentiment expressions are dominating and the vocabulary is less explicit and considerably less aggressive compared to the Web materials widely used in the literature. Instead

3 Fig. 1: The structure of opinions employers are dependent on their company and thus tend to express criticism rather subtile, or deliberately decide not to mention certain problematic topics, or to use reported speech. Often the sentiment can be only estimated after careful analysis of the aspects highlighted of the subject rather than on adjectives used to describe those. Fig. 1 summarizes the pattern structures we plan to detect in our data set. The Object is in our case an interviewee who expresses a specific opinions about a number of Subjects(also called opinion targets). A subject could be a person, specific item like a particular instrument or abstract concept and events. Each subject receives an opinion expression which can be either positive or negative (presented as +/-). In this section we will identify and discuss some of the challenges to be faced while extracting patterns described above. Detection of Subjective Expressions: User generated content is the major data source in literature about opinion mining. One property of such data is that a particular Web user is often hidden behind a virtual identity and behaves more freely than she would do in the real life. Generally, Web users are rarely concerned about careful selection of words and expressions. High precision in positive/negative sentiment analysis on such datasets is achieved not least due to explicit emotional adjectives (for example ugly, idiot vs. perfect, favorite [3]). In our studies the interviews were recorded face-to-face and the sentiment is often obscured. In following, we are using example sentences, extracted from our English dataset described in Section 2. There is a number of seemingly neutral expressions actually having a hidden positive or negative sentiment: The text (company rules) says it should be achievable but again the reality, the experience from some people has been otherwise. Sometimes an expression only appears subjective with respect to vocabulary without being it

4 (here the term good does not carry sentiment value): We are here to give them a service, clean their aircraft. it s got to have a good standard and quality of clean. Subject Identification: Typically state-of-the-art approaches assume that the document contains opinions on one main subject expressed by the author of the document (e.g. Product review, YouTube video etc.). In our case the subject(s) have first to be detected. For example the interviewee in one document can express opinions about multiple subjects such as colleagues, boss, company, family, government, etc. Moreover, a subject may be complex having different aspects. The authors [4] addressed the problem of target detection for French telephonic surveys and forum entries by developing a grammar using linguistic patterns like Target state Verb Adjective (e.g. My boss is great ). User opinions on events and impact of opinions in social Web over time was considered in [5], similarly, in our project we are interested in event descriptions and temporal opinion development analysis. Context Dependency: The expression It was cold in contexts of skiing weather and restaurant food would have completely different polarity [6]. Similarly, in different cases the same terms may also differ with respect to their degree of sentiment. The latter was considered in [7]. Indirect Sentiment: Just a vocabulary with positive/negative examples alone would not be sufficient when judging opinions. Sometimes it depends less on the expressed terms and more on the subject attributes being highlighted. Although in the literature direct and indirect attributes are distinguished [8], the impact on highlighting and omitting particular attributes was not yet considered. In order to stay polite people often speak mainly about positive aspects (e.g of the work) even if they are less important. Opinion Order: Although the expressions The work is hard, but the salary is high or The salary is high, but the work is hard share the terms as well as the topic, they are quite different in terms of sentiment. 4 Approach Step 1 - Rich Annotation Editor: In contrast to most datasets used in the literature, our dataset is missing any definite features, like favorite assignments, or (dis-)likes in Web2.0, that could directly be used for estimating the sensitivity degree of a document. This makes manual annotation of the dataset a necessity. To capture as many important properties as possible we are developing an annotation editor for gathering a high quality gold standard data. The annotator can read the source text on the left panel of the editor Fig.2(1). Selecting a piece of text and pressing new topic/concept (2) will create a new selection section. Here four buttons are present and active as soon as the annotator selects

5 Fig. 2: Annotation Editor some text in the left panel. Clicking on instance (3) will add the current text selection as a new instance of active subject (e.g., My Chef and My Boss ) and its particular aspect(4). Clicking on the (5) positive or negative button will add the selection as support for the corresponding sentiment. The corpora will be annotated by social scientists who were collecting and working with the data. The manual assessment of sentiment will serve as a gold standard/ground truth, which will be used as a training corpus for deriving models for automatic identification. Step 2 - NLP Analysis: First, the annotated set will be manually analyzed by the social scientists and a set of formal rules describing sentimental/neutral expressions will be defined. We will continue with the analysis using NLP tools and extract a set of further feature candidates like part of the speech, parse tree structure, typical idioms, etc. Finally we will conduct classification experiments and plot precision recall curves for evaluation of the feature selection. Especially we are interested to find out, to what degree we can automatically answer questions like What is a sentiment value of a particular document and Are there sensitive documents in the given set. Aggregation of polarity over different aspects and subject level granularity are particularly interesting issues. Step 3: Tools Development: Finally our goal is to implement a toolbox for estimating sentiment in qualitative data, apply those to our dataset and open parts of the archive to secondary research. Further analysis could provide insights on the average situation at the workplace with respect to sentiment expressions at given time points and make it comparable to other times and workplaces.

6 5 Conclusion In this paper we describe the directions for tackling the problem of sentiment analysis within corpora of qualitative research data. The challenge is first to detect subjective expressions given the absence of explicit, clearly sentimental vocabulary. In the next step the corresponding subjects need to be identified. Finally, the relations between sentiment degree of expressed opinions and the sensitivity of the documents needs to be analyzed. We plan to develop and evaluate corresponding tools as well as to apply those on an existing set of qualitative interviews within the German project Gute Arbeit. Acknowledgments The work was supported by the project Gute Arbeit nach dem Boom (Re-SozIT) funded by the German Federal Ministry of Education and Research (BMBF) under mark 01UG1249C within the ehumanities line of funding as well as by the European project ARCOMEM (GA270239). References 1. Marchington, M., Rubery, J., Willmott, H.: Changing organizational forms and the re-shaping of work : Case study interviews, 1999-2002 [computer file] (2004) 2. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2(1-2) (January 2008) 1 135 3. Siersdorfer, S., Chelaru, S., Nejdl, W., San Pedro, J.: How useful are your comments?: analyzing and predicting youtube comments and comment ratings. In: Proceedings of the 19th international conference on World wide web. WWW 10, New York, NY, USA, ACM (2010) 891 900 4. Goujon, B.: Text mining for opinion target detection. In: Intelligence and Security Informatics Conference (EISIC), 2011 European. (2011) 322 326 5. Maynard, D. Bontcheva, K.R.D.: Challenges in developing opinion mining tools for social media. In: @NLP can u tag usergeneratedcontent?! Workshop at LREC 2012, Istanbul, Turkey 6. Krestel, R., Siersdorfer, S.: Generating contextualized sentiment lexica based on latent topics and user ratings. In: Proceedings of the 24th ACM Conference on Hypertext and Social Media. HT 13, New York, NY, USA, ACM (2013) 129 138 7. Stylios, G., Tsolis, D., Christodoulakis, D.: Mining and estimating users opinion strength in forum texts regarding governmental decisions. In Iliadis, L., Maglogiannis, I., Papadopoulos, H., Karatzas, K., Sioutas, S., eds.: Artificial Intelligence Applications and Innovations. Volume 382 of IFIP Advances in Information and Communication Technology. Springer Berlin Heidelberg (2012) 451 459 8. Xiao, R.: Corpus creation. In: Handbook of Natural Language Processing. (2010) 385 403