Emotional Links between Structure and Content in Online Communications



Similar documents
Designing Ranking Systems for Consumer Reviews: The Impact of Review Subjectivity on Product Sales and Review Quality

End-to-End Sentiment Analysis of Twitter Data

How To Learn From The Revolution

An Analysis of MOOC Discussion Forum Interactions from the Most Active Users

Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams

Florida International University - University of Miami TRECVID 2014

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

Monitoring Web Browsing Habits of User Using Web Log Analysis and Role-Based Web Accessing Control. Phudinan Singkhamfu, Parinya Suwanasrikham

A Framework of User-Driven Data Analytics in the Cloud for Course Management

The First Online 3D Epigraphic Library: The University of Florida Digital Epigraphy and Archaeology Project

How To Find Out What Political Sentiment Is On Twitter

A Survey on Product Aspect Ranking

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Technical Efficiency Accounting for Environmental Influence in the Japanese Gas Market

Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering

Bootstrapping Big Data

Dong-Joo Kang* Dong-Kyun Kang** Balho H. Kim***

Nonparametric statistics and model selection

Data quality in Accounting Information Systems

Web Document Clustering

Semantic Sentiment Analysis of Twitter

DATA ANALYSIS. QEM Network HBCU-UP Fundamentals of Education Research Workshop Gerunda B. Hughes, Ph.D. Howard University

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

NILC USP: A Hybrid System for Sentiment Analysis in Twitter Messages

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

Affine-structure models and the pricing of energy commodity derivatives

CENG 734 Advanced Topics in Bioinformatics

Predicting stocks returns correlations based on unstructured data sources

ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS

Private Record Linkage with Bloom Filters

Dissecting the Learning Behaviors in Hacker Forums

SINAI at WEPS-3: Online Reputation Management

Online Ensembles for Financial Trading

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Glossary of Terms Ability Accommodation Adjusted validity/reliability coefficient Alternate forms Analysis of work Assessment Battery Bias

Descriptive Statistics

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

RRSS - Rating Reviews Support System purpose built for movies recommendation

Multivariate Analysis of Ecological Data

Self Organizing Maps for Visualization of Categories

CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis

market research p&a international Market Plaza Esomar: Global and Local Partner in Market Research

Reputation Management System

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

A Conceptual Approach to Data Visualization for User Interface Design of Smart Grid Operation Tools

Creating Usable Customer Intelligence from Social Media Data:

Do All Birds Tweet the Same? Characterizing Twitter Around the World

Comparison of K-means and Backpropagation Data Mining Algorithms

1. Introduction. Keywords: content adaptation, social media, engagement, automation

Introducing diversity among the models of multi-label classification ensemble

Understanding and Supporting Intersubjective Meaning Making in Socio-Technical Systems: A Cognitive Psychology Perspective

Text Analysis for Big Data. Magnus Sahlgren

Sentiment Analysis on Big Data

Integrating Field Research, Problem Solving, and Design with Education for Enhanced Realization

MapReduce Approach to Collective Classification for Networks

A Study of Web Log Analysis Using Clustering Techniques

Data Mining - Evaluation of Classifiers

Learning is a very general term denoting the way in which agents:

Simulating the Structural Evolution of Software

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

Brunel Business School Doctoral Symposium March 28th & 29th 2011

Confidence Intervals for One Standard Deviation Using Standard Deviation

Selective Naive Bayes Regressor with Variable Construction for Predictive Web Analytics

COURSE RECOMMENDER SYSTEM IN E-LEARNING

How To Perform An Ensemble Analysis

Robust Sentiment Detection on Twitter from Biased and Noisy Data

Teaching Statistics with Fathom

An Anomaly-Based Method for DDoS Attacks Detection using RBF Neural Networks

Tweets Miner for Stock Market Analysis

Overview of Water Utility Benchmarking Methodologies: From Indicators to Incentives

Sentiment analysis for news articles

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Chapter 7. One-way ANOVA

JamiQ Social Media Monitoring Software

Understanding the popularity of reporters and assignees in the Github

The Italian Hate Map:

Representation of Electronic Mail Filtering Profiles: A User Study

Microblog Sentiment Analysis with Emoticon Space Model

CSE 598 Project Report: Comparison of Sentiment Aggregation Techniques

Designing Programming Exercises with Computer Assisted Instruction *

Statistical issues in the analysis of microarray data

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Tutorial for proteome data analysis using the Perseus software platform

Fairfield Public Schools

Data Mining Yelp Data - Predicting rating stars from review text

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

Sentiment analysis on tweets in a financial domain

A GENERAL TAXONOMY FOR VISUALIZATION OF PREDICTIVE SOCIAL MEDIA ANALYTICS

Sentiment Analysis and Time Series with Twitter Introduction

Quantitative Methods for Finance

Blog Post Extraction Using Title Finding

Overview of iclef 2008: search log analysis for Multilingual Image Retrieval

Spam Detection Using Customized SimHash Function

Transcription:

Emotional Links between Structure and Content in Online Communications Rafael E. Banchs 1 and Andreas Kaltenbrunner 2 1 Human Language Technology, Institute for Infocomm Research, Singapore rembanchs@i2r.a-star.edu.sg 2 Social Media and Interaction, Barcelona Media Innovation Centre, Barcelona andreas.kaltenbrunner@barcelonamedia.org Abstract. This papers aims at studying the existence of links between the structure of online communications and the contents they are composed of. The study is conducted over two datasets of similar online discussion platforms in different languages: English and Spanish. As a result of our analysis, it is concluded that there are significant trends in the variation patterns observed over the emotional load of user generated contents that are associated to the different types of communication structures existing in the datasets. Moreover, the observed trends are quite similar for both of the studied languages, suggesting that such kind of emotional links between structure and content in online communications are language independent in nature. Keywords: online communications, structure, user generated content, emotions. 1 Introduction Online communications has been extensively studied along the short history of the World Wide Web [15]. While several studies have paid attention to the structure of such communications [7],[10], other studies have focused on the analysis of the contents [6],[19]. Recently, a more comprehensive approach to the online communication problem has pointed out the necessity of taking both, structure and contents, into account, as well as exploring the possible relationships between these two dimensions of online communications [11],[18]. In this work, we focus our attention on exploring the possible existence of links of emotional nature between the structure and the contents of online communications. We based our assumption on the fact, described by [12], that the width (maximum number of responses at any level) and depth (largest chain of responses) in a given online discussion tree strongly depend on the specific forum category the discussion belongs to. Their empirical results, which are theoretically supported on deliberation theory [13,14], suggest that the level of engagement of participants in an online discussion is determinant to the evolution and final structure of the resulting discussion. Although the level of engagement depends on a complex combination of several factors, emotions can be assumed to be a key player among these factors. In this work, we explore the possible relationship between emotions, as can be measured A. Jaafar et al. (Eds.): AIRS 2014, LNCS 8870, pp. 340 347, 2014. Springer International Publishing Switzerland 2014

Emotional Links between Structure and Content in Online Communications 341 from the user generated content contained in the discussion, and the structure or topology of the corresponding discussion tread. The rest of the paper is structured as follows. Section 2 gives a brief overview over previous research on online communication analysis and provides a detailed description on the two specific works on which this work is based. Section 3 describes both the datasets and the experimental analysis, as well as discusses the most relevant results derived from the analysis. Finally, section 4 presents the main conclusions and the proposed future research. 2 Related Work In this section we first present a brief overview over previous research on online communication, followed by a detailed description of the works on which our research is based. 2.1 Research on Online Communications As already mentioned, online communication has been an important topic of research for at least the last ten years or so. Pioneering research in this area can be traced back to work on computer mediated communication in distance learning applications [4]. More recently, the Web 2.0 phenomenon has called the attention of researches on the social network aspects of online communication [7],[10],[16,17],[23]; as well as on user generated content analysis [6],[8],[19],[22]. Another important body of research has been developed around the growing phenomenon of opinionated content analysis. This kind of research focuses more on the analysis of the subjective nature of user generated contents and aims at determining the specific sentiments and the polarity of the opinions conveyed in them [2],[3],[20]. 2.2 Previous Related Research This work, which aims at studying the existence of emotional links between the structure of online communications and the linguistic contents they are composed of, is mainly founded on two previous studies. The first one [12] provides an empirical evidence about the possible dependence between the structural variables of a given online discussion and its topic. In that work, the authors operationalized the structural variables of depth and width for a discussion tree as follows: the depth corresponds to the maximum number of layers through which the discussion unfolds and the width corresponds to the maximum number of comments at any level of the tree. They considered the depth and width of a discussion tree to be proxies to the deliberation theory variables of argumentation and representation, respectively. According to this, a given online discussion tree can be classified into one of four different types of communication depending on its location into the depth-width space, in a similar way political communications are categorized into the argumentation-representation space [1]. This notion is illustrated in Fig. 1.

342 R.E. Banchs and A. Kaltenbrunner Fig. 1. The depth-width space and the four online communication type regions As a result of their study, [12] demonstrated that the location of a given online dis- cussion tree in the depth-width space was highly dependent on the specific category or topic being discussed. In such a way, those discussions belonging to political and other similar ideological categories tend to fall in the type-1 region, while discussions be- The second previous related work [5] provides the empirical means for the direct estimation of the emotional load of a given document or segment of text. In that work longing to games, books and other similar categories tend to fall in the type-3 region. the authors developed a human annotated lexicon denominated ANEW (Affective Norms for English Language Words). The ANEW lexicon consists of about a thousand English frequently used words that have been individually scored along three different emotional dimensions: va- of emotional load in a numerical scale that ranges from 1 (minimum degree) to 9 lence, arousal and dominance. Each of these three variables captures a different aspect (maximum degree). While valence measures happiness, arousal measures anger, and dominance measures the feeling of assurance or confidence. A similar lexicon has been also developed for the case of Spanish [21], whichh al- lows us to conduct the experimental analysis over Spanish online discussions too. 3 Empirical Analysis In this section we describe in detail the empirical analysis that we conducted to study the differences in emotional trends between discussions of type-1 and type-3. First, we describe the datasets used for the analysis; afterwards we provide the details of the experimental analysis and present the experimental results. 3.1 Online Discussion Datasets The empirical data collected and used for our analysis was extracted from the online discussion forum Slashdot(www.slashdot.com) and its Spanish version Barrapunto(www.barrapunto.com). These two sites are based on the same software

Emotional Links between Structure and Content in Online Communications 343 platform, which provides a good source of experimental data as it allows for reliably reconstructing the discussion trees from the crawled html files [10]. The existence of these two (Spanish and English) online discussion forums, which are based on the same platform, also allows for the comparison of the experimental analyses in the two different languages. However, it is important to mention that the Spanish site has much lesss activity than the English one and, as a consequence, the volume of available data for the former case is much more restrictive. A total of 10,012 discussion trees, compressing a little bit more than 2 million comments, were collected for the case of English, while a total of 6,252 discussion trees were collected for the case of Spanish. In both cases, the original collections were filtered by retaining only those discussion trees including at least 100 occur- rences of words contained in the emotional lexicons. This should guarantee that relia- function centered on the median values of depth and width for each of the data collec- tions. An adjustment factor was used to ensure that about 40% 1 of the discussions were selected as either type-1 or type-3. ble estimates of the emotional variables can be computed for each discussion tree. Finally, type-1 and type-3 communications were identified by using a hyperbolic The implemented hyperbolic function can be described by the following equation: (d d med ) (w w med ) = f; where d and w refers to depth and width, d med and w med refers to their corresponding median values in the considered data collection, and f is the adjustment factor. The depth and width median values observed in the case of Slashdot (English dataset) were d med = 8.00 and w med = 4.07; and the required adjust- ment factor for selecting 40% of the discussions as type-1 or type-3 was f = 0.70 In the case of Barrapunto (Spanish dataset), the parameters were as follows: d med = 7..00, w med = 3.18 and f = 0.10. Fig. 2, illustrates the identification process for the case of the English dataset, and Table 1 summarizes the main characteristics of both online discussion datasets. Fig. 2. Cross-plot of English discussions from Slashdot in the depth-width space and selected subsets of discussions within type-1 and type-3 categories 1 Several experiments were conducted for different proportions by varying the adjustment factor. Similar results weree always observed in those cases in which 60% or less of the dis- and for proportions over 80%, most of the observed results were not statistically significant any more. In this work, we report those results corresponding to the 40% cussions were selected. For proportions over 60%, the observed effects tended to fade away; case.

344 R.E. Banchs and A. Kaltenbrunner Table 1. Main characteristics of English and Spanish online discussion datasets Discussion Trees English Spanish Originally collected 10012 6252 Filtered collection 9426 2744 Filtered type-1 1939 640 Filtered type-3 1831 556 3.2 Experimental Results For evaluating the possible relations between structure and content, we studied the differences on the emotional variables between the two considered communications types (type-1 and type-3).more specifically, the four basic statistical moments (mean, variance, skewness and kurtosis) were computed at the discussion level for each of the three emotional dimensions (valence, arousal and dominance) according to the specific words occurring in the discussion which have been assigned emotional scores in the lexicons. Average values for each moment were finally computed over the subsets of type-1 discussions (TYPE-1) and type-3 discussions (TYPE-3). The statistical significance of the observed differences was assessed by means of bootstrapping [9], a re-sampling method that allows for hypothesis testing. In our case, we were interested in estimating the degree of confidence with which the null hypothesis that both averaged emotional scores (for type-1 and type-3 discussions) belong to the same distribution can be rejected. In the implemented procedure, new discussion subsets were artificially generated from the original ones by using sampling with replacement. A total of 10,000 simulations were conducted in each case for confidence estimation purposes. Table 2 and 3 summarize the obtained results for the English and Spanish datasets, respectively. In the tables, the average values for each computed moment over both types of discussions (TYPE-1) and (TYPE-3) are reported, along with the percentage of times each boot-strap simulation produced larger average values for either type-1 (T1>T3) or type-3 (T1<T3) discussions. Table 2. Average values for the statistical moments of emotional variables estimated over both type-1 and type-3 English discussions and percentage of times each bootstrap simulation produced larger averages for type-1 (T1>T3) or type-3 (T1<T3) discussions Average for TYPE-1 TYPE-3 T1<T3 T1>T3 MEAN valence 6.04 6.17 100.00 0.00 MEAN arousal 5.17 5.20 100.00 0.00 MEAN dominance 5.44 5.48 100.00 0.00 VAR valence 1.69 1.57 0.00 100.00 VAR arousal 0.92 0.90 0.00 100.00 VAR dominance 0.93 0.87 0.00 100.00 SKW valence -0.96-1.08 0.00 100.00 SKW arousal 0.12 0.05 0.00 100.00 SKW dominance -0.60-0.61 40.02 59.98

Emotional Links between Structure and Content in Online Communications 345 Table 2. (Continued) KURT valence 3.21 3.75 100.00 0.00 KURT arousal 2.82 2.87 99.98 0.02 KURT dominance 3.40 3.66 100.00 0.00 Table 3. Average values for the statistical moments of emotional variables estimated over both type-1 and type-3 Spanish discussions and percentage of times each bootstrap simulation produced larger averages for type-1 (T1>T3) or type-3 (T1<T3) discussions Average for TYPE-1 TYPE-3 T1<T3 T1>T3 MEAN valence 5.63 5.67 97.90 2.10 MEAN arousal 5.46 5.48 98.14 1.86 MEAN dominance 5.11 5.10 44.08 55.92 VAR valence 1.99 1.95 0.01 99.99 VAR arousal 0.95 0.93 0.04 99.96 VAR dominance 1.03 1.02 7.63 92.37 SKW valence -0.63-0.67 0.06 99.94 SKW arousal -0.25-0.26 34.14 65.86 SKW dominance -0.27-0.27 49.74 50.26 KURT valence 2.29 2.45 100.00 0.00 KURT arousal 2.47 2.50 96.01 3.99 KURT dominance 2.79 2.76 19.11 80.89 Several interesting observations can be drawn from Table 2 and 3. First of all, notice how for the case of the English dataset, with the exception of the skewness of dominance, all observed average differences can be regarded as statistically signifycant, while for the case of Spanish only three differences happen to be significant: the variance of valence and arousal and the kurtosis of the valence. This lack of statistically significant results in most of the Spanish cases can be explained by two factors: the less quantity of empirical data available for Spanish, and the richer morphology of Spanish with respect to English, which makes emotional estimation from a lemmabased lexicon a much more unreliable and noisier task. Nevertheless, it is interesting to observe that the general trends for both languages are quite similar. From all observed differences in Table 2 and 3, the most important one in absolute terms is the drop in the kurtosis of the valence when we move from type-3 to type-1 discussions. This can be interpreted as an increment in the emotional polarization of the terms used in the discussion, which is also consistent with the observed increment in the variance of the valence. Notice that both results are statistically significant for the two languages. 4 Conclusions and Future Work In this work, we focused our attention on exploring the possible existence of links of emotional nature between the structure and the contents of online communications. More specifically, the average variations of statistical moments for three different

346 R.E. Banchs and A. Kaltenbrunner emotional variables were studied for two types of online discussion: type-1 (wide and deep trees) and type-3 (narrow and shallow trees). We conducted the analysis over empirical data in both English and Spanish and observed similar and statistically significant trends for the cases of the kurtosis and the variance of the valence, being the drop in the kurtosis of the valence the most important one in absolute terms. This suggests an increment in the emotional polarization of the terms used in the discussion along this specific dimension, which measures the degree of happiness. This shows that the emotional load of the contents does actually affects the structure of the studied online communications. As future research in this area we intend to extend our work to other languages, as well as to improve the emotional estimation in Spanish by morphologically enriching the current lemma-based lexicon. Acknowledgements. The authors would like to thank their respective institutions: Institute for Infocomm Research and Barcelona Media Innovation Centre, for their support and permission for publishing this work. References 1. Ackerman, B., Fishkin, J.: Deliberation Day. The Journal of Political Philosophy 10(2), 129 152 (2002) 2. Agarwal, A., Xie, B., Vovsha, I., Rambow, O., Passonneau, R.: Sentiment Analysis of Twitter Data. In: Proceedings of the Workshop on Language in Social Media, pp. 30 38. ACM, New York (2011) 3. Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA), http://gplsi.dlsi.ua.es/congre-sos/wassa2011/ 4. Berge, Z., Collins, M.: Computer Mediated Communication and the Online Classroom: Distance Learning. Hampton Press, Cresskill (1995) 5. Bradley, M., Lang, P.: Affective Norms for English Words (ANEW): Stimuli, Instruction manual and Effective Ratings. Technical report C-1, The Center for Research in Psychophysiology, University of Florida (1999) 6. Cassell, J., Tversky, D.: The Language of Online Intercultural Community Formation. Journal of Computer-Mediated Communication 10, 16 33 (2005) 7. Applying Social Network Analysis to Online Communications Networks. Leadership Learning Community, http://leadershiplearning.org/blog/nataliaca/2012-01-30/ applying-social-network-analysis-online-communicationsnetworks 8. Workshop on Content Analysis for the Web 2.0 (CAW2.0), http://caw2.barcelonamedia.org/ 9. Davison, A., Hinkley, D.: Bootstrap Methods and their Applications. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press (1997) 10. Gomez, V., Kaltenbrunner, A., Lopez, V.: Statistical Analysis of the Social Network and Discussion Threads in Slashdot. In: 17th International Conference on World Wide Web, pp. 645 654. ACM, New York (2008)

Emotional Links between Structure and Content in Online Communications 347 11. Gonzales, A., Hancock, J., Pennebaker, J.: Language Style Matching as a Predictor of Social Dynamics in Small Groups. Communication Research 37(1), 3 19 (2010) 12. Gonzalez-Bailon, S., Kaltenbrunner, A., Banchs, R.: The Structure of Political Discussion Networks: A Model for the Analysis of Online Deliberation. Journal of Information Technology, 1 14 (2010) 13. Habermas, J.: The theory of communicative action, I. Polity, Cambridge (1984) 14. Habermas, J.: The theory of communicative action, II. Polity, Cambridge (1987) 15. Herring, S.: Automating Analysis of Social Media Communication: Insights from CMDA. In: Proceedings of the Workshop on Language in Social Media, p. 1. ACM, New York (2011) 16. Kaltenbrunner, A., Bondia, E., Banchs, R.: Analyzing and Ranking the Spanish Speaking MySpaceC by their Contributions in Forums. In: 18th International World Wide Web Conference (2009) 17. Laniado, D., Tasso, R., Volkovich, Y., Kaltenbrunner, A.: When the WikipediansTalk: Network and Tree Structure of the Wikipedia Discussion Pages. In: 5th International AAAI Conference on Weblogs and Social Media (2011) 18. Workshop on Language in Social Media (LSM 2001), http://research.microsoft.com/en-us/events/lsm2011/ default.aspx 19. Nguyen, D., Rose, C.: Language Use as a Reflection of Socialization in Online Communities. In: Proceedings of the Workshop on Language in Social Media, pp. 76 85. ACM, New York (2011) 20. Pang, B., Lee, L.: Opinion Mining and Sentiment Analysis. Foundations and Trends in Information Retrieval 2(1-2), 1 135 (2008) 21. Redondo, J., Fraga, I., Padron, I., Comesana, M.: The Spanish adaptation of ANEW (affective norms for English words). Journal of Behavior Research Methods 39(3), 600 605 (2007) 22. Rossler, P.: Content Analysis in Online Communication: AChallenge for Traditional Methodology. In: Hogrefe, Huber (eds.) Online Social Science, Cambridge (2002) 23. Szell, M., Lambiotte, R., Thurner, S.: Multirelational Organization of Large-Scale Social Networks in an On-line World. PNAS 107(31), 13636 13641 (2010)