What We Can Learn from Looking at Profanity



Similar documents
Task 3 Web Community Sensing & Task 6 Query and Visualization

REACTION Workshop Overview Porto, FEUP. Mário J. Silva IST/INESC-ID, Portugal REACTION

Technical Presentations. Arian Pasquali, FEUP, REACTION Data Collection Plataform David Batista, INESC-ID, Sematic Relations Extraction REACTION

Social Market Analytics, Inc.

Creating Usable Customer Intelligence from Social Media Data:

Collecting Polish German Parallel Corpora in the Internet

Search Result Optimization using Annotators

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam

How To Find Out What Political Sentiment Is On Twitter

Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams

Towards Inferring Web Page Relevance An Eye-Tracking Study

THE BASICS OF STATISTICAL PROCESS CONTROL & PROCESS BEHAVIOUR CHARTING

Social Media. Style Guide

Sentiment analysis on tweets in a financial domain

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Towards better understanding Cybersecurity: or are "Cyberspace" and "Cyber Space" the same?

Use the Academic Word List vocabulary to make tips on Academic Writing. Use some of the words below to give advice on good academic writing.

Analysis of Social Media Streams

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Modern foreign languages

Spatio-Temporal Patterns of Passengers Interests at London Tube Stations

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis

Analysis of Tweets for Prediction of Indian Stock Markets

Identifying Market Price Levels using Differential Evolution

The Italian Hate Map:

The Viability of StockTwits and Google Trends to Predict the Stock Market. By Chris Loughlin and Erik Harnisch

Active Learning SVM for Blogs recommendation

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

CUSTOMER RELATIONSHIP MANAGEMENT SYSTEM: A CASE STUDY OF FLOOR MILLS IN BAHAWALPUR DISTRICT

Alan Eldridge and Tara Walker. Understanding Level of Detail Expressions

Semantically Enhanced Web Personalization Approaches and Techniques

Fuzzy Matching in Audit Analytics. Grant Brodie, President, Arbutus Software

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement

Nail Care Trends & Influencers Snapshot

Terminology Extraction from Log Files

Forecasting stock markets with Twitter

Understanding the popularity of reporters and assignees in the Github

Strategies for Effective Tweeting: A Statistical Review

INFO 2950 Intro to Data Science. Lecture 17: Power Laws and Big Data

ScreenMatch: Providing Context to Software Translators by Displaying Screenshots

Sentiment analysis using emoticons

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

Social networking guidelines and information

Introduction to Social Media

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Capturing Meaningful Competitive Intelligence from the Social Media Movement

Local Culture in Global English:

French Language and Culture. Curriculum Framework

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval

PGR Computing Programming Skills

Quality Assurance at NEMT, Inc.

Educational Social Network Group Profiling: An Analysis of Differentiation-Based Methods

Multilanguage sentiment-analysis of Twitter data on the example of Swiss politicians

Blogging- A Powerful Public Relation Marketing tool, A study of public. awareness with reference to Nagpur City.

Micro blogs Oriented Word Segmentation System

Adaptive Filtering of SPAM

THE BACHELOR S DEGREE IN SPANISH

Local Culture in Global English:

Approaches to Arabic Name Transliteration and Matching in the DataFlux Quality Knowledge Base

SHARPEN YOUR NOTE -TAKING

Quality Assurance at NEMT, Inc.

Guide to Digital Marketing for Business-To-Business (B2B)!

Easily Identify Your Best Customers

News media analysis at Lab SAPO UPorto. Jorge Teixeira

Contact Recommendations from Aggegrated On-Line Activity

Sentiment Analysis for Movie Reviews

How Social Media will Change the Future of Banking Services

Using Edit-Distance Functions to Identify Similar Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC

Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects

Data Deduplication in Slovak Corpora

Sentiment Analysis Tool using Machine Learning Algorithms

Marketing Research Core Body Knowledge (MRCBOK ) Learning Objectives

Intelligent Log Analyzer. André Restivo

Oxford Learning Institute University of Oxford

The Practice of Social Research in the Digital Age:

ABSTRACT OF THE DOCTORAL THESIS BY Cătălin Ovidiu Obuf Buhăianu

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

Importance of Online Product Reviews from a Consumer s Perspective

Online Marketing Optimization Essentials

Understanding and Supporting Intersubjective Meaning Making in Socio-Technical Systems: A Cognitive Psychology Perspective

Related guides: 'Planning and Conducting a Dissertation Research Project'.

Modeling customer retention

DIGITAL MARKETING. The Page Title Meta Descriptions & Meta Keywords

Principles of Data Visualization for Exploratory Data Analysis. Renee M. P. Teate. SYS 6023 Cognitive Systems Engineering April 28, 2015

Transcription:

What We Can Learn from Looking at Profanity Gustavo Laboreiro and Eugénio Oliveira LIACC, Universidade do Porto, Faculdade de Engenharia {gustavo.laboreiro,eco}@fe.up.pt Abstract. Profanity is a common occurrence in online text. Recent studies found swearing words in over 7% of English tweets and 9% of Yahoo! Buzz messages. However, efforts in recognizing, understanding and dealing with profanity do not share resources, namely, their dataset, which imposes duplication of effort and non-comparable results. We here present a freely available dataset of 2500 messages from a popular Portuguese sports website. About 20% of the messages had profanity, thus we annotated 726 swear words, 510 of which were obfuscated by the authors. We also identified the most frequent profanities, and what methods, and combination of methods, people used to disguise their cursing. 1 Introduction and Related Work In the context of this work we define profanity, curse, swear or taboo words, as words used with offensive or vulgar intentions. Although swearing can be studied in the context of multiple disciplines, from the computational perspective, it is commonly associated with the automatic identification of abusive comments. Most often the intent of profanity identification lays on censoring these words or posts, but profanity is also tightly related with sentiment analysis and opinion mining tasks [1], since it can adequately express certain emotions [2 4], mostly negative. Its use seems to depend on several factors, such as gender, age and social class [5, 4]. How common is cursing on-line? Most pages of 16 year olds on MySpace, and about 15% of pages of middle-aged people contained strong swearing [5]. 9.28% of comments in Yahoo! Buzz showed profanity [6]. Out of 51 million tweets in English, at least 7.73% of messages contained cursing [4], where swear words represented 1.15% of all words seen as frequent as first person plural pronouns (we, us, our) [7]. While profanity is a common occurrence on-line, correct spelling is not. Curse words are not always written in the same way, a consequence of their use and spread being more oral than written. Graphical diversity is also augmented by accidental misspellings or intended obfuscations. Sood, Antin and Churchill found 76% of the top profane words not being written correctly, and describe why this variability is a hurdle for list-based profanity detection systems [6]. To study how users obfuscate their texts, we required a dataset annotated for profanity study. As we were unaware of any, we proceeded with the tiresome process of creating our own. Our goals were: i) the messages should relate to

2 a swearing-prone subject; ii) the dataset needs to be of adequate size; iii) the annotation needs to address individual words; and iv) the dataset should be distributable, to avoid the duplication of effort and promote result comparison. We annotated 2500 comments from a large sports news website in Portugal, and made it available, with extra information, at http://labs.sapo.pt/2014/ 05/obfuscation-dataset/. We will next elaborate on the nature of our data, and on the annotation process. Then we will present a number of methods used for obfuscation, and look at their presence in the dataset. Finally we present our conclusions and how we expect to follow-up on our work. 2 Description of the Dataset Our dataset was based on 2 years of text messages published on SAPO Desporto (http://desporto.sapo.pt/), a sport news website, with a strong emphasis in soccer, a sport known as important for the social identity in several countries [8]. We randomly selected 2500 messages, written in Portuguese. The website checks all posts against a small list of forbidden words (the blacklist), and rejects any message that contains one. Users can choose to not use those taboo words, or they can attempt to bypass the filter in some way. Many took up the challenge, and the filter did not end cursing it just pushed it into disguise. Hence, this data is appropriate for the study of obfuscation. The blacklist contained 252 entries, including the most common profanities in the Portuguese language. Of interest to us were 44 curse words, 85 curse word variants (plural, diminutive, etc.), 30 curse words written in a graphical variant (e.g. no diacritical sign, k replacing c,... ), 41 curse word variants written with graphical variants (e.g. put@s ), and 10 foreign curse words and variants. The remainder of the list contained entries used for spam control. This list is distributed with our dataset. In order to find how users used their creativity to overcome the obstacle of censorship, we had to analyse the messages. Three annotators used their sensibility on what constituted profanity. Once a word was considered swearing, it was tagged in the entire corpus. If misspelled, we add its canonical form to the annotation. In the end we identified 521 messages with profanities (1 in every 5), and 726 individual instances of profanity use (we ignored graphical duplicates in the same message), of which 510 were obfuscated. We can summarise our profanity dictionary in 40 different base profanities, totalling 103 when counting variants. Despite the possibility of profanity variants being used as a kind of obfuscation (e.g., employing shitful instead of shit ), we decided to consider them as distinct profanities. Of the 103 profanities we identified, 29 were present in the blacklist, and represented half of the cursing instances that we found. Therefore, SAPO targeted the most frequent swearing terms, but failed before obfuscation. Let us take a look at the methods that were used to bypass the filter.

3 3 Obfuscation Methods We were able to identify a total of 17 different ways in which the words we found deviated from their canonical spelling. They are described below, next to the symbols we assigned to represent them. Ac Accent removed A diacritical mark is removed from the word. For example, cabrão becomes cabrao, or piço becomes pico. C Characters removed Letters and/or symbols are removed from the word. + Ac Accent added A diacritical mark is added where it is not supposed to exist. For example, we see cócó instead of cocó. This alteration seldomly had any phonetic impact on the words. + L Letters added Extra letters are added to the word, but is not a repetition of the preceding letter, as in pandeleiro instead of paneleiro. + N Number added A seemingly random number is added to the word. + P Punctuation added A punctuation sign (.,,, - or ) is inserted into the word. These characters are chosen because they are easy to distinguish from letters. Two examples: f-o-d-a-m and me-r.da. + S Symbols added A symbol not from our punctuation set is inserted in a word. One example is fu der, meaning foder. + Sp Spaces added A space in employed to break the word into two or more segments. E.g., co rnos, p u t a. = Ac Change accent One letter with an accent is replaced by the same letter bearing a different accent. We saw cù instead of cú many times. = L Letters substituted One letter is replaced by one other letter. Usually this change does not alter the pronunciation of the word. = N Number as a letter A number takes the place of a letter. Often the digit resembles the letter somewhat. As an example, foda becomes f0da. = P Punctuation as a letter One of the characters of our punctuation set are used as a placeholder for one or more letters. For example, p... for puta. = S Symbol as a letter A symbol from outside our punctuation set is used as a letter. A common occurrence was @ instead of a, as in put@. Ag Word aggregation Two words are combined into just one. For example, americuzinho combining Américo and cuzinho ( cu and co sounding similar in this case). Cp Complex alteration Forms of obfuscation that are too complex to be described with the other methods. A common occurrence is fdp, that are the initials for son of a bitch ( sob ) in Portuguese. P h Phonetic-driven substitution The words sound similar, but differ beyond a simple letter substitution. E.g., fodassse instead of foda-se. P un Pun The author relies on the context to explain the obfuscation. R Repetition One letter is repeated, as in merddddddddda. These operations were selected to provide a descriptive view, rather than to provide the smallest set of operations that could transform a word from its canonical representation. We focus on the way the reader perceives the obfuscation method, since multiple combinations can lead to the same result (e.g.

4 substitution vs. insertion and removal). We will see that authors tend to choose methods that are easy to understand. 4 Analysis of the dataset In this work our main concern was on method choice. The number of times each method is used on each word strongly depends on word length, and provides little insight on how to reverse it. Also, some methods are more prone to overuse (e.g. Repetition and Punctuation added) than others (e.g. Letter substituted or Accent removed). Thus, if the author uses two letter substitutions in the same word, we count it as one. We divided our analysis into two types of obfuscation: those that maintain word length and those that alter word length. We then look at how many operations were combined to obfuscate each word, as an indicator of complexity. In general, the length of a word provides an additional clue that helps the reader in recognizing it, even when disguised. We found 261 obfuscations keeping word length (out of 510). Many obfuscation methods cannot be used in order to achieve this. Tables 1a and 1b show the absolute frequency of the methods we saw. Letter substitution was the most popular choice, which can be explained with c and k usually being phonetically similar, and 1/3 of the curse words starting with a c. Obfuscation through only one method (Table 1a) is achieved mostly by substitutions ( = L, = N and = S) or accents manipulation ( = Ac and Ac). When two methods are used (Table 1b), the clear preference lies in the combination of = L and Ac, mostly by writing cú (ass) as ku. When word length no longer constrains their efforts, authors show different preferences. In Table 1c and 1d we can see the method choice distribution across the remaining 189 obfuscation instances. If no other method is used, Table 1c shows that Repetition (R) is the preferred choice, possibly because it calls attention to the word itself, and makes the modification obvious. The same characteristics we can claim to be shared by the insertion of easy-to-ignore noise ( + P and + Sp). Making puns was the fourth most popular method, something that is difficult to address automatically. When two methods are used, there is a lack of clear predominance, as shown in Table 1d. The use of symbols as letters ( = S) and repeating letters (R) are seen more frequently than the other methods, even if they are not combined often, which we found curious. We also accounted for the rare concurrent use of three obfuscation methods. We saw the combination = N P h R three times (e.g., f000daseeeeeee for fodase ), while Ac = L + Sp were seen together once ( ka brao instead of cabrão ). 5 Conclusion and Future Work We hope that our work, while modest in size and scope, is a good first step towards greater research cooperation and validation of profanity identification.

5 Table 1: How often each method was employed in our dataset. Maintaining word length: (a) and (b), changing word length: (c) and (d). (a) (b) Method Count Method Count Ac 19 = P 3 + Ac 7 = S 22 = Ac 22 = Sp 1 = L 102 P h 2 = N 36 P un 7 = L = N = S R Ac 27 1 + Ac 1 = Ac 1 = N 1 5 P h 1 1 P un 2 (c) Method Count C 8 + L 4 + P 38 + S 3 + Sp 21 Ag 10 Cp 4 P h 4 P un 20 R 77 (d) + P + S + Sp = L = P = S Ag Cp P h P un R Ac 1 5 C 1 1 1 + Ac 3 + L 1 1 1 + N 1 + P 1 6 + S 2 1 1 1 + Sp 1 1 6 = Ac 1 = L 3 = N 1 4 = P 3 = S 1 1 2 Ag 1 R 1 2 We identified the most common swear words used in our corpus, and given that many were blacklisted, we inferred that the filter had no significant impact on the vocabulary of the users, as many circumvented it. We also surveyed a set of frequent obfuscation techniques we believe relevant when dealing with cursing. We concluded that both graphical appearance and pronunciation are important when obfuscating profanity, but not necessarily both at the same time. Knowing that written noise can derive from personal choice [9], it could be interesting to study if this preference extends to obfuscation decisions. By providing an annotation at a finer granularity (word level instead of message level), we believe that new techniques for word de-obfuscation can be enabled. This could be achieved by adapting the Levenshtein distance (new operations or statistics-derived costs), or through machine learning. Acknowledgements This project was funded by the UT Austin Portugal International Collaboraboratory for Emerging Technologies, project UTA-Est/MAI/0006/2009, and SAPO Labs UP. Also thanks to Luís Sarmento, Francisca Teixeira and Tó Jó for their help.

6 References 1. Constant, N., Davis, C., Potts, C., Schwarz, F.: The pragmatics of expressive content: Evidence from large corpora. Sprache und Datenverarbeitung: International Journal for Language Data Processing (33) (2009) 5 21 2. Jay, T., Janschewitz, K.: Filling the emotional gap in linguistic theory: Commentary on Pot s expressive dimension. (33) (2007) 215 221 3. Jay, T.: The utility and ubiquity of taboo words. 4(2) (2009) 153 161 4. Wang, W., Chen, L., Thirunarayan, K., Sheth, A.P.: Cursing in English on Twitter. In: Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing. CSCW 14 (February 2014) 5. Thelwall, M.: Fk yea I swear: cursing and gender in MySpace. Corpora 3(1) (2008) 83 107 6. Sood, S.O., Antin, J., Churchill, E.: Profanity use in online communities. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI 12, New York, NY, USA, ACM (2012) 1481 1490 7. Mehl, M.R., Pennebaker, J.W.: The Sounds of Social Life: A Psychometric Analysis of Students Daily Social Environments and Natural Conversations. Journal of Personality and Social Psychology 84(4) (2003) 857 870 8. Crisp, R.J., Heuston, S., Farr, M.J., Turner, R.N.: Seeing Red or Feeling Blue: Differentiated Intergroup Emotions and Ingroup Identification in Soccer Fans 9. Sousa-Silva, R., Laboreiro, G., Sarmento, L., Grant, T., Oliveira, E., Maia, B.: twazn me!!! ;( Automatic Authorship Analysis of Micro-Blogging Messages. In Muñoz, R., Montoyo, A., Métais, E., eds.: Procedings of the 16th International Conference on Applications of Natural Language to Information Systems, NLDB 2011. Number LNCS 6716 in Lecture Notes in Computer Science, Springer (Jun 2011) 161 168