What We Can Learn from Looking at Profanity Gustavo Laboreiro and Eugénio Oliveira LIACC, Universidade do Porto, Faculdade de Engenharia {gustavo.laboreiro,eco}@fe.up.pt Abstract. Profanity is a common occurrence in online text. Recent studies found swearing words in over 7% of English tweets and 9% of Yahoo! Buzz messages. However, efforts in recognizing, understanding and dealing with profanity do not share resources, namely, their dataset, which imposes duplication of effort and non-comparable results. We here present a freely available dataset of 2500 messages from a popular Portuguese sports website. About 20% of the messages had profanity, thus we annotated 726 swear words, 510 of which were obfuscated by the authors. We also identified the most frequent profanities, and what methods, and combination of methods, people used to disguise their cursing. 1 Introduction and Related Work In the context of this work we define profanity, curse, swear or taboo words, as words used with offensive or vulgar intentions. Although swearing can be studied in the context of multiple disciplines, from the computational perspective, it is commonly associated with the automatic identification of abusive comments. Most often the intent of profanity identification lays on censoring these words or posts, but profanity is also tightly related with sentiment analysis and opinion mining tasks [1], since it can adequately express certain emotions [2 4], mostly negative. Its use seems to depend on several factors, such as gender, age and social class [5, 4]. How common is cursing on-line? Most pages of 16 year olds on MySpace, and about 15% of pages of middle-aged people contained strong swearing [5]. 9.28% of comments in Yahoo! Buzz showed profanity [6]. Out of 51 million tweets in English, at least 7.73% of messages contained cursing [4], where swear words represented 1.15% of all words seen as frequent as first person plural pronouns (we, us, our) [7]. While profanity is a common occurrence on-line, correct spelling is not. Curse words are not always written in the same way, a consequence of their use and spread being more oral than written. Graphical diversity is also augmented by accidental misspellings or intended obfuscations. Sood, Antin and Churchill found 76% of the top profane words not being written correctly, and describe why this variability is a hurdle for list-based profanity detection systems [6]. To study how users obfuscate their texts, we required a dataset annotated for profanity study. As we were unaware of any, we proceeded with the tiresome process of creating our own. Our goals were: i) the messages should relate to
2 a swearing-prone subject; ii) the dataset needs to be of adequate size; iii) the annotation needs to address individual words; and iv) the dataset should be distributable, to avoid the duplication of effort and promote result comparison. We annotated 2500 comments from a large sports news website in Portugal, and made it available, with extra information, at http://labs.sapo.pt/2014/ 05/obfuscation-dataset/. We will next elaborate on the nature of our data, and on the annotation process. Then we will present a number of methods used for obfuscation, and look at their presence in the dataset. Finally we present our conclusions and how we expect to follow-up on our work. 2 Description of the Dataset Our dataset was based on 2 years of text messages published on SAPO Desporto (http://desporto.sapo.pt/), a sport news website, with a strong emphasis in soccer, a sport known as important for the social identity in several countries [8]. We randomly selected 2500 messages, written in Portuguese. The website checks all posts against a small list of forbidden words (the blacklist), and rejects any message that contains one. Users can choose to not use those taboo words, or they can attempt to bypass the filter in some way. Many took up the challenge, and the filter did not end cursing it just pushed it into disguise. Hence, this data is appropriate for the study of obfuscation. The blacklist contained 252 entries, including the most common profanities in the Portuguese language. Of interest to us were 44 curse words, 85 curse word variants (plural, diminutive, etc.), 30 curse words written in a graphical variant (e.g. no diacritical sign, k replacing c,... ), 41 curse word variants written with graphical variants (e.g. put@s ), and 10 foreign curse words and variants. The remainder of the list contained entries used for spam control. This list is distributed with our dataset. In order to find how users used their creativity to overcome the obstacle of censorship, we had to analyse the messages. Three annotators used their sensibility on what constituted profanity. Once a word was considered swearing, it was tagged in the entire corpus. If misspelled, we add its canonical form to the annotation. In the end we identified 521 messages with profanities (1 in every 5), and 726 individual instances of profanity use (we ignored graphical duplicates in the same message), of which 510 were obfuscated. We can summarise our profanity dictionary in 40 different base profanities, totalling 103 when counting variants. Despite the possibility of profanity variants being used as a kind of obfuscation (e.g., employing shitful instead of shit ), we decided to consider them as distinct profanities. Of the 103 profanities we identified, 29 were present in the blacklist, and represented half of the cursing instances that we found. Therefore, SAPO targeted the most frequent swearing terms, but failed before obfuscation. Let us take a look at the methods that were used to bypass the filter.
3 3 Obfuscation Methods We were able to identify a total of 17 different ways in which the words we found deviated from their canonical spelling. They are described below, next to the symbols we assigned to represent them. Ac Accent removed A diacritical mark is removed from the word. For example, cabrão becomes cabrao, or piço becomes pico. C Characters removed Letters and/or symbols are removed from the word. + Ac Accent added A diacritical mark is added where it is not supposed to exist. For example, we see cócó instead of cocó. This alteration seldomly had any phonetic impact on the words. + L Letters added Extra letters are added to the word, but is not a repetition of the preceding letter, as in pandeleiro instead of paneleiro. + N Number added A seemingly random number is added to the word. + P Punctuation added A punctuation sign (.,,, - or ) is inserted into the word. These characters are chosen because they are easy to distinguish from letters. Two examples: f-o-d-a-m and me-r.da. + S Symbols added A symbol not from our punctuation set is inserted in a word. One example is fu der, meaning foder. + Sp Spaces added A space in employed to break the word into two or more segments. E.g., co rnos, p u t a. = Ac Change accent One letter with an accent is replaced by the same letter bearing a different accent. We saw cù instead of cú many times. = L Letters substituted One letter is replaced by one other letter. Usually this change does not alter the pronunciation of the word. = N Number as a letter A number takes the place of a letter. Often the digit resembles the letter somewhat. As an example, foda becomes f0da. = P Punctuation as a letter One of the characters of our punctuation set are used as a placeholder for one or more letters. For example, p... for puta. = S Symbol as a letter A symbol from outside our punctuation set is used as a letter. A common occurrence was @ instead of a, as in put@. Ag Word aggregation Two words are combined into just one. For example, americuzinho combining Américo and cuzinho ( cu and co sounding similar in this case). Cp Complex alteration Forms of obfuscation that are too complex to be described with the other methods. A common occurrence is fdp, that are the initials for son of a bitch ( sob ) in Portuguese. P h Phonetic-driven substitution The words sound similar, but differ beyond a simple letter substitution. E.g., fodassse instead of foda-se. P un Pun The author relies on the context to explain the obfuscation. R Repetition One letter is repeated, as in merddddddddda. These operations were selected to provide a descriptive view, rather than to provide the smallest set of operations that could transform a word from its canonical representation. We focus on the way the reader perceives the obfuscation method, since multiple combinations can lead to the same result (e.g.
4 substitution vs. insertion and removal). We will see that authors tend to choose methods that are easy to understand. 4 Analysis of the dataset In this work our main concern was on method choice. The number of times each method is used on each word strongly depends on word length, and provides little insight on how to reverse it. Also, some methods are more prone to overuse (e.g. Repetition and Punctuation added) than others (e.g. Letter substituted or Accent removed). Thus, if the author uses two letter substitutions in the same word, we count it as one. We divided our analysis into two types of obfuscation: those that maintain word length and those that alter word length. We then look at how many operations were combined to obfuscate each word, as an indicator of complexity. In general, the length of a word provides an additional clue that helps the reader in recognizing it, even when disguised. We found 261 obfuscations keeping word length (out of 510). Many obfuscation methods cannot be used in order to achieve this. Tables 1a and 1b show the absolute frequency of the methods we saw. Letter substitution was the most popular choice, which can be explained with c and k usually being phonetically similar, and 1/3 of the curse words starting with a c. Obfuscation through only one method (Table 1a) is achieved mostly by substitutions ( = L, = N and = S) or accents manipulation ( = Ac and Ac). When two methods are used (Table 1b), the clear preference lies in the combination of = L and Ac, mostly by writing cú (ass) as ku. When word length no longer constrains their efforts, authors show different preferences. In Table 1c and 1d we can see the method choice distribution across the remaining 189 obfuscation instances. If no other method is used, Table 1c shows that Repetition (R) is the preferred choice, possibly because it calls attention to the word itself, and makes the modification obvious. The same characteristics we can claim to be shared by the insertion of easy-to-ignore noise ( + P and + Sp). Making puns was the fourth most popular method, something that is difficult to address automatically. When two methods are used, there is a lack of clear predominance, as shown in Table 1d. The use of symbols as letters ( = S) and repeating letters (R) are seen more frequently than the other methods, even if they are not combined often, which we found curious. We also accounted for the rare concurrent use of three obfuscation methods. We saw the combination = N P h R three times (e.g., f000daseeeeeee for fodase ), while Ac = L + Sp were seen together once ( ka brao instead of cabrão ). 5 Conclusion and Future Work We hope that our work, while modest in size and scope, is a good first step towards greater research cooperation and validation of profanity identification.
5 Table 1: How often each method was employed in our dataset. Maintaining word length: (a) and (b), changing word length: (c) and (d). (a) (b) Method Count Method Count Ac 19 = P 3 + Ac 7 = S 22 = Ac 22 = Sp 1 = L 102 P h 2 = N 36 P un 7 = L = N = S R Ac 27 1 + Ac 1 = Ac 1 = N 1 5 P h 1 1 P un 2 (c) Method Count C 8 + L 4 + P 38 + S 3 + Sp 21 Ag 10 Cp 4 P h 4 P un 20 R 77 (d) + P + S + Sp = L = P = S Ag Cp P h P un R Ac 1 5 C 1 1 1 + Ac 3 + L 1 1 1 + N 1 + P 1 6 + S 2 1 1 1 + Sp 1 1 6 = Ac 1 = L 3 = N 1 4 = P 3 = S 1 1 2 Ag 1 R 1 2 We identified the most common swear words used in our corpus, and given that many were blacklisted, we inferred that the filter had no significant impact on the vocabulary of the users, as many circumvented it. We also surveyed a set of frequent obfuscation techniques we believe relevant when dealing with cursing. We concluded that both graphical appearance and pronunciation are important when obfuscating profanity, but not necessarily both at the same time. Knowing that written noise can derive from personal choice [9], it could be interesting to study if this preference extends to obfuscation decisions. By providing an annotation at a finer granularity (word level instead of message level), we believe that new techniques for word de-obfuscation can be enabled. This could be achieved by adapting the Levenshtein distance (new operations or statistics-derived costs), or through machine learning. Acknowledgements This project was funded by the UT Austin Portugal International Collaboraboratory for Emerging Technologies, project UTA-Est/MAI/0006/2009, and SAPO Labs UP. Also thanks to Luís Sarmento, Francisca Teixeira and Tó Jó for their help.
6 References 1. Constant, N., Davis, C., Potts, C., Schwarz, F.: The pragmatics of expressive content: Evidence from large corpora. Sprache und Datenverarbeitung: International Journal for Language Data Processing (33) (2009) 5 21 2. Jay, T., Janschewitz, K.: Filling the emotional gap in linguistic theory: Commentary on Pot s expressive dimension. (33) (2007) 215 221 3. Jay, T.: The utility and ubiquity of taboo words. 4(2) (2009) 153 161 4. Wang, W., Chen, L., Thirunarayan, K., Sheth, A.P.: Cursing in English on Twitter. In: Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing. CSCW 14 (February 2014) 5. Thelwall, M.: Fk yea I swear: cursing and gender in MySpace. Corpora 3(1) (2008) 83 107 6. Sood, S.O., Antin, J., Churchill, E.: Profanity use in online communities. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI 12, New York, NY, USA, ACM (2012) 1481 1490 7. Mehl, M.R., Pennebaker, J.W.: The Sounds of Social Life: A Psychometric Analysis of Students Daily Social Environments and Natural Conversations. Journal of Personality and Social Psychology 84(4) (2003) 857 870 8. Crisp, R.J., Heuston, S., Farr, M.J., Turner, R.N.: Seeing Red or Feeling Blue: Differentiated Intergroup Emotions and Ingroup Identification in Soccer Fans 9. Sousa-Silva, R., Laboreiro, G., Sarmento, L., Grant, T., Oliveira, E., Maia, B.: twazn me!!! ;( Automatic Authorship Analysis of Micro-Blogging Messages. In Muñoz, R., Montoyo, A., Métais, E., eds.: Procedings of the 16th International Conference on Applications of Natural Language to Information Systems, NLDB 2011. Number LNCS 6716 in Lecture Notes in Computer Science, Springer (Jun 2011) 161 168