What We Can Learn from Looking at Profanity

Size: px
Start display at page:

Download "What We Can Learn from Looking at Profanity"

Transcription

1 What We Can Learn from Looking at Profanity Gustavo Laboreiro and Eugénio Oliveira LIACC, Universidade do Porto, Faculdade de Engenharia Abstract. Profanity is a common occurrence in online text. Recent studies found swearing words in over 7% of English tweets and 9% of Yahoo! Buzz messages. However, efforts in recognizing, understanding and dealing with profanity do not share resources, namely, their dataset, which imposes duplication of effort and non-comparable results. We here present a freely available dataset of 2500 messages from a popular Portuguese sports website. About 20% of the messages had profanity, thus we annotated 726 swear words, 510 of which were obfuscated by the authors. We also identified the most frequent profanities, and what methods, and combination of methods, people used to disguise their cursing. 1 Introduction and Related Work In the context of this work we define profanity, curse, swear or taboo words, as words used with offensive or vulgar intentions. Although swearing can be studied in the context of multiple disciplines, from the computational perspective, it is commonly associated with the automatic identification of abusive comments. Most often the intent of profanity identification lays on censoring these words or posts, but profanity is also tightly related with sentiment analysis and opinion mining tasks [1], since it can adequately express certain emotions [2 4], mostly negative. Its use seems to depend on several factors, such as gender, age and social class [5, 4]. How common is cursing on-line? Most pages of 16 year olds on MySpace, and about 15% of pages of middle-aged people contained strong swearing [5]. 9.28% of comments in Yahoo! Buzz showed profanity [6]. Out of 51 million tweets in English, at least 7.73% of messages contained cursing [4], where swear words represented 1.15% of all words seen as frequent as first person plural pronouns (we, us, our) [7]. While profanity is a common occurrence on-line, correct spelling is not. Curse words are not always written in the same way, a consequence of their use and spread being more oral than written. Graphical diversity is also augmented by accidental misspellings or intended obfuscations. Sood, Antin and Churchill found 76% of the top profane words not being written correctly, and describe why this variability is a hurdle for list-based profanity detection systems [6]. To study how users obfuscate their texts, we required a dataset annotated for profanity study. As we were unaware of any, we proceeded with the tiresome process of creating our own. Our goals were: i) the messages should relate to

2 2 a swearing-prone subject; ii) the dataset needs to be of adequate size; iii) the annotation needs to address individual words; and iv) the dataset should be distributable, to avoid the duplication of effort and promote result comparison. We annotated 2500 comments from a large sports news website in Portugal, and made it available, with extra information, at 05/obfuscation-dataset/. We will next elaborate on the nature of our data, and on the annotation process. Then we will present a number of methods used for obfuscation, and look at their presence in the dataset. Finally we present our conclusions and how we expect to follow-up on our work. 2 Description of the Dataset Our dataset was based on 2 years of text messages published on SAPO Desporto (http://desporto.sapo.pt/), a sport news website, with a strong emphasis in soccer, a sport known as important for the social identity in several countries [8]. We randomly selected 2500 messages, written in Portuguese. The website checks all posts against a small list of forbidden words (the blacklist), and rejects any message that contains one. Users can choose to not use those taboo words, or they can attempt to bypass the filter in some way. Many took up the challenge, and the filter did not end cursing it just pushed it into disguise. Hence, this data is appropriate for the study of obfuscation. The blacklist contained 252 entries, including the most common profanities in the Portuguese language. Of interest to us were 44 curse words, 85 curse word variants (plural, diminutive, etc.), 30 curse words written in a graphical variant (e.g. no diacritical sign, k replacing c,... ), 41 curse word variants written with graphical variants (e.g. ), and 10 foreign curse words and variants. The remainder of the list contained entries used for spam control. This list is distributed with our dataset. In order to find how users used their creativity to overcome the obstacle of censorship, we had to analyse the messages. Three annotators used their sensibility on what constituted profanity. Once a word was considered swearing, it was tagged in the entire corpus. If misspelled, we add its canonical form to the annotation. In the end we identified 521 messages with profanities (1 in every 5), and 726 individual instances of profanity use (we ignored graphical duplicates in the same message), of which 510 were obfuscated. We can summarise our profanity dictionary in 40 different base profanities, totalling 103 when counting variants. Despite the possibility of profanity variants being used as a kind of obfuscation (e.g., employing shitful instead of shit ), we decided to consider them as distinct profanities. Of the 103 profanities we identified, 29 were present in the blacklist, and represented half of the cursing instances that we found. Therefore, SAPO targeted the most frequent swearing terms, but failed before obfuscation. Let us take a look at the methods that were used to bypass the filter.

3 3 3 Obfuscation Methods We were able to identify a total of 17 different ways in which the words we found deviated from their canonical spelling. They are described below, next to the symbols we assigned to represent them. Ac Accent removed A diacritical mark is removed from the word. For example, cabrão becomes cabrao, or piço becomes pico. C Characters removed Letters and/or symbols are removed from the word. + Ac Accent added A diacritical mark is added where it is not supposed to exist. For example, we see cócó instead of cocó. This alteration seldomly had any phonetic impact on the words. + L Letters added Extra letters are added to the word, but is not a repetition of the preceding letter, as in pandeleiro instead of paneleiro. + N Number added A seemingly random number is added to the word. + P Punctuation added A punctuation sign (.,,, - or ) is inserted into the word. These characters are chosen because they are easy to distinguish from letters. Two examples: f-o-d-a-m and me-r.da. + S Symbols added A symbol not from our punctuation set is inserted in a word. One example is fu der, meaning foder. + Sp Spaces added A space in employed to break the word into two or more segments. E.g., co rnos, p u t a. = Ac Change accent One letter with an accent is replaced by the same letter bearing a different accent. We saw cù instead of cú many times. = L Letters substituted One letter is replaced by one other letter. Usually this change does not alter the pronunciation of the word. = N Number as a letter A number takes the place of a letter. Often the digit resembles the letter somewhat. As an example, foda becomes f0da. = P Punctuation as a letter One of the characters of our punctuation set are used as a placeholder for one or more letters. For example, p... for puta. = S Symbol as a letter A symbol from outside our punctuation set is used as a letter. A common occurrence instead of a, as in Ag Word aggregation Two words are combined into just one. For example, americuzinho combining Américo and cuzinho ( cu and co sounding similar in this case). Cp Complex alteration Forms of obfuscation that are too complex to be described with the other methods. A common occurrence is fdp, that are the initials for son of a bitch ( sob ) in Portuguese. P h Phonetic-driven substitution The words sound similar, but differ beyond a simple letter substitution. E.g., fodassse instead of foda-se. P un Pun The author relies on the context to explain the obfuscation. R Repetition One letter is repeated, as in merddddddddda. These operations were selected to provide a descriptive view, rather than to provide the smallest set of operations that could transform a word from its canonical representation. We focus on the way the reader perceives the obfuscation method, since multiple combinations can lead to the same result (e.g.

4 4 substitution vs. insertion and removal). We will see that authors tend to choose methods that are easy to understand. 4 Analysis of the dataset In this work our main concern was on method choice. The number of times each method is used on each word strongly depends on word length, and provides little insight on how to reverse it. Also, some methods are more prone to overuse (e.g. Repetition and Punctuation added) than others (e.g. Letter substituted or Accent removed). Thus, if the author uses two letter substitutions in the same word, we count it as one. We divided our analysis into two types of obfuscation: those that maintain word length and those that alter word length. We then look at how many operations were combined to obfuscate each word, as an indicator of complexity. In general, the length of a word provides an additional clue that helps the reader in recognizing it, even when disguised. We found 261 obfuscations keeping word length (out of 510). Many obfuscation methods cannot be used in order to achieve this. Tables 1a and 1b show the absolute frequency of the methods we saw. Letter substitution was the most popular choice, which can be explained with c and k usually being phonetically similar, and 1/3 of the curse words starting with a c. Obfuscation through only one method (Table 1a) is achieved mostly by substitutions ( = L, = N and = S) or accents manipulation ( = Ac and Ac). When two methods are used (Table 1b), the clear preference lies in the combination of = L and Ac, mostly by writing cú (ass) as ku. When word length no longer constrains their efforts, authors show different preferences. In Table 1c and 1d we can see the method choice distribution across the remaining 189 obfuscation instances. If no other method is used, Table 1c shows that Repetition (R) is the preferred choice, possibly because it calls attention to the word itself, and makes the modification obvious. The same characteristics we can claim to be shared by the insertion of easy-to-ignore noise ( + P and + Sp). Making puns was the fourth most popular method, something that is difficult to address automatically. When two methods are used, there is a lack of clear predominance, as shown in Table 1d. The use of symbols as letters ( = S) and repeating letters (R) are seen more frequently than the other methods, even if they are not combined often, which we found curious. We also accounted for the rare concurrent use of three obfuscation methods. We saw the combination = N P h R three times (e.g., f000daseeeeeee for fodase ), while Ac = L + Sp were seen together once ( ka brao instead of cabrão ). 5 Conclusion and Future Work We hope that our work, while modest in size and scope, is a good first step towards greater research cooperation and validation of profanity identification.

5 5 Table 1: How often each method was employed in our dataset. Maintaining word length: (a) and (b), changing word length: (c) and (d). (a) (b) Method Count Method Count Ac 19 = P 3 + Ac 7 = S 22 = Ac 22 = Sp 1 = L 102 P h 2 = N 36 P un 7 = L = N = S R Ac Ac 1 = Ac 1 = N 1 5 P h 1 1 P un 2 (c) Method Count C 8 + L 4 + P 38 + S 3 + Sp 21 Ag 10 Cp 4 P h 4 P un 20 R 77 (d) + P + S + Sp = L = P = S Ag Cp P h P un R Ac 1 5 C Ac 3 + L N 1 + P S Sp = Ac 1 = L 3 = N 1 4 = P 3 = S Ag 1 R 1 2 We identified the most common swear words used in our corpus, and given that many were blacklisted, we inferred that the filter had no significant impact on the vocabulary of the users, as many circumvented it. We also surveyed a set of frequent obfuscation techniques we believe relevant when dealing with cursing. We concluded that both graphical appearance and pronunciation are important when obfuscating profanity, but not necessarily both at the same time. Knowing that written noise can derive from personal choice [9], it could be interesting to study if this preference extends to obfuscation decisions. By providing an annotation at a finer granularity (word level instead of message level), we believe that new techniques for word de-obfuscation can be enabled. This could be achieved by adapting the Levenshtein distance (new operations or statistics-derived costs), or through machine learning. Acknowledgements This project was funded by the UT Austin Portugal International Collaboraboratory for Emerging Technologies, project UTA-Est/MAI/0006/2009, and SAPO Labs UP. Also thanks to Luís Sarmento, Francisca Teixeira and Tó Jó for their help.

6 6 References 1. Constant, N., Davis, C., Potts, C., Schwarz, F.: The pragmatics of expressive content: Evidence from large corpora. Sprache und Datenverarbeitung: International Journal for Language Data Processing (33) (2009) Jay, T., Janschewitz, K.: Filling the emotional gap in linguistic theory: Commentary on Pot s expressive dimension. (33) (2007) Jay, T.: The utility and ubiquity of taboo words. 4(2) (2009) Wang, W., Chen, L., Thirunarayan, K., Sheth, A.P.: Cursing in English on Twitter. In: Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing. CSCW 14 (February 2014) 5. Thelwall, M.: Fk yea I swear: cursing and gender in MySpace. Corpora 3(1) (2008) Sood, S.O., Antin, J., Churchill, E.: Profanity use in online communities. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI 12, New York, NY, USA, ACM (2012) Mehl, M.R., Pennebaker, J.W.: The Sounds of Social Life: A Psychometric Analysis of Students Daily Social Environments and Natural Conversations. Journal of Personality and Social Psychology 84(4) (2003) Crisp, R.J., Heuston, S., Farr, M.J., Turner, R.N.: Seeing Red or Feeling Blue: Differentiated Intergroup Emotions and Ingroup Identification in Soccer Fans 9. Sousa-Silva, R., Laboreiro, G., Sarmento, L., Grant, T., Oliveira, E., Maia, B.: twazn me!!! ;( Automatic Authorship Analysis of Micro-Blogging Messages. In Muñoz, R., Montoyo, A., Métais, E., eds.: Procedings of the 16th International Conference on Applications of Natural Language to Information Systems, NLDB Number LNCS 6716 in Lecture Notes in Computer Science, Springer (Jun 2011)

Task 3 Web Community Sensing & Task 6 Query and Visualization

Task 3 Web Community Sensing & Task 6 Query and Visualization Task 3 Web Community Sensing & Task 6 Query and Visualization REACTION Workshop January 31 th, 2013 Summary of on-going activities Team update WP3 & WP6 progress reports Resources & publications Team update

More information

REACTION Workshop 2013.07.31 Overview Porto, FEUP. Mário J. Silva IST/INESC-ID, Portugal REACTION

REACTION Workshop 2013.07.31 Overview Porto, FEUP. Mário J. Silva IST/INESC-ID, Portugal REACTION Workshop 2013.07.31 Overview Porto, FEUP Mário J. Silva IST/INESC-ID, Portugal Agenda 11:30 Welcome + Quick progress report and status summary 11:45 Task leaders summarize ongoing activities (10 min each

More information

Social Market Analytics, Inc.

Social Market Analytics, Inc. S-Factors : Definition, Use, and Significance Social Market Analytics, Inc. Harness the Power of Social Media Intelligence January 2014 P a g e 2 Introduction Social Market Analytics, Inc., (SMA) produces

More information

Technical Presentations. Arian Pasquali, FEUP, REACTION Data Collection Plataform David Batista, INESC-ID, Sematic Relations Extraction REACTION

Technical Presentations. Arian Pasquali, FEUP, REACTION Data Collection Plataform David Batista, INESC-ID, Sematic Relations Extraction REACTION Agenda 11:30 Welcome + Quick progress report and status summary 11:45 Task leaders summarize ongoing activities (10 min each max) 12:30 Break. 14:00 Technical Presentations 15:00 Break 16:00 Short Technical

More information

Creating Usable Customer Intelligence from Social Media Data:

Creating Usable Customer Intelligence from Social Media Data: Creating Usable Customer Intelligence from Social Media Data: Network Analytics meets Text Mining Killian Thiel Tobias Kötter Dr. Michael Berthold Dr. Rosaria Silipo Phil Winters Killian.Thiel@uni-konstanz.de

More information

Keeping Your Online Community Safe. The Nuts and Bolts of Filtering User Generated Content

Keeping Your Online Community Safe. The Nuts and Bolts of Filtering User Generated Content Keeping Your Online Community Safe The Nuts and Bolts of Filtering User Generated Content Overview Branded online communities have become an integral tool for connecting companies with their clients and

More information

Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam

Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam Groundbreaking Technology Redefines Spam Prevention Analysis of a New High-Accuracy Method for Catching Spam October 2007 Introduction Today, numerous companies offer anti-spam solutions. Most techniques

More information

Predicting Elections with Twitter What 140 Characters Reveal about Political Sentiment

Predicting Elections with Twitter What 140 Characters Reveal about Political Sentiment Predicting Elections with Twitter What 140 Characters Reveal about Political Sentiment Andranik Tumasjan, Timm O. Sprenger, Philipp G. Sandner, Isabell M. Welpe Workshop Election Forecasting 15 July 2013

More information

Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams

Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams 2012 International Conference on Computer Technology and Science (ICCTS 2012) IPCSIT vol. XX (2012) (2012) IACSIT Press, Singapore Using Text and Data Mining Techniques to extract Stock Market Sentiment

More information

Collecting Polish German Parallel Corpora in the Internet

Collecting Polish German Parallel Corpora in the Internet Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska

More information

Search Result Optimization using Annotators

Search Result Optimization using Annotators Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,

More information

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

More information

SERIES OF MARKS. Intellectual Property Office of Singapore

SERIES OF MARKS. Intellectual Property Office of Singapore SERIES OF MARKS Copyright 2012 Intellectual Property Office of Singapore. You may download, view, print and reproduce this document without modifications, but only for non-commercial use. All other rights

More information

Sentiment analysis on tweets in a financial domain

Sentiment analysis on tweets in a financial domain Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International

More information

Social Media. Style Guide

Social Media. Style Guide Social Media Style Guide CONTENTS General Guidelines Maintain Mission Focus...4 Be Authentic...4 Don t Censor Content...4 Pay Attention and Listen...4 Post Relevant and Interesting Content...4 Remember

More information

Fitch Risk Performance Monitor

Fitch Risk Performance Monitor Fitch Risk Performance Monitor The following commentary is part of a periodic analysis of recent Credit Default Swap market activity and results generated by Fitch s proprietary Market Implied Ratings

More information

Towards Inferring Web Page Relevance An Eye-Tracking Study

Towards Inferring Web Page Relevance An Eye-Tracking Study Towards Inferring Web Page Relevance An Eye-Tracking Study 1, iconf2015@gwizdka.com Yinglong Zhang 1, ylzhang@utexas.edu 1 The University of Texas at Austin Abstract We present initial results from a project,

More information

THE BASICS OF STATISTICAL PROCESS CONTROL & PROCESS BEHAVIOUR CHARTING

THE BASICS OF STATISTICAL PROCESS CONTROL & PROCESS BEHAVIOUR CHARTING THE BASICS OF STATISTICAL PROCESS CONTROL & PROCESS BEHAVIOUR CHARTING A User s Guide to SPC By David Howard Management-NewStyle "...Shewhart perceived that control limits must serve industry in action.

More information

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 seema@iasri.res.in Genomics A genome is an organism s

More information

Analysis of Tweets for Prediction of Indian Stock Markets

Analysis of Tweets for Prediction of Indian Stock Markets Analysis of Tweets for Prediction of Indian Stock Markets Phillip Tichaona Sumbureru Department of Computer Science and Engineering, JNTU College of Engineering Hyderabad, Kukatpally, Hyderabad-500 085,

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis

CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis Team members: Daniel Debbini, Philippe Estin, Maxime Goutagny Supervisor: Mihai Surdeanu (with John Bauer) 1 Introduction

More information

Towards better understanding Cybersecurity: or are "Cyberspace" and "Cyber Space" the same?

Towards better understanding Cybersecurity: or are Cyberspace and Cyber Space the same? Towards better understanding Cybersecurity: or are "Cyberspace" and "Cyber Space" the same? Stuart Madnick Nazli Choucri Steven Camiña Wei Lee Woon Working Paper CISL# 2012-09 November 2012 Composite Information

More information

Use the Academic Word List vocabulary to make tips on Academic Writing. Use some of the words below to give advice on good academic writing.

Use the Academic Word List vocabulary to make tips on Academic Writing. Use some of the words below to give advice on good academic writing. Use the Academic Word List vocabulary to make tips on Academic Writing Use some of the words below to give advice on good academic writing. abstract accompany accurate/ accuracy/ inaccurate/ inaccuracy

More information

Identifying Market Price Levels using Differential Evolution

Identifying Market Price Levels using Differential Evolution Identifying Market Price Levels using Differential Evolution Michael Mayo University of Waikato, Hamilton, New Zealand mmayo@waikato.ac.nz WWW home page: http://www.cs.waikato.ac.nz/~mmayo/ Abstract. Evolutionary

More information

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA

More information

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of

More information

Analysis of Social Media Streams

Analysis of Social Media Streams Fakultätsname 24 Fachrichtung 24 Institutsname 24, Professur 24 Analysis of Social Media Streams Florian Weidner Dresden, 21.01.2014 Outline 1.Introduction 2.Social Media Streams Clustering Summarization

More information

Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement

Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement Ray Chen, Marius Lazer Abstract In this paper, we investigate the relationship between Twitter feed content and stock market

More information

Modern foreign languages

Modern foreign languages Modern foreign languages Programme of study for key stage 3 and attainment targets (This is an extract from The National Curriculum 2007) Crown copyright 2007 Qualifications and Curriculum Authority 2007

More information

Semantically Enhanced Web Personalization Approaches and Techniques

Semantically Enhanced Web Personalization Approaches and Techniques Semantically Enhanced Web Personalization Approaches and Techniques Dario Vuljani, Lidia Rovan, Mirta Baranovi Faculty of Electrical Engineering and Computing, University of Zagreb Unska 3, HR-10000 Zagreb,

More information

Forecasting stock markets with Twitter

Forecasting stock markets with Twitter Forecasting stock markets with Twitter Argimiro Arratia argimiro@lsi.upc.edu Joint work with Marta Arias and Ramón Xuriguera To appear in: ACM Transactions on Intelligent Systems and Technology, 2013,

More information

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet Muhammad Atif Qureshi 1,2, Arjumand Younus 1,2, Colm O Riordan 1,

More information

Sentiment analysis using emoticons

Sentiment analysis using emoticons Sentiment analysis using emoticons Royden Kayhan Lewis Moharreri Steven Royden Ware Lewis Kayhan Steven Moharreri Ware Department of Computer Science, Ohio State University Problem definition Our aim was

More information

ScreenMatch: Providing Context to Software Translators by Displaying Screenshots

ScreenMatch: Providing Context to Software Translators by Displaying Screenshots ScreenMatch: Providing Context to Software Translators by Displaying Screenshots Geza Kovacs MIT CSAIL 32 Vassar St, Cambridge MA 02139 USA gkovacs@mit.edu Abstract Translators often encounter ambiguous

More information

The Viability of StockTwits and Google Trends to Predict the Stock Market. By Chris Loughlin and Erik Harnisch

The Viability of StockTwits and Google Trends to Predict the Stock Market. By Chris Loughlin and Erik Harnisch The Viability of StockTwits and Google Trends to Predict the Stock Market By Chris Loughlin and Erik Harnisch Spring 2013 Introduction Investors are always looking to gain an edge on the rest of the market.

More information

Ongoing Student Learning Expectations to be Addressed Each Nine Weeks

Ongoing Student Learning Expectations to be Addressed Each Nine Weeks W.4.2.1 Contribute to a writer s notebook (i.e., interesting words or phrases, books or experiences that spark an interest, etc.) Northwest Arkansas Instructional Alignment English Language Arts Grade

More information

The Italian Hate Map:

The Italian Hate Map: I-CiTies 2015 2015 CINI Annual Workshop on ICT for Smart Cities and Communities Palermo (Italy) - October 29-30, 2015 The Italian Hate Map: semantic content analytics for social good (Università degli

More information

Capturing Meaningful Competitive Intelligence from the Social Media Movement

Capturing Meaningful Competitive Intelligence from the Social Media Movement Capturing Meaningful Competitive Intelligence from the Social Media Movement Social media has evolved from a creative marketing medium and networking resource to a goldmine for robust competitive intelligence

More information

Mapping linguistic phenomena on Twitter and other big data sources. Gabriel Doyle UC San Diego 2014 LSA Annual Meeting

Mapping linguistic phenomena on Twitter and other big data sources. Gabriel Doyle UC San Diego 2014 LSA Annual Meeting Mapping linguistic phenomena on Twitter and other big data sources Gabriel Doyle UC San Diego 2014 LSA Annual Meeting Big data most major corpora are hundreds of millions of words at most Twitter users

More information

CUSTOMER RELATIONSHIP MANAGEMENT SYSTEM: A CASE STUDY OF FLOOR MILLS IN BAHAWALPUR DISTRICT

CUSTOMER RELATIONSHIP MANAGEMENT SYSTEM: A CASE STUDY OF FLOOR MILLS IN BAHAWALPUR DISTRICT CUSTOMER RELATIONSHIP MANAGEMENT SYSTEM: A CASE STUDY OF FLOOR MILLS IN BAHAWALPUR DISTRICT Prof. Dr. Abdul Ghafoor Awan Dean of Faculties, Institute of Southern Punjab, Multan, Pakistan. Muhammad Salman

More information

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information

More information

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING Sumit Goswami 1 and Mayank Singh Shishodia 2 1 Indian Institute of Technology-Kharagpur, Kharagpur, India sumit_13@yahoo.com 2 School of Computer

More information

Alan Eldridge and Tara Walker. Understanding Level of Detail Expressions

Alan Eldridge and Tara Walker. Understanding Level of Detail Expressions Alan Eldridge and Tara Walker Understanding Level of Detail Expressions 2 At Tableau, our goal is to make data analysis a delightful experience. People tell us that when they are deeply engaged in Tableau

More information

SHARPEN YOUR NOTE -TAKING

SHARPEN YOUR NOTE -TAKING SHARPEN YOUR NOTE -TAKING SKILLS Mayland Community College SOAR Program 1996 Note-taking is a three part process of OBSERVING, RECORDING, AND REVIEWING. First you observe an event (teacher lecturing or

More information

Spatio-Temporal Patterns of Passengers Interests at London Tube Stations

Spatio-Temporal Patterns of Passengers Interests at London Tube Stations Spatio-Temporal Patterns of Passengers Interests at London Tube Stations Juntao Lai *1, Tao Cheng 1, Guy Lansley 2 1 SpaceTimeLab for Big Data Analytics, Department of Civil, Environmental &Geomatic Engineering,

More information

Fuzzy Matching in Audit Analytics. Grant Brodie, President, Arbutus Software

Fuzzy Matching in Audit Analytics. Grant Brodie, President, Arbutus Software Fuzzy Matching in Audit Analytics Grant Brodie, President, Arbutus Software Outline What Is Fuzzy? Causes Effective Implementation Demonstration Application to Specific Products Q&A 2 Why Is Fuzzy Important?

More information

Educational Social Network Group Profiling: An Analysis of Differentiation-Based Methods

Educational Social Network Group Profiling: An Analysis of Differentiation-Based Methods Educational Social Network Group Profiling: An Analysis of Differentiation-Based Methods João Emanoel Ambrósio Gomes 1, Ricardo Bastos Cavalcante Prudêncio 1 1 Centro de Informática Universidade Federal

More information

Multilanguage sentiment-analysis of Twitter data on the example of Swiss politicians

Multilanguage sentiment-analysis of Twitter data on the example of Swiss politicians Multilanguage sentiment-analysis of Twitter data on the example of Swiss politicians Lucas Brönnimann University of Applied Science Northwestern Switzerland, CH-5210 Windisch, Switzerland Email: lucas.broennimann@students.fhnw.ch

More information

Terminology Extraction from Log Files

Terminology Extraction from Log Files Terminology Extraction from Log Files Hassan Saneifar 1,2, Stéphane Bonniol 2, Anne Laurent 1, Pascal Poncelet 1, and Mathieu Roche 1 1 LIRMM - Université Montpellier 2 - CNRS 161 rue Ada, 34392 Montpellier

More information

Google and Yahoo Keyword Auctions. Ryan Gabbard

Google and Yahoo Keyword Auctions. Ryan Gabbard Google and Yahoo Keyword Auctions Ryan Gabbard 1 Outline Google Ads Overview Targeting Ads Content Network and Site-Targeting Ad Design Advice Tools Yahoo Auctions 2 Google Ads Overview Account Structures

More information

Nail Care Trends & Influencers Snapshot

Nail Care Trends & Influencers Snapshot June 24 Why winning over top nail care influencers and word of mouth amplifiers is your top priority Nail Care Trends & Influencers Snapshot Verifeed reveals opportunities and challenges alike for nail

More information

Understanding the popularity of reporters and assignees in the Github

Understanding the popularity of reporters and assignees in the Github Understanding the popularity of reporters and assignees in the Github Joicy Xavier, Autran Macedo, Marcelo de A. Maia Computer Science Department Federal University of Uberlândia Uberlândia, Minas Gerais,

More information

Approaches to Arabic Name Transliteration and Matching in the DataFlux Quality Knowledge Base

Approaches to Arabic Name Transliteration and Matching in the DataFlux Quality Knowledge Base 32 Approaches to Arabic Name Transliteration and Matching in the DataFlux Quality Knowledge Base Brant N. Kay Brian C. Rineer SAS Institute Inc. SAS Institute Inc. 100 SAS Campus Drive 100 SAS Campus Drive

More information

Guidelines for Improved Search Engine Rankings

Guidelines for Improved Search Engine Rankings Guidelines for Improved Search Engine Rankings Search Engine Optimization Guide for Content Authors The purpose of this guide is to provide a process that can be used to create URMC Web site pages that

More information

Social Media Resources

Social Media Resources Social Media Resources Policy Option 1 This policy applies to the social networking activity of all employees, contractors, business partners or other parties with a material interest in [COMPANY], and

More information

Strategies for Effective Tweeting: A Statistical Review

Strategies for Effective Tweeting: A Statistical Review Strategies for Effective Tweeting: A Statistical Review DATA REPORT Introduction 3 Methodology 4 Weekends Are Good for Relaxing and Tweeting 5 Best Days to Tweet By Industry 6 When Followers Are Busy Give

More information

INFO 2950 Intro to Data Science. Lecture 17: Power Laws and Big Data

INFO 2950 Intro to Data Science. Lecture 17: Power Laws and Big Data INFO 2950 Intro to Data Science Lecture 17: Power Laws and Big Data Paul Ginsparg Cornell University, Ithaca, NY 29 Oct 2013 1/25 Power Laws in log-log space y = cx k (k=1/2,1,2) log 10 y = k log 10 x

More information

Sentiment Analysis for Movie Reviews

Sentiment Analysis for Movie Reviews Sentiment Analysis for Movie Reviews Ankit Goyal, a3goyal@ucsd.edu Amey Parulekar, aparulek@ucsd.edu Introduction: Movie reviews are an important way to gauge the performance of a movie. While providing

More information

Classification of Virtual Investing-Related Community Postings

Classification of Virtual Investing-Related Community Postings Classification of Virtual Investing-Related Community Postings Balaji Rajagopalan * Oakland University rajagopa@oakland.edu Matthew Wimble Oakland University mwwimble@oakland.edu Prabhudev Konana * University

More information

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies Sentiment analysis of Twitter microblogging posts Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies Introduction Popularity of microblogging services Twitter microblogging posts

More information

A first-use tutorial for the VIP Analysis software

A first-use tutorial for the VIP Analysis software A first-use tutorial for the VIP Analysis software Luis C. Dias INESC Coimbra & School of Economics - University of Coimbra Av Dias da Silva, 165, 3004-512 Coimbra, Portugal LMCDias@fe.uc.pt The software

More information

AP PSYCHOLOGY 2013 SCORING GUIDELINES

AP PSYCHOLOGY 2013 SCORING GUIDELINES AP PSYCHOLOGY 2013 SCORING GUIDELINES Question 2 General Considerations 1. Answers must be presented in sentences, and sentences must be cogent enough for the student s meaning to come through. Spelling

More information

Content Filters A WORD TO THE WISE WHITE PAPER BY LAURA ATKINS, CO- FOUNDER

Content Filters A WORD TO THE WISE WHITE PAPER BY LAURA ATKINS, CO- FOUNDER Content Filters A WORD TO THE WISE WHITE PAPER BY LAURA ATKINS, CO- FOUNDER CONTENT FILTERS 2 Introduction Content- based filters are a key method for many ISPs and corporations to filter incoming email..

More information

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

Folksonomies versus Automatic Keyword Extraction: An Empirical Study Folksonomies versus Automatic Keyword Extraction: An Empirical Study Hend S. Al-Khalifa and Hugh C. Davis Learning Technology Research Group, ECS, University of Southampton, Southampton, SO17 1BJ, UK {hsak04r/hcd}@ecs.soton.ac.uk

More information

Sentiment Analysis Tool using Machine Learning Algorithms

Sentiment Analysis Tool using Machine Learning Algorithms Sentiment Analysis Tool using Machine Learning Algorithms I.Hemalatha 1, Dr. G. P Saradhi Varma 2, Dr. A.Govardhan 3 1 Research Scholar JNT University Kakinada, Kakinada, A.P., INDIA 2 Professor & Head,

More information

Introduction to Social Media

Introduction to Social Media Introduction to Social Media Today s Discussion Overview of Web 2.0 and social media tools How EPA and other agencies are using these tools Agency and governmentwide policies governing use of tools Case

More information

Importance of Online Product Reviews from a Consumer s Perspective

Importance of Online Product Reviews from a Consumer s Perspective Advances in Economics and Business 1(1): 1-5, 2013 DOI: 10.13189/aeb.2013.010101 http://www.hrpub.org Importance of Online Product Reviews from a Consumer s Perspective Georg Lackermair 1,2, Daniel Kailer

More information

Social networking guidelines and information

Social networking guidelines and information Social networking guidelines and information Introduction Social media is an emerging and changing landscape. The digital marketing communications team and corporate communications have a social media

More information

Reputation Management System

Reputation Management System Reputation Management System Mihai Damaschin Matthijs Dorst Maria Gerontini Cihat Imamoglu Caroline Queva May, 2012 A brief introduction to TEX and L A TEX Abstract Chapter 1 Introduction Word-of-mouth

More information

Cross-lingual Synonymy Overlap

Cross-lingual Synonymy Overlap Cross-lingual Synonymy Overlap Anca Dinu 1, Liviu P. Dinu 2, Ana Sabina Uban 2 1 Faculty of Foreign Languages and Literatures, University of Bucharest 2 Faculty of Mathematics and Computer Science, University

More information

French Language and Culture. Curriculum Framework 2011 2012

French Language and Culture. Curriculum Framework 2011 2012 AP French Language and Culture Curriculum Framework 2011 2012 Contents (click on a topic to jump to that page) Introduction... 3 Structure of the Curriculum Framework...4 Learning Objectives and Achievement

More information

PGR Computing Programming Skills

PGR Computing Programming Skills PGR Computing Programming Skills Dr. I. Hawke 2008 1 Introduction The purpose of computing is to do something faster, more efficiently and more reliably than you could as a human do it. One obvious point

More information

Local Culture in Global English:

Local Culture in Global English: Local Culture in Global English: a case study of Kultur in Sprache / Sprachwissenschaft in Kulturwissenschaften Josef Schmied Chair English Language & Linguistics Chemnitz University of Technology www.tu-chemnitz.de/phil/english/linguist

More information

Doctoral Consortium 2013 Dept. Lenguajes y Sistemas Informáticos UNED

Doctoral Consortium 2013 Dept. Lenguajes y Sistemas Informáticos UNED Doctoral Consortium 2013 Dept. Lenguajes y Sistemas Informáticos UNED 17 19 June 2013 Monday 17 June Salón de Actos, Facultad de Psicología, UNED 15.00-16.30: Invited talk Eneko Agirre (Euskal Herriko

More information

Salesforce ExactTarget Marketing Cloud Radian6 Mobile User Guide

Salesforce ExactTarget Marketing Cloud Radian6 Mobile User Guide Salesforce ExactTarget Marketing Cloud Radian6 Mobile User Guide 7/14/2014 Table of Contents Get Started Download the Radian6 Mobile App Log In to Radian6 Mobile Set up a Quick Search Navigate the Quick

More information

Why is Internal Audit so Hard?

Why is Internal Audit so Hard? Why is Internal Audit so Hard? 2 2014 Why is Internal Audit so Hard? 3 2014 Why is Internal Audit so Hard? Waste Abuse Fraud 4 2014 Waves of Change 1 st Wave Personal Computers Electronic Spreadsheets

More information

Numerical Summarization of Data OPRE 6301

Numerical Summarization of Data OPRE 6301 Numerical Summarization of Data OPRE 6301 Motivation... In the previous session, we used graphical techniques to describe data. For example: While this histogram provides useful insight, other interesting

More information

Investigating Clinical Care Pathways Correlated with Outcomes

Investigating Clinical Care Pathways Correlated with Outcomes Investigating Clinical Care Pathways Correlated with Outcomes Geetika T. Lakshmanan, Szabolcs Rozsnyai, Fei Wang IBM T. J. Watson Research Center, NY, USA August 2013 Outline Care Pathways Typical Challenges

More information

Micro blogs Oriented Word Segmentation System

Micro blogs Oriented Word Segmentation System Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,

More information

Blogging- A Powerful Public Relation Marketing tool, A study of public. awareness with reference to Nagpur City.

Blogging- A Powerful Public Relation Marketing tool, A study of public. awareness with reference to Nagpur City. Blogging- A Powerful Public Relation Marketing tool, A study of public awareness with reference to Nagpur City. Ms. Minal Kashyap minal_kashyap@yahoo.co.in Ms. Kamal Satija kamalsatija.s@gmail.com Abstract-

More information

Search Engine Optimisation (SEO)

Search Engine Optimisation (SEO) WEB DESIGN DIGITAL MARKETING BRANDING ADVERTISING Keyword Research Definitely number one on the list; your entire search engine optimisation programme will revolve around your chosen Keywords. Which search

More information

Quality Assurance at NEMT, Inc.

Quality Assurance at NEMT, Inc. Quality Assurance at NEMT, Inc. Quality Assurance Policy NEMT prides itself on the excellence of quality within every level of the company. We strongly believe in the benefits of continued education and

More information

Social Media ROI. First Priority for a Social Media Strategy: A Brand Audit Using a Social Media Monitoring Tool. Whitepaper

Social Media ROI. First Priority for a Social Media Strategy: A Brand Audit Using a Social Media Monitoring Tool. Whitepaper Whitepaper LET S TALK: Social Media ROI With Connie Bensen First Priority for a Social Media Strategy: A Brand Audit Using a Social Media Monitoring Tool 4th in the Social Media ROI Series Executive Summary:

More information

Local Culture in Global English:

Local Culture in Global English: Local Culture in Global English: a case study of Kultur in Sprache / Sprachwissenschaft in Kulturwissenschaften Josef Schmied Chair English Language & Linguistics Chemnitz University of Technology www.tu-chemnitz.de

More information

Adaptive Filtering of SPAM

Adaptive Filtering of SPAM Adaptive Filtering of SPAM L. Pelletier, J. Almhana, V. Choulakian GRETI, University of Moncton Moncton, N.B.,Canada E1A 3E9 {elp6880, almhanaj, choulav}@umoncton.ca Abstract In this paper, we present

More information

THE BACHELOR S DEGREE IN SPANISH

THE BACHELOR S DEGREE IN SPANISH Academic regulations for THE BACHELOR S DEGREE IN SPANISH THE FACULTY OF HUMANITIES THE UNIVERSITY OF AARHUS 2007 1 Framework conditions Heading Title Prepared by Effective date Prescribed points Text

More information

DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING

DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING ABSTRACT The objective was to predict whether an offender would commit a traffic offence involving death, using decision tree analysis. Four

More information

Microblog Sentiment Analysis with Emoticon Space Model

Microblog Sentiment Analysis with Emoticon Space Model Microblog Sentiment Analysis with Emoticon Space Model Fei Jiang, Yiqun Liu, Huanbo Luan, Min Zhang, and Shaoping Ma State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory

More information

Quality Assurance at NEMT, Inc.

Quality Assurance at NEMT, Inc. Quality Assurance at NEMT, Inc. Quality Assurance Policy NEMT prides itself on the excellence of quality within every level of the company. We strongly believe in the benefits of continued education and

More information

Language and Literacy

Language and Literacy Language and Literacy In the sections below is a summary of the alignment of the preschool learning foundations with (a) the infant/toddler learning and development foundations, (b) the common core state

More information

News media analysis at Lab SAPO UPorto. Jorge Teixeira

News media analysis at Lab SAPO UPorto. Jorge Teixeira News media analysis at Lab SAPO UPorto Jorge Teixeira Past deliverables and visualization prototypes Twitómetro Twitteuro Mundo Visto Daqui interativo (MVDi) On-going work Mundo Numa Rede Sapo Notícias

More information

Contact Recommendations from Aggegrated On-Line Activity

Contact Recommendations from Aggegrated On-Line Activity Contact Recommendations from Aggegrated On-Line Activity Abigail Gertner, Justin Richer, and Thomas Bartee The MITRE Corporation 202 Burlington Road, Bedford, MA 01730 {gertner,jricher,tbartee}@mitre.org

More information

Guide to Digital Marketing for Business-To-Business (B2B)!

Guide to Digital Marketing for Business-To-Business (B2B)! o2markit consulting ltd Guide to Digital Marketing for Business-To-Business (B2B) A brief guide aimed at small business or start-up executives who are not Marketing professionals but who need to understand

More information

Twitter Sentiment Analysis of Movie Reviews using Machine Learning Techniques.

Twitter Sentiment Analysis of Movie Reviews using Machine Learning Techniques. Twitter Sentiment Analysis of Movie Reviews using Machine Learning Techniques. Akshay Amolik, Niketan Jivane, Mahavir Bhandari, Dr.M.Venkatesan School of Computer Science and Engineering, VIT University,

More information

INBOX. How to make sure more emails reach your subscribers

INBOX. How to make sure more emails reach your subscribers INBOX How to make sure more emails reach your subscribers White Paper 2011 Contents 1. Email and delivery challenge 2 2. Delivery or deliverability? 3 3. Getting email delivered 3 4. Getting into inboxes

More information

Twitter Stock Bot. John Matthew Fong The University of Texas at Austin jmfong@cs.utexas.edu

Twitter Stock Bot. John Matthew Fong The University of Texas at Austin jmfong@cs.utexas.edu Twitter Stock Bot John Matthew Fong The University of Texas at Austin jmfong@cs.utexas.edu Hassaan Markhiani The University of Texas at Austin hassaan@cs.utexas.edu Abstract The stock market is influenced

More information

Data Deduplication in Slovak Corpora

Data Deduplication in Slovak Corpora Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, Bratislava, Slovakia Abstract. Our paper describes our experience in deduplication of a Slovak corpus. Two methods of deduplication a plain

More information

Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects

Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects Mohammad Farahmand, Abu Bakar MD Sultan, Masrah Azrifah Azmi Murad, Fatimah Sidi me@shahroozfarahmand.com

More information