Identification of Author Personality Traits using Stylistic Features

Similar documents
Author Gender Identification of English Novels

Mining Online Diaries for Blogger Identification

Author Identification for Turkish Texts

Blogs and Twitter Feeds: A Stylometric Environmental Impact Study

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Web Document Clustering

Effects of Age and Gender on Blogging

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

Language Lessons. Secondary Child

MODULE 15 Diagram the organizational structure of your company.

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

Semi-Supervised Learning for Blog Classification

Livingston Public Schools Scope and Sequence K 6 Grammar and Mechanics

A Writer s Reference, Seventh Edition Diana Hacker Nancy Sommers

Robust Sentiment Detection on Twitter from Biased and Noisy Data

Predicting Age and Gender in Online Social Networks

10th Grade Language. Goal ISAT% Objective Description (with content limits) Vocabulary Words

A Unified Data Mining Solution for Authorship Analysis in Anonymous Textual Communications

Identifying Thesis and Conclusion Statements in Student Essays to Scaffold Peer Review

Can We Hide in the Web? Large Scale Simultaneous Age and Gender Author Profiling in Social Media

Virginia English Standards of Learning Grade 8

Feature Selection for Electronic Negotiation Texts

Translate once, translate twice, translate thrice and attribute: Identifying authors and machine translation tools in translated text

Evaluation of Authorship Attribution Software on a Chat Bot Corpus

Clustering Connectionist and Statistical Language Processing

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

ONLINE ENGLISH LANGUAGE RESOURCES

Subordinating Ideas Using Phrases It All Started with Sputnik

Interpreting areading Scaled Scores for Instruction

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

Grade 4 Writing Assessment. Eligible Texas Essential Knowledge and Skills

Interest Rate Prediction using Sentiment Analysis of News Information

Morphology. Morphology is the study of word formation, of the structure of words. 1. some words can be divided into parts which still have meaning

Certificate Programs

Automated Content Analysis of Discussion Transcripts

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Uncovering Plagiarism, Authorship, and Social Software Misuse

Summarizing Online Forum Discussions Can Dialog Acts of Individual Messages Help?

Writing Common Core KEY WORDS

Grammar Presentation: The Sentence

12 FIRST QUARTER. Class Assignments

Sentiment analysis: towards a tool for analysing real-time students feedback

Cohesive writing 1. Conjunction: linking words What is cohesive writing?

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Albert Pye and Ravensmere Schools Grammar Curriculum

Understanding Clauses and How to Connect Them to Avoid Fragments, Comma Splices, and Fused Sentences A Grammar Help Handout by Abbie Potter Henry

Unsupervised Anomaly Detection

Strand: Reading Literature Topics Standard I can statements Vocabulary Key Ideas and Details

LANGUAGE! 4 th Edition, Levels A C, correlated to the South Carolina College and Career Readiness Standards, Grades 3 5

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

Student Guide for Usage of Criterion

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

Forecasting stock markets with Twitter

An Introduction to WEKA. As presented by PACE

Projektgruppe. Categorization of text documents via classification

Cross-lingual Synonymy Overlap

Ask your teacher about any which you aren t sure of, especially any differences.

ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS

Customer Intentions Analysis of Twitter Based on Semantic Patterns

UNIVERSITÀ DEGLI STUDI DELL AQUILA CENTRO LINGUISTICO DI ATENEO

The eyes of the beholder: Gender prediction using images posted in Online Social Networks

Compound Sentences and Coordination

Crowdfunding Support Tools: Predicting Success & Failure

English for Academic Skills Independence [EASI]

Numerical Algorithms for Predicting Sports Results

Grammar Rules: Parts of Speech Words are classed into eight categories according to their uses in a sentence.

Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

ESL 005 Advanced Grammar and Paragraph Writing

Editing and Proofreading. University Learning Centre Writing Help Ron Cooley, Professor of English

Keywords social media, internet, data, sentiment analysis, opinion mining, business

Clauses and Phrases. For Proper Sentence Structure

Modeling Suspicious Detection Using Enhanced Feature Selection

Sentiment analysis on tweets in a financial domain

EFFICIENTLY PROVIDE SENTIMENT ANALYSIS DATA SETS USING EXPRESSIONS SUPPORT METHOD

Identifying Focus, Techniques and Domain of Scientific Papers

Using an Emotion-based Model and Sentiment Analysis Techniques to Classify Polarity for Reputation

Introduction Predictive Analytics Tools: Weka

Ling 201 Syntax 1. Jirka Hana April 10, 2006

YouTube Comment Classifier. mandaku2, rethina2, vraghvn2

1. Define and Know (D) 2. Recognize (R) 3. Apply automatically (A) Objectives What Students Need to Know. Standards (ACT Scoring Range) Resources

Speech and Network Marketing Model - A Review

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

NILC USP: A Hybrid System for Sentiment Analysis in Twitter Messages

Building a Question Classifier for a TREC-Style Question Answering System

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Data Mining Practical Machine Learning Tools and Techniques

II. RELATED WORK. Sentiment Mining

Web opinion mining: How to extract opinions from blogs?

Multisensory Grammar Online

Why language is hard. And what Linguistics has to say about it. Natalia Silveira Participation code: eagles

INTERNATIONAL UNIVERSITY OF JAPAN Public Management and Policy Analysis Program Graduate School of International Relations

Introducing diversity among the models of multi-label classification ensemble

Sentiment Analysis: a case study. Giuseppe Castellucci castellucci@ing.uniroma2.it

Social Media Mining. Data Mining Essentials

BUSINESS COMMUNICATION. Competency: Grammar Task: Use a verb that correctly agrees with the subject of a sentence.

Transcription:

Identification of Author Personality Traits using Stylistic Features Notebook for PAN at CLEF 2015 Ifrah Pervaz, Iqra Ameer, Abdul Sittar, Rao Muhammad Adeel Nawab COMSATS Instiute of Inforamtion Technology, Lahore ifrahpervaz23@gmail.com, iqraameer133@gmail.com, abdulsittar72@gmail.com, adeelnawab@ciitlahore.edu.pk Abstract. Author profiling is the task of determining the age, gender or type of the author's personality by studying their sociolect aspect, that is, how the language is shared by people. This paper presents the COMSATS Institute of Information Technology, Lahore entry for the PAN 2015 competition on Author Profiling task. Our proposed system is based on stylometry features. We implemented 29 different stylistic features, many of which are language independent. Since the training data was available in multiple languages, one of our main objectives was to explore which language independent features are most effective. The problem of author profiling was casted as a supervised document classification task. Results showed that features (Percentage of Question Sentences, Average Sentence Length, Percentage of Punctuations, Percentage of Comma and Percentage of Full stops) were most effective multilingual features. 1 Introduction Personality in Encyclopedia of Psychology is defined as individual differences in characteristic patterns of thinking, feeling and behaving. [1]These differences can be reflected through one s speech, writing, images etc. Authorship analysis deals to find out the methods to identify these differences and use them to know profile of author such as gender, age, native language, education, profession or personality type. So author profiling can simply be defined as: given the set of texts, you need to identify age, gender, profession, education, native language and similar personality traits. 1

Authorship analysis has attracted much attention in recent years due to the rapid increase of electronic text and the need for expert systems able to handle this information. Like from a marketing viewpoint, companies may be interested in knowing about the trends regarding their products, on the basis of the analysis of blogs and online product reviews, what types of people like or dislike their products, what is required by the customer and which customer category/ class they can attract more. Similarly from a forensic viewpoint, determining the linguistic profile of a person i.e. who wrote a "suspicious text" may provide valuable background information. In this paper we explore how different stylistic features help in conveying about the personality of writer. Moreover, we also tried to find out how the use of different stylistic features affect multilingual results. We used the datasets provided by PAN organizers, and applied different machine learning practices for predicting writer s traits. The rest of this paper is organized as follows: Section 2 describes related work. Section 3 presents the approach used to identify personality traits of author. Section 4 describes the results obtained on training data. Section 5 presents the results obtained on test data and section 6 concludes the paper. 2 Related work With the advancement in web technology, there is an abundant increase in use of electronic media. Along with useful information there is also heap of anonymous data available so the need to identify who is who has become important. A lot of establishments have been done in this field so far, Pennebaker et al. [2] joined dialect utilization with identity qualities, examining how the variety of semantic attributes in a content can give data in regards to the gender and age of its author. Argamon et al. [3] analyzed formal written texts extracted from the British National Corpus [4] combining function words with part-of-speech features. Koppel studied the problem of automatically determining an author s gender by proposing combinations of simple lexical and syntactic features. Holmes and Meyerhoff [5], Burger and Henderson [6] have also investigated obtaining age and gender information from formal texts. Seifeddine and Maher [6], focus is on author s discussions, they combine content based approach and statistic approach. They show a hemi- strategy that [4] www.natcorp.ox.ac.uk

merges the analysis of information in writings with a machine learning technique. To get a higher organization of these statistics, they depended on the utilization of the Decision table calculation". 3 Our approach 3.1 Stylistic Features The approaches commonly used for the automatic identification of an author s personality traits from text can be categorized into three broad categories: (1) Stylometry based approaches (which aim to identify an author s traits from his writing style), (2) Content based approaches (which identify author traits using features extracted from the content of the document) and (3) Topic based approaches (which try to predict an author s profile based on the topics used in the document). The system presented in this PAN Author Profiling Competition is based on stylometry. Table 3.1 shows the list of 29 stylistic features used for the development of our proposed author profiling detection system. As can be noted that these features aim to extract different stylistic information from a single document, which can be helpful in identifying the age, gender, personality type i.e. stable, open, extroverted, agreeable and conscientious.in this year s training data, the tweets are available for four different languages: (1) English, (2) Dutch, (3) Spanish and (4) Italian. This study aims to identify some language independent stylistic features which are probable to perform on multiple languages as well. As can be noted from Table 3.1 that features numbered 1, 2, 3, 4, 5, 6, 7, 8, 9, 13, 20, 21, 23 and 24 are language independent and can be used for extracting stylistic information from document in any of the four languages i.e. English, Dutch, Spanish and Italian, while the remaining ones can only be used for English language.

Languages No. Feature English Dutch Spanish Italian 1. Percentage of Question Sentences Yes Yes Yes Yes 2. Percentage of Short Sentences Yes Yes Yes Yes 3. Percentage of Long Sentences Yes Yes Yes Yes 4. Average Sentence Length Yes Yes Yes Yes 5. Average Word Length Yes Yes Yes Yes 6. Percentage of Words with Six and More Letters Yes Yes Yes Yes 7. Percentage of Words with Two and Three Letters Yes Yes Yes Yes 8. Percentage of Semicolons Yes Yes Yes Yes 9. Percentage of Punctuations Yes Yes Yes Yes 10. Percentage of Pronouns Yes ---- ---- ---- 11. Percentage of Prepositions Yes ---- ---- ---- 12. Percentage of Coordinating Conjunctions Yes ---- ---- ---- 13. Percentage of Comma Yes Yes Yes Yes 14. Percentage of Articles Yes ---- ---- ---- 15. Percentage of Words with One Syllable Yes ---- ---- ---- 16. Percentage of Words with Three Plus Syllables Yes ---- ---- ---- 17. Average Syllables per Word Yes ---- ---- ---- 18. Percentage of Adjectives Yes ---- ---- ---- 19. Percentage of Adverbs Yes ---- ---- ---- 20. Percentage of Capitals Yes Yes Yes Yes 21. Percentage of Colons Yes Yes Yes Yes 22. Percentage of Determiners Yes ---- ---- ---- 23. Percentage of Digits. Yes Yes Yes Yes 24. Percentage of Full stop Yes Yes Yes Yes 25. Percentage of Interjections Yes ---- ---- ---- 26. Percentage of Modals Yes ---- ---- ---- 27. Percentage of Nouns Yes ---- ---- ---- 28. Percentage of Personal Pronouns Yes ---- ---- ---- 29. Percentage of Verbs Yes ---- ---- ---- Table3. 1: List of all the stylistic features that are used in the development of the author profiling detection system. 3.2 PAN 2015 Author Profiling Training Datasets Our proposed system was trained using the PAN 2015 Author Profiling training data, which consists of Twitter tweets in four different languages: (1) English, (2) Dutch, (3) Spanish and (4) Italian. In the training dataset there are 152, 34, 100 and 38 author profiles for English, Dutch, Spanish and Italian languages respectively. In the PAN 2015 Author Profiling training dataset, there are two classes for gender (male and female), four classes for age (18-24, 25-34, 35-49 and 50-xx) for the remaining personality traits open, stable, agreeable, extroverted and conscientious there are two classes: (yes or no). 3.3 Evaluation Methodology The task of identifying an author s profile from text was treated as a supervised machine learning problem. N-fold-cross-validation was used. Due to difference in the sizes of training data for different languages, we used 5-fold cross

validation for English language corpus, 4-fold cross validation for Dutch and 3-fold cross validation for the Italian and Spanish corpora respectively. We applied a range of machine learning algorithms on the training data including Navies Bayes, Support Vector Machine, Random Forest, J48 and Logistics. The WEKA s (give reference) implementation of these algorithms was used. The scores generated using the stylistic features (see Table 3.1) were used as input features to the machine learning algorithms. We used WEKA s attribute selection approach for selecting best features from the complete feature set. For that purpose we explored CFsSubSetEval, Filtered Attribute, SVM Attribute and Classifier Subset as evaluators, and Best first and ranker as search methods. 3.4 Evaluation measure As recommended by PAN 2015 organizers, the performance of the proposed system for age and gender personality traits was measured using accuracy, whereas for other personality traits (stable, open, extroverted, agreeable and conscientious) average Root Mean Squared Error was used. 4 Results on Training Data We carried out three sets of experiments: (1) performance on individual features, (2) performance on combined features (mean all 29 features are used as input) and (3) performance on selected features. The best performance was obtained using selected features, therefore, we are only reporting the results for them. Table 4.1 demonstrates the accuracy for age and gender trait and average root mean square error for personality types, for tweets in all four languages. Best features for age and gender traits in English, gender, stable, agreeable and extroverted traits in Dutch, age, gender and extroverted traits in Spanish and gender and extroverted traits in Italian are obtained by using ranker search method while remaining are obtained through best-first search method, the features chosen by these methods are shown in table 4.2

Language Gender Age Stable Open Agreeable Extroverted Conscientious English 0.75 0.65 0.45 0.14 0.43 0.41 0.42 Dutch 0.64 --- 0.36 0.00 0.39 0.20 0.47 Spanish 0.73 0.53 0.51 0.37 0.47 0.00 0.34 Italian 0.73 --- 0.14 0.16 0.33 0.43 0.29 Table 4.1 shows the results for training data on all four language, for age and gender accuracy scores are reported, while for other personality traits average root mean square error scores are presented. Traits English Dutch Spanish Italian Age Ranked in order: ---- Ranked in order: ---- 28,7,19,11,12,10,29,21,16, 4,13,9,1,6,21,7,24 14,6,17,9,18,27,22,24,23,1,23,8,5,20,3,2 5,26,4,13,1,5,25,20,8,3,2 Ranked in order Ranked in order Ranked in order Ranked in order Gender 23,10,22,20,18,9,1,5,28,11 21,9,8,13,6,1,20,2 1,9,23,7,8,11,4,21 24,1,4,8,20,5,9,1,21,14,25,24,12,27,13,7,6, 3,4,14,7,5,3,2,5,6,14,13,3,2 3,23,6,7,21,3,2 26,8,29,15,16,17,4,19,3,2 Open 1 1 1 1 Ranked in order Stable 6,27,29 6,8,5,7,1,23,21,20 24 1,13,24,9,4,3,22 Ranked in order Agreeable 1 13,20,1,6,13,4,8,2 24 1 4,5,7,9,21,3,2 Ranked in order Ranked in order Ranked in order Extroverted 21,27 21,24,5,8,6,1,13,9 24,23,21,20,13,9, 8,21,9,7,6,24,20,,4,20,7,23,3,2 8,7,6,5,4,3,2,1 13,5,23,1,4,3,2 Conscientious 7,21 4,6,7,21,23,24 1 1 Table 4.2 shows the features selected for training data on all for language 5 Results on Test Data Table 5.1 shows the accuracy for age and gender trait and average root mean square error for personality types, for test data in all four languages i.e. (1) English (2) Dutch (3) Spanish (4) Italian. For evaluation of test data we train our modal using all features as discussed in table 3.1., and applying chi squared evaluator and ranker search method. It can be concluded from table 5.1 that overall high accuracy for age and gender is achieved for English language, while for other traits i.e. open, stable, agreeable, extroverted and conscientious overall best results are achieved for Dutch language.

Language Gender Age Stable Open Agreeable Extroverted Conscientious English 0.6901 0.7183 0.3172 0.2149 0.2154 0.2131 0.1959 Dutch 0.5938 -------- 0.1686 0.0866 0.1436 0.1677 0.1425 Spanish 0.6932 0.5341 0.2806 0.2145 0.1430 0.2786 0.1410 Italian 0.5833 -------- 0.2303 0.2449 0.1462 0.1067 0.1333 Table 5.1: shows the results for test data on all four language, for age and gender accuracy scores are reported, while for other personality traits average root mean square error scores are presented. 6 Conclusion In this paper we have discussed our participation in the Author Profiling task. We have covered the role of stylistic features in identification of author personality traits. For that purpose we figured out 29 features and perform different experiments on these, like comparing accuracy by using all features, then checking accuracy for single feature and finally using subsets of them and come to conclusion that best results are achieved by feature selection techniques. References [1] Alan E. Kazdin, PhD, Editor-in-Chief Encyclopedia of Psychology: 8 Volume Set [2] James W. Pennebaker, Mathias R. Mehl, and Kate G. Niederhoffer. Psychological aspects of natural language use: Our words, our selves. Annual review of psychology, 54(1):547 577, 2003. [3] Shlomo Argamon, Moshe Koppel, Jonathan Fine, and Anat Rachel Shimoni, Gender, genre, and writing style in formal written texts. TEXT, 23:321 346, 2003. [5] Shlomo Argamon, Moshe Koppel, James W. Pennebaker, and Jonathan Schler. Automatically profiling the author of an anonymous text. Commun. ACM, 52(2):119 123, February 2009. [6] John D. Burger, John Henderson, George Kim, and Guido Zarrella. Discriminating gender on twitter. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 11, pages 1301 1309, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. [7]Janet Holmes and Miriam Meyerhoff. The Handbook of Language and Gender. Blackwell Handbooks in Linguistics. Wiley, 2003. ISBN 9780631225027.

[8]Seifeddine Mechti, Maher Jaoua, Lamia Hadrich Belguith, and Rim Faiz1,Machine learning for classifying authors of anonymous tweets, blogs, reviews and social media Notebook for PAN at CLEF 2014. [9] Jason Brownlee Feature Selection to Improve Accuracy and Decrease Training Time, March 12, 2014 in Uncategorized [10] Ian H. Witten, Eibe Frank and Mark A. Hall, Data Mining Practical Machine Learning Tools and Techniques [11] Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstathios Stamatatos, and Giacomo Inches. Overview of the Author Profiling Task at PAN 2013. In Pamela Forner, Roberto Navigli, and Dan Tufis, editors, Working Notes Papers of the CLEF 2013 Evaluation Labs, September 2013. ISBN 978-88-904810-3- 1.