SENTIMENT ANALYSIS IN TURKISH A THESIS SUBMITTED TO THE GRADUATE SCHOOL OF NATURAL AND APPLIED SCIENCES OF MIDDLE EAST TECHNICAL UNIVERSITY

Transcription

1

2 SENTIMENT ANALYSIS IN TURKISH A THESIS SUBMITTED TO THE GRADUATE SCHOOL OF NATURAL AND APPLIED SCIENCES OF MIDDLE EAST TECHNICAL UNIVERSITY BY UMUT EROĞUL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER ENGINEERING JUNE 2009

3 Approval of the thesis: SENTIMENT ANALYSIS IN TURKISH submitted by UMUT EROĞUL in partial fulfillment of the requirements for the degree of Master of Science in Computer Engineering Department, Middle East Technical University by, Prof. Dr. Canan Özgen Dean, Graduate School of Natural and Applied Sciences Prof. Dr. Müslim Bozyi git Head of Department, Computer Engineering Dr. Meltem Turhan Yöndem Supervisor, Computer Engineering Dept. Examining Committee Members: Prof. Dr. Göktürk Üçoluk Computer Engineering Dept., METU Dr. Meltem Turhan Yöndem Computer Engineering Dept., METU Asst. Prof. Dr. Pınar Şenkul Computer Engineering Dept., METU Dr. Onur Tolga Şehito glu Computer Engineering Dept., METU Güven Fidan, M.Sc. AGMLAB Date:

4 I hereby declare that all information in this document has been obtained and presented in accordance with academic rules and ethical conduct. I also declare that, as required by these rules and conduct, I have fully cited and referenced all material and results that are not original to this work. Name, Last Name: UMUT EROĞUL Signature : iii

5 ABSTRACT SENTIMENT ANALYSIS IN TURKISH Eroğul, Umut M.S., Department of Computer Engineering Supervisor : Dr. Meltem Turhan Yöndem June 2009, 55 pages Sentiment analysis is the automatic classification of a text, trying to determine the attitude of the writer with respect to a specific topic. The attitude may be either their judgment or evaluation, their feelings or the intended emotional communication. The recent increase in the use of review sites and blogs, has made a great amount of subjective data available. Nowadays, it is nearly impossible to manually process all the relevant data available, and as a consequence, the importance given to the automatic classification of unformatted data, has increased. Up to date, all of the research carried on sentiment analysis was focused on English language. In this thesis, two Turkish datasets tagged with sentiment information is introduced and existing methods for English are applied on these datasets. This thesis also suggests new methods for Turkish sentiment analysis. Keywords: Natural Language Processing, Machine Learning, Text Mining, Sentiment Analysis, Turkish iv

6 ÖZ TÜRKÇE METİNLERDE DÜŞÜNCE ÇÖZÜMLEME Eroğul, Umut Yüksek Lisans, Bilgisayar Mühendisli gi Bölümü Tez Yöneticisi : Dr. Meltem Turhan Yöndem Haziran 2009, 55 sayfa Sentiment analiz (düşünce çözümleme) bir dökümanın bilgisayar tarafından incelenip, o dökümanın konu hakkında (konudan ba gımsız olarak) olumlu veya olumsuz görüş belirtti gini saptamaya çalışır. Çıkartılmaya çalışılan görüş yazarın konu hakkındaki kararı ya da de gerlendirmesi, yazarken hissetti gi ruh hali, ya da belirtmek istedi gi etki olabilir. Günümüzde internet kullanımının artması sonucu daha da yaygınlaşan tartışma grupları ve görüş - eleştiri sitelerinde yer alan fikir belirten yazıların artması, bu verilerin elle işlenmesini imkansıza yakın hale getirmiş ve otomatik sınıflandırmanın önemini daha da arttırmıştır. Günümüze kadar düşünce çözümleme araştırmaları, genelde İngilizce üzerine yo gunlaşmıştır. Bu tez kapsamında düşünce çözümlemede kullanılabilecek iki yeni veri seti oluşturulmuş, ve daha önceden İngilizce de yapılmış çalışmalarda denenen yöntemlerin Türkçe de gösterdikleri başarılar incelenmiş ve Türkçe ye özel yeni yöntemler önerilmistir. Anahtar Kelimeler: Do gal Dil İşleme, Makine Ö grenimi, Metin Madencili gi, Düşünce Çözümleme, Türkçe v

7 To my family. vi

8 ACKNOWLEDGMENTS I would like to thank my supervisor Dr. Meltem Turhan Yöndem for her guidance for the last two years. I would like to thank Dr. Onur Tolga Şehito glu and Güven Fidan, for their help and ideas. I would like to thank my brother Can Ero gul, for his advices and encouragements through my education. I would like to thank Atıl İşçen for his cooperation for the last four years. I would like to thank my aunt Dicle Ero gul and Ruken Çakıcı for their corrections and suggestions on this thesis. Finally, I would like to thank my parents and loved ones for their support. vii

9 TABLE OF CONTENTS ABSTRACT ÖZ DEDICATON ACKNOWLEDGMENTS TABLE OF CONTENTS iv v vi vii viii LIST OF TABLES xi LIST OF FIGURES xii CHAPTERS 1 INTRODUCTION Motivation What is Sentiment? Importance of Turkish from a Linguistic Perspective Outline of the Thesis LITERATURE SURVEY BACKGROUND Machine Learning Support Vector Machines Automated Data Gathering Natural Language Processing Part of Speech Natural Language Processing Tools Natural Language Toolkit Zemberek viii

10 3.3 Combining ML and NLP N-gram model Feature Selection Threshold Negation METHODS Data Description English Subjectivity Data Polarity Sentence Polarity v Turkish Polarity Data Filtering Process Turkish Polarity Scale Data Major Differences Between Turkish and English Data Methods Used Spellcheck Methods Polarity Spanning Using Roots of Words Part of Speech Presence vs Frequency Multiclass data RESULTS AND DISCUSSIONS Equivalence in English Data Threshold Experiments Baseline Using Roots of Words Part-of-Speech N-gram tests Extracted Features Turkish Polarity Scale Data ix

11 6 CONCLUSION AND FUTURE WORK Conclusion Future Work REFERENCES APPENDICES A EXAMPLE DATA A.1 Polarity 2.0 Data Examples A.1.1 Positive Movie Review from Polarity A.1.2 Negative Movie Review from Polarity A.2 Sentence Polarity Data Examples A.2.1 Positive Movie Review from Sentence Polarity Dataset v A.2.2 Negative Movie Review from Sentence Polarity Dataset v A.3 Turkish Polarity Data Examples A.3.1 Turkish Positive Movie Review A.3.2 Turkish Negative Movie Review A.3.3 Turkish Neutral Movie Review A.4 Analysis of Extracted Features x

12 LIST OF TABLES TABLES Table 1.1 Two Example Sentences in Turkish with Translations Table 3.1 An Example Sentence Translation Table 4.1 General Statistics on Polarity Data Table 4.2 General Statistics on Sentence Polarity Dataset v Table 4.3 Distribution of Reviews from beyazperde on Rating Table 4.4 Distribution of Reviews at Turkish Polarity Scale Data on Rating Table 5.1 Precision, Recall and F Table 5.2 General test results for acquiring baseline Table 5.3 Results using roots of words Table 5.4 Detailed feature information using roots of words Table 5.5 Results using part-of-speech data Table 5.6 Detailed feature information using part-of-speech data Table 5.7 Results in ngram Table 5.8 Detailed feature information in N-gram tests Table 5.9 Most powerful 20 word phrases for Turkish Table 5.10 Most powerful 20 word phrases for English Table 5.11 Results in one-vs-all training Table 5.12 Results with linear regression Table A.1 Most powerful 40 word phrases for Turkish Table A.2 Most powerful 40 word phrases for English xi

13 LIST OF FIGURES FIGURES Figure 3.1 SVM Hyperplane Figure 3.2 An example POS tagging by Zemberek Figure 3.3 An example morphological analysis by Zemberek Figure 3.4 N-gram sets of an example sentence Figure 4.1 A screenshot from rottentomatoes.com Figure 4.2 A screenshot from beyazperde.mynet.com Figure 4.3 A screenshot from 23 Figure 4.4 A screenshot from beyazperde.mynet.com Figure 4.5 The steps taken on an example sentence Figure 4.6 An example of the first spellchecking method with Zemberek Figure 4.7 The suggestions of the Zemberek library for the word uyumamalıydım. 26 Figure 4.8 An example of the polarity spanning Figure 4.9 An example of words with the same root Figure 4.10 Analyzing a sentence by Zemberek Figure 5.1 Threshold vs Performance Figure 5.2 Threshold vs Feature Number Figure 5.3 Turkish Polarity Scale Data Histogram Figure 5.4 Linear Regression Histogram xii

14 CHAPTER 1 INTRODUCTION 1.1 Motivation What other people think has always been an important piece of information during decision making, even pre-web; asking friends who they plan to vote for, requesting letters of recommendation from colleagues, checking Consumer Reports regarding dishwasher brands. People search for, are affected by, and post online ratings and reviews. Recent studies [8] [16] shows that 60% of US residents have done online research on a product at least once, and 15% do so on a typical day. On the other hand; 73%-87% of US readers of online reviews of services reported that the reviews had a significant influence on their purchase. But, 58% of US internet users report that online information was missing, impossible to find, confusing, and/or overwhelming. Terabytes of new user-generated content appears on the Web every day. Many of these blog posts, tweets, articles, podcasts and videos presents valuable opinions about a product, a company, a movie, etc.. The quantity of information available is immense, which made it impossible to track manually, thus tools are needed to automatically analyze these comments. This vast amount of information is searchable with words or word phrases, but in order to perform better results, search engines should understand the meaning and semantics hidden in the text. One of the approaches to this problem is sentiment analysis, whose main concern is the extraction of feelings in a given text. Sadly the sentiment analysis studies mostly concentrated on English so far, and to the best of our knowledge, there is no study made in Turkish. We wanted to see the effects of the 1

15 methods applied to English, in a Turkish data. Unfortunately, there was no data available for sentiment analysis in Turkish. So in this thesis, we introduce a new dataset tagged with sentiment information, and performed experiments on this data. The aim of this thesis is to initiate sentiment analysis study in Turkish; by creating a Turkish dataset, tagged with the sentiment information and conducting the initial experiments on this data, by using the methods that are already applied in English sentiment analysis. This thesis also aims to find new methods, which would increase the performance using language dependent features, and analyze the effect of linguistic features on Turkish sentiment analysis. 1.2 What is Sentiment? Sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic. The attitude may be the judgment or evaluation, the emotional state of the author when writing or the emotional effect the author wishes to have on the reader. Computers can perform automated sentiment analysis of digital texts, using elements from machine learning such as latent semantic analysis, support vector machines, bag of words and Semantic Orientation. There are two main approaches in sentiment analysis, statistical and linguistic. Statistical approach relies heavily on mathematical and statistical comparison of the occurrences and number of negative or positive statements in the text, whereas linguistic approach tries to build a set of rules and compare the analyzed text with them. As Devitt et Al. (2007) [11] stated, sentiment analysis in computational linguistics has focused on examining what textual features (lexical, syntactic, punctuation, etc) contribute to affective content of text and how these features can be detected automatically to derive a sentiment metric for a word, sentence or whole text. As pang/lee (2004) [26] stated, sentiment analysis seeks to identify the viewpoint(s) underlying a text span; an example application is classifying a movie review as positive or negative. It has proven useful for companies, recommender systems, and editorial sites to create summaries of people s experiences and opinions that consists of subjective expressions extracted from reviews. 2

16 Sentiment analysis is tightly coupled with Opinion Mining, which is the term used by commercial applications that apply sentiment analysis methods. Formally Opinion Mining is about automatically getting to know what people think on an item (brand, product, anything) from their explicit opinions in blogs, comments to blog posts and news stories, Social Networks, elsewhere in the Web. 1.3 Importance of Turkish from a Linguistic Perspective Turkish is highly-inflected and word order free. It is an agglutinating language, which means that words are formed with linear concatenation of suffixes. In an agglutinating language like Turkish, a single word can be a sentence with tense, agreement, polarity, and voice as in Table 1.1 and translate into a relatively longer English sentence. Morphological structure of the words bears clues about part-of-speech tags, modality, tense, person and number agreement, case, voice and so on [6]. Two example sentences in Turkish are given with their translations in Table 1.1. We think it is a good example of the importance and use of suffixes in Turkish. We can see that a lot of meaning can be given to a word by extending with appropriate suffixes. Table 1.1: Two Example Sentences in Turkish with Translations Gidemeyebilirdim go-abil-neg-possib-aor-past-p1sg I might not have been able to go Oturtmalısın sit-caus-oblig-p2sg (You) must make (someone) sit. 3

17 1.4 Outline of the Thesis The organization of the thesis is as follows: In Chapter 2, a survey of related work on sentiment analysis is given. Chapter 3 gives detailed background information about the fields and methods, used in this thesis. In Chapter 4, the data used for the experiments and the methods used are described in detail. Chapter 5 gives the experimental results and the discussions about the results. Finally, Chapter 6 summarizes the thesis and gives the conclusions. Possible improvement ideas to the applied methods are given in the future work section. 4

18 CHAPTER 2 LITERATURE SURVEY Polarity classification, also known as opinion mining, is a subfield of sentiment analysis, where the task is to classify an opinionated document as either positive or negative according to the sentiment expressed by the author, which has gained a lot of interest in the past years. Early studies on sentiment-based categorization started with the use of models inspired by cognitive linguistics [15] [30], or the manual construction of discriminant-word lexicons [9]. Das et Al. introduced real-time sentiment extraction, trying to automatically label each finance message in online stock message boards as a buy, sell or neutral recommendation. But their method requires creating a discriminant-word lexicon by manual selection and tagging. One of the first major research in sentiment analysis with supervised learning approach is the Thumbs up? Sentiment Classification using Machine Learning Techniques by Pang et Al. [28] conducted in 2002; where they have established a dataset from the movie reviews collected from the well-known internet movie database ( They have automatically classified 1400 reviews (700 positive, 700 negative) where the author clearly expressed his opinions with such phrases like (9 out of 10, 4 out of 5, etc..). They cleaned the reviews from such remarks, and performed various tests on them. Their main objective was to automatically determine the sentiment. They had tried several features (part of speech information, position information) for the bag-of-words approach, where the document is represented as an unordered collection of words that serves as an input to machine learning methods, which will be explained in section 3.3. They tried three machine learning methods (Naive Bayes, Maximum Entropy, Support Vector Machines). Their results show that using only uni-grams and bi-grams in the bag-of-words approach, which will be explained in sec- 5

19 tion generates the best results (82.9%) with the support vector machines, which will be explained in section Dave et Al. [10] also took the supervised learning approach and experimented with multiple datasets of product reviews on amazon.com. They examined an extensive array of more sophisticated features based on linguistic knowledge or heuristics, coupled with various featureselection and smoothing techniques. In conclusion they state the general problems of taking online data, some of which can be states as, users not understanding the rating system clearly and giving low ratings for the things they like, or giving only the negative comments and conclude with a final sentence stating that the overall performance is satisfactory. They also state that the average length of product reviews are too short and the positive reviews are dominant in online sites. Subjective - Objective Classification After analyzing the results, researchers concluded that classifying between objective and subjective sentences could improve the performance on sentiment analysis. These studies [26] [32], trained another classifier for the classification of the subjective-objective sentences which filters subjective sentences from the sentiment classifier. One of the first datasets created for the classification of subjectivity is subjectivity dataset v1.0 [5] created by the Bo Pang and Lillian Lee, introduced in Pang/Lee (2004)[26] which includes 5000 subjective movie review snippets, and 5000 movie plot sentences which are assumed to be objective. In the experiments, they used a two layer classifying algorithm, the first layer of classifiers classified between the subjective and objective sentences, the sentences which are classified as subjective are processed in the sentiment classifier, which increased the overall result by 4% (86.9%). In the subjectivity classifying, they used the subjectivity scores of the neighboring sentences, and graph algorithms which is outside the scope of this thesis. Pang et Al. (2005) [27] aimed to extend the current problem by trying to find the correct rate given to a movie (1-5) from the reviews made, rather than just assigning positive or negative. This research field has focused on improving multi-class labeling techniques such as one-vs-all, one-vs-one and regression. Since there is no current data for the sentiment scale classification, they have created a new dataset for this purpose. They achieve 70% performance on average with 3-class classification, and 50% performance on 4-class classification. The best results are achieved in one-vs-all by support vector machine models. Also in this 6

20 study a positive-negative classifier is trained and the positive-sentence-percentages of each review in a rating group is analyzed, it can be said that the positive sentence percentage in a review is directly proportional to the rate it received. Automatic Data Generation Some researchers tackle the need for annotated data in supervised learning by automatically assigning the effects of user comments on the sale price in amazon marketplace [14] or by analyzing the stock prices of a given company with the news published about the company [11]. They trained on the sales made on the marketplace within the last year, and tried to guess which seller will sell the same item first. They determined a set of key phrases important for the overall rating like delivery, packaging, service, etc; for each merchant, and created the feature vector under the assumption that each of these attributes are expressed by a noun, noun phrase, verb, or a verb phrase in the feedback postings. They assigned a dollar value to each noun phrase representing the effect of the phrase on the sale price, which performed 87% on guessing which merchant will sell the item. Using the natural language processing tools available for English on other languages is also an interesting research area for the researchers [23] in different language domains. A previously translated corpus, and a lexicon is used for sentiment classification in Romanian. This shows that the cross-lingual transfer performance is mainly based on the amout of multilingual data available for training. Domain Transfer One interesting problem arising from the domain-specific nature of sentiment analysis in the context of supervised learning is how to adapt the classifier to a new domain. In domain transfer, the classifier is trained in a domain, and tested on another domain [31]. The main purpose is to find the properties of domain independent classifiers [2]. Blitzer et Al. (2007) [4] showed that domain transfer in the product reviews gathered from Amazon.com performed succesfully. Models trained with the reviews about the dvd products, had performed (79.5%) as good as models trained with reviews about books (80.5%) when tested with the book reviews. Feature Selection in Sentiment Analysis The selection of the feature set is a major part of the machine learning process. Minimizing the feature set without losing critical information is another problem, which the machine learn- 7

21 ing and the sentiment analysis community tackles. Eliminating features that are subsumed by other features [29] could minimize the feature space by 98%, while increasing the performance by two percent. The importance of the feature set for machine learning is described in Section Various other types of features like pattern extraction for value phrases, Turney values, Osgood semantic differentiation with WordNet values and feature selection methods like subset selection and feature subsumption have been analyzed [24] [3] [21] [13]. The information loss created by the bag-of-words approach is also a problem that the sentiment analysis community tackles. Keeping the document in a tree structure and applying the sentiment results to neighboring sentences is shown [22] to have a small positive effect on the performance, but even using N-grams with bag-of-words in sentence level is not able to prevent information loss. After the features are known, Airoldi et Al.(2006) [1] converted the feature set to a graph, and used markov blanket method to merge features, in order ro refine the dataset. Wilson et Al. (2004) represented the data in a tree structure, and used features representing the relations in the tree, with boosting and rule based methods. The use of Neutral Data Most of the research in the sentiment analysis domain is focused on positive-negative classification because the neutral documents decrease the performance drastically. Koppel et Al. (2006) [19] shows that using neutral data in training improves the performance. In this study, rather than training one classifier for the positive-negative classification, the authors trained two classifiers for positive-neutral and neutral-negative classification. The interesting result is that, if the results of these two classifiers are analyzed correctly; the overall classification performance on positive-negative classification is increased by using neutral data in the training. 8

22 CHAPTER 3 BACKGROUND In this chapter, the background information is given for the topics relevant to the thesis study. The major related topics are listed as Natural Language Processing (NLP) and Machine Learning (ML). NLP is used for finding the roots of the words via morphological analysis, and determining the part of speech. The main machine learning tool used in this work is Support Vector Machines (SVM). 3.1 Machine Learning Machine learning (ML) is the subfield of artificial intelligence that is concerned with the design and development of algorithms that allow computers to improve their performance over time, based on data. In addition to using data mining and statistics, ML also uses computational and statistical methods in order to collect a variety of information automatically from raw data. The use of machine learning in natural language processing has peaked in recent years, due to its high performance on various studies. In this thesis, we use Support Vector Machines as a machine learning technique, which has been shown to be effective in previous text mining studies Support Vector Machines In general Support Vector Machines (SVMs) get a feature vector as input, which expresses the relevant data with a set of real values. Representing the data with a set of real values can be a hard task depending on the data and the domain. A common approach in text mining is 9

23 the bag-of-words approach which will be explained in section 3.3. Given that, data points each belong to one of two classes, the goal of SVMs are to decide which class a new data point will be in. A data point is viewed as a p-dimensional vector (feature vector (a list of p numbers)), and a SVM tries to find whether it can separate such points with a p - 1-dimensional hyperplane, which is called a linear classifier. There may be many hyperplanes that might classify the data, however, SVMs try to find the hyperplane with the maximum separation between the two classes. That is to say that the nearest distance between a point in one separated hyperplane and a point in the other separated hyperplane is maximized. Formally a data point can be represented with Equation 3.1a, and a hyperplane can be written as the set of points x satisfying Equation 3.1b, where denotes the dot product. The vector w is a normal vector, and it is perpendicular to the hyperplane. SVM selects the w and b to maximize the margin, or distance between the parallel hyperplanes that are as far apart as possible, while still separating the data. This is illustrated in Figure 3.1. D = (x i c i ) x i R p c i 1 1}} n i=1 w x b = 0 (3.1a) (3.1b) After receiving the input, machine learning techniques try to find a relation between given features and the outcome. The complexity of these relations can vary from linear (feature#n has a 0.15 positive effect on the outcome, etc..), to quadratic or more. We used Joachim s SVMperf package [18] for training and testing because of its speed and performance on large datasets. SVMperf is an implementation of the Support Vector Machines formulation for optimizing multivariate performance measures described in [18] Automated Data Gathering The Machine Learning methods have a strong dependency on the quality and quantity of the data. Supervised learning methods need for large ammount of data is usually answered by automatic collection of tagged data from the Internet. There a lot of web sites that have ratings associated with the polarity of a given text. For instance, stars associated with Internet Movie 10

24 Figure 3.1: SVM Hyperplane DB comments hint at polarity. For a user that gave the movie 8 stars out of 10: it is likely that his sentiments captured in the text were generally positive and moderately forcefully held. It s not surprising that a 5/10 review has the title, A huge disappointment for fans of this memorable series and 10/10 is coupled with I just LOVED IT!. Similarly, a hotel guest who chose a Fair rating in a satisfaction survey is likely to have posted more complaints than praise in free-text response fields. Most of the datasets used in this field are created by gathering all the reviews and their ratings from web sites. If the site does not give rating information for a review, automated programs try to extract the rating information, from phrases in the review like * OUT OF 10 or */ Natural Language Processing Natural Language Processing (NLP) is an area of computer science where the computer tries to process naturally occurring human language, therefore NLP researchers try to understand the underlying patterns used in human language and extract these patterns for the use of com- 11