Research on Sentiment Classification of Chinese Micro Blog Based on

Research on Sentiment Classification of Chinese Micro Blog Based on Machine Learning School of Economics and Management, Shenyang Ligong University, Shenyang, 110159, China E-mail: 8e8@163.com Abstract This thesis has conducted an empirical research on sentiment classification of micro blog in three machine learning algorithms, three feature selection algorithms and three feature item weighting algorithms. As the experimental result shows, considering different feature weighting algorithms, SVM and Naïve Bayes have their own advantages, and Information Gain (IG) feature selection algorithm is apparently more effective than other methods. Considering the three factors as a whole, it is most effective to have sentiment classification on micro blog by adopting SVM, IG and TF-IDF (Term Frequency-Inverse Document Frequency) as feature items weighting. It has compared the generality of classification model between micro blog comments and ordinary comments in the field of films, and as a result, the experimental results show that the performance of sentiment classification relies on the style of reviews. Key words: Micro Blog, Sentiment Classification, Machine Learning, Feature Selection, Feature Item Weighting 1. Introduction The rise of Internet, especially the increase of the application of Web2.0 in recent years, makes it more convenient for the netizens to make comments on various products and hot issues. The comments on the products are very valuable to both the businesses and consumers, while the comments on hot issues are also rather valuable for the government to know what the netizens think about the specific issues. As a kind of emerging technology, sentiment classification has already received much research [1-3]. Sentiment classification technology divides human s sentiment into positive one and negative one, with mainly two methods applied in current researches: the method based on machine learning[1, 3] and the method based on semantic[4-5]. The former method considers sentiment analysis as an issue of classification, in which classification model can be achieved with labeled training set through machine learning algorithm training for the sentiment classification in the future. The latter method forms a sentiment lexicon through dividing all the words about sentiment into positive ones and negative ones, and then determines the sentiment tendencies of the sentence by calculating the relative quantity of the positive and negative words about sentiment in the sentence. Many current research results [1-2] show that the performance of the method based on machine learning is better than that of the method based on semantic. As a kind of application developing in recent years, micro blog is receiving considerable attention from the researchers. Comparing with traditional reviews, micro blog has the following five features: (1) Length: The length of micro blog is limited within 140 characters, with an average length of 40 characters according to the statistics of collected corpus, which is greatly different from the traditional comments. This is exactly the reason why the netizens ideas are easier to understand in micro blog. (2) Easy data accessibility: It is relatively easier to get data for most of the current micro blog provide API so that a large amount of data can be gotten conveniently. (3) Specific language style: As the netizens can release information through mobile phones, client-sides, plug-in board and so on, which results in the diversity of the information resources for micro blog, some emerging words or spelling mistakes may occur in micro blogs, compared with traditional blogs and product reviews. (4) Information diversity: Since the information in micro blogs come from different field, and the netizens release their comments on the products or on the current hot issues, so information of different fields can be achieved from micro blogs. Moreover, most of the current micro blog can provide International Journal of Digital Content Technology and its Applications(JDCTA) Volume7,Number3,February 2013 doi:10.4156/jdcta.vol7.issue3.50 395

keyword search, by which relevant information can be searched on the basis of keywords of relevant fields. (5) Instantaneity: With various ways of tweeting, the netizens can release their ideas onto the micro blogs whenever and wherever possible; therefore, micro blogs are more timely than traditional comments, which is a more suitable information resource for those applications with high time requirements. Viewing from the above features analyzed, the research on sentiment classification based on the comments resources of micro blog is meaningful. So far, there are relatively few relevant research at home and abroad, and only some scholars abroad has conducted relevant sentiment classification research on micro blog [6-7]. Given the lack of research on Chinese micro blog currently, the literature [8] put forward a method based on semantic for micro blog to calculate the sentiment index of each tweet by defining the attitude dictionary, weighting dictionary, negation dictionary, degree dictionary and conjunction dictionary, with the data coming from Fanfou, a Chinese micro blog. However, there aren t any relevant researches on the sentiment classification of micro blog in machine learning method currently. To make up this blank, this thesis conducted an empirical research on Chinese micro blog by using three machine learning algorithms, three feature section algorithms and three features weighting algorithms, and compared the generality of disaggregated model between micro blog comments and ordinary comments. 2. Relevant Knowledge Introduction 2.1 Machine Learning Method 2.1.1 SVMs SVMs is a kind of new machine learning algorithms based on structure risk minimizing principle [9] as well as a prediction tool with high generalization ability which has been widely used in such fields as text classification and face recognition, etc. In text classification, SVM turned out to be very effective and its robustness is better than traditional methods [10]. The SVM with separable samples is called linear SVM. As most of the text data is linearly separable, this thesis only takes linear SVM into consideration. In this thesis, LIBLINEAR, a kind of SVMs algorithms Rong-En Fan [11] put forward for large-scale linear text classification which is very effective for high dimensional sparse data, is used for training and testing of classification model. 2.1.2 Naïve Bayes Naïve Bayes is a kind of frequently used text classification method which can predict the possible property of a sample of unknown category by using Bayes theorem and select the most possible category as the category for the sample. Despite its simple model, it is widely applied in text classification [12]. Based on text classification, there are mainly two different bayesian model: multinomial model and multi-variate bernoulli model. Since a large number of scholars have carried out text classification research with multinomial model [2, 13-14], multinomial Bayesian classification algorithm is also adopted for experiment in this thesis. Multinomial Bayesian classification model calculate the number of occurrences of Wt, a kind of cj terms, through Formula (1): it i 1 t j W Nj Pw ( c) Nj n s 1 i 1 n is (1) In the above, nit refers to the number of occurrences of the term t in document i, Nj refers to the size 396

of the training set of category cj, W refers to the size of dictionary. The posterior probability is calculated through Formula (2): 2.1.3 N gram Linguistic Model p( cj) p( di cj) Pc ( j di) (2) pd ( ) i Text classification with n-gram linguistic model is a new model in natural language processing [15]. Different from traditional vector space model, n-gram linguistic model consider document as a sequence of words, thus the whether the words occur is actually a kind of language binding mode, which can be used for text classification. For a character string s = c1c2 cncn-1, n-gram linguistic model probably assumed that the probability of occurrence of the Nth character is only related with the former n-1 characters, namely, P( c s ) p( c cc... c ) (3) n cc 12... cn 1 n 1 2 n 1 2.2 Introduction of Feature Selection Method 2.2.1 Information Gain Information gain (IG) is a kind of feature selection method often used in text classification [16]. The classification ability of feature t can be weighted by measuring how the adding of feature t affects the classification performance comparing with getting rid of feature t. The IG Formula is as follows: c c c (4) IGt () Pc ( )lg Pc ( ) Pt () Pc ( tpc ) ( t) Pt ( ) Pc ( t)lg Pc ( t) i i i i i i i 1 i 1 i 1 In the above formula, P(c i) refers to the probability of category,p(t) refers to the probability of occurrence of feature t, and P() t refers to the probability of absence of feature t. 2.2.2 CHI Statistics CHI Statistics selects features through measuring the dependency between features and categories, in which a higher value of the CHI indicates the stronger dependence between features and categories, and a lower value implies that features and categories are relative independent. The computational formula of CHI value is as follows: 2 NNN ( 11 00 N10 N01) CHI (, t c) ( N N )( N N )( N N )( N N ) 11 01 11 10 10 00 01 00 (5) CHI () t max( CHI (, t c )) (6) i i In the above, N indicates the total number of documents in training set; N11 implies the times of co-occurrence of feature t and category ci; N10 refers to the number of documents with feature t but not in category ci; N01 refers to the number of documents without feature t but in category ci; N00 refers to the number of documents without feature t and not in category ci. 397

2.2.3 Document Frequency Document Frequency (DF) is a kind of most simple feature selection method through setting document frequency threshold value. Document frequency refers to the number of document with certain feature. DF method believes the feature of too high or too low document frequency can be deleted since it helps little in text classification. Though simple, DF method has good performance in both Chinese and English text classifications [17-18]. 3. Datasets Collection 3.1 Micro blog Dataset(datasetA) Since there is no common micro blog dataset in China, some data are gained through a Web crawler program from Sina Micro-blog, which classified the Micro blog according to the subjects. To prevent the experimental result limiting in certain field, the crawled data are mainly from four subjects: H1N1 influenza vaccine, Wangjialing mine disaster, film review and spring outing activities. First, three group members make sentiment labels on the corpus respectively, and then select the most sentiment from the three kinds of labels with comments, getting 2134 comments in total finally, with 1002 positive comments and 1132 negative comments. 3.2 Micro blog Film Review and Ordinary Film Review Dataset (dataset B) Film reviews are collected from Sina Micro blog and Douban respectively to test the sentiment disaggregated model generality of Micro blog reviews and traditional reviews. There are totally 4000 film reviews collected from Sina Micro blog, with 2000 positive reviews and 2000 negative reviews, which are labeled in the same way as that in dataset A. As for the 1000 reviews from Douban, due to its 1~5-star rating system, the reviews of four-star or five-star are labeled as positive ones, whereas the reviews of one-star or two-star as negative ones, with the reviews without any rates washed out, getting respectively 500 positive reviews and negative reviews in total. According to the statistics of the collected film reviews, the average length of Micro blog reviews is 40 characters, while that of ordinary reviews is 1155 characters. 4. Experiment 4.1 Experiment Design First, the experiment conducted Chinese segmentation for every review by ICTCLAS and built vector space model with certain feature item weighting algorithm based on the demands of experiment, and then select feature by adopting corresponding feature selection methods, and finally train classification model by using three machine learning algorithms. SVM and Naïve Bayes algorithms experiments are conducted in WEKA experimental environment (http://www.cs.waikato.ac.nz/ml/weka/), and n-gram linguistic model experiment is conducted with Lingpipe (http://alias-i.com/lingpipe/index.html). The experiment adopted 10 fold cross validation method, selecting F-SCORE as the performance evaluation index. The F-SCORE formula is shown in Formula (7): 2 Recall Precision F ( Recall Precision ) (7) Recall indicates algorithm recalling rate, and Precision refers to algorithm accuracy. 398

4.2 Experimental Result and Analysis 4.2.1 Performance Comparison of Different Feature Item Weightings This experiment compared the following three feature item weighting algorithms: (1)Boolean Algorithm (Presence): if the feature occurs in documents, the weight is 1, otherwise, the weight is 0. (2) Term Frequency Algorithm (TF): Take the numbers of occurrence of the feature in documents as the weight of the feature. (3) TF-IDF (Term Frequency-Inverse Document Frequency) Algorithm: Taking the number of documents containing the feature into consideration, it thinks that the more documents containing the feature, the worse the separating capacity of the feature is. The computational formula is as follows: N Wtd (, ) tf(, td) lg( ) (8) n N refers to the number of documents in the whole training document set, nt refers to the number of documents containing the term t. Most of the current research adopted certain specific character representation directly [2, 14], literature [1] compared Presence and TF in English sentiment analysis and the result shows that Presence performs better. Literature [20] compared the performance of Presence and TF in sentiment analysis of Chinese news, and the result shows that Presence performs better. However, there is no any research of comparative research in this aspect in micro blog, therefore, this thesis compared the performance of three different weight algorithms through experiments. In the experiment, IG method is selected for feature select algorithm while SVM and Naïve Bayes method are selected for classification algorithm. The performance comparison of three weight algorithms is shown as Figure 1 and Figure 2. t Figure 1 Performance comparison of three weight algorithms in SVM Figure 2 Performance comparisons of three weight algorithms in Naïve Bayes The graphs above show that three weight algorithms have their own advantages for different machine learning method. As can be seen from Figure 1, when using SVM classification algorithm, TF-IDF performs best while the performance of Presence and TF is similar. As can be seen from Figure 2, when using Naïve Bayes algorithm, Presence performs best while the performance of TF is similar to that of Presence, but TF-IDF performs not so well, with the performance decreasing apparently when the features are 3000~4000. Taking both classification algorithms and weighting into account, it can be seen that, when adopting TF-IDF, SVM performs best in a features number of 2000, with its F-SCORE value reached up to87.07; when adopting Presence, BAYES performs best in a features number of 3000, with its F-SCORE value reached up to 87.07. Therefore, in terms of IG feature selection method, it is the best to select the combination of SVM and TF-IDF. 399

4.2.2 Comparison of Different Feature Selection Method The experiment compared the performance of different feature selection methods by adopting SVM for classification algorithm and TF-IDF for weighting algorithm. The experimental results are shown as Figure 3. Figure 3 Comparison Diagram of Feature Selection Method Table1 Performance Comparison Table of Three Machine Learning Algorithms Classification Algorithms Weight Value Presenece TF TF-IDF SVM 85.10 84.54 87.07 Naïve Bayes 87.07 86.41 84.91 N-GRAM 82.32 As can be seen from Figure 3, IG has obvious superiority over CHI Statistics and DF, as the IG performs the best at a feature number of 2000, with the accuracy rate reaching up to 87.07, while the CHI Statistics and DF performs similarly, with CHI Statistics performing unstably. But the performance of three methods is basically steady above a feature number of 2500. 4.2.3 Comparison of Three Machine Learning Algorithms The experiment purpose is to compare the performance of three machine learning algorithms. As the result of experiment 1 shows that the performances of SVM and Naïve Bayes depend on different weight algorithms, this experiment compared the performance from three different weight algorithms. With no weight algorithm problems in n-gram linguistic model, the experiment sets n from 2 to 8 in proper order and makes the optimal value as the result. The experiment result is shown as Table 1. As can be seen from Table 1, compared with SVM and Naïve Bayes algorithms, n-gram model performs the worst. But the former two methods depend on different weight algorithms, in which SVM performs better when adopting TF-IDF, while adopting Presence, Naïve Bayes performs better. Viewing from the above experiments, it can be concluded that it is best to adopt TF-IDF in weight algorithm, SVM in classification algorithm and IG in feature selection algorithm. The following experiments will all be conducted in this way unless stated. 4.2.4 Comparison between Micro blog and Ordinary Reviews Literature [2] proves that sentiment classifier is heavily dependent on different fields or subjects. Due to the different features of Micro blog and ordinary reviews, it is worthy of research whether the classifier can recognize two reviews of different styles in the same field. The experiment purpose is to analyze whether the sentiment classifier of reviews in the same field relies on the style of reviews through comparative research of sentiment classification on two kinds of reviews of different styles. 400

Table 2 Performance comparison of three machine learning algorithms Weight value Classification algorithm Pr esence TF T F-IDF SVM 85.10 84.54 8 7.07 Naïve Bayes 87.07 86.41 8 4.91 N-GRAM 82.32 First,divide the Micro blog reviews sets in dataset B into training set and test set, respectively 3000 reviews and 1000 reviews, and the Douban reviews sets into training set and test set, respectively 700 reviews and 300 reviews. Next, train the two training set respectively to receive corresponding classification models. Then, test the two training sets respectively. The classification performance comparison is shown as Table 2. It can be seen from Table 2 that, in terms of reviews classification performance of the same type, the generality between models with different reviews performs not so well, which is possibly because two kinds of reviews express emotions in different ways, that is, Micro blog tends to express emotions directly, containing more sentiment terms in sentences, while in ordinary reviews, sentiment terms are mixed in some statement of facts. 5. Conclusion This thesis conducted an analytical research on sentiment in Micro blog, and found that all the three machine learning methods are effective for sentiment analysis through experiments, in which it is best to adopt TF-IDF in weight algorithm, SVM in classification algorithm and IG in feature selection algorithm. And then, the generality of sentiment classification model between Micro blog and ordinary reviews for films is studied. As a result, the experimental data show two kinds of reviews of different style are relatively bad in generality, and to build a sentiment classification algorithm which can be applied for all reviews of different styles is also worth studying. This is a preliminary research on the application of machine learning algorithms in the sentiment analysis of Micro blog, and some further study is needed, such as the performance comparison between algorithm based on machine learning and algorithms based on semantics, and that the feasibility research on the application of Micro blog sentiment analysis in some specific field. The future research can be specified into biomedical field, studying the evolution of public sentiment on emergencies. Table 3 Generality comparison between micro blog and ordinary reviews model Training set Test set SVM Bayes N-GRAM Micro blog training set Ordinary test set 75.77 78.50 75.23 Micro blog training set 86.68 84.32 85.12 Micro blog test set Ordinary training set Mircoblog test set 63.00 73.25 70.45 Ordinary training set Ordinary test set 76.08 79.89 79.61 6. References [1] Tan Songbo,Zhang Jin.An empirical study of sentiment analysis for Chinese documents[j].expert Systems with Applications,pp. 2622-2629, 2008. [2] Mullen T,Collier N. Sentiment analysis using support vector machines with diverse information sources[c] //Proceedings of Methods in Natural Language Processing,Barcelona,Spain,pp. 412-418, 2004. 401

[3] Hatzivassiloglou V, McKeown K. Predicting the semantic orientation of adjectives Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics(ACL),pp. 174-181, 1997. [4] Jansen B J,Zhang Mimi.Micro blog power:tweets as electronic word of mouth[j].journal of the American Society for Information Science and Technology, pp. 2169-2188, 2009. [5] Shen Yang,Li Shuchen. Emotion mining research on micro-blog[c] 2009 1st IEEE Symposium on Web Society,pp. 71-75,2009 [6] Fan Rongen,Chang Kaiwei.LIBLINEAR:a library for large linear classification[j].journal of Machine Learning Research,pp. 1871-1874, 2008. [7] Ye Qiang, Zhang Ziqiong, Law R.Sentiment classification of online reviews to travel destinations by supervised machine learning approaches[j].expert Systems with Applications, pp. 6527-65352009 [8] Carpenter B. Scaling high-order character language models to gigabytes Proceedings of the 2005 Association for Computational Linguistics Software Workshop,pp. 1-14,2005. [9] Hui Cheng, Yun Liu, Juan Li, Jiang Zhu, Junjun Cheng, "Content-based Micro Blog User Preference Analysis", JCIT, Vol. 7, No. 1, pp. 282 ~ 289, 2012 [10] Pei Yin, Hongwei Wang, Wei Wang, "Extracting Features for Sentiment Classification: in the Perspective of Statistical Natural Language Processing", AISS, Vol. 4, No. 15, pp. 33 ~ 41, 2012 [11] Neda Ale Ebrahim, Mohammad Fathian, Mohammad Reza Gholamian, "Sentiment Classification of Online Product Reviews Using Product Features", IJIPM, Vol. 3, No. 3, pp. 30 ~ 35, 2012 402