Positive or negative? Using blogs to assess vehicles features

Positive or negative? Using blogs to assess vehicles features Silvio S Ribeiro Jr. 1, Zilton Junior 1, Wagner Meira Jr. 1, Gisele L. Pappa 1 1 Departamento de Ciência da Computação Universidade Federal de Minas Gerais (UFMG) CEP 31270-901 - Belo Horizonte - MG - Brasil {silviojr, zilton, meira, glpappa}@dcc.ufmg.br Abstract. Social media has become a valuable source of information to know what consumers think about products. In this work, we focus on analyzing opinions on individual product s features presented in reviews and blog comments. We describe an adaptation of a lexicon-based approach to sort out the problem, propose a new approach based on supervised learning algorithms. We focus on vehicles, and present as a key finding the generalization performance of the models generated in different datasets from the same domain. Our results show that is possible to achieve better precision and recall using supervised learning algorithms that do not require as much human effort as those obtained by traditional natural language processing approaches. 1. Introduction Information about the reputation of companies and products has never been so available. A quick search on the Web regarding a product will produce many results about its characteristics, advantages, drawbacks and, more specifically, what people who have bought the product think about it. Most of this information is generated by ordinary users in social networks, blogs, micro-blogs or online stores, is easily accessible and useful to the final consumer. Given the amount of information available, many techniques have been proposed to extract useful information from all these available content coming from different sources. In special, many of these methods were developed to deal with data from Twitter 1. Significant research considering the content produced by the micro-blog showed it has a high degree of correlation with the real world. The applications already developed from Twitter data vary from epidemics prediction [Gomide et al. 2011] to the better understanding of politics [Tumasjan et al. 2010] and natural disasters [Sakaki et al. 2010]. Blogs are other useful source of information. They usually have more complete and structured information than those available in general-purpose social networks, as they are usually written and read by experts on a topic. Among the techniques developed to extract information from different online media, those focusing on automatic sentiment analysis have been given special attention [Wilson et al. 2005, Pang et al. 2002]. The task of automatic sentiment analysis can be defined as follows. Given a text (tweet, comment, blog post, etc), one wants to automatically classify its content as having a good or bad opinion towards a specific entity. This work focuses on automatic sentiment analysis for blog posts and comments. More specifically, we focus on a specific domain: vehicles. 1 www.twitter.com

Suppose a company or a buyer wants to know what has been said about a new car, just launched on the market. A set of blogs that discusses the subject is known, but each post in the blog is followed by hundreds of comments, and is difficult to summarize all this information. In particular, the user is interested in how the performance of the car is, if it is economic, if the trunk is big enough for his needs, etc. Most of these answers can be obtained from the Web, based on other users experiences, or from blogs of specialists. This paper proposes an approach for product feature-based sentiment analysis, where we are not interested in the overall opinion of the users, but rather what they think about specific features/parts of the product, given that these parts are already known. The paper proposes a new approach for sentiment analysis based on learning algorithms, which uses content published in reviews to classify opinions expressed in blog comments. This strategy is particularly interesting for not using language-specific resources, as occurs in most feature-based sentiment analysis methods. Besides, the method has another interesting characteristic: its training and test sets are obtained from different blogs about the same domain, and the classifier needs to be general enough to perform well in both datasets. Furthermore, a Portuguese version for the opinion-lexicon expansion strategy described in [Qiu et al. 2011] was implemented and a variation of [Hu and Liu 2004] lexicon-based algorithm used and compared to the learning approach. The latter produced significantly better results than the lexicon-based approach, supporting the claim that learning algorithms may achieve better results in sentiment classification without using sophisticated linguistic resources. The remainder of this paper is organized as follows. Section 2 describes related work, while Section 3 details the construction of the datasets. Section 4 explains how the proposed methods work, and Section 5 describes the experimental results. Finally, Section 6 draws conclusions and discusses future work. 2. Related Work Many papers about automatic sentiment analysis have been published in recent years. Most of them focus on determining the sentiment present in a text (i.e., reviews) according to two main orientations: positive or negative. There are two widely used categories for opinion analysis strategies in the literature: lexicon-based and classificationbased. Lexicon-based strategies use a list of positive and negative terms (opinion lexicon) to compute the polarity of a document [Turney 2002] or of the sentences of a document [Wilson et al. 2005]. Creating an opinion lexicon to support these systems is a challenge, as it depends on many linguistic and corpus resources [Kamps et al. 2004, Esuli and Sebastiani 2005, Esuli and Sebastiani 2006]. Classification-based strategies have been used to determine the overall sentiment of a document by extracting a set of features of the target text and, given the real sentiment associated with the document, use a classification algorithm to learn from these data [Pang et al. 2002]. Both strategies have also been combined to perform sentiment analysis in political and movie review blogs [Melville et al. 2009]. This paper proposes and contrasts a representative method for each of the aforementioned approaches to performe a task that can be classified as product review. Products review is not a new subject in the sentiment analysis field. [Turney 2002] uses an

unsupervised learning technique to classify movie reviews as recommended or not according to the average semantic orientation of the phrases in the review. The semantic orientation is calculated based on the phrase s mutual information with the words poor and excellent. [Pang et al. 2002], in turn, determines the overall sentiment present in movie reviews using prior-knowledge-free supervised machine learning techniques. While the work of [Pang et al. 2002] is based on the sentiment of the whole review, [Wilson et al. 2005] determines the contextual polarity for sentiment expressions through a phrase-level sentiment analysis combining machine learning classification and a priorpolarity subjectivity lexicon. There are two core tasks in identifying the opinion about a product s features: identifying the features themselves and determining the opinion orientation towards them in each sentence. [Yi et al. 2003] performs a specific feature extraction and its associated sentiment using a sentiment lexicon and a sentiment pattern database. Our work does not deal with feature-extraction: the product s features to be analyzed are given as input to the system. For products features opinion analysis, [Nasukawa and Yi 2003] present an approach to extract sentiments associated with polarities for specific subjects from a document using manually defined sentiment expression and a sentiment lexicon. Their system yields high precision, but low recall. [Liu et al. 2005], in turn, proposes a technique based on language pattern mining to extract product features from Pros and Cons in a particular type of review. A prototype called Opinion Observer was implemented to enable a user to compare consumers opinions about competing products. [Hu and Liu 2004] determines the opinion counting the number of positive and negative adjectives, and the most frequent determines the overall orientation of the sentence. To solve the feature-extraction problem and create a domain-dependent opinion lexicon required in most sentiment analysis task, [Qiu et al. 2011] created a technique called double-propagation. The approach propagates information between opinion words and product s features to expand both opinion lexicon and features set. Our method assigns phrase-level polarities for different features of a determined product. Previous works have aimed to perform this task, but using NLP techniques that rely on linguistic resources such as opinion lexicon and handcraft linguistic patterns. We demonstrate here that it is possible to achieve high accuracy on feature-level sentiment analysis by just using well-known machine learning classifiers with no use of handcraft sentiment expression or sentiment lexicons. 3. Vehicle s Users Sentiment Dataset Construction This section describes the datasets used in this paper. We decided to detail them before the method because it makes some of the method s decisions easier to understand. Furthermore, this work differs from other in the strategy used to learn: datasets from different sources in the same domain are used. Note, however, that the proposed methods are not domain dependent. Two datasets were built to product feature-driven sentiment analysis, namely dataset reviews (REV) and comments dataset (COM). The dataset reviews was created using a website specialized in vehicles called Carrosnaweb 2. Carrosnaweb was chosen 2 http://www.carrosnaweb.com.br/

Figure 1. Example of two reviews made by car owners. The one in the left is positive (overall evaluation:9.27 stars), while the one in the right is negative (overall evaluation: 5.73 stars) Table 1. Examples of Reviews found in REV Pros Extracted from the Positive Evaluation Showed Above The steering and suspensions are soft. Handling is great, comfortable. It has a good consume, 12km/l in the city using gas and excellent breaks and height, rear-view mirror. The stability is good even when I abuse it, but I did not run it using alcohol to try its performance. Pros Extracted from the Negative Evaluation Showed Above It is a beautiful car. Cons Extracted from the Positive Evaluation Showed Above The back visibility is bad but the big rear-view mirrors can help you a lot. They should have kept the Fire s engine because the Evo s is slow, just average for a 1.0. The internal space is just average. Cons Extracted from the Negative Evaluation Showed Above Consume and stability. because it presents an interesting structure to obtain labeled data about cars with no labeling cost. In Figure 1, we show one of the sections of the website, called Users Opinion (Opinião do Dono, in the original site). There is one page for each vehicle, in a total of 729 vehicles, with a summary of 15 different features, which vary from stability to breaks. In the example, we show the opinions for Fiat Uno G2. Each car owner is asked to rate a set of features from the car with stars rating from 1 to 5. Besides, there are four free text fields, where users list the pros, cons, failures and other comments about the car. Table 1 shows the free text for the positive and negative comments listed in Figure 1. Observe that the size of the comments varies significantly, but in general what appears in pros is in favor of the car and its features, while the opposite is true for cons. Here we assume this observation is always true, although there are, for instance, a few cases of irony which are more difficult to be handled, and will be treated in the future (e.g.: Cons: It consumes 1litre/13km at 140km/h with the air conditioner on - Really, really bad... lol... lol... ). The second dataset, COM, was created from 88,208 comments extracted from 19 blogs about cars. This dataset contains more diverse linguistic structures, and most of its content does not explicit relates to the features of the car. Unlike the first dataset, most of its statements may be neutral. We will not deal with neutral statements for now, and the dataset considers only statements that express sentiment. As we are working with sentence classification, for each review and/or comment in our datasets, we extract its sentences. The sentences are identified using the simplest possible approach, i.e., cutting sentences when one of the three most common punctuation signs, namely final dot, exclamation or question mark, appear. Having the text broken into sentences, the method identifies the sentences that have at least one explicit reference to one of the vehicles features of interest. These features are defined by the user according

(a) Frequency of the number of words per sentence in COM (b) Frequency of the number of words per sentence in REV Figure 2. Frequency of the number of words per sentence Table 2. Characteristics of the Review and Comments Datasets Documents #sentences #feature-related sentences #avg words/sentence REV 24,802 48,121 19,689 28 COM 87,940 247,257 45,357 31 to his/her interest or, in our case, defined with the help of a handcrafted source. Finally, pronouns, articles, prepositions and conjunctions are discarded as stopwords. A few characteristics of the databases are summarized in Table 2, where the total number of posts and sentences in the original dataset followed by those of interest (which refer to a considered feature) are listed. Figure 3 complements the table. For example, in Figure 3(a) we observe that, in REV, slightly more than 15% of the sentences reference more than two features, while in COM this fraction is much smaller. Comments tend to have longer sentences, as seen in Figure 3, than reviews. Such differences may be relevant for classifiers due to differences between text genres. (a) Frequency of the number of features per sentence in COM (b) Frequency of the number of features per sentence in REV Figure 3. Frequency of the number of features per sentence In Table 3, we observe the class distribution according to each feature for the reviews dataset. We used this dataset to train different classifiers, and a cross validation was used to estimate generalization. We also randomly selected and labeled 200 examples from the COM dataset for evaluation purposes: performance (desempenho), engine (mo-

Table 3. Number of positive and negative sentences in the REV Dataset Feature #positive #negative suspension 168 50 instruments 83 218 interior design 159 204 breaks 335 143 transmission 268 343 style 40 596 cost 164 592 performance 137 804 trunk 460 660 stability 254 1371 workmanship 826 1092 consume 1527 1635 engine 1524 2238 tor) and workmanship (acabamento). We use this labeled set as a test set for the created classifiers, we will name it as TEST from now on. 4. Two Methods for Product-Features Review This section describes the two methods created for product feature-driven sentiment analysis in blog comments. The first is a naive alternative coming from the natural language processing field, where a lexicon is created for sentiment analysis based on a data source, adapted to Portuguese [Qiu et al. 2011]. The second method is based on machine learning classification, and extracts a set of language-based characteristics to train a classifier, which will learn to distinguish good from bad opinions. We start describing the feature extraction step, used by both methods, and then describe them in detail. 4.1. Feature Extraction For both approaches presented in this paper, a feature extraction process is performed using a method based on grammar dependency trees. The dependency tree is generated from a parse tree. A parse tree represents the syntactic structure of a sentence according to the grammar. We used Freeling [Padró et al. 2010] to generate parse trees. A dependency tree is a representation that denotes grammatical relations between words in a sentence [Culotta and Sorensen 2004]. For example, subjects are dependent on their verbs and adjectives are dependent on the nouns they modify. A set of rules are used to transform a parse tree in a dependence tree. We generated the dependence tree using DepPattern 3. In a dependency tree, every node represents a word, and the edges between a parent and a child node specify the grammatical relationship between the two words, as showed in Figure 4. This representation is useful to extract words that are grammatically related to the features we are analyzing in a sentence Figure 4 illustrates the dependency tree of the sentence O acabamento interno é lindo e o câmbio automático dá um charme ao veículo, which would read in English ( The internal workmanship is beautiful and the automatic transmission gives the vehicle a charm. This sentence is composed by two clauses, connected by e (and). Acabamento (workmanship) is the subject and also a noun, and is related to lindo (beautiful) by the verb ser (to be). Automático (automatic) is also related to câmbio (transmission), 3 http://gramatica.usc.es/pln/tools/deppattern.html

Figure 4. A dependence tree generated by DepPattern but they are directly connected. In both cases, the verbs are connected to the car features because they are the subjects of the clauses. 4.2. Lexicon-based approach Our lexicon-based approach is adapted from [Hu and Liu 2004], and can be divided into three main steps. First, we identify the adjectives related to the feature of each sentence. Second, we check in the lexicon the orientation of this adjectives. Finally, we count the number of positive and negative adjectives, and the most frequent determines the overall orientation toward the feature. If there is the same number of positive and negative opinion adjectives,the orientation is given by the opinion adjective that is closest to the feature in the sentence. For example, in the sentence The internal workmanship is beautiful and the automatic transmission gives the vehicle a charm, three adjectives: internal, beautiful, and automatic would be identified, and classified using a lexicon. The lexicon gives the orientation of the adjective, i.e., if it is positive and negative. Three different approach to generate a lexicon were used here: Feature-based propagation (FBP): based on [Qiu et al. 2011]. Using REV dataset, we extracted all the adjectives related to each car feature. All the adjectives found in the pros sections are considered as candidates to be positive and all the adjectives found in the cons sections are candidates to be negative. The intersection of those two sets is removed for being dubious. In this manner, each feature has its own initial seed of sentiment words. Using the dependency tree as indicated in [Qiu et al. 2011], more adjectives are extracted from the COM dataset for each car feature. We assign the polarity of a new adjective according to its co-occurence with an already known adjective. If it is not possible to predict the orientation of an adjective, it is disregarded. Table 6 presents the number of adjectives found per feature. Note that the propagation presented just a minor expansion for this dataset. Simple propagation (SP): Similar to the above, except that it does not consider that each feature has its own set of opinion words. The initial seed is formed by all the adjectives found in the pros and cons of REV. Again, the adjectives that appear in both pros and cons are not considered for being dubious. The co-occurence of adjectives in sentence of the COM dataset indicates the possible orientation of the new found adjectives. To emphasize the precision, we do not consider adjectives that have co-occurences with both positive and negative words of our seed. Our

initial seed contained 392 positive and 242 negative adjectives. By the end of the propagation, we had 432 positive adjectives and 263 negatives adjectives in dictionary form. General Opinion Lexicon (GOL): created from many sources by [Souza et al. 2012] as a general opinion lexicon (the two above focus on adjective in the context of vehicles). It contains 4,268 positive adjectives and 4,580 negative adjectives in dictionary form. Each of these lexicons can classify adjectives as having different orientations. If an adjective is not present in the lexicon, it is ignored. In the case of the sentence above, assume that the used lexicon classified beautiful as positive, and did not find the others. In this case, the sentence has one positive adjective related to workmanship, and hence its orientation is also positive. Since transmission is not linked to a classified adjective, no orientation is assigned to it and that reduces the recall. 4.3. Learning-based approach The learning based approach is also based on three main phases: (i) feature extraction, (ii) training and (iii) testing. The first decision when using a classifier is to decide which features should be used to describe the data. The most intuitive thing to do is to use the whole sentence after the preprocessing step. However, other alternatives will be discussed here. We proposed to use four sets of features, all of them based on the dependency tree described earlier, and each of them based on a grammatical class: JJS: The adjectives directly linked or indirectly linked through a verb to the product feature of interest; NNS:The nouns directly linked to the product feature of interest; VBS:The verbs directly linked to the product feature of interest; GROUP:The adjective, nouns and verbs linked to the product feature of interest; These proposed features are compared to the use of all words in the complete original sentence, from now on referred as ORIG, and its variation using bigrams (ORIG2Gram). Having extracted the set of features from the original sentences, we train a classifier to generate a classification model for each car feature. Here we report the experiments for two classifiers: Naive Bayes and Support Vector Machine (SVM). Our choice was based on the results obtained from previous work, such [Pang et al. 2002], and because the two of them build models in very different ways. Note that both classifiers work with supervised learning. Hence, they need supervised data. For the vehicles dataset, the training phase uses the REV dataset, which had orientation automatically attributed to each of the components due to the structure of the CarrosNaWeb blog. The Naive Bayes classifier is a probabilistic classifier based on the Bayes theorem. It assumes that features are independent given a class and despite of its simplicity it performs well, specially in text classification [Pang et al. 2002]. The SVM classifier, in turn, is a large-margin classifier and the basic idea behind the training procedure is to find a hyper-plane that not only separates the document vectors in one class from those in the other, but for which the separation is as large as possible [Pang et al. 2002].

Table 4. Precision(P), Recall(R) and F-measure(F) for the SVM Classifier ORIG JJS GROUP ORIG2Gram P R F P R F P R F P R F suspensions 0.60 0.77 0.67 0.72 0.78 0.72 0.78 0.79 0.73 0.64 0.77 0.68 instruments 0.71 0.74 0.67 0.80 0.78 0.74 0.78 0.77 0.73 0.70 0.74 0.65 interior 0.66 0.64 0.62 0.75 0.62 0.60 0.63 0.61 0.58 0.64 0.60 0.51 breaks 0.62 0.71 0.60 0.80 0.77 0.72 0.77 0.76 0.72 0.77 0.72 0.61 transmission 0.65 0.63 0.60 0.73 0.64 0.62 0.68 0.66 0.65 0.59 0.58 0.46 style 0.88 0.94 0.91 0.89 0.94 0.91 0.92 0.94 0.92 0.88 0.94 0.91 cost 0.78 0.80 0.74 0.79 0.80 0.73 0.75 0.79 0.74 0.81 0.79 0.71 performance 0.80 0.86 0.79 0.82 0.86 0.81 0.86 0.87 0.84 0.76 0.85 0.79 trunk 0.71 0.71 0.70 0.78 0.71 0.70 0.75 0.73 0.71 0.70 0.66 0.61 stability 0.84 0.85 0.80 0.86 0.87 0.83 0.86 0.88 0.85 0.83 0.85 0.79 workmanship 0.71 0.70 0.68 0.71 0.69 0.67 0.74 0.73 0.73 0.73 0.61 0.50 consume 0.65 0.65 0.65 0.75 0.67 0.65 0.71 0.71 0.70 0.71 0.57 0.48 engine 0.67 0.67 0.65 0.72 0.67 0.61 0.70 0.70 0.68 0.67 0.64 0.55 Having trained a set of models from REV, we then predict the orientation of the sentences of the COM dataset, which is unlabeled. For evaluation purposes, the results which will be reported take into account a small set of labeled examples, named TEST, but the idea of the model is to label these new data without having to previously know the class. 5. Experiments and Results This section is divided into two parts: the lexicon-based and classification-based approaches. For the classification-based approach, we performed experiments using the REV dataset together with a 5-fold cross-validation process, in order to access its generality in the data coming from the same source. Recall that we generate one classifier for each of the features described in Table 3. In a second step, we applied the models and lexicons produced in the first step to the COM and TEST datasets. During the first set of experiments, we tried many different combinations of preprocessing steps. One thing was to test the stemmed forms of the words as well as their dictionary forms. The stemmed forms produced systematically inferior results to their dictionary-form counterparts. Since there was not enough space for both results, we chose to show only those related to the dictionary-form features. A trigram variation of ORIG was also tested, and performed worst them the ones reported here. Tables 4 and 5 report the values of precision, recall and F-measure for each class. The results show that, although SVM has shown good performance for overall sentiment analysis [Pang et al. 2002], Naive Bayes performed significantly better for feature-based analysis, presenting good precision and recall even for those datasets with unbalanced class distribution and reduced number of examples. The set of features GROUP is clearly the best choice for SVM, meanwhile the ORIG2Gram features and its variation using bigrams are better for Naïve Bayes. We compare the aforementioned results to the other variants of feature sets using a two-tailed t-test. The results are indicated in the tables with three symbols: denotes a non significant statistical variation, corresponds to a significant negative variation while denotes a significant positive variation. In the second part of the experiments, we tested the models produced by training the models with REV in the COM dataset. The results presented here represent a small sample of the dataset, but intend to give an idea of how the method performed in two very

Table 5. Precision(P), Recall(R) and F-measure(F) for the Naïve Bayes Classifier ORIG JJS GROUP ORIG2Gram P R F P R F P R F P R F suspension 0.89 0.87 0.87 0.81 0.81 0.78 0.83 0.83 0.82 0.89 0.85 0.86 instruments 0.89 0.88 0.88 0.81 0.79 0.75 0.82 0.81 0.81 0.88 0.87 0.88 interior 0.86 0.85 0.85 0.77 0.66 0.65 0.74 0.73 0.73 0.86 0.86 0.86 breaks 0.86 0.85 0.85 0.81 0.79 0.76 0.81 0.80 0.80 0.85 0.81 0.82 transmission 0.89 0.89 0.89 0.76 0.66 0.64 0.81 0.80 0.80 0.89 0.89 0.89 style 0.91 0.92 0.92 0.90 0.94 0.91 0.92 0.93 0.92 0.93 0.89 0.90 cost 0.90 0.89 0.89 0.80 0.81 0.76 0.84 0.85 0.83 0.88 0.87 0.88 performance 0.90 0.89 0.90 0.86 0.87 0.85 0.83 0.86 0.83 0.90 0.88 0.89 trunk 0.92 0.92 0.92 0.79 0.72 0.72 0.87 0.87 0.87 0.93 0.93 0.93 stability 0.93 0.93 0.93 0.86 0.87 0.85 0.89 0.90 0.89 0.93 0.93 0.93 workmanship 0.91 0.91 0.91 0.72 0.69 0.67 0.84 0.84 0.84 0.92 0.92 0.92 consume 0.89 0.89 0.89 0.76 0.69 0.67 0.80 0.80 0.80 0.89 0.89 0.89 engine 0.88 0.88 0.88 0.70 0.68 0.64 0.78 0.78 0.77 0.89 0.89 0.89 different styles of text: comments and reviews. From Table 7 and 8, we can see that both classifiers performance are significantly worse than the ones obtained in the REV dataset. For both classifiers, adjectives have shown to be the best choices to be used as features. One possible reason is that even with the text genre shift, most adjectives are still discriminants between the two classes. The recall is low, as in most sentiment-analysis approaches, since not all opinion sentences contain adjectives. Table 6. Expansion of the Opinion Lexicons per Cars Feature Before Propagation After Propagation Positive Negative Positive Negative workmanship 55 43 62 47 performance 39 20 39 23 engine 93 43 119 56 All features (SP) 392 242 432 263 Next, we present the results obtained with the lexicon approach. Recall that the sentences in the REV dataset were used as seeds to the method, which used an expansion process considering the COM dataset. Table 6 shows the number of adjectives found before and after expansion. These sets were used to classify each sentence in the TEST dataset. The precision and recall results are shown in Table 9. The general opinion lexicon produced absolute values of precision and recall lower than the SP lexicon. This is mainly due to the fact the SP is more specific to the domain (cars) than the GO lexicon. However, those differences are not so significant as when we compare both results to the FBP lexicon results. Although it presents a higher precision, its recall remain even lower than the previous lexicons. The main reason is the feature specialization in this lexicon that leads to a higher precision. Nevertheless, the number of opinion words extracted is not enough to cover all opinions. When compared to the results generated by the classification approach, the results produced by the lexicons are significantly worse than the produced by the classificationbased ones. The good results for the JJS set show that the classifier learned the discriminant adjectives better than the lexicon-based approach. 6. Conclusions and Future Work This paper compared two approaches for features product review, one based on lexicon and another based on classification. Despite not using sophisticated linguistic resources

Table 7. Precision(P) and Recall(R) for the SVM Classifier ORIG JJS GROUP ORIG2Gram P R P R P R P R workmanship 0.72 0.72 0.83 0.74 0.73 0.72 0.66 0.66 performance 0.81 0.59 0.80 0.66 0.8 0.66 0.69 0.64 engine 0.7 0.7 0.84 0.53 0.72 0.59 0.69 0.68 Table 8. Precision(P) and Recall(R) for the Naive Bayes Classifier ORIG JJS GROUP ORIG2Gram P R P R P R P R workmanship 0.69 0.69 0.82 0.74 0.71 0.71 0.61 0.61 performance 0.76 0.64 0.89 0.64 0.89 0.58 0.73 0.65 engine 0.73 0.72 0.79 0.57 0.71 0.64 0.74 0.72 Table 9. Precision(P) and Recall(R) for the Lexicon-Based Classifier GOL SP FBP P R P R P R workmanship 0.65 0.31 0.64 0.38 0.82 0.25 performance 0.61 0.23 0.69 0.23 1 0.02 engine 0.58 0.32 0.67 0.34 0.75 0.17 and patterns, the machine learning classifiers performed well better than the lexicon-based approaches. Those approaches had good results in other works, but they require extensive work to produce a domain-specific opinion-lexicon and sentiment pattern databases - and a lot of handcraft work. We suppose better results could have been achieved if we had made more efforts to ensemble those requirements. Nevertheless, the machine learning classifiers showed good precision and recall in the TEST dataset with less effort. One of the interesting things tested was how methods based on terms would generalize in datasets from the same domain. More sophisticated techniques for domaintransfer learning techniques will be applied in future work to enhance the classifiers performance for blog comments, since the two datasets (REV and COM) belong to different text genres. Another direction for future works is to deal with non-polar sentences, since they might be as important to identify as another ones. Tests with datasets coming from other domains will also be performed. References Culotta, A. and Sorensen, J. (2004). Dependency tree kernels for relation extraction. In Proceedings of the 42nd Annual Meeting on Ass. for Computational Linguistics, pages 423 429. Esuli, A. and Sebastiani, F. (2005). Determining the semantic orientation of terms through gloss classification. In In Proc. of the Conference on Information and Knowledge Management, pages 617 624. Esuli, A. and Sebastiani, F. (2006). Determining term subjectivity and term orientation for opinion mining. In In Proc. of the 11th Conf. of the European Chapter of the Association for Computational Linguistics, pages 193 200. Gomide, J., Veloso, A., Jr., W. M., Benevenuto, F., Almeida, V., Ferraz, F., and Teixeira, M. (2011). Dengue surveillance based on a computational model of spatio-temporal locality of twitter. In Proc. of the 3rd Int. Conference on Web Science, pages 1 8.

Hu, M. and Liu, B. (2004). Mining and summarizing customer reviews. In Proc. of the tenth ACM SIGKDD int. conf. on Knowledge discovery and data mining, pages 168 177. Kamps, J., Marx, M., Mokken, R. J., and Rijke, M. D. (2004). Using wordnet to measure semantic orientation of adjectives. In Proceedings of the 4th International Conference on Language Resources and Evaluation, pages 1115 1118. Liu, B., Hu, M., and Cheng, J. (2005). Opinion Observer: analyzing and comparing opinions on the web. In Proc. of the 14th Int. Conf. on World Wide Web, pages 342 351. Melville, P., Gryc, W., and Lawrence, R. D. (2009). Sentiment analysis of blogs by combining lexical knowledge with text classification. In Proc. of the 15th ACM SIGKDD Int. Conf. on Knowledge discovery and data mining, pages 1275 1284. Nasukawa, T. and Yi, J. (2003). Sentiment analysis: capturing favorability using natural language processing. In Proceedings of the 2nd international conference on Knowledge capture, pages 70 77. Padró, L., Reese, S., Agirre, E., and Soroa, A. (2010). Semantic services in freeling 2.1: Wordnet and ukb. In Bhattacharyya, P., Fellbaum, C., and Vossen, P., editors, Principles, Construction, and Application of Multilingual Wordnets, pages 99 105. Pang, B., Lee, L., and Vaithyanathan, S. (2002). Thumbs up?: Sentiment classification using machine learning techniques. In Proc. of the ACL-02 Conf.on Empirical methods in natural language processing - Volume 10, pages 79 86. Qiu, G., Liu, B., Bu, J., and Chen, C. (2011). Opinion word expansion and target extraction through double propagation. Comput. Linguist., 37(1):9 27. Sakaki, T., Okazaki, M., and Matsuo, Y. (2010). Earthquake shakes twitter users: realtime event detection by social sensors. In Proceedings of the 19th international conference on World wide web, pages 851 860. Souza, M., Vieira, R., Busetti, D., Chishman, R., and Alves, I. M. (2012). Construction of a portuguese opinion lexicon from multiple resources. In The 8th Brazilian Symp. in Information and Human Language Technology, pages 59 66. Tumasjan, A., Sprenger, T. O., Sandner, P. G., and Welpe, I. M. (2010). Predicting elections with twitter: What 140 characters reveal about political sentiment. In Proc. of Int. Conf. on Weblogs and Social Media, pages 178 185. Turney, P. D. (2002). Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. In Proc. of the 40th Annual Meeting on Association for Computational Linguistics, pages 417 424. Wilson, T., Wiebe, J., and Hoffmann, P. (2005). Recognizing contextual polarity in phrase-level sentiment analysis. In Proc. of the Conf. on Human Language Technology and Empirical Methods in Natural Language Processing, pages 347 354. Yi, J., Nasukawa, T., Bunescu, R., and Niblack, W. (2003). Sentiment analyzer: Extracting sentiments about a given topic using natural language processing techniques. In Proc. of the 3rd IEEE Int. Conf. on Data Mining, pages 427 435.