Proceedings of the Tenty-Fourth AAAI Conference on Artificia Inteigence (AAAI-10) Sentiment Anaysis ith Goba Topics and Loca Dependency Fangtao Li, Minie Huang, Xiaoyan Zhu State Key Laboratory of Inteigent Technoogy and Systems Tsinghua Nationa Laboratory for Information Science and Technoogy Department of Computer Science and Technoogy, Tsinghua University, Beijing 100084, China fangtao06@gmai.com; {aihuang, xy-dcs}@tsinghua.edu.cn Abstract With the deveopment of Web 2.0, sentiment anaysis has no become a popuar research probem to tacke. Recenty, topic modes have been introduced for the simutaneous anaysis for topics and the sentiment in a document. These studies, hich jointy mode topic and sentiment, take the advantage of the reationship beteen topics and sentiment, and are shon to be superior to traditiona sentiment anaysis toos. Hoever, most of them make the assumption that, given the parameters, the sentiments of the ords in the document are a independent. In our observation, in contrast, sentiments are expressed in a coherent ay. The oca conjunctive ords, such as and or but, are often indicative of sentiment transitions. In this paper, e propose a major departure from the previous approaches by making to inked contributions. First, e assume that the sentiments are reated to the topic in the document, and put forard a joint sentiment and topic mode, i.e. Sentiment-. Second, e observe that sentiments are dependent on oca context. Thus, e further extend the Sentiment- mode to Dependency- Sentiment- mode by reaxing the sentiment independent assumption in Sentiment-. The sentiments of ords are vieed as a Markov chain in Dependency- Sentiment-. Through experiments, e sho that expoiting the sentiment dependency is ceary advantageous, and that the Dependency-Sentiment- is an effective approach for sentiment anaysis. 1. Introduction With the deveopment of Web 2.0, peope can more easiy express their vies and opinions on the Web. They can post revies at E-Commerce eb sites, and express their opinions on amost everything in forums, bogs or other discussion groups. The opinion information they eave behind is important for both onine industries and customers. By coecting the opinion information, companies can decide on their strategies for marketing and products improvement. Customers can make a better decision hen purchasing products or services. Hence, in Copyright 2010, Association for the Advancement of Artificia Inteigence (.aaai.org). A rights reserved. recent years, sentiment anaysis has become a popuar topic for many research communities, incuding artificia inteigence. In this paper, e focus on the task of sentiment cassification, hich cassifies the sentiment orientation of an opinioned document (i.e. product revie), as thumbs up (positive) or thumbs don (negative) (Pang and Lee, 2002; Turney, 2002; Biter et a, 2007). Here, e refer sentiment cassification as the document-eve sentiment cassification, here sentiment is determined based on the overa sentiment orientation of an entire document. Most existing approaches (Pang and Lee, 2002; Cui et a, 2006; Biter et a, 2007) adopt supervised earning modes, in hich they are trained on annotated corpora of manuay abeed documents. Severa unsupervised earning approach have aso been proposed (Turney, 2002; Liu 2010), here many are based on given sentiment exicons. The sentiment poarities are dependent on topics or domains. The same ord may have different sentiment poarities in different domains. For instance, though the adjective compex in the sentence, The movie is compex and great!, may have positive orientation in a movie revie, it coud aso have negative orientation in a sentence, It is hard to use such a compex camera in a eectronic revie. Therefore, it is more suitabe to anaye the topic and sentiment simutaneousy. Recenty, various joint sentiment and topic modes (Mei et a, 2007; Titov and McDonad, 2008; Lin and He, 2009) are proposed based on topic mode (Bei et a, 2003; Hofmann, 1999). These methods vie the text as a mixture of goba topics, and they anaye the sentiment in the more detaied topic or domain eve. Hoever, they a assume that, given the parameters, the sentiments of the ords in a document are independent, hich is knon as the bag of ords assumption for topic mode. We present an extension of the topic mode approach to sentiment anaysis that is aso a significant departure from the previous approach. We observe that the sentiment orientation of each ord is dependent on the oca context. The conjunctions pay important roes on sentiment anaysis (Ding and Liu, 2007). If the ords or phrases are connected by conjunction and, these ords i beong to the same sentiment orientation. If the ords or phrases 1371
x x Figure 1. Figure 2. Sentiment- Figure 3. Dependency-Sentiment- are connected by conjunction but, these ords or phrases i beong to different sentiment orientations. Meanhie, from the inguistic vies, the sentiment expressions are coherent. If these are no but or other adversative conjunctions in one sentence, the sentiment orientation i not change. Therefore, the assumption of sentiment independence for different ords in a document, as assumed by many previous approaches e revieed above, is not appropriate for sentiment anaysis. In this paper, e propose a joint sentiment and topic mode, caed Sentiment-, by adding a sentiment ayer to the Latent Dirichet Aocation (). Furthermore, to capture the dependency beteen the sentiments in the document, e propose a nove Dependency-Sentiment- mode by considering the inter-dependency of sentiments through a Markov chain. A major advantage is that the proposed Dependency-Sentiment- not ony anayes the goba topic and sentiment in a unified ay, but aso empoys the oca dependency among sentiments. To our best knoedge, this is the first ork to incorporate sentiment dependency into joint sentiment and topic modeing. We conduct experiments to sho that the proposed Dependency-Sentiment- approach is an effective unsupervised method to sentiment cassification The rest of the paper is organied as foos: In Section 2, e present our Sentiment- and Dependency- Sentiment-. The mode priors are described in Section 3. In Section 4, e introduce and discuss the experiment resuts. Section 5 introduces the reated ork. The concusion and future ork are presented in Section 6. 2. Dependency Sentiment Topic Mode In this section, e i introduce our to modes for sentiment cassification: Sentiment- and Dependency- Sentiment- 2.1 Sentiment- Sentiment cassification aims to cassify each opinioned document as positive or negative, based on its overa poarity. As described in previous section, the sentiment of a ord is dependent on the domain or topic. It is appropriate to consider the sentiment and topic simutaneousy. Moreover, besides the overa sentiment poarity of the document, peope may be interested in the sub topics expressed in the document. Taking a movie revie as an exampe, in addition to knoing the overa sentiment of the movie, e may interested in the music, character, or other sub-topics about this movie. In order to mode these observations, e propose a joint sentiment and topic mode, Sentiment-, hich is an expansion of Latent Dirichet Aocation (). Figure 1 shos the mode. It contains three ayers: document ayer, topic ayer and ord ayer. Sentiment-, as shon in Figure 2, is a four-ayer topic mode. We add a sentiment ayer beteen the topic ayer and the ord ayer. The sentiment ayer is associated ith topic ayer, and ords are associated ith both sentiment abes and topics. With this additiona sentiment ayer, the topic and sentiment are considered in a united ay. Sentiment- can not ony cassify the overa sentiment poarity for the document, but aso can cacuate the poarity for each topic. Assume that e have a corpus ith a coection of documents denoted by each document in the corpus is a sequence of ords denoted by, and each ord in the document is an item from a vocabuary index ith distinct terms denoted by. Aso, et be the tota number of topics, and be the number of distinct sentiment abes. In this paper, the sentiment cassification task ony considers to categories: positive or negative. So e set equa to 2, hich means that is a binary variabe. If, the corresponding ord contains more positive expression than negative, otherise,. The procedure of generating a ord in document d has three stages. Firsty, one chooses a topic from the document specific topic distribution. Fooing that, one chooses a sentiment abe from the sentiment distribution, here is chosen conditioned on the topic abe. Finay, the ord is chosen from the distribution defined by both topic and sentiment abe. 1372
The forma definition of the generative process of Sentiment- mode is as foos: 1. For each document, choose a distribution from 2. For each topic, under document, choose a distribution from 3. For each ord in document 3.1. Choose a topic from 3.2. Choose a sentiment abe from 3.3. Choose a ord from the distribution over ords defined by the topic and sentiment abe, 2.2 Dependency-Sentiment- Sentiment- is a joint sentiment and topic mode. Hoever, simiar to previous methods, this mode aso makes an independent assumption among sentiment abes. In fact, sentiments are expressed in a coherent ay. Based on the above observations, e propose a nove mode, caed Dependency-Sentiment- mode, hich drops sentiment ayer independence assumption in Sentiment-. As shon in Figure 3, the sentiments of the ords in a document form a Markov chain, here the sentiment of a ord is dependent on its previous one. The transition probabiity is reated ith, and transition variabe. The transition variabe determine here the corresponding sentiment abe comes from. If, a ne sentiment abe is dran from ith the corresponding. If, the sentiment abe of ord is identica to the previous one. If, the sentiment abe of ord is opposite to, hich shos that the sentiment changes from one poarity to the other. We hope to mode that transition variabe, hen to ords are connected by and or other reated conjunctions;, hen to ords are connected by but or other adversative conjunctions. We i sho ho to add conjunction prior to Dependency-Sentiment- in next section. The definition of the generative process is as foos: 1. For each document, choose a distribution from choose a distribution from 2. For each topic, under document, choose a distribution from 3. For each ord in document 3.1 Choose a topic from 3.2 Choose a decision vaue from 3.3 Choose a sentiment abe as the fooing rues: a) If, b) If, c) If, from 3.4 Choose a ord from the distribution over ords defined by the topic and sentiment abe, 2.3 Inference In this section, e describe the inference agorithms for Sentiment- and Dependency-Sentiment-. The Gibbs Samping is adopted here for the inference task. In order to perform Gibbs samping ith Sentiment-, e need to compute the conditiona probabiity,, here and are vectors of assignments of topics and sentiments for a the ords in the coection except for the considered ord at position in document. Due to space imit, e sho the samping formuas ithout derivation: Where is the number of times ords assigned to topic in document m. is the tota number of ords in document m. is the number of times ords assigned to topic and sentiment in document. is the number of times ord appeared in topic and sentiment. is the number of times ords assigned to topic and sentiment. the subscript denotes a quantity except for the data in position. For Dependency-Sentiment-, e need to consider the defined Markov property hen computing the conditiona probabiity. We formuate the Dependency- Sentiment- as a specia type of HMM. The ord ayer is considered as the observation, and the combination of sentiment ayer and transition variabe ayer, in condition of topic ayer, is considered as hidden variabes. Due to space imit, e just sho you the samping formuas ithout detaied derivations. When and, ega component, hich satisfies our defined Markov property, is samped as foos: Where is the number of transition variabes assigned to in document ; is the indicator function. Other formuas can be acquired in the simiar ay. With the above sampes, the mode parameters can be estimated as: (1) (2) 1373
(3) We anaye the subtopic in the document by, the sentiment poarity of subtopic by, the overa sentiment poarity of entire document is cacuated as foos 3. Defining the Prior Knoedge The prior knoedge i guide the Sentiment- and Dependency-Sentiment- modes ith hat the sentiments and transitions shoud ook ike in the data set. Tabe 1. Sentiment Lexicons Description Lexicon Neg. Pos. Description Name Sie Sie 1 HoNet 2700 2009 Engish transation of positive/negative Chinese ords 2 SentiWordNet 4800 2290 Words ith a positive or negative score above 0.6 3 MPQA 4152 2304 MPQA subjectivity exicon 4 Intersection 411 308 Words appeared in a 1, 2, and 3 5 Union 9361 5305 Words appeared in 1 or 2, or 3 3.1 Sentiment Prior The sentiment prior knoedge is obtained from sentiment exicons. We use severa types of sentiment exicons to anaye the impact of sentiment prior. Tabe 1 shos the exicons e use in our experiments. HoNet is a knoedge database in Chinese, and provides an onine Engish transation ord ist ith positive and negative tags. SentiWordNet (Esui and Sebastiani, 2006)) and MPQA (Wison et a. 2005) are aso popuar exica resources for sentiment anaysis. We aso take the union and intersection, respectivey, of the above three exicons. 3.2 Transition Prior Transition variabe prior knoedge is mainy obtained by anaying the conjunctive ords. We coect a number of conjunctions, and divide them into to types: one is caed coordinating conjunctions, such as and, hich maintain the expressed sentiment orientation; the other is adversative conjunctions, such as but, hich change the expressed sentiment orientation. 3.3 Incorporating the Prior Knoedge In the experiments, the prior knoedge is utiied during the initiaiation and iteration steps for Gibbs Samping. For sentiment prior knoedge, the initiaiation starts by (4) comparing each ord token in corpus against the ords in sentiment exicons. If there is a match, the corresponding sentiment abe is assigned to the ord token. Otherise, a sentiment abe is randomy samped for a ord token. We aso add the pseudo count for corresponding sentiment in iteration step. The utiiation of transition prior is simiar as the sentiment prior. 4. Experiments 4.1 Experiment Setup We use the Amaon product revie data set from (Biter et a, 2007). The data set contains four types of products: books, DVDs (mainy about movie), eectronics and kitchen appiances. There are totay 4000 positive and 4000 negative revies. The data set is processed as foos: the punctuation and other non-aphabet ords are removed first; then e stem a the ords; finay, the stopords are aso removed. The fina performance is measured in terms of accuracy. Since our proposed modes are unsupervised, e design sentiment exicons based method as our baseine. Given a revie, e first count the number of positive ords and the number of negative ords based on the sentiment exicons, hich are isted in Tabe 1. If the number of positive ords is arger than the number of negative ords, the revie is considered as positive, otherise as negative. We aso compare our resuts ith the resuts acquired by supervised method presented in (Denecke,2009). 4.2 Experiment Resuts Tabe 2. Experiment Resuts ith different exicons Lexicon Sentimentbased Method Sentiment prior Dependency- Sentiment- HoNet 56.3% 62.1% 66.5% SentiWordNet 57.2% 60.6% 64.2% MPQA 60.2% 65.3% 69.0% Intersection 57.1% 59.7% 65.1% Union 59.7% 62.3% 67.3% Supervised Cassification Best Accuracy: 70.7% (Denecke 2009) Tabe 2 shos the experimenta resuts ith different sentiment exicons. The performance of Sentiment- is better than the exicon based methods. The exicon based methods achieve the accuracy around 56%~60%. The accuracy for the Sentiment- is around 60%~65%. This is because the sentiment exicons are designed for genera appication, it maybe not suitabe for the product domain. With joint sentiment and topic anaysis, our Sentiment- can detect the topic and sentiment in a unified ay. Hence, it can improve the sentiment cassification accuracy. 1374
Dependency-Sentiment- is more poerfu than Sentiment-. Dependency-Sentiment- can not ony anaye the topic and sentiment simutaneousy, but aso consider the oca dependency among sentiment abes. By incorporating the sentiment dependency, the accuracy is improved by 3%~5%, hich shos the effectiveness of the sentiment dependency. The best resut is achieved ith our proposed Dependency-Sentiment-. The sentiment exicons ith arger number of entries may be potentiay better than the smaer exicons. The union of exicons achieves a better resut than the intersection of exicons. But it is not aays the case. MPQA achieves the best resut among a exicons. This may be because MPQA exicon is suitabe for this product revie set, hich shos that the domain-specific exicons are important for sentiment anaysis. Our mode is competitive as an unsupervised method. The best resut achieved by Dependency-Sentiment- is 69.0%, hich is comparabe to 70.7% achieved by supervised method (Denecke, 2009) in the same data set. Moreover, the resut 70.7% is achieved based on 10-fod cross vaidation in a test set comprising of ony 800 revies. Our resuts are achieved on the hoe product revie set ith 8000 revies. 0.75 0.65 0.55 Sentiment- Dependency-Sentiment- 1 50 100 Figure 4. Accuracy ith different topic numbers We aso anaye the infuence of topic numbers. We use MPQA as the sentiment prior. Figure 4 shos the experiment resuts. When the number of topic is set to 1, the mode is degraded into the traditiona unigram mode for topic ayer, hich ignores the topic-sentiment correation. The overa performance increases, hen mutipe topics are considered, hich aso proves that it is appropriate to anaye the topic and sentiment in a unified ay. 4.3 Exampes of Seected Topics A topic is mutinomia distribution over ords conditioned on topics and sentiments. The top ords (most probabe ords) for each distribution coud approximatey refect the meaning of the topic. Tabe 3 shos the seected exampes of goba topics extracted ith our modes. The top to ros are topics extracted ith Sentiment-. The bottom to ros are topics extracted ith Dependency-Sentiment-. Each ro shos the top 20 ords for corresponding topic. Even the topics are a about DVD-movie ith positive (Pos) and negative (Neg) poarity. We can find that the topics extracted by Dependency-Sentiment- are more cear and informative. Tabe 3. Extracted Topic Exampes: DVD-movie Sentiment- Dependency- Sentiment- Pos Neg Pos Neg star, come, don, just, mage, movie, set, kid, isn, pick, sound, ant, dvd, feature, aaken, maybe, setting, good, dome, aren movie, fim, scene, character, atch, end, pot, actor, fee, reay, bad, act, think, come, story, try, ar, guy, make, pay movie, best, great, dvd, john, action, favorite, good, ant, coection, orth, incude, dance, actor, star, better, enjoy, fim, documentary, story movie, fim, orst, bad, scene, fight, adrian, just, buy, maybe, act, dead, air, actor, say, didn, music, suck, atch, man Our proposed modes can cassify the overa sentiment of entire revie as formua (4). It aso can extract the topics expressed in the revies as formua (1), and identify the sentiment poarity of these topics as formua (3). For exampe, an eectronic product revie is abeed as negative in the data set. Our proposed mode can correcty cassify it as negative. Besides, our mode can find that the revie mainy expresses to topics (the probabiity for other topics is beo 0.1). One maybe about usage ; the other may be about sound (from the top ords for corresponding topic). The mode aso indentifies that the first topic is expressed more positive and the second topic is expressed more negative in this revie. 5. Reated Work Sentiment anaysis is the computationa study of opinions, sentiments and emotions expressed in text. One of the most studied tasks is sentiment cassification, see the tutorias (Liu, 2010; Pang and Lee, 2009). Topic mode (Bei et a, 2003; Hofmann, 1999) is a type of unsupervised method. It vies each document as mixture of topics. Currenty, these are various topic modes on sentiment anaysis. We ca this type of mode as joint sentiment and topic mode. Topic-Sentiment Mode (TSM) (Mei et a, 2007) jointy modes the mixture of topics and sentiment predictions. Hoever, TSM is essentiay from probabiistic Latent Semantic Indexing (plsi) (Hofmann, 1999), thus suffers from the probems of inference on ne document and over fitting data. Muti-Grain Latent Dirichet Aocation mode (MG-) (Titov and McDonad, 2008) is argued to be more appropriate to buid topics that are representative of ratabe aspects of objects from onine user revies, by 1375
aoing terms being generated from either a goba topic or a oca topic. But MG- is sti purey topic based ithout considering the associations beteen topics and sentiments. Joint Sentiment/Topic (JST) mode (Lin and He, 2009) detects sentiment and topic simutaneousy from text. It is a four-ayer topic mode, hich is simiar to our Sentiment-. Hoever, JST puts the sentiment ayer corresponding to document ayer, hich is hard to anaye the sentiment of sub topics. Sentiment- puts sentiment ayer corresponding to topic ayer, hich is easy to anaye the sentiment for both document and document sub topics. Moreover, a of previous joint sentiment and topic modes empoy bag of ords assumption, hich assume that, given the parameters, sentiments of ords in the document are independent. Our proposed Dependency- Sentiment- drops this assumption. We vie sentiments of a ords in the document as a Markov chain. We aso add the sentiment exicon and conjunction ords as the mode prior. Athough there are some studies on topic mode beyond bag-of-ords (Griffiths et a, 2005; Waach, 2006; Gruber et a, 2007; Jiang, 2009), as far as e kno, this is the first ork to mode the sentiment dependency in the joint sentiment and topic methods. 6. Concusions and Future Work In this paper, e investigate the sentiment dependency in joint sentiment and topic anaysis. A nove mode, caed Dependency-Sentiment-, is proposed ith extension of our joint sentiment and topic mode, Sentiment-. Dependency-Sentiment- mode reaxes the sentiment independent assumption. It not ony anayes the goba topic and sentiment in a unified ay, but aso empoys the oca dependency among sentiments. To the best of our knoedge, this is the first ork to mode sentiment dependency for topic mode. The experiment resuts demonstrate the effectiveness of our modes. In the future ork, e i expand our joint sentiment and topic modes for more compicated tasks in sentiment anaysis, such as rating cassification or sentiment summariation. Acknoedgement This ork as party supported by the Chinese Natura Science Foundation under grant No.60973104 and No. 60803075, and carried out ith the aid of a grant from the Internationa Deveopment Research Center, Ottaa, Canada. We aso thank the anonymous revieers, Qixia Jiang, Zhiei Li, Jing Jiang, Yunbo Cao and Qiang Yang for their hepfu comments. References D. M. Bei, A. Y. Ng, and M. I. Jordan. 2003. Latent dirichet aocation. J. Mach. Learn. Res., 3:993 1022. John Biter, Mark Drede, Fernando Pereira. 2007. Biographies, Boyood, Boom-boxes and Benders: Domain Adaptation for Sentiment Cassification. Association for Computationa Linguistics (ACL). Hang Cui, Vibhu Mitta and Mayur Datar. 2006. Comparative Experiments on Sentiment Cassification for Onine Product Revies. In Proceedings of the Tenty- First Nationa Conference on Artificia Inteigence (AAAI-06). Kerstin Denecke. 2009. Are SentiWordNet Scores Suited for Muti-Domain Sentiment Cassification? In 4th Internationa Conference on Digita Information Management, ICDIM. X. Ding and B. Liu. 2007. The utiity of inguistic rues in opinion mining. In Proceedings of the 30th Annua internationa ACM SIGIR Conference. T.L. Griffiths, M. Steyvers, D.M. Bei, and J.B. Tenenbaum. 2005. Integrating Topics and Syntax. In: Advances in Neura Information Processing Systems, 17 (Sau, L.K et a., eds), 537-544. Amit Gruber, Micha Rosen-Zvi and Yair Weiss. 2007. "Hidden Topic Markov Modes," in Artificia Inteigence and Statistics (AISTATS), San Juan, Puerto Rico. Thomas Hofmann. 1999. Probabiistic Latent Semantic Indexing, Proceedings of the Tenty-Second Annua Internationa SIGIR Conference (SIGIR-99). Jing Jiang. 2009. "Modeing syntactic structures of topics ith a nested HMM-" In Proceedings of ICDM. Bing Liu. 2010. Sentiment Anaysis and Subjectivity. To appear in Handbook of natura Language Processing, Second Edition. C. Lin, and Y. He. 2009. Joint Sentiment/Topic Mode for Sentiment Anaysis, The 18th ACM Conference on Information and Knoedge Management (CIKM). Q. Mei, X. Ling, M. Wondra, H. Su, and C. Zhai. 2007. Topic sentiment mixture: modeing facets and opinions in ebogs. In Proceedings of the 16th internationa conference on Word Wide Web, pages 171 180. B. Pang, L. Lee, and S. Vaithyanathan. 2002. Thumbs up?: sentiment cassification using machine earning techniques. In Proceedings of the conference on Empirica methods in natura anguage processing, pages 79 86. Bo Pang and Liian Lee. 2008. Opinion mining and sentiment anaysis. Foundations and Trends in Information Retrieva 2(1-2), pp. 1 135. I. Titov and R. McDonad. 2008. Modeing onine revies ith muti-grain topic modes. In WWW 08: Proceeding of the 17th internationa conference on Word Wide Web, pages 111 120, Ne York, NY, USA. P. D. Turney. 2002. Thumbs up or thumbs don?: semantic orientation appied to unsupervised cassification of revies. In Proceedings of the 40th Annua Meeting on Association for Computationa Linguistics. H. M. Waach. 2006. Topic modeing: beyond bag-ofords. In ICML: Proceedings of the 23rd internationa conference on Machine earning, pages 977 984. Theresa Wison, Janyce Wiebe, and Pau Hoffmann. 2005. Recogniing Contextua Poarity in Phrase-Leve Sentiment Anaysis. Proc. of HLT-EMNLP-2005. 1376