, pp.152-156 http://d.do.org/10.14257/astl.2016.135.38 Comparson of Doman-Specfc Lecon Constructon Methods for Sentment Analyss Myeong So Km 1, Jong Woo Km 2,3 and Cu Jng 4 1 Department of Mathematcs, College of atural Scence, Hanyang Unversty 2 School of Busness, Hanyang Uvnversty 4 Department of Busness Admnstraton, Graduate School, Hanyang Uvnversty 222 Wangsmn-ro, Seongdong-gu, Seoul 133-791, Korea detrment@naver.com, kjw@hanyang.ac.kr, choj127@naver.com 3 Correspondng Author Abstract. Rather than usng general sentment dctonares, doman-specfc sentment dctonares can contrbute to mprove predcton accuracy of setment anlayss. In ths study, we suggest sentment lecon constructon methods based on supervsed learnng. Four alternatve supervsed learnng methods ncludng SO-PMI (Semantc Orented-Pontwse Mutual Informaton), condtonal probablty of words or polarty, and smply frequency-based method are compared usng a move revew data set from IMDB (Internet Move Data Base). The epermental results show that the sentment dctonary whch s constructed by condtonal probablty of polarty method provdes the best performance to predct revewers ratngs among four methods. Keywords: Sentment Analyss, Sentmental Lecon, Semantc Orented- Pontwse Mutual Informaton 1 Introducton Opnon mnng or sentment analyss has lots of potental because t can be used to predct people s emoton or atttude on products, brands, and publc ssues based open tets n onlne communty, SS, blogs, and Internet news stes. A common sentment analyss approach s based on sentment dctonares whch nclude postve or negatve terms and ther polartes. In the approach, the qualty and ftness of sentment dctonares are crtcal to the performance of sentment analyss. Even though there est general purpose sentment dctonares such as SentWordet 1), sentment analyss usng such general purpose dctonares may not provde acceptable performance due to the followng reasons. Frst, polartes of words can be dfferent dependng on contets. For eample, word 'spooky' s usually used on negatve sentences, but f t s found n a revew of a horror move, t can be used as postve epresson that the horror move realze horror atmosphere well. Smlarly, word 'predctable' s usually used n postve sentences, but f t s found n a revew of 1) http://sentwordnet.st.cnr.t/ ISS: 2287-1233 ASTL Copyrght 2016 SERSC
a move, t mght have negatve meanng that the story of the move was trte and obvous. For the above reason, doman-specfc sentment dctonares can provde better performance of sentment analyss. In ths study, we propose sentment dctonary constructon methods based on supervsed learnng approach to mprove the performance of sentment analyss. To verfy the proposed approach usng real world data set, more than 80,000 move revews n IMDB (Internet Move Data Base) are used as an epermental data set. The paper s organzed as follows. In the net secton, related works are presented. In Secton 3, four methods to construct doman-specfc sentment dctonares are presented. Secton 4 ncludes the epermental results. Secton 5 descrbes concluson remarks and further research ssues brefly. 2 Related works There were many works to fnd better way to evaluate the polarty of documents. The researches can be classfed nto two categores: sentence structure-based and leconbased approach. 2.1 Researches Usng Sentence Structure There are researches on fgurng condtonal, comparng and sarcastc sentences, whch are representatve researches of sentence structure-based approach [4, 7, 9]. Also there s a tral to fnd semantc orentaton of mplct phrase or doms [11]. 2.2 Researches Usng Sentmental Lecon There are researches whch use polarty evaluaton system consderng synonyms and antonyms [3]. Also consderng syntactc words ncludng ungram, bgram, and trgram s another method [5]. There are several supervsed machne learnng methods to classfy sentments of IMDB revews [2, 10]. Common ways to make lecons are usng condtonal probablty of words or SO-PMI [1, 6, 8]. 3 Doman-specfc Lecon Buldng In ths research, we compare the performance of two doman-specfc lecon buldng approaches. We classfed them by method of scorng polarty value, (1) SO-PMI and (2) term frequency-based approach. In term frequency-based approach, we tred to test three polarty measures. PMI (Pontwse Mutual Informaton) s a measure for the smlarty of two terms (refer formula 1). SO-PMI (Semantc Orentaton from Pontwse Mutual Informaton) s a method to determne terms polartes based on seed terms (refer formula 2). Copyrght 2016 SERSC 153
, PMI (, log ) SO PMI ( ) n 1 PMI (, pw ) n 1. (1) PMI (, nw ). (2) In the above frst formula, ) s the probablty that term n a document, and, s the probablty that term and y est n the same document. In the second formula, pw s postve seed term and nw s a negatve seed term. The three possble polarty measures to determne terms polartes based on term frequency are (1) CPoP (Condtonal Probablty of Polart, (2) CPoW (Condtonal Probablty of Word), and (3) SF (Smple Frequenc. CPoP( ) CPoW ( ) SF( ) pos pos pos pos pos neg neg neg neg neg. (3) (4). (5) In the above formulas, pos s the number of postve documents, and pos s the number of postve documents whch nclude term. After sortng terms by polarty score on descendng order, we classfy them nto fve categores, strong postve, postve, neural, negatve, and strong negatve. Usng s dfferent lecons (4 supervsed learnng method and orgnal adjectveonly, orgnal-all lecons), 10-scale ratngs whch are gven by revewers are predcted usng aïve Bayesan classfer n SentR package n R. The performance measure s MAE (Mean Absolute Error) (refer formula 6). MAE r rˆ 1. (6) In the above formula, r s actual ratng n revew and rˆ s predcted ratng. Fg.1. S Lecons and research procedures 154 Copyrght 2016 SERSC
4 Epermental Results The epermental result usng a move revew data set from IMDB (Internet Move Data Base) s shown n Table 1. Parwse t-test results are also presented n the table. The statstcal sgnfcances between a specfc method and orgnal adjectves-only lecon are tested usng parwse t-test technques. Lecons usng supervsed learnng approach show better performance than orgnal adjectve-only lecon. Especally, n Acton and Horror genres, lecons usng supervsed learnng approach provde sgnfcant better performance than orgnal adjectves-only and orgnal all lecons. The reason may be some terms n the two genre can have doman-specfc sentments rather than general sentments. Also, three lecons usng term frequency-based methods (CPoP, CPoW, and SF) show better accuracy than that usng SO-PMI. However, there s no specfc best-performng lecon and shows smlar accuracy between 3 term frequency-based lecons (CPoP, CPoW, and SF). Table 1. MAEs per Lecon and Genre Genre Orgnal Adj. SO-PMI CPoP CPoW Acton 2.26 2.19 *** 2.17 *** 2.17 *** - Anmaton 2.22 2.21 2.15 * 2.14 * - (0.443) (0.045) (0.033) Comedy 2.42 2.41 2.36 ** 2.37 ** - (0.296) (0.006) (0.010) Drama 2.50 2.42 *** 2.37 *** 2.33 *** - Horror 2.42 2.33 * 2.24 *** 2.35 - (0.027) (0.051) Sc-F 2.60 2.49 *** 2.47 *** 2.45 *** - Move All 2.43 2.36 *** 2.33 *** 2.35 *** - *** : p<0.001, ** : p<0.01, * : p<0.05 Smple Freq, 2.17 *** 2.15 (0.05) 2.36 ** (0.004) 2.32 *** 2.34 * (0.039) 2.47 *** 2.34 *** Orgnal All 2.30 * (0.013) 2.12 ** (0.056) 2.43 (0.343) 2.37 *** 2.51 ** (0.010) 2.56 (0.117) 2.40 ** (0.002) o. of Data 13,894 2,870 9,126 9,318 2,449 4,426 42,083 5 Concludng Remarks In ths research, we compare four alternatve lecon constructon methods based on supervsed learnng approach. In an eperment usng a move revew data set from IMDB (Internet Move Data Set), lecons usng supervsed learnng show better performance than general-purpose lecons. Especally, lecons usng term frequency-based methods show better accuracy than usng SO-PMI (Semantc Orented-Pontwse Mutual Informaton) n terms of smplcty and stablty. However, there s no specfc best lecon and shows smlar accuracy between three term frequency-based lecons. Copyrght 2016 SERSC 155
Acknowledgments. Ths work was supported by Valuaton and Soco-economc Valdty Analyss of uclear Power Plant In Low Carbon Energy Development Era. of the Korea Insttute of Energy Technology Evaluaton and Plannng (KETEP) granted fnancal resources from the Mnstry of Trade, Industry & Energy, Republc of Korea (o. 20131520000040). References 1. Song, S. I., Lee, D. J., Lee, S. G.: Identfyng Sentment Polarty of Korean Vocabulary Usng PMI. In: Proceedngs of Korean Computer Congress 2010, pp. 260--265. (2010) 2. Pang, B., Lee, L., Vathyanathan, S.: Thumbs up? Sentment Classfcaton Usng Machne Learnng Technques. In: Proceedngs of the ACL-02 Conference on Emprcal Methods n atural Language Processng, pp. 79--86. (2002) 3. Hu, M., Lu, B.: Mnng and Summarzng Customer Revews, In: Proceedngs of the ACM SIGKDD Conference on Knowledge Dscovery and Data Mnng, pp. 168--177. (2004) 4. Jndal,., Lu, B.: Mnng Comparatve Sentences and Relatons, In: Proceedng of the 21 st atonal Conference on Artfcal Intellgence, pp. 1331--1336. (2006) 5. Dave, K., Lawrence, S., Pennock, D. M.: Mnng the Peanut Gallery: Opnon Etracton and Semantc Classfcaton of Product Revews. In: Proceedngs of the 12th nternatonal conference on World Wde Web, pp. 519--528. (2003) 6. Baron, M., Vegnaduzzo, S.: Identfyng Subjectve Adjectves through Web-based Mutual Informaton. In: Proceedngs of KOVES, pp. 17--24. (2004) 7. arayanan, R., Lu, B., Choudhary, A.: Sentment Analyss of Condtonal Sentences. In: Proceedng of Conference on Emprcal Methods n atural Language Processng, pp.180- -189. (2009) 8. Turney, P., Lttman, M.: Measurng Prase and Crtcsm: Inference of Semantc Orentaton from Assocaton. In: Proceedngs of ACL-02, 40th Annual Meetng of the Assocaton for Computatonal Lngustcs. pp. 417--424. (2002) 9. Tsur, O., Davdov, D., Rappoport, A.: Sem-Supervsed Recognton of Sarcastc Sentences n Onlne Product Revews. In: Proceedngs of the Internatonal AAAI Conference on Weblogs and Socal Meda, pp.162--169. (2010) 10. Jan, Z., Chen, X., Wang, H. S.: Sentment Classfcaton Usng the Theory of As. The J. Chna Unverstes of Posts and Telecommuncatons, 17, 58--62 (2010) 11. Dng, X., Lu, B., Yu, P. S.: A Holstc Lecon-based Approach to Opnon Mnng. In: Proceedngs of the 2008 Internatonal Conference on Web Search and Data Mnng, pp. 231--240, (2008) 156 Copyrght 2016 SERSC