202 Iteratioal Coferece o Idustrial ad Itelliget Iformatio (ICIII 202) IPCSIT vol.3 (202) (202) IACSIT Press, Sigapore Extractig Similar ad Opposite ews Websites Based o Setimet Aalysis Jiawei Zhag, Yukiko Kawai ad Tadahiko Kumamoto 2 Kyoto Sagyo Uiversity 2 Chiba Istitute of Techology Abstract. With the widespread of olie ews websites, people ca browse ad retrieve ews articles more easily. However, for a cotetious ews topic, differet ews websites may have differet setimet tedecies ad the tedecies may vary over time. To catch this feature, we costruct a setimet dictioary ad develop a system that ca extract ews articles setimets, visually preset the setimet variatio over time iside a ews website, ad compare setimet correlatio betwee ews websites. I particular, the system adopts three-dimesio setimets, that are more suitable for the aalysis of ews articles tha the covetioal positive-egative setimets. The experimetal evaluatios show the accuracy of setimet extractio is good, ad the observatio results show setimet compariso is effective. Keywords: setimet aalysis, ews aalysis, correlatio aalysis. Itroductio ecetly, a icreasig umber of portal ews websites, such as Google ews ad Yahoo! ews, have bee desiged to collect ad itegrate similar ews articles from various ews websites. These portal websites provide ews browsig, keyword search, ad various persoalized services. People ca thus browse ad retrieve ews articles more easily. For some domais such as politics ad ecoomy, cotetious issues cotiuously arise i ews articles. For a cotetious ews topic, differet ews websites may have similar or opposite setimet tedecies. Moreover, a ews website may always persist i cosistet setimet, whereas aother ews websites may show various setimets over time. Extractig ad presetig this kid of backgroud kowledge is sigificat for ews readers to obtai impartial iformatio. We costruct a setimet dictioary that stores words ad their setimet values. Based o our previous research [], three dimesios ( Happy Sad, Glad Agry, ad Peaceful Straied ), that are proved suitable for ews articles, are adopted. Usig the costructed setimet dictioary, we develop a system that ca detect ad visualize the setimets of ews articles ad ews websites. The system achieves the followig fuctios:. Give a ews article (target article), the system ca extract its topic ad its setimet. 2. The system ca idetify the ews website (target website) of the target article, ad preset setimet variatio over time iside the target website related to the topic. 3. The system ca calculate the setimet correlatio betwee the target website ad other websites, ad cosequetly extract the websites whose setimet tedecies are similar or dissimilar to the target website. 24
Table. A sample of setimet dictioary Table 2. Origial setimet words The rest of this paper is structured as follows. Sectio 2 describes the costructio of the setimet dictioary. Sectio 3 ad Sectio 4 describe the offlie processig ad olie processig of the system respectively. Sectio 5 evaluates the accuracy of setimet extractio ad shows the prototype of the system. Sectio 6 reviews related work. Fially, Sectio 7 cocludes the paper ad discusses future work. 2. Setimet dictioary costructio We the costruct the setimet dictioary, i which each etry idicates the correspodece of a target word ad its setimet values o the three dimesios. A sample of the setimet dictioary is show i Table. A setimet value s(w) of a word w o each dimesio is a value betwee 0 ad. The values close to mea the setimets of the words are close to Happy, Glad, or Peaceful, while the values close to 0 mea the words setimets are close to Sad, Agry, or Straied. For example, the setimet value of the word prize o Happy Sad is 0.862, which meas the word prize coveys a Happy setimet. The setimet value of the word deceptio o Glad Agry is 0.075, which meas deceptio coveys a Agry setimet. For each of the three dimesios, we set two opposite sets (OW ad OW ) of origial setimet words (Table 2). The basic idea of setimet dictioary costructio is that a word expressig a left setimet o a dimesio ofte occurs with the dimesio s OW, but rarely occurs with its OW. For example, the word prize expressig the setimet Happy ofte occurs with the words Happy, Ejoy, etc, but rarely occurs with the words Sad, Grieve, etc. We compare the co-occurrece of each target word with the two sets of origial setimet words for each dimesio by aalyzig the ews articles published by a Japaese ewspaper YOMIUI OIE durig 2002-2006. First, for each dimesio, we extract the set S of ews articles icludig oe or more origial setimet words i OW or OW. The, for each ews article, we cout the umbers of the words that are icluded i OW ad i OW. The ews articles, i which there are more words icluded i OW tha i OW, costitute the set S. Iversely, the ews articles, i which there are more words icluded i OW tha i OW, costitute the set S. ad represet the umbers of the ews articles i S ad S respectively. For each word w occurrig i the set S, we cout the umber of ews articles icludig w i S ad mark it as (w). Similarly, we cout ad mark the umber of ews articles icludig w i S as (w). The coditioal probabilities are ( w) P ( w) ( w) P ( w) A setimet value s(w) of a word w is calculated as follows: s( w) P ( w)* weight : weight log0 P ( w)* weight + P ( w)* weight, weight log0 25
3. System's offlie processig We implemet a ews crawler for collectig ews articles o my ow. ews articles are crawled from 25 specified ews websites (5 ewspapers published i Japa ad 0 ewspapers Japaese versios i other coutries) every day. The, the articles are morphologically aalyzed to extract proper ous, geeral ous, adjectives, ad verbs. The tf idf values of each extracted word i a ews article are calculated. The setimet value of a ews article is also calculated by lookig up the setimet values of the words extracted from it from the setimet dictioary ad averagig them. A ews article ca obtai a setimet value ragig from 0 to. Cosiderig the comprehesibility ad the symmetry, the calculatio value is further coverted to a value ragig from -3 to 3 by the formula: coversio value 6 calculatio value 3. Whe the calculatio values are, 0.5, ad 0, the correspodig coversio values become 3, 0 ad -3. The coversio values 3, 2,, 0, -, -2, -3 o a dimesio, e.g., Happy Sad, correspod to Happy, elatively happy, A little happy, eutral, A little sad, elatively sad ad Sad, respectively. The above processig is doe offlie. As a result, the collected ews articles, the tf idf values of the words extracted from the ews articles, ad the setimet values of the ews articles are stored i a database. 4. System's olie processig 4.. Extractig the topic ad the setimet of the target article Give a ews article, the system first extracts keywords ad sub-keywords represetig the article s topic. The keywords are the top five words with the highest tf idf values extracted from the target article. The sub-keywords are the top five words with the highest sums of tf idf values i the related articles that iclude ay of the five keywords. Both the five keywords ad the five sub-keywords are preseted to the user. The user selects the words represetig the topic that he or she has cocer about. The selected words are later used to retrieve past articles for aalyzig setimet tedecies of ews websites related to the cocered topic. The setimet value of the target article is also calculated by usig the setimet dictioary ad coverted to a coversio value. The coversio values of the target article o the three dimesios are also preseted to the user. 4.2. Presetig setimet variatio iside the target website The ews website of the target article is idetified by aalyzig the U of the article. Figure is a example of the setimet variatio over time iside the target website. The ews articles icludig the userselected words i the target website date back at a regular iterval t i (e.g., oe day or two days). At each iterval t i, the articles, o which the tf idf values of the user-selected words are larger tha a threshold τ 0, are extracted. Their setimet values are calculated, coverted ad averaged as the setimet values s(t i ) of the target website at the iterval t i. The real horizotal lie represets the mea of setimet values ad the dotted horizotal lies represet the stadard deviatio of setimet values. By browsig the setimet variatio iside the target website, users ca perceive whether the setimet of the curret article (the red poit) is cosistet with the past setimets of the target website. 4.3. Presetig setimet correlatio betwee websites Aother fuctio of the system is to show the correlatio of setimet tedecies betwee the target website ad its couterpart websites (Figure 2). et s X (t i ) ad s Y (t i ) be the setimet values of two websites X ad Y at the iterval t i respectively, ad we calculate their correlatio coefficiet ρ(x, Y ) as follows: ( s i X ) sx ) ( sy ) sy ) ρ ( X, Y ) : 2 2 ( s i X ) sx ) ( s i Y ) s Y ) s X i s X ), s Y i Y s ( t ) i 26
Figure. Compariso iside a website Figure 2. Compariso betwee websites Figure 2 shows a extreme example, i which ρ(a,b) is (direct correlatio) ad ρ(a,c) is - (iverse correlatio). Based o the calculatio results of correlatio coefficiet, the system ca extract setimetsimilar websites by selectig the oes, ρ betwee which ad the target website is larger tha a threshold τ (e.g., 0.5) ad setimet-dissimilar websites by selectig the oes, ρ betwee which ad the target website is smaller tha aother threshold τ 2 (e.g., -0.5). 5. Evaluatio ad observatio 5.. Evaluatio o setimet extractio accuracy We specify 0 ews domais (Society, Sports, Ecoomy, Sythesis, Politics, Overseas, ife, Etertaimet, Culture, ad Sciece) ad pick up 0 ews articles from each domai. As a result, 00 ews articles are selected for evaluatig setimet extractio accuracy. 5 testees are asked to read each of 00 ews articles ad evaluate how itesely they feel the setimets o the three dimesios. Each testee ca evaluate the setimet itesity by givig a iteger from -3 to 3. For example, for the dimesio Happy Sad, 3, 2,, 0, -, -2, -3 represet Happy, elatively happy, A little happy, eutral, A little sad, elatively sad, ad Sad, respectively, The evaluatio values from 5 testees for each article ad each dimesio are averaged as the mea value of the article s setimet o that dimesio. For each of 00 ews articles, the coversio value of the setimet o each dimesio is also calculated by usig the setimet dictioary ad the coversio formula. We compare the coversio values (computer s output) of 00 articles with the mea values (testees evaluatio). The average errors betwee the coversio values ad the mea values o Happy Sad, Glad Agry, ad Peaceful Straied are 0.748, 0.746, ad.28, respectively. Cosiderig the setimet values have seve levels ragig from -3 to 3, the error of about oe level o each dimesio idicates that our setimet extractio accuracy is good. 5.2. Observatio o setimet tedecies We implemet a prototype that extracts the setimet of a ews article, the setimet variatio iside a website, ad the setimet tedecies correlatio betwee differet websites. A example is that a user is browsig a ews article reportig the draft of tax icrease for the revival of Japaese earthquake. The system first extracts the keywords ad the sub-keywords represetig the topic ad the setimet of the ews article (Figure 3). After the user selects the cocered words (e.g., tax icrease ad revival ), the system idetifies the website of the article is IKKEI ad aalyzes the setimet variatio related to tax icrease for the revival iside the website IKKEI (Figure 4). The overall setimet tedecy about the topic i this website is relatively Sad, ad the setimet of the curret target article (the red poit) keeps withi the setimet variatio rage of the website. Figure 5 shows the system detects the setimet-similar website Asahi ad the setimet-dissimilar website Maiichi o Happy Sad related to the topic tax icrease for the revival for the target 27
website IKKEI. The graph shows the compariso results of setimet correlatio betwee those websites. The gree lie represets the setimet tedecy of the target website IKKEI, the blue lie represets the setimet tedecy of its setimet-similar website Asahi, ad the red lie represets the setimet tedecy of its setimet-dissimilar website Maiichi. From the graph, the user ca exactly observe that Asahi has similar tedecies to IKKEI while Maiichi has opposite tedecies to IKKEI. Figure 3. Sapshot of topic ad setimet extractio Figure 4. Sapshot of setimet variatio iside a website 6. elated work Figure 5. Sapshot of setimet correlatio betwee websites 28
Setimet aalysis [2, 3] is icreasigly importat i may research areas. Turey [4] proposed a method for classifyig reviews ito two categories: recommeded ad ot recommeded based o mutual iformatio. Pag et al [5] extracted oly the subjective portios of movie reviews ad classified them as thumbs up or thumbs dow by applyig text-categorizatio techiques. However, these methods oly cosider positive-egative setimet. Ulike these methods, our system captures more detailed setimets of three dimesios suitable for ews articles. There also exist other several setimet models. Plutchik [6] desiged a four-dimesio model: Joy Sadess, Acceptace Disgust, Aticipatio Surprise, ad Fear Ager. ussell [7] proposed a two-dimesioal space where the horizotal dimesio was pleasure-displeasure, ad the vertical dimesio was arousal-sleep. The remaiig four variables: excitemet, depressio, cotetmet, distress, were their combiatio, ot formig idepedet dimesios. We adopt three-dimesio setimets for ews articles. Park et al [8] proposed a aspect-level ews browsig system that aimed to mitigate ews bias. Differet from their works that oly deal with recet ews articles, our system also aalyzes the past articles for extractig the setimet tedecies of websites over time. 7. Coclusio ad future work We described a system for extractig a ews article s setimet, fidig the setimet variatio iside a ews website, ad comparig setimet tedecies betwee differet websites. Our implemetatio eabled users to obtai visual compariso results. We pla to costruct a model for evaluatig ews articles credibility based o the setimet compariso results described i this paper. A idea is that if ews websites with differet setimet tedecies come to hold the same setimet, the iformatio may be credible. Developig a method for calculatig credibility scores of ews articles is oe of our future challege. 8. Ackowledgemets This work was supported i part by SCOPE (Miistry of Iteratioal Affairs ad Commuicatios, Japa) ad by the MEXT Grat-i-Aid for Youg Scietists (B) (#270020, epresetative: Yukiko Kawai). 9. efereces [] T. Kumamoto, Desig of Impressio Scales for Accessig Impressios of ews Articles, I SSMW 200, pp. 285-295, 200. [2] B. Pag ad. ee, Opiio Miig ad Setimet Aalysis, Foudatios ad Treds i Iformatio etrieval, Vol. 2, os. -2, pp. 35, 2007. [3] A. Wright, Our Setimets, Exactly, Commuicatios of the ACM, Vol. 52, o. 4, pp. 4 5, 2009. [4] P. D. Turey, Thumbs Up or Thumbs Dow? Sematic Orietatio Applied to Usupervised Classificatio of eviews, I AC 2002, pp. 47 424, 2002. [5] B. Pag ad. ee, A Setimet Educatio: Setimet Aalysis Usig Subjectivity Summarizatio Based o Miimum Cuts, I AC 2004, pp. 27 278, 2004. [6]. Plutchik, The Emotios, Uiv Pr of Amer, 99. [7] J. A. ussell, A Circumplex Model of Affect, Joural of Persoality ad Social Psychology, Vol. 39, o. 6, pp. 6 78, 980. [8] S. Park, S. ee ad J. Sog, Aspect-level ews Browsig: Uderstadig ews Evets from Multiple Viewpoits, I IUI 200, pp. 4-50, 200. 29