An IG-RS-SVM classifier for analyzing reviews of E-commerce product

Size: px
Start display at page:

Download "An IG-RS-SVM classifier for analyzing reviews of E-commerce product"

Transcription

1 Iteratoal Coferece o Iformato Techology ad Maagemet Iovato (ICITMI 205) A IG-RS-SVM classfer for aalyzg revews of E-commerce product Jaju Ye a, Hua Re b ad Hagxa Zhou c * College of Iformato Egeerg, Cha Jlag Uversty, Hagzhou 3008, Cha a @qq.com, b @qq.com, c zhx@cjlu.edu.c * Correspodg author Keywords: e-commerce; feature selecto; esemble learg; support vector mache Abstract. Aalyzg revews of E-commerce product s a kd of text classfcato whch belogs to supervsed learg. Due to the huge umber of words, hgh dmesoal feature space s a serous problem text classfcato. I order to solve t, a ew algorthm, IG-RS-SVM, s proposed. Iformato Ga (IG) s a feature selecto algorthm whch ca reduce the dmeso of feature subspace. Radom subspace, a kd of esemble learg algorthm, ca dvde the feature space to smaller oes each submtted to a base classfer such as Support Vector Mache (SVM). After expermets, t shows that IG-RS-SVM algorthm ca effectvely mprove the text classfcato accuracy. Itroducto The revews of E-commerce product belog to a kd of text setmet aalyss. Though collectg the cosumers revews after they purchased E-commerce products, aalyzg ther emotos, moods ad atttudes ca help other cosumers decde whether to buy ad also help qualty supervso departmets fd qualty problems as soo as possble whch s proptous to the mplemetato of the supervso ad spot check. Curretly, there are two ways to aalyze the revews of E-commerce products []. Oe s based o emotoal kowledge, ad the other s based o data mg. Wth atural laguage ad some exstg dctoares, the frst method makes a decso to the commets drectly. Ths method ot oly eed to establsh a huge emotoal dctoary, but also ca t judge the emotoal tedecy accurately because of the complexty of Chese. The secod method uses data mg algorthms for text classfcato. Text Classfcato Orgal text s ustructured data whch computer ca't uderstad, so that t must be coverted to structured data. Text segmetato s a major lk the pretreatmet. It ca trasform text formato to structured data ad delete a large umber of redudat cotets (cludg puctuato marks, stop words, repeated cotets ad so o). Eglsh text segmetato s relatvely smple because t oly eed operate accordg to the space ad puctuato. But Chese text eed to be doe by relevat algorthm segmetato. For example, SCWS Chese segmetato system [2] ad ICTCLAS Chese segmetato system are ofte used at preset. The text after pretreatmet s ot totally structured that t eed a mathematcal model to represet tself. The most commoly used model of text feature represetato s vector space model (VSM). I that model, text s cosdered as a vector space cossts of a set of orthogoal vectors [3]. Data mg algorthm s used to classfy structured data. Commo methods of data mg for text classfcato are Bayes classfer, support vectors mache(svm), decso-tree ad so o. Esemble learg ca effectvely mprove the classfcato effcecy of the algorthm whch uses some smple classfcato algorthms to get a umber of dfferet learg maches ad the combes them to tegrated learg mache. The esemble learg algorthm s wdely used mage processg, bomedcal ad cotrol egeerg ad other related felds. There are some researches about text classfcato appled by esemble learg algorthm. Lterature [4] used a Baggg algorthm wth attrbute selecto. Ths algorthm ca oly evaluate the 205. The authors - Publshed by Atlats Press 60

2 cotrbuto to the classfcato of a part of attrbutes, but ca t evaluate the cotrbuto to the classfcato of sgle attrbute. I order to mprove the accuracy of text classfcato o hgh dmeso, lterature [5] put forward RS-SVM algorthm, but wthout cosderg the dmeso problem tself whe choosg the feature subspaces, t was uable to flter out the features whch were redudat or o cotrbuto. Amg at dsadvatages of the above algorthms, cosderg the sgle feature cotrbutos to text classfcato ad text ad text dmesoal reducto usg o hgh dmeso, ths paper puts forward IG-RS-SVM algorthm. IG-RS-SVM Algorthm Iformato Ga. The VSM model usual has a hgh dmeso whch ca reach tes of thousads or eve more ad most of them are redudat or rrelevat. Redudat features may cause a decle the classfer performace ad affect effcecy of data mg by aalysts. Feature selecto s a good way to reduce dmesos of VSM so that t ca acheve the goal to mprove the classfcato accuracy ad reduce computatoal complexty. Commo methods of feature selecto are documet frequecy(df), formato ga(ig), mutual formato(mi), ch-square(chi) ad so o [6]. IG s proved as a better method compared wth fve feature selecto algorthms [7]. The amout by whch the etropy of the class decreases after observg a certa feature reflects the addtoal formato about the class that feature provdes [8], s called Iformato Ga. Formula s as follows: () ( ) ( ) Pω PC ωlogpc ω + = IG(W) = HC ( )- HCW ( ) = - PC ( )log PC ( ) + = P( ϖ) PC ( ϖ) logpc ( ϖ) = () W ωϖ,. PC ( ) represets the C represets text category. W represets text feature, { } probablty that the text belogs to C. P( ω ) represets the probablty that W appears ad P( ϖ ) represets the probablty that W does t appear. PC ( ω ) represets the probablty that the text belogs to C wth W, PC ( ϖ ) represets the probablty that the text belogs to C wthout W. Radom Subspace. However, the dmesoalty of feature ca be few thousads eve after feature selecto. Fg. shows the structure of Radom Subspace. I Radom Subspace, after dvdg the orgal feature space to feature subspaces, each subset s submtted to a base classfer the esemble [9]. Combed wth the result of each base classfer, fal result s obtaed by a majorty vote. Orgal Text Feature Dataset Feature Subset... Feature Subset Base Classfer... Base Classfer Comber Fg.. The structure of radom subspace 602

3 Support Vector Maches. Support Vector Maches (SVM) s a kd of mache learg methods whch s proposed by Vapk et al [0]. It has become the hotspot of mache learg because of ts excellet learg performace o solvg lear, olear ad hgh dmesoal patter recogto, ad t also ca be appled to the fucto fttg other mache learg problems [-2]. The prcple of SVM s to get a hyperplae to make sure that the dstaces of two kd pots closg to the hyperplae are the farthest. SVM has the advatage of dealg wth olear problems by troducg the feature trasform the olear problem the orgal space to the lear problem the ew space, such as ( x x ) ( ϕ( x) ϕ( x )) whch ca be remembered as a kerel fucto as K( x, x ) = ( ϕ( x ) ϕ( x )).So j j the fal decso fucto s: f() x = sg{ λ ykx ( x ) + b} = j j j (2) Usg dfferet kerel fuctos wll have dfferet forms of the olear support vector mache. Now more commoly used kerel fucto maly has three types: Lear Kerel: Kxx (, ) = ( xx ); Polyomal Kerel: Kxx (, ) = [( xx ) + ] q ; 2 x x RBF Kerel: Kxx (, ) = exp( ). 2 σ IG-RS-SVM Algorthm. For a text, the text after pretreatmet ca be expressed as d = { t, t2, t3,, t} (t represets feature, represets the umber of feature) ad the category c whch t belogs to. So the text dataset ca be represeted by D= {( dc, ),( dc 2, 2),( dc 3, 3),,( dm, cm)} ad m whch meas the umber of text dataset. IG-RS-SVM algorthm s descrbed as follows: Iput: text dataset D= {( dc, ),( dc 2, 2),( dc 3, 3),,( dm, cm)} ad m. Output: classfcato result Fd (), Fd () C Setp: trasformg text dataset to VSM; Setp2: calculatg the text etropy of each feature VSM, the puttg them to a feature set; Step3: sortg the feature set ad delete the feature whose value s 0; Step4: rebuldg a ew VSM accordg to the ew feature set; Step5: choosg the umber of SVM classfer; Step6: to each SVM classfer, radom geeratg a feature subspace samples from the ew feature set; Step7: classfyg subspace samples wth SVM classfer; Step8: combed wth the result of each SVM classfer, outputtg the result by a majorty vote or through after the combato. IG-RS-SVM Algorthm Precso, Recall ad F-measure are commoly used to be evaluato dcators the feld of text classfcato. Computato formula s as follows: TP TP 2 Recall Precso Precso=, Recall=, F-Measure= (3) TP+FP TP+FN Recall+Precso For bary classfcato, TP refers to the true postve whch meas forecast result ad actual result are both true; FP refers to the false postve whch meas forecast result s true but the actual result s false; FN refers to the False Negatve whch meas forecast result s false but the actual result s true. Precso s the rato about the umber of actual classfcato of the total samples. Recall s the rato about the umber of actual classfcato of the total actual samples. F-measure s harmoc mea betwee Precso ad Recall. 603

4 Precso measures the ablty of classfcato to refuse to the rrelevat formato. Recall measures the ablty of classfcato to classfy the relevat formato. F-measure measures the comprehesve ablty about Precso ad Recall. Expermet Results Aalyss. I order to verfy the effectveess of IG-RS-SVM algorthm e-commerce product revews aalyss, ths artcle selects the classc MoveRevews data set cludg 000 postve evaluatos ad 000 egatve evaluatos. Ths expermet cross valdato method where dataset s dvded to 0 portos, take e as tra data ad the other oe as the test data. At last, expermet uses the average value of each expermet. After text pretreatmet, we get a feature dataset cludg 65 features, the we use IG algorthm to keep features whose values are greater tha 0. Ths ew feature dataset cludes 3 features whch mea dmeso of the feature fell sharply. Four classfers are used for classfcato of two feature datasets the expermets, ad the expermetal results are show Table. Table. Classfcato results wth dfferet algorthms Classfer Correct/% SD Precso/% Recall/% F-measure/% NB SVM RS-SVM IG-NB IG-SVM IG-RS-SVM Through the aalyss of Table, we ca obta the followg results: () Wthout IG algorthm ad RS algorthm, SVM got a better result tha NB; (2) Wthout IG algorthm, SVM wth RS algorthm got a better result; (3) Wthout RS algorthm, usg IG algorthm have a certa upgrade of two algorthms. (4) Wth IG algorthm ad RS algorthm, cosderg the factors of stadard devato, SVM got a best result. ROC Area of these classfcatos s the rage of [0, ] whch s usually more tha 0.5. The closer to of the values s, the better the performace of the classfer s. If the value s equal to 0.5, t meas that classfer s completely effectve. Fg. 2. ROC graph Fg. 2 shows the ROC graph where the ROC Areas of RS-SVM ad IG-RS-SVM are both the hghest. The value of RS-SVM s 0.95 ad the value of IG-RS-SVM s It seems that IG-RS-SVM classfer s the best amog them. Parameter Aalyss. Ths expermet adopts three kds of commoly used SVM kerels such as Lear Kerel, Polyomal Kerel ad RBF Kerel. The Radom Subspace Rate s also a mportat parameter ths algorthm whch meas the rato of the feature subset. Ths expermet selects 0., 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 ad 0.9 of the proporto to evaluate the classfcato results uder dfferet proportos. 604

5 Fg. 3. Results of dfferet ratos ad kerels Fg. 3 shows the classfcato result of Lear Kerel s the best, polyomal kerel s secod, ad the classfcato of RBF kerel s the worst. Whe the rato s 0.7, Lear Kerel ad Polyomal Kerel get a best result, ad the classfcato results of the RBF kerel s creased wth the Radom Subspace Rate. I summary, SVM classfer usg the lear kerel s recommeded practcal applcato. Coclusos Because of the problem about the usatsfactory RS-SVM classfcato result uder the hgh dmeso of feature space, ths paper troduces IG feature selecto algorthm to reduce the dmesos of the feature subspace wth Radom Subspace algorthm. The expermetal results show that, compared wth other classfcato algorthms, IG-RS-SVM has greatly mproved classfcato accuracy ad stablty. Ad cosderg the fluece of SVM kerel ad Radom Subspace Rate for the classfcato results, comparg the expermetal results, show that the SVM lear kerel ad 0.7 Radom Subspace Rate obtas good results. The revews of E-commerce product qualty ot oly provde great referece value to cosumers, but also cotrbute to E-commerce product qualty motorg by the qualty supervso departmets. Through the aalyss of the revew of E-commerce products, govermets ca ssue a alert to avod the problems caused by the qualty of the product, ad provde a guaratee for the healthy developmet of E-commerce. Ackowledgemets We are grateful to Dr. Sogguo Lu (Natoal Rsk Motorg Ceter for Qualty of E-commerce Products, Hagzhou, Cha) for provdg the data for aalyzg, ad Dr. Ju Feg (Hagzhou Isttute of Calbrato ad Testg for Qualty ad Techcal Supervso, Hagzhou, Cha) for mprovg algorthm of data mg. Ths work was supported by the Geeral Admstrato of Qualty Supervso, Ispecto ad Quarate of the People s Republc of Cha (No. 2030K00-2) ad by the Natoal Natural Scece Foudato (No ). Refereces [] B. Pag, L. Lee, Opo mg ad setmet aalyss, Foudatos ad treds formato retreval, 2008, 2(-2): [2] X. Fag, S. Wag, S. Cao, A Chese Search Approach Based o SCWS, Proceedgs of the 9th Iteratoal Symposum o Lear Drves for Idustry Applcatos, Berl Hedelberg: Sprger, 204: [3] Q.L. Guo, Y.M. L, Q. Tag, The smlarty computg of documets based o VSM, Applcato Research of Computers, 2008, : (I Chese) 605

6 [4] B. Robert, G. Rcardo, Q. Fracs, Attrbute baggg: mprovg accuracy of classfer esembles by usg radom feature subsets, Patter recogto: The Joural of the Patter Recogto Socety, 2003, 36(6): [5] G. Wag, S.L. Yag, Study of Setmet Aalyss of Product Revews Iteret Based o RS-SVM, Computer Scece, 203, 40(A): (I Chese) [6] H.T. Ng, W.B. Goh, K.L. Low, Feature selecto, perceptro learg, ad a usablty case study for text categorzato, ACM SIGIR Forum, 997, 3(SI): [7] Y. Yag ad O.J. Pederse, A comparatve study o feature selecto text categorzato, Proceedgs of the 4th Iteratoal Coferece o Mache Learg, 997, pp [8] C.J. Shag, D. Bares, Combg support vector maches ad formato ga rakg for classfcato of mars McMurdo paorama mages, IEEE Iteratoal Coferece o Image Processg, 200: [9] M.J. Gageh, M.S. Kamel, R.P.W. Du, Radom Subspace Method Text Categorzato, Iteratoal Coferece o Patter Recogto, 200: [0] V. Vapk, The ature of statstcal learg theory, Sprger Scece & Busess Meda, [] Q. Lu, C. Cu, H.X. Zhou, Applcato of a kd of modfed SVM mult-class classfcato algorthm wreless sesor etworks, Joural of Cha Jlag Uversty, 203 (3): (I Chese) [2] S.L. Wag, Itruso detecto system for WSNs based o SVM, Trasducer ad Mcrosystem Techologes, 202, 3(7): (I Chese) 606