Usng Supervsed Clusterng Technque to Classfy Receved Messages n 137 Call Center of Tehran Cty Councl Mahdyeh Haghr 1*, Hamd Hassanpour 2 (1) Informaton Technology engneerng/e-commerce, Shraz Unversty (2) Department of Computer Eng. & IT, Shahrood Unversty of Technology,Shahrood,Iran Mahdyehaghr.66@gmal.com; h_hassanpour@yahoo.com Receved: 2011/05/20; Accepted: 2011/06/14 Pages: 15-24 Abstract Supervsed clusterng s a data mnng technque that assgns a set of data to predefned classes by analyzng dataset attrbutes. It s consdered as an mportant technque for nformaton retreval, management, and mnng n nformaton systems. Snce customer satsfacton s the man goal of organzatons n modern socety, to meet the requrements, 137 call center of Tehran cty councl s plannng to reduce the watng tme of customers, forwardng ther messages to the approprate servce manager to ncrease servce qualty. The cty councl s currently usng a manual approach dspatchng textual request messages. Snce ths process s very smlar to supervsed clusterng concept, n ths study, we appled the Naïve Bayes algorthm, whch s one of the most common algorthms n the supervsed clusterng arena to classfy receved messages regardng ther subjects. The performance results of the proposed technque ndcate ts hgh effcency n clusterng the receved messages wth 98% accuracy. Keywords: supervsed clusterng, classfcaton, Naïve Bayes algorthm, data mnng 1. Introducton Nowadays, due to the growng development of communcaton and nformaton technologes, our socety needs to take advantages of human and socal resources as much as possble. In ths regard, for nstance, reducng the watng tme of customers, ncreasng servce qualty and smplfyng the communcatons are needed n socal communcatons n ths age. Because of the growng development of IT-based communcaton tools and easy access to telecommuncatons systems such as phones, mobles, computers and the nternet, and due to engagement n dong routne tasks, usng smple and avalable technologes has a specal mportance. Therefore, Tehran muncpalty on a new plan made a system by usng nformaton technology scence and telecommuncaton tools. So, because of the drect vew and ctzens' actve collaboraton and people's nvolvement n managng ther own envronment, cvl tasks can be done quckly and properly. 137 call center of Tehran s founded as an nfrastructure n order to facltate communcatons of ctzens. The receved messages are classfed accordng to ther content nto predefned subject categores n the system by human operators and referred to the proper unt to be surveyed more. 15
Usng Supervsed Clusterng Technque to M. Haghr, H. Hassanpour Because of the mplementaton process n 137 call center that s naturally based on specfc and predefned choces, ths paper presents a model to classfy the receved messages automatcally. The proposed model uses a well known technque for textual document clusterng as an mportant step n ndexng, retreval, management and mnng of data n nformaton systems [1]. Clusterng methods are classfed nto two groups: supervsed and unsupervsed clusterng. Supervsed clusterng also known as classfcaton [2], s a data mnng technque used for extractng model from data on dscrete values [3]. Classfcaton evaluates the features of a data set, and then assgns them labels from a predefned set. Ths s the most common capablty of data mnng. Data classfcaton s a two-phase process [3]. In the frst phase, that s called learnng step, classfcaton system s made by a predefned set of data labels or concepts. In ths phase, the classfcaton algorthm makes the classfer by analyzng or learnng from tranng data set. In the second phase - generalzaton step [4], traned model s used to classfy unlabeled data. A test set s used to estmate the classfer performance. The man goal of generalzaton phase s to decrease nose mpact on the classfcaton results and ncrease classfer accuracy. Classfer accuracy on a gven test set s the percentage of tuples that have been correctly classfed. If classfer accuracy s admtted, t can be used to classfy tuples that ther class labels are unknown. Naïve Bayes, Roccho, K-Nearest Neghbors (K-NN), Regresson, Decson Trees, Neural Networks, Support Vector Machnes (SVM) and Decson Rule, can be referred to as the most mportant methods for text classfcatons problem [5]. Unsupervsed clusterng or smply clusterng [2], s n fact an unsupervsed learnng method. The goal s to dvde and gather a set of unstructured objects nto approprate clusters or groups so that smlar objects appear n the same cluster, and dfferent objects n the separate clusters [6,7]. The remander sectons of ths paper are as follows. Secton 2 explans our approach n detal. Followng, Secton 3 ntroduces our dataset and some related challenges related to Persan language of our dataset. Secton 4 s devoted to expermental results, and fnally secton 5 offers conclusons and suggests future work. 2. Introducng the proposed algorthm The classfcaton algorthm s used to classfy receved messages from ctzens n the 137 call center of Tehran cty councl based on Naïve Bayes algorthm. To know more about the mechansm of ths knd of algorthm, we wll explan t n great detals n ths secton. Assume we have m classes. Naïve Bayes classfcaton algorthm assgns X record to a class, say, whch has the maxmum posteror probablty: PC ( X ) > PC ( X ) for 1 j m, j (1) Wth ths assumpton, j maxmzes the posteror probablty. P( X C) PC ( ) P( C X) (2) P( X) 16
Journal of Advances n Computer Research (Vol. 2, No. 3, August 2011) 15-24 Snce the value of P(X) s the same for all classes, to maxmze (2),, t s enough to maxmze. Addtonally f that s named pror probablty s equal for all classes, t s only requred for to be maxmzed. If the pror probablty of classes s dfferent, t s possble to compute from (3): PC ( ) PC, D D (3) where s the number of records of class n the dataset, and s the total number of records n the data set. The amount of computed by (4): P( X C ) =Π P( X C ) (4) n k= 1 k In ths relaton, s the amount of attrbute of for record and s the total number of all attrbutes. If attrbutes are dscrete, then: P( X k C) P( X C ) = C, D (5) Snce our calculaton for classfyng a record, s based on the percentage of occurrng keywords of specal class (weght of the keyword for a subject), formula (6) s used to compute the weght of each keyword for each subject: ( ) P X C thefrequencyofeachkeywordforsubjectc = (6) thefrequencyofeachkeywordforallsubjects a. Offered archtecture for classfcaton Offered archtecture n ths paper can classfy the receved messages from ctzens to predefned subjects. Ths algorthm s presented n fgure (1), weghts keywords of each message for every subject and then calculates the weght of the subjects for each message by summng the weghts of the keywords n each message. Fnally, t assgns to the message the subject wth maxmum weght. 17
Usng Supervsed Clusterng Technque to M. Haghr, H. Hassanpour Algorthm SAMANE (Input Message: Strng, Output: Classfcaton of Message) 1- Select a Message; 2- Gve-Token(Message, Token-Set); 3- Fnd New Tokens n Relaton by Tokens n Token-Set and Insert Token-Set and Unte wth Related Token; 4- For Each Token n Token-Set Do Fnd All Message n Database n Whch It Occurs and Insert It n Message-Set; 4-1- For Each Message n Message-Set Do Fnd All Themes (Subjects) Related to Message; Themes-Frequency=Frequency of All Themes for Each Token; 4-2- For Each Theme n Theme-Set Do Weght (Token, Theme) = Frequency of Themes (Subjects)/ Themes-Frequency; 5- For Theme=1 to Number of Themes Do For Token=1 to Number of Tokens Do Sum (Theme) = Sum(Theme)+Weght(Token, Theme); 6- Max=Sum (1); For Theme=2 to Number of Themes Do If Sum(Theme)>Max Then Max=Sum(Theme); 7- Return Class=Subject(Max);. Fgure 1. offered classfcaton algorthm Table (1) represents a sample of the amount of the keywords relaton wth predefned subjects for each message. Ths assumed message has keywords that have been already exsted n subjects of all predefned subjects. Table 1. The relaton between keywords and defned subjects Subject Word Subject m Subject 2 Subject 1 Keyword 1 15% 40% 27% Keyword 2 20% 25% 10% Keyword n 5% 30% 36% Ths algorthm estmates the weght of each keyword n exstng subjects and calculates the weght of each message for each probable subject by computng the total assgned weghts of the keywords n each subject. The way of calculatng the weght of each keyword n dfferent subjects has been shown n (6). 18
Journal of Advances n Computer Research (Vol. 2, No. 3, August 2011) 15-24 The message classfcaton steps are as followng: 1. Select a message from receved messages; 2. Fnd the keywords exstng n the message; 3. Fnd synonyms and words relatng to the keywords and ntegrate them; 4. Make Token-Set (set of the keywords and the synonyms relatng to the message); 5. Send Token-Set to the weght calculaton unt for each subject; 6. Calculate the weghts for each subject for message; 7. Assgn the message to the subject wth the maxmum weght; 8. Regster nformaton so that t can be used for searchng and retrevng. b. The components of the offered archtecture Fgure 2. offered archtecture for classfcaton. Tranng data set Tranng data are sets of nformaton that transfer a knd of knowledge to the learner agent as a presupposed one. Actually, basc dfference between two methods of clusterng results from the dfference between ther learnng methods. Tranng nformaton gven to the system has been dscussed n the followng. - Class Labels The 137 call center of Tehran cty councl has classfed the receved messages from ctzens nto 44 specfc classes and then refers them to the relevant unt. Snce we plan to classfy automatc data based on the current process, and also we want to provde the same condton to compare the two methods, 44 classes wth the same labels are ncluded n the offered system. Moreover a class called "other cases" has been consdered so that a receved message out of the knowledge extent of the system can be classfed. - Tranng messages Each of the 44 classes n the manual system contans some related messages. These messages have been chosen and dscussed by a group of human expertse assgned by Tehran muncpalty and nclude all types of probable problems n every subject. These messages have been known as tranng messages n our system and ntroduced to the system by ther class labels. 19
Usng Supervsed Clusterng Technque to M. Haghr, H. Hassanpour - Keywords Preprocessng explores smple data n order to change them to approprate format for tranng [8]. One of the preprocessng phases s to extract keywords. Each document has one or more keywords or key phrases descrbng ts content, where these keywords or key phrases belong to a fnte set called controlled dctonary [5]. If there s a set of tranng documents wth dstnctve keywords for each of them, the process of extractng keywords wll be a supervsed learnng, otherwse wll be an unsupervsed learnng [8]. Due to the nature of our data tranng, we always face the short messages. Our target keywords are also exactly techncal n the urban management and belong to the predefned classes. So dentfyng and defnng these few techncal words whch many of them have been already known ncreases the accuracy and effcency of the system, and requres less tme regardng to dentfyng all the useless words such as addtonal words, conjunctons, pronouns, adverbs, plural suffxes, verbs and, etc. In ths paper we proft from supervsed learnng for keyword extracton. The offered system has been desgned n a way that t can dstngush defned keywords from all the suffxes, prefxes, punctuatons such as ".", ",", ";", "?", etc. and also mult syllabus words. So the keywords have been ntroduced to the system n ther smplest faces.. Test data set The data whch have been used to test the offered system, are the sent messages from the ctzens n the frst fve months n 2011n the dstrcts 1,2, 5 and 7 of the muncpalty of the regon 1 n Tehran. These 356 messages that have been already classfed manually are n 20 classes out of 44 tranng classes. In ths paper, t s supposed that ctzens' messages should be sent n Persan. c. Overcomng Persan language challenges In [9], some of the Persan language challenges n computer especally n the nternet and result of web searchng, have been mentoned. These challenges are as followng: ; «میتواند «and «میتواند «as such «می «unclung 1. Varety n usng clung or ; «درختها «and «درختها «as such «ها «unclung 2. Varety n usng clung or ; «پاکسازي «and «پاكسازي «as 3. Usng some prefxes or suffxes such ; «مسي له «and «مسا له «or «مسو ول «and «مسي ول «as n varous types such «همزه «Usng 4. ; «خانه مسکونی «and «خانە مسکونی «as such ة««usng 5. Usng or not ; «موسا «and «موسی «as such ا««by n Arabc words ended ي««usng 6. Varety n ; «اطاق «and «اتاق «as 7. Varety n spellng some words such 8. Usng European words n natve language or Persan translatons such as «ATM» and ; «خودپرداز «9. Usng or not usng rregular plurals for some words; 10. Transformng Latn words to Persan alphabets wth the orgnal pronuncaton such as ; «سورس «and «source» ; «قران «and «قرآن «as nterchangeably such آ««and ا««Usng 11. 12. Usng or not usng phonetc symbols for words. 20
Journal of Advances n Computer Research (Vol. 2, No. 3, August 2011) 15-24 The only dfference between our data and what found n computer, and the nternet s that our data are messages sent by ther own cell phones or mobles. Due to the present abltes of mobles n our country, cases such as 5 and 12 can't be set forth for dscusson. On the other hand, ctzens are supposed to send ther messages n Persan, so cases 8 and 10 can be gnored, too. Because of the system capacty to extract keywords from all of possble suffxes and prefxes, cases 1 and 2 have been corrected. Our offered system has a complete controlled dctonary that keywords belong to t. Ths dctonary n addton to synonymous words and phrases of keywords, conssts of dfferent spellng of them so that the system can respond to the cases 3, 6, 7 and 9 correctly. But for cases 4 and 11, an ablty s consdered n the system so that all the wrtng styles of vowels such as «,«و««ا and «ي» can be changed to the natve wrtng styles and then t wll survey them. 3. Evaluaton the offered system performance To estmate algorthm performance, three scales Precson(P), Recall(R) and Accuracy(A) were surveyed[10]. Four parameters TP, FN, FP and TN[11] that are defned n table (3), are needed to calculate these scales. Table 3. parameters of effcency scales Text Assgnng to class Not assgnng to class Belongng to class True Postve(tp) False Negatve(fn) Belongng to non class False Postve(fp) True Negatve(tn) Accuracy s the common scale used to evaluate accuracy of machne learnng algorthms. Ths scale s calculated by (7)[12]: tp+ tn A = tp + tn + fp + fn (7) Precson and Recall are scales used to estmate effcency of the text classfcaton algorthm. Precson scale calculates the accuracy of a searchng. Ths scale s calculated by (8)[12]: P tp = tp + fp (8) Recall scale, unlke the precson scale, calculates the rate of perfecton n a searchng [12]: R tp = tp + fn (9) 21
Usng Supervsed Clusterng Technque to M. Haghr, H. Hassanpour Table (4) presents the effcency scales of the system n all target classes. The average of ths asset value s used for fnal evaluaton. Table 4. Performance metrcs related to some target classes Class name Accuracy Precson Recall water 0.977 1 0.91 bus 0.991 0.75 0.86 asphalt 0.983 1 0.84 Socal damages 0.997 0.8 1 snow 0.985 0.83 0.88 Parks and green lands 0.985 0.66 0.79 traffc 1 1 1 Changng usage 0.997 1 0.88 Removng and ptchng 0.958 0.91 0.8 anmals 0.992 0.73 1 trees 0.983 0.89 0.94 trashes 0.966 0.71 0.77 buldng 0.983 0.95 0.9 Blockng path 0.98 1 0.53 washng 1 1 1 dredge 0.977 1 0.75 reparng 0.989 0.89 0.89 trouble 0.952 0.64 0.43 Ptchng safety marks 0.989 0.67 0.67 cleanng 0.98 1 0.61 Average 0.984 0.87 0.82 The results from the surveys showed that the system has totally made acceptable errors n the followng condtons: 1. The nearness of the message content to two or more groups wth the hgh usage and meanng resemblance; 2. The presence of dfferent usage and meanng keywords n the message; 3. Introducng the problem n a statement, wthout gvng any soluton or requestng to solve the problem. The frst error resultng from the hgh meanng resemblance and usage occurs because the rght range of the subject has not been pad attenton. For words such as park that has dfferent meanngs n dfferent sentences, t s necessary to extract not only the defned keywords but also survey the other parts of the sentence so that the target meanng n each sentence can be specfed through the relevant words wth each possble meanng. The thrd error occurs because the range of some groups s too wde. Although the ctzens can help survey ther message better and faster by requestng the muncpalty n ther own messages, at least by makng ther own requests besdes the ntroduced problems. Therefore, to correct the possble errors and ncrease the effcency of the system as much as possble, the followng actons can be taken. 1. Reformng the exstng groups by mergng or splttng the groups of less accuracy; 22
Journal of Advances n Computer Research (Vol. 2, No. 3, August 2011) 15-24 2. Usng a method besdes usng a dctonary to ndex latent semantc n the sentences made up of the dfferent applcable meanng keywords; 3. Approprate teachng and nformng the ctzens to ask ther own requests n the message. However, despte the ntroduced problems the rate of offered system precson s 87%, and ts Recall s 82%. Of course, we should notce that these acceptable percentages have consderably mproved and makes the system effcency ncrease by overcomng the presented dffcultes. The accuracy measure calculated s over 98%. Ths shows one of the most mportant advantages of the Naïve Bayes algorthm n comparson wth other algorthms that s ts persstence aganst the rrelevant attrbutes and keywords. 4. Concluson In ths paper, a method based on one of the most common supervsed algorthms, Naïve Bayes, was presented to classfy the ctzens' thematc messages n the 137 call center of Tehran cty councl. Ths algorthm uses 44 exstng classes n the manual system as tranng classes, 605 reference messages as ts tranng messages and 356 receved messages from Tehran ctzens n dstrct 4 of Tehran muncpalty of regon 1 for fve months n a year, as ts test data. Applyng ths method and evaluatng ts results proved ths algorthm wth accuracy up to over 98%, s a rght choce for our target. The rate of precson 87% and recallng 82% n classfyng the messages also suggests the comparatve success of the offered system n classfyng the target test data. Moreover, more study on the obtaned results showed that the effcency of the system can be ncreased by reformng the target groups used n the 137 call center and subsequently n the system and addng more convenence to the capacty of ndexng ts latent semantc, and teachng and nformng the ctzens correctly n presentng ther problems and requests. Our suggested future work conssts of usng other methods besde the current algorthm to ncrease accuracy. 5. References [1] Azzag, H., Gunot, C. and Venturn, G., "Data and text mnng wth herarchcal clusterng ants" Sprnger, Swarm Intellgence n Data Mnng. Vol. 34,: 153-189, 2006. [2] Eck, C., F., Zedat, N. and Zhao, Z., "Supervsed Clusterng- Algorthms and Benefts" 16th IEEE Internatonal Conference on Tools wth Artfcal Intellgence (ICTAI'04): 774-776, 2004. [3] Mazhar, N., Iman, M., Joudak, M. and Ghelchpour, A., " An overvew of classfcaton and ts algorthms" 3 th Data Mnng Conference (IDMC'09): Tehran, 2009. [4] Fazaeljavan, M. and Sadredn, M., H., " Investgatng Classfcaton and Assocaton Rule Mnng Technques for Intruson Detecton" 3 th Data Mnng Conference (IDMC'09): Tehran, 2009. [5] Sebastan, F., "Machne learnng n automated text categorzaton" Journal of ACM Computng Surveys., Vol. 34, 2002. [6] Fnley, T. and Joachms, T., "Supervsed clusterng wth support vector machnes" ICML '05 Proceedngs of the 22nd nternatonal conference on Machne learnng, 2005. [7] Berkhn, P., "Survey Of Clusterng Data Mnng Technques" Sprnger. Vol. 12: 1-56, 2002. [8] Ahmad, M., H., Monadjem, A., H. and Ayat, S., "Persan Text Classfcaton Usng Assocaton Rules" 4 th Data Mnng Conference (IDMC'10): Tehran, 2010. 23
Usng Supervsed Clusterng Technque to M. Haghr, H. Hassanpour [9] محسنیکبیر نینا میناییبیدگلی بهروز و کاشفی امید "مطالعه و بررسی اثر پیشوند و پسوند در شباهت معنایی جملات زبان فارسی با هدف کاربردي در سیستمهاي بازیابی اطلاعات" سومین کنفرانس دادهکاوي ایران تهران: دانشگاه علم و صنعت ایران 1388. [10] Wllams, N., Zander, S. and Armtage, G., "A prelmnary performance comparson of fve machne learnng algorthms for practcal IP traffc flow classfcaton" ACM SIGCOMM Computer Communcaton Revew. Vol. 36, October 2006. [11] Nguyen, T., T., T. and Armtage, G., "A survey of technques for nternet traffc classfcaton usng machne learnng" IEEE, Communcatons Surveys & Tutorals. Vol. 10: 56-76, 2008. [12] Fard, D., Harb, N. and Rahman, M., Z., "Combnng Nave Bayes and Decson Tree for Adaptve Intruson Detecton" Internatonal Journal of Network Securty & Its Applcatons: 12-25, 2010. 24