, pp.29-33 http://dx.doi.org/10.14257/atl.2014.76.08 A Spam Meage Filtering Method: focu on run time Sin-Eon Kim 1, Jung-Tae Jo 2, Sang-Hyun Choi 3 1 Department of Information Security Management 2 Department of Buine Data Converion 3 Profeor Department of Management Information Sytem, BK21+ BSO Team Chungbuk National Univerity 52 Naeudong-ro, Heungdeok-gu, Chungbuk 361-763 Korea trebrone@gmail.com, hlla007@gmail.com, choi@cbnu.ac.kr Abtract. In thi paper, we trie to propoe light and quick algorithm through with SMS filtering can be performed within mobile device independently. After introducing thi algorithm, it can be the olution for limitation of memory and reource that had not been olved. Keyword: Mobile phone, SMS pam, pam filtering, Data Mining 1 Introduction A SMS pam meage have dratically increaed, typical filtering method are not effective to be proceed within mobile phone anymore. For efficient pam filtering, technique to remove unneceary data are needed. Thee data reducing technique include data filtering, feature election, data clutering, etc. The main idea i to elect important feature uing relative magnitude of feature value. We compare the performance of our method with tandard feature election method; Naive Baye, J- 48 Deciion Tree, Logitic. In thi paper, we propoe a new feature election method the average ratio of each cla relative to total data. We compare between propoed method and other method. 2 Related Work The reearche include tatitic-baed method, uch a bayeian baed claifier, logitic regreion and deciion tree method. There are till few tudie about SMS pam filtering method available in the reearch journal while reearche about email pam claifier are continuouly increaing. We preent the mot relevant work related to thi topic. Gómez Hidalgo et. al. (2006) evaluated everal Bayeian baed claifier to detect mobile phone pam. In thi work, the author propoed the firt two well-known SMS pam dataet: the Spanih (199 pam and 1,157 ham) and Englih (82 pam and 1,119 ham) tet databae. They have teted on them a number of meage ISSN: 2287-1233 ASTL Copyright 2014 SERSC
repreentation technique and machine learning algorithm, in term of effectivene. The reult indicate that Bayeian filtering technique can be effectively employed to claify SMS pam[1]. 2.1 SMS Spam Collection v.1 Data Set The SMS Spam Collection v.1 i a et of SMS tagged meage that have been collected for SMS pam reearch. It contain one et of SMS meage in Englih of 5,574 meage, tagged according being ham or pam. The data i contain one meage per line. Each line i conit of two column: one with label (ham or pam) and other with the raw text. Table 1. Type of feature. Meage Amount % Ham 4,827 86.60 Spam 747 13.40 Total 5,574 100% A hown in Table 1, the data et ha 86.6% of Ham meage and 13.4% of Spam meage. Table 2 how ome example about ham and pam meage[8]. 2.2 Data Mining algorithm Typical method to detect pam meage include bayeian claifier, logitic regreion, deciion tree and o on. Bayeian Claification provide a ueful perpective for undertanding and evaluating many learning algorithm[7]. A deciion tree i a flowchart-like tructure in which internal node repreent tet on an attribute, each branch repreent outcome of tet and each leaf node repreent cla label (deciion taken after computing all attribute). A path from root to leaf repreent claification rule[2]. Logitic regreion i a type of probabilitic tatitical claification model. It i alo ued to predict a binary repone from a binary predictor, ued for predicting the outcome of a categorical dependent variable baed on one or more predictor variable feature[3]. 3 Experimental Study We explained above that SMS pam data i rapidly increaing. In order to detect pam meage, filtering algorithm or feature election method have to be more efficiently run. The above three method ue a complex calculation to do thi. For thi 30 Copyright 2014 SERSC
reaon, thee method i inefficient for dealing with large cale data. In thi paper, we propoe a imple and efficient feature election method. 3.1 Propoed Method Thi tudy propoe a VR (Value Ratio) meaure for evaluating lightne and quickne of filtering method o that SMS filtering can be performed independently within mobile device. Firt, each Cla (Spam and Ham) i divided, and appearance frequencie of word on SMS meage are evaluated. Then the appearance frequencie of each word are aggregated and then divided by the number of meage to calculate an average. The formula i a below. W j = i pam w / k = w / k (1) W j = i ham w / k = w / k (2) Here, i and j repreent row and column repectively, and total meage i k. The reult of calculating a VA by uing calculated W j and W h j value i a below. VR(j) = W j / h W j (3) VR(j) repreent the relative ratio of average frequency of jth keyword in pam meage to that in ham meage. A the value of VR(j) i larger, the word are more frequently refered in pam meage. A hown in the figure, a a reult of executing algorithm by uing the VR attribute election technique, run time varied much. Thu, it i expected to fit for executing algorithm independently in the mobile environment that ha many limitation in the apect of torage pace, memory, and proceing capability. Copyright 2014 SERSC 31
Figure 2. The reult of algorithm 4 Future work In the future, reearche hould make a program with the method propoed in thi tudy and prove that it i an efficient technique by conducting a comparative analyi on calculated time taken when it i performed within actual mobile phone independently. Becaue pam meage continuouly increae, data hould be added contantly for a precie analyi. Additionally, the propoed method hould not be limited in the pam filtering but applied to variou field to extract ueful information o that reearche on data reducing technique for an efficient analyi in the maive data environment can be conducted. Acknowledgement. Thi reearch wa upported by the MSIP(Minitry of Science, ICT and Future Planning), Korea, under the ITRC(Information Technology Reearch Center) upport program (NIPA-2014-H0301-14-1022) upervied by the NIPA(National IT Indutry Promotion Agency). Thi reearch wa upported by the MSIP(The Minitry of Science,ICT and Future Planning), Korea, under the "SW mater' coure of a hiring contract" upport program (NIPA-2013-HB301-13-1008) upervied by the NIPA(National IT Indutry Promotion Agency). Reference 1. Gómez Hidalgo, J. M., Bringa, G. C., Sánz, E. P., & García, F. C. (2006). Content baed SMS pam filtering. In Proceeding of the 2006 ACM ympoium on Document engineering,107-114. 2. http://en.wikipedia.org/wiki/deciion_tree 3. http://en.wikipedia.org/wiki/logitic_regreion 4. http://www.c.waikato.ac.nz/~ml/weka/ 32 Copyright 2014 SERSC
5. Liu H. Setiono R. Motoda H. Zhao Z. (2010). Feature Selection: An Ever Evolving Frontier in Data 6. Mining, JMLR: Workhop and Conference Proceeding, 4-13 7. Saurabh Mukherjeea. Neelam Sharmaa. (2012) Intruion Detection uing Naive Baye Claifier with Feature Reduction, Procedia Technology, 119-128. 8. SMS Spam Collection v.1 (2012) http://archive.ic.uci.edu/ml/index.html Copyright 2014 SERSC 33