1 Vol. 7, No. 3, May, 213 A Keyword Filters Method for Spam via Maximum Independent Sets HaiLong Wang 1, FanJun Meng 1, HaiPeng Jia 2, JinHong Cheng 3 and Jiong Xie 3 1 Inner Mongolia Normal University 2 Air Defene Fores Aademy 3 Inner Mongolia eletri power information and Communiation Center lzjtuwhl@163.om Abstrat In order to evade the keyword filtering, the spammers insert omments into s, suh as unusual symbols # or, to divide some keywords. In the paper, one keyword filters method for spam via maximum independent sets is presented, and the main ontents inlude: (1) build a mathing relation matrix algorithm to help us to improve the performane of maximal independent sets; (2) develop a judgmental riterion aording to the mathing relation matrix algorithm. (3) design a behavior reognition tehnology, whih an detet and rejet the whih reeiving. Proved by the experiments and analyses of examples, the spae and time omplexity of this algorithm is muh smaller than (mn). The operating effiieny is also satisfatory, and is able to ahieve the omplete filtering of targeted unusual symbols during the keyword filtering. Keywords: Maximum independent sets; semi-diagonal line; keyword filters; mathing relation matrix 1. Introdution has beome the important method for network ommuniation as it is widely used among the Internet users and is regarded as one of the most ommonly used network appliations. However, with the development of Internet, junk s (spam) bothering most people do not only bring disontent to users, but also ause some web seurity issues and eonomi losses. In reent years, a large number of dediated servers generate and send out spam though . Aording to the statistis from the anti-spam enter of the Internet Soiety of China, There are more than 1 servers per month whih were sent into the blaklisted by authoritative foreign anti-spam organization sine 25. With the development of the network, spam beomes an inreasingly serious global seurity problem. More and more researhers pay attention and onern to this field. Negative effets derived from spam, whih bring great eonomi losses and result in large amounts of data and information blokage, have beome a worldwide problem. Numerous experts and sholars put forward a lot of targeted prevention methods that ease the problem to some extent. Data Mining is a relatively popular tehnique for the filtering of ontents and theme keywords, whih detet spam keywords through keyword lassifiation and statistial algorithm. Bayes filter is an effetive method. The harateristis of Bayes filter are adaptation and self-learning. Bayes filter has the advantages of high detetion auray [1]. Other widely used deteting approahes inlude detetion based on memorial information, detetion based on desription of event features, and filtering based on spam feature analysis and regular expression mathes. 31

2 Vol. 7, No. 3, May, Related Filtering Tehnology and Analysis In Anti-Spam Solutions and Seurity, Dr. Neal Krawetz sorts the anti-spam tehniques into four main ategories: Filter, Reverse lookup, Challenges, and Cryptography. All of these solutions an ease the spam problems, but have respetive limitations. Filter is a relatively simple and widely used tehnique in spam detetion. It is ommonly used in mail reeiving management system for judging and eliminating spam. Currently, most of the mail servers apply anti-spam plug-ins, detetion gateways and lient-side spam filtering, based on varieties of filtering tehniques suh as keyword filtering algorithm, blak and white list, rule-based filtering and Bayesian filtering algorithm [2]. 2.1 Keyword filtering algorithm Mail ontent filtering uses a key word or multiple keywords as a general basis to judge. Using keyword hit rate to onfirm whether this message is spam or not, if the hit rate is large than a set threshold, it is onsidered as a spam. In addition, the key words an also be phrases and short sentenes. Mail header information is the original reord of the mail delivery proess whih is also a very important sense data, this is spam. Spammers use various tools and randomly send the span to forged losing sender, subjet and ontent, but some ommon information stored in the message header information, whih onsist IP address, host name, X-identifiation [3]. Through the filter of this information, you an find out the spam from several mail sent from the same address whih inludes different transeiver address and subjet. Keyword filtering algorithm generally reated some spam assoiated keyword table to judge and proess spam. If ertain keyword appears in large numbers of spam , then we an put them on the filtered list. The defet of this algorithm are there is a great impat on filtering apaity, the seletion proedure osts a lot resoures, and relatively low effiieny in the stage of seletion of keywords. In addition, the word-split funtion and wordombination funtion an easily avoid the filtering. 2.2 White and Blak List The blaklist ontains the basi information of onfirmed spammers: IP address or IP address of the mail sender. If the sender of the mail is the same as the address of the known spam, we an judge this message as a spam, and rejet the mail. The disadvantage of this method is that the an bypass the blaklist deteted by using a different IP address. The spammer an use the forgery and hanged sender's address [4]. In addition, rely on manual proessing administrator is unable to update the blaklist in time effetively. The whitelist ontains the trusted address, or IP address. If the information of send s mail mathes the data in whitelist. This mail an be onsidered as a normal mail and was released. The disadvantage of this approah is that if the user wishes to reeive from a ertain address, the user reeives this address mail rules must be set in advane to allow. If a user want to hange his/her mail address, the whitelist should be updated in time. Otherwise, the mail server will rejet this mail. 2.3 Rule-based Filtering The rule-based filtering tehnology defines the filtering expressions or rules mainly by seleting ertion harateristis of keywords to desribe spam s feature value. The fatal defet of this filtering method is that it requires managers to maintain a relatively large rules library, 32

3 Vol. 7, No. 3, May, 213 and in order to keep the effetive and real-time of the system, managers need to organize new rules regularly [5]. 2.4 The stati ontent filtering tehnology Stati ontent filtering atually is only useful for the ruling spam. For example, marketing advertising is a spam whih always ontains these rules, the advertising mail inludes a subjet if ADV. If the user does not want to reeive the spam like this type, he an simply set a filter to rejet the mail whih subjet with the ADV tag. However, the further mail spam appears, for example, some of the words "free" was transformed into "Free... fee" or "Free - fee" spam. It beomes another hot problem in filtering fields. If we san the mail and rejet the mail whih inludes these types data, some normal mail will be deleted. Therefore, the keyword filtering tehnology based on the message ontent will lead to a high false rate in pratial. This tehnology an be used in the environment under highly ontrolled [6]. 2.5 Bayesian Algorithm Bayesian filter mainly alulates the probability that whether an inludes a spam message ontent (TOKEN string), trains from the manually identified spam and legitimate mail. Thus the result is more effetive than other average ontent filters. Bayesian filter is a sore determined filter whih utilize the automati reation of spam feature table. The algorithm first analysis respetive feature value in massive spam and legitimate mails, and then alulate the probability that multiple features ontains in the mail. The priniple of this algorithm is to hek the keywords in spam set and legitimate mail set, statistis eah feature value as a TOKEN string, then builds hash tables respetively for TOKEN string in the spam set and legitimate mail set aording to the ourrenes of the extrated TOKEN string, named as word frequeny. And these tables store the mapping relationship from TOKEN string to the word frequeny. We ompute the probability that TOKEN string exists in the hash tables by WF,WF is the word frequeny of a ertain TOKEN string, and L is the P L length of orresponding hash table. P indiates the probability that new reeived is spam when the mail ontent ontains ertain TOKEN string in the hash tables established by system [7]. Finally, we get the spam deision sore from the overall mail, alulating by the omposite probability formula we obtain: P( A t 1,t, t 2 n ) P * P 1 2 P1 * P2 * Pn * P (1- P )*(1- P ) (1 P ) n 1 2 n (1) P(A t i ), where P i (1 i n) denotes the probability that a mail is spam when it ontains the TOKEN string. If the result is greater than a speified threshold, we set the mail as the spam; otherwise, the mail is legitimate. What we an observe from the Bayesian algorithm that spammers may esape the Bayesian algorithm filtering by random inserting a word or sentene. Beause of these filter using the stati passive detetion tehnology, many of them an be most effetive only within a very short period. In order to maintain the effetiveness and real-time of spam detetion, the managers should update the rules of filter onstantly [8]. Currently, the main anti-spam systems ommonly used keywords filtering tehnology based on omplete mathing, interepting samples, analyzing harateristis, generating rules, 33

4 Vol. 7, No. 3, May, 213 rules issued and ontent filtering tehnology. In order to avoid speial keywords filtering, spammers often insert a large number of omments into the in order to split some keywords (suh as 法 轮 功 ) and mix the mail ontent by some speial methods. for example: inserting symbols into Keywords 法 # 轮 功 et. Sometimes they onvert promotional ontent into a Zip pakage as the attahment to evade the filter. In this paper, we design and implement an adopted algorithm whih an effetively solve the problems inluding keywords split and ombine, as well as inserting speial symbols. Considering the traditional filter tehnology always hek the after all the ontent download into the loal disk, whih downgrade the performane. We also design a behavior reognition tehnology, whih an detet and rejet the whih reeiving. In this way, we do not need to wait all the ontent fully reeived from the remote nodes, and diretly blok the at the very beginning of the transfer. The entire filter rule will be build at the initial period of the establishing SMTP onnetion. 3. The Keywords Filtering Algorithm Based on the Maximal Independent Set 3.1 Mathing relation matrix of string Given any two strings S and T, the maximum mathing problem of them is equivalent to the maximal independent set of mathing relation matrix. Define S a1 a2 am, T b1b 2 bn. Note that n { 1, 2,, n}, m {1, 2,, m} Thus i, j) i m, j n, a i b } is alled mathing relation set of S and {( j T, written M ( S, T) [5], here we assume that n m generally [9]. C mathing relation matrix, an be defined as follow: C m m1 1 n 2n mn (2) Where ij 1, if ai bj, if ai b j We only disuss the situation that weight C ij 1 in this paper. Definition This thesis defines that eah node in the independent set only exist a orresponding node at the bottom right loated in the different row or different olumn, alled quasi diagonal. 34

5 Vol. 7, No. 3, May, 213 The set of nodes whih value are 1 ( C 1) over a quasi diagonal of mathing relation ij matrix is alled an independent set. The keyword mathing problem an be transformed into solving maximal independent set of mathing relation matrix, and searhing all independent sets in a given mathing relation matrix. So the longest set is the answer. In partiular, we an searh points in mathing relation matrix that of whih value is 1 to determine whether they are ompletely math in the proess. However, the idea disuss above will make the problem more ompliated, we find that finding the maximal independent sets an be regarded as searhing for a road in the mathing relation matrix of whih value is 1. Eah node in the independent set only exist a orresponding node at the bottom right loated in the different row or different olumn, alled quasi diagonal, searh the bottom right aording to the quasi diagonal [1]. We introdue a new problem solving algorithm to find the independent sets in this paper. We will desribe the detail in Setion 3.2. The relationship between output result with original input string will satisfy the following relationship: a1 11 a2 21 am m m2 1 n b1 2n b2 mn bn (3) 3.2 The algorithm of maximal independent set of mathing relation matrix In order to searh the bottom right orresponding with the quasi diagonal, we propose an improved algorithm of maximal independent set of mathing relation matrix as follows: Assume string α and β are independent sets obtained in searhing, we set the length of them as α and β. The proedure of algorithm desribes as follows: Step 1. When searh the olumn j, if α β, and the absissa of the last harater of α i1 < i2(the absissa of the last harater of β), then stop the operation of β. Step 2. When searh the row i, if α β, and the ordinate of the last harater of α j1 < j2 (the ordinate of the last harater of β), then stop the operation of β. Step 3. When searhing mathing relation matrix is ompleted, the length of α is equal to the original string, then keywords are found [11]. In this algorithm, it may generate multiple result of β, beause we just find the length of α with the length of original string, so we an get multiple independent sets in the end. The finding proedure will be present as a pseudo ode in following [12]: 35

6 Vol. 7, No. 3, May, 213 searh d[]=;ol=;η[]=;k=; for j ol,,n do for i k,, m do if equal(s[i],t[j]) { d[]++;ol++;k++; ontinue;} if(j<n&&k>=m) k=; End of For η[] = d[]/m; End (4) Where d[] indiates the array of length of eah independent set during the storage and alulation, η[] presents the mathing auray of eah independent set, ol is the next mathing start position of target string T[j], This setting design of ol an largely redue the mathing time omplexity. K denotes the identity of searh the original string. When the first searhing trip of the original string is finished and the target string is not ompleted, and then searh from the first harater of the original string again, the searh proess will stop until it finds all the mathing string [13]. 3.3 Judgement Criterion The Mathing auray m N C ij n i1 j1 N, where N is the length of original keywords string, N is the length of all quasi diagonal (Cij only equal or 1). If η<1,indiates the string α and the deteting target string does not math exatly, then the system outputs the results: the mail is seure; if η>1, meaningless; if η=1, then the system shows the string α and the deteting target string math exatly, keywords hidden in the string β is found, then the system give a warning, and ontinue the next steps [14]. 3.4 Complexity Analysis In this paper, we also import two tehniques in our algorithm. Firstly, the mathing relation matrix is reated dynamially; we do not need any other saving spae to store the matrix table. Seondly, In the proedure of string mathing, we does not require all the elements of the original string and the target string equal, we just searh along the bottom right of the quasi-diagonal [15], the searh detail an be find on the algorithm pseudoode desription of definition of ol. Both the spae and time omplexity of this algorithm are far less than O (mn). 4. Example Analysis Assuming that the word " 法 轮 功 " as a keyword in the list of filtering keywords, we detet the reeived mail subjet 法 # 轮 % 功 法 * 律, with our algorithm designed in this paper. All the proess steps an be desribe as follows: firstly, the Chinese mathing relation matrix as show below: 36

7 Vol. 7, No. 3, May, 213 法 # 轮 # 功 法 % 律 Figure 1. Mathing relation matrix (1) When searh along the bottom right of the quasi-diagonal aording to mathing relation matrix 1, the entry at the row labeled 1 and the olumn labeled 1 is 1,we firstly get the string α = 法, there is no relevant mathing in the seond line, then go to the third line, obtain α = 法 轮 ; similarly and so on, we an get α = 法 轮 功 ; When the program running into the seventh line, the first olumn is 1, but it is not the last node of string α, so it reate a new string β. Here we annot determine whether the length of β is the same as the keywords, thus ontinue searh β= 法. When all the searhing proedure finished, we get final string α [8]. We an use the mathing auray algorithm to determine whether the keyword is a spam, and then arry out the orresponding treatment [16]. To illustrate the effetiveness of the proposed algorithm, the haraters mathing proess via the relation matrix shows in below [17]: Figure 2. Mathing relation matrix (2) First of all, we get a temporary string φ = B (node on φ 1 ). And in the right, we find a way, row 4 ontains two 1, we an selet the above one aordane to the rules of the algorithm for solving a 1, that is to selet the first row 1, instead of 1 in the seond row. Meanwhile, in the first row there are two 1, by Theorem 3, we selet the left of that one, i.e. the fourth olumn is a. In this ase, φ = BO. However, when the algorithm is running into the fourth row, φ = BOOK, K in the sixth olumn of the third line of the Bank in the first olumn, the left side of the road φ is the last node of K, then you must reate a new one road Ψ, beause we annot determine whether future there φ Ψ. When the algorithm is run to the sixth row, φ = BOOK Ψ = NEW φ = Ψ = 3, we will the S hain on to the road φ, obtained the longest lower right road φ = BOOKS φ = 5. Thus, it is possible to alulate the degree of mathing of the two strings [18]. 37

8 Vol. 7, No. 3, May, Conlusions In this paper, we proposed a keyword filtering methods to filter the spam whih is related as the approximate mathing of Chinese haraters. The basi idea of our design is based on the onept of the edit distane approximate string mathing. Edit distane is defined as number how muh times a string transform into a minimum number of edits needed another string. By alulating the edit distane matrix, we an draw ahieve the best math. Searh through the parallel simulation, you an speed up the running proess of the lassial algorithm; this method is espeially good for the short string [19]. If the bit omparing proess, we import the parallel priniple idea to the mathing funtion, a number of different values paked into a omputer word length w, these words an be proessed in a single operation or operator whih need several operations or operator to omplete the funtion in traditional methods. To judge a text string in a loation or the pattern strings is mathed or not, whih may math more easily than judgment. If the filtering algorithm annot suessfully math the region, and then ombined with the non-the filtering text searh algorithm, we an also ultimately ahieve fast string mathing [2]. This paper presents a set of mail keywords filtering methods to find the maximal independent set. We design and implement an adopted algorithm whih an effetively solve the problems inluding keywords split and ombine, as well as inserting speial symbols. We also design a behavior reognition tehnology, whih an detet and rejet the whih reeiving. The experimental result shows that both spae and time omplexity are far less than O (mn), the effiieny is also satisfatory. However, if the spammers hange the produe way and onstantly sending the spam, we will always in a passive position. In future, we will further study the anti-spam tehnology to hange our passive reation position to impassive deteting the spam. Aknowledgments This researh was supported by Sientifi Researh Projet of Higher Eduation of Inner Mongolia Autonomous Region, China (NJZY1352). Referenes [1] Z. Li and H. Shen, SOAP: A Soial network Aided Personalized and effetive spam filter to lean your e- mail box, INFOCOM, 211 Proeedings IEEE, (211), pp [2] M. T. B. Aun, B. -M. Goi and V. T. H. Kim, Cloud enabled spam filtering servies: Challenges and opportunities, Sustainable Utilization and Development in Engineering and Tehnology (STUDENT), 211 IEEE Conferene on, (211), pp [3] Q. Luo, B. Liu, J. Yan and Z. He, Design and Implement a Rule-Based Spam Filtering System Using Neural Network, Computational and Information Sienes (ICCIS), 211 Int l Conferene on, (211), pp [4] P. Graham, Better Bayesian Filtering, (23) January. [5] C. Dwork and M. Naor, Priing via proessing or ombatting junk mail, in Proeedings of the 12th Annual International Cryptology Conferene on Advanes in Cryptology, Springer-Verlag, (1993), pp [6] A. Li and H. Liu, Utilizing improved Bayesian algorithm to identify blog omment spam, Robotis and Appliations (ISRA), 212 IEEE Symposium, (212), pp [7] A. C. Yao, The Complexity of Pattern Mathing for a Random String, SIAM Journal on Computing, vol. 8, no. 3, (1979), pp [8] J. A. Bondy and U. S. R. Murty, Graph Theory with Appliations, The MaCmillan Press ltd, London and Basingstoke, (1976). 38

9 Vol. 7, No. 3, May, 213 [9] S. Ghemawat, H. Gobioff and S. Leung, The Google File System, SIGOPS Oper. Syst. Rev., vol. 37, no. 5, (23), pp [1] R. Drewes, An artifiial neural network spam lassifier, (22) August, [11] H. Yuan and D. Wang, The New Approah of Marking Ativity-Loops Based on the String Reahable Matrix, Communiations and Mobile Computing, 29, CMC '9, WRI International Conferene, (29), pp [12] E. Irshad, W. Noshairwan, M. Shafiq, S. Khurram, A. Irshad and M. Usman, Performane Evaluation Analysis of Group Mobility in Mobile Ad ho Networks, International Journal of Future Generation Communiation and Networking (IJFGCN), Syst. Rev., vol. 3, no. 3, (21), pp [13] J. -C. Lin and T. C. Huang, An effiient fault-ontaining self-stabilizing algorithm for finding a maximal independent set, Parallel and Distributed Systems, IEEE Transations, (23), pp [14] G. Vesztergombi, G. Odor, F. Rohrbah and G. Varga, Salable matrix multipliation algorithm for IRAM arhiteture mahine, Parallel and Distributed Proessing, 1998, PDP '98, Proeedings of the Sixth Euromiro Workshop, (1998), pp [15] C. -H. Lin, J. -C. Liu, C. -T. Kuo, M. -C. Chou and T. -C. Yang, Safeguard Intranet Using Embedded and Distributed Firewall System, International Journal of Future Generation Communiation and Networking (IJFGCN). Syst. Rev., vol. 2, no. 1, (29), pp [16] J. Xie, S. Yin, X. Ruan, Z. Ding, Y. Tian, J. Majors, A. Manzanares and X. Qin, Improving MapRedue performane through data plaement in heterogeneous Hadoop lusters, in: Parallel Distributed Proessing, Workshops and Phd Forum (IPDPSW), 21 IEEE International Symposium on, doi:1.119/ipdpsw, (21) April, 54788, pp [17] S. -C. Kim and J. -M. Chung, Message Complexity Analysis of Mobile Ad Ho Network Address Autoonfiguration Protools Mobile Computing, IEEE Transations, (28), pp [18] S. V. Viraktamath and G. V. Attimarad, Impat of Quantization Matrix on the Performane of JPEG, International Journal of Future Generation Communiation and Networking (IJFGCN), Syst. Rev., vol. 4, no. 3, (211), pp [19] M. Laoste, Arhiteting Adaptable Seurity Infrastrutures for Pervasive Networks through Components, International Journal of Future Generation Communiation and Networking (IJFGCN), Syst. Rev., vol. 3, no. 1, (21), pp [2] F. Marozzo, D. Talia and P. Trun-fio, A peer-to-peer framework for supporting mapredue appliations in dynami loud environments, in Nik Antonopoulos and Lee Gillam, editors, Cloud Computing, vol. of Computer Communiations and Networks, Springer London, (21), pp Authors Hailong Wang reeived the BS in omputer siene from North Jiaotong University, China, in 1998,and reeived the MS in omputer siene from Lanzhou Jiaotong University, China,in 27.Currently,he is an assistant professor in Computer & Information Engineering College at Inner Mongolia Normal University, China. His researh interests inlude embedded system and muti-ore proessors, and also fault tolerane and real-time database. Jiong Xie reeived the BS and MS degrees in omputer siene from BUAA (Beijing University of Aeronautis and Astronautis), China, in 24 and 28. He is urrently working toward the PhD degree at the Department of Computer Siene and Software Engineering, Auburn University. His researh interests inlude sheduling tehniques and parallel algorithms for lusters, and also multi-ore proessors and software tehniques for I/O-intensive appliations. 39

10 Vol. 7, No. 3, May,

More information