Online Spam Filter for Duplicate or Near Duplicate Message Content Detection Scheme 1 Rahul Verma, 2 Joydip Dhar ABV- Indian Institute of Information Technology and Management, Gwalior-474015, India, 1, E-mail: rahulvermaiiitm@gmail.com *2, E-mail: jdhar@iiitm.ac.in Abstract Electronic mail (E-mail) spam is today s popular problem for users and the service providers. Spammers not only send a large number of e mails but also use the spamming techniques to spoof the secret information of users. They force the users to draw their attention to read. In the work we propose an online spam filter for duplicate or near duplicate message content detection scheme. This efficient scheme uses strong signature technique to detect duplicate message with matching percentage that helps user to decide the slender modification in email spam. Additionally the scheme uses an efficient and fast hashing technique and possibly easily establish in any service provider s mailing system to detect near duplicate spam emails before a large number of spam are delivered at the receiver end. It does not break any email policy like privacy or secrecy, because it does not store any email message content. Keywords: Spam filter, bloom filter, block partition, similarity measure, false positive, false negative, white-list, black-list. 1. Introduction Emails those are unusual for the particular receivers are called spam, but the particular email may be useful for any other receiver. Email spam differs from the users need. Spammers try to draw the users attention to their emails, sometimes for the product advertisement, sometimes for spoofing the personal information; to tease someone spread malicious links, ad-wares, sending fake attachments and spreading vulgarity or pornography. Ultimately the receiver distracts from these types of emails or spams. In our daily life about we received 100 billion emails per day from different email addresses around the world. In the year 2010 an estimated 88% of this worldwide traffic was spam [1]. Emailservice provider use different spam detection techniques such as content based filtering, different types of white listing and blacklisting, distributed honey-pot and collaborative data sharing techniques and solutions to tackle these spam problems based on senders [2-7]. There are lots of spam detection and filtering techniques but none of the solution is permanent and effective. Although spammers may adopt newer effective techniques for sending unwanted spams. They adopted new techniques like (i) Slight message format modification. (ii) Message volume based attacks. (iii) Message sending from different sender addresses [8, 17]. Spams are sent without the receiver s permission to receive an electronic bundle of mails, these emails are annoying if we are using the internet after a week or month then our mailbox may full of spam. The user suffers a lot to judge which message is really important. Most of the spams are in the same format and from the same sender. So we have to detect the repetitive near duplicate messages. On continuing with the trend of emails, spam problems should be solved by adding some new methods and algorithms. Motivated by this current issue we propose an efficient scheme which can effectively identify the repetitive transfer of slight similar or duplicate emails which is useful in spam identification. Most of the spammers use different user names for the sending email spam from the same or different domain names like jobs portal matrimonial sites, winning the lottery, call for papers, advertising agencies etc. They use nearly the same message for the different users. Several times they send spam with small changes in the message by interchanging the sequence of paragraphs or some changes in the words and interesting or deleting single or more lines but the theme of the message is almost similar [16]. According to these observations, our proposed technique audits email servers and detect a high number of unsolicited emails of nearly similar and duplicate content being transferred. Email server uses two Journal of Convergence Information Technology(JCIT) Volume9, Number4, July 2014 23
types of protocols: (i) Simple mail transfer protocol (SMTP) for sending, (ii) Post office protocol (POP) for retrieving the email from the domain so the email server works at the store and forward technique which provides auditing all message transmission from the email server. In the proposed architecture we use a hash based algorithm which worked on online detection scheme, it detects whether the transferred message contents are similar to near duplicate. Basically two messages are exactly the same if their contents are similar with the high probability, now if we break similar messages into the small set of blocks then the blocks should also match in the sequence with high probability. Keyword and content-based spam filters did not solve the problem of network bandwidth limitation and storage problem because all the emails are processed at the recipient side. Machine learning systems have some limitations of training because until our system is well trained it will not work like an expert system because there is no system will get trained as the full expert system [9, 15, 16]. IP and URL based systems have the problems of fake accounts and hacked accounts. If any of the spammer is suspected the instantly change their email address and create new email account. Challenge/response and content based systems have some limitations of time, money and resources during the verification. Blacklisting servers are mostly non-profit organizations so after some time they may stops to update their blacklist database. Any of these email spam detection and filtering technique cannot solve the limitations of network bandwidth and the storage problems. The polite sender spam filter has some problem like there is no method to check malicious URL links, virus worms and to detect duplicate or near duplicate email messages to decide if the received email is a spam because most of the emails have the same message content and in the same format. 2. Problem formulation Email message contains a set of string S, we have to break into small substrings of successive characters within a mail. Two email messages are same if their successive chunks or blocks are same. So first we have to break a message string into small blocks M = {b 1, b 2,, b m }. Where b 1, b 2, b m are successive blocks of string. To check the similarity s, into the given messages M 1 and M 2 we use Jaccard similarity metric 0 M M s( M 1, M 2) (1) 0 M M By using this metric into the given messages M 1 and M 2 we can say that the messages are similar or partially similar ( M 1 ~ M 2 ), if s(m 1, M 2 ) >= s 0, where s 0 is the similarity threshold. The chunking of message blocks depends on similarity metric, therefore to break the message string in small successive blocks is challenging. Here we are going to discuss how we break the message string in small successive blocks. We have a problem if we are using fixed block or static chunking. In this chunking if we are inserting one character then the sequence of message string is changing and in this case there is no similarity or partially found between messages. In this case we are going to use rolling hash computation. If hash value matches a certain predefined pattern we can declare a chunk boundary at that position situation. To do this computation efficiently a technique called the rolling hash was devised. It uses a sliding window that scans over the data bytes and provides a hash value at each point. The hash value at position i can be cheaply computed from the hash at position i-1. In other words where n is the window size and represents the window bits at byte position i. In mathematical terms this is a recurrence relation. Rolling hashes have been used in contexts like Rabin-Karp substring search and Rsync. Today they are used extensively in chunk splitting in the context of duplicate data. Consider i number of email messages M 1, M 2,, M i received at the email server. Form the M i+1 th received message, we have to check how many already received messages similar to M i+1. By the store and process fashion it is quite easy to find a solution for this problem. The technique we used in this paper is efficient with the limited space and computation complexity especially when the number of messages is very bulky in size. We are going to use a counting bloom filter to find a proper solution of this problem which is summarized in the next section. 24
3. Detection prototype The study behind the research is based on the detection technique of spam. We want to detect the approximate percentage of the duplicate content to check whether the receiving email is spam or not. If the mail is really a spam then what is the probability. For this cause we use a counting bloom filter. Our proposed solution is in this section. A. Counting bloom filter A counting bloom filter is some different from the standard bloom filter, it is a space saving probabilistic data structure, but we use time slot T 1, T 2,..., T j. Here we use the counting bloom filter to verify a given message string set in the previous stored bloom filter bit array [9][12]. In a standard bloom filter initial values are set to zero, it uses K hash functions corresponding to 1, 2,..., m messages as shown in Fig.1. Figure 1. Structure of a standard bloom filter The initially bloom filter has zero value of each index. When any new string is inserted into the bloom filter it is first hashed with all the K hash functions, according to hash values all the corresponding zero value is changed to 1. If any value is already 1 in the index then collision happens here and the corresponding value is not changed. When a message string is checked for availability in the set, the given string is first hashed with all the corresponding K hash functions. Here the bloom filter takes constant time to check If all the values in the bloom filter are set to 1 corresponding to the given string then we can say that the string matched from the previous message contents with some probability, this is called false positive rate, false positive rate have some meaning in the bloom filter because of collisions during the insertion process. If any of the bits is set to zero then we can say that the given string was not matched with the previously stored message strings, and there is no false negative. If we have zero false negative rates it means the bloom filter is very efficient with very small false positive probability. In this scheme we have changed some things like we use a predefined time period for the received similar message from the standard bloom filter so we called it counting bloom filter. On receiving every new message the inserted old messages are erased from the bloom filter entry to minimize collisions and false positive rate. We have a slot of only 50 messages can remain at one time in the bloom filter and have set a range of 0 to 50000 index value. B. Implementation The spam messages have some characteristics: Mostly machine generated emails. Received after a short period of time. In the same format. Message contents are nearly similar. Implementation scheme is shown by a process flow diagram in Fig. 2. Suppose there is N number of messages is sent to the receiving end, then we first split each message into a block of lines L i. Our online spam filter scheme works as follows: Initialize the bloom filter: 25
First each message M splits into M 1, M 2,, M i blocks and then hashed each message block M i corresponding to K hash functions, then insert into bloom filter B 1, B 2,, B i corresponding to time slot T 1, T 2,, T j. Initially all the values in the bloom filter are set to zero. When a value is inserted into the bloom filter, it flipped to 1 if already set to 1 then a collision occurs and no bit value is changed. Note that here we are using I >> j. Figure 2. Process flow diagram Insertion Process: Upon receiving a new mail at the receiver side, first it follows the same process during insertion, then it calculates the hash value corresponding to K hash functions, if all the values are 1 in the bloom filter corresponding to the calculated hash values then we can say that the inserted message block M j is stored in the bloom filter. An illustration is shown in the Fig. 3 that s how to store a particular email. Figure 3. Insertion Process in to bloom filter 26
Verification process: In the verification process we first calculate the hash values of all K hash functions then check bit values from the stored bloom filter, if all the corresponding values are 1 then we can say that the given string is the part of the message content refers to Fig. 4. When the time period T j expires creates a fresh bloom filter B j+1 with fresh time slot T j+1 and discard all the old bloom filters and time slots. Figure 4. Verification of the given String C. Block partition Block partition is a trivial problem for an email message into several chunks. Here are some of the block partition problems: The simple block partition method cannot capture the repetition of the message block because of a fixed sequence of blocks. If blocks are divided into a fixed length block then there is a problem when we inserted any new element in the message block. New element breaks the sequence of blocks and no duplicate data is found. As shown in Fig. 5. Figure 5. Static Chunking [13] In the variable-block or dynamic chunking the block boundaries are fixed so there is no effect on the insertion of elements and all the duplicate blocks are detected during the scanning process as shown in the Fig. 6. 27
Figure 6. Dynamic Chunking [13] After looking at various chunking techniques we found n-grams hashing technique [13]. The sliding window rolling hash scans the data bytes and provides a hash value at each point. It is a very complex and time taking process as shown in Fig. 7. Figure 7. Rolling hash [13] By facing several problems we chose to break message contents into a block of lines L i. D. Experiments We have put here similarity threshold s 0 : 66%, because if we have only three numbers of lines and the two lines are detected as content matching then the matching percentage P > 66 so we can say that message may be a spam. We are using k=10 hash functions in our scheme and the bloom filter size should larger than 50000 index values for detecting 10 similar messages. Every line has not more than 100 words, if the more than 100 words then it breaks into the next block. If the matching percentage P crosses the similarity Threshold s 0 then it generates an alert that the receiver can judge the new mail how much probable is to be a spam as shown in Fig. 8. 28
4. Conclusion Figure 8. Generated alert during the verification process. To identify duplicate or near duplicate email message content over the Internet, a fast and efficient online spam detection technique is given in this paper. Our proposed method uses a standard bloom filter for detecting message content occurrences in a very efficient and space saving. Our prototype test cases show that we can verify 100% duplicate content to judge which mail is a spam. In our scheme the bloom filter size should larger than 50000 index values for detecting 10 similar messages. Every line has not more than 100 words, if the more than 100 words then it breaks into the next block. The proposed scheme achieved high performance and accuracy in comparison of available spam detection technique. 5. Acknowledgement This work is a part of a master thesis supported by the Atal Bihari Vajpayee Indian Institute of Information Technology and Management, Gwalior, India. 6. References [1] B. Plumer, The economics of internet spam 2012, http://www.washingtonpost.com/blogs/wonkblog/wp/2012/08/10/the-economics-of-spam/, 2012. [2] Ortega, F. Javier, et al. "Combining textual content and hyperlinks in web spam detection." Natural Language Processing and Information Systems. Springer Berlin Heidelberg, pp.266-269, 2011. [3] Johnson, Peter C., et al. "Nymble: Anonymous IP-address blocking." Privacy Enhancing Technologies. Springer Berlin Heidelberg, 2007. [4] Zhuge, Jianwei, et al. "Collecting autonomous spreading malware using high-interaction honeypots." Information and Communications Security. Springer Berlin Heidelberg, pp.438-451, 2007. [5] Das, Swagata. Instant Messaging Spam Detection in Long Term Evolution Networks. Diss. Concordia University, 2013. [6] Özgür, Levent, Tunga Güngör, and Fikret Gürgen. "Spam mail detection using artificial neural network and Bayesian filter." Intelligent Data Engineering and Automated Learning IDEAL 2004. Springer Berlin Heidelberg, pp.505-510, 2004. [7] Almeida, Tiago A., and Akebo Yamakami. "Content-based spam filtering." Neural Networks (IJCNN), The 2010 International Joint Conference on. IEEE, 2010. [8] Schryen, Guido. "The impact that placing email addresses on the Internet has on the receipt of spam: An empirical analysis." Computers & Security 26.5, pp.361-372, 2007. [9] Panigrahi, Prabin Kumar. "A Comparative Study of Supervised Machine Learning Techniques for Spam E-mail Filtering." Computational Intelligence and Communication Networks (CICN), Fourth International Conference on. IEEE, 2012. [10] Vorakulpipat, Chalee, Vasaka Visoottiviseth, and Siwaruk Siwamogsatham. "Polite sender: A resource-saving spam email countermeasure based on sender responsibilities and recipient justifications." Computers & Security 31.3, pp.286-298, 2012. 29
[11] Gu, Jianhua, and Xingshe Zhou. "A Dynamic Structure of Counting Bloom Filter." Proceedings of the 2011 2nd International Congress on Computer Applications and Computational Science. Springer Berlin Heidelberg, 2012. [12] Coskun, Baris, and Paul Giura. "Mitigating SMS spam by online detection of repetitive nearduplicate messages." Communications (ICC), 2012 IEEE International Conference on. IEEE, 2012. [13] M. Ramblings, https://moinakg.wordpress.com/tag/rolling-hash/, 2013. [14] Kanaris, Ioannis, Konstantinos Kanaris, and Efstathios Stamatatos. "Spam detection using character n-grams." Advances in Artificial Intelligence. Springer Berlin Heidelberg, pp.95-104, 2006. [15] Hsu, Wei-Chih, and Tsan-Ying Yu. "E-mail Spam Filtering Based on Support Vector Machines with Taguchi Method for Parameter Selection." Journal of Convergence Information Technology 5.8, (2010). [16] Wang, Jinlong, Ke Gao, and Huy Quan Vu. "SpamCooling: a parallel heterogeneous ensemble spam filtering system based on active learning techniques." Journal of convergence information technology 5.4 (2010). [17] Zhao, Zheng-dong, et al. "A fuzzy adaptive multi-population parallel genetic algorithm for spam filtering." Journal of Convergence Information Technology 6.2 (2011). 30