On the Efficiency of Collecting and Reducing Spam Samples

Transcription

1 On the Efficiency of Collecting and Reducing Spam Samples Pin-Ren Chiou, Po-Ching Lin Department of Computer Science and Information Engineering National Chung Cheng University Chiayi, Taiwan, Abstract Collecting spam samples from the Internet is useful for observing the campaigns of spamming botnets and testing spam-filtering products. The common methods of spam collection from the Internet include setting up trap addresses, a spam-filtering mail gateway and an open relay sinkhole. In this work, we empirically evaluate the three methods with respect to their efficiency of collection and the variety of the collected spam samples. We find an open relay sinkhole can collect the largest number of spam samples among the three methods, but the samples from it are likely to be duplicate or highly similar. We therefore design a novel two-level cache mechanism, which can efficiently reduce nearly 99% of the spam samples sent to the sinkhole, and greatly save the storage space and the volume of spam samples for further analysis. 1 Introduction Spam filtering is a common practice on virtually all mail services, but spamming skills also have been evolving to evade the filtering. Thus, the developers of spam-filtering techniques would evaluate or compare various techniques with one or several corpora of mail samples [1, 2]. The evaluation results can serve as a good clue for the developers to improve the filtering accuracy based on real-world samples. Anti-spam researchers can also study the latest spam campaigns from the samples. Several corpora of mail samples are publicly available, e.g., on the website of Cybersecurity Data Mining Competition ( index.php/data.html). The existing corpora, however, have been outdated, and cannot reflect the latest spam campaigns. Considering the fast evolution of spamming techniques, an efficient method for collecting ongoing spam messages is required. In this work, we empirically evaluate three common methods for spam collection (i.e., trap addresses, spam-filtering mail gateway and open relay sinkhole) with respect to their collection efficiency and the variety of the spam samples. It is noted that collecting normal mail samples highly depends on the willingness of contributors, and that is irrelevant to the purpose of this work. We find an open relay sinkhole can collect the largest number of spam samples among the three methods, but the samples are likely to be duplicate or highly similar in the collection. The excessive number of samples will waste the storage space and the time of spam analysis. We therefore design a novel two-level cache mechanism to efficiently reduce duplicate or highly similar spam samples. The first level comprises a novel structure of hash tables to identify duplicate or highly similar spam samples arriving in a burst with high efficiency and filter them out, while the second can reduce more obfuscated samples based on the features derived from parsing into the samples. The two levels together can reduce nearly 99% of the collected spam samples, which are found duplicate or highly similar. The reduction can efficiently save the storage space and the volume of spam samples for further analysis. The remainder of this paper is organized as follows. Section 2 reviews the methods of spam collection and the techniques to identify similar documents. Section 3 presents the deployment of the three methods of spam collection and the design of the two-level cache mechanism. Section 4 presents the collection results and the efficiency of reducing spam samples. Section 5 concludes this work. 2 Related Work A corpus of spam samples can be collected in multiple ways: (1) A set of trap addresses, known as spamtraps, can be exposed to spammers and lure their spam messages. A well known example of this approach is the Project Honey Pot (www. projecthoneypot.org), which distributes a large number of trap addresses to the websites of volunteers. (2) An open relay sinkhole, which is a mail transfer agent (MTA) that allows forwarding from any client to any destination, can be deployed for spammers to relay spam messages [3]. (3) A mail

2 gateway equipped with spam-filtering functions can offer the spam messages it detects. (4) A spamming bot program can be executed in a controlled environment to deliver spam based on the bot master s instructions. The spam messages can be collected as samples. A number of techniques have been developed to identify similar documents such as web pages, files and mail messages [4]. Two common techniques are Broder s shingling algorithm [5] and Charikar s simhash [6], which generate a fingerprint to represent a document, and then compare the similarity between the fingerprints. Henzinger compared the two algorithms in detail [7]. Prior studies [8] and tools (e.g., spamsum, junkcode/spamsum) assumed spam messages in the same campaign are likely to be similar, and used fingerprinting for spam detection. Despite the existing techniques, a great challenge for reducing spam samples is to identify similar samples on the order of millions of spam samples or more in a huge corpus, while new spam messages keep arriving. Thus, an efficient online method is required to identify (1) whether a newly arriving spam message is highly similar to an existing one in a huge corpus, and (2) dynamically updating the corpus. While the work in [4] can meet the former requirement, the latter is essential because the arrival rate of spam messages can be high on an open relay sinkhole [3]. 3 Methods of Spam Collection and Reduction We evaluate the three spam collection methods in this work: trap addresses, spam-filtering mail gateway and open relay sinkhole. Spam collection from bots is left to the future work because of the insufficiency of our current resources. 3.1 Spam Collection The three methods of spam collection in the comparison are described as follows Trap Addresses Spammers can collect addresses as the spamming targets by crawling public web pages for like patterns using harvesting tools (e.g., Extractor from extractorpro.com). Since the trap addresses are not created for normal use, the mail delivered to them is supposed to be spam. Consequently, we first applied for two sub-domains under the domain names of our campus, i.e., ccu.edu.tw, and then faked addresses in the subdomains by following the naming conventions in the campus to make these addresses look real. Figure 1: The system architecture of the spam-filtering mail gateway [9]. Figure 2: The system architecture of the open relay sinkhole. We embedded the trap addresses on two more websites with high pageranks of 3 and 5, besides ours, to make them be harvested in a short time, with the assumption that the addresses will be exposed rapidly on a website with a high pagerank Spam-filtering Mail Gateway We implemented a spam-filtering mail gateway at a senior high school. The MX record entry of the domain was modified to redirect the SMTP traffic to a Postfix ( daemon on the gateway. Figure 1 presents the system architecture. We integrated Amavisd-new and SpamAssassin with Postfix to filter incoming messages. Before the spam collection, we had tuned the spam-filtering algorithm to avoid collecting normal mail messages by minimizing the false-positive rate. We also built our own real-time blocking list (RBL) to enhance the accuracy of the spam filter on the mail gateway [9] Open Relay Sinkhole We built an open relay sinkhole for spam collection based on the method in [10]. Spammers used to scan the Internet for open relays, especially servers running on the SMTP port. According to the statistics of spam activities on the Botlab website ( [11], we rented a virtual private server (VPS) in the U.S., which ranks the second highest in the country rank of spam activities. Figure 2 presents the system architecture of the open relay sinkhole. The open relay sinkhole includes Postifx on the VPS and two multi-threaded Perl programs,

3 pcollector and rcollector, to cope with a large number of spam deliveries in a short time. The latter two programs are described as follows: pcollector runs on the backend server, and analyzes the mail messages forwarded from Postfix. For each newly arriving spam message, the entire message is searched for the patterns of test messages we have identified (to be discussed later). If it is a test message, it will be sent back to rcollector to be forwarded to its destination; otherwise, the spam message will be forwarded to the two-level cache for identifying duplicate or highly similar ones, only one copy of which will be stored on the local storage. rcollector runs on the VPS. It is responsible for forwarding test mail messages passed from pcollector to the destination. Like the observation in [3], we also found that the spammers deliver a small portion of test messages through the open relay sinkhole, and check whether the test messages are forwarded successfully. Once a test message is delayed, the spammers will keep resending the message in a short time (about once every 10 minutes in an hour). If the test messages remain delayed, the spammers will slow down the retrying rate and stop the spam delivery. When the open relay starts to deliver the test messages again, more new test messages will be sent to the open relay for verification. The spamming traffic will resume once the open relay passes the test message checking. The subject or the body of a test message may contain the IP address of the open relay with various keywords, such as test, test123, BC, SM, testuseropen Relay, in the prefix or postfix. We used the keywords and the IP address of the open relay as the patterns to look for test messages. The test message bodies are mostly empty, and the sender addresses are also mostly similar. 3.2 Methods of Fingerprinting The spam samples can be summarized with the fingerprints generated from a hash function to rapidly identify the similarity among them. We use Charikar s simhash [6] to generate the fingerprint for each spam sample because the simhash fingerprint can be as short as 64 bits to achieve good precision [4]. For each spam sample, the subject and the body are first decoded to restore the original content if they are encoded in BASE64 or the like, and the simhash function then reads the decoded sample in units of tokens. A token can be an English word, an ASCII string separated by punctuation or space, or a multi-byte character (e.g., a Chinese character). The mail header except the subject is not involved in the fingerprint calculation because it contains variable yet irrelevant information such as recipients and processing records inserted by the MTA, which will increase undesired disparity among the samples. To generate the f-bit fingerprint fp for each sample, the components of a f-tuple vector v are initialized to 0 first (f = 64 in this work). Each token along the spam content is sequentially hashed into a 64-bit value one by one. For each hash value, if the i-th bit is 1, the i-th component of v is incremented by a weight (1 by default); otherwise, the i-th component of v is decremented by the weight. Finally, the i-th bit of fp will be 1 if the i-th component of v is positive; otherwise, the i-th bit of fp will be 0. We determine the similarity between two samples according to the Hamming distance of their fingerprints. 3.3 Two-level Cache for Spam Reduction It is essential to efficiently judge whether a newly arriving spam message is duplicate or highly similar to an existing one in a huge spam corpus. The work in [4] formulated this issue as a hamming distance problem to identify whether an existing fingerprint in a collection of simhash fingerprints differs from the fingerprint of a given document in at most k bits. That work presented an efficient algorithm using multiple sorted tables of fingerprints from a static set of documents, but the algorithm is inapplicable to this work because the spam messages will keep arriving at a fast pace (e.g., in an open relay sinkhole), and the assumption of a static set of documents is not true. We design a two-level cache mechanism, including an L1 cache and an L2 cache, to address the above issue. The former can filter out duplicate or highly similar spam messages arriving in a burst, and the latter can employ three features from the spam content to identify more similar spam messages. The details are described as follows L1 cache Figure 3 illustrates the data structure of the L1 cache. For each newly arriving spam message, we calculate the simhash fingerprint f p from the concatenation of the subject and the body in the L1 cache. Each fingerprint in the cache is divided into k + 1 segments. If an existing fingerprint in the cache differs fp in at most k bits, then one of its k + 1 segments must be identical to that of fp. We set k to 3, which is a good balance of the precision and recall [4]. Each fingerprint in the cache is duplicated in k+1 hash tables, and its i-th segment is located in the prefix of the i-th hash table. The i-th segment of a new

4 Figure 3: The L1 cache (not including the circular queues) for identifying similar simhash fingerprints. fingerprint fp will be looked up in the prefix of the i-th hash table, for i = 1... k + 1. According to the above observation, if a fingerprint in the cache differs from fp in at most k bits, one of the lookups will result in a hit. That is, the i-th segment of fp is identical to that of an existing fingerprint, for some i in 1... k + 1. fp will be then compared with the fingerprints in the hit entry to verify the similarity. If a similar fingerprint is found in the cache, the new spam message is considered duplicate or highly similar, and will be dropped right away; otherwise, f p will be inserted into the cache, and the new spam message will be stored. Each hash table has 1,024 entries in the memory for efficient queries. A circular queue is maintained for each entry in the hash tables to store the fingerprints with the same hash values, and the oldest fingerprint in a circular queue will be overwritten by the latest one if the queue is full. Thus, only k + 1 lookups are required for each new fingerprint, and at most c times of verification are required if there is a hit, where c is the length of a circular queue (c = 8 in this work) L2 cache We deliberately restrict the cache size of the L1 cache and calculate a fingerprint from the concatenation of the subject and the body, but this design trades accuracy for efficiency. First, a new fingerprint may be similar to one that was once in the L1 cache but has expired. Second, spammers often obfuscate spam messages to increase the disparity of the messages. Thus, we parse into the spam messages left after the L1 cache operation to extract three features from each of them, i.e., mail subject, mail body, and URLs, and analyze the features for further filtering. The last will be skipped if no URLs are in the spam messages. The features are described as follows. 1. mail subject and mail body: Both may be encoded using the scheme proposed in RFC 2047 ( For example, the mail subject may be represented in a format like?big-5?b? Encoded-text?=. We decode the encoded section and normalize it with the UTF-8 character set before fingerprint calculation to ensure the fingerprints will be consistent across different encodings. 2. URLs: We search for URLs in the mail body, but skip known clean URLs such as org and schemas.microsoft.com, which are included in the spam messages because spammers compose the spam content based on the schemas defined by the organizations such as W3C (see doctype.asp). If multiple URLs are present, we choose only the first as the representative URL for generating the fingerprints. The similarity comparison on the L2 cache are performed offline on the spam samples regularly (e.g., per day) to further reduce the volume of spam samples. The following two methods are considered for implementing the L2 cache, and their efficiency will be evaluated in Section 4.2. Method 1: The fingerprints are calculated from the aforementioned three features separately, and store them in three separate caches. If no URLs are found in the spam messages, only two caches (for the mail subject and the body) are queried. The hash tables in each cache have the same number of entries as those in the L1 cache, and the operation is like that of the L1 cache to identify an existing fingerprint that differs in at most k bits (k = 0 for the fingerprints of URLs, and k = 3 otherwise) from the queried fingerprint for a specific feature. If more than half of the queries result in a hit, the spam message will be discarded. Method 2: Like Method 1, we build three caches to store the fingerprints of the three features, but the fingerprints will not expire in the caches. Considering the large number of fingerprints due to the volume of spam messages, we simplify the caches to save the memory space by keeping the fingerprints in one hash table per cache, rather than in a complicated data structure like that in the L1 cache. The hash table uses linked lists to handle hash collisions, and can be dynamically expanded to accommodate more fingerprints and reduce the chances of hash collisions. Thus, a fingerprint has to be identical to an existing one in the hash table to encounter a hit. A spam message is discarded if more than half of total queries result in a hit in the hash tables.

5 4 Evaluation In this section, we first compare the volume and the variety of collected spam samples in the three collection methods, and then study the efficiency of the two-level cache mechanism. 4.1 Comparison of Collection Methods We deployed the three methods of spam collection described in Section 3.1. The periods of the three collections were different because of the different degrees of complexity to deploy these methods (e.g., request the authorities for permissions, purchase of equipment, configurations, implementations, etc.). According to Table 1, the average number of collected spam messages per day in the three methods are 9, 379 and 1,388,738. Thus, setting up an open relay sinkhole can collect the largest number of spam messages in a short time, while the other two methods need distributed deployment on a large scale to collect a large volume of spam samples efficiently. The pairwise Hamming distances between the fingerprints of spam samples in each collection are calculated to evaluate the variety of the spam samples collected in the three methods. The cumulative distribution function (CDF) of the pairwise Hamming distances are presented in Figure 4. For simplicity, the sets of spam samples in the methods, trap addresses, spam-filtering mail gateway and open relay sinkhole are represented as Collection A, Collection B and Collection C. Because of the huge number of samples in Collection C, we randomly selected around 200 thousand spam samples over the period of the collection to save the analysis time. According to Figure 4, nearly 75% of the pairs of fingerprints differ in at most 33 bits in Collection A and B, while the difference is at most 15 bits in Collection C, meaning the samples are more similar to each other in Collection C. Thus, the variety between the samples in Collection C is the lowest among the three methods. 4.2 Analysis of Cache Efficiency We select 7,066,226 spam samples collected in the first week from Collection C as the input dataset for evaluating the efficiency of the L1 cache. It is noted that we do not apply the L1 cache to Collection A and Collection B because the arrival of spam messages in the two collections is not in a burst, and the mechanism will be less effective. The result indicates that 6,824,156 spam messages, which amount to 96.57% of the evaluated samples, were found similar and dropped by the L1 cache. The spam samples left after the L1 cache, as well as those in Collection A and Collection B, were read one by one for evaluating the efficiency of the L2 cache. According to Table 2, Method 1 and Method 2 of the L2 cache can filter out more than 85% of the samples in Collection A and Collection B, and more than 60% of the evaluated samples in Collection C. The results mean that separating the features for fingerprint calculation can effectively reduce more similar yet obfuscated spam samples that cannot be identified by the L1 cache. The two caches together can reduce 98.66% (96.83% by the L1 cache, and 60.91% by Method 2 of the L2 cache) of the spam samples in the evaluated dataset from Collection C. We also use the aforementioned samples from Collection C for evaluating the efficiency of the cache mechanism in Method 1 with different hash table sizes (i.e., the number of entries in the hash tables), besides the default size of 1,024 entries mentioned in Section 3.3. Method 2 is not involved in the evaluation because the hash tables in this method are dynamically expanded in its operation. Table 3 summarizes the numbers of hits with different hash table sizes in Method 1. The results indicate that a larger hash table size can help to detect more duplicate or highly similar samples in the L2 cache. Table 3: The numbers of hits with different hash table sizes in Method 1. Hash table size Number of hits 1, ,445 (60.91%) 2, ,432 (66.28%) 4, ,311 (70.77%) 8, ,450 (72.07%) 5 Conclusion and Future Work Figure 4: The CDF of the pairwise Hamming distances between the fingerprints of the spam samples. We evaluate three common methods of collecting spam samples, and present a novel two-cache mechanism to efficiently reduce duplicate or highly similar spam samples in this work. We find an open relay

6 Table 1: Spam message count in the three methods of spam collection. Method Spam message count Period Trap addresses 3, /04/ /03/31 Spam-filtering mail gateway 138, /01/ /12/31 Open relay sinkhole 56,938, /07/ /08/24 Table 2: The numbers of hits for the three collections in the L2 cache. Event Count in Collection A Count in Collection B Count in Collection C Number of spam samples 3, , ,070 Number of hits [Method 1] 2,856 (87.13%) 120,213 (86.65%) 147,445 (60.91%) Number of hits [Method 2] 2,837 (86.55%) 118,817 (85.69%) 159,393 (65.85%) sinkhole can collect the largest number of spam samples (up to nearly 57 million samples over a period of six weeks) among the three methods, but the variety of its samples is also the lowest. The two-cache mechanism can reduce nearly 99% of duplicate or high similar spam samples sent to the open relay sinkhole, and more than 85% in the other two collection methods. The reduction can greatly save the storage space and the volume of spam samples for further analysis. This work will be useful to those who want to collect a large corpus of spam samples for various kinds of analysis and filtering. For the future work, our next step is to deploy the collection methods in more than one spot, and analyze the variety and types of spam samples collected from different spots.. References [1] G. V. Cormack and T. R. Lynam, On-line Supervised Spam Filter Evaluation. ACM Transactions on Information Systems, 25(3), pp. 1-31, July [2] L. Zhang, J. Zhu and T. Yao, An Evaluation of Statistical Spam Filtering Techniques. ACM Transactions on Asian Language Information Processing, 3(4), pp , Dec [3] A. Pathak, F. Qian, Y. C. Hu, Z. M. Mao and S. Ranjan. Botnet Spam Campaigns Can Be Long Lasting: Evidence, Implications, and Analysis. In Proceedings of the 11th International Joint Conference on Measurement and Modeling of Computer Systems, Aug [4] G. S. Manku, A. Jain, A. D. Sarma, Detecting Near-Duplicates for Web Crawling. In Proceedings of International World Wide Web (WWW) Conference, May [5] A. Broder, S. Glassman, M. Manasse and G. Zweig, Syntactic Clustering of the Web. In Proceedings of International World Wide Web (WWW) Conference, Apr [6] M. S. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In Proceedings of 34th Annual ACM Symposium on Theory of Computing, May [7] M. Henzinger, Finding Near-Duplicate Web Pages: a Large-Scale Evaluation of Algorithms. In Proceedings of Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Aug [8] A. Kolcz, A. Chowdhury and J. Alspector, Improved Robustness of Signature-based Near- Replica Detection via Lexicon Randomization. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug [9] Pin-Ren Chiou, Po-Ching Lin and Chung-Ta Li. Blocking Spam Sessions with Greylisting and Block Listing based on Client Behavior. In Proceedings of International Conference on Advanced Communication Technology (ICACT), Jan [10] A. Pathak, Y. C. Hu, and Z. M. Mao. Peeking into Spammer Behavior from a Unique Vantage Point. In Proceedings of USENIX LEET, [11] J. P. John, A. Moshchuk, S. D. Gribble and A. Krishnamurthy. Studying Spamming Botnets Using Botlab. In Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp , Apr [12] A. Ramachandran, N. Feamster and S. Vempala, Filtering spam with behavioral blacklisting. In Proceedings of the 14th ACM conference on Computer and Communications Security (CCS), pp , Oct [13] H. Gao, J. Hu, C. Wilson, Z. Li, Y. Chen, and B.Y. Zhao, Detecting and characterizing social spam campaigns, In Proceedings of Internet Measurement Conference (IMC), pp.35-47, 2010.