2-4 Clustering and Feature Selection Methods for Analyzing Spam Based Attacks

Size: px
Start display at page:

Download "2-4 Clustering and Feature Selection Methods for Analyzing Spam Based Attacks"

Transcription

1 2-4 Clustering and Feature Selection Methods for Analyzing Spam Based Attacks In recent years, the number of spam s has been dramatically increasing and spam is recognized as a serious internet threat. Most recent spam s are being sent by bots which often operate with others in the form of a botnet, and skillful spammers try to conceal their activities from spam analyzers and spam detection technology. In addition, most spam messages contain URLs that lure spam receivers to malicious Web servers for the purpose of carrying out various cyber attacks such as malware infection, phishing attacks, etc. In order to cope with spam based attacks, there have been many efforts made towards the clustering of spam s based on similarities between them. The spam clusters obtained from the clustering of spam s can be used to identify the infrastructure of spam sending systems and malicious Web servers, and how they are grouped and correlate with each other. Therefore, we propose an optimized spam clustering method, called O-means, based on the K-means clustering method, which is one of the most widely used clustering methods. By examining three weeks of spam gathered in our SMTP server, we observed that the accuracy of the O-means clustering method is about 87% which is superior to the previous clustering methods. In addition, we define new 12 statistical features to compare similarities between spam s, and we propose a feature selection method to identify a set of optimized features which makes the O-means clustering method more effective. With our method, we identified 4 significant features which yielded a clustering accuracy of 86.33% with low time complexity. Keywords Spam, Clustering, Feature selection, Botnet, Malicious URLs 1 Introduction In recent years, the number of spam s has been dramatically increasing and spam is recognized as a serious internet threat. Most recent spam s are being sent by bots which often operate with others in the form of a botnet, and skillful spammers try to conceal their activities from spam analyzers and spam detection technology. For example, botnet owners control their bots carefully so as to send an extremely small amount of spam from each bot, so that they are able to avoid traditional bot detection technology which checks a quantity of traffic sent by remote systems. In fact, most recent bots are used for sending only 1 2 spam messages on average and in our experiments, we confirmed that each spam sending system sent 1.9 s on average. Furthermore, the active time of bots and the effective lifetime of a single spam message is highly short. By some estimates, 75% of bots are active for just 2 minutes or less and the lifetime of 65% spam messages is 2 hours or less. On the other hand, most spam messages also contain URLs that lure spam receivers to malicious Web servers for the purpose of carrying out various cyber attacks such as malware infection, phishing attacks, etc. Thus, we also need to analyze Web pages linked to URLs in order to determine whether a Web 35

2 page is malicious or not. In many cases, however, it is very difficult to analyze all Web pages in real-time, because more than 90% of all today is considered spam, and is abused for various purposes. To make matters worse, real malicious Web servers appear beyond 4 or more Web pages from the initial one linked to URLs directly. Therefore, we also need to reduce the analysis time of spam based attacks. In order to cope with spam based attacks, there have been many efforts made towards the clustering of spam s based on similarities between them. From the clustering of spam s, we are able to categorize spam into clusters based on shared similarities. By analyzing each spam cluster, we are able to identify the infrastructure of the systems used for sending spam s and how they are grouped with each other, and identify the correlation between spam sending systems and malicious Web servers, because in spam based attacks, attackers have to prepare both of them for carrying out their attacks successfully. In addition, if we use cluster information in the classification of spam, it can be used to minimize the time needed for analyzing Web pages. In other words, if an belongs to a certain cluster, then we do not have to follow its URLs and analyze their Web pages anymore. Therefore, it could be said that the clustering of spam s is essential to analyze spam based attacks effectively, and also it is very important to improve the accuracy of the spam clustering as much as possible so as to analyze spam based attacks more accurately. We present an optimized spam clustering method, called O-means, based on the K-means clustering method, which is one of the most widely used clustering methods. The K-means clustering method partitions a set of data into k clusters through the following steps. Initialization: Randomly choose k instances from data set and make them initial cluster centers. Assignment: Assign each instance to the closest center. Updating: Replace every cluster s center with the mean of its members. Iteration: Repeat Assignment and Updating until there is no change for each cluster, or other convergence criterion is met. The popularity of the K-means clustering method is largely due to its low time complexity, simplicity and fast convergence. However, it has been well known that its clustering results heavily depend on the chosen k initial centers, and it is very difficult to predefine the proper number of clusters, i.e., k. The O-means clustering method improves its performance by overcoming the shortcomings of the K-means clustering method. By examining three weeks of spam gathered in our SMTP server, we observed that the accuracy of the O-means clustering method is about 87% which is superior to the previous clustering methods. In addition, we define new 12 statistical features to compare similarities between spam s, and we propose a feature selection method to identify a set of optimized features which makes the O-means clustering method more effective. With our method, we identified 4 significant features which yielded a clustering accuracy of 86.33% with low time complexity. The rest of the paper is organized as follows. In Chapter 2, we give brief description for the previous spam clustering methods. In Chapter 3, we present the proposed clustering method, and experimental results and discussion are given in Chapters 4 and 5, respectively. Finally, we present concluding remarks and suggestions for future study in Chapter 6. 2 Related work Zhuang et al. developed new techniques to map botnet membership using traces of spam s. Their clustering method is based on the assumption that spam s with similar content are often sent from the same spammer or attacker, because these messages share a common economic interest. They identified hundreds of botnets by grouping similar spam messages and related spam campaigns. Li et al. investigated the clustering structures of spammers based on spam traffic collected 36 Journal of the National Institute of Information and Communications Technology Vol. 58 Nos. 3/4 2011

3 at a domain mail server they constructed. In this approach, they used URLs in spam s as the criterion for the clustering, and they observed that the relationship among the spammers has demonstrated highly clustering structures. In, Xie et al. focused on characterizing spamming botnets. To this end, they applied polymorphic URLs which have the same domain name to grouping spam s. They identified 7,721 botnet-based spam campaigns together with 340,050 unique botnet host IP addresses from a three-month sample of s from Hotmail, their system, i.e., AutoRE. However, there is a fatal weakness in that the three criteria, i.e., content, URL and domain name, are easily influenced by changes in spam messages and trends. In fact, spammers periodically change the contents of s and domain names, and the active period of URLs is extremely short; 1 day: 45%, 2 days: 20%, 3 days: 7%. Also, in our experiments, we confirmed that most URLs used in spam s are unique. As a result, it could be said that these three criteria are not suitable for maintaining and improving the quality of clusters. In, we carried out an experiment to examine the clustering of spam s based on IP addresses resolved from URLs. In other words, we regarded two s as the same cluster, called IP cluster, if their IP address sets resolved from URLs are completely identical. Although we demonstrated that its performance is better than that of the domain name and URL based clustering methods, there is a limitation in that the IP clusters contain lots of unrelated s which were sent from different controlling entities, and whose URLs are connected to different types of Web pages. This is because there are many Web servers which are hosting a lot of Web sites on the same IP address, or many Web servers compromised by spammers or attackers are serving their Web sites. In this paper, we focus on dividing them belonging to the same IP cluster into the individual clusters. 3 Proposed method 3.1 Overall procedure Figure 1 shows the overall procedure for the O-means clustering method proposed in this paper, and it is composed of the following 6 main parts. Header based clusterer: constructs HD clusters, denoted by HD_C 1, HD_C 2,, HD_C h, from given spam s based on similarities between the herder as described in Section 3.2 (1), and inserts them into O-means clusterer (2). Fig.1 Overall procedure of the O-means clustering method 37

4 IP based clusterer: resolves IP addresses from URLs within spam s, generates IP clusters, denoted by IP_C 1, IP_C 2,, IP_C i, from spam s using the resolved IP addresses as described in Section 3.3 (3), and inserts them into the O-means clusterer (4). O-means clusterer: generates OM clusters, denoted by OM_C 1, OM_C 2,, OM_C o, from spam s based on the K-means clustering method as described in Section 3.4 (5), and inserts them into analyzer & evaluator (6). Crawler: accesses Web sites linked to URLs within spam s, downloads their HTML content (7), and inserts them into document based clusterer (8). Document based clusterer: generates document clusters, denoted by DC_C 1, DC_C 2,, DC_C d, according to similarities found in Web document as described in Section 3.5 (9). Analyzer & evaluator: estimates the performance of the O-means clustering method as well as the 12 statistical features proposed in this paper, and analyzes spam sending systems and the URLs destinations, i.e., Web servers, using the results from the O-means clusterer and the document based clusterer (6, 10). 3.2 Header based clusterer In order to identify the relationship between spam sending systems, we leverage the headers such as From, To which indicate who the sender is and the receiver is, respectively, because the characteristics of the headers depend on client programs used for sending s. In other words, since there are a lot of types of headers of which some headers are essential, and others are optional, it can be said that if the headers of two s are not the same, then they are sent by different client programs. In addition, in many cases, spammer and attackers use their individual spam sending programs, we can distinguish them by revealing their characteristics found in the headers. In our header based clusterer, we focus on three criteria, i.e., the types of headers, their number and order, in order to classify spam s into the HD clusters in which spam s with the same headers become members of the same HD cluster. Figure 2 shows its clustering process. During the clustering process, it first picks the headers from each (1) and extracts their types excluding overlapping (2). After that, it regards two s as the same HD cluster if their number and order are completely identical (3). 3.3 IP based clusterer Figure 3 shows the process of IP based clustering proposed in. During the clustering process, it first picks unique URLs from the original URLs obtained from each (1, 2). After that, it resolves IP address(es) from the unique URLs(3), and regards two s as the same IP cluster if the IP address set resolved from the unique URLs is completely identical(4). 3.4 O-means clusterer Figure 4 shows the clustering process of the O-means clustering method proposed in this paper. During the clustering process, we first extract the 12 statistical features from Fig.2 Clustering process of the header based clusterer Fig.3 Process of IP based clustering 38 Journal of the National Institute of Information and Communications Technology Vol. 58 Nos. 3/4 2011

5 Table 1 Description of the 12 statistical features No. Feature Name Value 1 Size of s Number of lines 8 3 Number of unique URLs 1 4 Average length of unique URLs 57 5 Average length of domain names 17 6 Average length of path 10 7 Average length of query 22 8 Average number of key-value pairs 2 9 Average length of keys Average length of values 4 11 Average number of dots(.) in domain names 1 12 Number of global top 100 URLs 0 Fig.4 Clustering process of the O-means clustering method each as described in Section (1) and normalize their values according to the normalization method described in Section (2). We then create k initial centers using the IP clusters and the HD clusters as described in Section (3, 4, 5, 6). Finally, we construct the OM clusters, i.e., OM_C 1, OM_C 2,, OM_C o, from spam s using the k initial centers and the K-means clustering method (6, 7) Feature extraction Our feature extraction is based on the assumption that the spammers make spam s, especially URLs, under a certain rule or pattern embedded in their sending programs, even though the contents of s, URLs and domain names are frequently changed. Thus, there is a possibility that we are able to distinguish each spammer by characterizing his/her rule from spam s and URLs. To this end, we define the 12 statistical features as shown in Table 1 and describe them using the example of an as shown in Fig. 5. In Figure 5, the contains 7 lines representing the headers and only one URL Fig.5 Example of an in its body part. From the , we first compute the size of s (i.e., 310) in bytes and the number of lines (i.e., 8). After that, we pick the unique URL from the and divide it into 3 parts: domain name, path and query. Also, since the query part can contain multiple key-value pairs, we partition it into each key-value pair. By using those parts, we compute the values of 9 features (i.e., No. 3 No. 11) associated to the unique URL as shown in Table 1. Finally, we count the number of URLs that are most popular global top 100 Web sites provided from Alexa.com. The reason why we use this feature is that spammers may try to evade or confuse the URL based spam detection mechanism by crafting spam s which contain legitimate URLs, especially the popular URLs such as Google.com, Yahoo.com, etc. Therefore, we may be able to reveal the 39

6 characteristic of spammers by inspecting the using frequency of popular URLs Normalization After we extracted the 12 statistical features from spam s, we need to normalize their values, because each feature has a different scale. Our normalization method is basically based on. Given a set of data instances (i.e., spam s) which have the above 12 statistical values, we calculate the normalized values of each instance as follows. normalized_instnace[r] = original_instance[r] average[r] standard_deviation[r] subject to : r (1 r 12) where [r] is the rth feature, average[r] and standard_deviation[r] are the average and the standard deviation of the 12 features obtained from all data instances, respectively. This normalization method means that for every feature value of each data instance, how far it is away from the average of the corresponding feature with respect to its standard deviation Creating k initial centers (1) Motivation and strategy After we obtained the above 12 normalized values from spam s, we create k initial centers to be used in the next clustering phase, i.e., the K-means clustering method. In order to get the best performance out of the K-means clustering method, we need to choose the k initial centers as representatives of the actual k clusters within given spam s. In the case of spam s, each cluster can be defined as a group of spam s which are sent from the same controlling entity, e.g., a botnet, and whose URLs are connected to the same Web page. In other words, if two s are sent from different botnets, they should be members of different clusters, even if they share the same URL linked to the same Web page. In order to reflect those clusters to the k initial centers, we leverage the IP clusters and the HD clusters. Our strategy is that we first merge the IP clusters into DIP (Duplicate IP) clusters, denoted by DIP_C 1, DIP_C 2,, DIP_C m, whose members (i.e., spam s) share at least one IP address resolved from their unique URLs. This means that the members of each DIP cluster have a high possibility of having a close relationship to each other from viewpoint of URL destinations. However, the IP clusters contain lots of unrelated s which were sent from different controlling entities, whose URLs are connected to different types of Web pages, even if they are hosting on the same Web server with the same IP address. Thus, we divide the DIP clusters into k groups based on the HD clusters whose members (i.e., spam s) were sent from the same type of spam sending systems in terms of the types of the headers, their number and order. As a result, it can be expected that if distinct spammers sent spam s linked to the same Web page, then our method is able to discover each of them as a group. In addition, it is obvious that if a spammer is related to several different types of Web pages which are being hosted on different Web servers (i.e., the IP clusters) they can also be divided as different groups in our method. (2) Making DIP clusters Figure 6 shows the merging process of making the DIP clusters, and it is carried out by the following 7 steps. 1 : feed all the IP clusters into a set of inter- Fig.6 Merging process of making the DIP clusters 40 Journal of the National Institute of Information and Communications Technology Vol. 58 Nos. 3/4 2011

7 mediate IP clusters. 2 : select the largest IP cluster, IP_C largest, from the set of intermediate IP clusters. 3 : create a new DIP cluster which contains only IP_C largest as its initial member. 4 : select all IP clusters excluding IP_C largest from the set of intermediate IP clusters. 5 : classify all IP clusters into two parts: if an IP cluster shares at least one identical IP address with IP_C largest, it becomes a member of the new DIP cluster, otherwise it returns to the set of intermediate IP clusters. 6 : add the new DIP cluster to the set of DIP clusters. 7 : repeat 2 7 unless the set of intermediate IP clusters is empty. Otherwise the merging process is terminated. As a result, we can obtain the set of the DIP clusters, DIP_C 1, DIP_C 2,, DIP_C m. (3) Creating k groups Figure 7 shows the process of creating the k initial centers, and it is composed of the following 5 steps. 1 : feed all the DIP clusters into a set of intermediate DIP clusters. 2 : select a DIP cluster from the set of intermediate DIP clusters. 3 : partition its members, i.e., spam s, into groups in which spam s within the same group belong to the same HD cluster. 4 : add the new groups to the set of k groups. 5 : return to the step 2 if the set of intermediate DIP clusters is not empty, otherwise the creation process is terminated. From the above creation process, k groups, i.e., g 1, g 2,, g k are created. For the k groups, it computes the mean of their members, and regards them as the k initial centers. After we create the k initial centers, we apply the K-means clustering method to spam s. As the clustering result of the K-means clustering method, we are able to obtain the OM clusters, OM_C 1, OM_C 2,, OM_C o. Note that we computed the distance between two s using the Euclidean distance in our experiment. Given a pair of objects, e.g., a = {a 1, a 2,, a d } and b = {b 1, b 2,, b d }, which are vectors in real d-dimensional space, R d, then the Euclidean distance between a and b, d(a, b), is as follows. 3.5 Document based clusterer In order to evaluate our clustering method, we clustered spam s into the document clusters, DC_C 1, DC_C 2,, DC_C d, according to similarities found in Web pages linked to URLs. In order to examine the similarity between Web pages, we applied text shingling to the corresponding Web pages. Figure 8 shows an example of text shingling where two Web documents, e.g., A and B, contain 50 (A 1, A 2,, A 50 ) and 40 (B 1, B 2,, B 40 ) words, the size of a shingle is 5. Since the size of a shingle is 5, we construct 46 shingles (e.g., A 1 A 2 A 3 A 4 A 5, A 2 A 3 A 4 A 5 A 6 ) and 36 shingles (e.g., B 1 B 2 B 3 B 4 B 5, B 2 B 3 B 4 B 5 B 6 ) from Fig.7 Creation process of the k initial centers Fig.8 Example of text shingling 41

8 A and B, respectively, then similarity between A and B can be calculated as follows. the number of the same shingles the number of unique shingles in both A and B In this example, if A and B share 32 shingles with each other and the threshold to determine whether they are the same cluster or not is 50%, then we regard them as members of the same DC cluster, because similarity between A and B is 32 / ( ) = 64% which is larger than 50%. 4 Experimental results 4.1 Double bounce s In double bounce s, they have no valid addresses associated with spam senders and receivers in their header. Figure 9 shows the overall process of double bounce s. Assuming that there are three valid users whose addresses are a@user.com, b@user.com and c@user.com. If a spammer sends a double bounce in that its returnpath address (i.e., forged@address.com) and recipient address (i.e., unknown@address. com) do not exist, to the target SMTP server (1), then the user unknown error message is exchanged between two SMPT servers (2, 3). From this situation, it could be said that the spammer intentionally forged his/her returnpath address to conceal his/her activities and randomly generated a recipient address which does not exist in the real world. In the normal case, however, an has at least one valid return-path address in its header, even if a sender mistyped the recipient address to his/her . In this context, double bounce s can be regarded as pure spam, and therefore we use double bounce s for our analysis data, i.e., spam s. 4.2 Description of experimental data We collected 596,526 double bounce s that arrived at our SMTP server for three weeks (Jan. 25th Feb. 20th, 2010). Figure 10 shows their overall properties. Among all s, we observed that 526,544 s contained one or more URLs in their body, and the total number of URLs were 1,405,950 of which downloadable URLs and unique URLs numbered 1,296,120 and 1,048,158, respectively. Also, we found that 275,117 unique IP addresses were used for sending all of spam this means each unique IP address was associated with only about 1.9 s on average, while the total number of unique IP addresses connected to URL destinations was only 1,643. Figure 11 shows the national distribution of spam sending systems and URL destinations. From Figure 11, we can see that 10% and 8% of spam s was sent from America and Brazil, respectively, but in total, our data found that spam was sent from a total of 204 countries. On the other hand, in the case of URL destinations (i.e., Web servers), 56% of them were located in China, but an additional 35% of Web servers were located in Korea (19%) and America (16%). These results show that the geographical distribution of spam send- Fig.10 Overall properties of experimental data Fig.9 Overall process of double bounce s Fig.11 National distribution of spam sending systems and URL destinations 42 Journal of the National Institute of Information and Communications Technology Vol. 58 Nos. 3/4 2011

9 ing systems andweb servers are quite different from each other. In other words, spam sending systems are widely distributed all around the world, but Web servers are concentrated in only three countries. Fig.12 Similarity distribution of 1,739 web documents 4.3 Results of document based clustering In order to evaluate the performance of the O-means clustering method, we clustered double bounce s into the DC clusters as described in Section 3.5. During the clustering process, we set the size of a shingle to 5 and the threshold to 50%. This means that a shingle consists of 5 adjacent words, and if similarity between Web documents is larger than 50%, then they become members of the same DC cluster. In our investigation, we observed that there were 1,739 distinct Web documents whose hash values were different from each other among all Web documents crawled from 1,296,120 URLs. We first grouped 1,739 Web documents under the above two conditions and found that there were 772 groups, i.e., there were 772 Web documents which contain distinct content from each other. Also, we observed that among 1,739 Web documents, 656 Web documents had no similar Web documents to the others, while 1,083 Web documents did. In our further investigation, we discovered that among the 772 groups, the most largest group has 156 similar Web documents as its members. Figure 12 shows similarity distribution of the 1,739 Web documents when they are assigned to one of the 772 groups. From Figure 12, it can be easily seen that similarities of most Web documents are close to either 0 or 100. This means that the threshold value (i.e., 50%) does not affect the performance of the document based clusterer. We then grouped double bounce s according to the membership with the 772 Web documents. In other words, if two double bounce s share at least one URL in many cases, it is not the same which is linked to one of the 772 Web documents, then they become members of the same DC cluster. As a result, from experimental data, we obtained 772 DC clusters in which double bounce s within the same DC cluster are similar to each other in terms of Web pages connected to URLs they contain. 4.4 Clustering results of the O-means clustering method With respect to 526,544 s with URLs, we constructed the HD clusters and the IP clusters as described in Sections 3.2 and 3.3, respectively. The number of the HD clusters and the IP clusters were 1,062 and 1,204, respectively. Also, we merged the 1,204 IP clusters into the DIP clusters as described in Section 3.4.3, and obtained 900 DIP clusters. Using the 900 DIP clusters and the 1,062 HD clusters, we created 3,528 groups based on the creating process described in Section 3.4.3, and consequently computed 3,528 initial centers with the mean of their members. We then fed the 3,528 initial centers into the K-means clustering method which created the OM clusters, and consequently obtained 2,049 OM clusters. In order to estimate the accuracy of the 2,049 OM clusters, we measured their accuracy using the DC clusters obtained from Section 4.3. In this estimation, we first drew up a list of the DC clusters that contained at least one from an OM cluster, and then we selected the certain DC cluster that shared the largest number of s with the OM cluster. Finally, we calculated the accuracy of the OM cluster as follows. 43

10 the largest number of s the total number of s in the OM cluster From our evaluation, we observed that the average accuracy of the 2,049 OM clusters is 86.63% which is about 6% higher than that (i.e., 80.98%) of the IP based clusterer. Figure 13 shows the accuracy of the IP based clusterer and the O-means clustering method with respect to the top 100 clusters which are responsible about 98% and 86% of double bounce s used in our experiment, respectively in terms of their size, i.e., the number of members. Specifically, we can observe that the O-means clustering method is superior to the IP based clusterer in the 10 clusters among the top 12 clusters. The accuracy of the top 10 OM clusters is shown in Table 2. Table 2 shows statistical information of spam sending systems and URL destinations for the top 10 OM clusters. From Table 2, we can see that each OM cluster has a lot of distinct spam sending systems, i.e., 4,730 27,476, which represent the size of botnets or systems connected closely with each other and that they are distributed in many different countries, i.e., While in the case of URL destinations, we can see that the number of unique IP addresses is only 2 5 and that they are located in 1 3 countries. In fact, in our investigation, we observed that they are distributed in one of three countries, i.e., China, Korea and America. In Table 2, the most important thing to note is that each OM cluster, especially the top 8 OM clusters, has many distinct URLs and domain names. This means that spammers Fig.13 Performance comparison between the IP based clustering method and the O-means clustering method with respect to top 100 clusters Table 2 Statistics of the top 10 OM clusters OM Cluster ID (Top 10 Clusters) Accuracy 99.03% 98.84% 98.95% 98.98% 99.08% 99.05% 87.99% 92.12% 94.74% 99.76% # of s 31,790 22,857 20,230 17,817 9,850 9,409 9,369 9,013 8,998 6,772 # of unique source IP addresses 27,476 20,176 18,244 16,260 9,229 8,901 7,681 7,368 5,731 4,730 # of unique source countries # of entire URLs 95,370 68,619 60,688 53,447 29,548 28,225 56,214 54,073 8,998 6,772 # of unique URLs 94,805 68,198 60,294 53,151 29,532 28,068 56,008 53, # of unique domain names 94,805 68,198 60,294 53,151 29,532 28,068 56,008 53, # of unique destination IP addresses # of unique destination countries Journal of the National Institute of Information and Communications Technology Vol. 58 Nos. 3/4 2011

11 or attackers frequently change their URLs and domain names. Therefore, in the case of the existing domain name and URL based clustering methods, it could be said that the clustering accuracy is extremely low, because most URLs are unique as well as they have unique domain names. 4.5 Feature selection method In this section, we present our feature selection method based on heuristic analysis of spam s. In our method, we first selected the top 10 clusters from the original 772 clusters obtained from the document based clustering described in Section 3.5. Table 3 shows the the statistical information of the top 10 clusters (see Section 4.5.1). Using the top 10 clusters, we evaluated 12 features heuristically and identified their degrees of contribution for improving the clustering accuracy of spam s (see Section 4.5.2) Top 10 clusters Using the document based clustering described in Section 3.5, we obtained 772 clusters and observed that there are 10 large clusters which are responsible for about 90% of spam s with URLs arriving at our SMTP server. Table 3 shows their statistical information. From Table 3, we can see that each cluster has a lot of distinct spam sending systems, i.e., 3, ,149, which represent the size of botnets or systems connected closely with each other and that they are distributed in many different countries, i.e., While in the case of URL destinations, we can see that the number of unique IP addresses is only and that they are located in 2 29 countries. Also, we can observe that there are two patterns of URLs and domain names: in the case of clusters 1, 3 and 5, spammers or attackers frequently change their URLs and domain names, while in the case of the others, they almost always used the same URLs and domain names Selecting significant features Although we defined the 12 statistical features to calculate similarity between spam s and showed that the clustering accuracy is superior to the existing research, we need to consider that the relationships among the 12 statistical features are not independent from each other, and thereby there may exist several redundant features which do not contribute to the improving of the clustering accuracy. Furthermore, considering we have to deal with a large amount of spam s and analyze them effectively, it is needed to extract significant features from the original 12 features, so that we are able to reduce the analysis time of spam s while maintaining the high clustering accuracy. In order to discover a set of optimized features, it is best to investigate all the combinations of the 12 features: namely, the number of Table 3 Statistics of top 10 clusters ID Top 10 Clusters A 286,024 30,205 22,696 21,061 17,393 10,741 9,807 7,649 5,601 5,265 B 169,149 19,102 16,881 10,480 4,003 4,828 6,101 5,241 4,958 3,061 C D 951,565 30,209 77,794 21,061 17,397 18,363 9,806 7,649 5,919 5,254 E 891, , ,790 6, F 890, , ,757 6, G H A: Num. of s, B: Num. of unique source IP addresses, C: Num. of unique source countries, D: Num. of entire URLs, E: Num. of unique URLs, F: Num. of unique domain names, G: Num. of unique destination IP addresses, H: Num. of unique destination countries 45

12 all cases to be estimated is Σ 12 i = 1 12 C i but doing so is extremely time-consuming. Thus, we carried out an alternative experiment, which is based on the following heuristic analysis of the top 10 clusters. 1. In each cluster, the number of spam s that have the same value is counted with respect to all of the 12 features. 2. The number of spam s is scaled to [0, 1] according to the size of each cluster, because the size of the top 10 clusters is different from each other. 3. Made 2-dimensional graphs for each feature where the horizontal axis indicates the values of each feature and the vertical axis indicates the ratio of the number of spam s within each cluster as shown in Fig Using the results in Fig. 14, we first selected three significant features: Size of s (14(a)), Number of lines (14(b)), Length of URLs (14(d)), because most clusters can be distinguished from each other by using their values. In other words, since spam s in each cluster have different value distribution in these three features, if we use the three features to calculate similarity between spam s it is possible to accurately partition them into clusters where members within the same cluster have similar values with each other. 5. Second, we excluded Number of global URLs (14(l)) from the list of significant features, because the top 10 clusters have (a) Size of s (b) Number of lines (c) Number of URLs (d) Length of URLs (e) Length of domain names (f) Number of dots (g) Number of key-value pairs (h) Length of keys (i) Length of values (j) Length of queries (k) Length of paths (l) Number of global URLs Fig.14 Value distribution of 12 statistical features with respect to top 10 clusters 46 Journal of the National Institute of Information and Communications Technology Vol. 58 Nos. 3/4 2011

13 similar values in this feature. 6. Third, we did not regard Length of domain names (14(e)) as a significant feature, because it is unable to help distinguish each cluster at all: there is no cluster whose value distribution is independent from others. 7. Fourth, we also excluded 6 features, i.e., Number of URLs (14(c)), Number of key-value pairs (14(g)), Length of keys (14(h)), Length of values (14(i)), Length of queries (14(j)), Length of paths (14(k)) from the list of significant features, because they helped distinguish clusters 7, 8, 1, 3 and 5 from the others, but it is obvious that those clusters can also be obtained by using the three significant features in step Finally, since it is impossible to distinguish cluster 4 from cluster 5 using the three significant features in step 4, we added Number of dots (14(f)) to the list of significant features as it was able to separate them. As a result, we selected four significant features, i.e., Size of s, Number of lines, Length of URLs, and Number of dots. Note that our method is a supervised feature selection, since it uses the label information of top 10 clusters Performance evaluation In order to demonstrate the effectiveness of the four significant features, we examined their clustering accuracy and time complexity. As a clustering algorithm, we used an optimized spam clustering method, called O-means based on the K-means clustering method, which is one of the most widely used clustering methods. The evaluation results are shown in Table 4. From Table 4, we can see that almost the same clustering accuracy (i.e., 86.33%) was yielded by using only these four significant features. In addition, we can see that execution time was drastically reduced as a result of using only these four significant features, enabling us to analyze spam s more effectively. This measurement was performed on a machine running an Intel Core 2 Duo 2.8GHz CPU with 3 GB of RAM, and our Table 4 Comparison of the clustering accuracy and time complexity Clustering Accuracy Execution Time (sec.) program was written in the Perl programming language and Mysql. 5 Discussion 4 significant features All features 86.33% 86.63% 6,124 28,772 In order to evaluate the performance of the O-means clustering method, we constructed the document clusters (i.e., DC_C 1, DC_C 2,, DC_C d ) using text shingle technique as shown in Section 3.5. Although there are a lot of techniques (e.g., Levenshtein distance) for estimating similarities in text documents, in our method we used the text shingling technique to compare the similarity of Web pages downloaded from URLs, because it was developed to measure and compare the similarity of Web pages, and its effectiveness was verified in many approaches. Furthermore, our experimental results shown in Fig. 12 also demonstrate that 1,739 Web documents were well classified according to similarities among them by using this text shingling technique. In our experiments, we evaluated the O-means clustering method using 3 weeks of double bounce s that arrived at our SMTP server. In order to evaluate the Omeans clustering method more accurately, it is better to analyze a longer period of double bounce s and Web pages linked to URLs. However, there is a practical problem which makes this difficult: Web crawling is an intrusive process that might let spammers believe certain groups of users are more vulnerable to spam s and thus send more spam to them in the future ; the time complexity for computing similarities among all Web documents downloaded from URLs increases exponentially. 47

14 Similar to our experiments, therefore, most of existing approaches evaluated their systems using a short period of spam s. In fact, in, they used only one month, one week and 9 days of spam s, respectively. Nevertheless, they succeeded in identifying of a number of botnets and spam clusters. As one of the reasons for such a success, it could be considered that the active time of recent bots and the effective lifetime of a single spam message is extremely short. In this context, it could be concluded that our experiments were carried out in a reasonable manner. 6 Conclusion In this paper, we have proposed an optimized spam clustering method, the O-means clustering method, based on the K-means clustering method, which is one of the most widely used clustering methods. The O-means clustering method improves its performance by overcoming the shortcomings of the K-means clustering method: its clustering result depends on the chosen k initial centers, and it is very difficult to predefine the proper number of clusters, i.e., k. By examining three weeks of spam gathered in our SMTP server, we observed that the accuracy of the O-means clustering method is about 87% which is superior to the previous clustering methods. In addition, we have defined new 12 statistical features to compare similarities between spam s, and we have proposed a feature selection method to identify a set of optimized features which makes the Omeans clustering method more effective. With our method, we identified 4 significant features which yielded a clustering accuracy of 86.33% with low time complexity. Although we only focused on the improvement of the clustering accuracy in this paper, the spam clusters obtained from the O-means clustering method can be utilized for analyzing spam based attacks more accurately. Therefore, we need to carry out more practical analysis for spam based attacks in our future work, so that we are able to identify the infrastructure of spam sending systems and malicious Web servers, and how they are grouped and correlate with each other. Further, it is worth to investigate how much the spam clusters can be contributed to the time reduction of analyzing Web pages lined to URLs. We also need to consider the merging of the OM clusters generated by the O-means clustering method, because in the creating process of the k initial centers, we regarded two spam s as different clusters, if they were sent from different controlling entities, even though their URLs are linked to the same Web page. By merging the OM clusters according to their similarity, we are able to identify the relationship between the OM clusters; it could be expected that we are able to identify a group of the OM clusters whose s share URL(s) linked to the same Web page. It is also a challenging task to devise additional statistical features and to evaluate them with respect to all their combinations, so that we are able to obtain an optimized set of statistical features for the clustering of spam s. Finally, we will evaluate the O-means clustering method with a larger data set of spam obtained from various domain sources in order to analyze spam based attacks more effectively. References 1 Xie, Y., Yu, F., Achan, K., Panigrahy, R., Hulten, G., and Osipkov, I., Spamming botnets: signatures and characteristics, ACM SIGCOMM Computer Communication Review, Vol. 38, No. 4, October Ramachandran, A. and Feamster, N., Understanding the network-level behavior of spammers, Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications, Vol. 36, No. 4, September 11 15, Pisa, Italy, Journal of the National Institute of Information and Communications Technology Vol. 58 Nos. 3/4 2011

15 3 Anderson, D. S., Fleizach, C., Savage, S., Voelker, and G. M., Spamscatter: Characterizing Internet Scam Hosting Infrastructure, Proceedings of the USENIX Security Symposium, Boston, Jennings, R., The global economic impact of spam, 2005 report, Technical report, Ferris Research, Kawakoya, Y., Akiyama, M., Aoki, K., Itoh M., and Takakura, H., Investigation of Spam Mail DrivenWeb- Based Passive Attack, IEICE Technical Report, ICSS2009-5(2009-5), pp , Jungsuk Song, Daisuke Inoue, Masashi Eto, Suzuki Mio, Satoshi Hayashi, and Koji Nakao, A Methodology for Analyzing Overall Flow of Spam-based Attacks, 16th International Conference on Neural Information Processing(ICONIP 2009), LNCS 5864, pp , Bangkok, Thailand, 1 5 December Li, F. and Hsieh, M.H., An Empirical Study of Clustering Behavior of Spammers and Group-based Anti-Spam Strategies, Proceedings of 3rd Conference on and Anti-Spam (CEAS), pp , Mountain View, CA, Zhuang, L., Dunagan, J., Simon, D. R., Wang, H. J., and Tygar, J. D., Characterizing botnets from spam records, Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats, No. 2, pp. 1 9, San Francisco, Jungsuk Song, Daisuke Inoue, Masashi Eto, Hyung Chan Kim, and Koji Nakao, Preliminary Investigation for Analyzing Network Incidents Caused by Spam, The Symposium on Cryptography and Information Security (SCIS 2010), Takamatsu, Japan, January, Jungsuk Song, Daisuke Inoue, Masashi Eto, Hyung Chan Kim, and Koji Nakao, An Empirical Study of Spam : Analyzing Spam Sending Systems and Malicious Web Servers, 10th Annual International Symposium on Applications and the Internet (SAINT 2010), Seoul, Korea, July MCQUEEN, J, Some methods for classification and analysis of multivariate observations, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp , John, J. P., Moshchuk, A., Gribble, S. D., and Krishnamurthy, A., Studying spamming botnets using Botlab, Proceedings of the 6th USENIX symposium on Networked systems design and implementation, pp , Boston, April 22 24, L. Portnoy, E. Eskin, and S. Stolfo, Intrusion Detection with Unlabeled Data Using Clustering, Proceedings of ACM CSS Workshop on Data Mining Applied to Security, D. Fetterly, M. Manasse, M. Najork, and J. L.Wiener, A large-scale study of the evolution of web pages, Softw. Pract. Exper., 34(2), Huan Liu and Lei Yu, Toward integrating feature selection algorithms for classification and clustering, IEEE Transactions on Knowledge and Data Engineering, Vol. 17, pp , A. Broder, On the Resemblance and Containment of Documents, Proceedings of the Compression and Complexity of Sequences (SEQUENCES 97), pp , June 11 13, N. SHIVAKUMAR and H. GARCIA-MOLINA, Finding near-replicas of documents and servers on the web, Proceedings of the First International Workshop on the Web and Databases (WebDB 98), pp , March (Accepted June 15, 2011) 49

16 , Ph.D. Researcher, Cybersecurity Laboratory, Network Security Research Institute Network Security, Spam Analysis, IPv6 Security 50 Journal of the National Institute of Information and Communications Technology Vol. 58 Nos. 3/4 2011

Extending Black Domain Name List by Using Co-occurrence Relation between DNS queries

Extending Black Domain Name List by Using Co-occurrence Relation between DNS queries Extending Black Domain Name List by Using Co-occurrence Relation between DNS queries Kazumichi Sato 1 keisuke Ishibashi 1 Tsuyoshi Toyono 2 Nobuhisa Miyake 1 1 NTT Information Sharing Platform Laboratories,

More information

An Efficient Methodology for Detecting Spam Using Spot System

An Efficient Methodology for Detecting Spam Using Spot System Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 1, January 2014,

More information

Fang Yu, Kannan Achan, Rina Panigrahy, Microsoft Research Silicon Valley Windows Live (Hotmail) mail, Windows Live Safety Platform MSR-SVC

Fang Yu, Kannan Achan, Rina Panigrahy, Microsoft Research Silicon Valley Windows Live (Hotmail) mail, Windows Live Safety Platform MSR-SVC Yinglian Xie Fang Yu, Kannan Achan, Rina Panigrahy, Geoff Hulten, Ivan Osipkov Microsoft Research Silicon Valley Windows Live (Hotmail) mail, Windows Live Safety Platform Botnet attacks are prevalent DDoS,

More information

Spamming Botnets: Signatures and Characteristics

Spamming Botnets: Signatures and Characteristics Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Geoff Hulten +,IvanOsipkov + Microsoft Research, Silicon Valley + Microsoft Corporation {yxie,fangyu,kachan,rina,ghulten,ivano}@microsoft.com

More information

Spamming Botnets: Signatures and Characteristics

Spamming Botnets: Signatures and Characteristics Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Geoff Hulten, Ivan Osipkov Presented by Hee Dong Jung Adapted from slides by Hongyu Gao and Yinglian

More information

Botnet Detection Based on Degree Distributions of Node Using Data Mining Scheme

Botnet Detection Based on Degree Distributions of Node Using Data Mining Scheme Botnet Detection Based on Degree Distributions of Node Using Data Mining Scheme Chunyong Yin 1,2, Yang Lei 1, Jin Wang 1 1 School of Computer & Software, Nanjing University of Information Science &Technology,

More information

A Novel Distributed Denial of Service (DDoS) Attacks Discriminating Detection in Flash Crowds

A Novel Distributed Denial of Service (DDoS) Attacks Discriminating Detection in Flash Crowds International Journal of Research Studies in Science, Engineering and Technology Volume 1, Issue 9, December 2014, PP 139-143 ISSN 2349-4751 (Print) & ISSN 2349-476X (Online) A Novel Distributed Denial

More information

Detecting Spamming Activities by Network Monitoring with Bloom Filters

Detecting Spamming Activities by Network Monitoring with Bloom Filters Detecting Spamming Activities by Network Monitoring with Bloom Filters Ping-Hai Lin, Po-Ching Lin, Pin-Ren Chiou, Chien-Tsung Liu Department of Computer Science and Information Engineering National Chung

More information

How To Create A Spam Authentication Protocol Called Occam

How To Create A Spam Authentication Protocol Called Occam Slicing Spam with Occam s Razor Chris Fleizach cfleizac@cs.ucsd.edu Geoffrey M. Voelker voelker@cs.ucsd.edu Stefan Savage savage@cs.ucsd.edu ABSTRACT To evade blacklisting, the vast majority of spam email

More information

How To Measure Spam On Multiple Platforms

How To Measure Spam On Multiple Platforms Observing Common Spam in Tweets and Email Cristian Lumezanu NEC Laboratories America Princeton, NJ lume@nec-labs.com Nick Feamster University of Maryland College Park, MD feamster@cs.umd.edu ABSTRACT Spam

More information

A Review of Anomaly Detection Techniques in Network Intrusion Detection System

A Review of Anomaly Detection Techniques in Network Intrusion Detection System A Review of Anomaly Detection Techniques in Network Intrusion Detection System Dr.D.V.S.S.Subrahmanyam Professor, Dept. of CSE, Sreyas Institute of Engineering & Technology, Hyderabad, India ABSTRACT:In

More information

Bisecting K-Means for Clustering Web Log data

Bisecting K-Means for Clustering Web Log data Bisecting K-Means for Clustering Web Log data Ruchika R. Patil Department of Computer Technology YCCE Nagpur, India Amreen Khan Department of Computer Technology YCCE Nagpur, India ABSTRACT Web usage mining

More information

Anti-Malware Technologies

Anti-Malware Technologies : Trend of Network Security Technologies Anti-Malware Technologies Mitsutaka Itoh, Takeo Hariu, Naoto Tanimoto, Makoto Iwamura, Takeshi Yagi, Yuhei Kawakoya, Kazufumi Aoki, Mitsuaki Akiyama, and Shinta

More information

LASTLINE WHITEPAPER. Using Passive DNS Analysis to Automatically Detect Malicious Domains

LASTLINE WHITEPAPER. Using Passive DNS Analysis to Automatically Detect Malicious Domains LASTLINE WHITEPAPER Using Passive DNS Analysis to Automatically Detect Malicious Domains Abstract The domain name service (DNS) plays an important role in the operation of the Internet, providing a two-way

More information

SPAMMING BOTNETS: SIGNATURES AND CHARACTERISTICS

SPAMMING BOTNETS: SIGNATURES AND CHARACTERISTICS SPAMMING BOTNETS: SIGNATURES AND CHARACTERISTICS INTRODUCTION BOTNETS IN SPAMMING WHAT IS AUTORE? FACING CHALLENGES? WE CAN SOLVE THEM METHODS TO DEAL WITH THAT CHALLENGES Extract URL string, source server

More information

Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

More information

Characterizing Botnets from Email Spam Records

Characterizing Botnets from Email Spam Records Characterizing Botnets from Email Spam Records Li Zhuang UC Berkeley John Dunagan Daniel R. Simon Helen J. Wang Ivan Osipkov Geoff Hulten Microsoft Research J. D. Tygar UC Berkeley Abstract We develop

More information

A COMBINED METHOD FOR DETECTING SPAM

A COMBINED METHOD FOR DETECTING SPAM International Journal of Computer Networks & Communications (IJCNC), Vol.1, No.2, July 29 A COMBINED METHOD FOR DETECTING SPAM MACHINES ON A TARGET NETWORK Tala Tafazzoli and Seyed Hadi Sadjadi,, Faculty

More information

Recurrent Patterns Detection Technology. White Paper

Recurrent Patterns Detection Technology. White Paper SeCure your Network Recurrent Patterns Detection Technology White Paper January, 2007 Powered by RPD Technology Network Based Protection against Email-Borne Threats Spam, Phishing and email-borne Malware

More information

Clustering Spam Campaigns with Fuzzy Hashing

Clustering Spam Campaigns with Fuzzy Hashing Clustering Spam Campaigns with Fuzzy Hashing Jianxing Chen 1, Romain Fontugne 2,3, Akira Kato 4, Kensuke Fukuda 2 1 Saarland University 2 National Institute of Informatics 3 Japanese-French Laboratory

More information

Next Generation IPS and Reputation Services

Next Generation IPS and Reputation Services Next Generation IPS and Reputation Services Richard Stiennon Chief Research Analyst IT-Harvest 2011 IT-Harvest 1 IPS and Reputation Services REPUTATION IS REQUIRED FOR EFFECTIVE IPS Reputation has become

More information

Email Spam Detection Using Customized SimHash Function

Email Spam Detection Using Customized SimHash Function International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email

More information

LASTLINE WHITEPAPER. Large-Scale Detection of Malicious Web Pages

LASTLINE WHITEPAPER. Large-Scale Detection of Malicious Web Pages LASTLINE WHITEPAPER Large-Scale Detection of Malicious Web Pages Abstract Malicious web pages that host drive-by-download exploits have become a popular means for compromising hosts on the Internet and,

More information

Clustering as an add-on for firewalls

Clustering as an add-on for firewalls Clustering as an add-on for firewalls C. Caruso & D. Malerba Dipartimento di Informatica, University of Bari, Italy. Abstract The necessary spread of the access points to network services makes them vulnerable

More information

Symantec Cyber Threat Analysis Program Program Overview. Symantec Cyber Threat Analysis Program Team

Symantec Cyber Threat Analysis Program Program Overview. Symantec Cyber Threat Analysis Program Team Symantec Cyber Threat Analysis Program Symantec Cyber Threat Analysis Program Team White Paper: Symantec Security Intelligence Services Symantec Cyber Threat Analysis Program Contents Overview...............................................................................................

More information

EVILSEED: A Guided Approach to Finding Malicious Web Pages

EVILSEED: A Guided Approach to Finding Malicious Web Pages + EVILSEED: A Guided Approach to Finding Malicious Web Pages Presented by: Alaa Hassan Supervised by: Dr. Tom Chothia + Outline Introduction Introducing EVILSEED. EVILSEED Architecture. Effectiveness of

More information

On the Efficiency of Collecting and Reducing Spam Samples

On the Efficiency of Collecting and Reducing Spam Samples On the Efficiency of Collecting and Reducing Spam Samples Pin-Ren Chiou, Po-Ching Lin Department of Computer Science and Information Engineering National Chung Cheng University Chiayi, Taiwan, 62102 {cpj101m,pclin}@cs.ccu.edu.tw

More information

WE KNOW IT BEFORE YOU DO: PREDICTING MALICIOUS DOMAINS Wei Xu, Kyle Sanders & Yanxin Zhang Palo Alto Networks, Inc., USA

WE KNOW IT BEFORE YOU DO: PREDICTING MALICIOUS DOMAINS Wei Xu, Kyle Sanders & Yanxin Zhang Palo Alto Networks, Inc., USA WE KNOW IT BEFORE YOU DO: PREDICTING MALICIOUS DOMAINS Wei Xu, Kyle Sanders & Yanxin Zhang Palo Alto Networks, Inc., USA Email {wei.xu, ksanders, yzhang}@ paloaltonetworks.com ABSTRACT Malicious domains

More information

Implementation of Botcatch for Identifying Bot Infected Hosts

Implementation of Botcatch for Identifying Bot Infected Hosts Implementation of Botcatch for Identifying Bot Infected Hosts GRADUATE PROJECT REPORT Submitted to the Faculty of The School of Engineering & Computing Sciences Texas A&M University-Corpus Christi Corpus

More information

Efficient Filter Construction for Access Control in Firewalls

Efficient Filter Construction for Access Control in Firewalls Efficient Filter Construction for Access Control in Firewalls Gopinath C.B Vinoda A.M Department of Computer science and Engineering Department of Master of Computer Applications, Government Engineering

More information

HYBRID INTRUSION DETECTION FOR CLUSTER BASED WIRELESS SENSOR NETWORK

HYBRID INTRUSION DETECTION FOR CLUSTER BASED WIRELESS SENSOR NETWORK HYBRID INTRUSION DETECTION FOR CLUSTER BASED WIRELESS SENSOR NETWORK 1 K.RANJITH SINGH 1 Dept. of Computer Science, Periyar University, TamilNadu, India 2 T.HEMA 2 Dept. of Computer Science, Periyar University,

More information

A Visualization Technique for Monitoring of Network Flow Data

A Visualization Technique for Monitoring of Network Flow Data A Visualization Technique for Monitoring of Network Flow Data Manami KIKUCHI Ochanomizu University Graduate School of Humanitics and Sciences Otsuka 2-1-1, Bunkyo-ku, Tokyo, JAPAPN manami@itolab.is.ocha.ac.jp

More information

A Campaign-based Characterization of Spamming Strategies

A Campaign-based Characterization of Spamming Strategies A Campaign-based Characterization of Spamming Strategies Pedro H. Calais, Douglas E. V. Pires Dorgival Olavo Guedes, Wagner Meira Jr. Computer Science Department Federal University of Minas Gerais Belo

More information

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems 2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Impact of Feature Selection on the Performance of ireless Intrusion Detection Systems

More information

Commtouch RPD Technology. Network Based Protection Against Email-Borne Threats

Commtouch RPD Technology. Network Based Protection Against Email-Borne Threats Network Based Protection Against Email-Borne Threats Fighting Spam, Phishing and Malware Spam, phishing and email-borne malware such as viruses and worms are most often released in large quantities in

More information

LASTLINE WHITEPAPER. The Holy Grail: Automatically Identifying Command and Control Connections from Bot Traffic

LASTLINE WHITEPAPER. The Holy Grail: Automatically Identifying Command and Control Connections from Bot Traffic LASTLINE WHITEPAPER The Holy Grail: Automatically Identifying Command and Control Connections from Bot Traffic Abstract A distinguishing characteristic of bots is their ability to establish a command and

More information

Botnet Detection by Abnormal IRC Traffic Analysis

Botnet Detection by Abnormal IRC Traffic Analysis Botnet Detection by Abnormal IRC Traffic Analysis Gu-Hsin Lai 1, Chia-Mei Chen 1, and Ray-Yu Tzeng 2, Chi-Sung Laih 2, Christos Faloutsos 3 1 National Sun Yat-Sen University Kaohsiung 804, Taiwan 2 National

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

An Overview of Spam Blocking Techniques

An Overview of Spam Blocking Techniques An Overview of Spam Blocking Techniques Recent analyst estimates indicate that over 60 percent of the world s email is unsolicited email, or spam. Spam is no longer just a simple annoyance. Spam has now

More information

2-5 DAEDALUS: Practical Alert System Based on Large-scale Darknet Monitoring for Protecting Live Networks

2-5 DAEDALUS: Practical Alert System Based on Large-scale Darknet Monitoring for Protecting Live Networks 2-5 DAEDALUS: Practical Alert System Based on Large-scale Darknet Monitoring for Protecting Live Networks A darknet is a set of globally announced unused IP addresses and using it is a good way to monitor

More information

An In-depth Analysis of Spam and Spammers

An In-depth Analysis of Spam and Spammers An In-depth Analysis of Spam and Spammers Dhinaharan Nagamalai Wireilla Net Solutions Inc,Chennai,India Beatrice Cynthia Dhinakaran, Jae Kwang Lee Department of Computer Engineering, Hannam University,

More information

SIP Service Providers and The Spam Problem

SIP Service Providers and The Spam Problem SIP Service Providers and The Spam Problem Y. Rebahi, D. Sisalem Fraunhofer Institut Fokus Kaiserin-Augusta-Allee 1 10589 Berlin, Germany {rebahi, sisalem}@fokus.fraunhofer.de Abstract The Session Initiation

More information

Intrusion Detection via Machine Learning for SCADA System Protection

Intrusion Detection via Machine Learning for SCADA System Protection Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. s.l.yasakethu@surrey.ac.uk J. Jiang Department

More information

Design of Standard VoIP Spam Report Format Supporting Various Spam Report Methods

Design of Standard VoIP Spam Report Format Supporting Various Spam Report Methods 보안공학연구논문지 (Journal of Security Engineering), 제 10권 제 1호 2013년 2월 Design of Standard VoIP Spam Report Format Supporting Various Spam Report Methods Ji-Yeon Kim 1), Hyung-Jong Kim 2) Abstract VoIP (Voice

More information

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Abdun Mahmood, Christopher Leckie, Parampalli Udaya Department of Computer Science and Software Engineering University of

More information

A Phased Framework for Countering VoIP SPAM

A Phased Framework for Countering VoIP SPAM International Journal of Advanced Science and Technology 21 A Phased Framework for Countering VoIP SPAM Jongil Jeong 1, Taijin Lee 1, Seokung Yoon 1, Hyuncheol Jeong 1, Yoojae Won 1, Myuhngjoo Kim 2 1

More information

Dual Mechanism to Detect DDOS Attack Priyanka Dembla, Chander Diwaker 2 1 Research Scholar, 2 Assistant Professor

Dual Mechanism to Detect DDOS Attack Priyanka Dembla, Chander Diwaker 2 1 Research Scholar, 2 Assistant Professor International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Engineering, Business and Enterprise

More information

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA Professor Yang Xiang Network Security and Computing Laboratory (NSCLab) School of Information Technology Deakin University, Melbourne, Australia http://anss.org.au/nsclab

More information

escan Anti-Spam White Paper

escan Anti-Spam White Paper escan Anti-Spam White Paper Document Version (esnas 14.0.0.1) Creation Date: 19 th Feb, 2013 Preface The purpose of this document is to discuss issues and problems associated with spam email, describe

More information

ANALYZING OF THE EVOLUTION OF WEB PAGES BY USING A DOMAIN BASED WEB CRAWLER

ANALYZING OF THE EVOLUTION OF WEB PAGES BY USING A DOMAIN BASED WEB CRAWLER - 151 - Journal of the Technical University Sofia, branch Plovdiv Fundamental Sciences and Applications, Vol. 16, 2011 International Conference Engineering, Technologies and Systems TechSys 2011 BULGARIA

More information

The Role of Country-based email Filtering In Spam Reduction

The Role of Country-based email Filtering In Spam Reduction The Role of Country-based email Filtering In Spam Reduction Using Country-of-Origin Technique to Counter Spam email By Updated November 1, 2007 Executive Summary The relocation of spamming mail servers,

More information

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

How To Detect Denial Of Service Attack On A Network With A Network Traffic Characterization Scheme

How To Detect Denial Of Service Attack On A Network With A Network Traffic Characterization Scheme Efficient Detection for DOS Attacks by Multivariate Correlation Analysis and Trace Back Method for Prevention Thivya. T 1, Karthika.M 2 Student, Department of computer science and engineering, Dhanalakshmi

More information

We Know It Before You Do: Predicting Malicious Domains

We Know It Before You Do: Predicting Malicious Domains We Know It Before You Do: Predicting Malicious Domains Abstract Malicious domains play an important role in many attack schemes. From distributing malware to hosting command and control (C&C) servers and

More information

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Data Mining in Web Search Engine Optimization and User Assisted Rank Results Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management

More information

Malware Detection in Android by Network Traffic Analysis

Malware Detection in Android by Network Traffic Analysis Malware Detection in Android by Network Traffic Analysis Mehedee Zaman, Tazrian Siddiqui, Mohammad Rakib Amin and Md. Shohrab Hossain Department of Computer Science and Engineering, Bangladesh University

More information

WEB APPLICATION FIREWALL

WEB APPLICATION FIREWALL WEB APPLICATION FIREWALL CS499 : B.Tech Project Final Report by Namit Gupta (Y3188) Abakash Saikia (Y3349) under the supervision of Dr. Dheeraj Sanghi submitted to Department of Computer Science and Engineering

More information

Index Terms Domain name, Firewall, Packet, Phishing, URL.

Index Terms Domain name, Firewall, Packet, Phishing, URL. BDD for Implementation of Packet Filter Firewall and Detecting Phishing Websites Naresh Shende Vidyalankar Institute of Technology Prof. S. K. Shinde Lokmanya Tilak College of Engineering Abstract Packet

More information

A Software Framework for Spam Campaign Detection and Analysis

A Software Framework for Spam Campaign Detection and Analysis A Software Framework for Spam Campaign Detection and Analysis Prepared by: Dr. Mourad Debbabi Mr. Francis Fortin Son Dinh Taher Azab Scientific Authority: Rodney Howes DRDC Centre for Security Science

More information

Trend Micro Hosted Email Security Stop Spam. Save Time.

Trend Micro Hosted Email Security Stop Spam. Save Time. Trend Micro Hosted Email Security Stop Spam. Save Time. How Hosted Email Security Inbound Filtering Adds Value to Your Existing Environment A Trend Micro White Paper l March 2010 1 Table of Contents Introduction...3

More information

Adaptive Discriminating Detection for DDoS Attacks from Flash Crowds Using Flow. Feedback

Adaptive Discriminating Detection for DDoS Attacks from Flash Crowds Using Flow. Feedback Adaptive Discriminating Detection for DDoS Attacks from Flash Crowds Using Flow Correlation Coeff icient with Collective Feedback N.V.Poorrnima 1, K.ChandraPrabha 2, B.G.Geetha 3 Department of Computer

More information

CYBER SCIENCE 2015 AN ANALYSIS OF NETWORK TRAFFIC CLASSIFICATION FOR BOTNET DETECTION

CYBER SCIENCE 2015 AN ANALYSIS OF NETWORK TRAFFIC CLASSIFICATION FOR BOTNET DETECTION CYBER SCIENCE 2015 AN ANALYSIS OF NETWORK TRAFFIC CLASSIFICATION FOR BOTNET DETECTION MATIJA STEVANOVIC PhD Student JENS MYRUP PEDERSEN Associate Professor Department of Electronic Systems Aalborg University,

More information

Web Forensic Evidence of SQL Injection Analysis

Web Forensic Evidence of SQL Injection Analysis International Journal of Science and Engineering Vol.5 No.1(2015):157-162 157 Web Forensic Evidence of SQL Injection Analysis 針 對 SQL Injection 攻 擊 鑑 識 之 分 析 Chinyang Henry Tseng 1 National Taipei University

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 21 CHAPTER 1 INTRODUCTION 1.1 PREAMBLE Wireless ad-hoc network is an autonomous system of wireless nodes connected by wireless links. Wireless ad-hoc network provides a communication over the shared wireless

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

KEITH LEHNERT AND ERIC FRIEDRICH

KEITH LEHNERT AND ERIC FRIEDRICH MACHINE LEARNING CLASSIFICATION OF MALICIOUS NETWORK TRAFFIC KEITH LEHNERT AND ERIC FRIEDRICH 1. Introduction 1.1. Intrusion Detection Systems. In our society, information systems are everywhere. They

More information

How To Filter Spam Image From A Picture By Color Or Color

How To Filter Spam Image From A Picture By Color Or Color Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among

More information

SPAM FILTER Service Data Sheet

SPAM FILTER Service Data Sheet Content 1 Spam detection problem 1.1 What is spam? 1.2 How is spam detected? 2 Infomail 3 EveryCloud Spam Filter features 3.1 Cloud architecture 3.2 Incoming email traffic protection 3.2.1 Mail traffic

More information

A Study on the Live Forensic Techniques for Anomaly Detection in User Terminals

A Study on the Live Forensic Techniques for Anomaly Detection in User Terminals A Study on the Live Forensic Techniques for Anomaly Detection in User Terminals Ae Chan Kim 1, Won Hyung Park 2 and Dong Hoon Lee 3 1 Dept. of Financial Security, Graduate School of Information Security,

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Denial of Service Attack Detection Using Multivariate Correlation Information and Support Vector Machine Classification

Denial of Service Attack Detection Using Multivariate Correlation Information and Support Vector Machine Classification International Journal of Computer Sciences and Engineering Open Access Research Paper Volume-4, Issue-3 E-ISSN: 2347-2693 Denial of Service Attack Detection Using Multivariate Correlation Information and

More information

COAT : Collaborative Outgoing Anti-Spam Technique

COAT : Collaborative Outgoing Anti-Spam Technique COAT : Collaborative Outgoing Anti-Spam Technique Adnan Ahmad and Brian Whitworth Institute of Information and Mathematical Sciences Massey University, Auckland, New Zealand [Aahmad, B.Whitworth]@massey.ac.nz

More information

A SYSTEM FOR DENIAL OF SERVICE ATTACK DETECTION BASED ON MULTIVARIATE CORRELATION ANALYSIS

A SYSTEM FOR DENIAL OF SERVICE ATTACK DETECTION BASED ON MULTIVARIATE CORRELATION ANALYSIS Journal homepage: www.mjret.in ISSN:2348-6953 A SYSTEM FOR DENIAL OF SERVICE ATTACK DETECTION BASED ON MULTIVARIATE CORRELATION ANALYSIS P.V.Sawant 1, M.P.Sable 2, P.V.Kore 3, S.R.Bhosale 4 Department

More information

Protecting the Infrastructure: Symantec Web Gateway

Protecting the Infrastructure: Symantec Web Gateway Protecting the Infrastructure: Symantec Web Gateway 1 Why Symantec for Web Security? Flexibility and Choice Best in class hosted service, appliance, and virtual appliance (upcoming) deployment options

More information

SEO Techniques for various Applications - A Comparative Analyses and Evaluation

SEO Techniques for various Applications - A Comparative Analyses and Evaluation IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727 PP 20-24 www.iosrjournals.org SEO Techniques for various Applications - A Comparative Analyses and Evaluation Sandhya

More information

Towards Online Spam Filtering in Social Networks

Towards Online Spam Filtering in Social Networks Towards Online Spam Filtering in Social Networks Hongyu Gao Northwestern University Evanston, IL, USA hygao@u.northwestern.edu Diana Palsetia Northwestern University Evanston, IL, USA palsetia@u.northwestern.edu

More information

MINIMIZING THE TIME OF SPAM MAIL DETECTION BY RELOCATING FILTERING SYSTEM TO THE SENDER MAIL SERVER

MINIMIZING THE TIME OF SPAM MAIL DETECTION BY RELOCATING FILTERING SYSTEM TO THE SENDER MAIL SERVER MINIMIZING THE TIME OF SPAM MAIL DETECTION BY RELOCATING FILTERING SYSTEM TO THE SENDER MAIL SERVER Alireza Nemaney Pour 1, Raheleh Kholghi 2 and Soheil Behnam Roudsari 2 1 Dept. of Software Technology

More information

An Efficient Way of Denial of Service Attack Detection Based on Triangle Map Generation

An Efficient Way of Denial of Service Attack Detection Based on Triangle Map Generation An Efficient Way of Denial of Service Attack Detection Based on Triangle Map Generation Shanofer. S Master of Engineering, Department of Computer Science and Engineering, Veerammal Engineering College,

More information

Evaluation of Anti-spam Method Combining Bayesian Filtering and Strong Challenge and Response

Evaluation of Anti-spam Method Combining Bayesian Filtering and Strong Challenge and Response Evaluation of Anti-spam Method Combining Bayesian Filtering and Strong Challenge and Response Abstract Manabu IWANAGA, Toshihiro TABATA, and Kouichi SAKURAI Kyushu University Graduate School of Information

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Network Based Intrusion Detection Using Honey pot Deception

Network Based Intrusion Detection Using Honey pot Deception Network Based Intrusion Detection Using Honey pot Deception Dr.K.V.Kulhalli, S.R.Khot Department of Electronics and Communication Engineering D.Y.Patil College of Engg.& technology, Kolhapur,Maharashtra,India.

More information

Towards Proactive Spam Filtering (Extended Abstract)

Towards Proactive Spam Filtering (Extended Abstract) Towards Proactive Spam Filtering (Extended Abstract) Jan Göbel Thorsten Holz Philipp Trinius {goebel holz trinius}@informatik.uni-mannheim.de Laboratory for Dependable Distributed Systems University of

More information

Clustering Technique in Data Mining for Text Documents

Clustering Technique in Data Mining for Text Documents Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor

More information

A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

More information

Load Distribution in Large Scale Network Monitoring Infrastructures

Load Distribution in Large Scale Network Monitoring Infrastructures Load Distribution in Large Scale Network Monitoring Infrastructures Josep Sanjuàs-Cuxart, Pere Barlet-Ros, Gianluca Iannaccone, and Josep Solé-Pareta Universitat Politècnica de Catalunya (UPC) {jsanjuas,pbarlet,pareta}@ac.upc.edu

More information

Comprehensive Malware Detection with SecurityCenter Continuous View and Nessus. February 3, 2015 (Revision 4)

Comprehensive Malware Detection with SecurityCenter Continuous View and Nessus. February 3, 2015 (Revision 4) Comprehensive Malware Detection with SecurityCenter Continuous View and Nessus February 3, 2015 (Revision 4) Table of Contents Overview... 3 Malware, Botnet Detection, and Anti-Virus Auditing... 3 Malware

More information

Intercept Anti-Spam Quick Start Guide

Intercept Anti-Spam Quick Start Guide Intercept Anti-Spam Quick Start Guide Software Version: 6.5.2 Date: 5/24/07 PREFACE...3 PRODUCT DOCUMENTATION...3 CONVENTIONS...3 CONTACTING TECHNICAL SUPPORT...4 COPYRIGHT INFORMATION...4 OVERVIEW...5

More information

The What, Why, and How of Email Authentication

The What, Why, and How of Email Authentication The What, Why, and How of Email Authentication by Ellen Siegel: Director of Technology and Standards, Constant Contact There has been much discussion lately in the media, in blogs, and at trade conferences

More information

An Empirical Analysis of Malware Blacklists

An Empirical Analysis of Malware Blacklists An Empirical Analysis of Malware Blacklists Marc Kührer and Thorsten Holz Chair for Systems Security Ruhr-University Bochum, Germany Abstract Besides all the advantages and reliefs the Internet brought

More information

Dynamics of Online Scam Hosting Infrastructure

Dynamics of Online Scam Hosting Infrastructure Dynamics of Online Scam Hosting Infrastructure Maria Konte, Nick Feamster, and Jaeyeon Jung 2 Georgia Institute of Technology 2 Intel Research {mkonte,feamster}@cc.gatech.edu, jaeyeon.jung@intel.com Abstract.

More information

Port evolution: a software to find the shady IP profiles in Netflow. Or how to reduce Netflow records efficiently.

Port evolution: a software to find the shady IP profiles in Netflow. Or how to reduce Netflow records efficiently. TLP:WHITE - Port Evolution Port evolution: a software to find the shady IP profiles in Netflow. Or how to reduce Netflow records efficiently. Gerard Wagener 41, avenue de la Gare L-1611 Luxembourg Grand-Duchy

More information

Phishing Activity Trends Report June, 2006

Phishing Activity Trends Report June, 2006 Phishing Activity Trends Report, 26 Phishing is a form of online identity theft that employs both social engineering and technical subterfuge to steal consumers' personal identity data and financial account

More information

Spamming Chains: A New Way of Understanding Spammer Behavior

Spamming Chains: A New Way of Understanding Spammer Behavior Spamming Chains: A New Way of Understanding Spammer Behavior Pedro H. Calais Guerra Federal University of Minas Gerais (UFMG) pcalais@dcc.ufmg.br Cristine Hoepers Brazilian Network Information Center (NIC.br)

More information

Ipswitch IMail Server with Integrated Technology

Ipswitch IMail Server with Integrated Technology Ipswitch IMail Server with Integrated Technology As spammers grow in their cleverness, their means of inundating your life with spam continues to grow very ingeniously. The majority of spam messages these

More information

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LIX, Number 1, 2014 A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE DIANA HALIŢĂ AND DARIUS BUFNEA Abstract. Then

More information

Monitoring Large Flows in Network

Monitoring Large Flows in Network Monitoring Large Flows in Network Jing Li, Chengchen Hu, Bin Liu Department of Computer Science and Technology, Tsinghua University Beijing, P. R. China, 100084 { l-j02, hucc03 }@mails.tsinghua.edu.cn,

More information

Report on Cyber Security Alerts Processed by CERT-RO in 2014

Report on Cyber Security Alerts Processed by CERT-RO in 2014 Section III - Cyber-Attacks Evolution and Cybercrime Trends Report on Cyber Security Alerts Processed by CERT-RO in 2014 Romanian National Computer Security Incident Response Team office@cert-ro.eu The

More information

Application of Data Mining based Malicious Code Detection Techniques for Detecting new Spyware

Application of Data Mining based Malicious Code Detection Techniques for Detecting new Spyware Application of Data Mining based Malicious Code Detection Techniques for Detecting new Spyware Cumhur Doruk Bozagac Bilkent University, Computer Science and Engineering Department, 06532 Ankara, Turkey

More information