Detecting spam machines, a Netflow-data based approach

Transcription

1 Detecting spam machines, a Netflow-data based approach Gert Vliek February 24, 2009 Chair for Design and Analysis of Communication Systems (DACS) Faculty of Electrical Engineering, Mathematics and Computer Science (EEMCS) University of Twente, The Netherlands Supervisors: Dr. ir. A. Pras A. Sperotto M.Sc. Dr. R. Sadre

2

3 Abstract Spam is a problem that practically every user encounters. More than 75% of all messages are likely to be spam and this level is still rising. This makes spam prevention a very relevant topic. Most research on spam prevention has been focused at the receiving side, spam filters in clients and receiving servers. This is changing as spam gets more interest from the research community. Network-level behavior has in recent years been seen as another research direction. This thesis focusses on detecting spam machines via Netflow data. Because Netflow only provides information about the communications between hosts and not about the contents of those communications, this is not an easy task. The aim of this work is to inspect the feasability of detecting spam machines via Netflow. To reach this goal, a large repository of Netflow data has been studied to find behavior that differentiates spam machines from normal servers. With a few simple assumptions a high number of IPs with suspicious behavior could be found. The behavior displayed by those suspicious IPs has been used to propose a number of criteria for detecting spamming machines. These criteria have been combined to implement an algorithm to detect spam machines via Netflow data. To validate this algorithm, DNS blacklists and SpamAssassin log-files were used with the Netflow data of the University of Twente. The first tests with DNS blacklist validation show that with randomly picking IPs around 95% of the IPs is listed in a blacklist. Closer inspection shows that this is because of the high number of IPs with only a single or a few SMTP connections directed at the monitored network. These are probably bots, sending only a low number of spam messages per domain to avoid detection. Because of the lack of data (only Netflow data within the monitored network is available) those IPs cannot be analyzed with Netflow. This is why the focus of this work lies on machines with a higher number (at least 100) of outgoing SMTP connections. When validating the algorithm itself a few surprising observations were made. Among those, it is observed that a high percentage of idle time for an IP seems to be by far the most effective criterium. With the help of validation results, the algorithm has been optimized. The end result is a validation rate of 99% positively validated machines. This result was obtained with Netflow data captures over multiple time spans. Based on those results, we conclude that it is possible to detect spammers via only Netflow data. This has been a feasibility study and there are some open issues. Those have been left for future work.

4

5 Preface This thesis is the end result of my final Master project executed at the Design and Analysis of Communications Systems (DACS) chair of the university of Twente. At the end of 2007 I was searching for a final Master project which I could do part-time, because I also wanted to work 3 days a week at the company Soltria. After some orientation I quickly came to the conclusion that DACS and specifically the Netflow assignment was the best match to my interests. I had already developed an interest in security related topics and first wanted to search for botnets in Netflow. I quickly concluded that this topic was to broad as a starting point. To limit the scope I decided it was better to focus on one of the activities bots will display. Two fellow students were already studying DoS attacks and port scans with Netflow. I decided to focus my research on spam machines, which seemed a real challenge because with Netflow there is not a lot of information to work with. The research took some directions that I did not expect beforehand (for example the amount of work put into building our common Netflow database), but I think I can be content with some nice results. I have been working on this project only at the end of each week because Monday- Wednesday I worked at the company Soltria. This meant that this work took a little longer than expected and that I had a high workload, but I learned a lot during this period, both at work and at the university. There even was some overlap, as for example for both I worked a lot with MySQL. I have enjoyed this period and have seen it as a nice way to get used to a business environment while still being a student. I would like to thank both Aiko Pras and Remco Lupker for giving me the opportunity to combine work and my master thesis. Also my supervisors, Aiko Pras for always being very supportive (and driving me to buy new Apple hardware ;-), Anna Sperotto for helping me structuring my research and Ramin Sadre for his help with building our common database. Also the people that provided the data and help needed to enable this research, in particular ICTS, SURFnet and Geant for providing the Netflow data, Peter Peters for access to the Spamassassin logfiles and Quarantainenet for their insight. And of course the other people in the DACS lab, especially Daan van der Sanden for relaying his first experiences with Netflow, Joris Kinable with whom I did a lot of work for our common database and Martijn van Eenennaam and Stephan Roolvink; I enjoyed being the bringer of the end of the week ;-). And last but not least my parents, for always supporting my choices and giving me the opportunities they did. Gert Vliek January 2009

6

7 Contents Abstract 3 Preface 5 1 Introduction Problem description Research questions Main research question Subquestions Approach Outline of the thesis Netflow and spam Netflow Flow-based network monitoring with Netflow Netflow versions Spam The SMTP protocol Network level spamming techniques Trends in (anti-)spamming techniques Experimental first steps Available data sources Differentiating suspicious machines from normal machines Manually extracting data from the Utwente repository Efficiently automating data extraction Conclusions drawn after the first experimental steps Proposed spam-machine detection algorithm Proposed criteria Acceptance criteria

8 4.1.2 Ordering criteria Complete algorithm implementation Validation Validation methods DNS blacklists University of Twente SpamAssassin log files Validation results Validation summary Conclusions Summary of results Future work A Summarized proposed algorithm 51 B First validation results (without probabilities) 53 C Optimized algorithm after validation results 57 D Results of the optimized algorithm 59 8

9 List of Figures 1.1 Approach All connections to port 25 captured by the Utwente Netflow dataset One of the university mail servers One of the suspicious mail machines SQL processing for the Utwente repository Distribution plot of the distinct outgoing SMTP destination IPs per machine (for IPs) Distribution plot of idle time Distribution plot of the standard deviation of the outgoing traffic per machine Distribution plot of five times the standard deviation of the outgoing traffic per machine Analysis process Spamassassin hitcounts per machine (university mail servers) Number of outgoing SMTP connections per IP

10 List of Figures 10

11 1 Introduction This Chapter starts with explaining the context of this research in the problem description. Then the main research question and the subquestions of this work are stated. Next the approach taken to answer the research questions is described. The Chapter closes with an outline of this thesis. 1.1 Problem description Spam is a problem that practically every user encounters. More than 75% of all messages are likely to be spam and this level is still rising [1]. This makes spam prevention a very relevant topic. Most research on spam prevention has been focused at the receiving side, spam filters in clients and receiving servers. A new approach in combatting spam can be the use of Netflow data. Netflow is gaining popularity and can be used for research purposes, spam detection is such a possibility. Because Netflow only provides information about the communications between hosts and not about the contents of those communications, this is not an easy task. By studying a large repository of Netflow data, behavior that differentiates spam machines from normal servers is searched for. With the results, an algorithm for detecting spam machines via Netflow will be implemented. The main aim of this research is to inspect the feasibility of detecting spam machines via Netflow data. 1.2 Research questions The main research question aims to address the feasibility of detecting spam machines via Netflow data. If this is possible, Netflow can be used to address the spam problem at its source: the spamming machines. It could be used as a cheaper solution than scanning all s, or another factor to be used in current spam detection mechanisms. To answer the main research question a few subquestions are posed. This Section states and explains the research questions. 11

12 Chapter 1. Introduction Main research question The main research question is: Is it possible to detect spamming machines via Netflow data gathered at internet routers with a low false positive rate? Netflow data is highly aggregated, only summarized flow-level data is available. Because of this, it will be difficult to decide whether a certain machine is sending spam messages. With traditional anti-spam measures like bayesian filters, the content of s are scanned. This is obviously not possible with Netflow. A method to differentiate the flow-level behavior of spam machines from normal traffic has to be devised. Because of the high aggregation of data, the main research question is whether it is possible at all to accomplish this. Because spam detection has in the past proved to be a difficult process, another important question is how reliable the resulting detection algorithm will be. So this research aims to inspect the feasibility to detect spam machines using Netflow with a low false positive rate Subquestions To answer the main research question a few sub-questions are posed. They have been chosen to get a picture of the currently known facts about spammers, get a sense of the possibilities with Netflow and develop an algorithm to detect spam machines, which will have to be validated. Those goals all aim to answer a part of the main research question. The sub-questions are: How do spammers operate? To differentiate spam behavior from normal behavior, it is important to know how spammers behave on the network layer. The aim of this research question is to obtain an overview of the currently known methods in literature. 2. What are trends in (anti-)spamming techniques? This question will be answered by looking at the currently known trends in spamming techniques and anti-spam research. The aim of this question is to get an insight in the current state of the art concerning spam-related research related to the main research question. 3. How does spamming differ from normal traffic on the level of Netflow data? To detect spam machines via Netflow the first question is how spamming differs from normal traffic. The first steps in implementing a spam detection mechanism will be to explore the possibilities with Netflow data. 4. Which algorithms can be used to detect spamming machines via Netflow data? After the first criteria to select spamming machines via Netflow are known, a detection algorithm should be developed. Together with the previous research question, this represents the bulk of this research.

13 1.3. Approach 5. How can the results be validated? When a detection mechanism is implemented, it is important to validate that it really is working. Methods to make sure that the identified spam machines are indeed spamming will have to be found. 6. What are the probabilities for false positives and which cases causes them? When a detection mechanism is implemented, it is important to identify the possible causes for false positives. 1.3 Approach As not a lot of existing literature is available on the specific topic of detecting spam via Netflow data, literature study will be minimal. The literature study focusses on answering research subquestions 1 and 2. The approach taken to answer subquestions 3-6 is displayed in Figure 1.1. First a Netflow data capture is done. It was chosen to first observe the data to get a sense of what can be detected with Netflow data. As not a lot of information about this is found in literature, this should be the first step to get a sense of normal and abnormal behavior. Later on an algorithm can be developed based on those first observations. This in contrary to Prasanna Desikan et al. [2] and Anirudh Ramachandran et al. [21], where first a theory is devised and later veried. Because this is a feasibility study it is chosen to first explore behavior that can be found in Netflow instead of devising an algorithm beforehand and hoping that suspicious machines can be found with it, with the help of Netflow data. Netflow data capture Observe behavior (Chapter 3) Algorithm implementation (Chapter 4) New data capture Feedback Validation (Chapter 5) Figure 1.1: Approach After this first exploration, an algorithm to detect spamming machines using Netflow data 13

14 Chapter 1. Introduction is implemented based on what has been observed in the Netflow data. While fine-tuning the detection algorithm, it is important to have a means to validate the results. Validation will be needed to decide whether the detection algorithm really works. Another use of validation is as a tool for fine-tuning. If changing some parameters results in far fewer results positively validated, those changes are probably degrading the detection algorithm instead of improving it, while more positively validated results will probably have improved the algorithm. With those goals in mind, two methods for validation of this research were chosen, DNS blacklists and SpamAssassin. So the validation is used to improve the detection algorithm in an iterative way. After the first data capture is studied and the first version of the algorithm is implemented, the algorithm is validated using a new data capture (a recent dataset is necessary for validation, as explained in Chapter 5). Those results are then iteratively used to improve the algorithm. 1.4 Outline of the thesis Chapter 2 provides an introduction to Netflow and SMTP and answers subquestions 1 and 2 with a literature study. Figure 1.1 illustrates the set-up of Chapters 3-5. Chapter 3 focusses on subquestion 3. Also, feasibility in terms of processing time is explored. The Netflow data repositories result in huge database tables. To keep the processing time feasible, efficient ways to process data will have to be used. After this first exploration, the focus will switch to research question 4: an algorithm to detect spamming machines using Netflow data is implemented, using the experience of the previous exploration step of this research. This is done in an iterative way; the algorithm will be fine-tuned based on the obtained results. Also, validation mechanisms (research subquestion 5) will be used to optimize the mechanism (getting a high percentage of results validated). Those steps are discussed in Chapter 4. Chapter 5 focusses on subquestions 5 and 6. The validation methods are explained and validation results are discussed. Finally, Chapter 6 concludes this research by giving an overview of the results and answering the main research question (Is it possible to detect spamming machines via Netflow data gathered at internet routers with a low false positive rate?). Also, the practical applications of this research and recommendations for further research will be discussed. 14

15 2 Netflow and spam This Chapter starts with an explanation of Netflow. After that, the spam problem in relation with Netflow is discussed. To do this, first the SMTP protocol and its behavior in Netflow is explained in Section Then the currently known spamming techniques at the network level are discussed in Section 2.2.2, followed by an overview of the current directions of spammers and spam research in Section Netflow The main data-source for this research is Netflow. Netflow represents a relatively lightweight network monitoring tool. If it is possible to use it to detect spamming machines, it could be used to clean up networks or blacklist IPs. It could be used as a cheaper solution than scanning all s for spam, or another factor to be used in current spam detection mechanisms. This Section explains Netflow, gives a few remarks to keep in mind while using Netflow and discusses the current versions of Netflow Flow-based network monitoring with Netflow Network monitoring can be performed by packet level inspection with the use of tools like TCPDUMP [7]. However, for large packet switching facilities the processing time and diskspace requirements will become infeasible. Here SNMP [8] is a popular data source for network monitoring. SNMP however aggregates information on a very high level (i.e. network interface throughput or device uptime), a lot of information is lost. Between those two extremes (recording every packet or aggregating high level data) Cisco s Netflow can be positioned. It records flow level data; for every (TCP/UDP etc) flow aggregated data is recorded. Cisco defines a flow as follows [5]: A flow is identified as a unidirectional stream of packets between a given source and destination - both defined by a network-layer IP address and transport-layer source and destination port numbers. Specifically, a flow is identified as the combination of the following seven key fields: Source IP address Destination IP address 15

16 Chapter 2. Netflow and spam Source port number Destination port number Layer 3 protocol type ToS byte Input logical interface (ifindex) Netflow operates by creating cache entry s. This means that for every connection between hosts an entry is made in this cache. For each combination of key fields, the stored information includes octets send, packets send, start time, end time and TCP flags. The information included depends on the Netflow version and Netflow set-up. The information is accessible by using the Command Line Interface (CLI) for an immediate view of the Netflow cache or by exporting Netflow data packets to a Netflow Collector (used in this research). [6] Flows are exported from the cache when the following events occur: A TCP flow is ended by a FIN or RST flag When a flow has been inactive for a specified time (the inactive timer) When a flow has has been active for a specified time (the active timer) When the Netflow cache is full The inactive and active timer can split flows, which means that a specific flow is exported several times. The result is several exported Netflow records which should actually be one and the same flow. This means that those records will have to be merged again if the total aggregated flow data is to be retrieved for those flows that are split up. Another thing to keep in mind when using Netflow data is that flows are defined as a unidirectional stream of packets. This means that for a successful connection between two hosts, Netflow will export not one but two flows, one for each direction Netflow versions There are several version of Netflow; Netflow 5 and increasingly Netflow 9 are the most popular versions. Netflow versions 1, 5, 6, 7 and 8 export packets over UDP. This means that packets can be lost, because UDP doesn t guarantee packet delivery. Lost packets can be detected by going over the sequence numbers in the Netflow packet header. Netflow 9 is designed to be transport protocol independent, but UDP is still commonly used. Netflow export packets prior to version 9 consist of a header with max 24 (version 1) or 30 (other versions) records for the exported flows. Netflow version 9 is designed to be easily extendible. It makes use of templates that can be defined by the user. Netflow 9 has been submitted to the IETF as an RFC. [9] 16

17 2.2. Spam 2.2 Spam This section gives an overview of the results of a literature study on the current state of the art of spam and anti-spam techniques, preceded by a short explanation of the SMTP protocol in the context of Netflow The SMTP protocol To understand what can be observed by Netflow (discussed in Section 2.1) it is important to know how the SMTP (Simple Mail Transfer Protocol) works. SMTP is the protocol that is used to send . End users however, often make use of protocols like POP3, IMAP or a web-based mail client to retrieve . This research however focusses on the SMTP protocol, because this is the protocol used to send (or spam). The SMTP protocol is defined in RFC 821 [14]. The SMTP RFC specifies a few different transport protocols on which it can run. In practice however, SMTP uses the TCP protocol on service port 25. This means for Netflow only port 25 TCP traffic has to be captured to obtain all SMTP traffic. Because Netflow stores flows uni-directional [6], source port as well as destination port 25 should be captured. To give an idea on how the SMTP protocol works, the following (slightly modified) example was taken from RFC 821: S: MAIL FROM:<[email protected]> R: 250 OK S: RCPT TO:<[email protected]> R: 250 OK S: RCPT TO:<[email protected]> R: 250 OK S: DATA R: 354 Start mail input; end with <CRLF>.<CRLF> S: Blah blah blah... S:...etc. etc. etc. S: <CRLF>.<CRLF> R: 250 OK Here an is being sent from [email protected] to [email protected] and [email protected]. SMTP is a simple text protocol, after connecting to port 25 via telnet those commands can be executed to send . In the example, sender lines are indicated with S and responses from the receiving side are indicated with R. First the originating address is given and after that two destination addresses. Then data is inputted, ended with a line only containing.. More detailed information about SMTP can be read in RFC 821 [14]. There are also SMTP extensions for including attachments like pictures or movies, those are outside the scope this research (Netflow will just see more data being transferred). For Netflow, the example above is seen as two uni-directional TCP flows. However, those flows do not necessarily represent the sending of a single message! In the example, in 17

18 Chapter 2. Netflow and spam one single TCP connection, an is being sent to two different destinations. It is also possible to make more than one message to more then one destination address. The content of this SMTP (which is an application layer protocol) message exchange, is of course not captured by Netflow. The only information available with Netflow, is information like summarized in Section Network level spamming techniques Literature study reveals the most common methods spammers use to send their spam [10] [11] [12] [13]. Abhinav Pathak et al. [11] suggest to divide those methods into LVS (Low-Volume Spammers) and HVS (High-Volume Spammers) for analysis purposes. This distinction will be used in the remainder of this research. Direct spammers use dedicated hosts from spam-friendly providers. This method is becoming less effective because of DNSBL (DNS Blacklists). DNSBLs list known spamming IPs, mail servers can check whether a mail sending machine is listed via a simple DNS lookup. If it is listed the legitimate mail server can decide not to deliver mail sent by this machine, as it probably is spam. Open relays or proxies are used to send spam anonymously. An open relay is an MTA (Mail Transfer Agent, an SMTP server) that can be used without authentication. This allows such an open relay to be used to relay . Spammers can us those machines to send spam on their behalf. Because of the obvious possibility of misuse and actual exploitation by spammers, open relays are becoming rare. They are still being scanned for and used. They are attractive to spammers because of the anonymity. Also, open relays could be an alternative when their own IP addresses are blacklisted. Sending spam via an open relay circumvents the blacklisting. An open relay is used by Abhinav Pathak et al. [11] as a data collection point to research spam behavior. Because mass spamming mechanisms present obvious suspicious behavior and blacklisting is an effective defense against direct spammers, botnets are becoming more popular. A botnet consists of a high number of bots (machines infected with malicious software) commanded by a controller with some kind of C&C (Command & Control) infrastructure. In the past the C&C was mostly IRC, but because this is traceable the C&C infrastructure is changing into other (often encrypted) protocols. Botnets in itself are, just like spam, a security problem. A lot of research is focussed on botnets (i.e. [12], [25] and [26]). Li Zhuang et al. [12] take the unique approach of finding botnets using spam records. The methods described in the paragraphs above (direct spamming, open relays or proxies and botnets, as classified by Anirudh Ramachandran et al. [13]) implicate what can be seen at the network level of the OSI stack [20] (where Netflow operates). Direct spammers and open relays or proxies probably generate a large number of outgoing SMTP traffic while botnets will do the same, but distributed over a large number of infected hosts, to avoid detection [13]. Of course there are other techniques to consider, but most of them will not immediately influence what can be seen by Netflow because they are implemented at higher layers. Examples are image-spam, false SMTP headers and dictionary attacks (to guess usernames). A supplemental technique that is upcoming, is BGP (Border Gateway Protocol) hijacking. This is observed among others, by Anirudh Ramachandran et al. [13] and Nick Feamster et al. [21]. The BGP is used by AS s (Autonomous Systems) to advertise prefixes it can deliver traffic to. This protocol can be misused to hijack IP address prefixes. This is commonly done via short-lived routing announcement. The surprising thing (at first sight) is that spammers 18

19 2.2. Spam that use this technique use short IP prefixes (resulting in larger blocks of IP addresses!). However, the distribution of IP addresses sending spam suggests that this is done so that mail relays can hop between a large number of IP-addresses, avoiding getting blacklisted. Also, route announcements for shorter IP prefixes are unlikely to be blocked compared to announcements for longer IP prefixes. [13] [21] Trends in (anti-)spamming techniques The main trend concerning the sending of spam that is identified in research, is the increasing use of botnets instead of direct spamming [11] [13]. Botnets are becoming very complicated making them a research topic on their own [10]. BGP hijacking is a supplemental technique, described in Section As mentioned by Anirudh Ramachandran et al. [13] and Nick Feamster et al. [21], this is becoming a popular method to avoid blacklisting. Because botnets sizes are difficult to estimate and research repositories cover only part of the total picture (a single domain, a single mail server etc) it is difficult to get exact numbers about the global spam problem. Most research on spam has traditionally been focussed at the receiving side in the form of mail filters. Now that the spam problem gets more attention from the research community, network-level behavior of spammers is also studied by for example Anirudh Ramachandran et al. [13]. This research uses spam honeypots in the form of a spam sinkhole. This is just a mail server with no legitimate addresses, so that every received is a spam message. They registered a domain with a DNS Mail Exchange (MX) record in it. The result is still a limited viewpoint on the spam problem, because only spam on a single server on a single domain is recorded. To get a broader view on the spam problem, Abhinav Pathak et al. [11] propose to set up an open relay sinkhole. The idea here is to set up an open relay in such a way that it can be detected by spammers, but will not really relay mail. This way information about sources and the different destinations of spam can be obtained. A few interesting statistics were found: 75% of the connecting hosts were already blacklisted in some of the 5 DNSBLs used. For the spammers using open relays, DNSBLs seem to do their job well! [11] 25% of the hosts connected only once, more than 75% connected fewer dan 10 times and only 0.9% made more than 100 connections, those hosts where responsible for 59% of the total amount of spam. 0.1% made more than 1000 connections, amounting for 43% of spam. The authors conclude that more than 75% of the machines are part of a botnet. Interesting to note is that according to this research, non-botnet spam is still responsible for 59% of the spam. [11] Spam campaigns (the sending of one specific spam message to a lot of different recipients) are typically done within an hour for HVS, while for LVS it takes on average 2-3 days to complete a campaign. [11] Another interesting approach is taken by Nick Feamster et al. [21]. They propose to identify spammers not based on IP address or content filtering, but based on behavioral analysis (much like this research does). They used as a datasource the logs of an organization that manages over 115 domains to get a multiple-domain overview of spam. To classify spammers, they propose to cluster IPs with similar behavior. Their clustering algorithm 19

20 Chapter 2. Netflow and spam takes as input the vector (n, d, t), where n is the number of IP addresses that send to any d domains within one of t time windows. The general idea behind this clustering approach is that bots of a botnet will display similar behavior, bots will probably send a small amount of messages to a large amount of servers to avoid detection (each domain sees only a small amount of messages, avoiding the decision to blacklist the bots). First the algorithm is trained with an initial set of known IPs. After training, new hosts displaying similar behavior as the clustered spamming machines will be classified as spam. Their validation shows that a small percentage of spam messages not yet blacklisted were detected by the proposed algorithm (called SpamTracker). Another approach that uses behavior analysis on the network level to detect spamming machines, is used by Prasanna Desikan et al. [2]. Their data-source is Netflow. This paper describes the only spam research with Netflow that could be found during the literature study of this research. Just like Nick Feamster et al. [21], they see their research as a complementary approach to spam reduction compared to the techniques currently in use. Their proposal is to construct link-graphs, in particular bipartite cores. The HITS algorithm eliminates normal traffic (servers which send and receive mail with other servers sending and receiving mail). Then the algorithm identifies machines sending to a high number of other machines that are not in the set of normal traffic machines. The paper also acknowledges the importance of temporal behavior. Their first attempt to model temporal behavior is focussed on single nodes and a few properties that are measured over time. They evaluated their algorithms with Netflow data for a 10 minute and 3 hour time period on the border network of the University of Minnesota. One machine known to be sending spam was detected. A completely different approach was taken by Christian Kreibich et al. [10]. They infiltrated the well known Storm botnet by installing 16 instances of Storm bots in virtual machines. To successfully analyze the botnet, the different forms of Storm communication traffic were reverse engineered. Traffic was collected from late december 2007 until early February Some interesting findings were discovered about as well the botnet itself as its spam mechanisms. Interesting to note among those findings is that the overall average spamming rate was 152 spam messages per minute per bot and the estimated address list size ranges from 100 million to 800 million recipients. Also a lot of insight was obtained in the mechanisms used to generate spam messages and how they were being sent (the use of dictionaries, campaigns, address harvesting etc). 20

21 3 Experimental first steps This chapter describes the first experimental steps to detect spamming machines with the Netflow data. To answer the third research question (How does spamming differ from normal traffic on the level of Netflow data?) it has been chosen to first observe the suspicious behavior that can be found in Netflow data before theorizing spam detection mechanisms. This in contrary to Prasanna Desikan et al. [2] and Anirudh Ramachandran et al. [21], where first a theory is devised and later verified using the Netflow data. First the available data sources are stated. Then the first manual inspection of the netflow data is described and the relevant results and ideas for detecting spamming machines are discussed. Then, to cope with the amount of data, a mechanism for further inspection is discussed. This chapter closes with a short conclusion on this first exploration in the netflow data. 3.1 Available data sources For this research three Netflow sources were available: University network: This encompasses (almost) all traffic for the University of Twente (the Netherlands). This will be referred to as the Utwente datatset in the remainder of this work. SURFnet: This repository contains the traffic for SURFnet, a research network covering the Netherlands. GEANT This repository contains the traffic for GEANT, a research network covering Europe, which also contains the Surfnet traffic if it is not routed via another network. A database repository was made for two days captured in July 2008 for all three networks. The focus on this research lies on the Utwente dataset, because this network is the most familiar one and we have a live Netflow data stream which could come in handy later on. However, the analyses of this Chapter have also been done on the SURFnet and Geant repositories to see whether similar behavior was found. 21

22 Chapter 3. Experimental first steps 3.2 Differentiating suspicious machines from normal machines The focus of this research is to find spam machines using Netflow data. The Netflow data is put in a MySQL database. Because those repositories are huge in terms of disk space and records ( records occupying 49.64GB gigs for data and 86.91GB for the index for the Utwente repository) the next challenge is to use this huge amount of data in a feasible way. The following approach was taken: 1. Get some initial experience with the available data (what kind of queries are common, what information can be distilled and how long do queries take etc). This is described in Section Try to automate data processing for this specific assignment and create smaller data tables to operate upon to speed up query processing times. This is described in Section After those experimental first steps the goal is to devise an algorithm to detect spamming machines, answering the research question Which algorithms can be used to detect spamming machines via Netflow data?. This will be done in the next Chapter Manually extracting data from the Utwente repository In this section the available Netflow data is analyzed to get some experience as to how the Netflow data can be used to detect spamming machines. One of the first ideas was to generate a plot of all outgoing connections to port 25 over time, the result can be seen in Figure 3.1. Figure 3.1: All connections to port 25 captured by the Utwente Netflow dataset The choice was made to aggregate all traffic in 5-minute time slices. In this plot and all later plots the data points represent 5-minute time spans. So at the start of this plot around 22

23 3.2. Differentiating suspicious machines from normal machines 7500 outgoing SMTP connections were made per 5 minutes. The value of 5 minutes has been chosen for practical reasons, the precision is good enough while limiting the amount of needed datapoints (and thus processing time, which will be important especially when automating the detection process later on). It is also important to note that an outgoing connection does not correspond to a single mail message. A single SMTP session can be used to send multiple mail messages to multiple users [14]. Also, some outgoing connections to port 25 could be port scans. Figure 3.1 shows a very irregular plot. There are a few peaks that could be due to spam, but it is difficult to make any conclusions based on this plot. Further research into the connections made during the two highest peaks didn t result in any clarification. Because Figure 3.1 does not tell us a lot, it was decided to make a plot of normal SMTP server behavior. The (5 load-balanced) university mail servers were chosen. They all showed the same kind of plot, one of them is displayed in Figure 3.2. As can be seen, this official Figure 3.2: One of the university mail servers mail server shows a far more stable plot than the one of all outgoing connections for the whole university domain (Figure 3.1). There are three peaks, but overall, there exists a stable baseline. The peaks have not been investigated further. They could be explained by mailing list mailings but could also be misuse of the mail server. Now that we have a picture of outgoing SMTP traffic for a legitimate mail server, the next step is to find and plot the outgoing SMTP traffic of an illegitimate (spamming) mail machine. Finding a legitimate mail server is easy, just pick one of the official university mail servers, but it is a little bit more difficult for an illegitimate mail server. To find an illegitimate mail server, a way to identify one is needed. For this a simple basic assumption was made to distinguish those machines from legitimate ones: Spam machines will send a lot of SMTP traffic, but won t receive a lot of, or no incoming SMTP traffic at all, on port

24 Chapter 3. Experimental first steps Of course, this is only an assumption at this point. There are a few cases in which this assumption won t hold true. One such a case is when a system is set-up to send mass mailings like a newsletter. This will result in a lot of mail being sent but the machine will probably not be configured to receive mail. To obtain the machines where the assumption holds true, the Utwente repository was queried to group the traffic by the source IP over the entire Utwente dataset. For each IP the following metrics were calculated: The number of distinct destination servers. These are the distinct destination servers for the outgoing connections with destination port 25. The number of outgoing connections. destination port 25. These are all the outgoing connections with The number of distinct incoming machines. These are all the distinct IPs that connect to the current machine with destination port 25. The number of incoming connections. destination port 25. These are all the incoming connections with Note that Netflow records both directions of a connection as a different flow, one in each direction. Also, for the assumption, only the incoming and outgoing amount of SMTP connections is needed. What was added was the amount of distinct source and destination IPs. This might also shed some light on the behavior of legitimate versus illegitimate machines. Finally, as the aim is to find obvious spam machines, the resulting list is ordered by the amount of outgoing connections to port 25 (descending), so that the mass spammers should be detected. The resulting top 15 of that list is shown in Table 3.1. The IPs are left out due Ip dist out out dist in in Table 3.1: Top 15 IPs with highest number of outgoing connections 24

25 3.2. Differentiating suspicious machines from normal machines to privacy concerns. The other columns are the number of distinct destinations, outgoing connections, distinct incoming machines and incoming connections respectively, as explained above. The machines matching the assumption that spam machines will send a lot but will not receive a lot of traffic are in bold. For the third bold machine a time plot of the amount of outgoing connections with destination port 25 against time was made, shown in Figure 3.3. In addition to this, the following lines were plotted: Incoming SMTP connections. Total response for outgoing connections. The amount of reverse Netflow flows for the outgoing connections. If no reverse flows are found, no successful connection was made. SMTP is based on TCP and because of TCP s three-way-handshake there should at least flow some packets in the other direction if the destination port is open. If there are a lot less responses than outgoing connections the machine is probably performing a port scan. Total response for incoming connections. Same as total response for outgoing connections, but now for incoming connections. Outgoing SMTP connections to different IPs. This is the amount of distinct destination servers. Figure 3.3: One of the suspicious mail machines As can be seen in the figure, most of the time this machine is not sending anything, but there are some periods of time with peaks up to over 3000 outgoing connections to port 25. Also, most outgoing connections get a response (data flowing in the direction of the originator, represented by the line Total respone for outgoing connections ). This means that this is 25

26 Chapter 3. Experimental first steps probably not a port scan. In that case most connection attempts would probably not get a response. The connections are to a high number of distinct destinations. But because the line representing the different destination servers is a bit lower than the line representing the amount of outgoing connections, the machine makes more than 1 connection to a lot of distinct destinations. The reasons for this are not clear, SMTP allows to send all mail to be delivered in a single connection, separate connections for each mail are not required. It could also be the case that the Netflow timers break the connections in multiple flows. In summary, this machine displays very suspicious behavior, it has a lot of outgoing traffic to a lot of distinct destinations while only receiving 2 incoming connections. Also, the traffic is highly irregular, the plot shows large periods in which no SMTP traffic is being observed with sudden very high peaks of outgoing traffic. It is very likely the first spam machine has been detected! The same behavior was found for most other IPs where there are a lot of outgoing connections with respect to its incoming connections Efficiently automating data extraction As can be seen in the previous section, a few steps have to be performed to detect suspicious machines. Doing this manually consumes a lot of time and is not very efficient in terms of machine resource usage. After experimenting with tools like flow-tools [3] it was quickly decided that a database is more flexible for research purposes. Because of the large dataset in the database (see the introduction of this chapter) queries take a lot of time to complete. Just going trough the whole dataset with a simple query takes on average around 1.5 hours. When doing this for each suspicious machine this quickly becomes infeasible. The process should be automated in a more efficient way. Figure 3.4: SQL processing for the Utwente repository The first obvious step to improve the performance is to eliminate going trough all the non-relevant data in the repository. Non-relevant data is all the flows that don t have port 25 (SMTP) in the source or destination port field. So the first step is to create a new datatable containing only those flows. The reduced the amount of flows from to entries, so SMTP represents only about 1,2% of all flows. 26

27 3.3. Conclusions drawn after the first experimental steps The next step is aggregating a few statistics per IP (amount of outgoing connections, distinct destination servers etc) using the SMTP traffic table. From this traffic table a new table is made with only those IPs satisfying the assumption that spam machines generate a lot of outgoing traffic but won t receive a lot of traffic. The following criterium was chosen: incoming SMTP connections outgoing SMTP connections < The result is that only the machines that were put in bold in the list of IPs from table 3.1 are selected. So only the IPs with a high number of outgoing SMTP connections and little or no incoming SMTP connections are obtained. The value of means that only 0.5 percent of the number of outgoing connections is allowed as an incoming connection. The main idea is to allow only a very small number of incoming connections compared to the outgoing connections. This behavior has been observed in the Netflow data a lot. Another choice could be to select only the IPs with no incoming connections at all (but with a high number of outgoing connections). Then IPs with only a low number of incoming connections would not be listed, while a lot of machines having a low number of incoming connections were found to have suspicious behavior. So the choice is made to use the ratio instead. The result set is ordered by the amount of outgoing connections to port 25. From this set the top 100 machines is used to generate time plot data which can be used as input for a charting tool like gnuplot[15]. The result is a set of 100 suspicious IPs with time plots. The described steps are implemented as MySQL stored procedures using cursors and indexes to reach a high efficiency. The total runtime is around 2 hours. Only a little bit longer then generating a plot for a single machine because of the data reductions! The resulting process is illustrated in Figure Conclusions drawn after the first experimental steps In the previous section an automated data extraction scheme based on simple rules has been developed. The end result is a list of hosts that have a lot of outgoing connections to port 25, but don t receive (or receive a relatively small set of) connections to port 25 ordered by the amount of outgoing connections (descending). Also, data points have been generated as input for generating plots. Those plots should be inspected to see what behavior a suspicious host displays over time. In the Utwente dataset large periods in which no traffic was being sent was detected with sudden peaks in which a lot of outgoing connections to port 25 were made. The same behavior could be found in the SURFnet and Geant data. Further investigation of the suspicious IPs is required to decide how to proceed with a spam detection scheme based on Netflow. The most important issue is to keep the false positives to a minimum. False positives with this mechanism can for example be an SMTP server set up for legitimate mass mailings. In this set-up the SMTP server will send a lot of traffic, won t receive a lot and have sudden peaks. The question is whether those kind SMTP servers are used a lot in practice. The first experimental steps described in this chapter at least show a promising outlook on detecting spam machines. The next challenge is to decide upon an algorithm that does so. 27

28 Chapter 3. Experimental first steps 28

29 4 Proposed spam-machine detection algorithm This chapter describes several criteria to detect spamming machines (based on experimental results described in the previous Chapter) and combines them in a detection algorithm. From now on, the focus will lie on the Utwente Netflow data. 4.1 Proposed criteria This section proposes the criteria for detecting spam machines via Netflow. The criteria are divided into two categories. This is done for two reasons. Firstly, because of the large number of IPs to be analyzed, it would help if a first selection is made on which more extensive analysis will be performed. This way only machines displaying obvious suspicious behavior will be analyzed in more detail, reducing the processing time. Secondly, only processing the suspicious machines according to the first criteria gives a smaller resultset, excluding machines that are not suspicious. The remaining criteria can than be used to order the already suspicious machines by calculating a probability. If this was done with all the machines which have SMTP traffic without a first elimination round, an unnecessary large resultset and an unnecessary high processing time would be the result. So we have the following two categories of criteria: 1. Acceptance criteria : These are the criteria used to select suspicious machines. If criteria of this type do not hold true, the IPs are ignored for further analysis. 2. Ordering criteria: These are the criteria used to order the machines being selected by the acceptance criteria. The goal is to order those machines so that the most suspicious machines are ranked on top. The ordering criteria will result in a ranking. It is possible to extend upon the acceptance criteria, some machines will fall trough the acceptance criteria with ease, some will barely. This can be used to classify machines as being more suspicious or less suspicious. The criteria are based on the observations in the Netflow data, not on literature study. As far as currently known they have not been used in combination with Netflow so far. Some of them however could be compared to criteria used in research with other data sources, i.e. Anirudh Ramachandran et al. [21] use the number of messages per domain. With Netflow this could be translated to the ratio between outgoing SMTP connections and distinct destination IPs. 29

30 Chapter 4. Proposed spam-machine detection algorithm Acceptance criteria The following acceptance criteria are defined: 1. The ratio between incoming and outgoing SMTP connections This is the first assumption that was made to detect suspicious machines. This criterium was already explained in Chapter 3. The goal is to only select machines with a lot of outgoing SMTP traffic but little or no incoming SMTP traffic. As described in Chapter 3, a lot of suspicious behavior was found with this criterium. The criterium can be summarized with the following formula: incoming SMTP connections outgoing SMTP connections < (4.1) 2. The number of distinct destinations While experimenting with criterium 1, some false positives were found with 1 or 2 distinct destination IPs. These were found to be logging mechanisms that send log messages via . Also, while validating with DNS blacklists and SpamAssassin blacklists, more machines were positively validated with a high number distinct destination IPs. Certainly with botnets, this can be explained by the fact that bots send a low number of messages to a high number of different domains, to avoid getting detected or blacklisted [11] [21]. As the university has 5 load-balanced SMTP servers, for the Utwente dataset it was chosen to set this criterium to a value larger than 5 distinct destinations, to at least exclude most students who will only use the university SMTP servers to send . So for the Utwente dataset this criterium can be summarized as follows: distinct destinations > 5 (4.2) 3. Number of outgoing connections If in Netflow only a low number of outgoing SMTP connections is found for an IP, it is not very useful to do an analysis on that IP because: The aim is to find spam machines, which will have a high number of outgoing connections, as observed in literature. Legitimate users will also send a low number of outgoing SMTP connections to send their , certainly, those should not be selected as suspicious. Doing an analysis on a few connections is not very useful, it is probably impossible to distinguish legitimate from illegitimate with the limited amount of information available in Netflow (remember that we don t have the content of those messages, only the fact that an SMTP connection was made is known). So another criterium is to only accept IPs with an x amount of outgoing connections. The value of x should depend on the goals of the analysis: if the aim is to find mass spammers (HVS) the value of x should be set very high, if the aim is to find bots, the value of x should probably be be set lower. For now the value of 30 is chosen. This should at least exclude most normal users (30 connections in 2 days) and still include the bots with enough data to analyze. So this criterium can for the time being be summarized as: number of outgoing connections > 30 (4.3) 30

31 4.1. Proposed criteria Ordering criteria The ordering criteria will be used in the the complete spam detection algorithm to order the results. To accomplish this it has been chosen to let each ordering criterium result in a value between 0 and 1. Those values are represented by the variables a trough e, which will be referred to when presenting the complete algorithm in Section 4.2. The higher the value, the more likely a it is that a machine is sending spam. So the ordering criteria lead to a ranking. 1. Incoming connections Acceptance criterium 1 already states that suspicious machines will have a low number of incoming SMTP connections compared to its outgoing SMTP connections. This ordering criterium will look at the absolute amount of incoming SMTP connections. Only a small number of machines have incoming connections (which has also been observed in the Utwente dataset). For spam machines it is probably not very logical to accept incoming mail, as the goal is to send out mail, not to receive it. On the other hand, legitimate mail servers will of course want to accept incoming mail. The conclusion of this reasoning is that machines that have incoming SMTP connections are less suspicious than those who do not have incoming SMTP connections (of course this reasoning will have to be validated experimentally). This is why this criterium is defined as follows: { 1 if incoming SMTP connections = 0 a = 0 otherwise (4.4) Of course, this is exactly the behavior that most people will generate if they use a mail client program such as Outlook or Thunderbird set-up to use SMTP to send mail. They will not receive via SMTP because they did not install a mail server, they will use protocols like POP3 or IMAP to receive mail from the mail server provided to them by for example their internet provider. The machines those people are using are obviously not spamming because of this behavior. Note that this criterium is evaluated on the result set of the combined acceptance criteria, which means that machines that comply to this criterium also have to have a high number of outgoing SMTP connections to several distinct destination IPs. This will most probably exclude those not malicious IPs. 2. Number of distinct destinations Acceptance criterium 2 states that suspicious machines should at least have several destination servers. This ordering criterium will try to make this a bit more concrete. To do this, a plot was made for number of distinct destination IPs per source IP. For this plot, the following points were taken into consideration: Of course, this plot will have to be on the result set of machines that comply with the combined acceptance criteria, as the ordering will take place on this set of IPs. Additionally, as the goal is to analyze a 7 day (full week) period a time, the plot will have to span 7 days. Also, the previous data capture of the Utwente dataset is at the time of the analysis one year in the past, behavior could have been changed due to network changes or changes in spam behavior. 31

32 Chapter 4. Proposed spam-machine detection algorithm Because of those points it was decided to do a new full week data capture for this plot and the other plots required for criteria analysis. This capture was done the 2nd of august 2008 and ran for 7 days. Figure 4.1: Distribution plot of the distinct outgoing SMTP destination IPs per machine (for IPs) 32 The resulting plot is shown in Figure 4.1. The plot displays the CDF of the number of distinct destination SMTP servers per source IP. The number of analyzed IPs for this plot was limited by IPs to keep processing all the data feasible. The IPs with the most outgoing traffic (of the set of IPs complying with all the acceptance criteria, of course) were chosen. As can be seen, only a small part of the selected IPs has more then ten destinations (about 1%). Among the suspicious IPs these are assumed to be the most suspicious ones, certainly in combination with the acceptance criteria (only IPs which comply with the acceptance criteria are used for the ordering criteria). Also keep in mind that botnet machines will, according to the literature study (Chapter 2), send spam to a high number of IPs to avoid detection. A few remarks can be made: There are 5 load balanced university mail servers. This means that connecting to around 5 mail servers isn t that suspicious for the average PC. This does however not mean that the IPs that sent to less than 10 IPs are not suspicious in combination with the selection criteria and other ordering criteria. Also keep in mind that the selected machines are the machines with the most outgoing connections.

33 4.1. Proposed criteria To conclude, the small portion of machines that send to more then 10 different IPs are suspicious, the machines that sent to less than 10 different IPs are less suspicious. { 1 if distinct destination servers > 10 b = 0 otherwise (4.5) 3. Idle time As we observed with the first experimental steps in for example Figure 3.3, some Figure 4.2: Distribution plot of idle time suspicious IPs have a lot of idle time. As well in this plot as most other plots of suspicious machines, all traffic is generated in short time periods with sudden peaks. This poses the question whether idle time can be used to identify suspicious behavior. To do this, a plot was made of the idle time for the same data capture (and thus the same machines) as described for ordering criterium 2. The result is shown in Figure 4.2. As can be seen by the plot, there are indeed a high number of IPs with a lot idle time. Note that idle time is defined as the percentage of idle 5 minute time frames in 7 days. This is because of the plotting function described in Section This percentage will also be high if a low number of minimum outgoing SMTP connections is used in acceptance criterium 3. For this criteria it is assumed this is not the case. Of course this also depends on the router(s) of the used Netflow dataset, for a router of a large network this will probably be less likely than a router of a small network. Because the top machines that are selected are the ones with the most outgoing traffic, it is suspicious that traffic is only being sent in a short time period. A normal mail server will have a kind of baseline (as seen in the plots of the official mail servers, 33

34 Chapter 4. Proposed spam-machine detection algorithm i.e. Figure 3.2). Because of this, if the percentage of idle time is high, the machine is more suspicious: c = percentage of idle time (4.6) 4. Standard deviation As shown in Figure 3.3, compared to Figure 3.2 suspicious machines have an irregular pattern. A simplistic way to detect this is to use the standard deviation on the 5 minute plot points in the outgoing SMTP connections plot. The same dataset as the previous two ordering criteria is used to plot this standard deviation in Figure 4.3. As can be seen, for most machines the standard deviation is beneath 1. This is not suspicious, the idea here is to find machines that have a large standard deviation, which means that the points in the plot are far from the mean µ, in contrary to a small standard deviation, which means most points are clustered closely around mean µ. Clearly, closely clustered points represent a baseline like a normal mail server should. Because of this, it is decided that values σ larger then 1 are suspicious, which is around 10%. Figure 4.3: Distribution plot of the standard deviation of the outgoing traffic per machine d = { 1 if σ > 1 0 otherwise (4.7) 5. Peak behavior In the previous criterium (4) the standard deviation was used to detect whether there is a nice baseline, like a normal server should have. This does not detect sudden peaks yet, as we have encountered in the previous analyses (i.e. Figure 3.3). A complicated 34

35 4.2. Complete algorithm implementation peak detection mechanism could be used, but because of the high number of IPs that will have to be analyzed (currently manually limited to machines which takes around three days to complete) this will become infeasible due to the amount of processing time required. So a simplistic approach was chosen. To detect sudden peaks the amount of (5 minute time span) data points larger then five times the standard deviation plus the mean value are counted. This should detect relevant peaks (=5 minute time spans) compared to the average µ. The value of 5 was chosen based on some experimentation, such that this mechanism is not to sensitive, but will be sensitive enough to detect relatively high peaks. A distribution plot (same dataset as the previous criteria plots) of this measure is shown in Figure 4.4. Based on this plot and the reasoning above, Figure 4.4: Distribution plot of five times the standard deviation of the outgoing traffic per machine machines with more then 10 peaks were chosen to be suspicious. { 1 if (datapoints > (5σ + µ)) > 10 e = 0 otherwise (4.8) 4.2 Complete algorithm implementation Section 4.1 describes the criteria chosen to detect spamming machines. This section concludes this description by summarizing the complete algorithm in the context of the developed implementation. To summarize, there are basically two stages of the spam machine detection algorithm: 35

36 Chapter 4. Proposed spam-machine detection algorithm 1. Select only those machines for which all acceptance criteria hold true (equation ). 2. For those machines calculate a confidence probability. This confidence probability indicates how likely a machine is spamming. It is based on the ordering criteria (equation ). The weighted average of those criteria is calculated as follows (the weights are equal because we do not yet now how well each individual criterium works yet): Pr{spam machine confidence} = 1 5 a b c d e (4.9) The results set is ordered by the resulting probability (descending). complete algorithm is provided in Appendix A. An overview of the Figure 4.5: Analysis process To implement this algorithm with actual Netflow data a few things have to be taken into consideration: 36

37 4.2. Complete algorithm implementation Firstly, for validation DNS blacklists will be used. DNS blacklists are very time sensitive, which means that it is not practical to use an old data capture. As already stated in Chapter 3, a Netflow data capture generates a high amount of data. Processing this data will take a high amount of processing time. Analyzing data should be done in an efficient way. It should be easy to run the the resulting implementation multiple times to experiment with the parameters and observe the results that can be found over a longer time period. Because of those points, it was chosen to extend the already efficient mechanism from Section and to let it operate on a live feed of the University Netflow collector. Everything will be automated, so that it is easy to do new captures on a new data capture time period. While experimenting with the first versions of this implementation it immediately became clear that doing a time analysis (the analysis on 5 minute time span points) on all machines generating SMTP traffic was way to expensive in terms of required processing time. Because the time analysis is not needed for the acceptance criteria but only for the ordering criteria, it was decided to add a criterium to the acceptance criteria to limit the time required to do a complete run. The top machines of the result set of the other acceptance criteria (which is ordered by number of outgoing connections) is picked for analysis. This criterium was chosen because the suspicious IPs generating most traffic are of course the most interesting ones. The resulting algorithm run on the Netflow data is as follows: 1. Obtain only the SMTP traffic from the Netflow data 2. Group aggregate information by IP and order by number of outgoing SMTP connections descending 3. Select only those IPs where: (a) the ratio between incoming and outgoing SMTP connections is less than 0.005% (b) the number of different destinations is > 5 (c) the number of outgoing connections is > 30 (d) and of those IPs, get the top IPs 4. Add the necessary information via time plots in the aggregate table (use a cursor in SQL to speed this up) 5. Calculate a SPAM machine confidence probability based on a weighted average of the criteria from equations Order the now smaller aggregation resultset by the calculated probability (descending) The complete analysis set-up is displayed in Figure 4.5. The right block shows the steps described above (the algorithm itself). In the left blocks a few additional steps are described. The first two steps process and import SpamAssassin data (used for validation), which is described in Chapter 5. Also, DNS blacklists are queried for the top 100 most suspicious IPs. This is also described in Chapter 5. Gnuplot [15] is used to generate plots for the 5 minute time span datapoints for the result set of IPs. The last step combines the results (statistics, probabilities, validation) for the top 100 most suspicious IPs. 37

38 Chapter 4. Proposed spam-machine detection algorithm 38

39 5 Validation Identifying spam is not an exact science and has many caveats. None of the methods currently in use results in zero false positives and false negatives. This is certainly the case with the algorithm presented in this work and also with the validation methods. Nevertheless, an important part for this research is to find ways to validate the results. Because only Netflow data is available it is difficult to make hard statements about strange behavior found in the repositories. Only flow-level data is observed and it is not possible to observe the mail bodies themselves to identify spam messages. This chapter discusses possibilities to validate the conclusions drawn in the previous steps of this research. Two methods were chosen, DNS blacklists and SpamAssassin log files. First both methods will be discussed after which validation results are presented. 5.1 Validation methods This Section first explains DNS blacklists and states a few of them that were used for this research. Also, the disadvantages for using DNS blacklists as a means for validation are discussed. The second part of this Section explains SpamAssassin and the limitations for using SpamAssassin log files as a means for validation DNS blacklists A very simple technique to prevent delivering spam is to use a blacklist. A blacklist consists of a list of known spam sending IPs. Those blacklists are commonly run over the DNS system. Via a simple DNS lookup done via the machine providing the blacklist an IP can be checked for an entry in the blacklist. Reliability in backlists varies a lot. Some blacklists block half the internet and others are rarely updated. This makes it necessary to carefully consider which DSN blacklists are suited for validaton. Al Iverson [17] and Jeff Makey [19] both benchmark most of the popular DNS blacklists regularly and generate reports based on which reliable DNS blacklists can be chosen. This list changes over time so it is a good idea to reconsider the chosen blacklists regularly based on the most recent findings. For validating this work DNS blacklists are an easy match, because both DNS blacklists and the Netflow detection algorithm aim to list suspicious IPs. It was chosen to do a conservative and an optimistic validation. The conservative validation is done via 5 blacklists found 39

40 Chapter 5. Validation to be realiable according to Al Iverson [17]. The following blacklists were chosen: zen.spamhaus.org bl.spamcop.net safe.dnsbl.sorbs.net psbl.surriel.com dnsbl.njabl.org For the optimistic validation the top 25 blacklists from Jeff Makey [19] were chosen. For this list the probability for false positives will be large, but if an IP is listed by multiple of those blacklists it is still possible to identify probable spamming machines. One problem with DNS blacklists is that they are updated a lot. New IPs are added and false positives are removed. With the rise of botnet spam those updates will occur a lot (because new bots will be added every day). Anirudh Ramachandran et al. [16] did research on the responsiveness and completeness of DNS blacklisting for the Bobax botnet spamming machines. The results show that DNS blacklists are slow to respond in the case of bots sending spam and only a small percentage is listed. This means that bots sending spam will be difficult to validate using DNS blacklists University of Twente SpamAssassin log files Spamassassin is a very popular rule based open-source spam filter. It can use various techniques which in the end result in a score. If the score is above an adjustable threshold, an message is labeled as being a spam message. For determining the score points can be added (in case of a DNS blacklist listing, known spam signatures etc) and substracted (for example when a sender is whitelisted). [18]. The university of Twente uses SpamAssassin on its main university mail servers. The mail servers are set-up as 5 load-balanced mail servers. Figure 3.2 shows the load of one of those servers over time. A plot of the hitcount per machine is shown in Figure 5.1. The hitcount is defined as the number of messages that a certain machine has send that is labeled as spam by SpamAssassin. As can be seen most machines send less than ten pieces of spam to the university mail servers, of which about half only sends a single piece of spam. The machines that send less than 10 pieces of spam in this 7 day period will be impossible to detect via Netflow, as there is not enough data to do an analysis upon. There are some limitations when using the university SpamAssassin log files. Not all spam on the university network will go trough SpamAssassin because not all SMTP traffic will go trough the university mail servers. This means that only a subset of the Netflow SMTP traffic will go trough SpamAssassin. Also, SpamAssassin results are known to have (a small amount of) false positives. Another thing to keep in mind is that with SpamAssassin spam is detected on a mail message level, so every single piece of is scanned. With Netflow spam is detected on a machine level, meaning that spam machines are detected (machines dedicated to sending spam or infected by a bot that is sending spam). Because of this difference it is not possible to say that a machine is a spam machine based on SpamAssassin logfiles, an official mail machine for example will be sending spam and thus will have some SpamAssassin hits, but this does not mean that the machine is a spam machine because the 40

41 5.2. Validation results Figure 5.1: Spamassassin hitcounts per machine (university mail servers) bulk of the is not spam. However, if an IP is found to be suspicious by the Netflow spam detection algorithm and a large percentage of the messages sent to the university mail server is labeled as spam, it is reasonable to assume that the IP is correctly detected as being a spam machine. 5.2 Validation results The first validation has been done on the first experimental steps, which are described in Chapter 3. This immediately poses a problem because the data was not recent. This poses problems for validating with DNS blacklists because they are, as explained, time sensitive. This is why the results of this first validation will not be discussed further in this chapter. The results, however, are available in Appendix B. After the first experimental steps in Chapter 3 the criteria for the detection mechanism are proposed in Chapter 4. Of course those should be validated. But before this, it is interesting to get a sense of what can be expected. So first 2 sets of 100 random IPs with outgoing SMTP connections were picked from the live Utwente data. Those were validated with the the set of 25 blacklists and the set of conservative 5 blacklists listed in Section The number of IPs for which validation was negative is shown in Table 5.1. Run Optimistic Conservative 1 1/100 5/ /100 11/100 Table 5.1: Validation fail rate for the random test run 41

42 Chapter 5. Validation Figure 5.2: Number of outgoing SMTP connections per IP So picking random IPs already seems to be a good measure to get a high percentage of IPs validated! Note however, that the goal itself is not to get a high percentage of IPs validated, but to reliably detect spam machines. With randomly picking IPs you might get a high percentage validated, but it will not exclude legal mail sending machines. However, validation with DNS blacklists will be difficult when this high number of machines is already positively validated with randomly picking IPs. There is not a lot of room for improvement left, how to decide that the algorithm really works? And also, is it really the case that more then 90% of all IPs with outgoing SMTP connections are spamming? Figure 5.2 gives some more insight. Of the IPs with outgoing SMTP connections, almost 40% has only one outgoing connection in 7 days time and about 80% has less than 10 outgoing SMTP connections. So only 10% has more then 10 outgoing SMTP connections. Also note that this plot of all observed on the Utwente dataset shows roughly the same picture as the plot of Spamassassin hitcounts on the University mail servers (Figure 5.1). Two notes can be made based on those plots: Because about 80% of IPs with outgoing SMTP connections have less than 10 outgoing SMTP connections, those IPs do not have enough connections to do an analysis upon. This is already solved by acceptance criterium 3 (Section 4.1.1) of the algorithm. Now the impact of this criterium is known; only a small percentage of mail sending IPs is left by this criterium. 2. As stated in the literature study in Section 2.2.3, botnet spam is on the rise. The percentages shown in Figures 5.1 and 5.2 closely resemble those of Abhinav Pathak et al. [11]. As is shown among others by Abhinav Pathak et al. [11], bots will send a low

43 5.2. Validation results volume of spam to any single host to avoid detection (=LVS, Low Volume Spammers). This could be an explanation for what is observed in the plots, a high volume of IPs that have only a low number of outgoing connections. This are most probably IPs outside of the Utwente domain that send a low volume of spam messages to our domain to avoid detection. This is effective in the case of the proposed Netflow spam detection algorithm with the Utwente data, because probably only a small percentage of all outgoing SMTP traffic for those IPs is known if the IP is outside of the Utwente domain. If the Netflow data of the main router(s) of such an IP is available, probably the same low number of outgoing connections to a high number of distinct destinations will be observed, making detection possible again. To conclude, bots outside the domain of the router(s) generating Netflow data are very difficult (possibly impossible) to detect with only the Netflow data itself, when they only have a low number of outgoing SMTP connections directed at that domain. For bots that are in our domain, probably a high number of distinct SMTP destinations will be observed, which is taken into account with ordering criteria 2 (Section 4.1.2). Because of those observations it seems another nice criterium to classify spammers is the following ratio: Number of outgoing connections (5.1) Number of distinct destinations This has been left as future work. Also, Anirudh Ramachandran et al. [21] have very similar ideas as an approach to detect spamming machines. They used log files as a datasource for their research, it would be interesting to also try Netflow data with this approach. After the first observations mentioned above, a parameter analysis was done on a fresh Netflow data capture of 7 days on the University routers. To avoid the LVS spammers, acceptance criterium 3 was set to at least 150 outgoing SMTP connections. The amount of unvalidated machines per criterium and combined are shown in Table 5.2. The results for each row are described below: 1. The first row displays the number of machines that did not validate positively if the top 100 IPs with the highest number of outgoing SMTP connections is selected. As can be seen, there are 56/57 machines that did not validate with the optimistic and the conservative DNS blacklist validation. Note that this is a much higher number of unvalidated machines than with randomly picking IPs! Also note that because the top 100 IPs with the highest number of outgoing SMTP connections is selected as the base, acceptance criterium 3 (minimum number of outgoing connections) is not separately validated as this set will already satisfy this criterium. 2. The second row adds acceptance criterium 1, the ratio between incoming and outgoing SMTP connections. The optimistic validation gives a significant decrease of the number of unvalidated IPs (as expected), while the conservative validation has two more unvalidated IPs, which is a bit stange. No explanation could be found for this. In the result set however, a lot of legitimate mailing machines (such as the University mail servers) disappeared from the result set, which is definitely a good thing. Also, with this criterium we found a high number of IPs with suspicious behavior with the first experimental steps, described in Chapter 3. Alltogether we conclude that this criterium is certainly a very important part of the algorithm, the optimistic validation and closer analysis do show this, the conservative observation is probably just a fluke. 43

44 Chapter 5. Validation Criterium Optimistic Conservative Optimistic Conservative 1. Base Acc. crit Acc. crit Combined Ord. crit Combined Ord. crit Combined Ord. crit Combined Ord. crit Combined Ord. crit Combined Combined Table 5.2: Parameter analysis results. The second column displays the results for individual parameters, the third column displays the results for sets of combined parameters The third row adds acceptance criterium 2, the number of distinct destinations. In this analysis the minimum is set at least 5 distinct destinations (as the university has 5 load balanced mail servers). Note that this criterium is added to the base without acceptance criterium 1. The optimistic validation shows a slightly lower number of unvalidated machines, while the conservative validation shows a much higher number of unvalidated machines. This step citerium in itself does not seem a very good acceptance criterium. However, the idea here is to combine it with acceptance criterium 1 to get the IPs with a high number of outgoing SMTP connections to several destinations with a low number of incoming SMTP connections (or none at all). 4. This row combines the criteria from the rows above. This time, as well the optimistic as the conservative validation show a significant improvement compared to the base or two criterium results. The combination really seems to does it work of selecting suspicious machines well. However, there is still a lot of room for improvement (28 and 48 unvalidated machines!). 5. This row shows the validation results for the first order criterium. Because this criterium simply states that suspicious machines have no incoming connections this has been validated as if it is an acceptance criterium. The results show that less unvalidated machines, so the assumption that machines without incoming connections are more suspicious seems to be correct. 6. When adding order criterium 1 to the algorithm the results are slightly better, as can be expected. 7. Order criterium 2, which states that there should at least be 10 distinct destinations is just like order criterium 2 added as if it is an acceptance criterium. The results are very similar to acceptance criterium 2.

45 5.2. Validation results 8. When adding order criterium 2 the total result improves again, with again slightly less validated machines. Certainly when considering the improvement with the conservative validation, order criterium 2 seems to be a valid one. 9. This row shows the most surprising result. Order criterium 3 states that the higher the percentage of idle time (with respect to outgoing SMTP connections) the more suspicious an IPs is. Because this is the first (and only) order criterium is not a binary result (1 or 0) but a percentage, the validation was done by ordering the IPs that fall trough the acceptance criteria by the percentage of idle time (descending). The results show for both the optimistic as the conservative validation only 4 unvalidated machines! Apparently, machines that match the acceptance criteria ordered by idle time is a large improvement compared to ordering by the number of outgoing connections! 10. Combining every criteria until now gives a slight increase in unvalidated results. Note however, that this is the first validation were the order criteria are averaged to order the result set from the acceptance criteria. This result already indicates that better results can be obtained by just ordering by idle time instead of the combination of the first 3 order criteria. 11. Order criterium 4 states that the standard deviation of the 5 minute time-span plot points of the number of outgoing SMTP connections should be larger than 1 for suspicious machines. This criterium also seems to be a very good measure to detect suspicious machines, with an optimistic number of unvalidated machines of only 9 and a conservative of only Adding order criterium 4 to the algorithm results in a significant improvement. The combined result is the best so far (excluding only ordering by order criterium 3 or 4, which both give better results). 13. Order criterium 5 is a simple peak detection mechanism. As can be seen in row 13, there are less unvalidated IPs than the base validation. This seems to eliminate unvalidated machines, although the results are not as good as order criterium 3 and Adding order criterium 5 to the algorithm significantly improves upon the amount of unvalidated IPs. This is the last step of the algorithm and this results in the best combined criteria results, the algorithm indeed reduces the amount of unvalidated IPs significantly compared to the base validation. However, there are two individual criteria analyses that give better results. 15. The best validation results were obtained by ordering the IPs from the result set of the acceptance criteria by idle time. After closer inspection of the result set of the complete algorithm ( Combined 6 in the table), it was found that almost all unvalidated machines had more than 80% of idle time. Thus it might make sense to make this an acceptance criterium, as this seems to eliminate a high number of false positives. The following criterium is defined: Idle time frames > 0.8 (5.2) All time frames The result of adding this as an acceptance criterium, was only 1 unvalidated IP. So a 99% validation rate is reached by this version of the algorithm. 45

46 Chapter 5. Validation 5.3 Validation summary Validation is possible with DNS blacklists and SpamAssassin. DNS blacklists seem to be the best match with the detection algorithm, as it also aims to identify IPs as spammers and SpamAssassin logfiles are only useful when spam is being sent to the University mail servers (of which the log files are available). The validation shows that the algorithm can perform with a 99% accuracy for the top 100 suspicious IPs after some fine-tuning. Also, some criteria seemed significantly more effective than others. Idle time performed so good (only 4 false positives) that it was moved from an order criterium to an acceptance criterium. Appendix C shows the resulting algorithm, while Appendix D shows the results obtained with this algorithm. The first validation attempt with the algorithm as it was in Chapter 3 is described in Appendix B. As can be seen, the complete algorithm performs much better. While those results look very promising, keep in mind that this work aims to inspect the feasibility of detecting spam machines via behavioral analysis with Netflow data. There remain a few open issues left for future work. For example, only the 100 most suspicious IPs where validated, what will happen when more IPs are added to this list? And can the algorithm be implemented with a feasible amount of required processing time for practical applications? For practical applications of this research, those and other questions also have to be answered (see also Section 6.2, Future work). 46

47 6 Conclusions After developing and validating the algorithm, a 99% validation rate was reached. This chapter summarizes the research done for this thesis and possible directions for future work. Section 6.1 will summarize the results and answer the research questions. Section 6.2 identifies possibilities for future work. 6.1 Summary of results The main research question of this thesis is: Is it possible to detect spamming machines via Netflow data gathered at internet routers with a low false positive rate? To answer this research question, the focus of this work was on the Netflow data itself without any other data source. With a few simple assumptions a high number of IPs with suspicious behavior could be found. The observed behavior was used to propose several criteria for detecting spamming machines. Those criteria were then combined in an algorithm that orders its resulting set of suspicious IPs by a calculated probability. At this point, it is difficult to say whether those results really represent spamming machines. After all, only flow level data has been used, the contents of the s are not known, so nothing can be said with any certainty. To validate the results, Spamassassin log files and DNS blacklists were used, with the focus on DNS blacklists as they are the best match with the proposed detection algorithm (DNS blacklists also try to identify IPs as spamming machines). Every criterium was validated individually. A few surprises were found, especially the fact that the percentage of idle time results in a high validation rate. After some modifications in the algorithm, 99 out of the 100 most suspicious IPs could repeatedly be positively validated. Because of those results the main research question can be answered positively: with a behavioral analysis executed on Netflow data it is possible to detect spamming machines gathered at internet routers with a low false positive rate. This work, however, is a first attempt to accomplish this. There are still a lot of open questions and possibilities for improvement. See Section 6.2 (Future work) for more information. One important discovery that might be considered as a problem with Netflow data, is the high number of LVS (Low Volume Spam) that is directed at the observed domain (in 47

48 Chapter 6. Conclusions this case, the University of Twente domain). If only a single spam message is sent by an IP outside of the observed domain, it is difficult if not impossible to identify that as spam via Netflow. If the machine is located on the domain itself, probably a lot of those messages will be observed to a high number of distinct domains. So the lack of data hinders detection via Netflow. This problem could probably be partly solved by using Netflow data from multiple domains. Below the subquestions are answered: How do spammers operate? Literature study shows that spam behavior on the network level can be classified as HVS (High Volume Spam) or LVS (Low Volume Spam). HVS is send by dedicated spammers or open relays/proxies. LVS is being sent by bots being part of a botnet. Also, BGP hijacking is a supplemental technique on the rise. [11] [12] [13] 2. What are trends in (anti-)spamming techniques? The main trend that is identified in literature is the use of botnets to deliver spam. Also, the use of BGP hijacking (to avoid blacklists) is recently observed more [13] [21]. On the anti-spam front, spam is getting more attention from the research community. In the past anti-spam measures were mostly implemented at the receiving side. Now more research is being done on the network level behavior of spam. The simplest approach is to just set-up a spam honeypot in the form of a spam sinkhole. This gives a limited overview of spammer behavior, as only spam on a single domain is observed [13]. With an open relay sinkhole a broader picture can be provided, the sources and destinations across multiple domains with a single measuring point (the open relay sinkhole) [11]. Another common approach to get data on spammer behavior is to analyze log files of (several) mail server(s) [21]. The most direct approach found in literature was an infiltration in a botnet sending spam [10]. Only a single paper could be found about detecting spammers via Netflow. Validation however showed only one correctly detected spammer and the use of only a limited time-span of Netflow data [2]. 3. How does spamming differ from normal traffic on the level of Netflow data? For the official mail servers that were plotted, a relatively stable baseline with a few peaks was observed (for the the number of outgoing SMTP connections over time). To find suspicious machines, it was assumed that a spammer has a high number of outgoing SMTP connections compared to a very low number (or no) incoming SMTP connections. Those machines mostly showed suspicious behavior; large periods without any outgoing SMTP connections with sudden peaks with a high number of outgoing SMTP connections, very irregular behavior (no nice baseline, like the official mail servers have) and often no incoming SMTP connections at all. 4. Which algorithms can be used to detect spamming machines via Netflow data? Based on the suspicious behavior observed in the Netflow data several criteria were combined in a detection algorithm. First a few acceptance criteria are used to only pick the most suspicious machines, which are then ordered by 5 ordering criteria. The main ideas for detecting spam machines which have been incorporated in the algorithm are:

49 6.2. Future work The ratio between incoming and outgoing SMTP connections The number of distinct destination servers The number of outgoing connections The percentage of idle time The standard deviation of the number of outgoing SMTP connections over time The number of peaks. The combination of those criteria and the specific parameters used are important for correctly identifying suspicious behavior. 5. How can the results be validated? To validate the results obtained by the algorithm, they can be checked against SpamAssassin log files or DNS blacklists. Spamassassin will flag individual messages as spam, while DNS blacklist will flag IPs as spamming hosts, making it a better match to test the detection algorithm. 6. What are the probabilities for false positives and which cases causes them? From the resultset of the algorithm, the 100 most suspicious IPs are validated. After optimizing the algorithm 99 of them could (with multiple data captures of the University of Twente Netflow data) be positively validated. The detection algorithm however, only results in probable spammers, as with Netflow the content of messages is not known. Only a behavioral analysis on the network level is possible. The biggest risk of false positives will probably be hosts set-up for sending mailing list messages. Those IPs were however not identified. It is difficult to do so, because an unvalidated IP could as well be a legitimate host as a spam machine (which makes it interesting to combine this algorithm with other sources than purely Netflow). 6.2 Future work This research is a first investigation into the possibilities of detecting spam with only Netflow data. The results look promising, but there are still outstanding questions and directions for further research. In particular, the following open issues were identified: This work focussed on detecting spamming machines. As observed for example by Li Zhuang et al. [12], botnets and spam are closely related, certainly now that botnets are becoming the popular choice for delivering spam [11] [13]. If it is possible to detect machines infected by a bot, this delivery method for spam can be combatted. Bots can also be used for i.e. DoS (Denial of Service) attacks or port scans, possibilities to detect this behavior with Netflow data are investigated by Joris Kinable [22], Daan van der Sanden [23] and Stephan Roolvink [24]. It would be interesting to combine (and/or extend) this research to see whether hosts displaying multiple suspicious activities can be found via Netflow data. This work focussed on the network of the University of Twente. As mentioned in Chapter 3.1, Netflow data for the SURFnet and Geant network are also available. Due 49

50 Chapter 6. Conclusions to time constraints this was not utilized in this research as well as it could be. There are some open issues: How does sampling influence detection of spamming machines? Sampling is used for both the SURFnet and Geant data. The Unversity network is within the SURFnet network, which is in turn within the Geant network. It would be interesting to see a) whether the same spamming machines can be found in the different networks and b) how (or if) the observed spamming behavior changes for the different network levels. This work took a bottom-up approach; the Netflow data was investigated for suspicious behavior, this behavior was then used to propose a detection algorithm. Some insight on what can be expected on a flow level has been gained. What has not yet been done is a top-down approach, focussing on a specific class of spammers and see what can be found in Netflow (based on the proposed criteria). Examples are: IPs within the network of the Netflow router(s) versus the IPs outside that network. For IPs within the network all flows are available (unless not all traffic is routed trough the Netflow router(s) of course). This means that bots that send only a small number of spam messages per domain, to a large number of distinct domains can be detected in this case (which is not possible if the bot is outside the network and has only a low number of SMTP connections to the network, as observed in large amounts in Section 5.2). IPs that fall within different ranges of the parameters of proposed criteria. For example the number of distinct destinations or the number of outgoing SMTP connections. As Pin Ren et al. [4] suggest, malicious behavior will often show the same flow characteristics. Maybe it is possible to detect behavior of a specific botnet or train the algorithm with known spamming behavior. Formula 4.9 results in a percentage used to rank the suspicious machines. As of this moment, this is the average value of the ordering criteria. Can this be improved by choosing different weights? How can behavior analysis with Netflow data be used in practice to combat spam? As observed in this work, it is possible to detect spam machines via Netflow, but is this mechanism reliable enough to be used in practice? And if so, how can it be used? Should behavioral analysis via Netflow be combined with other methods (DNS blacklists, log files etc)? Maybe a good way to evaluate the practical application is to deploy the algorithm as a plugin in for example SpamAssassin. 50

51 A Summarized proposed algorithm This Appendix gives an overview of the complete algorithm as proposed in Chapter 4. As explained in that chapter, first the most suspicious machines are chosen via the acceptance criteria. The resultset is then ordered by a weighted average of the ordering criteria (descending). This algorithm has for the purpose of this work been implemented with mainly MySQL stored procedures, which is described in Section 4.2. Acceptance criteria: incoming SMTP connections outgoing SMTP connections < distinct destinations > 5 number of outgoing connections > 30 Ordering criteria: Pr{spam machine confidence} = 1 5 a b c d e { 1 if incoming SMTP connections = 0 a = 0 otherwise { 1 if distinct destination servers > 10 b = 0 otherwise c = percentage of idle time d = { 1 if σ > 1 0 otherwise { 1 if (datapoints > (5σ + µ)) > 10 e = 0 otherwise 51

52 Appendix A. Summarized proposed algorithm 52

53 B First validation results (without probabilities) This Appendix shows the first validation results. This validation was done after the SpamAssassin log files of the university mail servers were obtained for until (6 days). At this stage the complete algorithm was not yet developed, only the research described in Chapter 3 was done at this point. So the algorithm at that point was simply ordering all IPs by the number of outgoing connections and then only accepting those IPs that comply with acceptance criterium 1. This results in a list with the IPs with the most outgoing SMTP connections, but with only a relatively small amount of incoming connections or no incoming connections at all. Validation was done with 25 DNS blacklists, SpamAssassin logfiles and the spam folder an account of that period. The columns display the following information: SA: The number of SpamAssassin hits on the University mail servers. P: The number of hits in the spam folder of a University account. BL25: The number of the 25 blacklists that lists the IP as suspicious. MX: Whether the IP has an MX record. UT: The number of connections to the university mail servers. D.Out: The number of distinct destinations. Out: The number of outgoing SMTP connections. D.In: The number of distinct machines connecting to the machine on port 25. In: The number of incoming SMTP connections. Hostnames and IPs are not shown due to privacy concerns. 53

54 Appendix B. First validation results (without probabilities) SA P BL25 MX UT D.Out Out D.in In paulinas.com.br hapeqshipping.com

55 SA P BL25 MX UT D.Out Out D.in In orientnet.com.br megasys.com.br

56 Appendix B. First validation results (without probabilities) SA P BL25 MX UT D.Out Out D.in In reticulum.nl To conclude, the total validation rate statistics: 74/100 had hits in the 25 DNS blacklists 31/100 had hits in the SpamAssassin logfiles. 14/100 could not be validated via DNS blacklists or SpamAssassin. So 86/100 are validated via DNS blacklists or SpamAssassin. 56

57 C Optimized algorithm after validation results This Appendix gives an overview of the complete algorithm, much like Appendix A, but now the one with the highest validation rate. As described in Chapter 5, the validation resulted in a few surprising observations. After making some changes in the algorithm, several runs resulted in 99 out of the top 100 suspicious IPs positively validated. Acceptance criteria: incoming SMTP connections outgoing SMTP connections < distinct destinations > 5 number of outgoing connections > 200 percentage of idle time > 80% Ordering criteria: Pr{spam machine confidence} = 1 5 a b c d e { 1 if incoming SMTP connections = 0 a = 0 otherwise { 1 if distinct destination servers > 10 b = 0 otherwise c = percentage of idle time d = { 1 if σ > 1 0 otherwise { 1 if (datapoints > (5σ + µ)) > 50 e = 0 otherwise 57

58 Appendix C. Optimized algorithm after validation results 58

59 D Results of the optimized algorithm The table in this Appendix shows the behavior found in the resultset of the optimized algorithm, which is described in Appendix C. This table shows the top 100 most supicious IPs. The columns display the following information: A - E: Individual criteria results. Prob: Probability based on the ordering criteria (weighted average). SA: The number of SpamAssassin hits on the University mail servers. BL25: The number of the 25 optimistic blacklists that lists the IP as suspicious. BL5: The number of the 5 conservative blacklists that lists the IP as suspicious. Out: The number of outgoing SMTP connections. In: The number of incoming SMTP connections. Hostnames and IPs are not shown due to privacy concerns. A B C D E Prob SA BL25 BL5 Out In

60 Appendix D. Results of the optimized algorithm 60 A B C D E Prob SA BL25 BL5 Out In

61 A B C D E Prob SA BL25 BL5 Out In

62 Appendix D. Results of the optimized algorithm 62

63 Bibliography [1] Bogdan Hoanga, Technology and Society Magazine volume 25, Spring 2006, How good are our weapons in the spam wars? [2] Prasanna Desikan and Jaideep Srivastava, Analyzing network traffic to detect spamming machines [3] Yiming Gong, August 2004, Detecting worms and abnormal activities with Netflow (part 1 and 2), [4] Pin Ren et al., Computer Graphics and Applications, IEEE, volume 26, March/April 2006, IDGraphs: Intrusion detection and analysis using stream compositing [5] Cisco System, January 2007, Netflow Services Solutions Guide [6] Cisco System, October 2007, Introduction to Cisco IOS Netflow - A Technical Overview [7] TCPDUMP, [8] William Stallings, 1999, SNMP, SNMPv2, SNMPv3 and RMON 1 and 2 [9] Network Working Group, Cisco Systems, October 2004, rfc3954, Cisco Systems Netflow Services Export Version 9 [10] Christian Kreibich et al., LEET 08, April 2008, On the Spam Campaign Trail [11] Abhinav Pathak et al., LEET 08, April 2008, Peeking into Spammer Behavior from a Unique Vantage Point [12] Li Zhuang et al., LEET 08, April 2008, Characterizing Botnets from Spam Records [13] Anirudh Ramachandran et al., NANOG37, September 2006, Understanding the Network-Level Behavior of Spammers 63

64 Bibliography [14] Jonathan B. Postel, Jonathan B. Postel, Simple Mail Transfer Protocol (RFC 821) [15] The GNUPlot plotting tool, [16] Anirudh Ramachandran, David Dagon, and Nick Feamster, CEAS 2006, 2006, Can DNS-Based Blacklists Keep Up with Bots? [17] Al Iverson, and [18] Apache SpamAssassin Project, [19] Jeff Makey, Blacklists Compared, jeff/spam/blacklists Compared.html [20] ISO, 1994, Information technology - Open Systems Interconnection - Basic Reference Model: The Basic Model, second edition [21] Anirudh Ramachandran, Nick Feamster and Santosh Vempala, Proc. ACM Conference on Computer and Communications Security (CCS), 2007, Filtering Spam with Behavioral Blacklisting [22] Joris Kinable, June 2008, Detection of network scan attacks using flow data [23] Daan van der Sanden, July 2008, Detecting UDP attacks in high speed networks using packet symmetry with only flow data [24] Stephan Roolvink, December 2008, Detecting attacks involving DNS servers, A Netflow data based approach [25] Reinier Schoof and Ralph Koning, February 2007, Detecting peer-to-peer botnets [26] Pekka Pietikinen and Lari Huttunen, 18th Annual FIRST Conference, 2006, Behavioral Study of Bot Obedience using Causal Relationship Analysis 64