Detecting spam machines, a Netflow-data based approach

Transcription

1 Detecting spam machines, a Netflow-data based approach Gert Vliek February 24, 2009 Chair for Design and Analysis of Communication Systems (DACS) Faculty of Electrical Engineering, Mathematics and Computer Science (EEMCS) University of Twente, The Netherlands Supervisors: Dr. ir. A. Pras A. Sperotto M.Sc. Dr. R. Sadre

2

3 Abstract Spam is a problem that practically every user encounters. More than 75% of all messages are likely to be spam and this level is still rising. This makes spam prevention a very relevant topic. Most research on spam prevention has been focused at the receiving side, spam filters in clients and receiving servers. This is changing as spam gets more interest from the research community. Network-level behavior has in recent years been seen as another research direction. This thesis focusses on detecting spam machines via Netflow data. Because Netflow only provides information about the communications between hosts and not about the contents of those communications, this is not an easy task. The aim of this work is to inspect the feasability of detecting spam machines via Netflow. To reach this goal, a large repository of Netflow data has been studied to find behavior that differentiates spam machines from normal servers. With a few simple assumptions a high number of IPs with suspicious behavior could be found. The behavior displayed by those suspicious IPs has been used to propose a number of criteria for detecting spamming machines. These criteria have been combined to implement an algorithm to detect spam machines via Netflow data. To validate this algorithm, DNS blacklists and SpamAssassin log-files were used with the Netflow data of the University of Twente. The first tests with DNS blacklist validation show that with randomly picking IPs around 95% of the IPs is listed in a blacklist. Closer inspection shows that this is because of the high number of IPs with only a single or a few SMTP connections directed at the monitored network. These are probably bots, sending only a low number of spam messages per domain to avoid detection. Because of the lack of data (only Netflow data within the monitored network is available) those IPs cannot be analyzed with Netflow. This is why the focus of this work lies on machines with a higher number (at least 100) of outgoing SMTP connections. When validating the algorithm itself a few surprising observations were made. Among those, it is observed that a high percentage of idle time for an IP seems to be by far the most effective criterium. With the help of validation results, the algorithm has been optimized. The end result is a validation rate of 99% positively validated machines. This result was obtained with Netflow data captures over multiple time spans. Based on those results, we conclude that it is possible to detect spammers via only Netflow data. This has been a feasibility study and there are some open issues. Those have been left for future work.

4

5 Preface This thesis is the end result of my final Master project executed at the Design and Analysis of Communications Systems (DACS) chair of the university of Twente. At the end of 2007 I was searching for a final Master project which I could do part-time, because I also wanted to work 3 days a week at the company Soltria. After some orientation I quickly came to the conclusion that DACS and specifically the Netflow assignment was the best match to my interests. I had already developed an interest in security related topics and first wanted to search for botnets in Netflow. I quickly concluded that this topic was to broad as a starting point. To limit the scope I decided it was better to focus on one of the activities bots will display. Two fellow students were already studying DoS attacks and port scans with Netflow. I decided to focus my research on spam machines, which seemed a real challenge because with Netflow there is not a lot of information to work with. The research took some directions that I did not expect beforehand (for example the amount of work put into building our common Netflow database), but I think I can be content with some nice results. I have been working on this project only at the end of each week because Monday- Wednesday I worked at the company Soltria. This meant that this work took a little longer than expected and that I had a high workload, but I learned a lot during this period, both at work and at the university. There even was some overlap, as for example for both I worked a lot with MySQL. I have enjoyed this period and have seen it as a nice way to get used to a business environment while still being a student. I would like to thank both Aiko Pras and Remco Lupker for giving me the opportunity to combine work and my master thesis. Also my supervisors, Aiko Pras for always being very supportive (and driving me to buy new Apple hardware ;-), Anna Sperotto for helping me structuring my research and Ramin Sadre for his help with building our common database. Also the people that provided the data and help needed to enable this research, in particular ICTS, SURFnet and Geant for providing the Netflow data, Peter Peters for access to the Spamassassin logfiles and Quarantainenet for their insight. And of course the other people in the DACS lab, especially Daan van der Sanden for relaying his first experiences with Netflow, Joris Kinable with whom I did a lot of work for our common database and Martijn van Eenennaam and Stephan Roolvink; I enjoyed being the bringer of the end of the week ;-). And last but not least my parents, for always supporting my choices and giving me the opportunities they did. Gert Vliek January 2009

6

7 Contents Abstract 3 Preface 5 1 Introduction Problem description Research questions Main research question Subquestions Approach Outline of the thesis Netflow and spam Netflow Flow-based network monitoring with Netflow Netflow versions Spam The SMTP protocol Network level spamming techniques Trends in (anti-)spamming techniques Experimental first steps Available data sources Differentiating suspicious machines from normal machines Manually extracting data from the Utwente repository Efficiently automating data extraction Conclusions drawn after the first experimental steps Proposed spam-machine detection algorithm Proposed criteria Acceptance criteria

8 4.1.2 Ordering criteria Complete algorithm implementation Validation Validation methods DNS blacklists University of Twente SpamAssassin log files Validation results Validation summary Conclusions Summary of results Future work A Summarized proposed algorithm 51 B First validation results (without probabilities) 53 C Optimized algorithm after validation results 57 D Results of the optimized algorithm 59 8

9 List of Figures 1.1 Approach All connections to port 25 captured by the Utwente Netflow dataset One of the university mail servers One of the suspicious mail machines SQL processing for the Utwente repository Distribution plot of the distinct outgoing SMTP destination IPs per machine (for IPs) Distribution plot of idle time Distribution plot of the standard deviation of the outgoing traffic per machine Distribution plot of five times the standard deviation of the outgoing traffic per machine Analysis process Spamassassin hitcounts per machine (university mail servers) Number of outgoing SMTP connections per IP

10 List of Figures 10

11 1 Introduction This Chapter starts with explaining the context of this research in the problem description. Then the main research question and the subquestions of this work are stated. Next the approach taken to answer the research questions is described. The Chapter closes with an outline of this thesis. 1.1 Problem description Spam is a problem that practically every user encounters. More than 75% of all messages are likely to be spam and this level is still rising [1]. This makes spam prevention a very relevant topic. Most research on spam prevention has been focused at the receiving side, spam filters in clients and receiving servers. A new approach in combatting spam can be the use of Netflow data. Netflow is gaining popularity and can be used for research purposes, spam detection is such a possibility. Because Netflow only provides information about the communications between hosts and not about the contents of those communications, this is not an easy task. By studying a large repository of Netflow data, behavior that differentiates spam machines from normal servers is searched for. With the results, an algorithm for detecting spam machines via Netflow will be implemented. The main aim of this research is to inspect the feasibility of detecting spam machines via Netflow data. 1.2 Research questions The main research question aims to address the feasibility of detecting spam machines via Netflow data. If this is possible, Netflow can be used to address the spam problem at its source: the spamming machines. It could be used as a cheaper solution than scanning all s, or another factor to be used in current spam detection mechanisms. To answer the main research question a few subquestions are posed. This Section states and explains the research questions. 11

12 Chapter 1. Introduction Main research question The main research question is: Is it possible to detect spamming machines via Netflow data gathered at internet routers with a low false positive rate? Netflow data is highly aggregated, only summarized flow-level data is available. Because of this, it will be difficult to decide whether a certain machine is sending spam messages. With traditional anti-spam measures like bayesian filters, the content of s are scanned. This is obviously not possible with Netflow. A method to differentiate the flow-level behavior of spam machines from normal traffic has to be devised. Because of the high aggregation of data, the main research question is whether it is possible at all to accomplish this. Because spam detection has in the past proved to be a difficult process, another important question is how reliable the resulting detection algorithm will be. So this research aims to inspect the feasibility to detect spam machines using Netflow with a low false positive rate Subquestions To answer the main research question a few sub-questions are posed. They have been chosen to get a picture of the currently known facts about spammers, get a sense of the possibilities with Netflow and develop an algorithm to detect spam machines, which will have to be validated. Those goals all aim to answer a part of the main research question. The sub-questions are: How do spammers operate? To differentiate spam behavior from normal behavior, it is important to know how spammers behave on the network layer. The aim of this research question is to obtain an overview of the currently known methods in literature. 2. What are trends in (anti-)spamming techniques? This question will be answered by looking at the currently known trends in spamming techniques and anti-spam research. The aim of this question is to get an insight in the current state of the art concerning spam-related research related to the main research question. 3. How does spamming differ from normal traffic on the level of Netflow data? To detect spam machines via Netflow the first question is how spamming differs from normal traffic. The first steps in implementing a spam detection mechanism will be to explore the possibilities with Netflow data. 4. Which algorithms can be used to detect spamming machines via Netflow data? After the first criteria to select spamming machines via Netflow are known, a detection algorithm should be developed. Together with the previous research question, this represents the bulk of this research.

13 1.3. Approach 5. How can the results be validated? When a detection mechanism is implemented, it is important to validate that it really is working. Methods to make sure that the identified spam machines are indeed spamming will have to be found. 6. What are the probabilities for false positives and which cases causes them? When a detection mechanism is implemented, it is important to identify the possible causes for false positives. 1.3 Approach As not a lot of existing literature is available on the specific topic of detecting spam via Netflow data, literature study will be minimal. The literature study focusses on answering research subquestions 1 and 2. The approach taken to answer subquestions 3-6 is displayed in Figure 1.1. First a Netflow data capture is done. It was chosen to first observe the data to get a sense of what can be detected with Netflow data. As not a lot of information about this is found in literature, this should be the first step to get a sense of normal and abnormal behavior. Later on an algorithm can be developed based on those first observations. This in contrary to Prasanna Desikan et al. [2] and Anirudh Ramachandran et al. [21], where first a theory is devised and later veried. Because this is a feasibility study it is chosen to first explore behavior that can be found in Netflow instead of devising an algorithm beforehand and hoping that suspicious machines can be found with it, with the help of Netflow data. Netflow data capture Observe behavior (Chapter 3) Algorithm implementation (Chapter 4) New data capture Feedback Validation (Chapter 5) Figure 1.1: Approach After this first exploration, an algorithm to detect spamming machines using Netflow data 13

14 Chapter 1. Introduction is implemented based on what has been observed in the Netflow data. While fine-tuning the detection algorithm, it is important to have a means to validate the results. Validation will be needed to decide whether the detection algorithm really works. Another use of validation is as a tool for fine-tuning. If changing some parameters results in far fewer results positively validated, those changes are probably degrading the detection algorithm instead of improving it, while more positively validated results will probably have improved the algorithm. With those goals in mind, two methods for validation of this research were chosen, DNS blacklists and SpamAssassin. So the validation is used to improve the detection algorithm in an iterative way. After the first data capture is studied and the first version of the algorithm is implemented, the algorithm is validated using a new data capture (a recent dataset is necessary for validation, as explained in Chapter 5). Those results are then iteratively used to improve the algorithm. 1.4 Outline of the thesis Chapter 2 provides an introduction to Netflow and SMTP and answers subquestions 1 and 2 with a literature study. Figure 1.1 illustrates the set-up of Chapters 3-5. Chapter 3 focusses on subquestion 3. Also, feasibility in terms of processing time is explored. The Netflow data repositories result in huge database tables. To keep the processing time feasible, efficient ways to process data will have to be used. After this first exploration, the focus will switch to research question 4: an algorithm to detect spamming machines using Netflow data is implemented, using the experience of the previous exploration step of this research. This is done in an iterative way; the algorithm will be fine-tuned based on the obtained results. Also, validation mechanisms (research subquestion 5) will be used to optimize the mechanism (getting a high percentage of results validated). Those steps are discussed in Chapter 4. Chapter 5 focusses on subquestions 5 and 6. The validation methods are explained and validation results are discussed. Finally, Chapter 6 concludes this research by giving an overview of the results and answering the main research question (Is it possible to detect spamming machines via Netflow data gathered at internet routers with a low false positive rate?). Also, the practical applications of this research and recommendations for further research will be discussed. 14

15 2 Netflow and spam This Chapter starts with an explanation of Netflow. After that, the spam problem in relation with Netflow is discussed. To do this, first the SMTP protocol and its behavior in Netflow is explained in Section Then the currently known spamming techniques at the network level are discussed in Section 2.2.2, followed by an overview of the current directions of spammers and spam research in Section Netflow The main data-source for this research is Netflow. Netflow represents a relatively lightweight network monitoring tool. If it is possible to use it to detect spamming machines, it could be used to clean up networks or blacklist IPs. It could be used as a cheaper solution than scanning all s for spam, or another factor to be used in current spam detection mechanisms. This Section explains Netflow, gives a few remarks to keep in mind while using Netflow and discusses the current versions of Netflow Flow-based network monitoring with Netflow Network monitoring can be performed by packet level inspection with the use of tools like TCPDUMP [7]. However, for large packet switching facilities the processing time and diskspace requirements will become infeasible. Here SNMP [8] is a popular data source for network monitoring. SNMP however aggregates information on a very high level (i.e. network interface throughput or device uptime), a lot of information is lost. Between those two extremes (recording every packet or aggregating high level data) Cisco s Netflow can be positioned. It records flow level data; for every (TCP/UDP etc) flow aggregated data is recorded. Cisco defines a flow as follows [5]: A flow is identified as a unidirectional stream of packets between a given source and destination - both defined by a network-layer IP address and transport-layer source and destination port numbers. Specifically, a flow is identified as the combination of the following seven key fields: Source IP address Destination IP address 15

16 Chapter 2. Netflow and spam Source port number Destination port number Layer 3 protocol type ToS byte Input logical interface (ifindex) Netflow operates by creating cache entry s. This means that for every connection between hosts an entry is made in this cache. For each combination of key fields, the stored information includes octets send, packets send, start time, end time and TCP flags. The information included depends on the Netflow version and Netflow set-up. The information is accessible by using the Command Line Interface (CLI) for an immediate view of the Netflow cache or by exporting Netflow data packets to a Netflow Collector (used in this research). [6] Flows are exported from the cache when the following events occur: A TCP flow is ended by a FIN or RST flag When a flow has been inactive for a specified time (the inactive timer) When a flow has has been active for a specified time (the active timer) When the Netflow cache is full The inactive and active timer can split flows, which means that a specific flow is exported several times. The result is several exported Netflow records which should actually be one and the same flow. This means that those records will have to be merged again if the total aggregated flow data is to be retrieved for those flows that are split up. Another thing to keep in mind when using Netflow data is that flows are defined as a unidirectional stream of packets. This means that for a successful connection between two hosts, Netflow will export not one but two flows, one for each direction Netflow versions There are several version of Netflow; Netflow 5 and increasingly Netflow 9 are the most popular versions. Netflow versions 1, 5, 6, 7 and 8 export packets over UDP. This means that packets can be lost, because UDP doesn t guarantee packet delivery. Lost packets can be detected by going over the sequence numbers in the Netflow packet header. Netflow 9 is designed to be transport protocol independent, but UDP is still commonly used. Netflow export packets prior to version 9 consist of a header with max 24 (version 1) or 30 (other versions) records for the exported flows. Netflow version 9 is designed to be easily extendible. It makes use of templates that can be defined by the user. Netflow 9 has been submitted to the IETF as an RFC. [9] 16

17 2.2. Spam 2.2 Spam This section gives an overview of the results of a literature study on the current state of the art of spam and anti-spam techniques, preceded by a short explanation of the SMTP protocol in the context of Netflow The SMTP protocol To understand what can be observed by Netflow (discussed in Section 2.1) it is important to know how the SMTP (Simple Mail Transfer Protocol) works. SMTP is the protocol that is used to send . End users however, often make use of protocols like POP3, IMAP or a web-based mail client to retrieve . This research however focusses on the SMTP protocol, because this is the protocol used to send (or spam). The SMTP protocol is defined in RFC 821 [14]. The SMTP RFC specifies a few different transport protocols on which it can run. In practice however, SMTP uses the TCP protocol on service port 25. This means for Netflow only port 25 TCP traffic has to be captured to obtain all SMTP traffic. Because Netflow stores flows uni-directional [6], source port as well as destination port 25 should be captured. To give an idea on how the SMTP protocol works, the following (slightly modified) example was taken from RFC 821: S: MAIL FROM:<Smith@Alpha.ARPA> R: 250 OK S: RCPT TO:<Jones@Beta.ARPA> R: 250 OK S: RCPT TO:<Brown@Beta.ARPA> R: 250 OK S: DATA R: 354 Start mail input; end with <CRLF>.<CRLF> S: Blah blah blah... S:...etc. etc. etc. S: <CRLF>.<CRLF> R: 250 OK Here an is being sent from Smith@Alpha.ARPA to Jones@Beta.ARPA and Brown@Beta.ARPA. SMTP is a simple text protocol, after connecting to port 25 via telnet those commands can be executed to send . In the example, sender lines are indicated with S and responses from the receiving side are indicated with R. First the originating address is given and after that two destination addresses. Then data is inputted, ended with a line only containing.. More detailed information about SMTP can be read in RFC 821 [14]. There are also SMTP extensions for including attachments like pictures or movies, those are outside the scope this research (Netflow will just see more data being transferred). For Netflow, the example above is seen as two uni-directional TCP flows. However, those flows do not necessarily represent the sending of a single message! In the example, in 17

18 Chapter 2. Netflow and spam one single TCP connection, an is being sent to two different destinations. It is also possible to make more than one message to more then one destination address. The content of this SMTP (which is an application layer protocol) message exchange, is of course not captured by Netflow. The only information available with Netflow, is information like summarized in Section Network level spamming techniques Literature study reveals the most common methods spammers use to send their spam [10] [11] [12] [13]. Abhinav Pathak et al. [11] suggest to divide those methods into LVS (Low-Volume Spammers) and HVS (High-Volume Spammers) for analysis purposes. This distinction will be used in the remainder of this research. Direct spammers use dedicated hosts from spam-friendly providers. This method is becoming less effective because of DNSBL (DNS Blacklists). DNSBLs list known spamming IPs, mail servers can check whether a mail sending machine is listed via a simple DNS lookup. If it is listed the legitimate mail server can decide not to deliver mail sent by this machine, as it probably is spam. Open relays or proxies are used to send spam anonymously. An open relay is an MTA (Mail Transfer Agent, an SMTP server) that can be used without authentication. This allows such an open relay to be used to relay . Spammers can us those machines to send spam on their behalf. Because of the obvious possibility of misuse and actual exploitation by spammers, open relays are becoming rare. They are still being scanned for and used. They are attractive to spammers because of the anonymity. Also, open relays could be an alternative when their own IP addresses are blacklisted. Sending spam via an open relay circumvents the blacklisting. An open relay is used by Abhinav Pathak et al. [11] as a data collection point to research spam behavior. Because mass spamming mechanisms present obvious suspicious behavior and blacklisting is an effective defense against direct spammers, botnets are becoming more popular. A botnet consists of a high number of bots (machines infected with malicious software) commanded by a controller with some kind of C&C (Command & Control) infrastructure. In the past the C&C was mostly IRC, but because this is traceable the C&C infrastructure is changing into other (often encrypted) protocols. Botnets in itself are, just like spam, a security problem. A lot of research is focussed on botnets (i.e. [12], [25] and [26]). Li Zhuang et al. [12] take the unique approach of finding botnets using spam records. The methods described in the paragraphs above (direct spamming, open relays or proxies and botnets, as classified by Anirudh Ramachandran et al. [13]) implicate what can be seen at the network level of the OSI stack [20] (where Netflow operates). Direct spammers and open relays or proxies probably generate a large number of outgoing SMTP traffic while botnets will do the same, but distributed over a large number of infected hosts, to avoid detection [13]. Of course there are other techniques to consider, but most of them will not immediately influence what can be seen by Netflow because they are implemented at higher layers. Examples are image-spam, false SMTP headers and dictionary attacks (to guess usernames). A supplemental technique that is upcoming, is BGP (Border Gateway Protocol) hijacking. This is observed among others, by Anirudh Ramachandran et al. [13] and Nick Feamster et al. [21]. The BGP is used by AS s (Autonomous Systems) to advertise prefixes it can deliver traffic to. This protocol can be misused to hijack IP address prefixes. This is commonly done via short-lived routing announcement. The surprising thing (at first sight) is that spammers 18

19 2.2. Spam that use this technique use short IP prefixes (resulting in larger blocks of IP addresses!). However, the distribution of IP addresses sending spam suggests that this is done so that mail relays can hop between a large number of IP-addresses, avoiding getting blacklisted. Also, route announcements for shorter IP prefixes are unlikely to be blocked compared to announcements for longer IP prefixes. [13] [21] Trends in (anti-)spamming techniques The main trend concerning the sending of spam that is identified in research, is the increasing use of botnets instead of direct spamming [11] [13]. Botnets are becoming very complicated making them a research topic on their own [10]. BGP hijacking is a supplemental technique, described in Section As mentioned by Anirudh Ramachandran et al. [13] and Nick Feamster et al. [21], this is becoming a popular method to avoid blacklisting. Because botnets sizes are difficult to estimate and research repositories cover only part of the total picture (a single domain, a single mail server etc) it is difficult to get exact numbers about the global spam problem. Most research on spam has traditionally been focussed at the receiving side in the form of mail filters. Now that the spam problem gets more attention from the research community, network-level behavior of spammers is also studied by for example Anirudh Ramachandran et al. [13]. This research uses spam honeypots in the form of a spam sinkhole. This is just a mail server with no legitimate addresses, so that every received is a spam message. They registered a domain with a DNS Mail Exchange (MX) record in it. The result is still a limited viewpoint on the spam problem, because only spam on a single server on a single domain is recorded. To get a broader view on the spam problem, Abhinav Pathak et al. [11] propose to set up an open relay sinkhole. The idea here is to set up an open relay in such a way that it can be detected by spammers, but will not really relay mail. This way information about sources and the different destinations of spam can be obtained. A few interesting statistics were found: 75% of the connecting hosts were already blacklisted in some of the 5 DNSBLs used. For the spammers using open relays, DNSBLs seem to do their job well! [11] 25% of the hosts connected only once, more than 75% connected fewer dan 10 times and only 0.9% made more than 100 connections, those hosts where responsible for 59% of the total amount of spam. 0.1% made more than 1000 connections, amounting for 43% of spam. The authors conclude that more than 75% of the machines are part of a botnet. Interesting to note is that according to this research, non-botnet spam is still responsible for 59% of the spam. [11] Spam campaigns (the sending of one specific spam message to a lot of different recipients) are typically done within an hour for HVS, while for LVS it takes on average 2-3 days to complete a campaign. [11] Another interesting approach is taken by Nick Feamster et al. [21]. They propose to identify spammers not based on IP address or content filtering, but based on behavioral analysis (much like this research does). They used as a datasource the logs of an organization that manages over 115 domains to get a multiple-domain overview of spam. To classify spammers, they propose to cluster IPs with similar behavior. Their clustering algorithm 19

20 Chapter 2. Netflow and spam takes as input the vector (n, d, t), where n is the number of IP addresses that send to any d domains within one of t time windows. The general idea behind this clustering approach is that bots of a botnet will display similar behavior, bots will probably send a small amount of messages to a large amount of servers to avoid detection (each domain sees only a small amount of messages, avoiding the decision to blacklist the bots). First the algorithm is trained with an initial set of known IPs. After training, new hosts displaying similar behavior as the clustered spamming machines will be classified as spam. Their validation shows that a small percentage of spam messages not yet blacklisted were detected by the proposed algorithm (called SpamTracker). Another approach that uses behavior analysis on the network level to detect spamming machines, is used by Prasanna Desikan et al. [2]. Their data-source is Netflow. This paper describes the only spam research with Netflow that could be found during the literature study of this research. Just like Nick Feamster et al. [21], they see their research as a complementary approach to spam reduction compared to the techniques currently in use. Their proposal is to construct link-graphs, in particular bipartite cores. The HITS algorithm eliminates normal traffic (servers which send and receive mail with other servers sending and receiving mail). Then the algorithm identifies machines sending to a high number of other machines that are not in the set of normal traffic machines. The paper also acknowledges the importance of temporal behavior. Their first attempt to model temporal behavior is focussed on single nodes and a few properties that are measured over time. They evaluated their algorithms with Netflow data for a 10 minute and 3 hour time period on the border network of the University of Minnesota. One machine known to be sending spam was detected. A completely different approach was taken by Christian Kreibich et al. [10]. They infiltrated the well known Storm botnet by installing 16 instances of Storm bots in virtual machines. To successfully analyze the botnet, the different forms of Storm communication traffic were reverse engineered. Traffic was collected from late december 2007 until early February Some interesting findings were discovered about as well the botnet itself as its spam mechanisms. Interesting to note among those findings is that the overall average spamming rate was 152 spam messages per minute per bot and the estimated address list size ranges from 100 million to 800 million recipients. Also a lot of insight was obtained in the mechanisms used to generate spam messages and how they were being sent (the use of dictionaries, campaigns, address harvesting etc). 20

21 3 Experimental first steps This chapter describes the first experimental steps to detect spamming machines with the Netflow data. To answer the third research question (How does spamming differ from normal traffic on the level of Netflow data?) it has been chosen to first observe the suspicious behavior that can be found in Netflow data before theorizing spam detection mechanisms. This in contrary to Prasanna Desikan et al. [2] and Anirudh Ramachandran et al. [21], where first a theory is devised and later verified using the Netflow data. First the available data sources are stated. Then the first manual inspection of the netflow data is described and the relevant results and ideas for detecting spamming machines are discussed. Then, to cope with the amount of data, a mechanism for further inspection is discussed. This chapter closes with a short conclusion on this first exploration in the netflow data. 3.1 Available data sources For this research three Netflow sources were available: University network: This encompasses (almost) all traffic for the University of Twente (the Netherlands). This will be referred to as the Utwente datatset in the remainder of this work. SURFnet: This repository contains the traffic for SURFnet, a research network covering the Netherlands. GEANT This repository contains the traffic for GEANT, a research network covering Europe, which also contains the Surfnet traffic if it is not routed via another network. A database repository was made for two days captured in July 2008 for all three networks. The focus on this research lies on the Utwente dataset, because this network is the most familiar one and we have a live Netflow data stream which could come in handy later on. However, the analyses of this Chapter have also been done on the SURFnet and Geant repositories to see whether similar behavior was found. 21

22 Chapter 3. Experimental first steps 3.2 Differentiating suspicious machines from normal machines The focus of this research is to find spam machines using Netflow data. The Netflow data is put in a MySQL database. Because those repositories are huge in terms of disk space and records ( records occupying 49.64GB gigs for data and 86.91GB for the index for the Utwente repository) the next challenge is to use this huge amount of data in a feasible way. The following approach was taken: 1. Get some initial experience with the available data (what kind of queries are common, what information can be distilled and how long do queries take etc). This is described in Section Try to automate data processing for this specific assignment and create smaller data tables to operate upon to speed up query processing times. This is described in Section After those experimental first steps the goal is to devise an algorithm to detect spamming machines, answering the research question Which algorithms can be used to detect spamming machines via Netflow data?. This will be done in the next Chapter Manually extracting data from the Utwente repository In this section the available Netflow data is analyzed to get some experience as to how the Netflow data can be used to detect spamming machines. One of the first ideas was to generate a plot of all outgoing connections to port 25 over time, the result can be seen in Figure 3.1. Figure 3.1: All connections to port 25 captured by the Utwente Netflow dataset The choice was made to aggregate all traffic in 5-minute time slices. In this plot and all later plots the data points represent 5-minute time spans. So at the start of this plot around 22

23 3.2. Differentiating suspicious machines from normal machines 7500 outgoing SMTP connections were made per 5 minutes. The value of 5 minutes has been chosen for practical reasons, the precision is good enough while limiting the amount of needed datapoints (and thus processing time, which will be important especially when automating the detection process later on). It is also important to note that an outgoing connection does not correspond to a single mail message. A single SMTP session can be used to send multiple mail messages to multiple users [14]. Also, some outgoing connections to port 25 could be port scans. Figure 3.1 shows a very irregular plot. There are a few peaks that could be due to spam, but it is difficult to make any conclusions based on this plot. Further research into the connections made during the two highest peaks didn t result in any clarification. Because Figure 3.1 does not tell us a lot, it was decided to make a plot of normal SMTP server behavior. The (5 load-balanced) university mail servers were chosen. They all showed the same kind of plot, one of them is displayed in Figure 3.2. As can be seen, this official Figure 3.2: One of the university mail servers mail server shows a far more stable plot than the one of all outgoing connections for the whole university domain (Figure 3.1). There are three peaks, but overall, there exists a stable baseline. The peaks have not been investigated further. They could be explained by mailing list mailings but could also be misuse of the mail server. Now that we have a picture of outgoing SMTP traffic for a legitimate mail server, the next step is to find and plot the outgoing SMTP traffic of an illegitimate (spamming) mail machine. Finding a legitimate mail server is easy, just pick one of the official university mail servers, but it is a little bit more difficult for an illegitimate mail server. To find an illegitimate mail server, a way to identify one is needed. For this a simple basic assumption was made to distinguish those machines from legitimate ones: Spam machines will send a lot of SMTP traffic, but won t receive a lot of, or no incoming SMTP traffic at all, on port

24 Chapter 3. Experimental first steps Of course, this is only an assumption at this point. There are a few cases in which this assumption won t hold true. One such a case is when a system is set-up to send mass mailings like a newsletter. This will result in a lot of mail being sent but the machine will probably not be configured to receive mail. To obtain the machines where the assumption holds true, the Utwente repository was queried to group the traffic by the source IP over the entire Utwente dataset. For each IP the following metrics were calculated: The number of distinct destination servers. These are the distinct destination servers for the outgoing connections with destination port 25. The number of outgoing connections. destination port 25. These are all the outgoing connections with The number of distinct incoming machines. These are all the distinct IPs that connect to the current machine with destination port 25. The number of incoming connections. destination port 25. These are all the incoming connections with Note that Netflow records both directions of a connection as a different flow, one in each direction. Also, for the assumption, only the incoming and outgoing amount of SMTP connections is needed. What was added was the amount of distinct source and destination IPs. This might also shed some light on the behavior of legitimate versus illegitimate machines. Finally, as the aim is to find obvious spam machines, the resulting list is ordered by the amount of outgoing connections to port 25 (descending), so that the mass spammers should be detected. The resulting top 15 of that list is shown in Table 3.1. The IPs are left out due Ip dist out out dist in in Table 3.1: Top 15 IPs with highest number of outgoing connections 24

25 3.2. Differentiating suspicious machines from normal machines to privacy concerns. The other columns are the number of distinct destinations, outgoing connections, distinct incoming machines and incoming connections respectively, as explained above. The machines matching the assumption that spam machines will send a lot but will not receive a lot of traffic are in bold. For the third bold machine a time plot of the amount of outgoing connections with destination port 25 against time was made, shown in Figure 3.3. In addition to this, the following lines were plotted: Incoming SMTP connections. Total response for outgoing connections. The amount of reverse Netflow flows for the outgoing connections. If no reverse flows are found, no successful connection was made. SMTP is based on TCP and because of TCP s three-way-handshake there should at least flow some packets in the other direction if the destination port is open. If there are a lot less responses than outgoing connections the machine is probably performing a port scan. Total response for incoming connections. Same as total response for outgoing connections, but now for incoming connections. Outgoing SMTP connections to different IPs. This is the amount of distinct destination servers. Figure 3.3: One of the suspicious mail machines As can be seen in the figure, most of the time this machine is not sending anything, but there are some periods of time with peaks up to over 3000 outgoing connections to port 25. Also, most outgoing connections get a response (data flowing in the direction of the originator, represented by the line Total respone for outgoing connections ). This means that this is 25

26 Chapter 3. Experimental first steps probably not a port scan. In that case most connection attempts would probably not get a response. The connections are to a high number of distinct destinations. But because the line representing the different destination servers is a bit lower than the line representing the amount of outgoing connections, the machine makes more than 1 connection to a lot of distinct destinations. The reasons for this are not clear, SMTP allows to send all mail to be delivered in a single connection, separate connections for each mail are not required. It could also be the case that the Netflow timers break the connections in multiple flows. In summary, this machine displays very suspicious behavior, it has a lot of outgoing traffic to a lot of distinct destinations while only receiving 2 incoming connections. Also, the traffic is highly irregular, the plot shows large periods in which no SMTP traffic is being observed with sudden very high peaks of outgoing traffic. It is very likely the first spam machine has been detected! The same behavior was found for most other IPs where there are a lot of outgoing connections with respect to its incoming connections Efficiently automating data extraction As can be seen in the previous section, a few steps have to be performed to detect suspicious machines. Doing this manually consumes a lot of time and is not very efficient in terms of machine resource usage. After experimenting with tools like flow-tools [3] it was quickly decided that a database is more flexible for research purposes. Because of the large dataset in the database (see the introduction of this chapter) queries take a lot of time to complete. Just going trough the whole dataset with a simple query takes on average around 1.5 hours. When doing this for each suspicious machine this quickly becomes infeasible. The process should be automated in a more efficient way. Figure 3.4: SQL processing for the Utwente repository The first obvious step to improve the performance is to eliminate going trough all the non-relevant data in the repository. Non-relevant data is all the flows that don t have port 25 (SMTP) in the source or destination port field. So the first step is to create a new datatable containing only those flows. The reduced the amount of flows from to entries, so SMTP represents only about 1,2% of all flows. 26

27 3.3. Conclusions drawn after the first experimental steps The next step is aggregating a few statistics per IP (amount of outgoing connections, distinct destination servers etc) using the SMTP traffic table. From this traffic table a new table is made with only those IPs satisfying the assumption that spam machines generate a lot of outgoing traffic but won t receive a lot of traffic. The following criterium was chosen: incoming SMTP connections outgoing SMTP connections < The result is that only the machines that were put in bold in the list of IPs from table 3.1 are selected. So only the IPs with a high number of outgoing SMTP connections and little or no incoming SMTP connections are obtained. The value of means that only 0.5 percent of the number of outgoing connections is allowed as an incoming connection. The main idea is to allow only a very small number of incoming connections compared to the outgoing connections. This behavior has been observed in the Netflow data a lot. Another choice could be to select only the IPs with no incoming connections at all (but with a high number of outgoing connections). Then IPs with only a low number of incoming connections would not be listed, while a lot of machines having a low number of incoming connections were found to have suspicious behavior. So the choice is made to use the ratio instead. The result set is ordered by the amount of outgoing connections to port 25. From this set the top 100 machines is used to generate time plot data which can be used as input for a charting tool like gnuplot[15]. The result is a set of 100 suspicious IPs with time plots. The described steps are implemented as MySQL stored procedures using cursors and indexes to reach a high efficiency. The total runtime is around 2 hours. Only a little bit longer then generating a plot for a single machine because of the data reductions! The resulting process is illustrated in Figure Conclusions drawn after the first experimental steps In the previous section an automated data extraction scheme based on simple rules has been developed. The end result is a list of hosts that have a lot of outgoing connections to port 25, but don t receive (or receive a relatively small set of) connections to port 25 ordered by the amount of outgoing connections (descending). Also, data points have been generated as input for generating plots. Those plots should be inspected to see what behavior a suspicious host displays over time. In the Utwente dataset large periods in which no traffic was being sent was detected with sudden peaks in which a lot of outgoing connections to port 25 were made. The same behavior could be found in the SURFnet and Geant data. Further investigation of the suspicious IPs is required to decide how to proceed with a spam detection scheme based on Netflow. The most important issue is to keep the false positives to a minimum. False positives with this mechanism can for example be an SMTP server set up for legitimate mass mailings. In this set-up the SMTP server will send a lot of traffic, won t receive a lot and have sudden peaks. The question is whether those kind SMTP servers are used a lot in practice. The first experimental steps described in this chapter at least show a promising outlook on detecting spam machines. The next challenge is to decide upon an algorithm that does so. 27

28 Chapter 3. Experimental first steps 28

29 4 Proposed spam-machine detection algorithm This chapter describes several criteria to detect spamming machines (based on experimental results described in the previous Chapter) and combines them in a detection algorithm. From now on, the focus will lie on the Utwente Netflow data. 4.1 Proposed criteria This section proposes the criteria for detecting spam machines via Netflow. The criteria are divided into two categories. This is done for two reasons. Firstly, because of the large number of IPs to be analyzed, it would help if a first selection is made on which more extensive analysis will be performed. This way only machines displaying obvious suspicious behavior will be analyzed in more detail, reducing the processing time. Secondly, only processing the suspicious machines according to the first criteria gives a smaller resultset, excluding machines that are not suspicious. The remaining criteria can than be used to order the already suspicious machines by calculating a probability. If this was done with all the machines which have SMTP traffic without a first elimination round, an unnecessary large resultset and an unnecessary high processing time would be the result. So we have the following two categories of criteria: 1. Acceptance criteria : These are the criteria used to select suspicious machines. If criteria of this type do not hold true, the IPs are ignored for further analysis. 2. Ordering criteria: These are the criteria used to order the machines being selected by the acceptance criteria. The goal is to order those machines so that the most suspicious machines are ranked on top. The ordering criteria will result in a ranking. It is possible to extend upon the acceptance criteria, some machines will fall trough the acceptance criteria with ease, some will barely. This can be used to classify machines as being more suspicious or less suspicious. The criteria are based on the observations in the Netflow data, not on literature study. As far as currently known they have not been used in combination with Netflow so far. Some of them however could be compared to criteria used in research with other data sources, i.e. Anirudh Ramachandran et al. [21] use the number of messages per domain. With Netflow this could be translated to the ratio between outgoing SMTP connections and distinct destination IPs. 29