Detect and Notify Abnormal SMTP Traffic and Email Spam over Aggregate Network

JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 21, 571-578 (2005) Short Paper Detect and Notify Abnormal SMTP Traffic and Email Spam over Aggregate Network Department of Computer Science and Information Engineering National Central University Chungli, 320 Taiwan E-mail: center7@cc.ncu.edu.tw As all the traffic between the public Internet and the customer s desktop must be interconnected through ISP s access network, this work thus makes use of the transportation traffic log gathered from backbone router to develop SMTP flooding detection system (SFDS), so that the most spam could be detected and stopped at the original fan-out network. The system has been deployed over a TANet (Taiwan Academic Network) backbone node for assisting network users grasping the abnormal SMTP sources with suddenly increase email requests. The result indicates that there is a high proportion of the notified spam could be detected in advance. Keywords: SMTP flooding detection, spam, anomaly notification, Rwhois, IP route MIB 1. INTRODUCTION Making use of the simplicity of SMTP, sophisticated intrusion tools, and selfpropagation email viruses, spammers had successfully mounted up a range of innocent machines to cover themselves distribute massive spam to global network users. Once a system was infected by mail virus, it not only attempts to copy itself to the exploitable destination systems, it also initials its SMTP engine to delivery spam messages at very high speed that easily falls outside historical norm [1]. The sustaining flood of in-yourface spam often fills the intermediate systems, networks, disks with unwanted messages [2]. Obviously, spamming behavior itself can make the measurement task easier if the feature could be finely narrowed down according to its flooding characteristics. Making use the transportation traffic logs exported from access router, this work develops the SMTP flooding detection system (SFDS) for adding an extra level spam filter over the upstream access network of Internet service providers (ISPs) backbone network. Received February 3, 2004; revised August 23 & November 23, 2004; accepted December 27, 2004. Communicated by Chu-Sing Yang. 571

572 1.1 Related Work As spam often has characteristic styles, phrases, and disclaimers, several researches took advantage of these characteristics to detect spam activities by checking the mail content received by SMTP server systems. In [3], the authors employed data mining technology to search mail log file over the server node, to discover the consistent and useful patterns of email system for recognizing anomalies and known intrusions. Bayesian anti-spam filter has been widely deployed over mail servers to learn the terms usually found in the spam, assess and filter the new coming messages to mitigate the overwhelming spam traffic [4]. Since Bayesian anti-spam filters are deployed at SMTP server/client sites, however, the precious mail resources over intermediate systems are still heavily consumed for transmitting the massive spam and the reject messages. In addition, the complex content matching of statistical spam filter also caused itself the target of email bombing attack. A sophisticated SMTP flooding program can severely overwhelm the victim server and obviously affect its performance and availability. This situation drives us to measure and detect SMTP flooding over upstream access network. 1.2 Transportation Traffic Log Rapid growth of Internet deployment and usage has created a massive increase in the demand for measurement technology to provide the information required to record network utilization and help network administrator solving network problem. Packetbased traffic traces captured by Tcpdump and flow-based NetFlow data aggregated by router are the most common resources exploited by network operators for measuring transportation traffic. Tcpdump uses libpcap library to capture TCP/IP packets and presents these packets in textual format. In addition to the layer 4 traffic log, tcpdump could be even employed to echo packet information up to payload content, and standard output to a disk file [5]. Typically, network administrator adopts the collection tools to obtain traffic traces for gain better understanding of traffic characteristics [6-8]. One additional approach to gather transportation logs is the router that used many IP packet header fields to forward each packet that has been passed through it [9, 10]. Each NetFlow log records counter for packets and bytes that are updated according to each packet that matches the key of the 5-tuples flow identifier that consists of the source IP, source port, destination IP, destination port, and the adopted transportation protocol. And each bi-direction TCP session is typically expressed as both Netflow records, one represents the traffic from client to server, and the other represents traffic from server to client. This work use flow-tools [11], shareware developed by Ohio State University (OSU), to collect NetFlow data exported from one TANet backbone router to support subsequent SMTP traffic measurement. The remainder of the paper is organized as follows. Section 2 presents feature of SMTP flooding, and illustrates feature based traffic aggregation and multi-thresholds anomaly detection mechanisms. After that, relation between the notified spam and detected anomalies is analyzed. Section 3 depicts the construction Rwhois direction service and the automatic anomaly notification. Finally, section 4 draws conclusions.

SPAM, ABNORMAL SMTP TRAFFIC MEASUREMENT 573 2. SMTP FLOODING DETECTION The suddenly increase mail flows from a single mail client does not happen by accident. Most of them are caused by the compromised system that had been exploited to distribute massive spam messages. Thus, the feature based traffic aggregation and multithresholds anomaly detection programs were employed by this work to measure and detect anomalies according to feature of SMTP flooding. 2.1 Feature of SMTP Flooding Fig. 1 shows the typical SMTP transaction model. Since the spamming source might generate and send out excessive messages to destination servers, thus, the SMTP requests originated from the flooding source will hugely arises and converges at the destination SMTP port. While the other SMTP flows responded from server systems to spamming source diverges among the non-well-known port numbers 1024-65535. SMTP Service Port (1) SMTP Server Spamming Source System (2) SMTP Server (3) SMTP Server (4) SMTP Server Fig. 1. SMTP flooding model. Therefore, this work constructed the set of feature that consists of source IP, destination IP, destination SMTP port attributes as the virtual flow identifier to achieve the subsequently SMTP traffic aggregation and anomaly detection, not just the single application port or the single address attribute as many systems now do. 2.2 Feature-Based Traffic Aggregation This work uses the 3-tuples flow attributes as the key to aggregate SMTP traffic sent from a mail source to destination server. The program firstly reads the NetFlow data from the dedicated disk files for each hour time interval, to accumulate traffic and store

574 the results to the corresponding traffic variables, we use the flow h [vr_flow i ], pkt h [vr_flow i ] and byte h [vr_flow i ] to denote the amount of SMTP flows, packets, bytes, carried through the virtual flow. After that, the traffic variables were summarized and sorted out to highlight the abnormal SMTP traffic. Fig. 2 shows the aggregated top-n SMTP traffic over the subject network on Nov 2nd 2003. Several traffic variables, such as, the mean packet size, the number of flow, the number of packet, the number of byte, were kept so that network administrator can figured out spam senders easily. Take the traffic transmitted from the host with the address of 163.25.154.253 as an example, the system successfully sent out thousands of emails to several servers in the single hour. Obviously, the suddenly increase SMTP traffic from the single source located at a high school campus was definitely different from the other SMTP clients. (a) Monitoring abnormal SMTP traffic. (b) Monitoring the detected spam traffic. Fig. 2. Monitoring SMTP flooding traffic.

SPAM, ABNORMAL SMTP TRAFFIC MEASUREMENT 575 2.3 Multi-Thresholds Anomaly Detection Spammers deliberately use malformed and alternative flow rate to hide their spamming behavior from detect. A multi-threshold anomaly detection program was designed to aggregate the interested traffic for filtering out flooding traffic without too much false alarms. The program reads the aggregated traffic variables of the top-n SMTP sessions to accumulate the flooding duration, mean flow rate, and mean packet size of each SMTP source to help determining the abnormal SMTP flooding. We use flow_count[source i ], pkt_count[source i ], byte_count[source i ], pkt_size[source i ], duration[source i ], and flow_rate[source i ] to denote the amount of SMTP flows, packets, bytes, mean packet size, flooding duration, and mean flow rate of each SMTP source system. Firstly, the packet numbers, byte numbers, sent from each email source to any mail servers were aggregated according to formulas (1) though (2), and the mean packet size was evaluated accordingly (formula (3)). Then, the flow numbers, flooding duration, and mean flow rate variables of each email source were summarized according to formulas (4) though (6). Finally, the program compared the aggregated traffic variables with the flow rate, flooding duration and mean packet size thresholds values to determine the suddenly increase SMTP traffic accordingly (formula (7)). C byte _ count[ source ] byte [ vr _ flow ] = (1) i h i i= 0 C pkt _ count[ source ] pkt [ vr _ flow ] = (2) i h i i= 0 byte _ count[ sourcei ] pkt _ size[ sourcei ] = pkt _ count[ source ] (3) i C flow _ count[ source ] flow [ vr _ flow ] = (4) i h i i= 0 duration[source i ] = duration[source i ] + 1, for each hour (5) flow _ count[ sourcei ] flow _ rate[ sourcei ] = (6) duration[ source ] i D smtp_flooding [source i ] = 1, if ((pkt_size[source i ] > threshold_pkt_size) and (flow_rate[source i ] > threshold_flow_rate) and (duration[source i ] > threshold_duration)) (7) In addition to IP header (20 bytes in length) and TCP header (20 bytes in length), the mail segment also involves: Received, Date, From, Message-ID, and to lines [13].

576 Thus, 100 bytes per packet was estimated by the system as the reasonable mean packet size to help distinguishing the successfully transmitted email from the rejected and retransmitted undeliverable messages. The flooding traffic with mean flow rate above 180 flows per hour, and has continued the flooding behavior for more than 5 hours, could be detected and showed on Fig. 2 (b). While the traffic of rejected and retransmitted undeliverable messages, with mean packet size much less than 100 bytes per packet, was displayed on the lower part of the monitoring page. However, the single flooding source with IP address of 163.25.154.253 can be determined easily, for the huge amount of SMTP flows is outside its historical norm. 2.4 Link between Detected Anomalies and Reported Spam In cope with the rapid growth of spam traffic, Internet users are encouraged to report the abuse event to the designated service sites, such as spamcop.net, and the responsible administrator, such as abuse@domain. In this work, a Perl script was employed to parse the anomalies detected by SDS system, and the abuse source list that had been notified by spamcop.net, to analyze the relation between the detected anomalous sources and the reported spam. Table 1 lists the link between the spamming senders that had been reported to abuse@domain and the detected SMTP flooding sources over the subject network over the past several months. The result shows that there is a high proportion of the reported spam could be picked up from the detected flooding sources. Considering the high link between SMTP flooding sources and the notified spam, an automatic anomaly notification program was developed by this work so that the responsible owners can be notified to fix the compromised system as soon as possible. Table 1. Link between detected anomalies and the reported spam. Reported spamming host number Spamming host number detected by the SFDS system Oct-2003 8 6 of 8 75 % Nov-2003 8 7 of 8 88 % Dec-2003 9 7 of 9 78 % Jan-2004 12 12 of 12 100 % Feb-2004 8 6 of 8 75 % Mar-2004 13 11 of 13 85 % Apr-2004 20 18 of 20 90 % May-2004 10 10 of 13 78 % 3. ANOMALY NOTIFICATION Security is a combination of technique, administrative and physical controls [13]. Once the anomalous SMTP flooding was detected, the responsible administrator should

SPAM, ABNORMAL SMTP TRAFFIC MEASUREMENT 577 be notified to verify the problem and correct the incident. In this work, the IP route information over the subject network was rebuilt for offering contact information data base. And Net::Rwhois.pm and Mail::Sendmail.pm Perl modules were integrated to achieve the notification. 3.1 IP Route MIB and Rwhois Direction Service Typically, the contact technician information that consists of the organization name, contact administrator, email address, telephone number, IP sub-network addresses and IP address of the routing interface of the campus network are kept for being able to contact the technically-competent person so that the suspect and the identified threats can be quickly addresses and quickly resolved by the responsible technician. The system firstly pulled iproute Simple Network Management Protocol (SNMP) Management Information Base (MIB) tree [14-16] form aggregate router to construct IP routing data. After that, the routing information was related to the contact technician data to construct Rwhois data base, a shareware package [17]. 3.2 Automatic Anomaly Notification Firstly, the program referred source IP attribute of the detected anomalous record, retrieve the responsible administration information through Rwhois.pm module. Based on the contact data, the program employs Sendmail.pm module to send out the aggregated flooding records to the responsible administrator. Thus, the responsible administrator or technician can be informed to resolve the problem, rather than just monitoring. 4. CONCLUSIONS Making use of NetFlow data gathered from aggregate network, this work developed SFDS system to determine SMTP flooding through feature-based traffic aggregation and multi-thresholds anomaly detection. And an automatic anomaly notification program was designed to notify the responsible administrator according to the detected flooding traffic. The analyzed result indicates that there is a high proportion of the reported spam could be detected by the FSDS system in advance. In the near future, the authors plan to integrated data mining and statistic technology into the system so that the sophisticated spamming with much lower flow rate can be determined. REFERENCES 1. E. Levy, The making of a spam zombie army: dissecting the sobig worms, IEEE Security and Privacy Magazine, Vol. 1, 2003, pp. 58-59. 2. D. Harris, Drowning in sewage-spam, Asia Pacific Regional Internet Conference on Operational Technologies (APRICOT), 2004. 3. R. Grier and S. Schappel, Junk email filter using naïve Bayesian classification, http://www.ryangrier.com/news/archives/bayesianemailfilter.pdf.

578 4. H. Han, X. L. Lu, J. Lu, C. Bo, and R. L. Yong, Data mining aided signature discovery in network-based intrusion detection system, ACM SIGOPS Operating Systems Review, Vol. 36, 2002, pp. 7-13. 5. C. Williamson, Internet traffic measurement, IEEE Internet Computing, 2001, pp. 70-74. 6. D. Luca and S. P. A. Finsiel, Effective traffic measurement using ntop, IEEE Communications Magazine, 2000, pp. 138-143. 7. H. Wang, D. Zhang, and K. G. Shin, Detecting SYN flooding attacks, The 21st Annual Joint Conference of IEEE Computer and Communications Societies (IN- FOCOM 2002), Vol. 3, 2002, pp. 1530-1539. 8. M. Roesch, Snort lightweight intrusion detection for networks, in Proceedings of 13th Systems Administration Conference, 1999, pp. 229-235. 9. Cisco IOS Netflow Technology, http://www.cisco.com/warp/public/cc/pd/iosw/prodlit/iosnf_ds.htm. 10. G. Huston, Measuring IP network performance, The Internet Protocol Journal, 2003, pp. 2-9. 11. M. Fullmer, The OSU flow-tools package and cisco netflow logs, in Proceedings of 14th Systems Administration Conference (LISA 2000), 2000, pp. 291-303. 12. B. Costales and E. Allman, Sendmail, O Reilly and Associates, Inc., 2003. 13. C. P. Pfleeger and S. L. Pfleeger, Security in Computing, Pearson Education, Inc., Prentice Hall, 2003, pp. 491-551. 14. Request for Comments: 1213, Management Information Base for Network Management of TCP/IP-based internets: MIB-II. 15. Request for Comments: 1354, IP Forwarding Table MIB. 16. C. Huitema, Routing in the Internet, Prentice Hall, 1995, pp. 27-64. 17. Request for Comments 2167, Referral Whois (RWhois) Protocol. Su-Chiu Yang ( 楊素秋 ) received the M.S. degree in Electronics Engineering from National Taiwan Institute of Technology, Taiwan in 1980, and the Ph.D. degree in Computer Science and Information Engineering from National Central University, Taiwan in 2004. She currently works as a senior programmer at the Computer Center in National Central University. Her research interests are in the area of computer network, network management, and intrusion detection. Li-Ming Tseng ( 曾黎明 ) received the B.S. degree in Engineering Science, the M.S. and the Ph.D. degrees in Electrical Engineering from National Cheng Kung University, Taiwan in 1970, 1974, and 1980. Dr. Tseng is currently a professor in the Department of Computer Science and Information Engineering in National Central University, Taiwan. His research interests are in the area of distribution system, computer network, operating system, and resource discovery.