Analysis of Spam Filter Methods on SMTP Servers Category: Trends in Anti-Spam Development

Analysis of Spam Filter Methods on SMTP Servers Category: Trends in Anti-Spam Development Author André Tschentscher Address Fachhochschule Erfurt - University of Applied Sciences Applied Computer Science Professor Prof. Dr. Gunar Schorcht 1

Introduction During the past years the private and commercial use of email gained importance. Before the turn of the century letters played a big role as communication medium, but now email is well established. Its advantages cannot be ignored. Emails can be delivered almost instantly with close to no cost to any number of recipients. A study found at Absolit Blog 2009 by Cisco showed that approximately 220 billion emails are sent every day. While the E-Postbrief by Deutsche Post is insusceptible to advertising due to its cost and traceability factor, over 80% of email traffic can be considered as spam. These unwanted emails usually contain advertisement for medications, health products, various other goods and services, as well as pornographic content. This circumstance makes the strong need to sort out spam apparent. Only by employing effective filter methods it can be ensured that important emails are not lost within the increasing volume of spam. A case study was conducted that shows what role spam plays in the global email traffic and how it can be reduced. Based on the results a proposal is made for blocking the raising flood of unwanted emails. Another goal of the presented research is an automatically working analyzer, which can generate a statistic showing the amount of spam in comparison to the total amount of email over any period of time. The reason why an email was marked as spam is also supposed to be recorded. The collected data is to be presented graphically. State of the Art Different methods for classifying emails as ham (wanted) or spam (unwanted) are presented. Later a system which uses a combination of these filters to sort between wanted and unwanted emails is presented. There are two general methods to filter emails. The first is preselection done by the mail server which uses the smtp (see IETFS (2001)) and the second is time-consuming content filtering done by a software suite like SpamAssassin. The filter rules of different mail server can have different names but basically all work in the same way. Preselection is based on the following simple rules (compare Heinlein, P. (2008) and Postfix (2010) The Postfix MTA). Helo command rejected: host not found, invalid name, need fully-qualified domain name 2

The name of the sending server is checked for validity. If the server provides a wrong name in the HELO then the email is discarded. Most of the time this is an indicator for spam but sometimes the mail server sending the email is not configured correctly. As a result emails from computers that do not belong to a valid domain are rejected. This also implies that emails from systems without a constant IP address are rejected. (SPF, written down in SPF (2010)) Sender address rejected: domain not found, malformed DNS server reply, need fully-qualified address One starting point for spam detection is the Sender Policy Framework, which is supposed to make it difficult to manipulate sender addresses. This is archived by providing additional SPF resource records for individual domains within a DNS server. These records contain which systems within the domain are allowed to send emails. When an email is checked a reverse SPF search is done. An entry for the sender indicates that the entries in HELO and MAIL FROM of the received email are correct otherwise the email is discarded. Recipient address rejected: user unknown in local recipient table Senders of spam try to guess email addresses by using special dictionaries since the names of the users of the entire mail server are unknown. If the mail account does not exist on the server the email is discarded. Relay access denied This filter method comes into effect when someone tries to send an email without the proper authentication. A mail server that is not configured as an open relay can help to minimize misuse. The time- and resource-consuming content filtering is applied to all of the mails which passed the preselection step. Based on JISC Techwatch (2004) and MX Logic (2003) here are four different kinds of content filters. Challenge-response filter Almost all mail servers today are configured to send an email again after a short amount of time if delivering it the first time failed. Since a mail server blocks emails from an unknown source this can be used to detect a wave of spam. The corresponding method is called graylisting. The challange-response filter follows a similar approach. It acknowledges emails from new senders with a test to verify if the sender is a natural person. Similar authentication methods in the form of captchas as seen in CAPTCHA (2010) are used forums and newsgroups. Email senders with a commercial interest either do not have the time or the resources to verify all mail boxes an email was sent to. 3

There are a few problems to consider. The filter will respond to spam which was send using a stolen or fake email address meaning that a vast number of authentication are sent which creates a high server load. Problems with automatically send emails can also arise since these mailers are not capable to verify themselves (e.g. registration emails etc.). Because of this emails from previously unknown sender cannot be received anymore. It can also happen that natural persons are unable to complete the authentication process since the email might be discarded as spam. In addition a challenge-response system usually requires entering a code. People with a certain degree of disability could be presented problem they cannot solve. From the presented problems it can be concluded that the challenge-response approach is not the best solution to confront spam. Rule based spam filters (see Visolve (2009)) When someone is talking about spam filters they usually refer to rule based filters. They used to compare the contents of an email with a set of words or phrases and would block a message if a certain number of hits were met. The past years have shown that the filter results for advertising emails are insufficient. Emails are wrongly marked as spam (false positives) or they are marked as ham even though they actually are spam (false negatives). It is easy to see that these kinds of filters can easily be bypassed by replacing characters with similar looking ones (e.g. "we have çheap repliça watçhes"). Even though rule based spam filters are continuously improved and achieve better results by using regular expressions spam mailers find ways to get around these improvements. It can be observed that spammers use rule based filters themselves to increase their odds. The number of false positives can also rise since the filters are becoming more and more complex. Today s filters recognize the rough context of a message and give a score based on its words and grammar. Once this score reaches a certain threshold the email is marked as spam. Even though this system can block the bigger part of the spam mails it will never be capable of making the right decision in 100% of the cases. Especially in the business sector a special junk folder in the mail box is highly advisable. Address and URL based blacklists Every email send by a mail server contains information in its header about the servers which took part in delivering it. Since information is only added and not deleted it can be used to analyze the message. There are multiple black list freely available online. These lists contain the IP addresses of all the known spammers and are updated regularly. Mail filter programs can usually use them to decide how probable it is that the received email is spam. It is usually a save reliable way of recognizing spam. The lists can be replaced, managed, and configured easily. The problem with them lies in how they are generated. Users and providers report spammers which mean that millions of spam mails could already have been sent by the time the list is updated. It can also happen that IP addresses of hacked accounts or mail relay servers are 4

blocked. Address ranges of internet service providers can also be affected even when only one person is actually sending spam. Although this should not happen very often it can be devastating for freelancers and small companies. Bayesian spam filter The Bayesian filter is a relatively new method to recognize spam. The analysis uses mathematical algorithms to indentify the content. Bayesian filters improve with the amount of received mail classified by the user (see Viatel (2004)). The quality of spam mails provided to the filter and its implementation heavily influence its efficiency. Even though developed by the mathematician Thomas Bayes in the 18 th century its for email traffic optimized implementation is necessary for a optimally working spam filter. Since the Bayesian spam filter learns from examples its effectiveness depends heavily on the provided input. During the learning phase the user has to correctly classify ham and spam. Afterwards the filter system computes signature for the mails to later compare them with incoming mails. As like rule based filters spam mailers can also trick this method. To do this though one needs a strong technical background and many test series and scenarios need to be run. Case Study For the following analysis a Linux based system of a company from Ilmenau is used. It consists of the email server Postfix, the virus scanner ClamAV (see ClamAV (2010)), and the spam filter SpamAssassin (see Spamassassin (2010). These services are controlled by the Amavis (see Amavis (2010)) service. The analysis is being performed by a script written in perl. It is triggered by a daemon and reads the services' log messages to store them in a database. A CGI script is then used to visualize the collected data and provide a detailed report over a chosen period of time. The analyzer itself is split into two parts. The first module computes the contents of the log files and the second one visualizes the data stored. This makes it possible to almost instantly view the report since the computation is done by an independent component. The report shows different sections of the results. In the top the user can choose the period of time for with to view the results of the analysis. The input is quoted to prevent SQL injections. 5

Figure 1: Start of Report The next part shown the amount of mails which were sorted out by Postfix before the content filter was applied. In addition information about the mails that passed the preselection step provided by SpamAssassin is displayed. The pie chart displays how the different entries are weighted. In the end the rules used for the content filtering with the most hits are displayed in a table. Results Looking at a report for any period of time it is apparent that simple Postfix rules can already filter a substantial amount of spam. About 10.0% of the emails pass the different filter mechanisms and 18.1% are detected as spam by the content filter and placed in a special folder. The assumptions made in the beginning about the amount of spam can be verified by the analysis. Figure 2: Blocked Spam by Postfix 6

Figure 3: Percentage of each Test It can also be shown that the number of spam emails increases every year by 300-400%. The excerpt from figure 4 shows that there are mainly two relevant test groups for detecting spam. The first category is made up by using blacklists (e.g. URIBL blacklist which can be found at URIBL (2010)). They need to always be kept up to date to ensure an optimal filter result. The second category is made up of various technical tests (e.g. message-id is not valid, according to RFC 2822 at IETF (2001)). Since botnets traditionally try to send vast amount of spam it is possible to efficiently filter those emails after they have passed the preselection. In conclusion it can be established that by using spam filter systems a great amount of spam emails can be detected even before they reach the mail box of the email account holder. Using preselection can give a better overview of the kind of spam being received and saves work time. Figure 4: Rules used by Spamassassin Conclusion During the project it was proven that it is not possible to detect every spam email. To still improve the spam detection rate the use of signatures should be considered. They identify a sender through the use of certificates. In the future the use of signatures could play a key role in fighting spam. There could be a central mail authentication service which provides every user with a mail certificate after validating his or 7

her personal information. The private certificate key would always be with the user (e.g. saved in the personal ID card). If the user wants to send an email the signature needs to be provided. Emails without a valid signature would not be sent. If the number of spam emails send by a certain user increases the corresponding certificate would be invalidated. Each mail server would contain a certificate revocation lists to easily discard spam emails. If a blocked user wants to receive a new certificate a fee depending on the extend of the offense would be charged. A user who notices that his/her account was compromised could have it blocked and have a new certificate issued to himself/herself. Applying for multiple email addresses would be allowed in special cases (e.g. multiple or different employers). Acknowledgements I am very grateful for the help I received from André Lochotzke by translating this paper. I also want to thank Prof. Dr. Schorcht of the University of Applied Sciences Erfurt for his support and for providing the data to carry out the research. References Absolit Blog (2009) 220 Millarden E-Mails pro Tag 1980 [Online]. Available from: http://www.absolit-blog.de/studien/220-milliarden-e-mails-pro-tag.html [Accessed: 21 October 2010] Amavis (2010) A Mail Virus Scanner [Online]. Available from: http://www.amavis.org/ [Accessed: 21 October 2010] CAPTCHA (2010) The official CAPTCHA Site [Online]. Available from http://www.captcha.net/ [Accessed: 09 September 2010] ClamAV (2010) Clam Antivirus [Online]. Available from http://www.clamav.net/lang/en/ [Accessed: 07 November 2010] Heinlein, P. (2008). Das Postfix Buch. Sichere Mailserver mit Linux. 3 rd Edition. Open Source Press. IETF (2001) Internet Message Format [Online]. Available from: http://www.ietf.org/rfc/rfc2822.txt [Accessed: 22 September 2010] IETFS (2001) Simple Mail Transfer Protocol [Online]. Available from: http://www.ietf.org/rfc/rfc2821.txt [Accessed: 22 September 2010] JISC Techwatch (2004) Spam on the Internet [Online]. Available from: http://www.jisc.ac.uk/uploaded_documents/acf11a8.pdf [Accessed: 05 September 2010] McDonald, A. (2005). Spam Assassin. Leitfaden zur Konfiguration, Integration und Einsatz. 1 st Edition. Addison Wesley Verlag. MX Logic (2003) Spam Classification Techniques [Online]. Available from: http://avega.ca/avega/wp-content/uploads/2010/01/spam-classification-whitepaper2.pdf.1.30.042.pdf [Accessed: 18 November 2010] Postfix (2010) The Postfix MTA [Online]. Available from: http://www.postfix.org/ [Accessed: 27 November 2010] Spamassassin (2010) Spamassassin, the #1 powerful open source filter [Online]. Available from: http://spamassassin.apache.org/ [Accessed: 09 September 2010] 8

SPF (2010) The Sender Policy Framework [Online]. Available from: http://www.openspf.org/ [Accessed: 13 November 2010] URIBL (2010) Realtime URI Blacklist [Online]. Available from: http://www.uribl.com/mirrors.shtml [Accessed: 29 November 2010] Viatel (2004) Spam: Now a cooperative concern [Online]. Available from http://www.viatel.com/uplds/viatel-097750_spam.pdf [Accessed: 10 September 2010] Visolve (2009) Taking Control of your emails [Online]. Available from http://www.visolve.com/vimail/vimailfilter_whitepaper.pdf [Accessed: 16 September 2010] 9