Detecting spam using social networking concepts Honours Project COMP4905 Carleton University Terrence Chiu 100605339 Supervised by Dr. Tony White School of Computer Science Summer 2007
Abstract This paper deals with email and spam filtering using the concepts of social networks. Based on the paper Leveraging Social Networks to Fight Spam [ Boykin, P. Oscar, Roychowdhury, Vwani P., 2005] we use the distinct properties associated with Social Networks such as clustering effect to determine if an email is spam or not. Throughout this project we will look at why the clustering effect is helpful and the disadvantages to using this method on spam filtering. 2
Acknowledgments I would like to thank Dr. Tony White for the idea and support of this project. Also thanks to Microsoft for putting up tutorials and examples on how to make an Outlook Plug-in. 3
Table of Contents 1. Introduction...6 2. Background Information...7 2.1 Email...7 2.2 Spam...7 2.3 Spam Filters...8 2.4 Social Networking...8 3. Spam filter using Social Networking Concepts...10 4. Project Design...16 4.1 Client and Server...16 4.2 Building the Social Network...16 5. Results...18 5.1 Disadvantages...18 6. Conclusion...19 References...20 Appendix A: CD-Rom Contents...21 Appendix B: Test Cases...22 Appendix C: Deployment Instructions...23 4
List of Figures Figure 2.1 Sample Social Network...9 Figure 3.1 Tracing Trust.10 Figure 3.2 Depiction of a Social Network. 11 Figure 3.3 Spammer vs Legitimate User...12 Figure 3.4 Spam, No clustering Coefficient... 14 Figure 3.5 Clustering Coefficient Greater than 0. 14 Figure 4.1 Client and Server Project Design. 16 5
1. Introduction Email is becoming a very popular communication tool in the world today. Most people use it everyday to communicate with friends, family and co-workers. Others use it to make money through advertisements or cause damage to computers by sending viruses. This is known as spamming. Spamming is done by sending mass amounts of unsolicited emails to random people everyday. Today, an average of 90 billion spam emails is sent out everyday. Over the years the amount of spam has increased dramatically mainly due to the fact that spam costs little to no money at all to send. Users are becoming increasingly frustrated in having to go through all their emails everyday, just to get to the legitimate ones. So companies have been building programs since the beginning to counter spam, programs known as spam filters that filter out any unwanted emails. Current spam filters rely on filtering out specific email addresses or keywords. In this paper we will look at a way to filter out spam that is based solely on trust and it involves the concepts of social networking. Social networks are networks of intertwined relationships between different people. Social networks are becoming more and more popular in society today especially on the World Wide Web. Some examples of popular social networking sites include Facebook.com or myspace.com. Through these sites people can add friends or families to their list, and their friends and families can be added to their list and so on, creating a network of relationships. 6
2. Background Information 2.1 Email Email is a system developed in the 1960s as a way of sending and receiving electronic messages over a computer network. An email is comprised of two parts, a header and a body. The header of an email consists of many fields that are used to describe the email such as the emails subject, date of the email, etc. Some of the following fields in the email header that we are interested in are: From: email address of the sender of the message To: the destination email address of the message CC: cc stands for carbon copy; any emails in this field receives a copy of the email being sent The body of an email contains the main message itself that is to be sent which can contain text, images, links, different types of files etc. 2.2 Spam Since the inception of email, users have been using email as a way to send unsolicited email. This process is known as spamming, which is the abuse of electronic messaging systems to indiscriminately send unsolicited bulk messages. [Wikipedia.org]. Spamming is a low cost operation that can be used to send out viruses, commit fraud, or advertise products. Since spamming is an inexpensive operation generally millions of spam mail is sent out a day and can be a very big annoyance and cost the economy billions of dollars a year. 7
2.3 Spam Filters In an attempt to combat spam, companies have created spam filters to filter out any unwanted emails. Current spam filtering techniques include searching for specific keywords in the messages or blocking known addresses using a blacklist. These types of filter usually filters out a lot of the emails but the lists of keywords or blacklisted emails need to be constantly updated. This becomes a big disadvantage to filtering spam since users can easily create a new email and continue sending out spam or they purposely misspell words that are usually filtered out. Using the concepts of social networking we can build a spam filter that needs little maintaining and updating. 2.4 Social Networking Thanks to spam, users have become less trusting of email. With social networking we can bring back the trust. Social Networking is a term used to describe a social structure comprised of relationships between different groups of people or individuals. When people interact with another person they form a relationship with them, with this relationship comes a certain level of trust. As the person forms more and more relationships with others, their social network grows. Those people can introduce people from their own social networks expanding others social networks and vice versa. For example we have a person A, who is friends with B and C and A introduces B to C, we will get a small little social network as shown below. 8
Figure 2.1 Sample Social Network Social networks are known as small-world networks. A network is called a small-world if it has the following traits: [Ebel, Davidsen and Bornholdt, 2002] 1. Clustering 2. A small average shortest path between two nodes scaling logarithmically with network size Clustering is a property in social networks as seen in figure 2.1 where A is linked to B and C, then there is a good chance C is linked to B. The second trait is important when building a spam filter, because it states, that every node in a small world network are not neighbors of another but can be reached through others. This trait is useful when defining trust. 9
3. Spam filter using Social Networking Concepts Using the concepts of social networking, we can show that an effective spam filter can be built. This spam filter will be filtering out spam based on if we can trust the sender or not, so before we start building the spam filter we need to define what trust is. 3.1 Defining Trust When defining trust, we take a look at how trust is incorporated in the real world. In the real world, people have relationship with family members and friends, and with this relationship there is a certain level of trust. People we do not know we do not trust, but their level of trust goes up if they are associated with someone you know. This gives trust in the social networking world a transitive property. If we go back to the definition of small world networks we can show that a person we do not know personally can still be trusted through the characteristic that the person can be traced back to someone we know. As shown below, if A trusts B and has never met C, and C is a trusted friend of B, chances are good that A can trust C. Figure 3.1 Tracing Trust When we define trust in our social networking spam filter, we need to not look at how well the user can trusts a person but how well other people can trust him. When we look at a social network visually we see a picture a bunch of nodes connected together by 10
links, but in reality a social network as a whole is a bunch of small little social networks that can be connected together Figure 3.2 Depiction of a Social Network In figure 3.2 we have 3 small-world networks, and in each group, every node does not have a relationship with every other node but most nodes can be reached by traversing other nodes. If an user from Group 3 were to email an user from Group 1, we can say that the user in Group 3 can be trusted mainly because other users in Group 3 can trust him. This is possible because when a spammer sends out spam, he usually uses a list consisting of random people that he does not know. Since the list of people the spammer uses is random, the probability that the more than one person in the list knows another person is relatively low as seen in figure 3.3. 11
Figure 3.3 Spammer vs Legitimate User As seen in the legitimate user s case, when he sends out an email, there is a better chance that people in his list know each other, which is given by the properties of small-world network. Now with trust defined, this leads us to the clustering coefficient. 3.2 Clustering Coefficient A Social Network s most distinctive property is the tendency to cluster. [Boykin and Roychowdhury, 2005] The tendency to cluster means that if there is a node A in a graph that has a relationship to node B and node B has a relationship with Node C then there is a good probability that node A has a relationship with node C. Using this distinctive property we can determine whether a specific person belongs to a social network or not, therefore we can determine if the email is trusted or spam. 12
To determine if the specific email in question is spam or not, we calculate its clustering coefficient. To do this we focus on the node and its graph that it is connected to. The email in question is treated as a node in a graph, and each email node has a degree of k, that is each node has k neighbors. If the node has a degree of less than 2, than it can be assumed that the email is spam. The reason we can assume this is because a node with degree of less than 2 does not belong to any social network yet and therefore has no one that can vouch for him as a trustworthy user. This is a very big disadvantage since a user can be legitimate and still be considered spam if the user is just entering the network. If a node has a degree greater than 2, then each of the k neighbors has a potential of being connected to each other. In our social network we are using a directed graph, where a link from user A to user B is different from a link from user B to user A. The reason behind a directed graph versus a undirected graph is due to trust. Trust lacks the associative property, that is a person can trust someone but that person doesn t have to feel the same way about the other. Therefore the maximum number of possible connections in an directed graph is k(k-1). In a undirected graph the maximum number of possible connections is k(k-1)/2 since a link from A to B is the same as a link from B to A. With this information we can now define the clustering coefficient as [wikipedia.org, 2007] Equation 1 Clustering coefficient The clustering coefficient is shown as a ratio between the amount of actual links {e jk } to the amount of possible links k(k-1).. With the clustering coefficient, we can now 13
distinguish between spam and non-spam emails. In a social network if the sender s neighbors has no nodes that are connected together, then he is can not be in any social network. Figure 3.4 Spam, No clustering Coefficient Figure 3.4 shows us that sender A is trying to send emails to B and C. A has two neighbors giving it a possible of two connections between each and since the actual number of links is 0, we get clustering coefficient of 0. Therefore if A sends an email out to B and C, it will be detected as spam. Figure 3.5 Clustering Coefficient Greater than 0. Figure 3.5 tells us that sender A is part of a social network and is therefore trusted as legitimate email. So if A tries to send an email to C, we will check the amount of 14
neighbors he has which is 3 giving us a possible number of 6 connections with an actual amount of 2 connections. We then get a clustering coefficient of 1/3. From the above examples we can conclude that a clustering coefficient of 0 tells us that the email received is spam, where as a clustering coefficient greater than 0 tells us that the email received is not spam. 15
4. Project Design 4.1 Client and Server The spam filter built here is a plug-in developed in C# for Microsoft Outlook. In order for the plug-in to check if an incoming email is spam, the social network has to be connected through other users as well. This means the social networking data needs to be kept on and updated by a central server program. The client (Plug-in) will update the server whenever a email is sent out or received, and since the server holds all the social network data, it will determine if the email is spam or not. Figure 4.1 Client and Server Project Design 4.2 Building the Social Network In order to detect spam using social networking concepts we have to first build it. Social network is built first by using the emails in the contact list, and adding relationships to each and every contact. Subsequent email addresses will then be added on later using the fields found in the header whenever the user composes an email and sends it out. This 16
can be done because by initializing conversation with the other person we are showing a level of trust with that person. So emails in the To and CC field will be added as relationships to the user. 17
5. Results Using the clustering coefficient we were able to show that any email that comes from a user that does not belong in a social network is concluded to be spam. Like all spam filters though, this method of filtering has its disadvantages. 5.1 Disadvantages The first disadvantage as mention earlier is the inability to distinguish between spam or email if the user is just entering the social network. If the user is legitimate and has a degree of one or less, meaning he has one or no neighbors in his network, the user will be treated as spam. This method works mainly because the probability that a spammer s social network will cluster is 0. Although this is the case, there can be some instances where the spammer might possibly send out an email to two people who happen to be friends. If this were the case then all of a sudden the spammer s social network is clustering. Another disadvantage to this is that a spammer can easily fake a social network by creating one with fellow spammers, thus forming a spammer s social network. Since they form a social network entity by themselves, any emails from them will be treated as real. 18
6. Conclusion Social networks are becoming increasingly popular in technology today. They range from a variety of uses, from interacting on the internet through sites such as facebook.com or myspace.com, or they can be built as a trust system. Social networks are a good way of using the trust system to filter spam, but it is not as effective as it should be due to its disadvantages. Using social networking concepts to build a spam filter, it is best to build the filter on top of already existing spam filters, that way it can be used as the last line of defense. Possible future work not discussed in this paper entails using these concepts to help build system that develops blacklists of emails addresses of known spammer, given that the social network is big enough, it can become quite effective. 19
References 1. Boykin, P. Oscar, Roychowdhury, Vwani P. (2005) Leveraging Social Networks to Fight Spam, IEEE Computer Society. 2. Ebel, Holger, Davidson, Jorn, Bornholdt, Stefan. (2003) Dynamics of Social Networks, Wiley Periodicals Inc. 3. Wikipedia.org (2007) Wikipedia, Wikipedia Foundations 20
Appendix A: CD-Rom Contents Spam Filter: written in C# for Outlook 2003 Server: Written in Java 1.5.0 Honours Project.pdf : Electronic Report 21
Appendix B: Test Cases Test Case 1: Receiving New Email Input Output Expected Result New Mail From: Email is passed into spam@spam.ca check for cluster Actual Result Email is passed into check for cluster Result Pass Test Case 2: Clustering Coefficient Input Output Expected Result 2a User is spam Algorithm returns true; meaning spam 2b User is not Spam Algorithm returns false; meaning legit Actual Result Algorithm returns true; meaning spam Algorithm returns false; meaning legit Result Pass Pass Test Case 3: Outgoing Mail Input Outgoing Mail: From: spam@spam.ca To: spam1@spam.ca Output Expected Result Add spam1@spam.ca as a neighbor to spam@spam.ca Actual Result Add spam1@spam.ca as a neighbor to spam@spam.ca Result Pass 22
Appendix C: Deployment Instructions The following software is required to run 1).NET Framework 2.0 2) Microsoft Office 2003 SP1 (or later) w/outlook 3) Microsoft Office 2003 Primary Interop Assemblies (O2003PIA.exe) 4) Visual Studio Tools for Office Runtime (vstor.exe) Java Runtime 1.5 To run the server goto Server\bin edit run.bat so that JDK Home points to your java home directory and execute run.bat 5) Setting the CAS Policy You need to set the CAS Policies in order for the plugin to run. In Windows XP goto 1. Control Panel 2. Administrative Tools 3. Microsoft.Net Framework 4. My Computer 5. Runtime Security Policy 6. User 7. Code Group 8. Click on All Group and add New 9. Name the Code Group 10. Set Condition All Code 11. Full Trust is on Install the Spam Filter located in \Spam Filter\Spam FilterSetup\Debug\setup.exe When you open outlook, the spam filter should be running, Under tools! spam filter! you need to set the ip of the server, default is localhost 23