Design and Implementation of a Secure Social Network System

Transcription

1 Design and Implementation of a Secure Social Network System Ryan Layfield, Bhavani Thuraisingham, Latifur Khan, Murat Kantarcioglu, Jyothsna Rachapalli The University of Texas at Dallas Abstract Context-based anomaly tracking represents a new approach to security enhancement of communication streams. By creating a system that develops an understanding of normal and abnormal based on communication history, it is possible to detect fluctuations in an evolving social network. Although more research is necessary to overcome current obstacles, the combination of social network analysis and anomaly detection techniques yields a promising set of applications for enhancing communication security. In this paper we will describe a system for context-based anomaly detection and then describe experiments for message surveillance application. S I. INTRODUCTION ocial networks are essentially networks formed by individuals, groups and organizations. Social network analysis is about analyzing the behaviors of individuals, groups and organizations and determine their behavior patterns/ Social network analysis is becoming an important tool for counterterrorism applications. For example with social networks analysis one can perhaps determine whether individuals, groups or organizations are involved in terrorist activities. An example of a social network is illustrated in Figure 1. Monitoring a continuous stream of data in the interest of security is not a trivial problem. In order to properly classify a single message as normal or suspicious, one must parse the contents, determine the origin, identify the recipients, and determine how prior communication traffic affects the context. Whether or not a message is suspect, it can theoretically affect the semantic meaning of future communications. This implies some method of storage for prior messages is necessary for a perfect detection system. While tools exist for message classification with a variety of intent, there are no known systems which establish a localized context for each node in the interest of security. While individuals being monitored may share a similar context in a common environment, there are several scenarios in which the same message passed by two different users does not have the same meaning. Hence, there is a need for a system to personalize context data to properly ascertain security threats. One of the biggest challenges in automated message surveillance is the recognition of messages containing suspicious content. A classic approach to this problem is constructing a set of keywords (i.e. bomb, nuclear ). In the event that a communiqué contains one or more of these words, the message is flagged as suspicious for further review. However, there are two drawbacks to this particular approach. First, it is reasonable to assume that such relatively static keywords will not always be present in messages that would otherwise warrant suspicion. Second, there is little guarantee that a sufficiently intelligent individual will not recognize such surveillance is in place and, instead, use substitute words in place of known keywords. David Skillicorn of Queens University has suggested a different approach in his work on the Enron dataset [SKIL05]. In his work, he outlines a method for using singular value decomposition (SVD) in the interest of recognizing trends in such topics as and social networks. We believe that this work can be expanded upon by extrapolating the techniques he used and applying them to a real-time message monitoring system. We wish to create algorithms which are designed to handle streaming data that make use of the techniques outlined in Skillicorn's work. By making use of the Enron dataset, we will use our existing threat identification techniques and apply singular value decomposition to discover data correlation of message content. Through word frequency analysis, we can rate the similarity of two separate data streams that may or may not deal with the same topic. The benefits of such an algorithm would allow us to recognize subjects, trends, and even conversations. The origination of this paper is as follows. In section 2 we will discuss the design of our socials network system and discuss context based anomaly tracking. In section 3 we will discuss social jet work analysis for message detection. We will provide some background information on the techniques utilized as well as discuss our system. We will also discuss security and privacy considerations. The paper is concluded in section /09/$ IEEE 236 ISI 2009, June 8-11, 2009, Richardson, TX, USA

2 One or more messages passed from one node to another forms a basic unidirectional link. Over time, as more messages are passed, it becomes possible to determine common lines of communication among people. By weighting links based on message frequency and whether or not replies are given in a timely manner, the strength of social ties between individuals can be realized. Figure 1. A Typical Social Network II. INTEGRATING SOCIAL NETWORK ANALYSIS WITH CONTEXT-BASED ANOMALY TRACKING A. Background Since its inception, has become an increasingly popular form of communication. According to a study performed by research group IDC in 2002, traffic will increase from 31 billion messages a day to 60 billion by 2006 [MINI05]. As of the second quarter of 2005, there are roughly 900 million known users of the internet (Global Reach). Assuming that mail traffic increases linearly, each user receives an average of 60 s a day. Manually sifting through the traffic of a group of a hundred people would require an individual to read 6,000 s a day. Clearly, automated methods are necessary to deal with such increasing volumes of data. The combination of text analysis and link mining concepts itself is not a new avenue of research. The work of Ben-Dov et. al. demonstrates the ability to enhance link mining of news sites by using available tools to semantically comprehend the contents of a document. One experiment performed by the group successfully discovered correlations between two individuals based simply on their presence within the same sentence. Successful examples can also be found in the field of semantic web analysis [HORR03]. B. Our Approach Overview The system we propose is an active monitoring agent that resides at a major message communication hub. Each message that passes through the hub is deconstructed to acquire basic information, such as source and destination addresses. These in turn are used in the construction of an evolving graph of communication patterns. anomaly detection is not a new concept. One existing topic of interest is the identification and filtering of spam. Using a set of desirable message attributes, a spam removal system is responsible for removing all unwanted from a user s inbox. This frequently includes advertisements, fraudulent topics, general bulk mail, and any other messages that do not appear relevant. Ultimately, this will ideally result in a set of messages consisting only of what the user desires [GOLD92-61]. While not always providing enhanced security directly, spam filtering represents a well-defined area to build from. The monitoring of and other point-to-point contact services can be used to build a relationship-oriented web. Instead of simply looking at each message as an isolated event, such a web allows the complex relationships between individuals to be mapped and further analyzed. Formally, this approach is known as constructing a social network. Within a social network, each individual represents a node. Figure 2. Original System Architecture The fundamental properties of this design can be found in how the elements within a monitored group are represented. First, each individual that uses the hub is kept as a user node. As in social networking, each node represents an endpoint of communication. Basic contact information is kept at the node to identify when future messages are arriving or departing at the node itself. In the 237

3 case of , the address is all the contact information necessary to uniquely identify the user. It is assumed that these identifiers do not change over time. Each message passed represents a conversational link between two users. The direction of the link is determined by the source and destinations of the message. In the event a message is passed back, the link automatically becomes bi-directional. The strength of a link is dependent on the number of messages passed in either direction. The attributes of the message itself is stored within the link. This allows the system to retrieve historical data between two nodes without needing to go to each node, find the relayed messages between them, etc. This is counter-intuitive to how messages are normally stored within most message services, but it is necessary to form context. Since a single message can be sent to multiple parties, a single instance of the message is often shared by multiple links. created and passed along to each individual that received the information. In turn, should any of the recipients disseminate the classified information to other recipients, additional child tokens will be created. Ultimately, a localized web of suspicious parties is created and tracked for future investigations. When the system is deployed, it is of note that it is the responsibility of an agent observing the results to take action. Ideally, the agent will be a human responsible for the security of the group being monitored. The system itself is only an observational tool. This approach was chosen to maximize potential uses for the system. For example, the responses chosen by an intelligence organization would vary widely from an internet service provider. Analysis In this section we discuss the strengths and weaknesses of our approach. Future directions will be discussed in Section 5. In the event the message passed has unusual properties, the anomalous characteristics are noted and recorded within a unique token. Attributes of a message ideally include unusual keywords, communication pattern deviations, and any other clues that may be necessary to identify future messages with similar traits. Other attributes of the token include an atomic identifier and a pointer to the originating token, if any. The token itself is considered as part of the established context, and it is stored within node endpoints. Current, the only characteristics of a message that the system tracks is a fixed set of keywords within the body of an . While marginally effective in generating reasonable result data, the technique is far from sufficient. The future plans section of this document describes the techniques which will eventually be implemented. --Strengths In theory, the deployment of this system offers a great deal of benefits. First, all analysis is performed in real-time. This means that, once deployed, the system is actively monitoring the available text stream for any and all communication activity. In the event that a malicious situation is identified, the observer of the system can either respond immediately or await further messages to decide whether a security issue exists. Second, the system indirectly models the complex social interactions of individuals. Hence, as messages are passed, it is possible to identify groups of people with malicious intent and how they collaborate. This is especially crucial to recognition of social sub-networks, in which normal keyword testing could be insufficient in identifying individuals with malicious intent. The reason a node is responsible for storing tokens, rather than a link, can be found in the fact that messages convey information that the user records for future reference when communicating with other users. These tokens represent unusual information that can propagate through a network. When new messages are passed, a check is performed to determine whether or not tokens exist at the originating node which matches the attributes of the message itself. If a match is found, a child token is created and passed along with the message. This child contains a link back to the original, creating a semantic trail that can be traced through a network. Such a trace can be useful in several security scenarios. For example, consider an intelligence agency concerned that there have been leaks of information within the organization to the media. A security manager using this system could begin by flagging certain keywords found only in a top-secret report recently given to suspects. In the event that messages sent from these suspects begin to use these keywords, a context token is 238 For example, consider the deployment of this system in the interest of catching a group of criminals involved in smuggling stolen works of art across international borders. Assume that they are using a message passing network to remain in constant contact along with a multitude of innocent people. Using keywords involving the stolen works, simple text filtering could create a number of false positives from people simply discussing the crimes mentioned in the news. By overlaying detected keyword uses with social network graphs, we could detect a group of individuals using these words among themselves. Once the group is properly identified, the entire set of individuals connected could be captured and questioned. Extending upon this scenario, should any of these individuals be held responsible, the system has already generated a set of conversations shared among the guilty parties. These exchanges could easily translate into an evidence exhibit to be used during prosecution. While certainly capable of being built by existing text mining tool, the convenience offered by

4 the availability of this data is an invaluable tool in situations where time is a factor. --Weaknesses Unfortunately as promising as such as a system may be, it is of note that the proper operation of the system has a number of dependent factors. First, the system requires that it has a roughly omnipotent view of communication among individuals. For example, it assumes that users of an server will not use any other server to communicate, nor any other form of communication that falls outside the bounds of what can be observed. Given that groups of individuals will likely communicate in person at some point, one or more semantic gaps could be created. Such gaps would prohibit token passing among nodes, as well as create inaccuracies within the perceived social network, reducing the overall effectiveness of the system. Second, there are serious ethical implementations for a system with such far reaching observational capability. Regardless of whether or not individuals are engaging in suspicious activity, social models are being created for future reference. Essentially, the data generated can be used to identify how close two individuals are, what they have been talking about, the common points of contact among them, etc. If an individual uses the monitored text stream exclusively for communication, a fairly accurate model of their relationships can be generated. To fully understand how such data can be used against an individual, consider an employer with access to a system that has been observing an individual applying for a position. During the evaluation process, an employer could analyze the social net around the applicant and determine the people they are closest to. These individuals could then be contacted and asked a series of questions about the applicant, their habits, prior employment history, etc. While the employer would benefit greatly from being able to have such data, the potential employee would undoubtedly feel their private life had been violated. Another weakness of the system is the lack of training methods to teach it when certain messages are false positives and false negatives. Although it is assumed that the observing agent can distinguish between results, it is much more convenient to filter out the noise to focus more on issues that require more attention. Additionally, given the token use of the system, a serious amount of false contexts could be created that would cause multiple complications for the entire social network. In theory, the impact of false tokens could be eliminated by giving an agent the option to delete specific tokens, but this is only a temporary solution. Regardless of how effective the system is, the ultimate weakness that this system faces is how much data must be stored. Traditionally, in most message passing networks, messages are stored at the user s terminal, removing the burden from the server. However, in order for the system to properly determine previous context, all messages passed must be stored in an archive after processing. Coupled with the data storage of links among users and the presence of tokens, it is possible that the data requirements of the system could multiply exponentially as more users join a network and average traffic flow increases. As of the creation of this document, there are no known systems that are implemented with these characteristics. Presumably, such systems may fall under classified government security methods. The fabled Echelon system, for example, is rumored to have similar capabilities. However, the lack of documentation to support this claim leads us to believe that this system simply represents a relatively unexplored area of research. Prototype Implementation I Design of the System The objective of this system is to combine social network analysis with text anomaly detection to enhance detection of unusual and undesirable activity. By combining these techniques and applying them to a continuous stream of messages, we believe it is possible to build a more secure communication system by identifying unusual behavior in normal channels of contact. Ultimately, this system will be ideally deployed on either a corporate or public network, provided that adequate authority exists. There are two primary parts necessary for the successful operation of this system: the organizational analysis system and the real-time results viewer. The former is responsible for building an understanding of the system from the text stream being observed. The latter translates the output of the former into a graphical representation viewable by a human security agent. This is necessary due to the intense amount of processing required by each to perform the responsibilities of the system in real-time. The architecture of the system is a largely decentralized and heavily object-oriented. The major aspects of functionality and purpose are encapsulated in appropriately named objects in the interest of keeping data and state information categorized appropriately. However, the system is currently tightly coupled, as many objects are heavily dependent on the functionality of others, often in both directions. Organizational Analysis System This subsystem is responsible for actual processing required to parse, analyze, and derive information from a text stream. Hence, it is broken down into three main pieces: the Delivery Agent, Detection Agent, and Mailroom Agent, each responsible for one of these three tasks, respectively. Basic information is represented as a series of user nodes and 239

5 conversational links, while tokens represent anomalous behavior. The Delivery Agent object represents the entry point for messages into the system. In the current design, it is a passive agent that, for each iteration, reads in another message from the stream. It then parses that for it s origin, destination, content, etc. Ultimately, a message object is created to embody this . No tokens are created at this point, as the agent does not keep track of context internally. This message is then passed to the Mailroom agent. Neither messages or s are kept on record here. When a message first arrives, the Mailroom Agent queries the Detection Agent to determine if any suspicious activity is present. While most of the functionality of the this agent is not fully implemented, it is responsible for maintaining data that determines the frequency of words, which words are unusual, and whether activity is normal or abnormal. Tokens are generated and attached to the message if necessary. The Mailroom Agent itself is responsible for keeping track of users in the system and delivering message objects to recipients. It maintains a virtual address book of identified users based on their endpoint communication identifier presented in the message itself. A copy of the message is also given to the User Node identified as the origin. The actual data of the social network and the propagation of messages is kept in the form of User Nodes and the conversational links between them. A User Node keeps track of the address of the user and represents an endpoint of communication. It also keeps a local list of tokens, which serve as the context for identifying future suspicious behavior. Nodes are created automatically as new A Conversational Link represents a link between two nodes. Each link objects keeps track of the messages between two nodes. The frequency and volume of activity represents the weight of this link. A higher number represents a stronger bond between two individuals. For space reasons and consistency, link objects are automatically shared between two nodes, reducing overhead and increasing efficiency of the system. When a message arrives at a User Node, the node is conditionally updated. If the message originated from the node, a link is either created or updated between it and the receiving nodes. If the node is a recipient of the message, it is assumed that the links have already been taken care of. Regardless, the tokens present in the message are integrated with any existing tokens at the node. The weight of the node is also updated to reflect traffic. The token object is the key to the way the system keeps track of security. Each token embodies the unusual characteristics of a message. If a message exhibits behavior not previously recorded, a new token is created. If an existing token present reflects the behavior adequately, a child of that token is created. Since each token is aware of it s parents and origin, a series of links can be establish in the form of a web that traces the propagation of undesirable activity. This information is ultimately used to determine which messages should be reviewed and when activity warrants investigation. Suspicious token propagation is passed to an Action Log object. Each time a node passes an unusual message, it is recorded in the log file for future review. Each entry set contains both the origin, destination, and identification of the message in question and the token involved. The results are then fed into the viewer application. Real-Time Results Viewer The viewer application, while not actually performing any security-related processing, plays a crucial role in translating raw data into human-readable results. Essentially, it accepts input from the analysis system, rebuilds a minimal representation of noted users and links, and display them graphically. The interface ultimately permits human security agents to determine whether or not the results involve warrant further investigation. The data arrives in the form of a series of lines of text. Each line carries a command along with the parameters (as comma-separated values) necessary to carry it out. There are six basic commands that the system recognizes, as shown in Figure 3. Notice that no coordinate data is included with the commands. This is due to the fact that the analysis system does not actually create a graphical representation of the data. As nodes arrive, they are initially given random coordinates. However, there is a separate thread that runs independent of the basic viewing program. This thread is responsible for arranging the nodes into a much more aesthetically pleasing an human-readable format by enforcing a spring layout. The resulting nodes are rendered based on this changing data. The spring layout is a relatively straightforward way to arrange nodes. Each node represents a ball of mass within an environment of inverse gravity. This is done to ensure nodes do not overlaps with each other and obscure crucial data. Essentially, if a node is within a threshold distance of another, they mutually repulse each other with a force proportional to their respective sizes. However, if this alone were all that was enforced, each node would be simply spacing themselves out without necessarily rendering what would likely be a tangled mess of nodes and their links. Therefore, we use the links between nodes as a simulated spring, drawing the nodes together. The strength of the spring is determined by the weight of the link: stronger links bring nodes closer together, while weaker links attempt to keep nodes at a minimum distance. 240

6 The combination of these forces results in a graph that attempts to properly represent the distance between individuals and their associates based on the strength of relationships. Neighbors are arranged in such a fashion as to insure that the distance between two nodes gives a rough visual approximation as to the strength of a relationship. This holds true roughly between immediate neighbors. It is of note that the other reason to place this processing in a separate thread is due to the intense computational resource usage. If done in line with the viewer, this would force the program to wait between iterations to properly render the interface, potentially reducing usability. By placing such processing in a separate thread with lower priority than the interface, this problem is reduced. Nodes within the view can be selected by clicking the mouse. The view itself can be dragged by holding down the left mouse button. Given the amount of data, it has also been practical to give zoom capabilities. By using the page up and page down keys, the user can zoom in and out (respectively) to focus on certain relationships as well as observe the larger picture. When a user clicks on a node, the node s address, weight, and links are displayed in the upper-right corner. The node itself is highlighted red. This is a crucial function, as it also displays the weight of each link. Such a feature is crucial for a user to perform in-depth analysis of how two nodes relate. Below the node information box on the right side of the interface is a depiction of the token storage tree. Each original token represents a clickable, expandable parent that has information on the token and any children underneath. Each child, a token holder, can be viewed for whom holds it. Clicking on either a token or token holder in the list highlights all links and nodes involved with the token within the graph in bright red. The associated with the token is displayed in the bottom text box. This crucial feature allows a security agent to determine when and how suspicious activity occurred. II INPUT/OUTPUT Currently, the system acquires input through a fixed database of . This was drawn from the Enron dataset, released by the Energy Commission during investigation of the company. It is maintained by researchers at Carnegie Mellon University. The text for a message is the original messages, which conform to the RFC standard. An example can be found in Figure 3. Command addnode addlink updnode updlink addtoken Description A simple node addition procedure. Node information must include the address and initial weight. The address is assumed to be unique. Creates a link between two nodes. A source and destination address must be specified, along with the weight of the link between them. As with the analysis program, links are considered shared memory between the nodes. A quick check is performed to insure a node does not link to itself. Updates the node s information. While the address remain constant, the weight of a node can change at any given moment from general traffic. The node must already be present. Updates an existing link between nodes. The source and destination must be specified, along with the new weight Adds a new token. Since tokens represents the suspicious activity of a system, this command requires an extensive amount of information, including the origin, identification number, embedded characteristics, and the message it was created from. This is stored as a top-level node in a token storage tree. The output of the system is ultimately a graph rendered by the viewer program. Data from the analysis system is completely human readable. Please see Figure 2 for an example graph. Figure 4. The viewer program interface. addtokenholder Figure 3. Command listing Adds a new token holder. These represent the children of an original token, and include all of the same information plu the ID of the parent and origins. At the moment, thi information is stored directly under the parent token in th token storage tree. In the future, this may becom hierarchically stored data. 241 III Enhancements to the prototype This system is still in the early stages of research and development. There are several areas that require additional improvement. However, there are three main aspects of the system which require the highest priority: flexible input

7 streaming, enhanced text processing, and real-time reinforcement of results. X-FileName: MTAYLO1 (Non-Privileged).pst The input of the system ultimately needs to be straight from a message captured or given to an server. This will require a parser that can sort through a message for the necessary fields, separate text from raw data (i.e. mail attachments), and determine how to proceed. To make this system applicable internationally, it will be ideal to take advantage of Java s Unicode support, but this will require heavy research in determining how the RFC standards dictate multi-language is exchanged. Ultimately, to make this system more applicable to a variety of systems, the input methods must be made to accept text input from different sources. This includes other message passing networks such as instant messaging and chat rooms, as well as any data recognized by other systems (i.e. a speech-to-text converter monitoring a telephone conversation). The current level of abstraction needs additional work to make this possible. Once we can accept a variety of input, it is also desirable to perform a variety of text-analysis techniques. Keywords must not be fixed; instead, they should be drawn from a dynamic list of word frequencies that are changed based on previous . The messages themselves should be subject to pattern matching techniques such as Singular Value Decomposition to insure even when unusual keywords are not used, the system can still operate effectively. Part of text analysis also includes enhancing the data tokens carry for use in context analysis. Message-ID: < JavaMail.evans@thyme> Date: Tue, 9 Oct :46: (PDT) From: h..moore@enron.com To: e..dickson@enron.com, legal <.taylor@enron.com> Subject: ONEOK EOL Amendment Cc: s..theriot@enron.com Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Bcc: s..theriot@enron.com X-From: Moore, Janet H. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=JHMOORE> X-To: Dickson, Stacy E. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Sdickso>, Taylor, Mark E (Legal) </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Mtaylo1> X-cc: Theriot, Kim S. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Ktherio> X-bcc: X-Folder: \MTAYLO1 (Non-Privileged)\Taylor, Mark E (Legal)\Inbox X-Origin: Taylor-M 242 Mark: I'm not sure whether this CP does financial trading. Regardless, I've included the highlighted language that I had circulated with the Idacorp draft; I think that you were mulling over whether that new language was acceptable. Please let us know your thoughts so that we can finalize the new rider. THANKS! Stacy: Would you like to tackle ONEOK next? Also, please let me know your thoughts on the bold faced language. Janet Figure 5. Example RFC 2822 Input The biggest improvement we can make can be found in the integration of machine learning techniques to allow the system to be positively or negatively reinforced. This will require research on which algorithms are appropriate, as well as how they should be integrated without significant additional overhead. When properly implemented, such algorithms could enhance the system significantly. However, it is clear that analysis would be limited to word choice, as structure of message body text can vary widely depending on the style chosen by authors. IV Current Directions for the Research Currently, the research on this topic is still in its early stages. There are several options to be explored that can significantly enhance the performance of this system. Once these improvements have been made, the next stage of development requires the use data sets with known results. Finally, once the system has been adequately tested, it would ideally be deployed within a mock-up environment to observe real performance. The existing method of message anomaly detection relies on a set of thirty fixed keywords. These have been selected based on their relevance to the current data source used for this project: the Enron database. While far from ideal, this method has been sufficient in offering basic detection of concerning the demise of the company. In order to create a more flexible system that can operate successfully on a variety of datasets, one possible direction to explore would be a word frequency dictionary. By using an available compiled English text corpus, a set of words extracted from an can be given global word rankings in terms of frequency of use. This would create an adaptive word detection filter that can comprehend when a traditionally unusual word becomes commonplace through widespread use, eliminating potential false positives. One problem with this direct approach is that there are a number of words that share the same meaning. For example,

8 the word person has approximately fifteen to twenty meanings, depending on context. While a semantic engine is outside the scope of this system, it is theoretically trivial to use a simple thesaurus database and collapse multiple synonyms into a single common word. Such an addition would make word frequency data much more robust and accurate. Since such a thesaurus dictionary may not be readily available, an alternative would be the integration of the classic Porter Stemming Algorithm. Developed originally in 1980, the algorithm was developed by Martin Porter in the interest of automatically removing the suffixes attached to words. One scenario involves the conjugation of a simple verb run into the various tenses: runs, running, ran, etc. While the past tense word ran is not properly identified, the rests of the variations are collapsed back into the word run itself. This algorithm alone could eliminate several redundancies in word frequency detection. Even a perfect keyword recognition algorithm is not sufficient when individuals begin to encode the text of their message. The use of encryption breaking techniques is beyond the scope of this research. However, assuming that data encryption is never used, there is another alternative to hiding the true topic of a message: word swapping. Keyword analysis is virtually useless when an individual uses unrelated words in place of suspicious words. For example, using the word corn in place of bomb would create a new message that would appear to be innocuously concerned with food. One solution to this problem is the use of a singular value decomposition (SVD) matrix. Frequently used in data pattern matching, research has shown that the use of SVDs to analyze correlation among messages is highly effective [SKIL05]. In the word swapping scenario, the use of a common pattern in word choice could be potentially detected, as long as the word choice deviates from established norms. The biggest hurdle in applying this research is adapting it to a real-time detection algorithm. We will discuss our approach to message surveillance in Section 3. Given the widespread interest in spam filtering, there are undoubtedly a number of alternative anomaly detection techniques that can be used on text-based communication. Further research is necessary to determine what techniques are available, their effectiveness in detecting characteristics, and whether or not they would be a beneficial addition to the system. For example, the use of artificial intelligence to automatically identify keywords has been discussed in the work of Michael Pazzani [PAZZ00]. A second area of necessary research can be found in properly modeling temporal decay. Several questions must be answered: How do implied relationships decay overt time? When does a token become invalid? How does time affect the weights of both nodes and links? What adjustments should be made to the 243 lifespan of a token when used repeatedly? Clearly, much of this research may branch outside of traditional computer science areas. The third area of further development can be found in analysis of the constructed social network. Several groups, such as the International Network for Social Network Analysis (INSNA), are dedicated to discussing methodologies for association data. As long as the system is building the network, it is logical that the machine representation of the communication network should be exploited for any potential gain in suspicious activity identification. One possible avenue of research is the identification of roles within a network. Within any given social setting, each individual often serves one or more roles in the communication infrastructure. Some may be hubs of information, always kept informed of situations and responsible for informing others. Others may act as brokers between two major parties [KREB05], a liaison responsible for maintaining a channel of contact. These roles impact how individuals interact, their purpose, and ultimately the way information disseminates through the network. Further research into identification of these roles can enhance anomaly detection by understanding when an individual deviates from the normal purpose they serve. Groups of individuals, also known as social fields, often form automatically within social environments [PERI91-76]. Whether by a direct department assignment or simply acquaintance, each individual has a number of associates that he or she frequently communicates with. When a number of individuals share close mutual ties, a group often exists. Recognition of these groups represents a unique challenge in future research. Social groups represent a significant factor in knowledge distribution, and existing techniques for network analysis may prove useful. A promising method of analysis is the field of graph clustering. By discovering correlation of nodes through analysis of shared links among tightly grouped users, groups can be discovered. The work of Kishnamurthy et. al., although focusing on the field of internet topology, gives a number of insights on creating such clusters through discovery of how a two nodes are indirectly connected. Yet another area in which further development is necessary is the construction of a control set to properly measure how effective this system is. The current Enron dataset, while useful, lacks enough information to be used in the formation of control data. Therefore, following the advice of one of my colleagues, we believe it is necessary to create a new, smaller dataset that directly reflects a planned social network. Populated with a number of automatically, generated messages using common words, several deliberate messages will be inserted over time using unusual keywords. The ideal results will be compared to

9 what the system generates, allowing the effectiveness to be measured. Additionally, an adequately constructed control set will be useful when research uncovers some of the previously mentioned improvements to determine how the system has changed. Given the claims that the proposed system could operate on virtually any type of text-based messaging system with static endpoints of communication, further research is necessary to determine how the system would accommodate different types of environments. For example, instant messaging networks, chat rooms, and even message boards represent other forms of communication this system must be prepared to monitor. Part of the research may need to include the varying social environments created by different mediums. Finally, in the interest of proving the effectiveness of the system, it is essential that the system is integrated into an actual server. This is necessary to determine the validity of the assumptions made in during construction, such as the uniqueness of assignment mail identification strings given by most servers. While the server would not be used for actual traffic, a test set would be fed through it to determine, overall, how effective this system would be in the interest of security. As discussed in this section, there are many areas for research. We discuss one such area and that is automatic message detection. The system that we have designed and developed for this application is discussed in Section 3. III. AUTOMATED MESSAGE DETECTION A. Background on Automated Message Detection The ultimate goal of our research is to construct a system that can determine the presence of anomalous or suspicious messages within typical communication traffic discussed in Section 2 [LAYF05] Our areas of research include social network analysis, text processing, and text filtering. This experiment will determine the effectiveness singular value decomposition as applied as a text processing technique in our system. One particular approach we use in our work in message surveillance involves the recognition of social context. Given a message that was deemed suspicious by a detection technique, there is an increased probability that future messages bearing similar characteristics to the detected message warrant further suspicion. We track this through the generation and passing of tokens carrying these characteristics in a model derived from observed social interaction within the message-passing network. Currently, these tokens are triggered by the presence of two or more keywords from a list tailored to our dataset. By using the techniques described by David Skillicorn, we believe it is possible to enhance the token passing methods of our approach. Message-ID: < JavaMail.evans@thyme> Date: Wed, 13 Dec :01: (PST) From: rebecca.cantrell@enron.com To: phillip.allen@enron.com Subject: Re: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-From: Rebecca W Cantrell X-To: Phillip K Allen X-cc: X-bcc: X-Folder: \Phillip_Allen_Dec2000\Notes Folders\All documents X-Origin: Allen-P X-FileName: pallen.nsf Phillip -- Is the value axis on Sheet 2 of the "socalprices" spread sheet supposed to be in $? If so, are they the right values (millions?) and where did they come from? I can't relate them to the Sheet 1 spread sheet. Figure 6 An excerpt from the Enron dataset The Enron dataset was originally made available by the Federal Energy Regulatory Commission during it's investigation of the company. It was purchased by MIT for use in data analysis, and is currently available in both raw and compiled database forms from Carnegie Melon University, courtesy of William Cohen. This dataset is useful to surveillance research as it is a freely available collection of that has been generated from a 'real world' scenario. After pruning and restructuring for consistency, the overall corpus is composed of 250,000 s that span a five-year period. Each remains in its' original raw ASCII-encoded text format, conforming to the RFC 2822 standard. The simplicity of this specification and the data it requires for proper mail transport assists with analysis in a number of ways. First, every must contain a unique identification number assigned by the originating server. Second, each message must have a well-defined origin and destination address, with the exception of any group or mass-mailing aliases. Third, every message has an explicit date stamp that reflects when it was originally sent. Finally, the body text of each is free of any escape sequences or special characters. With the exception of HTML-encoded , this greatly simplifies any preprocessing necessary for text parsing. The content of the Enron dataset is its' most useful aspect. We understand that, at some point in time, began 244

10 circulating that eventually led to the investigation of the entire corporation. This has been documented by countless news reports, committees, and even research groups. Coupled with the official announcement on January 9th, 2002 that the United States Department of Justice was beginning its official inquiry, we have definitive moments and individuals that can be scrutinized to determine the overall effectiveness of our experiment. Singular value decomposition is an important tool in statistical analysis. Given a mxn matrix A of real or complex numbers, we can decompose it into a series of component matrices: U,, and VT. U represents the patterns among values contained among the objects. In this experiment, the objects are the messages. V embodies the patterns among the attributes of each object. This will be the patterns among the ranks of words within the messages. The matrix is a diagonal matrix confirming to the dimensions nxn that stores the singular values of A. Essentially, the most 'interesting' parts of the original matrix are made evident. [JIAW00] messages set has a strong correlation with one or more of the token holding messages, the token is passed to all recipients of the message. To determine the rank of the word, a running word rank database is kept. The 5,000 most common words within the Brown Corpus of Standard American English are used to seed the database. As words are extracted from messages, each word and the number of times it occurred within the message is either inserted into the database if the word is new or used to update the existing word rank. words = A U V T Figure 7 Correlation of messages across the decomposition Once the SVD process has been carried out, the resulting matrices can be used for a variety of purposes. One particular way to use the results is the elimination of deviants from the patterns to filter out noise. We define noise among our messages as misspelled words, accidentally insertion of punctuation, and anything else a user may accidentally insert within their message that cannot be removed automatically within preprocessing in a timely and effective manner. Other uses include object correlation, signal processing, and even text retrieval [WIKI05]. Most of the uses for SVD stem from the fact that the components, once modified, can be used to build a new matrix that contains only the most 'interesting' qualities of the original. B. Design of Our System Our existing system discussed in Section 2 has been outfitted with a new detection module based on the SVD method as shown in Figure 8. The algorithm within the module requires a ranked word database, one or more messages that share the same token, and a set of messages that have originated from them representing recent traffic. A matrix is constructed based on the word content of the messages, which is decomposed according to the SVD method. A threshold is set for the noise, and the results are analyzed for word correlation. If a message in the recent traffic 245 Figure 8. Modified System Architecture The matrix constructed is created based on the number of messages involved and the words present. Each message represents a column, while each row represents a word. Thus, each cell represents the number of times a particular word occurs within a particular message. The columns are ordered from most common (left) to least common (right). Mathematically, this is represented as follows: c ji count( wi, m j ) W M 1 M 2... M w W i t where W represents the set of all words that occur in the union of the word sets for each message M. The function count returns the number of times a particular word wi occurs within a message mj. Note that Mj is derived from the words in mj; however, mj may contain the same word more than once while

11 Mj is a list that strictly represents all words only once. Once the SVD technique has been used on the messages, the noise is removed through the use of a threshold ( = 0.25 ). After the singular value dimensions falling below this threshold are removed, the matrix is re-constructed from the components. The resulting matrix should assist in simplifying the amount of data that must be analyzed. The correlation of any two messages is based on a total score. Each word that is found in both messages is given a rating according to the following equation: count( wi, m j ) count( wi, mk ) score( wi ) rank( wi ) where wi is the word that occurs in messages mj and mk. The count function returns the number of occurrences of the word within a supplied message. The rank function is a cumulative count of the occurrence of the word wi in all messages up to this point. This equation is designed to place emphasis on words that occur less frequently others. It is assumed that rare words that occur in both messages are more likely to be good candidates for a topical match. Note that si is naturally normalized to one. Given that the rank of a word is adjusted before these messages are processed with this equation, the maximum score will be one. In order to determine if two messages match, the following equation is used: wi W j Wk score( w i ) where Wj and Wk are the set of all words contained within mi and mk, respectively, and represents a predetermined threshold that determines the minimum. In theory, this should be some value that is determined by statistical analysis of known correlating messages. For our experiment, we assume = C. Enhancement to the Experiments Once this experiment has been run successfully, an adequate infrastructure will exist to fully exploit the potential of singular value decomposition. After the noise has been eliminated, we could use the patterns to observe topical correlation between messages. Given that our overall system includes social network analysis, it may become possible to determine suspicious messages as those that deviate from the patterns without the use of suspicious keywords. Another approach this could be used for is bridging potential gaps in our analysis due to users that use another form of communication (i.e. the telephone). The pattern analysis techniques could allow us to potentially determine when messages exchanged independently are intrinsically linked due to other social connections. Should there be a significant clustering among these conversations, it may even be possible 246 to derive the presence of social groups. [5, 6] There are a number of other methods that decompose a matrix into components. One such technique is Principle Component Analysis, a technique that yields similar results to the SVD by attempting to capture the majority of variability in the data. If we construct our code in an abstract fashion, we will be able to use different approaches to the same matrix to enhance the overall effectiveness of our system. [JIAW00] Beyond text processing, we are currently investigating the use of matrix decomposition as applied to graph theory. Given a matrix composed of nodes in a graph and the values which represent the strength of the unidirectional links between them, other communication patterns could be discovered. In the interest of furthering the applications of social network analysis, we are looking into tracking coalition information sharing and the game theory involved. Given the interest in assuring that vital information shared is never disclosed to the enemy, it is crucial to understand of how parts of a coalition interact and whom they may be associated with. A social network can theoretically be constructed that represents the organizations and the ties between them. Applications of our existing techniques could automate the construction of such a model, potentially benefiting the use of game theory that deals with information exchange. Section 3 will be devoted to a discussion of the application of social networks to coalition data sharing. D. Security and Privacy Considerations Two of the applications of social network analysis that are relevant to counter-terrorism applications are automated message surveillance and information sharing across coalitions. In the case of automated message surveillance the idea is to monitor the messages for suspicious words as well as determine the content of communication within groups. In the case of data sharing across coalitions, the idea is to analyze the behavior of an organization and determine how to extract information from that organization. In this section we discuss security and privacy considerations as relevant to both applications. The idea being social network analysis is to analyze the social network say N. N describes the communication and antitypes of a group G. The analysis could be to determine the suspicious words used in messages as we have discussed in Section 2.3 or determining the suspicious activities such as travel patterns of the members of the group G. It is the goal of G to ensure the privacy of the activities of its members as well as ensuring that only authorized individuals may access the activities of the members of G. With respect to privacy, G may enforce privacy policies. For4 example, anyone who accesses information about G has to comply with the policies enforced by G. If A and B who are members of B have some association between them and if that