Design and Implementation of a Secure Social Network System

Size: px
Start display at page:

Download "Design and Implementation of a Secure Social Network System"

Transcription

1 Design and Implementation of a Secure Social Network System Ryan Layfield, Bhavani Thuraisingham, Latifur Khan, Murat Kantarcioglu, Jyothsna Rachapalli The University of Texas at Dallas Abstract Context-based anomaly tracking represents a new approach to security enhancement of communication streams. By creating a system that develops an understanding of normal and abnormal based on communication history, it is possible to detect fluctuations in an evolving social network. Although more research is necessary to overcome current obstacles, the combination of social network analysis and anomaly detection techniques yields a promising set of applications for enhancing communication security. In this paper we will describe a system for context-based anomaly detection and then describe experiments for message surveillance application. S I. INTRODUCTION ocial networks are essentially networks formed by individuals, groups and organizations. Social network analysis is about analyzing the behaviors of individuals, groups and organizations and determine their behavior patterns/ Social network analysis is becoming an important tool for counterterrorism applications. For example with social networks analysis one can perhaps determine whether individuals, groups or organizations are involved in terrorist activities. An example of a social network is illustrated in Figure 1. Monitoring a continuous stream of data in the interest of security is not a trivial problem. In order to properly classify a single message as normal or suspicious, one must parse the contents, determine the origin, identify the recipients, and determine how prior communication traffic affects the context. Whether or not a message is suspect, it can theoretically affect the semantic meaning of future communications. This implies some method of storage for prior messages is necessary for a perfect detection system. While tools exist for message classification with a variety of intent, there are no known systems which establish a localized context for each node in the interest of security. While individuals being monitored may share a similar context in a common environment, there are several scenarios in which the same message passed by two different users does not have the same meaning. Hence, there is a need for a system to personalize context data to properly ascertain security threats. One of the biggest challenges in automated message surveillance is the recognition of messages containing suspicious content. A classic approach to this problem is constructing a set of keywords (i.e. bomb, nuclear ). In the event that a communiqué contains one or more of these words, the message is flagged as suspicious for further review. However, there are two drawbacks to this particular approach. First, it is reasonable to assume that such relatively static keywords will not always be present in messages that would otherwise warrant suspicion. Second, there is little guarantee that a sufficiently intelligent individual will not recognize such surveillance is in place and, instead, use substitute words in place of known keywords. David Skillicorn of Queens University has suggested a different approach in his work on the Enron dataset [SKIL05]. In his work, he outlines a method for using singular value decomposition (SVD) in the interest of recognizing trends in such topics as and social networks. We believe that this work can be expanded upon by extrapolating the techniques he used and applying them to a real-time message monitoring system. We wish to create algorithms which are designed to handle streaming data that make use of the techniques outlined in Skillicorn's work. By making use of the Enron dataset, we will use our existing threat identification techniques and apply singular value decomposition to discover data correlation of message content. Through word frequency analysis, we can rate the similarity of two separate data streams that may or may not deal with the same topic. The benefits of such an algorithm would allow us to recognize subjects, trends, and even conversations. The origination of this paper is as follows. In section 2 we will discuss the design of our socials network system and discuss context based anomaly tracking. In section 3 we will discuss social jet work analysis for message detection. We will provide some background information on the techniques utilized as well as discuss our system. We will also discuss security and privacy considerations. The paper is concluded in section /09/$ IEEE 236 ISI 2009, June 8-11, 2009, Richardson, TX, USA

2 One or more messages passed from one node to another forms a basic unidirectional link. Over time, as more messages are passed, it becomes possible to determine common lines of communication among people. By weighting links based on message frequency and whether or not replies are given in a timely manner, the strength of social ties between individuals can be realized. Figure 1. A Typical Social Network II. INTEGRATING SOCIAL NETWORK ANALYSIS WITH CONTEXT-BASED ANOMALY TRACKING A. Background Since its inception, has become an increasingly popular form of communication. According to a study performed by research group IDC in 2002, traffic will increase from 31 billion messages a day to 60 billion by 2006 [MINI05]. As of the second quarter of 2005, there are roughly 900 million known users of the internet (Global Reach). Assuming that mail traffic increases linearly, each user receives an average of 60 s a day. Manually sifting through the traffic of a group of a hundred people would require an individual to read 6,000 s a day. Clearly, automated methods are necessary to deal with such increasing volumes of data. The combination of text analysis and link mining concepts itself is not a new avenue of research. The work of Ben-Dov et. al. demonstrates the ability to enhance link mining of news sites by using available tools to semantically comprehend the contents of a document. One experiment performed by the group successfully discovered correlations between two individuals based simply on their presence within the same sentence. Successful examples can also be found in the field of semantic web analysis [HORR03]. B. Our Approach Overview The system we propose is an active monitoring agent that resides at a major message communication hub. Each message that passes through the hub is deconstructed to acquire basic information, such as source and destination addresses. These in turn are used in the construction of an evolving graph of communication patterns. anomaly detection is not a new concept. One existing topic of interest is the identification and filtering of spam. Using a set of desirable message attributes, a spam removal system is responsible for removing all unwanted from a user s inbox. This frequently includes advertisements, fraudulent topics, general bulk mail, and any other messages that do not appear relevant. Ultimately, this will ideally result in a set of messages consisting only of what the user desires [GOLD92-61]. While not always providing enhanced security directly, spam filtering represents a well-defined area to build from. The monitoring of and other point-to-point contact services can be used to build a relationship-oriented web. Instead of simply looking at each message as an isolated event, such a web allows the complex relationships between individuals to be mapped and further analyzed. Formally, this approach is known as constructing a social network. Within a social network, each individual represents a node. Figure 2. Original System Architecture The fundamental properties of this design can be found in how the elements within a monitored group are represented. First, each individual that uses the hub is kept as a user node. As in social networking, each node represents an endpoint of communication. Basic contact information is kept at the node to identify when future messages are arriving or departing at the node itself. In the 237

3 case of , the address is all the contact information necessary to uniquely identify the user. It is assumed that these identifiers do not change over time. Each message passed represents a conversational link between two users. The direction of the link is determined by the source and destinations of the message. In the event a message is passed back, the link automatically becomes bi-directional. The strength of a link is dependent on the number of messages passed in either direction. The attributes of the message itself is stored within the link. This allows the system to retrieve historical data between two nodes without needing to go to each node, find the relayed messages between them, etc. This is counter-intuitive to how messages are normally stored within most message services, but it is necessary to form context. Since a single message can be sent to multiple parties, a single instance of the message is often shared by multiple links. created and passed along to each individual that received the information. In turn, should any of the recipients disseminate the classified information to other recipients, additional child tokens will be created. Ultimately, a localized web of suspicious parties is created and tracked for future investigations. When the system is deployed, it is of note that it is the responsibility of an agent observing the results to take action. Ideally, the agent will be a human responsible for the security of the group being monitored. The system itself is only an observational tool. This approach was chosen to maximize potential uses for the system. For example, the responses chosen by an intelligence organization would vary widely from an internet service provider. Analysis In this section we discuss the strengths and weaknesses of our approach. Future directions will be discussed in Section 5. In the event the message passed has unusual properties, the anomalous characteristics are noted and recorded within a unique token. Attributes of a message ideally include unusual keywords, communication pattern deviations, and any other clues that may be necessary to identify future messages with similar traits. Other attributes of the token include an atomic identifier and a pointer to the originating token, if any. The token itself is considered as part of the established context, and it is stored within node endpoints. Current, the only characteristics of a message that the system tracks is a fixed set of keywords within the body of an . While marginally effective in generating reasonable result data, the technique is far from sufficient. The future plans section of this document describes the techniques which will eventually be implemented. --Strengths In theory, the deployment of this system offers a great deal of benefits. First, all analysis is performed in real-time. This means that, once deployed, the system is actively monitoring the available text stream for any and all communication activity. In the event that a malicious situation is identified, the observer of the system can either respond immediately or await further messages to decide whether a security issue exists. Second, the system indirectly models the complex social interactions of individuals. Hence, as messages are passed, it is possible to identify groups of people with malicious intent and how they collaborate. This is especially crucial to recognition of social sub-networks, in which normal keyword testing could be insufficient in identifying individuals with malicious intent. The reason a node is responsible for storing tokens, rather than a link, can be found in the fact that messages convey information that the user records for future reference when communicating with other users. These tokens represent unusual information that can propagate through a network. When new messages are passed, a check is performed to determine whether or not tokens exist at the originating node which matches the attributes of the message itself. If a match is found, a child token is created and passed along with the message. This child contains a link back to the original, creating a semantic trail that can be traced through a network. Such a trace can be useful in several security scenarios. For example, consider an intelligence agency concerned that there have been leaks of information within the organization to the media. A security manager using this system could begin by flagging certain keywords found only in a top-secret report recently given to suspects. In the event that messages sent from these suspects begin to use these keywords, a context token is 238 For example, consider the deployment of this system in the interest of catching a group of criminals involved in smuggling stolen works of art across international borders. Assume that they are using a message passing network to remain in constant contact along with a multitude of innocent people. Using keywords involving the stolen works, simple text filtering could create a number of false positives from people simply discussing the crimes mentioned in the news. By overlaying detected keyword uses with social network graphs, we could detect a group of individuals using these words among themselves. Once the group is properly identified, the entire set of individuals connected could be captured and questioned. Extending upon this scenario, should any of these individuals be held responsible, the system has already generated a set of conversations shared among the guilty parties. These exchanges could easily translate into an evidence exhibit to be used during prosecution. While certainly capable of being built by existing text mining tool, the convenience offered by

4 the availability of this data is an invaluable tool in situations where time is a factor. --Weaknesses Unfortunately as promising as such as a system may be, it is of note that the proper operation of the system has a number of dependent factors. First, the system requires that it has a roughly omnipotent view of communication among individuals. For example, it assumes that users of an server will not use any other server to communicate, nor any other form of communication that falls outside the bounds of what can be observed. Given that groups of individuals will likely communicate in person at some point, one or more semantic gaps could be created. Such gaps would prohibit token passing among nodes, as well as create inaccuracies within the perceived social network, reducing the overall effectiveness of the system. Second, there are serious ethical implementations for a system with such far reaching observational capability. Regardless of whether or not individuals are engaging in suspicious activity, social models are being created for future reference. Essentially, the data generated can be used to identify how close two individuals are, what they have been talking about, the common points of contact among them, etc. If an individual uses the monitored text stream exclusively for communication, a fairly accurate model of their relationships can be generated. To fully understand how such data can be used against an individual, consider an employer with access to a system that has been observing an individual applying for a position. During the evaluation process, an employer could analyze the social net around the applicant and determine the people they are closest to. These individuals could then be contacted and asked a series of questions about the applicant, their habits, prior employment history, etc. While the employer would benefit greatly from being able to have such data, the potential employee would undoubtedly feel their private life had been violated. Another weakness of the system is the lack of training methods to teach it when certain messages are false positives and false negatives. Although it is assumed that the observing agent can distinguish between results, it is much more convenient to filter out the noise to focus more on issues that require more attention. Additionally, given the token use of the system, a serious amount of false contexts could be created that would cause multiple complications for the entire social network. In theory, the impact of false tokens could be eliminated by giving an agent the option to delete specific tokens, but this is only a temporary solution. Regardless of how effective the system is, the ultimate weakness that this system faces is how much data must be stored. Traditionally, in most message passing networks, messages are stored at the user s terminal, removing the burden from the server. However, in order for the system to properly determine previous context, all messages passed must be stored in an archive after processing. Coupled with the data storage of links among users and the presence of tokens, it is possible that the data requirements of the system could multiply exponentially as more users join a network and average traffic flow increases. As of the creation of this document, there are no known systems that are implemented with these characteristics. Presumably, such systems may fall under classified government security methods. The fabled Echelon system, for example, is rumored to have similar capabilities. However, the lack of documentation to support this claim leads us to believe that this system simply represents a relatively unexplored area of research. Prototype Implementation I Design of the System The objective of this system is to combine social network analysis with text anomaly detection to enhance detection of unusual and undesirable activity. By combining these techniques and applying them to a continuous stream of messages, we believe it is possible to build a more secure communication system by identifying unusual behavior in normal channels of contact. Ultimately, this system will be ideally deployed on either a corporate or public network, provided that adequate authority exists. There are two primary parts necessary for the successful operation of this system: the organizational analysis system and the real-time results viewer. The former is responsible for building an understanding of the system from the text stream being observed. The latter translates the output of the former into a graphical representation viewable by a human security agent. This is necessary due to the intense amount of processing required by each to perform the responsibilities of the system in real-time. The architecture of the system is a largely decentralized and heavily object-oriented. The major aspects of functionality and purpose are encapsulated in appropriately named objects in the interest of keeping data and state information categorized appropriately. However, the system is currently tightly coupled, as many objects are heavily dependent on the functionality of others, often in both directions. Organizational Analysis System This subsystem is responsible for actual processing required to parse, analyze, and derive information from a text stream. Hence, it is broken down into three main pieces: the Delivery Agent, Detection Agent, and Mailroom Agent, each responsible for one of these three tasks, respectively. Basic information is represented as a series of user nodes and 239

5 conversational links, while tokens represent anomalous behavior. The Delivery Agent object represents the entry point for messages into the system. In the current design, it is a passive agent that, for each iteration, reads in another message from the stream. It then parses that for it s origin, destination, content, etc. Ultimately, a message object is created to embody this . No tokens are created at this point, as the agent does not keep track of context internally. This message is then passed to the Mailroom agent. Neither messages or s are kept on record here. When a message first arrives, the Mailroom Agent queries the Detection Agent to determine if any suspicious activity is present. While most of the functionality of the this agent is not fully implemented, it is responsible for maintaining data that determines the frequency of words, which words are unusual, and whether activity is normal or abnormal. Tokens are generated and attached to the message if necessary. The Mailroom Agent itself is responsible for keeping track of users in the system and delivering message objects to recipients. It maintains a virtual address book of identified users based on their endpoint communication identifier presented in the message itself. A copy of the message is also given to the User Node identified as the origin. The actual data of the social network and the propagation of messages is kept in the form of User Nodes and the conversational links between them. A User Node keeps track of the address of the user and represents an endpoint of communication. It also keeps a local list of tokens, which serve as the context for identifying future suspicious behavior. Nodes are created automatically as new A Conversational Link represents a link between two nodes. Each link objects keeps track of the messages between two nodes. The frequency and volume of activity represents the weight of this link. A higher number represents a stronger bond between two individuals. For space reasons and consistency, link objects are automatically shared between two nodes, reducing overhead and increasing efficiency of the system. When a message arrives at a User Node, the node is conditionally updated. If the message originated from the node, a link is either created or updated between it and the receiving nodes. If the node is a recipient of the message, it is assumed that the links have already been taken care of. Regardless, the tokens present in the message are integrated with any existing tokens at the node. The weight of the node is also updated to reflect traffic. The token object is the key to the way the system keeps track of security. Each token embodies the unusual characteristics of a message. If a message exhibits behavior not previously recorded, a new token is created. If an existing token present reflects the behavior adequately, a child of that token is created. Since each token is aware of it s parents and origin, a series of links can be establish in the form of a web that traces the propagation of undesirable activity. This information is ultimately used to determine which messages should be reviewed and when activity warrants investigation. Suspicious token propagation is passed to an Action Log object. Each time a node passes an unusual message, it is recorded in the log file for future review. Each entry set contains both the origin, destination, and identification of the message in question and the token involved. The results are then fed into the viewer application. Real-Time Results Viewer The viewer application, while not actually performing any security-related processing, plays a crucial role in translating raw data into human-readable results. Essentially, it accepts input from the analysis system, rebuilds a minimal representation of noted users and links, and display them graphically. The interface ultimately permits human security agents to determine whether or not the results involve warrant further investigation. The data arrives in the form of a series of lines of text. Each line carries a command along with the parameters (as comma-separated values) necessary to carry it out. There are six basic commands that the system recognizes, as shown in Figure 3. Notice that no coordinate data is included with the commands. This is due to the fact that the analysis system does not actually create a graphical representation of the data. As nodes arrive, they are initially given random coordinates. However, there is a separate thread that runs independent of the basic viewing program. This thread is responsible for arranging the nodes into a much more aesthetically pleasing an human-readable format by enforcing a spring layout. The resulting nodes are rendered based on this changing data. The spring layout is a relatively straightforward way to arrange nodes. Each node represents a ball of mass within an environment of inverse gravity. This is done to ensure nodes do not overlaps with each other and obscure crucial data. Essentially, if a node is within a threshold distance of another, they mutually repulse each other with a force proportional to their respective sizes. However, if this alone were all that was enforced, each node would be simply spacing themselves out without necessarily rendering what would likely be a tangled mess of nodes and their links. Therefore, we use the links between nodes as a simulated spring, drawing the nodes together. The strength of the spring is determined by the weight of the link: stronger links bring nodes closer together, while weaker links attempt to keep nodes at a minimum distance. 240

6 The combination of these forces results in a graph that attempts to properly represent the distance between individuals and their associates based on the strength of relationships. Neighbors are arranged in such a fashion as to insure that the distance between two nodes gives a rough visual approximation as to the strength of a relationship. This holds true roughly between immediate neighbors. It is of note that the other reason to place this processing in a separate thread is due to the intense computational resource usage. If done in line with the viewer, this would force the program to wait between iterations to properly render the interface, potentially reducing usability. By placing such processing in a separate thread with lower priority than the interface, this problem is reduced. Nodes within the view can be selected by clicking the mouse. The view itself can be dragged by holding down the left mouse button. Given the amount of data, it has also been practical to give zoom capabilities. By using the page up and page down keys, the user can zoom in and out (respectively) to focus on certain relationships as well as observe the larger picture. When a user clicks on a node, the node s address, weight, and links are displayed in the upper-right corner. The node itself is highlighted red. This is a crucial function, as it also displays the weight of each link. Such a feature is crucial for a user to perform in-depth analysis of how two nodes relate. Below the node information box on the right side of the interface is a depiction of the token storage tree. Each original token represents a clickable, expandable parent that has information on the token and any children underneath. Each child, a token holder, can be viewed for whom holds it. Clicking on either a token or token holder in the list highlights all links and nodes involved with the token within the graph in bright red. The associated with the token is displayed in the bottom text box. This crucial feature allows a security agent to determine when and how suspicious activity occurred. II INPUT/OUTPUT Currently, the system acquires input through a fixed database of . This was drawn from the Enron dataset, released by the Energy Commission during investigation of the company. It is maintained by researchers at Carnegie Mellon University. The text for a message is the original messages, which conform to the RFC standard. An example can be found in Figure 3. Command addnode addlink updnode updlink addtoken Description A simple node addition procedure. Node information must include the address and initial weight. The address is assumed to be unique. Creates a link between two nodes. A source and destination address must be specified, along with the weight of the link between them. As with the analysis program, links are considered shared memory between the nodes. A quick check is performed to insure a node does not link to itself. Updates the node s information. While the address remain constant, the weight of a node can change at any given moment from general traffic. The node must already be present. Updates an existing link between nodes. The source and destination must be specified, along with the new weight Adds a new token. Since tokens represents the suspicious activity of a system, this command requires an extensive amount of information, including the origin, identification number, embedded characteristics, and the message it was created from. This is stored as a top-level node in a token storage tree. The output of the system is ultimately a graph rendered by the viewer program. Data from the analysis system is completely human readable. Please see Figure 2 for an example graph. Figure 4. The viewer program interface. addtokenholder Figure 3. Command listing Adds a new token holder. These represent the children of an original token, and include all of the same information plu the ID of the parent and origins. At the moment, thi information is stored directly under the parent token in th token storage tree. In the future, this may becom hierarchically stored data. 241 III Enhancements to the prototype This system is still in the early stages of research and development. There are several areas that require additional improvement. However, there are three main aspects of the system which require the highest priority: flexible input

7 streaming, enhanced text processing, and real-time reinforcement of results. X-FileName: MTAYLO1 (Non-Privileged).pst The input of the system ultimately needs to be straight from a message captured or given to an server. This will require a parser that can sort through a message for the necessary fields, separate text from raw data (i.e. mail attachments), and determine how to proceed. To make this system applicable internationally, it will be ideal to take advantage of Java s Unicode support, but this will require heavy research in determining how the RFC standards dictate multi-language is exchanged. Ultimately, to make this system more applicable to a variety of systems, the input methods must be made to accept text input from different sources. This includes other message passing networks such as instant messaging and chat rooms, as well as any data recognized by other systems (i.e. a speech-to-text converter monitoring a telephone conversation). The current level of abstraction needs additional work to make this possible. Once we can accept a variety of input, it is also desirable to perform a variety of text-analysis techniques. Keywords must not be fixed; instead, they should be drawn from a dynamic list of word frequencies that are changed based on previous . The messages themselves should be subject to pattern matching techniques such as Singular Value Decomposition to insure even when unusual keywords are not used, the system can still operate effectively. Part of text analysis also includes enhancing the data tokens carry for use in context analysis. Message-ID: < JavaMail.evans@thyme> Date: Tue, 9 Oct :46: (PDT) From: h..moore@enron.com To: e..dickson@enron.com, legal <.taylor@enron.com> Subject: ONEOK EOL Amendment Cc: s..theriot@enron.com Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Bcc: s..theriot@enron.com X-From: Moore, Janet H. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=JHMOORE> X-To: Dickson, Stacy E. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Sdickso>, Taylor, Mark E (Legal) </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Mtaylo1> X-cc: Theriot, Kim S. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Ktherio> X-bcc: X-Folder: \MTAYLO1 (Non-Privileged)\Taylor, Mark E (Legal)\Inbox X-Origin: Taylor-M 242 Mark: I'm not sure whether this CP does financial trading. Regardless, I've included the highlighted language that I had circulated with the Idacorp draft; I think that you were mulling over whether that new language was acceptable. Please let us know your thoughts so that we can finalize the new rider. THANKS! Stacy: Would you like to tackle ONEOK next? Also, please let me know your thoughts on the bold faced language. Janet Figure 5. Example RFC 2822 Input The biggest improvement we can make can be found in the integration of machine learning techniques to allow the system to be positively or negatively reinforced. This will require research on which algorithms are appropriate, as well as how they should be integrated without significant additional overhead. When properly implemented, such algorithms could enhance the system significantly. However, it is clear that analysis would be limited to word choice, as structure of message body text can vary widely depending on the style chosen by authors. IV Current Directions for the Research Currently, the research on this topic is still in its early stages. There are several options to be explored that can significantly enhance the performance of this system. Once these improvements have been made, the next stage of development requires the use data sets with known results. Finally, once the system has been adequately tested, it would ideally be deployed within a mock-up environment to observe real performance. The existing method of message anomaly detection relies on a set of thirty fixed keywords. These have been selected based on their relevance to the current data source used for this project: the Enron database. While far from ideal, this method has been sufficient in offering basic detection of concerning the demise of the company. In order to create a more flexible system that can operate successfully on a variety of datasets, one possible direction to explore would be a word frequency dictionary. By using an available compiled English text corpus, a set of words extracted from an can be given global word rankings in terms of frequency of use. This would create an adaptive word detection filter that can comprehend when a traditionally unusual word becomes commonplace through widespread use, eliminating potential false positives. One problem with this direct approach is that there are a number of words that share the same meaning. For example,

8 the word person has approximately fifteen to twenty meanings, depending on context. While a semantic engine is outside the scope of this system, it is theoretically trivial to use a simple thesaurus database and collapse multiple synonyms into a single common word. Such an addition would make word frequency data much more robust and accurate. Since such a thesaurus dictionary may not be readily available, an alternative would be the integration of the classic Porter Stemming Algorithm. Developed originally in 1980, the algorithm was developed by Martin Porter in the interest of automatically removing the suffixes attached to words. One scenario involves the conjugation of a simple verb run into the various tenses: runs, running, ran, etc. While the past tense word ran is not properly identified, the rests of the variations are collapsed back into the word run itself. This algorithm alone could eliminate several redundancies in word frequency detection. Even a perfect keyword recognition algorithm is not sufficient when individuals begin to encode the text of their message. The use of encryption breaking techniques is beyond the scope of this research. However, assuming that data encryption is never used, there is another alternative to hiding the true topic of a message: word swapping. Keyword analysis is virtually useless when an individual uses unrelated words in place of suspicious words. For example, using the word corn in place of bomb would create a new message that would appear to be innocuously concerned with food. One solution to this problem is the use of a singular value decomposition (SVD) matrix. Frequently used in data pattern matching, research has shown that the use of SVDs to analyze correlation among messages is highly effective [SKIL05]. In the word swapping scenario, the use of a common pattern in word choice could be potentially detected, as long as the word choice deviates from established norms. The biggest hurdle in applying this research is adapting it to a real-time detection algorithm. We will discuss our approach to message surveillance in Section 3. Given the widespread interest in spam filtering, there are undoubtedly a number of alternative anomaly detection techniques that can be used on text-based communication. Further research is necessary to determine what techniques are available, their effectiveness in detecting characteristics, and whether or not they would be a beneficial addition to the system. For example, the use of artificial intelligence to automatically identify keywords has been discussed in the work of Michael Pazzani [PAZZ00]. A second area of necessary research can be found in properly modeling temporal decay. Several questions must be answered: How do implied relationships decay overt time? When does a token become invalid? How does time affect the weights of both nodes and links? What adjustments should be made to the 243 lifespan of a token when used repeatedly? Clearly, much of this research may branch outside of traditional computer science areas. The third area of further development can be found in analysis of the constructed social network. Several groups, such as the International Network for Social Network Analysis (INSNA), are dedicated to discussing methodologies for association data. As long as the system is building the network, it is logical that the machine representation of the communication network should be exploited for any potential gain in suspicious activity identification. One possible avenue of research is the identification of roles within a network. Within any given social setting, each individual often serves one or more roles in the communication infrastructure. Some may be hubs of information, always kept informed of situations and responsible for informing others. Others may act as brokers between two major parties [KREB05], a liaison responsible for maintaining a channel of contact. These roles impact how individuals interact, their purpose, and ultimately the way information disseminates through the network. Further research into identification of these roles can enhance anomaly detection by understanding when an individual deviates from the normal purpose they serve. Groups of individuals, also known as social fields, often form automatically within social environments [PERI91-76]. Whether by a direct department assignment or simply acquaintance, each individual has a number of associates that he or she frequently communicates with. When a number of individuals share close mutual ties, a group often exists. Recognition of these groups represents a unique challenge in future research. Social groups represent a significant factor in knowledge distribution, and existing techniques for network analysis may prove useful. A promising method of analysis is the field of graph clustering. By discovering correlation of nodes through analysis of shared links among tightly grouped users, groups can be discovered. The work of Kishnamurthy et. al., although focusing on the field of internet topology, gives a number of insights on creating such clusters through discovery of how a two nodes are indirectly connected. Yet another area in which further development is necessary is the construction of a control set to properly measure how effective this system is. The current Enron dataset, while useful, lacks enough information to be used in the formation of control data. Therefore, following the advice of one of my colleagues, we believe it is necessary to create a new, smaller dataset that directly reflects a planned social network. Populated with a number of automatically, generated messages using common words, several deliberate messages will be inserted over time using unusual keywords. The ideal results will be compared to

9 what the system generates, allowing the effectiveness to be measured. Additionally, an adequately constructed control set will be useful when research uncovers some of the previously mentioned improvements to determine how the system has changed. Given the claims that the proposed system could operate on virtually any type of text-based messaging system with static endpoints of communication, further research is necessary to determine how the system would accommodate different types of environments. For example, instant messaging networks, chat rooms, and even message boards represent other forms of communication this system must be prepared to monitor. Part of the research may need to include the varying social environments created by different mediums. Finally, in the interest of proving the effectiveness of the system, it is essential that the system is integrated into an actual server. This is necessary to determine the validity of the assumptions made in during construction, such as the uniqueness of assignment mail identification strings given by most servers. While the server would not be used for actual traffic, a test set would be fed through it to determine, overall, how effective this system would be in the interest of security. As discussed in this section, there are many areas for research. We discuss one such area and that is automatic message detection. The system that we have designed and developed for this application is discussed in Section 3. III. AUTOMATED MESSAGE DETECTION A. Background on Automated Message Detection The ultimate goal of our research is to construct a system that can determine the presence of anomalous or suspicious messages within typical communication traffic discussed in Section 2 [LAYF05] Our areas of research include social network analysis, text processing, and text filtering. This experiment will determine the effectiveness singular value decomposition as applied as a text processing technique in our system. One particular approach we use in our work in message surveillance involves the recognition of social context. Given a message that was deemed suspicious by a detection technique, there is an increased probability that future messages bearing similar characteristics to the detected message warrant further suspicion. We track this through the generation and passing of tokens carrying these characteristics in a model derived from observed social interaction within the message-passing network. Currently, these tokens are triggered by the presence of two or more keywords from a list tailored to our dataset. By using the techniques described by David Skillicorn, we believe it is possible to enhance the token passing methods of our approach. Message-ID: < JavaMail.evans@thyme> Date: Wed, 13 Dec :01: (PST) From: rebecca.cantrell@enron.com To: phillip.allen@enron.com Subject: Re: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-From: Rebecca W Cantrell X-To: Phillip K Allen X-cc: X-bcc: X-Folder: \Phillip_Allen_Dec2000\Notes Folders\All documents X-Origin: Allen-P X-FileName: pallen.nsf Phillip -- Is the value axis on Sheet 2 of the "socalprices" spread sheet supposed to be in $? If so, are they the right values (millions?) and where did they come from? I can't relate them to the Sheet 1 spread sheet. Figure 6 An excerpt from the Enron dataset The Enron dataset was originally made available by the Federal Energy Regulatory Commission during it's investigation of the company. It was purchased by MIT for use in data analysis, and is currently available in both raw and compiled database forms from Carnegie Melon University, courtesy of William Cohen. This dataset is useful to surveillance research as it is a freely available collection of that has been generated from a 'real world' scenario. After pruning and restructuring for consistency, the overall corpus is composed of 250,000 s that span a five-year period. Each remains in its' original raw ASCII-encoded text format, conforming to the RFC 2822 standard. The simplicity of this specification and the data it requires for proper mail transport assists with analysis in a number of ways. First, every must contain a unique identification number assigned by the originating server. Second, each message must have a well-defined origin and destination address, with the exception of any group or mass-mailing aliases. Third, every message has an explicit date stamp that reflects when it was originally sent. Finally, the body text of each is free of any escape sequences or special characters. With the exception of HTML-encoded , this greatly simplifies any preprocessing necessary for text parsing. The content of the Enron dataset is its' most useful aspect. We understand that, at some point in time, began 244

10 circulating that eventually led to the investigation of the entire corporation. This has been documented by countless news reports, committees, and even research groups. Coupled with the official announcement on January 9th, 2002 that the United States Department of Justice was beginning its official inquiry, we have definitive moments and individuals that can be scrutinized to determine the overall effectiveness of our experiment. Singular value decomposition is an important tool in statistical analysis. Given a mxn matrix A of real or complex numbers, we can decompose it into a series of component matrices: U,, and VT. U represents the patterns among values contained among the objects. In this experiment, the objects are the messages. V embodies the patterns among the attributes of each object. This will be the patterns among the ranks of words within the messages. The matrix is a diagonal matrix confirming to the dimensions nxn that stores the singular values of A. Essentially, the most 'interesting' parts of the original matrix are made evident. [JIAW00] messages set has a strong correlation with one or more of the token holding messages, the token is passed to all recipients of the message. To determine the rank of the word, a running word rank database is kept. The 5,000 most common words within the Brown Corpus of Standard American English are used to seed the database. As words are extracted from messages, each word and the number of times it occurred within the message is either inserted into the database if the word is new or used to update the existing word rank. words = A U V T Figure 7 Correlation of messages across the decomposition Once the SVD process has been carried out, the resulting matrices can be used for a variety of purposes. One particular way to use the results is the elimination of deviants from the patterns to filter out noise. We define noise among our messages as misspelled words, accidentally insertion of punctuation, and anything else a user may accidentally insert within their message that cannot be removed automatically within preprocessing in a timely and effective manner. Other uses include object correlation, signal processing, and even text retrieval [WIKI05]. Most of the uses for SVD stem from the fact that the components, once modified, can be used to build a new matrix that contains only the most 'interesting' qualities of the original. B. Design of Our System Our existing system discussed in Section 2 has been outfitted with a new detection module based on the SVD method as shown in Figure 8. The algorithm within the module requires a ranked word database, one or more messages that share the same token, and a set of messages that have originated from them representing recent traffic. A matrix is constructed based on the word content of the messages, which is decomposed according to the SVD method. A threshold is set for the noise, and the results are analyzed for word correlation. If a message in the recent traffic 245 Figure 8. Modified System Architecture The matrix constructed is created based on the number of messages involved and the words present. Each message represents a column, while each row represents a word. Thus, each cell represents the number of times a particular word occurs within a particular message. The columns are ordered from most common (left) to least common (right). Mathematically, this is represented as follows: c ji count( wi, m j ) W M 1 M 2... M w W i t where W represents the set of all words that occur in the union of the word sets for each message M. The function count returns the number of times a particular word wi occurs within a message mj. Note that Mj is derived from the words in mj; however, mj may contain the same word more than once while

11 Mj is a list that strictly represents all words only once. Once the SVD technique has been used on the messages, the noise is removed through the use of a threshold ( = 0.25 ). After the singular value dimensions falling below this threshold are removed, the matrix is re-constructed from the components. The resulting matrix should assist in simplifying the amount of data that must be analyzed. The correlation of any two messages is based on a total score. Each word that is found in both messages is given a rating according to the following equation: count( wi, m j ) count( wi, mk ) score( wi ) rank( wi ) where wi is the word that occurs in messages mj and mk. The count function returns the number of occurrences of the word within a supplied message. The rank function is a cumulative count of the occurrence of the word wi in all messages up to this point. This equation is designed to place emphasis on words that occur less frequently others. It is assumed that rare words that occur in both messages are more likely to be good candidates for a topical match. Note that si is naturally normalized to one. Given that the rank of a word is adjusted before these messages are processed with this equation, the maximum score will be one. In order to determine if two messages match, the following equation is used: wi W j Wk score( w i ) where Wj and Wk are the set of all words contained within mi and mk, respectively, and represents a predetermined threshold that determines the minimum. In theory, this should be some value that is determined by statistical analysis of known correlating messages. For our experiment, we assume = C. Enhancement to the Experiments Once this experiment has been run successfully, an adequate infrastructure will exist to fully exploit the potential of singular value decomposition. After the noise has been eliminated, we could use the patterns to observe topical correlation between messages. Given that our overall system includes social network analysis, it may become possible to determine suspicious messages as those that deviate from the patterns without the use of suspicious keywords. Another approach this could be used for is bridging potential gaps in our analysis due to users that use another form of communication (i.e. the telephone). The pattern analysis techniques could allow us to potentially determine when messages exchanged independently are intrinsically linked due to other social connections. Should there be a significant clustering among these conversations, it may even be possible 246 to derive the presence of social groups. [5, 6] There are a number of other methods that decompose a matrix into components. One such technique is Principle Component Analysis, a technique that yields similar results to the SVD by attempting to capture the majority of variability in the data. If we construct our code in an abstract fashion, we will be able to use different approaches to the same matrix to enhance the overall effectiveness of our system. [JIAW00] Beyond text processing, we are currently investigating the use of matrix decomposition as applied to graph theory. Given a matrix composed of nodes in a graph and the values which represent the strength of the unidirectional links between them, other communication patterns could be discovered. In the interest of furthering the applications of social network analysis, we are looking into tracking coalition information sharing and the game theory involved. Given the interest in assuring that vital information shared is never disclosed to the enemy, it is crucial to understand of how parts of a coalition interact and whom they may be associated with. A social network can theoretically be constructed that represents the organizations and the ties between them. Applications of our existing techniques could automate the construction of such a model, potentially benefiting the use of game theory that deals with information exchange. Section 3 will be devoted to a discussion of the application of social networks to coalition data sharing. D. Security and Privacy Considerations Two of the applications of social network analysis that are relevant to counter-terrorism applications are automated message surveillance and information sharing across coalitions. In the case of automated message surveillance the idea is to monitor the messages for suspicious words as well as determine the content of communication within groups. In the case of data sharing across coalitions, the idea is to analyze the behavior of an organization and determine how to extract information from that organization. In this section we discuss security and privacy considerations as relevant to both applications. The idea being social network analysis is to analyze the social network say N. N describes the communication and antitypes of a group G. The analysis could be to determine the suspicious words used in messages as we have discussed in Section 2.3 or determining the suspicious activities such as travel patterns of the members of the group G. It is the goal of G to ensure the privacy of the activities of its members as well as ensuring that only authorized individuals may access the activities of the members of G. With respect to privacy, G may enforce privacy policies. For4 example, anyone who accesses information about G has to comply with the policies enforced by G. If A and B who are members of B have some association between them and if that

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du fdu@cs.ubc.ca University of British Columbia

More information

Intrusion Detection System using Log Files and Reinforcement Learning

Intrusion Detection System using Log Files and Reinforcement Learning Intrusion Detection System using Log Files and Reinforcement Learning Bhagyashree Deokar, Ambarish Hazarnis Department of Computer Engineering K. J. Somaiya College of Engineering, Mumbai, India ABSTRACT

More information

Managing Variability in Software Architectures 1 Felix Bachmann*

Managing Variability in Software Architectures 1 Felix Bachmann* Managing Variability in Software Architectures Felix Bachmann* Carnegie Bosch Institute Carnegie Mellon University Pittsburgh, Pa 523, USA fb@sei.cmu.edu Len Bass Software Engineering Institute Carnegie

More information

Concepts of digital forensics

Concepts of digital forensics Chapter 3 Concepts of digital forensics Digital forensics is a branch of forensic science concerned with the use of digital information (produced, stored and transmitted by computers) as source of evidence

More information

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents

More information

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02) Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

More information

Insider Threat Detection Using Graph-Based Approaches

Insider Threat Detection Using Graph-Based Approaches Cybersecurity Applications & Technology Conference For Homeland Security Insider Threat Detection Using Graph-Based Approaches William Eberle Tennessee Technological University weberle@tntech.edu Lawrence

More information

IBM Policy Assessment and Compliance

IBM Policy Assessment and Compliance IBM Policy Assessment and Compliance Powerful data governance based on deep data intelligence Highlights Manage data in-place according to information governance policy. Data topology map provides a clear

More information

Visionet IT Modernization Empowering Change

Visionet IT Modernization Empowering Change Visionet IT Modernization A Visionet Systems White Paper September 2009 Visionet Systems Inc. 3 Cedar Brook Dr. Cranbury, NJ 08512 Tel: 609 360-0501 Table of Contents 1 Executive Summary... 4 2 Introduction...

More information

Software Engineering

Software Engineering Software Engineering Lecture 06: Design an Overview Peter Thiemann University of Freiburg, Germany SS 2013 Peter Thiemann (Univ. Freiburg) Software Engineering SWT 1 / 35 The Design Phase Programming in

More information

ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION Francine Forney, Senior Management Consultant, Fuel Consulting, LLC May 2013

ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION Francine Forney, Senior Management Consultant, Fuel Consulting, LLC May 2013 ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION, Fuel Consulting, LLC May 2013 DATA AND ANALYSIS INTERACTION Understanding the content, accuracy, source, and completeness of data is critical to the

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

INTRUSION PREVENTION AND EXPERT SYSTEMS

INTRUSION PREVENTION AND EXPERT SYSTEMS INTRUSION PREVENTION AND EXPERT SYSTEMS By Avi Chesla avic@v-secure.com Introduction Over the past few years, the market has developed new expectations from the security industry, especially from the intrusion

More information

I. The SMART Project - Status Report and Plans. G. Salton. The SMART document retrieval system has been operating on a 709^

I. The SMART Project - Status Report and Plans. G. Salton. The SMART document retrieval system has been operating on a 709^ 1-1 I. The SMART Project - Status Report and Plans G. Salton 1. Introduction The SMART document retrieval system has been operating on a 709^ computer since the end of 1964. The system takes documents

More information

CHAPTER 5 INTELLIGENT TECHNIQUES TO PREVENT SQL INJECTION ATTACKS

CHAPTER 5 INTELLIGENT TECHNIQUES TO PREVENT SQL INJECTION ATTACKS 66 CHAPTER 5 INTELLIGENT TECHNIQUES TO PREVENT SQL INJECTION ATTACKS 5.1 INTRODUCTION In this research work, two new techniques have been proposed for addressing the problem of SQL injection attacks, one

More information

Toad for Oracle 8.6 SQL Tuning

Toad for Oracle 8.6 SQL Tuning Quick User Guide for Toad for Oracle 8.6 SQL Tuning SQL Tuning Version 6.1.1 SQL Tuning definitively solves SQL bottlenecks through a unique methodology that scans code, without executing programs, to

More information

Matt Cabot Rory Taca QR CODES

Matt Cabot Rory Taca QR CODES Matt Cabot Rory Taca QR CODES QR codes were designed to assist in the manufacturing plants of the automotive industry. These easy to scan codes allowed for a rapid way to identify parts and made the entire

More information

Text Analytics Illustrated with a Simple Data Set

Text Analytics Illustrated with a Simple Data Set CSC 594 Text Mining More on SAS Enterprise Miner Text Analytics Illustrated with a Simple Data Set This demonstration illustrates some text analytic results using a simple data set that is designed to

More information

Using reporting and data mining techniques to improve knowledge of subscribers; applications to customer profiling and fraud management

Using reporting and data mining techniques to improve knowledge of subscribers; applications to customer profiling and fraud management Using reporting and data mining techniques to improve knowledge of subscribers; applications to customer profiling and fraud management Paper Jean-Louis Amat Abstract One of the main issues of operators

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION Exploration is a process of discovery. In the database exploration process, an analyst executes a sequence of transformations over a collection of data structures to discover useful

More information

Analytic Modeling in Python

Analytic Modeling in Python Analytic Modeling in Python Why Choose Python for Analytic Modeling A White Paper by Visual Numerics August 2009 www.vni.com Analytic Modeling in Python Why Choose Python for Analytic Modeling by Visual

More information

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

Understanding Web personalization with Web Usage Mining and its Application: Recommender System Understanding Web personalization with Web Usage Mining and its Application: Recommender System Manoj Swami 1, Prof. Manasi Kulkarni 2 1 M.Tech (Computer-NIMS), VJTI, Mumbai. 2 Department of Computer Technology,

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

SPAMfighter Mail Gateway

SPAMfighter Mail Gateway SPAMfighter Mail Gateway User Manual Copyright (c) 2009 SPAMfighter ApS Revised 2009-05-19 1 Table of contents 1. Introduction...3 2. Basic idea...4 2.1 Detect-and-remove...4 2.2 Power-through-simplicity...4

More information

The Re-emergence of Data Capture Technology

The Re-emergence of Data Capture Technology The Re-emergence of Data Capture Technology Understanding Today s Digital Capture Solutions Digital capture is a key enabling technology in a business world striving to balance the shifting advantages

More information

Manual Spamfilter Version: 1.1 Date: 20-02-2014

Manual Spamfilter Version: 1.1 Date: 20-02-2014 Manual Spamfilter Version: 1.1 Date: 20-02-2014 Table of contents Introduction... 2 Quick guide... 3 Quarantine reports...3 What to do if a message is blocked inadvertently...4 What to do if a spam has

More information

Modeling an Agent-Based Decentralized File Sharing Network

Modeling an Agent-Based Decentralized File Sharing Network Modeling an Agent-Based Decentralized File Sharing Network Alex Gonopolskiy Benjamin Nash December 18, 2007 Abstract In this paper we propose a distributed file sharing network model. We take inspiration

More information

Symbol Tables. Introduction

Symbol Tables. Introduction Symbol Tables Introduction A compiler needs to collect and use information about the names appearing in the source program. This information is entered into a data structure called a symbol table. The

More information

Text Analytics. A business guide

Text Analytics. A business guide Text Analytics A business guide February 2014 Contents 3 The Business Value of Text Analytics 4 What is Text Analytics? 6 Text Analytics Methods 8 Unstructured Meets Structured Data 9 Business Application

More information

Applying 4+1 View Architecture with UML 2. White Paper

Applying 4+1 View Architecture with UML 2. White Paper Applying 4+1 View Architecture with UML 2 White Paper Copyright 2007 FCGSS, all rights reserved. www.fcgss.com Introduction Unified Modeling Language (UML) has been available since 1997, and UML 2 was

More information

Greedy Routing on Hidden Metric Spaces as a Foundation of Scalable Routing Architectures

Greedy Routing on Hidden Metric Spaces as a Foundation of Scalable Routing Architectures Greedy Routing on Hidden Metric Spaces as a Foundation of Scalable Routing Architectures Dmitri Krioukov, kc claffy, and Kevin Fall CAIDA/UCSD, and Intel Research, Berkeley Problem High-level Routing is

More information

CALCULATIONS & STATISTICS

CALCULATIONS & STATISTICS CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

ETPL Extract, Transform, Predict and Load

ETPL Extract, Transform, Predict and Load ETPL Extract, Transform, Predict and Load An Oracle White Paper March 2006 ETPL Extract, Transform, Predict and Load. Executive summary... 2 Why Extract, transform, predict and load?... 4 Basic requirements

More information

Enterprise Resource Planning Analysis of Business Intelligence & Emergence of Mining Objects

Enterprise Resource Planning Analysis of Business Intelligence & Emergence of Mining Objects Enterprise Resource Planning Analysis of Business Intelligence & Emergence of Mining Objects Abstract: Build a model to investigate system and discovering relations that connect variables in a database

More information

Enhancing Document Review Efficiency with OmniX

Enhancing Document Review Efficiency with OmniX Xerox Litigation Services OmniX Platform Review Technical Brief Enhancing Document Review Efficiency with OmniX Xerox Litigation Services delivers a flexible suite of end-to-end technology-driven services,

More information

Performance Evaluation of Intrusion Detection Systems

Performance Evaluation of Intrusion Detection Systems Performance Evaluation of Intrusion Detection Systems Waleed Farag & Sanwar Ali Department of Computer Science at Indiana University of Pennsylvania ABIT 2006 Outline Introduction: Intrusion Detection

More information

Appendix B Data Quality Dimensions

Appendix B Data Quality Dimensions Appendix B Data Quality Dimensions Purpose Dimensions of data quality are fundamental to understanding how to improve data. This appendix summarizes, in chronological order of publication, three foundational

More information

The role of integrated requirements management in software delivery.

The role of integrated requirements management in software delivery. Software development White paper October 2007 The role of integrated requirements Jim Heumann, requirements evangelist, IBM Rational 2 Contents 2 Introduction 2 What is integrated requirements management?

More information

Automated Modeling of Legacy Systems Using the UML

Automated Modeling of Legacy Systems Using the UML Automated Modeling of Legacy Systems Using the UML by Pan-Wei Ng Software Engineering Specialist Rational Software Singapore Poor documentation is one of the major challenges of supporting legacy systems;

More information

DRIVE-BY DOWNLOAD WHAT IS DRIVE-BY DOWNLOAD? A Typical Attack Scenario

DRIVE-BY DOWNLOAD WHAT IS DRIVE-BY DOWNLOAD? A Typical Attack Scenario DRIVE-BY DOWNLOAD WHAT IS DRIVE-BY DOWNLOAD? Drive-by Downloads are a common technique used by attackers to silently install malware on a victim s computer. Once a target website has been weaponized with

More information

Chapter 23. Database Security. Security Issues. Database Security

Chapter 23. Database Security. Security Issues. Database Security Chapter 23 Database Security Security Issues Legal and ethical issues Policy issues System-related issues The need to identify multiple security levels 2 Database Security A DBMS typically includes a database

More information

Scientific Graphing in Excel 2010

Scientific Graphing in Excel 2010 Scientific Graphing in Excel 2010 When you start Excel, you will see the screen below. Various parts of the display are labelled in red, with arrows, to define the terms used in the remainder of this overview.

More information

A network is a group of devices (Nodes) connected by media links. A node can be a computer, printer or any other device capable of sending and

A network is a group of devices (Nodes) connected by media links. A node can be a computer, printer or any other device capable of sending and NETWORK By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore Email: bhu261@gmail.com Network A network is a group of devices (Nodes) connected by media

More information

IFS-8000 V2.0 INFORMATION FUSION SYSTEM

IFS-8000 V2.0 INFORMATION FUSION SYSTEM IFS-8000 V2.0 INFORMATION FUSION SYSTEM IFS-8000 V2.0 Overview IFS-8000 v2.0 is a flexible, scalable and modular IT system to support the processes of aggregation of information from intercepts to intelligence

More information

QUALITY TOOLBOX. Understanding Processes with Hierarchical Process Mapping. Robert B. Pojasek. Why Process Mapping?

QUALITY TOOLBOX. Understanding Processes with Hierarchical Process Mapping. Robert B. Pojasek. Why Process Mapping? QUALITY TOOLBOX Understanding Processes with Hierarchical Process Mapping In my work, I spend a lot of time talking to people about hierarchical process mapping. It strikes me as funny that whenever I

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

1394 Bus Analyzers. Usage Analysis, Key Features and Cost Savings. Background. Usage Segmentation

1394 Bus Analyzers. Usage Analysis, Key Features and Cost Savings. Background. Usage Segmentation 1394 Bus Analyzers Usage Analysis, Key Features and Cost Savings By Dr. Michael Vonbank DapUSA Inc., and Dr. Kurt Böhringer, Hitex Development Tools GmbH Background When developing products based on complex

More information

1 File Processing Systems

1 File Processing Systems COMP 378 Database Systems Notes for Chapter 1 of Database System Concepts Introduction A database management system (DBMS) is a collection of data and an integrated set of programs that access that data.

More information

Microsoft Word 2010 Prepared by Computing Services at the Eastman School of Music July 2010

Microsoft Word 2010 Prepared by Computing Services at the Eastman School of Music July 2010 Microsoft Word 2010 Prepared by Computing Services at the Eastman School of Music July 2010 Contents Microsoft Office Interface... 4 File Ribbon Tab... 5 Microsoft Office Quick Access Toolbar... 6 Appearance

More information

Welcome to Ipswitch Instant Messaging

Welcome to Ipswitch Instant Messaging Welcome to Ipswitch Instant Messaging What is Instant Messaging (IM), anyway? In a lot of ways, IM is like its cousin: e-mail. E-mail, while it's certainly much faster than the traditional post office

More information

Crime Pattern Analysis

Crime Pattern Analysis Crime Pattern Analysis Megaputer Case Study in Text Mining Vijay Kollepara Sergei Ananyan www.megaputer.com Megaputer Intelligence 120 West Seventh Street, Suite 310 Bloomington, IN 47404 USA +1 812-330-01

More information

SuperViz: An Interactive Visualization of Super-Peer P2P Network

SuperViz: An Interactive Visualization of Super-Peer P2P Network SuperViz: An Interactive Visualization of Super-Peer P2P Network Anthony (Peiqun) Yu pqyu@cs.ubc.ca Abstract: The Efficient Clustered Super-Peer P2P network is a novel P2P architecture, which overcomes

More information

Hillstone T-Series Intelligent Next-Generation Firewall Whitepaper: Abnormal Behavior Analysis

Hillstone T-Series Intelligent Next-Generation Firewall Whitepaper: Abnormal Behavior Analysis Hillstone T-Series Intelligent Next-Generation Firewall Whitepaper: Abnormal Behavior Analysis Keywords: Intelligent Next-Generation Firewall (ingfw), Unknown Threat, Abnormal Parameter, Abnormal Behavior,

More information

How To Use Neural Networks In Data Mining

How To Use Neural Networks In Data Mining International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

CHAPTER 2 DATABASE MANAGEMENT SYSTEM AND SECURITY

CHAPTER 2 DATABASE MANAGEMENT SYSTEM AND SECURITY CHAPTER 2 DATABASE MANAGEMENT SYSTEM AND SECURITY 2.1 Introduction In this chapter, I am going to introduce Database Management Systems (DBMS) and the Structured Query Language (SQL), its syntax and usage.

More information

SECTION 5: Finalizing Your Workbook

SECTION 5: Finalizing Your Workbook SECTION 5: Finalizing Your Workbook In this section you will learn how to: Protect a workbook Protect a sheet Protect Excel files Unlock cells Use the document inspector Use the compatibility checker Mark

More information

Visualizing Relationships and Connections in Complex Data Using Network Diagrams in SAS Visual Analytics

Visualizing Relationships and Connections in Complex Data Using Network Diagrams in SAS Visual Analytics Paper 3323-2015 Visualizing Relationships and Connections in Complex Data Using Network Diagrams in SAS Visual Analytics ABSTRACT Stephen Overton, Ben Zenick, Zencos Consulting Network diagrams in SAS

More information

æ A collection of interrelated and persistent data èusually referred to as the database èdbèè.

æ A collection of interrelated and persistent data èusually referred to as the database èdbèè. CMPT-354-Han-95.3 Lecture Notes September 10, 1995 Chapter 1 Introduction 1.0 Database Management Systems 1. A database management system èdbmsè, or simply a database system èdbsè, consists of æ A collection

More information

Visio Enabled Solution: One-Click Switched Network Vision

Visio Enabled Solution: One-Click Switched Network Vision Visio Enabled Solution: One-Click Switched Network Vision Tim Wittwer, Senior Software Engineer Alan Delwiche, Senior Software Engineer March 2001 Applies to: All Microsoft Visio 2002 Editions All Microsoft

More information

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset. White Paper Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset. Using LSI for Implementing Document Management Systems By Mike Harrison, Director,

More information

Reputation Network Analysis for Email Filtering

Reputation Network Analysis for Email Filtering Reputation Network Analysis for Email Filtering Jennifer Golbeck, James Hendler University of Maryland, College Park MINDSWAP 8400 Baltimore Avenue College Park, MD 20742 {golbeck, hendler}@cs.umd.edu

More information

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER INTRODUCTION TO SAS TEXT MINER TODAY S AGENDA INTRODUCTION TO SAS TEXT MINER Define data mining Overview of SAS Enterprise Miner Describe text analytics and define text data mining Text Mining Process

More information

Natural Language to Relational Query by Using Parsing Compiler

Natural Language to Relational Query by Using Parsing Compiler Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

An Introduction to. Metrics. used during. Software Development

An Introduction to. Metrics. used during. Software Development An Introduction to Metrics used during Software Development Life Cycle www.softwaretestinggenius.com Page 1 of 10 Define the Metric Objectives You can t control what you can t measure. This is a quote

More information

Bitrix Site Manager 4.0. Quick Start Guide to Newsletters and Subscriptions

Bitrix Site Manager 4.0. Quick Start Guide to Newsletters and Subscriptions Bitrix Site Manager 4.0 Quick Start Guide to Newsletters and Subscriptions Contents PREFACE...3 CONFIGURING THE MODULE...4 SETTING UP FOR MANUAL SENDING E-MAIL MESSAGES...6 Creating a newsletter...6 Providing

More information

The document may be freely distributed in its entirety, either digitally or in printed format, to all EPiServer Mail users.

The document may be freely distributed in its entirety, either digitally or in printed format, to all EPiServer Mail users. Copyright This document is protected by the Copyright Act. Changes to the contents, or partial copying of the contents, may not be done without permission from the copyright holder. The document may be

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

A Review of Anomaly Detection Techniques in Network Intrusion Detection System

A Review of Anomaly Detection Techniques in Network Intrusion Detection System A Review of Anomaly Detection Techniques in Network Intrusion Detection System Dr.D.V.S.S.Subrahmanyam Professor, Dept. of CSE, Sreyas Institute of Engineering & Technology, Hyderabad, India ABSTRACT:In

More information

Decision Modeling for Dashboard Projects

Decision Modeling for Dashboard Projects Decision Modeling for Dashboard Projects How to Build a Decision Requirements Model that Drives Successful Dashboard Projects Gagan Saxena VP Consulting Decision modeling provides a formal framework to

More information

White Paper April 2006

White Paper April 2006 White Paper April 2006 Table of Contents 1. Executive Summary...4 1.1 Scorecards...4 1.2 Alerts...4 1.3 Data Collection Agents...4 1.4 Self Tuning Caching System...4 2. Business Intelligence Model...5

More information

Sending MIME Messages in LISTSERV DISTRIBUTE Jobs

Sending MIME Messages in LISTSERV DISTRIBUTE Jobs Whitepaper Sending MIME Messages in LISTSERV DISTRIBUTE Jobs August 25, 2010 Copyright 2010 L-Soft international, Inc. Information in this document is subject to change without notice. Companies, names,

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS O.U. Sezerman 1, R. Islamaj 2, E. Alpaydin 2 1 Laborotory of Computational Biology, Sabancı University, Istanbul, Turkey. 2 Computer Engineering

More information

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper Parsing Technology and its role in Legacy Modernization A Metaware White Paper 1 INTRODUCTION In the two last decades there has been an explosion of interest in software tools that can automate key tasks

More information

Making the difference between read to output, and read to copy GOING BEYOND BASIC FILE AUDITING FOR DATA PROTECTION

Making the difference between read to output, and read to copy GOING BEYOND BASIC FILE AUDITING FOR DATA PROTECTION Making the difference between read to output, and read to copy GOING BEYOND BASIC FILE AUDITING FOR DATA PROTECTION MOST OF THE IMPORTANT DATA LOSS VECTORS DEPEND ON COPYING files in order to compromise

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

FirstClass FAQ's An item is missing from my FirstClass desktop

FirstClass FAQ's An item is missing from my FirstClass desktop FirstClass FAQ's An item is missing from my FirstClass desktop Deleted item: If you put a item on your desktop, you can delete it. To determine what kind of item (conference-original, conference-alias,

More information

Features. Emerson Solutions for Abnormal Situations

Features. Emerson Solutions for Abnormal Situations Features Comprehensive solutions for prevention, awareness, response, and analysis of abnormal situations Early detection of potential process and equipment problems Predictive intelligence network to

More information

1 Solving LPs: The Simplex Algorithm of George Dantzig

1 Solving LPs: The Simplex Algorithm of George Dantzig Solving LPs: The Simplex Algorithm of George Dantzig. Simplex Pivoting: Dictionary Format We illustrate a general solution procedure, called the simplex algorithm, by implementing it on a very simple example.

More information

ERNW Newsletter 29 / November 2009

ERNW Newsletter 29 / November 2009 ERNW Newsletter 29 / November 2009 Dear Partners and Colleagues, Welcome to the ERNW Newsletter no. 29 covering the topic: Data Leakage Prevention A Practical Evaluation Version 1.0 from 19th of november

More information

INFO 1400. Koffka Khan. Tutorial 6

INFO 1400. Koffka Khan. Tutorial 6 INFO 1400 Koffka Khan Tutorial 6 Running Case Assignment: Improving Decision Making: Redesigning the Customer Database Dirt Bikes U.S.A. sells primarily through its distributors. It maintains a small customer

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Static Analysis and Validation of Composite Behaviors in Composable Behavior Technology

Static Analysis and Validation of Composite Behaviors in Composable Behavior Technology Static Analysis and Validation of Composite Behaviors in Composable Behavior Technology Jackie Zheqing Zhang Bill Hopkinson, Ph.D. 12479 Research Parkway Orlando, FL 32826-3248 407-207-0976 jackie.z.zhang@saic.com,

More information

A very short history of networking

A very short history of networking A New vision for network architecture David Clark M.I.T. Laboratory for Computer Science September, 2002 V3.0 Abstract This is a proposal for a long-term program in network research, consistent with the

More information

April 2016. Online Payday Loan Payments

April 2016. Online Payday Loan Payments April 2016 Online Payday Loan Payments Table of contents Table of contents... 1 1. Introduction... 2 2. Data... 5 3. Re-presentments... 8 3.1 Payment Request Patterns... 10 3.2 Time between Payment Requests...15

More information

LOG AND EVENT MANAGEMENT FOR SECURITY AND COMPLIANCE

LOG AND EVENT MANAGEMENT FOR SECURITY AND COMPLIANCE PRODUCT BRIEF LOG AND EVENT MANAGEMENT FOR SECURITY AND COMPLIANCE The Tripwire VIA platform delivers system state intelligence, a continuous approach to security that provides leading indicators of breach

More information

6-1. Process Modeling

6-1. Process Modeling 6-1 Process Modeling Key Definitions Process model A formal way of representing how a business system operates Illustrates the activities that are performed and how data moves among them Data flow diagramming

More information

Healthcare, transportation,

Healthcare, transportation, Smart IT Argus456 Dreamstime.com From Data to Decisions: A Value Chain for Big Data H. Gilbert Miller and Peter Mork, Noblis Healthcare, transportation, finance, energy and resource conservation, environmental

More information

ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search

ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search Project for Michael Pitts Course TCSS 702A University of Washington Tacoma Institute of Technology ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search Under supervision of : Dr. Senjuti

More information

Making Critical Connections: Predictive Analytics in Government

Making Critical Connections: Predictive Analytics in Government Making Critical Connections: Predictive Analytics in Improve strategic and tactical decision-making Highlights: Support data-driven decisions. Reduce fraud, waste and abuse. Allocate resources more effectively.

More information

Technical Report. The KNIME Text Processing Feature:

Technical Report. The KNIME Text Processing Feature: Technical Report The KNIME Text Processing Feature: An Introduction Dr. Killian Thiel Dr. Michael Berthold Killian.Thiel@uni-konstanz.de Michael.Berthold@uni-konstanz.de Copyright 2012 by KNIME.com AG

More information

Visualizing e-government Portal and Its Performance in WEBVS

Visualizing e-government Portal and Its Performance in WEBVS Visualizing e-government Portal and Its Performance in WEBVS Ho Si Meng, Simon Fong Department of Computer and Information Science University of Macau, Macau SAR ccfong@umac.mo Abstract An e-government

More information

A new cost model for comparison of Point to Point and Enterprise Service Bus integration styles

A new cost model for comparison of Point to Point and Enterprise Service Bus integration styles A new cost model for comparison of Point to Point and Enterprise Service Bus integration styles MICHAL KÖKÖRČENÝ Department of Information Technologies Unicorn College V kapslovně 2767/2, Prague, 130 00

More information

ANALYTICS IN BIG DATA ERA

ANALYTICS IN BIG DATA ERA ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY, DISCOVER RELATIONSHIPS AND CLASSIFY HUGE AMOUNT OF DATA MAURIZIO SALUSTI SAS Copyr i g ht 2012, SAS Ins titut

More information

Analysis of Micromouse Maze Solving Algorithms

Analysis of Micromouse Maze Solving Algorithms 1 Analysis of Micromouse Maze Solving Algorithms David M. Willardson ECE 557: Learning from Data, Spring 2001 Abstract This project involves a simulation of a mouse that is to find its way through a maze.

More information

SYSTEM DEVELOPMENT AND IMPLEMENTATION

SYSTEM DEVELOPMENT AND IMPLEMENTATION CHAPTER 6 SYSTEM DEVELOPMENT AND IMPLEMENTATION 6.0 Introduction This chapter discusses about the development and implementation process of EPUM web-based system. The process is based on the system design

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Healthcare Measurement Analysis Using Data mining Techniques

Healthcare Measurement Analysis Using Data mining Techniques www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 03 Issue 07 July, 2014 Page No. 7058-7064 Healthcare Measurement Analysis Using Data mining Techniques 1 Dr.A.Shaik

More information