A Monograph on Data Mining Techniques for Email Forensics

A Monograph on Data Mining Techniques for Email Forensics Venkata Krishna Kota Central Research Laboratory Bharat Electronics Limited Bangalore, India venkatakrishnav@bel.co.in Abstract Email Forensics occupies key portion in today s Digital Forensics. There is a thirsty need for the tools to analyze large email collections forensically. Data mining techniques play a vital role in analyzing large collections of data. Various data mining techniques and architecture are proposed in this paper to help email forensic examiners. Keywords Email Forensics; Data Mining; Information Retrieval; Ontology I. INTRODUCTION With the rapid development in email technology, the usage of emails for the fraudulent activities is also accelerating with higher pace. Forensic Analysis of these emails can prevent, investigate or prove a crime committed. Email forensics can properly be defined as the use of specialized techniques for the collection, preservation and analysis of emails with a view of presenting evidence in a court of law. Analyzing huge size of emails is a challenge to forensic examiners. There is a thirsty need for the tools which can automatically analyze these emails forensically. Digital evidence search is the heart of digital forensics. Information Retrieval (IR) is the process of retrieving relevant information or documents possessing the relevant information for the specified information need. IR systems play vital role in such scenarios by retrieving the most relevant information for the given user queries. In this paper we propose a system to retrieve the relevant emails from the large email collections and to present them in an easily understandable form to the forensic examiners. Forensic examiners don t want to miss any piece of information that is relevant to them. So unlike traditional IR systems, Forensic IR systems should focus on high recall. The proposed system achieves high recall through ontology driven query expansion. High recall results in more matching emails where some or many of them may not be relevant to the information need of the investigator. The information need also varies from case to case under investigation. For example, assume that a tender is leaked in an organization. While dealing this case, the investigator may give more priority for the emails transmitted within the range of tender creation date and the tender results announcing date rather than the email transactions in other dates. Such requirements will be best known to the investigator alone. The proposed system offers a customizable ranking facility so that investigators can express his interest to the system which can rank the retrieved emails accordingly and serves the most relevant emails to him. Now a day s people use short form acronyms instead of writing full words. For example emails users write tc. But his intension is to say take care. Analyzing such acronyms is a challenge. The proposed system uses an approach to analyze such acronyms with the help of Ontology. Data mining techniques are applied for in-depth analysis. Link Analysis, Clustering, Summarization and other techniques are applied to identifying interesting patterns. Visualization techniques are used to present the retrieved knowledge to the user in an easily understandable form with graphical support. The system constructs an Email Forensics Ontology by capturing some semantic relationships among retrieved email conversations which are essential for email forensics. The ontology can straight away answer some of the domain specific questions of forensic investigators through semantic analysis and inference. Chapter 2 briefs the research done in Email Forensics so far. Chapter 3 presents the design details of the system. Chapter 4 explains the implementation issues. Chapter 5 concludes the paper. II. RELATED WORK With the rapid development in email technology, the usage of emails for the fraudulent activities is also accelerating with higher pace. Email Forensics can prevent, investigate or prove a crime committed. Many researchers have done valuable work and presented many solutions for Email Forensics. This section of the report briefs their contribution. In [1] a tool is presented for indexing and analyzing email textual content, and for providing information retrieval functions to retrieve all emails containing interesting information which can be used for digital forensics. Traditional forensic search tools just present the results without a kind of grouping or inappropriate filtering. Crime investigator has to spend a lot of time in order to find documents related to the investigation among the searched results. [2] Has proposed a new ranking method to rank the results according to their relevance to the information need of the investigator. When the users provide narrow queries, the information retrieval may fail to produce some relevant documents. [1] Has chosen the query expansion to solve this problem with the use of WordNet Ontology by expanding the query by ASE 2014 ISBN: 978-1-62561-000-3 1

including the words which are semantically related to the actual words in the query. Enron data set is a large collection of emails. The data set contains around 5,17,431 emails from 151 employees. It has been stated as the perfect test bed for testing the effectiveness of techniques used for counter terrorism and fraud detection and it has been used by many researchers [3]. In our experiment we have chosen it as our data set. In [5], authors presented a data mining tool that visualizes a very wide range of detailed analyses of email and email flows derived from large email collection in a variety of formats. In [6], tools have been explained for identifying the originating IP and originating location of an email through email header analysis. [7] Has proposed an algorithm for email forensics to analyze email information from network packets on SMTP protocol and HTTP protocol. In [8], authors have proposed techniques for discovering emails in one conversation, capturing the conversation structure and summarizing the email conversation. The system explained in [9] focused on Phishing scam. It analyzes emails to gather additional information related to Email Forensics using UNIX tools and it also generates forensic reports. The system mentioned in [10] has applied classification methods to Instant Messages to determine the author of it based on user behavior. In current paper an architecture and method is provided for Email Forensics using data mining techniques. III. DESIGN Block diagram of the system is given in figure 1. Emails from the corpus will be parsed, tokenized, analyzed and finally indexed one by one. Once the index is ready, users are permitted to query and retrieve relevant emails from it. User s query will be analyzed semantically, expanded and refined using WordNet and Chat Acronym Ontologies. This refined query will be matched against the index thus retrieves the matching emails. On the basis of the user provided Ranking Profile which is the representation of the interest of the investigator, these matching emails will be ranked according to their relevance to the case under investigation. These emails will be presented to the user as well as forwarded to the further modules for in-depth analysis. Data mining techniques like Link Analysis, Email Clustering and User Behavior Modeling.etc will be performed on these relevant emails and the retrieved knowledge will be presented in an easily understandable form by the visualization module with graphical support. Ontology Construction module takes the retrieved relevant emails, extracts the interesting relationships from them and constructs the Email Forensics Ontology. Ontology straight away answers some interesting domain specific questions of the investigators through semantic analysis and inference. FIG: 1 BLOCK DIAGRAM OF THE SYSTEM IV. IMPLEMENTATION METHODOLOGY Enron email corpus [14] is a good choice as the dataset for this experiment. It has been stated as the suitable test bed for the digital forensics and it has been widely used by many researchers. To provide rapid searching, first an index has to be constructed from the email collections. Indexing refers to processing the original data into a highly efficient crossreference lookup in order to facilitate rapid searching [4, 12]. Each email from this corpus has to be parsed, analyzed, tokenized and indexed. We suggest LUCENE [13] (an open source java search API) to perform this task. Thus an inverted index can be built. This index can be tested using Luke tool [20]. ASE 2014 ISBN: 978-1-62561-000-3 2

Usage of short form acronyms instead of actual word (example tc represents take care ) is making the email understanding problem as a challenging task. To deal this problem, we propose ontology based acronym expansion while indexing. By constructing ontology for acronym expansion, we can make use of it for expanding acronyms within the email. Thus index functionality can be refined by indexing the words instead of their acronyms. Users query the system with the keywords to get the relevant emails. Digital forensics generally aims at high recall, because investigators don t want to lose any information event if it is slightly relevant. We used the query expansion to achieve high recall. User s query is expanded with the words that are semantically similar to the words with in the user s query. We have used the WordNet Ontology for Query expansion. Lucene provide a convenient method for query expansion. User s query can be analyzed and refined through query expansion as explained above. This refined query will be mapped against the inverted index and emails containing those query terms will be retrieved. Lucene offers a handy way to search for documents those contain query keywords. Various search types like Boolean search, filed based search, weighted search, wild card search, fuzzy search.etc are possible. Suggestions for misspelled queries are also feasible through Lucene. Existing forensic search tools just present the results without a kind of grouping or inappropriate filtering, a crime investigator has to spend a lot of time in order to find documents related to an investigation among the searched results. To solve this we propose a ranking methodology. We have offered the added flavor of customization to the user. User can customize the ranking process based on his interest (which highly varies from case to case under investigation). Due to the high recall, the number of resultant emails will be huge. There is a desperate need for a ranking system which ranks the matching emails in their order of relevance. In traditional IR systems normally this ranking scheme will be static. For forensic analysis apart from the textual email body content, the metadata associated with the email like sender and receiver information, date and time of email sent or received the origin location of the email and other things are much more important. Traditional IR systems ranks the documents based on the occurrence of query terms in the text. They do not bother about such metadata. One more issue is the required metadata (email sent\received time, sender information...etc) will change from case to case and will be best known to the person investigating that case. So forensic ranking method should consider the metadata of emails and it should be flexible to change. The system presents the flexibility to the forensic examiner to express his interest in the form of ranking profile. System calculates the relevance score for each matching email and presents the emails which are most interesting to the investigator in the top of the results list. Metadata plays a key role in forensic analysis. Metadata can be extracted from retrieved emails. Sample metadata details of Enron email are given below. Message-ID: <24578553.1075841215501.JavaMail.evans@t hyme> Date: Tue, 5 Feb 2002 08:18:46-0800 (PST) From: chance.rabon@enron.com To: jonathan.mckay@enron.com, f..brawner@enron.com Subject: east basis and index points Mime-Version: 1.0 Content-Type: text/plain; charset=usascii Content-Transfer-Encoding: 7bit X-From: Rabon, Chance </O=ENRON/OU=NA/CN=RECIPIENTS/CN=CRABON> X-To: Mckay, Jonathan </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Jmckay1 >, Brawner, Sandra F. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Sbrawne > X-cc: X-bcc: X-Folder: \ExMerge - Mckay, Jonathan\Enron X-Origin: MCKAY-J X-FileName: jon mckay 7-11-02.PST From this Meta data more interesting details can be found. For example details like originating location or originating IP of the email can be extracted using tools like EmailTrackerPro [18] and SmartWhoIs [19]. Analyzing Email Body Content is useful in many contexts. Author identification can be done using email text content analysis. Author identification plays a key role in spam detection. Text mining is very useful for this purpose. Email classification also plays an important role in Email Forensics. If some crime happens, some mails may be actual mails which can contain crime related information and can stay as a witness. But many emails can be there talking about the crime scene. They are communicating the crime news. Crime investigator may not be interested in them. Classifying the emails into Crime-Related Emails and News Emails will help the crime examiner a lot. Text mining tools like NLTK [27] and GATE can be used for these purposes. Social interactions are very important for forensic investigations. Link analysis can be done on the retrieved emails to understand the email interactions [11]. User interaction graph is a graph which contains the email ids of retrieved emails as nodes and each retrieved email as an edge between corresponding email id nodes. Thus a single user interaction graph can be built from resultant emails. Thus this user interaction graph depicts whole set of user ASE 2014 ISBN: 978-1-62561-000-3 3

interactions in a single shot. Using this graph communication links between any parties can be easily analyzed. Mediators between two parties can be easily identified. It also depicts various groups of people who usually communicate among themselves. These details are important for forensic investigation. Efficient visualization of the extracted knowledge is equally important. Visualization module presents the knowledge discovered in above techniques with graphs and charts. Email User Interaction Graph which visualizes the email interactions graphically can be visualized using JUNG (Java Universal Network Graphs) API [22]. Interesting details like email sending and receiving counts, the time and date range versus the email sent/receive frequency and other such details can be plotted as bar charts and pie charts using JFreeChart java API [21]. Ontology is specification of conceptualization about any particular domain. Ontologies have been widely used in many domains to formally represent the semantics of that domain, to provide automated reasoning, to answer semantically rich domain specific queries. We propose an Ontology using concepts and relationships related to Email Forensics. Ontologies capture the semantic relations among the entities of the domain. We can infer new knowledge and can answer domain specific questions with help of Ontologies. Ontology can be developed using protégé tool [15, 17]. OWL (Web Ontology Language) [16] can be used to represent the Ontology. Once the Ontology is designed, it can be instantiated using the metadata details extracted from the resultant emails using protégé s Data Master plug-in [26]. Consistency check of the ontology can be done using Pellet reasoner. Rules specific to forensic domain can be written in SWRL (Semantic Web Rule Language) [23]. With the help of Jess inference engine [25], we can fire the domain specific rules and infer new knowledge. Ontology will be expanded by updating this new knowledge. Ontology will be queried with domain specific queries. SPARQL language [24] can be used to query the ontology. We can get answers for those queries from the Ontology. For example, the proposed system constructs the Email Forensics Ontology by capturing the relationships like A is sending email to B. etc. Using these details Ontology can infer who is in contact with whom, who is directly connected to whom, whether two people are connected or not and other details. Whenever the investigator wants to know answers for these queries, this Ontology answers him. V. CONCLUSION Need and challenges for email forensics and some of the available solutions for Email Forensics are briefed. A system is proposed to assist the Email Forensic Examiners in retrieving the relevant Emails from large email corpus in less time and to present the interesting hidden patterns in an easily understandable manner with advanced graphical support. A method is proposed using Ontology to answer some of the interesting domain specific questions of the forensic examiner. ACKNOWLEDGEMNT I am most grateful to Dr. Ajit T. Kalghatgi, Director (R&D), Bharat Electronics Limited, for his most valuable suggestions. REFERENCES [1] Report, Australian Phan Thien Son, Ontology-Driven Text Mining for Digital Forensics, COMP6703 Project National University, 2007. [2] Jooyoung Lee, Proposal for Efficient Searching and Presentation in Digital Forensics, The Third International Conference on Availability, Reliability and Security, 2008, pp. 1377-1381, doi:10.1109/ares.2008.192. [3] Jitesh Shetty and Jafar Adibi, The Enron Email Dataset Database Schema and Brief Statistical Report. [4] Erik Hatcher, Otis Gospodnetic and Michael McCandless, Lucene in Action, Second Edition, Manning Publications, 2009. [5] Salvator J Stolfo and Shlomo Hershkop, Email Mining Toolkit Supporting Law Enforcement Forensic Analyses. [6] Natarajan Meghanathan, Sumanth Reddy Allam and Loretta A. Moore, Tools and Techniques for Network Forensics, International Journal of Network Security & Its Applications (IJNSA), Vol.1, No.1,April 2009 [7] Wang Wen Qi and Liu WeiGuang, The Research on Email Forensic Based Network, First International Conference on Information Science and Engineering (ICISE), 2009, pp: 1912-1915, ISBN: 978-1-4244-4909-5. [8] Xiaodong Zhou, Discovering and Summarizing Email Conversations, Thesis Report, The University of British Columbia, 2008 [9] Agarwal S, Bali J, Zhenhai Dvan and Kermes L, The Design and Development of an Undercover Multipurpose Anti-Spoofing Kit (UnMASK) 23 rd Annual Conference on Computer Security Applications, 2007, pp: 141-150, ISBN: 978-0-7695-3060-4. [10] Angela Orebaugh and Jeremy Allnut, Classification of Instant Messaging Communications for Forensic Analysis, The International Journal of Forensic Computer Science, IJoFCS(2009) 1, pp: 22-28 [11] Jaiwei Han and Micheline Kamber, Data Mining Concepts and Techniques, Second Edition, ISBN: 978-81-312-0535-8. [12] Christoper D. Manning, Prabhakar Raghavan and Hinrich Schutze, Introduction to Information Retrieval ISBN: 978-0-521-86571-5, 2008. [13] Lucene search library, available at: http://lucene.apache.org/nutch [14] Enron email dataset available at: http://www-2.cs.cmu.edu/~enron/ [15] Protégé Ontology Editing Tool, available at: http://protege.stanford.edu/ [16] OWL guide, available at: http:// www.w3.org/tr/owl-guide/ [17] Protégé Wikipedia, available at: http://protegewiki.stanford.edu/wiki/main_page [18] EmailTrackerPro available at: http://www.emailtrackerpro.com/ [19] SmartWhoIs available at: http://smartwhois.com/ [20] Luke available at: http://www.getopt.org/luke/ [21] JFreeChart available at: http://www.jfree.org/jfreechart/ [22] JUNG available at: http://jung.sourceforge.net/ [23] SWRL available at: http://www.w3.org/submission/swrl/ [24] SPARQL available at: http://www.w3.org/tr/rdf-sparql-query/ [25] Jess available at: http://herzberg.ca.sandia.gov/ [26] DataMaster available at: http://protegewiki.stanford.edu/wiki/datamaster ASE 2014 ISBN: 978-1-62561-000-3 4

[27] NLTK available at: http://nltk.org/ AUTHOR Venkata Krishna Kota received his B.Tech degree in Computer Science and Information Technology from Jawaharlal Nehru Technological University in 2005 and M.E degree in Computer Science from Anna University in 2008. He is working as Member (Research Staff) at Central Research Laboratory (CRL), Bharat Electronics Limited (BEL), Bangalore. His research interests are Information Retrieval and Complex Event Processing. ASE 2014 ISBN: 978-1-62561-000-3 5