How To Write An E-Mail Forensics Framework For Cyber Crimes

digital investigation 5 (2009) 124 137 available at www.sciencedirect.com journal homepage: www.elsevier.com/locate/diin Towards an integrated e-mail forensic analysis framework Rachid Hadjidj, Mourad Debbabi*, Hakim Lounis, Farkhund Iqbal, Adam Szporer, Djamel Benredjem Computer Security Laboratory, Concordia University, 1455 de Maisonneuve West, EV 7-642, Montreal, Quebec, Canada H3G 1M8 article info Article history: Received 29 October 2007 Received in revised form 6 January 2009 Accepted 14 January 2009 Keywords: Cyber crimes E-mail forensics E-mail social networks Classification Clustering Statistical analysis abstract Due to its simple and inherently vulnerable nature, e-mail communication is abused for numerous illegitimate purposes. E-mail spamming, phishing, drug trafficking, cyber bullying, racial vilification, child pornography, and sexual harassment are some common e-mail mediated cyber crimes. Presently, there is no adequate proactive mechanism for securing e-mail systems. In this context, forensic analysis plays a major role by examining suspected e-mail accounts to gather evidence to prosecute criminals in a court of law. To accomplish this task, a forensic investigator needs efficient automated tools and techniques to perform a multi-staged analysis of e-mail ensembles with a high degree of accuracy, and in a timely fashion. In this article, we present our e-mail forensic analysis software tool, developed by integrating existing state-of-the-art statistical and machinelearning techniques complemented with social networking techniques. In this framework we incorporate our two proposed authorship attribution approaches; one is presented for the first time in this article. ª 2009 Elsevier Ltd. All rights reserved. 1. Motivations and background In the majority of e-mail mediated cyber crimes, the victimization tactics used vary from simple anonymity to identity theft and impersonation. Due to two inherent limitations, e-mail communication is exposed to such illegitimate uses. One, there is no mechanism for message encryption at the sender end and/or an integrity check at the recipient end. Two, the widely used e-mail protocol, Simple Mail Transfer Protocol, lacks a source authentication mechanism. In fact, the metadata in the header of an e-mail, containing information about the sender and the path along which the message has traveled, can easily be forged or anonymized. Installing antiviruses, filters, firewalls, and scanners is insufficient to secure e-mail communication (Teng et al., 2004). In this context, cyber forensic investigation (also called digital investigation) is employed to collect credible evidence by analyzing e-mail collections to prosecute criminals in the court of law. The scope of e-mail analysis ranges from simple keyword searching to authorship attribution of anonymous e- mails. For instance, an investigator may want to get an overview of an e-mail collection by computing simple statistics such as the distribution of e-mails per sender/recipient domains. In some situations an investigator may try to narrow down the scope of investigation by selecting (usually few) malicious e-mails from regular ones. For this purpose, usually contentbased clustering is applied to divide e-mails into different groups on the basis of the subject matter of e-mails (Li et al., 2006). The conceived subject matter could be the type of crime, such as pornography, hacking, or terrorism, etc., in which e-mails were instrumental in committing those crimes (Kulkarni and Pedersen, 2005). E-mails can be clustered on the basis of stylometric features to determine the writing styles of different individuals contained in an e-mail collection * Corresponding author. Tel.: þ1 514 848 2424; fax: þ1 514 848 3171. E-mail address: debbabi@ciise.concordia.ca (M. Debbabi). 1742-2876/$ see front matter ª 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.diin.2009.01.004

digital investigation 5 (2009) 124 137 125 (Holmes, 1998). Wei et al. (2008) have proposed a clustering algorithm for detecting relationships among different spam e- mails to identify relationships between spam campaigns. The features they used are extracted and derived from e-mail headers and attachments of spam e-mails. An investigator may be interested in detecting a similarity between certain e- mails in cases of plagiarism detection and authorship analysis (de Vel et al., 2001). E-mail social network analysis techniques are used to study the communication patterns of individuals at the e-mail account level, without analyzing the actual contents of e-mails (Stolfo et al., 2006). The need exists to develop an integrated e-mail analysis tool by using the above-mentioned innovative techniques. This will help forensic experts to efficiently analyze e-mail collections (which are usually huge), within a limited time frame. E-mail Mining Toolkit (EMT) (Stolfo et al., 2006) is one such framework that computes the behavior profile of users based on their e- mail accounts. These profiles are then employed to detect the anomalous behavior of those users. This toolkit is useful for generating reports by summarizing e-mail archives. However, the toolkit does not address the issue of authorship attribution and similarity detection, as addressed separately by Abbasi and Chen (2008). Zheng et al. (2006), on the other hand, proposed a stylometry-based framework that is used for e-mail authorship identification only. As described in this paper, we have designed and implemented a comprehensive software toolkit called Integrated E-mail Forensic Analysis Framework (IEFAF) to fit an investigator niche. The framework will help to assist investigators in gathering clues and evidences during investigations in which e-mail communications are relevant. Major functionalities of IEFAF include: The ability to investigate e-mail archives and compute the required statistical distributions to give an overview about an entire e-mail collection to the investigator. Using different visualization techniques, results are plotted for the purpose of clarity and understanding. The framework is compatible with a variety of data formats coming from different databases. The capability of keyword searching by using SQL like queries. The development of data mining models to help classify e-mails in different categories, or cluster them according to some undiscovered relationships. The detection of anomalous behaviors by matching the observed e-mail communication with the pre-recorded normal communication model of users. The usual communication patterns of a user within its cliques are collected though e-mail social network analysis techniques. The performance of e-mail authorship analysis, on the basis of stylometric features, to help identify the most plausible authors of anonymous e-mails. The capability to map selected IP addresses by applying our geographical localization technique to determine the physical location of that particular IP. Apart from developing the framework (IEFAF), we are proposing a new approach of mining style variation to address the authorship attribution problem. In traditional authorship attribution technique, writing style features are extracted from the entire e-mail collection of a person, irrespective of to whom the e-mails are written. It is usually assumed that the stylometric features found in one s documents remain consistent and are not controlled (neither consciously or unconsciously) by the writers. However, the fact is that a substantial variation in the style of an individual can be seen in both the contents as well as the stylometric features depending on the recipient and the context. In this paper we are proposing techniques for capturing the style variation of a person across his/her entire e- mail communication. A detailed description of our proposed technique is given in Section 2.3 Moreover, in IEFAF we have incorporated a novel authorship attribution approach, published in DFRWS 2008 (Iqbal et al., 2008). This technique of mining write-prints is based on the concept of frequent itemset (Agrawal et al., 1993), borrowed from data mining domain. This helps to capture the combination of features that occur frequently in a person s e-mails. The rest of the paper is organized as follows: Section 2 describes our proposed approach, Section 3 elaborates on different modules of our framework, and Section 4 contains concluding remarks and future directions. 2. Proposed approach The theoretical foundation of our framework is based on different well established techniques of statistical analysis, text mining (classification and clustering), and stylometric features analysis, together with behavioral modeling achieved by using social networking techniques. Stylometry is the statistical study of five different writing style (lexical, syntactic, structural, domain-specific and idiosyncratic) features (see Section 2.3.1). E-mail social network analysis is complemented by statistical analysis to develop more precise behavioral profiles at the e-mail account level of a user. Stylometric features analysis is applied to learn about a user writing behavior at the content level. These two types of models, together with machine-learning techniques, are employed to analyze anonymous e-mails in authorship attribution problem. In forensic investigation, it is imperative to localize individuals and their resources for collecting more concrete evidence. Therefore, we complement our framework with the capability of geographic localization. The detailed description of how are the aforesaid innovative techniques are helpful in the context of e-mail forensic analysis and how are they incorporated in our framework, is given in the next subsections. 2.1. E-mail statistic analysis Statistical analysis of e-mail accounts by observing their communication patterns manifests a great deal of information. For instance, to view a brief summary of an e-mail corpus, simple statistics like number of e-mails per sender, per recipient, per sender domain, per recipient domain, per class and per cluster 1 are calculated (see Fig. 5). Moreover, 1 Classes and clusters are determined by applying classification and clustering respectively.

126 digital investigation 5 (2009) 124 137 computing similar statistics, including e-mailing frequency during different parts of the day, average e-mail size, and average attachment size (if any) of a user help reveal some non-trivial information. For instance, an e-mail user may send on average 20 30 e-mails to his/her co-workers during day time, which may drop to 5 10 e-mails at night. Similarly, the average mail size of a user may be 2 5 KB, with usually short attachments, if any. If the same e-mail account suddenly transmits hundreds of large sized e-mails with heavy attachments towards certain unknown recipients, which reveals the possibility of suspicious behavior. This may help investigators to narrow down the investigation scope by short listing the number of suspects. More explicitly, e-mail accounts that show some kind of unusual behavior are selected for further investigation. Determining the total number of users (senders/recipients) within an e-mail collection, finding all the recipients of each user, and determining whether an e-mail has been replied to or not, helps during investigation. Statistical distributions can be computed over a certain period of time and for a specific set of e-mails. Additional statistics can be computed dynamically by sending appropriate SQL queries to the database. A more advanced use of statistical distributions can help compute users profiles that can be used for authorship identification (Mendenhall, 1887; Farringdon, 2001). To compute statistics on an e-mail corpus, each e-mail is first loaded from its raw files, and relevant fields, such as the sender, recipient, subject, and message body, etc., are extracted. Extracted information 2 is stored in database tables. 2.2. E-mail mining Data mining is a mathematical process designed to explore large amounts of data by capturing consistent patterns and relationships between data objects. By employing mathematical models, the knowledge acquired from interesting patterns is applied to make predictions about the unseen dataset. The application of data mining techniques to an e-mail dataset has been very successful in cyber crime investigation. Several studies (Abbasi and Chen, 2008; de Vel, 2000) signify the importance of e-mail mining for resolving issues of identity theft and plagiarism in e-mail forensic investigation. In our framework, we have used classification to identify the topic and/or the author of e-mails. Clustering, on the other hand, is used to cluster e-mails on the basis of contents and stylometric features. 2.2.1. E-mail classification In general, the process of classification starts by data cleaning, followed by features extraction. The extracted features are bifurcated into two groups, training and testing sets. Each instance of the training data has a definite category, called class label. The training set is given as input to a classification function (classifier) to develop a model. Common classifiers include decision tree (Quinlan, 1986), neural networks (Lippmann, 1987), and Support Vector Machine (SVM) (Joachims, 1998). The developed model is tested with the testing set by 2 In the current version of our framework, we do not handle e-mail attachments. assuming that the class labels are not known. The validated model is then employed for classification of unseen data. Usually, the larger the training set, the better the accuracy of the model. In the context of e-mail classification, the body and subject of an e-mail are converted to a vector of metrics called features. The features set that we used in our experiments is described in Section 2.3.2.1. Usually, each e-mail (subject and body) is converted into a stream of characters. Using java tokenizer API, each character stream is converted into distinct tokens or words. Some of the words may appear in different forms (for instance, verb, noun, and adjective, etc.), or different tenses (such as present, past, and future). Such words are stemmed to their common root. For instance, finance, financial, and financing may be converted to finance. Porter2 is a common stemming algorithm used by the data mining community. Syntactic features, also known as style markers (punctuation and all-purpose short words called stop-words), are treated differently in different data mining applications. For example, they are dropped in topic-based classification and kept in author-based classification due to their significant discriminating capabilities in identifying authors based on their writings. In our experiments, we used more than 300 function words (as listed in Zheng et al., 2006). Certain word sequences like United States of America and United Arab Emirates, etc., often appear together; that may increase features dimensionality. Therefore, we developed a module to automatically scan those sequences and treat them as single tokens. Using vector space model representation, each e-mail E i is converted into an n-dimensional vector of features E i ¼ {f 1,., f n }. Once all the e-mails are converted into feature vectors, normalization is applied to the columns as needed. The purpose of normalization is to limit all the values of a certain feature in a specific range and avoid overweighing some attribute over others. The selected columns are scanned for the maximum digit, then all the cells are divided by that number. In our framework, we apply classification for two purposes. One, to classify new e-mails on the basis of topic, and two, to identity the true author of an anonymous e-mail. A detailed description of the two types of application is given below. 2.2.2. Topic-based classification Most spam filtering and scanning techniques are based on topic or content-based classification. Analogously, in forensic analysis, e-mails are classified as malicious if their contents are matched to a particular cyber criminal taxonomy. In contrast to traditional keyword searching, which is inefficient and error prone, classification techniques are more precise and robust to noise and dimensionality. For instance, to identify e-mails (usually from a huge e-mail collection) that promote drug trafficking, one can perform a simple search with the word drug or other related keywords. However, the criminal community often uses special expressions and encrypted messages to communicate covertly with each other. Most of the culprits use different names and speech artifacts to hide information. Classifiers, on the other hand, are not limited to a few keywords and instead are trained on multidimensional data, and thus do not suffer from information hiding.

digital investigation 5 (2009) 124 137 127 In our framework, topic classification is achieved using a classical text mining approach (Forsyth and Holmes, 1996). After the pre-treatment phase (discussed in Section 2.2.1), the given set of e-mails is divided into a training ((2/3) of total e-mails) and a testing set ((1/3) of total e-mails). Each instance from the data set may contain either two class labels or multiple class labels, depending on the number of target groups/categories. The investigator, for instance, may want to classify an e-mail as malicious or non malicious (normal), or he may wish to classify e-mails in more than two categories, such as pornography, spamming and terrorism. The class label in this case is the crime type/group. It s worth mentioning that in topic-based classification, the context-independent words, called stop words (function words and punctuation), are removed and only the contentspecific features are retained. Frequency of each of the token is calculated. The resultant frequencies are normalized to a value between 0 and 1. As a result, each e-mail E i is represented as {f 1,., f n }, where each feature f i is a normalized frequency of a word w i. The next step is to apply a classification model to the set of feature vectors. For this purpose, we use a data mining software called Weka. 3 The feature vectors are converted into Weka compatible format, Attribute-Relation File Format (ARFF). To evaluate our implementation, we performed experiments on the Enron e-mail corpus made available by MIT at http://www-2.cs.cmu.edu/wenron/. We considered a subset containing around 300 e-mail messages classified manually into two classes: those dealing with company business (official) and those that were personal. Each class contained 150 e-mails. A training set was constructed by randomly selecting 100 e-mails from each class, while the remaining 50 e-mails were used for testing. The same process was repeated 10 times to construct 10 different training and testing sets. We employed 3 5 different classifiers. The precision of the classifiers varied between 76% and 89%, with an average of about 81%. The classifier precision, computed as the percentage of true positives (e-mails correctly classified), is used to measure the model s accuracy. 2.2.3. Author-based classification The second application of classification in our framework is to identify the author of an anonymous e-mail. The class label used for this purpose is the author or sender of an e-mail. This section is given here for the purpose of completeness, while the detailed description of authorship identification is given in Section 2.3. 2.2.4. E-mail clustering Clustering is the process of grouping data in semantically similar sets to achieve simplification by modeling data by its clusters (Gunopulos et al., 1998). In case of e-mail mining, we used clustering to group e-mails on the basis of discussion topic and authorship. To cluster e-mails by discussion topic, e-mails are processed for features extraction in the same way as discussed in Section 2.2.2. The only difference is that instead of computing the frequency of a word in each document, we compute the perceived importance of a word in all 3 Weka is available at http://www.cs.waikato.ac.nz/ml/weka. the documents. For this purpose, we employ the commonly used tf_idf function (Joachims, 1998): tf idf j;i ¼ tf j;i idf j;i where tf_idf j,i is the perceived importance of a word w j in e-mail E i, tf j,i is the frequency of word w j in e-mail E i, idf j;i ¼ logðn=df i Þ is the inverse document frequency, with N the total number of e-mails and df j the number of e-mails where the word w j appeared. We have used the three most commonly used clustering algorithms: Expectation Maximization (EM), K-Means, and bisecting K-Means. Once the clusters are obtained, each cluster is tagged with the most and the least frequent words/ phrases found in the respective cluster. Tagging clusters with the least frequent words, helps in finding the inter-cluster relationship. In addition to identifying the subject matter of a group of e-mails, clustering can also be employed to speed up query-based keyword searching. Instead of scanning each e-mail for a keyword, all the e-mails are first clustered and then each cluster is tagged with the most frequent words, which are then matched with the keyword in question. The matched clusters are retrieved in the order of relevance to the search criterion (query contents). Another application of clustering is to identify the most plausible author of an anonymous e-mail. In this case, the stylometric features are not discarded but are used to differentiate between writings of different suspects. The rest of the preprocessing is analogous to the one discussed in Section 2.2.1. Clustering is applied to anonymous e-mails, as well as e-mails with known authors. Resulting clusters are tagged with the most frequent senders. Since clustering is performed on the basis of writing style features, e-mails within a cluster would belong ideally to one particular individual. The anonymous e-mail appearing in a cluster where a specific sender is the most frequent, then that particular sender is declared to be the most probable author of the disputed anonymous e-mail. This is because that specific sender is the one who has more e-mails similar to the disputed e-mail. 2.3. E-mail authorship attribution Anonymity in e-mail communication is one of the main issues exploited by terrorists, pedophiles, and scammers. Falsifying sender name, address, and the path along which an e-mail travels is generally termed as spoofing and forging, which can be done even by a novice user. In this context, forensic analysis of e-mails, with special focus on authorship attribution, can help prosecute the offender of e-mail misuse by means of law (Teng et al., 2004). Traditionally, finger prints are used to uniquely identify individuals during criminal investigations within courts of law. Analogously, word-prints or write-prints constituted by the writing style features of an author can be used to discriminate his/her writings from that of others. The goal is to determine the likelihood that a specific individual is the author of an anonymous e-mail by examining his/her previously written e-mails. The problem of authorship identification in the context of e-mail forensics is distinct from traditional authorship problems in two ways. First, by assumption, the true author should certainly be one of the suspects. Second, e-mails, though are

128 digital investigation 5 (2009) 124 137 short in size but usually contain rich information as an e-mail normally consists of header, subject, body and attachments. More formally, a cyber forensic investigator attempts to determine the author of a disputed anonymous e-mail a, and who has to be one of the suspects {S 1,., S n }. The main issue here is to precisely identify the most plausible author from the suspects {S 1,., S n } and present the findings in a court of law. In current literature (Teng et al., 2004; Zheng et al., 2006), authorship identification is considered as a text classification problem. The process starts by extracting the writing style features from the previously known e-mails of a person. Using these features, a classifier is trained; then, the developed model is applied to the anonymous e-mail to identify its conceivable author. The authorship attribution technique has been successful in resolving ownership disputes over literary and historic documents. However, due to the special characteristics of an e-mail dataset, its application to e-mail is more challenging. The commonly used features in the field of e-mail authorship analysis (Corney et al., 2002; Zheng et al., 2006) are lexical, syntactical, structural, content-specific attributes and idiosyncratic features (see Section 2.3.1). In most of the previous studies, stylometric features are extracted from the entire ensemble of a suspect s e-mails, disregarding the topic, time, and recipients of the e-mails. However, the fact is that the writing style of an individual varies from recipient to recipient and evolves with the time and context (de Vel et al., 2001). This change may occur in both the contents as well as the style markers. For instance, e-mails that a person writes to his job colleagues are more formal than what she writes to her family members and friends. Coworkers of a financial company may talk more about meetings, promotion schemes, customer problems and solutions, salaries and bonuses, etc. E-mails exchanged among friends may discuss about trips, visits, funny stories and jokes. The writing style features, including the selection and distribution of function words and punctuation, may be different in different contexts. Moreover, a person may be more formal and careful in using structural features, such as greeting and farewell comments, in e-mails written to his boss. One may prefer to put complete signatures, including designation and contact information, in job communications. More importantly, malicious e-mails are mostly anonymous and will be devoid of such traceable information. Though some research work (de Vel, 2000; de Vel et al., 2001) recognizes that some style variations exist with respect to different recipients, most choose to ignore such variations and focus on obtaining the so-called permanent writing traits of a suspect. However, with this approach, the contents and writing styles found in malicious e-mails may be overshadowed by regular e-mails because malicious e-mails are usually much fewer in number than regular e-mails. As a result, the classifier built from all the e-mails would capture the writing styles from the regular e-mails but may not be able to capture all the variations of writing styles of the same suspect. The classifier may be very accurate for classifying regular e-mails, but fail to accurately classify malicious e-mails, which, ironically, is the objective of building the classifier. The need is to investigate the impact of a suspect s style variation on authorship attribution. In this study, we are proposing a novel approach of mining style variations to precisely extract the more representative writing style features of a suspect (see Section 2.3.2 for details). The major advantages of our proposed approach are: Model representativeness: the different writing styles of a suspect are captured separately without intermingling them. The developed classification model is a reasonably true representative. Increased accuracy: the developed model will be able to precisely match the disputed e-mail with the malicious behavior (as learnt from the malicious e-mails) of a potential suspect. Generic application: our experimental results indicate that the proposed approach can be applied to increase the accuracy of authorship identification when the dataset contains e-mails written on diversified topics. It can be a first step in solving the authorship identification problem in a more generic and natural way. 2.3.1. Stylometric features Writing styles are defined in terms of stylometric features. Writing patterns are usually the characteristics of words usage, words sequence, composition and layouts, common spelling and grammatical mistakes, vocabulary richness, hyphenation, and punctuation. However, there is no such features set that is optimized and is applicable equally in all domains. The commonly used features that are found in various authorship analysis studies (Baayen et al., 1996; Iqbal et al., 2008; Zheng et al., 2003) contain lexical, syntactical, structural and content-specific attributes. Recently, Abbasi and Chen (2008) presented a more comprehensive list of stylistic features by including idiosyncratic characteristics of writing styles. A brief description of the relative discriminating capability of each of these features is given below. Token-based Features are collected either in terms of characters or words. In terms of characters, for instance, frequency of letters, frequency of capital letters, total number of characters per token and character count per sentence are the most relevant metrics. These indicate the preference of an individual for certain special characters or symbols or the preferred representation of certain units. Word-based lexical features may include word length distribution, words per sentence, and vocabulary richness. Syntactic Features: Baayen et al. (1996) were the first who discovered that punctuation and function words are context-independent and thus can be applied to identify writers based on their written works. Structural Features are used to measure the over all appearance and layout of a document. For instance, average paragraph length, number of paragraphs per document, presence of greetings and their position within an e-mail, are common structural features. Content-specific Features are collection of certain keywords commonly found in a specific domain and may vary from context to context even for the same author. Zheng et al. (2003, 2006) used around 11 keywords (such as obo and sexy etc.) from the cyber crime taxonomy in authorship analysis experimentation. Idiosyncratic Features: common spelling mistakes such as transcribing f instead of ph say in phishing and grammatical mistakes such as sentences containing incorrect form of verbs. The list

digital investigation 5 (2009) 124 137 129 of such characteristics varies from person to person and is difficult to control. 2.3.2. Proposed attribution approach An investigator is provided with e-mails previously written by potential suspects. The available e-mails could be in different formats, written in different languages, and may contain images, video clips, and/or HTML/XML tags. Our framework supports most of the common e-mail formats. Presently, we are considering e-mails that are written in English only. In other words, we extract the textual part of the e-mail body, written in English, and drop all other parts of an e-mail message. The proposed approach consists of two major steps: e-mail grouping or categorization, followed by classification. As shown in Fig. 1, first the entire e-mail collection E i of a suspect S i, where S i {S 1,., S n }, is divided into distinct groups {S i G 1,., S i G k }. We have used both header information as well as the e- mail body for grouping e-mails. For instance, grouping is performed on the basis of e-mail recipient, e-mail sender, e-mail time stamp, and combination of them. In case of the e-mail body, the known data mining technique called clustering is applied to detect similarity among e-mails based on contents. Clustering is performed on contents and stylometric features. Next, using sender recipient, sender time stamp, and cluster tag as class labels, a classifier is built, as depicted in E-mails (E 1 ) of S 1 Grouping/Clustering E-mails (E n ) of S n S 1 G 1 S n G S 1 S S n G n G S k 1 G 1 G k 2 2 Anonymous E-mail Matching with S j G i Features Extraction Grouping/Clustering E-mails: n-dimensional feature vectors with Class Labels Class Label represents a distinct Writing Style Generation Training Set Classification Model Fig. 1 Mining style variation of S i. Testing Set Validation Conceivable Author Fig. 1. The classifier thus built captures the isolated and distinct styles without being misled by the overlapping behavior of an author. The anonymous e-mail a is parsed and its features are extracted. The extracted features are applied to the developed classification model to identify its true author. In this case, the matching paths within the classifier are increased, thus increasing the chances that the anonymous e-mail is precisely attributed to its true author. Prior to describing our proposed approach in detail, we need to explain the stylometric features that we used in our experiments. 2.3.2.1. Stylometric features used. There are more than 1000 (Abbasi and Chen, 2008) that are commonly used. In our experiments, we used around 400 features including lexical, syntactic, and structural features. Most of these features including 303 function words, are listed and explained by Zheng et al. (2006) and de Vel et al. (2001). 2.3.3. Categorization phase: mining class labels Grouping e-mails of a suspect is done on the basis of e-mail body, as well as e-mail header information. To perform the first type of grouping, we employed clustering technique. Clustering is on the basis of either contents or writing style features. The latter type of grouping is straightforward and is done by using e-mail sender, e-mail recipient, and e-mail time stamp. At the end of grouping phase, each e-mail of a group is tagged with the respective group label. These labels are later used as class labels during the process of classification. 2.3.3.1. Categorization based on e-mail body. In this section we study how to capture style variations by applying clustering. There are two types of clustering: content-based and stylometry-based. Content-based clustering is used to determine the topic of discussion within a dataset (Li et al., 2006). Stylometry-based clustering, on the other hand, is used to identify the different writing styles contained within a data collection (Baayen et al., 1996). The process of applying clustering in both cases is the same. The only difference is in the preprocessing step. In content-based clustering the common type of preprocessing is performed. More explicitly, once each e-mail is converted into a bag-of-words, the style markers (function words and punctuation) are dropped. The rest of the tokens are processed in the same way as described in Section 2.2.2. Unlike content-based clustering, in which style markers are dropped, in stylometry-based clustering the syntactic features are maintained. The rest of the pre-treatment is performed in a manner analogous to that described in Section 2.2.2. Once all the e-mails of each author are converted into vectors of features, clustering is applied. As discussed in Section 2.2.4, we have used the three more commonly used clustering algorithms: Expectation Maximization (EM), K-Means, and bisecting K-Means. Clustering is applied to e-mails of each author independently. The resultant clusters of an author, for example S 1, are labeled as fs 1 C 1 ; S 1 C 2 ;.; S 1 C k g. Similarly, e-mails of another author, S 2, are clustered separately, and resultant clusters are labeled as fs 2 C 1 ; S 2 C 2 ;.; S 2 C k g. The cluster labels are used as class labels during the classification phase.

130 digital investigation 5 (2009) 124 137 2.3.3.2. Categorization based on e-mail header. In the traditional classification approach of authorship identification, e- mail sender is used as a class label. However, in our study we divide e-mails of the same author into different groups. This division is based on e-mail recipient and e-mail time stamp, differentiating the different writing styles of the same user. The intuition behind using the time stamp for grouping is that some researchers, like Stolfo et al. (2006), believe that people behave differently at different times of day. People usually communicate with different categories of people at different times. For instance, most of the e-mails that a person writes during day time are exchanged with his/ her co-workers. Similarly, e-mails written in the evening may be exchanged with his/her family members and friends. Likewise, very few of the e-mails that are exchanged at midnight may be written to one s job colleagues. For simplicity, we divide the whole 24 hours day into three time brackets: morning, evening, and night. Therefore, e-mails of a sender are divided into three categories: e-mails sent in the morning are tagged as SM, e-mails sent in the evening are tagged as SE, and those sent at night are tagged as SN, where S represents sender. SM, SE and SN are used as class labels during the classification phase. 2.3.4. Classification phase Once e-mails of all the senders are divided into distinct groups and, thus, the respective class labels for each group are determined, the next phase is to apply classification. This phase consists of features extraction, model generation, and model application (see Fig. 1). A brief description of each of these steps is given below. 2.3.4.1. Features extraction. Each e-mail body is converted into an n-dimensional vector of features. A feature could be a word frequency, ratio of two quantities, or a boolean value. All the feature types that we used in our framework are described in Section 2.3.2.1. The features extraction process is elaborated in Section 2.2.1. 2.3.4.2. Model generation and validation. Prior to the application of classification algorithms, the e-mail group is first divided into training and testing sets (see Section 2.2.1). At the end of features extraction phase, we have two sets of features vectors (training and testing) for each suspect. Using the training set, some selected classifiers are employed to generate a model. Using the testing sets, the generated models are validated prior to their actual use. The validation (effectiveness) of the model is a function of its power to correctly classify the test e-mails. 2.3.4.3. Author identification. If the error approximation is below a certain acceptable threshold, the model is employed. The disputed anonymous e-mail is processed and converted into a features vector in the manner similar to the one adopted for known mail. Using the developed classification model, the conceivable class label of the unseen e-mail is identified. The class label indicates the author of that e-mail. 2.3.5. Experimental evaluation To evaluate our approach, we used e-mails from the Enron e-mail corpus. We considered a selection of 63 e-mails from 3 different senders. For each sender we selected 3 different recipients, with 7 e-mails sent to each of them. We constructed six different groups of training and testing sets. Each group is derived from the e-mail set by randomly selecting (2/ 3) of the e-mails as the training set and considering the remaining e-mails as the testing set (see Section 2.2.2). In our experiments we used the two common classifiers, SVM and C4.5 (Decision Tree). Weka data mining software package has its own version of C4.5, called J48. To check the effect of class labels on the accuracy of classifiers, we performed classification experiments for class labels: sender, sender recipient, and sender cluster. Setting the class label as sender represents the traditional approach, while sender recipient, and sender cluster represents our proposed technique. We ignored time stamp because the initial results that we obtained were similar to the traditional approach. The a 1 SVM b C4.5 1 0.9 0.9 0.8 0.8 0.7 0.7 Accuracy 0.6 0.5 0.4 Accuracy 0.6 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 1 2 3 4 5 6 Selection 0 Average 1 2 3 4 5 6 Selection Average Sender Sender-Recipient Sender-Cluster Sender Sender-Recipient Sender-Cluster Fig. 2 Accuracy VS class labels for (a) SVM and (b) C4.5.

digital investigation 5 (2009) 124 137 131 reason could be that the dataset used was not representative. The same set of experiments was repeated for both classifiers (SVM and C4.5) on all six groups of training and testing sets. Experimental results are depicted in Fig. 2, where figure (a) represents SVM results and figure (b) shows C4.5 results. Employing the SVM classifier, we obtained an average accuracy of 71% for the classical approach (classification by sender), and 69% and 83% (respectively, for sender recipient and sender cluster classes) for the proposed approach. Using C4.5, the results followed a similar trend, with average accuracies of 77%, 73%, and 83%, respectively, for sender, sender recipient, and sender cluster based classification. As shown, the accuracy obtained for the classification by sender cluster seems very encouraging. It shows a noticeable gain in accuracy (10% for SVM and 6% for C4.5) compared with the classical approach (classification by sender). This suggests the relevance of considering author style variation in authorship attribution. On the other hand, the results of sender recipient based classification shows a slight decrease (particularly for SVM) in accuracy compared to the classical approach. The rationale behind this could be explained in two ways. One, considering each sender recipient as a different class creates too many classes and, thus, is difficult for the classifier to handle. Classification studies indicate that SVM is more sensitive than Decision Tree to the number of classes. This observation is supported by our experimental results (Fig. 2). Two, it is not true that the writing style of an author changes for each of his correspondents. The number of class labels in applying the sender recipient approach can be reduced by converging recipients that belong to the same common domain. A deep analysis of these results indicates that accuracy can be improved further, provided the e-mails contain diversified topics and are written to different groups of recipients. 2.4. E-mail social networks Social network analysis is the study of communication links between people. E-mail social network analysis allows the modeling of e-mail flows and users activities to analyze relationships and detect misuses that manifest abnormal e-mail behaviors (Bhattacharyya et al., 2002). An explicit form of social networks for an e-mail corpus can be depicted as a graph, where nodes are senders and receivers, and edges represent e-mail traffic. However, other less explicit forms of social networks can be inferred based on different measures like authorship and content proximity. The structure of a person s social network manifests a great deal of information about his/her behavior and about the people (friends, colleagues, family members, etc.) with whom he/she interacts. For instance, one can know how often a person maintains distinct relationships between groups of people, and for how long. One can also guess whether these people have close friends, or regular interactions, whether these interactions can be distinguished based on roles (such as work, friendship, family, etc.), and what type of views a particular group of people exchange. During the course of an e-mail investigation, social networks can be used to discover interesting information about potential suspects. For instance, who are their collaborators, which of their e-mails are malicious, or when are the periods of their suspicious activities. Our framework provides some interesting information rendering and exploration capabilities for e-mail social networks. Social networks are labeled with some simple statistics computed about e-mail users, e-mail domains, and e-mail flows. We use three types of graphs to depict social networks. The first type e-mail temporal model, is the user network augmented with time information about e-mails, laid out to show how e-mail flow evolves over time, as shown in Fig. 3 E-mail social network: temporal model (left), spring-mass model (right).

132 digital investigation 5 (2009) 124 137 Fig. 4 Details editor. Fig. 3 (Left side view). From this network it is easy to identify causality effects between e-mails, such as in a situation in which an e-mail is received by a user, who in turn sends another e-mail at a later time. If both e-mails are classified as discussing about the same topic (drugs, for instance), then by following the chain of these e-mails one can identify potential collaborators. In the second graph, called user network, nodes represent e-mail users and edges are e-mail traffic (see Fig. 6). The flow of e-mails between e-mail accounts can be filtered according to the different classes and clusters of e-mails computed during the last classification and clustering, and within a specific period of time. In the third graph, domain network, nodes are e- mail domains (servers) and edges represent e-mail traffic. The flow between domains can also be filtered according to the different classes and clusters, and within a specific period of time. E-mail social networks are visualized using several techniques. One of the most interesting is based on a spring-mass model, where nodes are considered as small masses with positive charges and links as springs connecting them. Since nodes have positive charges they tend to push each other apart, 4 but those connected by springs tend to stay agglomerated. By adjusting the strength of the springs according to the intensity of e-mail flow, we can exhibit very interesting structural patterns and community structures in a social network as 4 Gravitational forces are neglected. shown in Fig. 3 (Right side view). Nodes are laid out in an iterative process, where the force on each node resulting from the repulsion of all other nodes, the friction of the environment, and the action of all springs connected to it, is computed. The position of each node is recomputed according to the force applied on it and a preestablished time step using a standard Newtonian equation of motion. So, for node n i, if: F i ¼ is the total forces acting on n i, m i ¼ mass of n i, T ¼ time step, x 0 i ¼ previous position of n i, x i ¼ current position of n i. v i ¼ (x i x 0 i) the current speed of n i, the new position of n i is x T ¼ x i þ v i T þð1=2þðf i =m i ÞT 2. Another interesting layout capability is the combination of social network rendering with the localization capability described in Section 2.5. Social networks are rendered directly on true maps to illustrate a geographical dimension. If the address of a user or a server is known, its corresponding node is displayed on its geographic location. Some statistical information computed on a social network is rendered graphically using graphical features of nodes and links: size, shape, color, thickness, etc. For instance, user and domain centrality values are shown by node sizes; the bigger the centrality of a node, the bigger its size. The intensity of e-mail flow is reflected by the thickness of a link. User and e-mail classes or clusters are

digital investigation 5 (2009) 124 137 133 Fig. 5 Statistics viewer. shown by node and link colors and shapes. Nodes associated with users can be replaced with their photos to provide a more intuitive and elegant representation. To identify community structures in a social network, we use Newman s approach (Newman, 2003), which uses a metric Q to evaluate the community structure within a social network. Q evaluates the difference between the fraction of links that fall within communities, and the expected value of the same quantity if the links fall at random, with no regard to the community structure. Therefore, the value of Q approaches zero for a network having no structure, while it takes a higher value with increasing community structure. If e i,j is the fraction of links in the network between communities c i and c j, and a i ¼ P j e i;j, then the metric Q ¼ P i ðe i;i a 2 i Þ. The algorithm is a repetitive process, where Q is optimized using a hierarchical agglomerative clustering approach, starting from the initial configuration wherein each node is a community by itself. Communities are greedily merged to achieve the highest increase or minimal decrease in Q. 2.5. E-mail geographic localization In most investigations, localization of resources and individuals is imperative. An investigator needs to understand the geographical scope of his/her investigation. This will help him/her to correlate facts, identify potential suspects, and target locations for collecting clues and evidences. We add a geographic visualization capability called interactive map viewer in our framework to view and explore geographic sites of relevance in an e-mail forensic investigation. This capability can also be used to localize information related to potential suspects, e-mail servers, and e-mail flow. An e-mail is rendered on the map as an arrow between the geographic locations of the sender and receiver e-mail accounts. If the physical addresses of the sender and receiver are known, an arrow between these two locations is drawn. Other information, such as the e-mail flow between e-mail users, can be rendered directly on the map by labeling arrows that connect them. This process renders social networks directly on the map viewer. Although it is not difficult to forge an e-mail header to hide an author s identity, not all e-mail users have the required skills to do so, or even think about doing so. An easier alternative would be to acquire an e-mail account on a public e-mail server with a fake identity. To detect this kind of forgery, the localization of an e-mail server during an investigation can have a great impact on the results. For instance, localizing an e-mail server that is hosting a suspicious account can trigger the decision to confiscate such an account for collecting further clues and evidence. After an examination of the contents of the confiscated account using the authorship identification capabilities, if the identity of the account holder is not compatible with the suspect, this could suggest that the suspect is masquerading as a different user. If the server does not exist or

134 digital investigation 5 (2009) 124 137 Fig. 6 Social networks viewer. does not host the user account, this would suggest that the user has forged the e-mail address. In both cases authorship analysis could help in identifying the true identity of the user. We start by giving a brief description of the first four modules, followed by a detailed description of the e-mail explorer and the functionalities it provides to an e-mail investigator. 3. Our framework (IEFAF) IEFAF is an integrated analysis platform in which a security analyst can perform a variety of tasks related to e-mail analysis. IEFAF is programmed in Java using several Java technologies like Java Swing, the Java Mail API, and JDBC. Swing is used to build the graphical interface and for information rendering in different visual formats (tree, list, picture, etc.). The Java Mail API is used to parse e-mails in several file formats and extract relevant information. JDBC allows us to connect to and navigate a JDBC-compliant database system and store information. IEFAF is composed of five sub-modules that can be used separately or jointly to build and explore decision support models. These modules are: Inter-database browser Statistics explorer Data mining explorer Weka submodule E-mail explorer 3.1. Inter-database browser As its name suggests, the inter-database browser allows a user to browse several JDBC-compliant databases (Oracle, Sybase SQL server, etc.), through a single interface. The drill down capability to navigate through different data tables and views is implemented by using tree structures. The inter-database browser extracts relationship information from the metadata of a database, which are then used to automatically construct the associated physical entity relationship diagram. To allow the navigation to span several databases, the user can manually supply relationships between tables and views from different databases. The interdatabase browser uses these relationships to create connections between entity relationship diagrams of connected databases and display them as if they belong to a single database. Some of the functionalities implemented in the database browser include: Dynamic creation of connections to JDBC-compliant databases.

digital investigation 5 (2009) 124 137 135 Fig. 7 Data mining viewer. Data exploration in different tables and views. Creation of relationships between different databases to span across them using drill down capability. Ability to issue and persist SQL statements. Preparation of datasets to create data mining models using the Weka tool. 3.1.1. Creating ARFF files The Weka native data storage file is the Attribute-Relation File Format (ARFF). An ARFF file is simply a set of records similar to a table in a database. An ARFF file consists of a list of the record instances. Attribute names and types are specified at the top of the file. The attribute values of each record are separated by commas. In our framework, the data obtained from any internal or external source can be automatically converted into an ARFF file. It provides the option of converting the results of an SQL query to an ARFF file. 3.2. Statistics explorer The statistics explorer (depicted in Fig. 5) allows us to associate an SQL query with very elegant charts in two or three dimensions to gain a deep insight into data. Charts are constructed using ExpressChart API, which provides a chart viewer with very interesting interactive functionalities. These functionalities range from simple transformations zooming, resizing, and rotating - to switching between views in both two and three dimensions. Charts are grouped in different categories and displayed in a tree structure. The user can dynamically create new categories and new charts, which can be accessed with a simple click of mouse. 3.3. Data mining explorer The data mining explorer provides the capability to explore and query data mining models. Models are organized in a tree structure in three main categories: classifiers, clusters, and association rules. Data mining models can be dynamically constructed and integrated into the data mining explorer in an appropriate category. Further categories can be added dynamically to customize the organization of models. Each data mining model is labeled with a description that explains its use and the kind of decision support capability it offers. To interrogate the model the user is prompted to enter the information required to make the prediction. 3.4. Weka submodule To complement the implemented functionalities, we have integrated Weka software package into our framework. Weka includes methods for most of the standard data mining problems: regression, classification, clustering, association rule mining, and attribute selection. It provides extensive support for the whole process of data mining, including preparing data,

136 digital investigation 5 (2009) 124 137 constructing and evaluating learning algorithms, and visualizing the input data as well as the results of machine learning. 3.5. E-mail explorer The e-mail explorer allows a multi-staged analysis of e-mails by using social networking techniques, text mining, geographical rendering, and statistical analysis to gain an indepth view of the underlying information. The e-mail explorer works with a database backend for a fast and convenient analysis of e-mail data. E-mails are organized in several virtual folders. The contents of each folder can be viewed separately, or jointly by merging all folders. A folder viewer displays a list of its e-mails. Several folder viewers can be opened at the same time, with the possibility of moving e-mails between them. A folder viewer offers some classical functionalities, like e-mail sorting and searching, and advanced functionalities related to e-mail data mining and social networks analysis. Advanced functionalities are presented through the following five different sub-modules: Details editor Map viewer Statistics viewer Social network viewer Data mining viewer 3.5.1. Details editor The details editor (see Fig. 4), offers three sub-views of an e-mail s contents: text, HTML, and raw format. The text subview displays the textual contents of an e-mail, the HTML subview renders an e-mail as a web page (provided it contains HTML tags). The raw sub-view shows the e-mail in its original format, along with its metadata (without any processing). 3.5.2. Map viewer The map viewer displays descriptive information about e-mail flows directly on real geographic maps. For instance, when an e-mail is selected in the folder viewer, an arrow is drawn between the geographic locations of the sender and recipient e-mail domains. Moreover, if the physical addresses of the sender and receiver are known, an arrow between these two locations is drawn as well. Other information, such as the flow of e-mails between e-mail participants, is rendered directly on the map. The joining arrows can be labeled with descriptive information (number of e-mails exchanged, the topic of conversation between the nodes, etc.). 3.5.3. Statistics viewer We compute several statistics on e-mail accounts and e-mail traffic and display them using appropriate charts for easy and intuitive interpretation (see Fig. 5). Statistical models, as mentioned in Section 2.1, are created in the statistics explorer and are inserted automatically in the statistics viewer. A statistical model is created by specifying appropriate SQL query to extract relevant data from an e-mail database and to associate it with an appropriate chart. 3.5.4. Social network viewer In order to analyze and investigate the nature of e-mail communication between individuals and communities, we have implemented a submodule called social network viewer. A typical output of this module is shown in Fig. 6. We compute a set of cliques and display them in the form of networks, in a full-fledged graph editor. The user can explore these networks and transform them if needed to gain an insight into the dynamics of e-mail traffic. The communication among different individuals/communities and between an individual and a community, as well as the number of e-mails exchanged, can be seen within the viewer. Different views of e-mail traffic can be displayed. The intensity of e-mail flow between parties is indicated by the thickness of the communication link between them. Thickness increases with the increase in e-mail traffic. Different coloring schemes are used to identity different e-mail classes and clusters. 3.5.5. Data mining viewer The data mining viewer (shown in Fig. 7) enables us to build machine-learning models by employing several different machine-learning algorithms over different kinds of datasets. This helps a user evaluate different data mining techniques. Functionalities of the data mining viewer can be split into two categories: classification and clustering. Classification allows us to build data decision models on sets of e-mails that are already classified, whereas clustering is employed to identify hidden relationships and structures in an e-mail corpus. 4. Conclusion As a result of growing e-mail misuse, investigators need efficient automated methods and tools for analyzing e-mails. In our work, we developed an e-mail analysis framework to assist investigators gather clues and evidence in an investigation in which e-mail communication is relevant. The framework offers different functionalities ranging from e-mail storing, editing, searching, and querying to more advanced functionalities such as authorship attribution and e-mail account localization. Extending traditional authorship identification techniques, we have proposed a new technique of mining style variation. This will help to capture the change that occurs in the style of person with respect to different contexts/recipients. To obtain more credible results, the level of cohesion and harmony among different analysis techniques needs to be increased. E-mail social networks need to be further explored; they are rich sources of learning about cyber criminal activities. references Abbasi A, Chen H. Writeprints: a stylometric approach to identitylevel identification and similarity detection in cyberspace. ACM Transactions on Information Systems March 2008;26(2). Agrawal R, Imielinski T, Swami A. Mining association rules between sets of items in large databases. ACM SIGMOD Record June 1993;22(2):207 16.

digital investigation 5 (2009) 124 137 137 Baayen RH, Van Halteren H, Tweedie FJ. Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing 1996;2:110 20. Bhattacharyya M, Hershkop S, Eskin E, Stolfo SJ. MET: an experimental system for malicious email tracking. In: Proc. of the 2002 new security paradigms workshop (NSPW-2002), Virginia Beach, VA; 2002. Corney Malcolm, Vel Olivier, Anderson Alison, Mohay George. Gender-preferential text mining of e-mail discourse. In: Proc. 18th annual computer security applications conference 2002; 2002. p. 21 7. de Vel O. Mining e-mail authorship. In: Proc. workshop on text mining, ACM international conference on knowledge discovery and data mining (KDD); 2000. de Vel O, Anderson A, Corney M, Mohay G. Mining e-mail content for author identification forensics. SIGMOD Record December 2001;30(4):55 64. Farringdon JM. Analyzing for authorship: a guide to the Cusum technique. University of Wales Press.; 2001. Forsyth RS, Holmes DI. Feature-finding for text classification. Literary and Linguistic Computing 1996;11(4):163 74. Gunopulos D, Agrawal R, Gehrke J, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. In: Proc. of ACM SIGMOD conference, Seattle, WA; 1998. Holmes DI. The evolution of stylometry in humanities. Literary and Linguistic Computing 1998;13(3):111 7. Iqbal F, Hadjidj R, Fung BCM, Debbabi M. A novel approach of mining write-prints for authorship attribution in e-mail forensics. Digital Investigation 2008;5:42 51. Joachims T. Text categorization with support vector machines: learning with many relevant features. In: Proc. 10th European conference on machine learning (ECML); 1998. p. 137 42. Kulkarni A, Pedersen T. Name discrimination and e-mail clustering using unsupervised clustering and labelling of similar contexts. In: Proc. 2nd Indian international conference on artificial intelligence (IICAI-05); 2005. p. 703 22. Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang. Adding semantics to email clustering; 2006, 938 42. Lippmann RP. An introduction to computing with neural networks. IEEE Acoustics Speech and Signal Processing Magazine 1987;4(2):4 22. Mendenhall TC. The characteristic curves of composition. Science 1887;11(11):237 49. Newman MEJ. The structure and function of complex networks. SIAM Review 2003;45:167 256. Quinlan JR. Induction of decision trees. Machine Learning 1986; 1(1):81 106. Stolfo J, Creamer G, Hershkop S. A temporal based forensic analysis of electronic communication. In: Proc. of ACM international conference on digital government research; 2006. Teng J, Ma J, Lai I, Li Ying. E-mail authorship mining based on svm for computer forensic. In: Proc. third international conference on machine learning and cyhemetics, Shanghai; August 2004. Wei C, Sprague A, Skjellum A, Warner G. Mining spam email to identify common origins for forensic application. New York, NY, USA: ACM; 2008. p. 1433 37. Zheng R, Li J, Chen H, Huang Z. A framework for authorship identification of online messages: writing-style features and classification techniques. Journal of the American Society for Information Science and Technology February 2006;57(3):378 93. Zheng R, Qin Y, Huang Z, Chen H. Authorship analysis in cybercrime investigation. In: Proc. 1st NSF/NIJ symposium. ISI Springer-Verlag; 2003. p. 59 73.