Digital Forensics - CS489 Sep 15, 2006 Topical Paper Mayuri Shakamuri Data Mining for Digital Forensics Introduction "Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner" (Hand, Mannila and Smyth 2001). Advancements in storage technology and digital data acquisition have contributed to growth of huge databases. This is happening in many areas from day to day tasks like credit and usage records, telephone call details, and market transactions to more complex ones like image processing, molecular databases, and medical records. The data is being misused as much as it is used for righteous purposes. Our dependency on these databases is increasing; the threat of having disruption due to cyber attacks has become a pressing issue. It has also become important to extract information from these huge databases that might be of value to the owner of the database. Data mining also called Knowledge-Discovery in Databases (KDD) can play a big role in making it convenient and practical to explore very large databases. Digital forensics is application of the scientific method to digital media in order to establish factual information for judicial review. This process often involves investigating computer systems to determine whether they are or have been used for illegal or unauthorized activities. (Wikipedia) With the growing sizes of databases, law enforcement and intelligence agencies face the challenge of analyzing large volumes of data involved in criminal and terrorist activities (Chen et al., 2003). Thus, a suitable scientific method for digital forensics is data mining. Data Mining Techniques Data mining can be categorized into different types of tasks. These tasks depend on the person's objectives in analyzing the data. (Hand, Mannila and Smyth 2001). 1. Exploratory Data Analysis (EDA): In this technique the goal is to explore the data without any idea of what we are looking for. EDA techniques can be interactive and visual. Some applications of EDA techniques are:
Coxcomb plots - In 1858, Florence Nightingale used it to display mortality rates at military hospitals in and near London. Becker, Eick, and Wilks (1995) described a set of intricate spatial displays for visualization of time-varying long-distance telephone network patterns over 12,000 links (Hand, Mannila and Smyth 2001). 2. Descriptive Modeling: This technique's goal is to describe all the data that is being explored. Some examples of such distributions are: Density estimations - Used for probability distributions of data. Cluster analysis and Segmentation - Partition of space into groups. Segmentation has been widely used in marketing to determine demographics. Clustering has been widely used in psychiatric research to determine taxonomies for psychiatric disorders. Dependency modeling - Models describing relationships between groups. 3. Predictive Modeling: In this technique a model can be built that will allow the value of one variable to be predicted from the known values of other variables. Classification and regression are the method used in this modeling. In classification the variable being predicted is categorical, where as in regression the variable is quantitative. Some examples of this modeling are: SKICAT system - used to classify stars from a 40-dimensional feature vector. AT&T Used Regression techniques to build models to estimate the probability whether a phone number is located at a business or residence. 4. Discovering Patterns and Rules: As the name suggests this method's goal is to find patterns in the data set based on association rules using algorithmic techniques. Tracking fraudulent use of cellular telephones 5. Retrieval by Content: The idea behind this method is to find a similar pattern based on the pattern a user has. This method is widely used in text and image data sets. PageRank - Used by Google systems to estimate relative importance of Web pages. QBIC - Developed by IBM to search large image databases using content-based queries.
Applying Data Mining techniques in Digital Forensics Digital forensic professionals, based on the types of data sets and specific nature of information needed, select appropriate data mining techniques. As an example, data can be a huge collection of emails, images and network traffic information etc. Appropriate data mining techniques include support vector machine learning algorithm, behavior based anomaly detection, and heuristic-based anomaly detection. 1. Intrusion Detection Systems Researchers at Columbia University have conceived an approach to intrusion detection systems (IDS) based on data mining of audit sources. Detection models are constructed automatically using cost-sensitive machine learning algorithms using given cost metrics. In cost-sensitive IDS, normal and intrusion activities are analyzed and this information is used in building effective misuse and anomaly detection models. Based on this the system finds the clusters of attack signatures and normal profiles and constructs dynamically configurable group of models (Stolfo et al., 2001). 2. Image Mining The amount of image traffic is growing day by day over the Internet. Illicit images are being transmitted at an alarming rate. Checking every image manually to identify which ones are of interest to digital forensics investigators and law enforcement officers is extremely time consuming and can be unproductive. A need for data mining tool is ever increasing to help investigators find the images in a relatively less time. Researchers at Queensland University together with Defense Science and Technology Organization in Australia have used data mining techniques to design an Image Mining System. "The system can be trained by a hierarchical Support Vector Machine (SVM) to detect objects and scenes which are made up of components under spatial or non-spatial constraints" (Brown et al., 2005). This model allows forensics investigators to communicate with the system via a grammar. "The grammar allows object description for training, searching, querying and relevance feedback (Brown et al., 2005). 3. Criminal Network Analysis In a NSF Digital Government Program funded project called COPLINK (Center: Information and Knowledge Management for Law Enforcement) researchers have applied data mining techniques for analyzing data in the context of law enforcement. One of them was to analyze and recognize previously unknown structural patterns from criminal networks in organized crimes such as
narcotics trafficking, terrorism, gang-related crimes and other illegal activities. Social Network Analysis (SNA) was the data mining technique used for these kinds of networks. There analysis involved four steps: Network extraction, Subgroup detection, Interaction patter discovery and Central member Identification (Chen et al., 2003). For subgroup detection they have used hierarchical clustering to detect subgroups based on relational strength in criminal network. Social network analysis approach called block modeling was used to reveal patterns of between-group interactions. To detect subgroups, interaction patterns and the overall structure manually is a rather difficult task. They concluded that the subgroups and members found based on this approach were correct representations of the reality. 4. Mining E-mail content E-mail is the most commonly used application on the Internet. There has been research on content analysis to perform various tasks such as spam detection and control and automated filing. For digital forensics and law enforcement purposes this may not be sufficient. As e-mail is accepted as legal evidence, there is a growing need for better tools to analyze the content and find patterns and other useful information for digital forensics professionals. Analyzing huge volumes of e-mail data manually can be extremely tedious and at times inefficient and unproductive. Data mining techniques can be applied to build tools that find valuable information and can save critical time that an investigator can spend on other important forensics tasks. Besides the content of the e-mail, information like who sent the e-mail and where it is being sent from and so on can be of great value. Once again in analyzing this information data mining tools can be very useful as they can integrate various aspects into one model. Researchers at Columbia University, New York have developed an E-mail mining toolkit (EMT) that helps law enforcement officers and digital forensics professionals in analyzing the emails and being able to present it as evidence. EMT detects anomalous behavior patterns in an account, similar patterns across accounts, which are a means of detecting proxy accounts used by a person to hide their identity (Stolfo et al., 2005). Their work has shown that with this data mining driven toolkit new behavior models can be used in spam detection. Structural characteristics and linguistic patterns were derived and combined with a Support Vector Machine learning algorithm to mine the e-mail content (Vel et al., 2001)
5. Modeling the Behavior of Serious Sexual Offenders Data mining has been used in many business organizations as well as criminal activities. The capabilities of these techniques are encouraging and are extending to various other areas. Researchers at University of Wolverhampton, along with the Police department of Birmingham, in UK have applied data mining techniques to link crimes of a serious sexual nature (Adderley et al., 2001). They have used Self Organizing Maps (SOM), which is a subtype of artificial neural networks, for this analysis. The data was taken from National Crime Faculty and National Police Staff College Bramshill, UK. A prototype based on behavioral patters was developed that formed clusters and linked offenders to a particular cluster in much shorter time compared to doing it manually. The commercial data-mining package SPSS Clementine was used to facilitate faster development of the model. The SOM technique was used to analyze sexual assaults and rape offences held in a ViCLASS relational database within the National Crime Faculty at Bramsmill (Adderley et al., 2001). This helped them in determining which of the crimes the same offender(s) committed. The analysts established that crimes in individual clusters exhibited strong similarities, with adjacent clusters that are based on a variable theme having similar traits as illustrated (Adderley et al., 2001). Conclusion There are several commercial data mining tools used in various industrial sectors and business. Some of the major players in the data mining sector are Clementine, Darwin, CART Decision Tree Software, MARS Predictive Modeling Software, TreeNet Stochastic Gradient Boosting Software, LOGIT Software, RandomForests, and COGNOS to name a few. Basis Technologies is working on Multilingual Digital Forensics to leverage its analytical multilingual search techniques to enhance the field of digital forensics. These commercially available data mining tools can be used for forensics and there is ongoing research in the quest for the killer applications in data mining. Data mining techniques have unlimited potential in the field of forensic science where models and tools can be developed to help investigators, digital forensics professionals and law enforcement officers to find the data or clues they are searching for much more efficiently and faster.
References: 1. Hand, D., Mannila, H., Smyth, P., (2001). Principles of Data Mining. Cambridge, MA: MIT Press 2. Chen, H., Chung, W., Qin, Y., Chau, M., Xu, J. J., Wang, G., Zheng, R., Atabakhsh, H. (2003). Crime Data Mining: An Overview and Case Studies. ACM International Conference Proceeding Series; Vol. 130, 1-5. 3. Stolfo, S. J., Lee, W., Chan, P. K., Fan, W., Eskin. E. (2001). Data Mining-based Intrusion Detectors: An Overview of the Columbia IDS Project. ACM SIGMOD Record; Vol. 30, 5-14. 4. Brown, B., Pham, B., Vel, O. (2005). Design ofa Digital Forensics Image Mining System. IIHMSP05, Melbourne 5. Vel, O., Anderson, A., Coney, M., Mohay. G. (2001). Mining E-mail Content for Author Identification Forensics. ACM SIGMOD Record; Vol. 30, No. 4. 6. Stolfo, S. J., Hershkop, S. (2005). Email mining toolkit supporting law enforcement forensic analyses. ACM International Conference Proceeding Series; Vol. 89, 221-222. 7. Adderley, R., Musgrove, P. B. (2001). Data mining case study: Modeling the behavior of offenders who commit serious sexual assaults. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining; 215-220.