Analyzing Huge Data Sets in Forensic Investigations Kasun De Zoysa Yasantha Hettiarachi Department of Communication and Media Technologies University of Colombo School of Computing Colombo, Sri Lanka
Centre for Digital Forensic ISIF Information Society Innovation Grant www.isif.asia
Our Role Police CID Customs Bribery and Corruption Judicial Services Victims
Year vs Number of Crimes Number of Crimes Reported During Past 7 Years 25 No of Crimes 20 15 10 5 0 2003 2004 2005 2006 Year 2007 2008 2009 4
5
Problems Faced Evidence not being collected in an acceptable manner Evidence being damaged due to time and environmental factors Evidence being damaged (wiped/formatted) before collection
Why? Equipments are not available Software are not available Procedures and policies are not in place Lack of IT knowledge in the Law Enforcement Sector
Some Existing Popular Forensic Investigation Tools Tools Description Encase/FTK Commercial products Sleuthkit - Open source -Widely used tool -Provide tools for forensic activities -Easy to understand and deploy PyFlag -Not widely used -Complex -Difficult to deploy PTK, Autopsy -Consumes a lot of time during file analysis
Challenges of Developing a Forensic Toolkit for a Developing Country Limited Resources Lack of high end machines Appropriate media to store evidence Procedures and Policies Developing a forensic framework -> accept balance between the technology and law Poor IT Literacy of Police and Legal Officers User friendly and useful service to the courts and judges
FIT4D A software toolkit utilizes the limited resources in developing countries http://score.ucsc.lk/fit4d/
Comparison Between PTK and FIT4D Features Feature PTK FIT4D 1 Creating disk images 2 Searching /filtering the disk image 3 Analysis and searching disk image piece wise 4 Report generation 5 Graphics processing tools 6 Compare file content within the image 7 Attach legal documents such as court orders to the case 8 Evidence not stored in a central server 9 Dynamic Timeline 10 Multiple investigators and case lock
Storage Capacity Grows Over Time Source : Wikipedia Tremendous time and effort in forensic investigations for analyzing huge data sets.
There are Huge Number of Hard Disks Which contains the email address perera@gmail.com? Which belongs to Mr. G.H. Perera?
Today most of the forensic tools analyze single drive at a time These tools are not adequate today s forensic challenge
Existing Tools Inefficient Most of the existing investigation tools cannot handle these huge data sets in an efficient manner. E.g: it will take nearly two/three hours to open a 6GB hard disk using a popular forensic toolkit like FTK
Data Mining : A Better Solution? Data mining is a good solution to handle massive volumes of data. Little research has focused on applying data mining techniques to digital forensics!
Proposed System : Data Mining for Forensic Investigations Our aim is to build a system which applies data mining techniques forensic analysis of data. Provide some pre-categorization of data and intelligent analysis
Advantages : Proposed System It will free the investigator from all low level and manual tasks. This will speed-up the investigation process Will improve the quality of the information associated with the data analysis. Reduce the huge monetary cost associate with a digital investigation.
Proposed System Architecture Evidence correlation Engine Entity Extraction Engine Clustering Engine Association Rule Mining Engine Data Store Transform Data Data selection and Cleaning Sleuthkit Extract Disk Information Disk Images
Entity Extraction Extract information in Unstructured documents into categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, e-mail addresses, authorships, personal characteristics etc. There are open source software for named entity extraction : GATE, ANNIE.
Clustering and Categorizing Data Classify data according to the patterns found on the storage medium E.g : Mine e-mail content and identify its authorship from a set of examples from known authors.
Association Rule Mining Find frequently occurring patterns in data sets and present them as rules E.g : This technique has been applied to network intrusion detection to derive association rules from user s interaction history. Those extracted rules can be used to discover future network attacks
Correlation of evidence Investigator has to browse and search for evidence and finally correlating all evidences to make final conclusions. Connecting the Dots operation is very complex Data mining statistical and intelligent methods to find correlations between the information found on the evidence. E.g : FACE is an example for a framework for automatic evidence discovery and correlation from a variety of forensic targets. They have only used it for memory evidence correlation.
The Proposed Framework will.. Apply data mining and artificial intelligence concepts to facilitate digital forensic. Release the investigator from all the low level tasks that they currently have to do. If applied properly, the system will achieve 3 main goals. 1) It will speed-up the investigation process and reduces the time taken for a digital investigation. 2) It will improve the quality of the information associated with the data analysis. 3) It will reduce the huge monetary cost associate with a digital investigation.
Limitations Although data mining has applied successfully in various domains, it is not much used and tested within the domain of digital forensic. Data mining and AI techniques need huge data sets for training the system. Otherwise it will show poor performance. We believe that these limitations will not limit the potential of extending data mining research to digital forensic and digital investigations.
Conclusion We propose a digital forensic investigation framework which would be able to free the investigator from all the low level tasks that they currently have to do. This will speed-up the investigation process and reduces the time taken for a digital investigation. Improve the quality of the information associated with the data analysis. Reduce the huge monetary cost associate with a digital investigation. We encourage other researchers and practitioners to assist us in improving awareness and skills in this area.
Thank you Contact Kasun (kasun@ucsc.cmb.ac.lk) to get more information about our projects