Quick and Secure Clustering Labelling for Digital forensic analysis

Transcription

1 Quick and Secure Clustering Labelling for Digital forensic analysis A.Sudarsana Rao Department of CSE, ASCET, Gudur. A.P, India C.Rajendra Department of CSE ASCET, Gudur. A.P, India ABSTRACT: In Digital forensic analysis Seized digital devices can provide precious information and evidences about facts and/or individuals on which the investigational activity is performed. In particular, algorithms for clustering documents can facilitate the discovery of new and useful knowledge from the documents under analysis. In that applies document clustering algorithms to digital forensic analysis of computers seized devices in police investigations. In the Digital forensic analysis investigate by carrying out extensive experimentation with clustering algorithms (means,, Single Link, Complete Link, Average Link, and CSPA) applied in datasets. The proposed work involves investigating automatic approaches for cluster labelling. The assignment of labels to clusters may enable the expert examiner to identify the semantic content of each cluster more quickly eventually even before examining their contents. Finally, the study of algorithms that induce overlapping partitions (e.g., Fuzzy C-Means and Expectation-Maximization for Gaussian Mixture Models) is worth of Investigation in Computer seized devices in digital forensic investigation. Index Terms Clustering, forensics analysis, digital investigation. INTRODUCTION: In Digital evidence, as defined as the information and data of investigative value that are stored on, received, or transmitted by a digital device [1],[2], has become lately a crucial component in law enforcement agencies investigations. The relevance of this kind of evidence, collected when electronic data and devices are seized, is established by digital forensics analysts, which more and more often have to deal with massive amounts of data, still increasing with the capacity of mass storage devices. In a more practical and realistic scenario, domain experts (e.g., forensic examiners) are scarce and have limited time available for performing examinations. Thus, it is reasonable to assume that, after finding a relevant document, the examiner could prioritize the analysis of other documents belonging to the cluster of interest, because it is likely that these are also relevant to the investigation [3]. Such an approach, based on document clustering, can indeed improve the 816

2 analysis of seized computers, as it will be discussed in more detail later. Basically this is paper for the police investigations through forensic data analysis. Clustering algorithms are typically used for examining data analysis, where there is little or no prior knowledge about the data. This is exactly the case in several applications of Computer Forensics, including the one mention in this paper. [3] Clustering algorithms have been studied for decades, and the literature on the subject is huge. Therefore, we decided to choose set of (six) representative algorithms in order to show the potential of the proposed approach, namely: the partitional means [4] and [5], the hierarchical single /Complete /Average Link [6], and the cluster ensemble algorithm known as CSPA [7]. These algorithms were run with different combinations of their parameters, resulting in sixteen different algorithmic instantiations, as shown in Table I. Thus, as a contribution of our work, we compare their relative performances on the studied application domain using real-world investigation cases conducted by the Brazilian Federal Police Department. In order to make the comparative analysis of the algorithms more realistic, two relative validity indexes (Silhouette [5] and its simplified version [8]) have been used to estimate the number of clusters automatically from data. Following table gives the various algorithms and their parameters. [3] Initializat Acronym Algorithm Attributes Distance ion estimate Kms means Cont. (all) Cosine Random Simp.Sil. Kms 100 means 100>TV Cosine Random Simp.Sil. Kms 100* means 100>TV Cosine [18] Simp.Sil. KmsT 100* means 100>TV Cosine [18] Silhouette KmsS means Cont. (all) Cosine Random Rec. Sil. Kms 100S means 100>TV Cosine Random Rec. Sil * Lev LevS AL100 CL100 SL100 NC NC100 E >TV Cosine Random Silhouette 100>TV Cosine [18] Silhouette Name Lev. Random Silhouette Name Lev. Random Rec. Sil. Average Link 100>TV Cosine - Silhouette Complete Link 100>TV Cosine - Silhouette Single Link 100>TV Cosine - Silhouette CSPA Name, Cont.(all) CSPA Random Simp.Sil. CSPA Name, 100>TV CSPA Random Simp.Sil. CSPA Cont.100 random CSPA Random Simp.Sil. Note: 100>TV: 100 attributes (words) that the greatest variance over the documents Cont. 100 random: 100 randomly chosen attributes from document content Cont. (all): all features from document distance Lev.: Levenshtein distance Simp. Sil.: Simplified Silhouette 817

3 Rec. Sil.: Recursive Silhouette *: Initialization on distant objects Name: file name As shown in table there are various algorithm with their parameters like distance which has cosine as well as levenshtein distance which is nothing but a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other. The application for levenshtein distance is to in approximate string matching; the objective is to find matches for short strings in many longer texts, in situations where a small number of differences is to be expected. Table also gives the initialization of each algorithm. [1] The remainder of this paper is organized as follows. Section II literature review Section III document clustering based techniques for forensic analysis Section IV. Implementation details V concludes the paper. Section VI gives references. II LITERATURE REVIEW The use of clustering has been reported by only few studies in the computer forensics field.[3] Basically, The use of classic algorithm for clustering data is described by most of the studies such as Expectation-Maximization (EM) for unsupervised learning of Gaussian Mixture Models, means, Fuzzy C-means (FCM), and Self-Organizing Maps (SOM). These algorithms have well-known properties and are widely used in practice.[9] Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection uses various algorithms and pre-processing technique for giving result as cluster data. Finally in their conclusion they have shown that, the approach presented by them applies document clustering methods to forensic analysis of computers seized in police investigations. Also, they are reported and discussed with several practical results that can be very useful for researchers and practitioners of forensic computing. More specifically, in their experiments the hierarchical algorithms known as Average Link and Complete Link presented the best results. Despite their usually high computational costs, they have shown that those algorithm are particularly suitable for the studied application domain because the dendrograms that they provide offer summarized views of the documents being inspected, thus being helpful tools for forensic examiners that analyze textual documents from seized computers.[3] Role of Document Clustering in Forensic Analysis: Computer forensic analysis involves the examining the huge set of files. Among 818

4 all of that files are not relevant to the forensic examiner interest. So analyzing such files and documents which are out of interest tends to more time consuming task. So the key approach is to apply document clustering [4] on such huge set of files and documents. As a result, these document clustering provides different set of clusters among which forensic examiner analyze only relevant documents related to investigation of reported case. It helps to improve speed of the forensic analysis process. It will also help for forensic examiner to analyze the files and documents by only analyzing representative of the clusters. The document clustering process involves the following phases as shown in Fig. 1: It is used for application such as forensic analysis in which clustering results are used for further analysis. 5. Forensic Analysis: As discussed in post pre processing forensic analysis process uses the result of document clustering for further analysis. The result of document clustering enhances the forensic process within sake of time. Hence, this clearly specifies the role of document clustering in the process of forensic analysis. Collection of data Pre processing 1. Collection of Data: Collection of data involves the processes like acquiring the files and documents from the computer seized devices. The collection of such files and documents involves special techniques. 2. Pre processing: As discussed earlier pre processing involves the tokenization, removing of stop words, stemming process and weighted matrix construction phases. 3. Document Clustering: After the pre processing document clustering is applied to form the set of clusters according to specified clustering criteria. 4. Post processing: Document Clustering Post processing Forensic Analysis Process Fig. 1: Relation of Document Clustering and Forensic Analysis III Document Clustering Based Techniques for Forensic Analysis: S.Oliver [10] proposes Self Organizing Maps (SOM) to support decision making by forensic investigators. Self 819

5 Organizing Maps (SOM) is basically used to search the pattern in data set. The files are clustered according to date and time of creation and type of files. So it s an easy task for the forensic investigator for the analysis once he got specified pattern. R. Hadjidj [9] proposes forensic analysis tool based on document clustering technique. It provides automated tool for multi-staged analysis of for the forensic investigator which helps to gather the evidences related to crime in the court of law. This also notifies the document clustering role in the forensic investigation. Recently, Nassif and Hruschka [3] proposed an approach that applies document clustering algorithms for the forensic analysis of computer devices. They uses the relative validity index criteria for the estimating the number of clusters in an automated manner which overcomes the limitations previous techniques. Here the forensic examiner can analyze only relevant clusters documents in accordance with reported case. The results [3] are as shown in Fig. 3. Cluster Information C1 3 black documents C2 4 financial transactions C3 2 maternity payments C4 2 grocery lists C5 1 foreign exchange transaction warning 1 list of documents for registration information C6 2 documents from foreign exchange operations C7 1 registration form from a brokerage company 1 contract template from the broker C8 1 investment club status 1 agreement for joining the club C9 2 models for handling cash greater than R$100K C10 8 receipts of foreign exchange insurance transactions C11 2 warnings about foreign brokerage business hours C12 3 label designs of a brokerage company C13 1 notice about working hours 1 check receipt C14 2 daily reports from buying/selling exchanges C15 2 sample documents from office application Fig. 3: Document Clustering for Forensic Analysis Fuzzy Methods for Forensic Data Analysis is again describes a methodology and an automatic procedure for inferring accurate and easily understandable expert-system-like rules from forensic data. In most data analysis environments the methodology and the algorithms used were proven to be easily implementable. By discussing the 820

6 applicability of different fuzzy methods to improve the effectiveness and the quality of the data analysis phase for crime investigation the fuzzy set theory would get implemented.[11] IV IMPLEMENTATION DETAILS: IV.I Architectural Diagram of Proposed System: The proposed system shown in figure 2 DOCUMEN T CLUSTER User Accuracy Classification Matrix Weighted Method Protocol Master Searching Vector Fig. 2: Architectural diagram of proposed system in our propose system Basically there are three important steps which are as follows 1) Pre-processing 2) Preparing cluster vector 3) Forensic analysis DOCUMEN T FOLDER 3. FORENCIS ANALYSIS 1. PREPROCESSING Numerical Sentence Vector Catalog the Vector 2. PREPARING CLUSTER VECTOR Fetch File Contents Stop word Removal Stemming Top 100 Word 1) Pre-processing- In pre-processing step there are three steps such as a) fetch a file contents, b) stop word removal c) stemming. In all the above steps the basic purpose is to check the file contain and to remove the stop word like a, an,the etc. and later on to do stemming on that file which will be removing ing and ed words from the given statement. 2) Preparing Cluster Vector- For preparing the cluster vector one will need to find top 100 words from the file on which preprocessing step is already done. Now from that document or rather way we can say file or data numerical sentences such as the sentence which has numerical word in it that means the sentence which contains date or any kind on number in it. 3) Forensic Analysis- This will be the last step of proposed method. From the diagram no 1 mention above one can say that for the forensic data analysis classification matrix need to be made with the help of weighted method protocol. At last one can find accuracy of his work. V CONCLUSIONS AND FUTURE WORK By doing the survey on digital forensic analysis it can be concluded that clustering on data is not an easy step. There is huge data to be cluster in compute forensic so to overcome this problem, this paper presented an approach that applies document clustering methods to digital forensic 821

7 analysis of computers seized in police investigations. Again by using labelling there will be document clustering for forensic data which will be useful for police investigations. VI REFERENCES: [1] U.S. Department of Justice, Electronic Crime Scene Investigation: A Guide for First Responders, I Edition, NCJ , 2008, [2]. J. Clerk Maxwell, A Treatise on Electricity and Magnetism(3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68 73). [3]. Luis Filipe da Cruz Nassif and Eduarado Raul Hruschka Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection - ieee transactions on information forensics and security, vol., no. 1, january [4] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice-Hall, Investigation, Elsevier, vol. 5, no. 3 4, pp , [10] B. K. L. Fei, J. H. P. Eloff, H. S. Venter, and M. S. Oliver, Exploring forensic data with selforganizing maps, in Proc. IFIP Int. Conf. Digital Forensics, 2005, pp [11]. K. Stoffel, P. Cotofrei, and D. Han, Fuzzy methods for forensic data analysis, in Proc. IEEE Int. Conf. Soft Computing and Pattern Recognition, 2010, pp [5] L. Kaufman and P. Rousseeuw, Finding Groups in Gata: An Introductionto Cluster Analysis. Hoboken, NJ: Wiley-Interscience, [6] R. Xu and D. C.Wunsch, II, Clustering. Hoboken, NJ: Wiley/IEEE Press, [7] A. Strehl and J. Ghosh, Cluster ensembles: A knowledge reuse framework for combining multiple partitions, J. Mach. Learning Res., vol.3, pp , [8] E. R. Hruschka, R. J. G. B. Campello, and L. N. de Castro, Evolving clusters in gene-expression data, Inf. Sci., vol. 176, pp , [9]. R. Hadjidj, M. Debbabi, H. Lounis, F. Iqbal, A. Szporer, and D. Benredjem, Towards an integrated forensic analysis framework, Digital 822