Chapter 8 Hash functions in digital forensics Page 129

Transcription

1 Page 129 In this chapter we describe the role of hash functions in digital forensics. Essentially hash functions are used for two main purposes: first, authenticity and integrity of digital traces are ensured by applying cryptographic hash functions. Second hash functions identify known objects (e.g., illicit files). Before we give details on their applications in IT forensics, we introduce the foundations of hash functions in Section 8.1. Then Section 8.2 describes the use case authenticity and integrity of digital traces. Finally in Section 8.3 we explain the use case data reduction by identification of known digital objects. 8.1 Cryptographic hash functions and approximate matching In this section we first introduce the general idea of a hash function and then turn to two different concepts: first Section discusses cryptographic hash functions, which originally come from cryptography to be used in the context of the security goals authenticity, integrity, and non-repudiation. Cryptographic hash functions are useful to uniquely identify an input bit string by its hash value. The second concept, on the other hand, is a rather new idea. It deals with the identification of similar input bit strings and is called approximate matching. We turn to approximate matching in Section A general hash function is simply a function, which takes an arbitrary large bit string as input and outputs a bit string of fixed size. If n N denotes the bit length of the output and if we denote as usual by {0,1} the set of all bit strings, then a hash function h is a mapping Hash function h : {0,1} {0,1} n. (8.1) Typically the computation of a hash value is efficient, that is fast in practice. These two properties are characteristic for a hash function and thus used for its definition (see e.g. [16]). Definition 8.1: Hash function Let n N be given. A hash function is a function, which satisfies the following two properties: D 1. Compression: h : {0,1} {0,1} n. 2. Ease of computation: For all input bit strings bs {0,1} computation of h(bs) is fast in practice. The output of the function h(bs) is referred to as a hash value, fingerprint, signature or digest. Example 8.1 We look at two simple hash functions. B 1. We set n = 1. For bs {0,1} we simply define h(bs) by the least significant bit of bs with the additional definition of h(/0) := 0 for the

2 Page 130 empty bit string /0. For instance we have h(10101) = h(11) = h(1) = 1 and h(1000) = h(10) = h(0) = 0. Clearly this function satisfies both requirements from Definition We set n = 2. For bs {0,1} we simply define h(bs) by bs mod 4, where bs is interpreted as a non-negative binary integer. Again, we set h(/0) := 0 for the empty bit string /0. For instance we have h(10110) = h(10) = 10 = 2 and h(1000) = h(0) = 0. Again this function satisfies both requirements from Definition 8.1. Applications Preimage resistance Cryptographic hash functions Hash functions are well-established in computer science for different purposes. Sample security applications of hash functions comprise storage of passwords (e.g., on Linux systems), electronic signatures (both MACs and asymmetric signatures), and whitelists/blacklists in digital forensics. Depending on the application, we have to impose further requirements. For instance, in cryptography a hash value serves as a unique identifier for its input, e.g., in the context of a digital signature, where the hash value uniquely represents the input data. Clearly in theory each hash value possesses infinitely many preimages, that is input bit strings, which map to the given hash value. However, in practice it is not possible to compute such a preimage the run time of the most efficient algorithm to find a preimage is too long. This property is called preimage resistance. Besides preimage resistance a cryptographic hash function satisfies two additional security requirements, which we list in Definition 8.2. D Definition 8.2: Cryptographic hash function Let h : {0,1} {0,1} n be a hash function. h is called a cryptographic hash function if it additionally satisfies the following security requirements: 1. Preimage resistance: Let a hash value H {0,1} n be given. Then it is infeasible in practice to find an input (i.e., a bit string bs) with H = h(bs). 2. Second preimage resistance: Let a bit string bs 1 {0,1} be given. Then it is infeasible in practice to find a second bit string bs 2 with bs 1 = bs 2 and h(bs 1 ) = h(bs 2 ). 3. Collision resistance: It is infeasible in practice to find any two bit strings bs 1,bS 2 {0,1} with bs 1 = bs 2 and h(bs 1 ) = h(bs 2 ). Clearly both hash functions from Example 8.1 are not cryptographic hash functions. For instance, we consider h from Example It is not preimage resistant, because given b {0,1} we simply take b as preimage and have h(b) = b, that is finding preimages is trivial. The same obviously holds for second preimage resistance and collision resistance, respectively. As we will see in this chapter the IT forensic community adopted the use of cryptographic hash functions for two main purposes: ensuring authenticity and integrity of a digital trace and automatic file identification. In both cases, preimage resistance is crucial, because the hash value of the input serves as a unique identifier for its preimage. If such an identifier is given and if we are able to find a preimage, which is different to the actual input, both IT forensic use cases are corrupted.

3 8.1 Cryptographic hash functions and approximate matching Page 131 If h is a hash function, then a necessary condition for h to be a cryptographic hash function is that the bit length of its digest n is sufficiently large. For preimage resistance and second preimage resistance we have to impose n 100, for collision resistance h has to satisfy n 200. Thus we recommend to make use of the stronger requirement and only apply hash functions with n 200. Sample cryptographic hash functions, which are used in digital forensics are MD5 (n = 128), SHA-1 (n = 160) or hash functions from the SHA-2 family (e.g., SHA-256 (n = 256), [21]). For further details we refer to Table 8.1. Name MD5 SHA-1 SHA-256 SHA-512 RIPEMD-160 n Sample cryptographic hash functions Table 8.1: Sample cryptographic hash functions. One important implication of the security properties of a cryptographic hash function is the avalanche effect. If we change the input bit string, then every bit of the output is expected to change its value with probability 50%, i.e., we do not have any control over the output, if the input changes. According to the avalanche effect, if only one single bit in the original input bit string bs is changed to get a tampered one bs, the two outputs h(bs) and h(bs ) look very different. We demonstrate the avalanche effect on base of similar ASCII strings in Example 8.2. Avalanche effect Example 8.2: Avalanche effect We demonstrate the avalanche effect by applying SHA-256 to a simple ASCII string: in the first string, Wolfgang claims to give Angela 1 million EUR, while the amount changes slightly to 1 billion EUR in the second string. However, the respective SHA-256 hash values look very different. B $ echo Dear Angela, I give you 1 million EUR. Wolfgang sha256sum cb10cfd3b6d47af94cd48c096c606ec8d2d836e80c7f87701ff450267efb $ echo Dear Angela, I give you 1 billion EUR. Wolfgang sha256sum 8dc377ef008781d dc7235aff7ac06e39a523eb7fda9ad547f6c4e - The Linux command echo prints the given string (including a subsequent new line character) to standard output. The Linux implementation of SHA- 256 sha256sum takes this string as input. The number of output characters of sha256sum is = 64, because each group of 4 bits of the hash value is printed as one hexadecimal digit. The avalanche effect is eligible in the context of unique identifiers or integrity of a trace, because it is easy to distinguish different input bit strings by comparing their respective hash values. However, the avalanche effect avoids detecting similar objects. It is important to keep this property in mind for the two use cases of cryptographic hash functions in IT forensics Bloom filter This section introduces Bloom filters, which are an important concept for approximate matching. Bloom filters are commonly used to represent elements of a finite set S. A Bloom filter is an array of m bits initially all set to zero. In order to insert an element s S into the filter, k independent hash functions are needed where each hash function h outputs a value between 0 and m 1. Next, s is hashed by all hash functions h. The bits of the Bloom filter at the positions h 0 (s),h 1 (s),...h k 1 (s) are set to one.

4 Page 132 To answer the question if s is in S, we compute h 0 (s ),h 1 (s ),...h k 1 (s ) and analyse if the bits at the corresponding positions in the Bloom filter are set to one. If this holds, s is assumed to be in S, however, we may be wrong as the bits may be set to one by different elements from S. Hence, Bloom filters suffer from a non-trivial false positive rate. Otherwise, if at least one bit is set to zero, we know that s / S. It is obvious that the false negative rate is equal to zero. False positive probability In case of uniformly distributed data the probability that a certain bit is set to 1 during the insertion of an element is 1 /m, i.e., the probability that a bit is still 0 is 1 1 /m. After inserting n elements into the Bloom filter, the probability of a given bit position to be one is 1 (1 1 /m) k n. In order to have a false positive, all k array positions need to be set to one. Hence, the probability p for a false positive is p = 1 (1 1 /m) k n k (1 e kn/m ) k. (8.2) Detection of similar objects No formal definition Extending yes/no output Use case classes Core functions Features, similarity digest Approximate matching: the concept Often it is useful in computer science to identify similar digital objects. Prominent use cases are spam detection, malware analysis, network-based anomaly detection, biometrics, or digital forensics. We first remark that although similarity has a natural meaning for us, a formal definition is still missing. The corresponding NIST special publication draft [22] only describes approximate matching in terms of uses cases, terminology, and requirements. We therefore skip a definition, too. The basic aim of approximate matching is to extend the yes/no outcome of a cryptographic hash function to a continuous one in the scope of automatic detection of a digital object. As explained in Section a cryptographic hash function yields a binary decision identical/differing for a comparison of two input bit strings: identical is encoded for instance as the integer 1, differing as the nonmatching integer 0. The output of an approximate matching comparison on the other hand is a matching score in the interval [0,1], where 1 means a high-level of similarity and 0 a low-level. The NIST draft [22] mentions two use case classes of similarity with two challenges, respectively. First, approximate matching aims at finding resemblence of two objects. The two challenges within this class are object similarity detection (e.g., different versions of a document) and cross correlation, i.e. finding digital artefacts, which share a common object (e.g., two files sharing an identical picture). Second, approximate matching should detect containment. [22] lists the two according challenges fragment detection (e.g., identify a cluster of a deleted blacklisted file or an IP packet transferring a fragment of a classified document) and embedded object detection, i.e. finding an indexed trace within a digital artefact (e.g., a picture within an ). The concept of approximate matching comprises two core functions: a similarity digest generation function and a similarity comparison function. In the terminology of [22] the first one is called the feature extraction function and the latter on is denoted as similarity function. We prefer our notation because it more obviously describes the goal of the respective function. Given an input object to the similarity digest generation function, it identifies characteristic patterns within the given object. As usual these patterns are called features. The specification of an approximate matching algorithm therefore de-

5 8.1 Cryptographic hash functions and approximate matching Page 133 scribes how to extract features from the given input. The set of all features is the output of the similarity digest generation function and called the similarity digest. The similarity comparison function takes as input two similarity digests and outputs a match score in [0,1]. As more the match score is close to 1 the more similar the corresponding two inputs of the similarity digest generation function are considered. As usual with noisy input, the user of approximate matching has to define a threshold to decide about similarity. As a consequence approximate matching suffers from the well-known error rates: the false match rate (FMR) describes the proportion of dissimilar objects falsely declared to match the compared object. On the other side the false non match rate (FNMR) describes the proportion of similar objects falsely declared to not match the compared object. Similarity may be considered on different layers of abstraction. The NIST draft [22] distinguishes three layers: Similarity comparison function Error rates Layers of abstraction 1. First, bytewise approximate matching takes a bit string as input for the similarity Bytewise approximate digest generation function without any high-layer interpretation of the string, matching that is the features are extracted directly from the input bit string. Bytewise approximate matching is therefore a general approach and may be applied to any bit string. However, it assumes that similar artefacts, which are of interest for the digital forensic investigator, are represented by a similar bit string or it fails within this use case. Bytewise approximate matching is often referred to as fuzzy hashing or similarity hashing. 2. Second, semantic approximate matching takes the interpretation of the application data into account and simulates the human similarity perception matching Semantic approximate procedure. For instance, semantic approximate matching in the scope of pictures extracts the features from the visual perception of the picture rather than from its low-layer representation. Semantic approximate matching is often referred to as perceptual hashing or robust hashing. 3. Third, syntactic approximate matching is based on standardised internal structures of an artefact. For instance, within network packets a syntactic ap- matching Syntactic approximate proximate matching algorithm may work on fields like source/destination MAC/IP addresses, ports, protocols. As bytewise and semantic approximate matching are useful for data reduction, we give more insights into these approaches in the subsequent sections. Breitinger et al. [5] provide an in-depth overview and we summarise and extend their key aspects in what follows Bytewise approximate matching According to Breitinger et al. [5] there are seven bytewise approximate matching algorithms published by the digital forensic community. In this section we review the three main approaches of feature extraction which seem to be the most promising ones. The first feature extraction approach is used by the well-known bytewise approximate matching algorithms ssdeep (due to Kornblum [14]) and mrsh-v2 (due to Breitinger and Baier [3]). The similarity digest generation function subdivides the input byte stream (denoted as m) into chunks m 1, m 2,... as depicted in Figure 8.1. The basic idea is that two digital artefacts are similar if they share a sufficient number of chunks. ssdeep, mrsh-v2

6 Page 134 Figure 8.1: Feature extraction of ssdeep and mrsh-v2 Chunk, trigger point The end of a chunk m i (and thus the beginning of the subsequent chunk m i+1 ) is called a trigger point. Such a trigger point is found if the final r bytes before the trigger point meet a certain condition (typically r = 7 and these r bytes determine an integer value, which has to match a predefined value for triggering). Each chunk represents a feature of the input and the feature set is the sequence of chunks, i.e. the input byte stream is fully covered by the feature set. To represent a feature, it is hashed by a hash function h (e.g., h is FNV 1 for ssdeep, h is MD5 for mrsh-v2) and its hash value is either represented by a Base64 character (ssdeep) or a Bloom filter (mrsh-v2). In case of ssdeep the similarity digest is a sequence of Base64 characters, in case of mrsh-v2 it is a sequence of Bloom filters. In Example 8.3 we compute the ssdeep similarity digest of the photo given in Figure 8.2. Figure 8.2: Sample input hacker-siedlung.jpg of ssdeep B Example 8.3: Similarity digest computation of ssdeep We compute the ssdeep similarity digest of the photo given in Figure 8.2. $ ls -l hacker-siedlung.jpg -rw baier baier :16 hacker-siedlung.jpg $ ssdeep -l hacker-siedlung.jpg ssdeep,2.13--blocksize:hash:hash,filename 1536:ZfICsORJt2PazD7Z2xqHmqL36uuXtrHTXkkknIKB+W2pDHviF4eYySb:\ ZfICNRf2CD7YwGqL36FXVTXQnIWgDvi2,"hacker-siedlung.jpg" 1 Fowler/Noll/Vo hash, retrieved

7 8.1 Cryptographic hash functions and approximate matching Page 135 We first look at the file size, which is bytes. Then we invoke ssdeep, its flag -l suppresses the whole path listing in the output of ssdeep. The output lists the block size, two parts of the similarity digest, and the file name, which are separated by a colon, respectively. The block size determines, when a trigger point is found. It aims at splitting the input byte stream in approximately 64 chunks. It is always of the form 3 2 k, where k is the smallest value with 3 2 k 64 file size. In our example we have = 410.6, thus k = 9 and the block size is 1536 = After the first colon, we get the first part of the ssdeep similarity digest corresponding to the block size It consists of Base64 characters, where the character Z represents the hash value of h(m 1 ), f the hash value of h(m 2 ), and b the hash value of the final chunk h(m 55 ). After the second colon we see the second part of the ssdeep similarity digest corresponding to the block size = We expect approximately half of the chunks. The second feature seletion strategy is to extract statistically improbable features. This strategy is implemented by sdhash of Roussev [24]. The basic idea is that uncommon patterns serve as the baseline for similarity. A statistically improbable feature within sdhash is a sequence of 64 bytes with a high Shannon entropy, that is a sufficiently large number of different bytes. The feature set of sdhash is the sequence of the statistically improbable features, which are represented by Bloom filters. There is a parallelised version available for use in large-scale investigations [25]. The third feature selection strategy is based on a majority vote of bit appearance with a subsequent run length encoding. This approach is used by mvhash-b due to Breitinger et al. [4]. The majority vote step replaces each byte of the input byte string by either an 0x00 byte or an 0xFF byte. The mapping depends on the neighbourhood of the respective byte: if the number of 0 bits predominate in its neighbourhood, the byte is mapped to 0x00, otherwise it is mapped to 0xFF. Then run length encoding is used, where each sequence of identical bytes is replaced by its length. The basic idea of similarity is that predominating regions of a certain bit are characteristic for digital objects. The integers of the run length encoding are then inserted into Bloom filters. The similarity digest of mvhash-b is therefore a sequence of Bloom filters. sdhash mvhash-b Semantic approximate matching As semantic approximate matching extracts perceptual features it is bound to a certain area of applications, for instance images, audio streams or videos. Again Breitinger et al. [5] present an overview of semantic approximate matching algorithms in the context of pictures. This branch dates back to the early 1990ies, when content-based image retrieval was an emerging research topic. There are different feature classes, which are used for image approximate matching. Breitinger et al. [5] mention histograms, low-frequency coefficients (e.g., from the discrete cosine transform), block bitmaps or projection-based. To get an idea of image approximate matching, we shortly explain a block bitmap approach used by the robust hashing algorithm rhash due to Steinebach [29]. The similarity digest generation process of rhash is depicted in Figure 8.3. The bit length of the rhash value is fixed in advance. As usual we denote it by n. In a first step, the input image is converted to greyscale and normalised (e.g., in a preset Perceptual features Feature classes rhash

8 Page 136 Figure 8.3: Similarity digest generation of rhash [29] size, with respect to orientation). Then the normalised and greyscaled picture is subdivided into n disjoint blocks, which cover the image. For instance, if n is a square, then rhash subdivides the image into n equally sized rows and columns, respectively. The sample in Figure 8.3 makes use of n = 256 = 16 2, that is the input picture comprises 16 rows and columns, respectively. Next for each block i with 0 i n 1 rhash computes the mean of of its pixel values. We denote the mean of the i-th block by M i and the median of the sequence (M i ) 0 i n 1 by Md. Finally, the block i contributes to the rhash similarity digest by the bit h i, where h i = 0 if and only if m i < Md. A sample rhash similarity digest is given on the right in Figure 8.3. Authenticity, integrity Dead and live analysis Usage of cryptographic hash functions Protect hash values General process 8.2 Authenticity and integrity of digital traces In this section we look at the first use case of hash functions in digital forensics: ensuring authenticity and integrity of digital traces during the IT forensic process (e.g., during data acquisition). Remember authenticity means that the origin of a digital trace is validated, while integrity describes the property that a digital trace did not change. The use case authenticity and integrity of digital traces is relevant for both dead and live analysis. We will focus on dead analysis in what follows (i.e., the digital forensic expert makes use of his own software), but we keep in mind that traces, which are acquired from a live system (e.g., main memory) must be protected by hash values, too. From Section 8.1 we know that cryptographic hash functions ensure integrity and authenticity by design due to their preimage and second preimage property (see Definition 8.2). For this reason the use case authenticity and integrity of digital traces assumes the usage of cryptographic hash functions. An important issue is that we have to protect the hash values against tampering. There are two alternatives to achieve this goal: first the classical analogue approach is to write down the hash values by hand in the narrative minutes (e.g., in the investigation notebook). Then the hash values are protected by the assumption that it is impossible to forge the handwriting of the investigator. Second the digital approach is to compute a digital signature over the hash values. This requires a private cryptographic key, which is related to the investigator. In this case the hash values are protected by the assumption that it is impossible to forge a digital signature. We now discuss the use case authenticity and integrity of digital traces by looking at the classical data acquisition process of a dead system. To sum up the paradigm is to first generate a master copy from the original device (because the original device must be touched as few as possible). Then the master copy is bitwise copied to get the working copy. If we only perform read-only commands on the working copy, we later on must prove that the working copy did not change during the

9 8.2 Authenticity and integrity of digital traces Page 137 investigation (and hence any trace is directly extractable from the original device). The steps are as follows: 1. Compute hash value h 1 over the whole original volume. 2. Write hash value h 1 down in physical logbook. 3. Make a 1-to-1 copy of the volume using dd. This is the master copy of the original device. 4. Compute hash value h 2 over the master copy. 5. Write hash value h 2 down in physical logbook. 6. Compare h 1 and h 2 : if both hash values match, the master copy is identical to the original device. Otherwise, we have to go back to step Generate a 1-to-1 copy of the master copy using dd. This is the working copy. 8. Compute hash value h 3 over the working copy. 9. Write hash value h 3 down in physical logbook. 10. Compare h 2 and h 3 : if both hash values match, the working copy is identical to the master copy and thus to the original device, too. Otherwise, we have to go back to step Perform the investigation read-only on the working copy and extract digital traces. 12. To finish the investigation and to prove integrity of the working copy, compute the hash value h 4 of the working copy after the investigation and check, if h 1 = h 4 holds. If yes, any digital trace is directly related to the original device, otherwise the investigator has to identify the step, where he changed the working copy. We show how to apply this process on base of the well-known cryptographic library openssl in Example 8.4. Example 8.4: Acquire first partition of an HDD In Linux storage media are typically identified by a device (that is a file in the directory /dev) starting with the two letters sd (historically for SCSI device) and a subsequent character to distinguish different devices. For instance the first HDD is referred to as /dev/sda, an attached USB stick is then mapped to /dev/sdb, an external SSD is identified as /dev/sdc, and so on. B In our example we assume that our HDD is the device /dev/sda. Then its first partition is identified by a digit following the device name (e.g., /dev/sda1), an extended partition may be the device /dev/sda5. We apply the general acquisition process and compute the SHA-256 hash value of this partition. In this example, we make use of the openssl tool, because openssl is the most common implementation of cryptographic algorithms like hash functions, encryption or digital signatures. After invoking openssl we have to tell the tool, which class of cryptographic algorithms we want to use. Cryptographic hash functions are identified by the digest command dgst. The remaining arguments are the chosen hash function (the flag -sha256) and the input bit string of the hash function (in our example the first partition of the HDD /dev/sda1).

10 Page 138 # openssl dgst -sha256 /dev/sda1 SHA256(/dev/sda1)= b9c028c604b5a1dfaf8acf0098e7f26de32fd47\ 38c581d9b6cbc84c98b28f39b # dd if=/dev/sda1 of=mastercopy-sda1.dd # openssl dgst -sha256 mastercopy-sda1.dd SHA256(mastercopy-sda1.dd)= b9c028c604b5a1dfaf8acf0098e7f26de32fd47\ 38c581d9b6cbc84c98b28f39b As both hash values match, we generate the working copy and check the respective hash values. # dd if=mastercopy-sda1.dd of=workingcopy-sda1.dd $ openssl dgst -sha256 workingcopy-sda1.dd SHA256(workingcopy-sda1.dd)= b9c028c604b5a1dfaf8acf0098e7f26de32fd47\ 38c581d9b6cbc84c98b28f39b Again both hash values match, that is the working copy is bitwise identical to the first partition of our HDD. We next investigate read-only the working copy. In the last step we check that the working copy did not change during the processing, which we prove by applying SHA-256 to the respective image of the working copy after our investigation. $ openssl dgst -sha256 workingcopy-sda1.dd SHA256(workingcopy-sda1.dd)= b9c028c604b5a1dfaf8acf0098e7f26de32fd47\ 38c581d9b6cbc84c98b28f39b The hash value of the working copy after the investigation matches the respective hash value of /dev/sda1 and thus any digital trace from the working copy is extractable from the partition, too. If for some reason the final hash value does not match, the investigator has to carefully analyse his narrative minutes to find a step where he modified the working copy. An example of destroyed integrity is given in what follows: $ openssl dgst -sha1 workingcopy-sda1.dd SHA256(workingcopy-sda1.dd)= df69b585b1a1af40b1c71d4fe9792fd1e843f8a\ 2fe0c5c3a39aa205e652aabe4 Big data challenge Finding the needle in the haystack 8.3 Identification of known digital objects An important issue in contemporary investigations of computer crime is handling the huge amount of data. The reason is that as of today information is stored and distributed in a digital rather than an analogue way. Low costs of storage devices and cheap unlimited access to the Internet support our ubiquitous use of digital devices. As a consequence a digital forensic investigation typically confronts the IT forensic experts with terabytes of data stored on different sorts of phyiscal or virtual devices: a classical personal computer, a laptop, a tablet PC, a smartphone, a mail provider, a cloud service provider to name only a few. The terabytes of data can be seen as a big haystack, where the actual evidence of some megabytes has to be found, that is the investigator s task is to find the

11 8.3 Identification of known digital objects Page 139 needle in the haystack. In this section we present concepts, which automatically preprocess the terabytes of input data to support the investigator in proving or refuting a hypothesis. If we use the metaphor of finding the needle in the haystack, two concepts are obvious: 1. First, decreasing the haystack means to scale down the actual data, which has Whitelisting to be inspected by the digital forensic expert. This concept is known as whitelisting or filtering out. Any object from the suspect s drive, which is indexed by the whitelist, is not considered for further inspection. We discuss whitelisting in Section Second, increasing the needle means to find hints to suspicious data structures, Blacklisting which actually support a certain hypothesis. These hints have to be confirmed manually by the investigator. This concept is known as blacklisting or filtering in. We discuss blacklisting in Section For both concepts, we need databases of irrelevant data (i.e. a whitelist) or incriminated files (i.e. a blacklist), respectively. The most common whitelist is the Reference Data Set (RDS) from the US-NIST National Software Reference Library (NSRL) [23]. The blacklist is case dependent (e.g., pictures of child abuse, classified documents). The most common basic technology for indexing files are hash functions. The proceeding is quite simple: for each object of the seized device (e.g., a file) calculate the corresponding digest and compare the respective fingerprint against a whiteor blacklist, respectively. As of today cryptographic hash functions (e.g., SHA- 1, SHA-256 [21]) are used. Cryptographic hash functions are very efficient and effective in detecting bitwise identical duplicates, but they fail in revealing similar objects. However, investigators are typically interested in automatic identification of similar objects, for instance to detect the correlation between a blacklisted picture of child abuse and its thumbnail, which was discovered on a seized device. Databases Hash values are used Whitelisting A whitelist is an index of known to be good objects, that is of non-suspicious patterns. The concept of whitelisting is quite simple: any object from the suspect s drive (typically an object is simply a file), which is indexed by the whitelist, is not considered for further inspection. Therefore whitelisting is referred to as filtering out, too. In order to handle a whitelist with respect to memory, a compressed representation of each whitelisted object is used. Additionally, as whitelisted objects are not considered for further investigation, the false match rate (FMR) must be 0. Otherwise it would be possible for an attacker to filter out relevant digital traces. Therefore whitelists are based on cryptographic hash functions. The most common whitelist is the Reference Data Set (RDS) from the US-NIST National Software Reference Library (NSRL) [23]. The RDS indexes files. Its website states 2 : The RDS is a collection of digital signatures of known, traceable software applications. There are application hash values in the hash set which may be considered malicious, i.e. steganography tools and hacking scripts. There are no hash values of illicit data, i.e. child abuse images. Whitelists are based on cryptographic hash functions RDS

12 Page 140 B Example 8.5 We enumerate sample entries of the NSRL Reference Data Set. $ less NSRLFile.txt "SHA-1","MD5","CRC32","FileName","FileSize","ProductCode","OpSystemCode","SpecialCode" " EDD92C4E3D2E F849","392126E756571EBF112CB1C1CDEDF926","EBD105A0",\ "I05002T2.PFB",98865,3095,"WIN","" " DA6391F7F5D2F7FCCF36CEBDA60C6EA02","0E53C14A3E48D94FF596A B492","AA6A7B16",\ "00br2026.gif",2226,228,"WIN","" "000000A9E47BD385A0A3685AA12C2DB6FD727A20","176308F27DD52890F013A3FD80F92E51","D749B562",\ "femvo523.wav",42748,4887,"macosx","" " AFA836117B1B572FAE4713F200567","9B3702B0E788C6D FE3C9786A","05E566DF",\ "J JPG",32768,18266,"358","" " AFA836117B1B572FAE4713F200567","9B3702B0E788C6D FE3C9786A","05E566DF",\ "J JPG",32768,2322,"WIN","" " AFA836117B1B572FAE4713F200567","9B3702B0E788C6D FE3C9786A","05E566DF",\ "J JPG",32768,2575,"WIN","" " AFA836117B1B572FAE4713F200567","9B3702B0E788C6D FE3C9786A","05E566DF",\ "J JPG",32768,2583,"WIN","" " AFA836117B1B572FAE4713F200567","9B3702B0E788C6D FE3C9786A","05E566DF",\ "J JPG",32768,3271,"WIN","" " AFA836117B1B572FAE4713F200567","9B3702B0E788C6D FE3C9786A","05E566DF",\ "J JPG",32768,3282,"UNK","" We see that the image J JPG has a file size of bytes. It is listed six times, because the product code or the operating system code differ. Content of RDS Effectiveness of whitelisting The RDS is updated four times a year. As of May 2015, the current release is RDS 2.48, which contains about 21 million unique files. Its size is about 6 GiB. As listed in Example 8.5 each entry of the RDS lists the SHA-1, MD5 and CRC32 checksum together with the file name and file size of the indexed file. The entries are ordered with respect to the numerical value of the SHA-1 hashes. Hence it is easy to decide if an input file is indexed by the RDS. Although filtering out using the RDS is widespread, only few results are available about its effectiveness. Back in 2008 Douglas White from NIST claims in a presentation at the American Academy of Forensic Sciences (AAFF) that file-based data reduction leaves an average of 30% of disk space for human investigation 3. However, the RDS only indexes application hash values, it does not take any personal files into account. Therefore Baier and Dichtelmüller [2] performed a study on data reduction for different user profiles. The baseline of their research is the data reduction in terms of the number of files rather than disc space (because an investigator has to look at a file rather than on a certain amount of memory). The methodology of Baier and Dichtelmüller [2] is to model different user behaviour and their corresponding file generation characteristics. Their data reduction rates for different profiles is given in Table 8.2. M G means the number of generated files in the file system of the respective user profile and M RDS the number of files in the system, which are indexed by the RDS, too. The data reduction rate is the relation of the number of indexed files to all files, that is R = M RDS M G. To be effective, R should be as close as possible to 1. For instance, the first row in Table 8.2 shows the result for a Windows XP operating system installation only, that is there are no user files. However, only 52.45% of the files in the file system are indexed by the RDS. It is obvious, that the reduction rate decreases if we insert additional user files. For example, if we model a user, which mainly uses his computer for playing games (i.e. the profile gamer), the 3 retrieved on

13 8.3 Identification of known digital objects Page 141 Profile Nr. of Indexed by Data reduction files: M G RDS: M RDS rate: R XP, OS only 10,467 5, % XP, standard software 22,801 9, % XP gamer 126,684 18, % W7, OS only 56,233 18, % W7 standard software 77,601 23, % W7 universal 322,128 42, % Ubuntu ,789 26, % Table 8.2: Data reduction rates for different user profiles using RDS [2] data reduction rate is below 15%. In this case the investigator has to inspect the remaining 85% of the files manually. These results are informally confirmed by practitioners, who are surprised by the high data reduction rates of Baier and Dichtelmüller [2] and mention an expected data reduction rate of 5% for their cases. To sum up, the haystack does not decrease significantly using RDS. As the preprocessing of applying the whitelist takes a lot of effort, our overall assessment is that whitelisting is not effective to automatically preprocess bulk data. Whitelisting is ineffective Blacklisting In contrast to a whitelist a blacklist indexes known to be bad objects, that is suspicious patterns. If an object from the suspect s drive matches an element of the blacklist, the investigator gets a hint to a digital trace, which he inspects manually. Thus blacklisting is also called filtering in. Again in order to handle a blacklist with respect to memory, a blacklist makes use of a compressed representation of each of its elements. In this section we assess different aspects of cryptographic hash functions and approximate matching in the scope of blacklisting. The aspects and our assessment are summarised in Table 8.3. To illustrate our rating, we assign categories starting with + for the best rating followed in descending order by,, to the worst rating. Property Cryptographic Bytewise approximate Semantic approximate hash function matching matching Run-time efficiency very fast + fast - medium slow to 6 20 to 500 Compression short + 1% to 3% short 256 bits of input length 256 to 600 bits Object similarity No Yes + Yes + detection Cross correlation No Yes + No Fragment detection No Yes + No Embedded object No Yes + No detection Domain specific No + No + Yes (e.g., only images) Encoding Yes Yes No + dependency FMR / FNMR 0% + Dependent Dependent Indexing Yes + Inefficient Inefficient Filter in Assessment Table 8.3: Assessment of hash functions with respect to blacklisting We first turn to the aspect efficiency, that is run-time and memory efficiency. Our rating of run-time efficiency is based on the experiments of Breitinger et al. [5]. Efficiency

14 Page 142 We assign a relative speed of 1 to cryptographic hash functions. Then bytewise approximate matching differs by a factor of 1.5 to much slower, e.g., mrsh-v2 has comparable speed to SHA-1, while sdhash is much slower. However, bytewise approximate matching is typically much faster than semantic approximate matching, because the latter one requires more complex computational steps. With respect to compression, both cryptographic hash functions and semantic approximate matching perform well. The hash value is of fixed small size. On the other hand, bytewise approximate matching outputs similarity digests of variable length, which is proportional to the input size (with the exception of ssdeep). For instance, a 1 TiB input requires a size of 10 GiB to 30 GiB for its bytewise approximate matching blacklist. This constitutes a key drawback of bytewise approximate matching. Resemblance Dependency Error rates Indexing We next assess the aspect resemblance (see Section 8.1.3). Both bytewise and semantic approximate matching are able to decide about object similarity, which is not the case for cryptographic hash functions. With regard to cross correlation (i.e. finding digital artefacts, which share a common object) only bytewise approximate matching is able to successfully conduct it. The same holds for the aspect containment, i.e. fragment detection and embedded object detection: only bytewise approximate matching copes with containment. The next aspect is dependency with respect to application area and representation, respectively. Both cryptographic hash functions and bytewise approximate matching consider the bytestream of an object, hence they are not bound to a specific domain of applications (e.g., image similarity, audio similarity). However, as semantic approximate matching extracts features to simulate human perception, it is bound to a certain domain of applications. If we examine encoding dependency, the situation is vice versa: the byte-level algorithms are dependent on the actual encoding (e.g., an image encoded as jpg is considered to be different from the same image encoded as png by both cryptographic hash functions and bytewise approximate matching). On the other hand as semantic approximate matching considers the perceptual level, it does not depend on the encoding representation. With respect to error rates, both the false non match rate (FNMR) and the false match rate (FMR) are of interest. For convenience the FMR should be small, otherwise the investigator is annoyed in manually checking erroneous traces. On the other hand the FNMR must be as close as possible to 0. Otherwise the blacklist fails in pointing to potential evidence, and the trace must be found in a different way. Cryptographic hash functions do not suffer from error rates due to their security requirements from the cryptographic domain (e.g., preimage resistance, collision resistance). However, as approximate matching processes noisy input, it suffers from both a non-trivial FMR and FNMR. It is therefore the operator s responsibility to prioritise the error rates. Our final aspect concerns indexing, that is a sorting algorithm for digests. As explained in Section the RDS sorts cryptographic hash values with respect to their numerical value. Hence indexing is easily possible for blacklists based on cryptographic hash functions. With respect to approximate matching, first approaches towards indexing are available. As they suffer from run time or memory inefficiency, we rate approximate matching rather negative with respect to sorting. 8.4 Summary In this chapter we described the two main use cases of hash functions in digital forensics. The use cases are authenticity and integrity of digital traces (ensured by applying cryptographic hash functions) and identification of known objects (e.g.,

15 8.4 Summary Page 143 illicit files). In the latter case we showed how whitelisting and blacklisting work and how these concepts aim to perform data reduction, respectively. Our conclusion is that whitelisting is not effective and that blacklisting may be performed by cryptographic hash functions or approximate matching.