Department Informatik. Privacy-Preserving Forensics. Technical Reports / ISSN Frederik Armknecht, Andreas Dewald

What can investigator decrypt?
What does the Split Secret Key into?
What step is used to encrypt email data?

Transcription

1 Department Informatik Technical Reports / ISSN Frederik Armknecht, Andreas Dewald Privacy-Preserving Forensics Technical Report CS April 2015 Please cite as: Frederik Armknecht, Andreas Dewald, Privacy-Preserving Forensics, Friedrich-Alexander-Universität Erlangen-Nürnberg, Dept. of Computer Science, Technical Reports, CS , April Friedrich-Alexander-Universität Erlangen-Nürnberg Department Informatik Martensstr Erlangen Germany

2

3 Privacy-Preserving Forensics Frederik Armknecht, Andreas Dewald IT Security Infrastructures Dept. of Computer Science, University of Erlangen, Germany Abstract In many digital forensic investigations, data needs to be analyzed. However, this poses a threat to the privacy of the individual whose s are being examined and in particular becomes a problem if the investigation clashes with privacy laws. This is commonly addressed by allowing the investigator to run keyword searches and to reveal only those s that contain at least some of the keywords. While this could be realized with standard cryptographic techniques, further requirements are present that call for novel solutions: (i) for investigation-tactical reasons the investigator should be able to keep the search terms secret and (ii) for efficiency reasons no regular interaction should be required between the investigator and the data owner. We close this gap by introducing a novel cryptographic scheme that allows to encrypt entire boxes before handing them over for investigation. The key feature is that the investigator can non-interactively run keyword searches on the encrypted data and decrypt those s (and only those) for which a configurable number of matches occurred. Our implementation as a plug-in for a standard forensic framework confirms the practical applicability of the approach. ACKNOWLEDGMENTS We want to thank Michael Gruhn for his cooperation and help in this project. This report presents the full version of the paper published at the DFRWS USA 2015 conference [1] and includes a more detailed security proof and further additional information. I. INTRODUCTION Digital forensics denotes the application of scientific methods from the field of computer science to support answering questions of law [2], [3], which usually includes the acquisition, analysis and presentation of digital evidence. In such investigations, privacy is protected by individual domestic laws prohibiting unreasonable searches and seizures, such as the Fourth Amendment to the United States Constitution. Formally, privacy is a human right, which means following the definition of the U.N. universal declaration of human rights that [n]o one shall be subjected to arbitrary interference with his privacy, [...] or correspondence [...] [and] [e]veryone has the right to the protection of the law against such interference or attacks [4, Article 12]. In many nations, domestic laws are very stringent and regarding investigations within a company further require companies to have works councils and data protection commissioners to protect the privacy of employees and customers. Especially the analysis of data is subject to controversial discussions, especially when the private use of the corporate account is allowed. It is in this duality between forensics and privacy where special requirements to forensic tools arised [5]. A. Motivation Whenever a company becomes victim of accounting fraud or is involved in cartel-like agreements, this usually prompts large-scale investigations. In such investigations, mostly management documents and s are reviewed, because especially communication data often provides valuable pieces of evidence. However, it also contains very sensitive and private information. As stated above, the company may be legally obligated by domestic law to protect the privacy of the employees during the invesigation as far as possible. Current methods that are applied to this end usually make use of filtering to prevent all s from being read: The investigator is allowed to run keyword searches and only those s that contain at least some of the keywords are displayed. In this way, the investigator shall find such s that are relevant to the case, but as few unrelated s as possible to preserve privacy. The conceptually most simple realization is that the data remains at the company and that the investigator sends the search queries to the company who returns the search results. However, this comes with two problems. First, the company would learn all search queries while for investigation-tactical reasons, it may be necessary to keep them secret. Second, the need for constantly contacting the company s IT puts further burden on both

4 sides. Therefore, common practice is that the investigator is granted access to the entire data and conducts the search. There, other privacy problems arise: Because of the risk to overlook the so-called smoking gun when not reading every , investigators may be tempted to sidestep the filter mechanism. A prominent example of such deviation from the search order is the case United States v. Carey [6], even though in this case no filtering was actually used at all. B. Contributions Our contribution is a novel approach for privacy-preserving forensics, which allows for non-interactive threshold keyword search on encrypted s (or generally text). That is a user holding the encrypted text can autonomously search the text for selected keywords. The search process reveals the content of an if and only if it contains at least t keywords that the investigator is searching for. Otherwise, the investigator learns nothing - neither about the content of the nor whether any of the selected keywords are contained. Our scheme allows for privacy-preserving forensics using the following process: 1) Company s IT extracts requested boxes. 2) Company (IT or data protection officer) encrypts the s by applying our scheme. 3) Encrypted data is extradited to third party investigators. 4) Investigators are able to run keyword searches on the encrypted data and can decrypt those (and only those) s that contain at least t of the keywords that they are looking for. This way, the investigators are able to conduct their inquiry in the usual way, while not threatening privacy more than necessary. We want to emphasize that the investigator holds the entire data and can search within the data without further interaction with the company. In particular, the company learns nothing about the search queries executed by the investigators. For the investigator, nothing changes and keyword search can be performed as usual, because the decryption is handled transparently in the background after a successful keyword search, as demonstrated later in Section VI. The only overhead introduced for the company is to determine the threshold t, i.e. decide how many relevant keywords must be matched for decryption, and to encrypt the data. Besides choosing the threshold t, the company may further decide, which keywords should be relevant for the case (i.e. whitelisting keywords for which the investigator might obtain a match), or in the contrary decide which words shall not allow the investigator to decrypt s, such as pronouns, the company s name and address, salutations or other simple words (i.e. blacklisting keywords). Note that both approaches can be applied in our scheme, however, we further refer to the blacklisting approach, as we assume it to be more practical in most cases, because a blacklist can easily be adopted from case to case once it has been built. A whitelist has to be determined strongly depending on the case, but there might be cases where it is worthwile to apply the whitelist approach. After configuring the threshold t and the white-/blacklist (once), the company can just invoke the encryption algorithm on the data and hand it over for investigation. The proposed scheme uses only established cryptographic building blocks: a symmetric key encryption scheme, a hash function, and a secret sharing scheme. The facts that for each of these allegedly secure implementations do exist and that the security of the overall scheme solely relies on the security of these building blocks allow for concrete realizations with high confidence in the security. Last but not least, we present an implementation of our scheme as a plug-in for the forensic framework Autopsy [7]. Experiments and performance measurements conducted on real folders demonstrate the practical applicability of our approach. C. Outline The remainder of the paper is structured as follows: We introduce our solution and give a high level description of the scheme and algorithms in Section II. In Section III, we provide an in-depth explanation of the scheme and Section VI shows our prototype implementation and evaluates the practical applicability. In Section VII, we give an overview of related work and conclude our paper in Section VIII. II. THE PROPOSED SCHEME This section provides a description of our scheme, focusing on explaining the motivation behind the design, the overall structure and work flow. For readability, we omit several details on the concrete realization of the components. These will be delivered in Section III. Because an investigator should only be able to decrypt and read such s that match his search terms, the investigator should not be able to learn anything about the encryption key of other s when successfully decrypting one . Thus, we use a cryptographic scheme that protects each individually. In order

5 Identifier P C K W w n t s F Meaning Plaintext Ciphertext Secret key Set of keywords Single Keyword Number of keywords Threshold Share of the secret key Mapping keywords to shares TABLE I QUICK REFERENCE OF USED IDENTIFIERS. to encrypt an entire mailbox, this scheme is applied to each of the mailbox. The overall situation with respect to one can be described as follows: A plaintext P is given together with a list of keywords W := {w 1,..., w n }. (for quick lookup, a complete list of the used identifiers is given in Table I). In our context, P would be the that needs to be protected and W denotes the set of words within this . For security reasons, this list excludes trivial words like the, or other general words that can be configured using the already mentioned blacklist. The scheme provides two functionalities: a protection algorithm Protect and an extraction algorithm Extract. The task of the protection algorithm is to encrypt the given plaintext P to a ciphertext C to prevent an outsider to extract any information about the plaintext. However, any user who knows at least t 1 keywords from W should be able to extract the original plaintext P from the ciphertext C. To this end, in addition the protection algorithm outputs some helper data. The helper data enables a user who knows at least t keywords to extract the plaintext P from the ciphertext C by applying the extraction algorithm. The scheme uses three established cryptographic primitives: a symmetric key encryption scheme, a hash function, and a secret sharing scheme. We explain these on the go when needed in our description. A. The Protection Mechanism The protection algorithm (see Algorithm 1) has the task to encrypt an in a way that allows for a keyword search on the encrypted data. In a nutshell, it can be divided into the following three different steps: a) Step P.1: Encrypt Message: To protect the content of P, it is encrypted to C using an appropriate encryption scheme(line 2). To this end, a secret key K is used which is randomly sampled (line 1). If the encryption scheme is secure, the plaintext becomes Algorithm 1 The Protection Algorithm Protect Input: A plaintext P, a set of n keywords W, and an integer threshold t 1. Output: A ciphertext C, a mapping F (representing the helper data) {Step P.1: Encrypt Message} 1: Sample a random key K {0, 1} lk 2: Encrypt the plaintext P to ciphertext C under the secret key: C := Enc K (P ) {Step P.2: Split the Secret Key into Shares} 3: Invoke the Split algorithm on K with inputs t (= threshold) and n (= number of shares) to get n shares of K: (s 1,..., s n ) := Split(K, n, t) where s 1,..., s n {0, 1} lk {Step P.3: Connect Keywords to Shares} 4: Create a mapping F such that F (w i ) = s i for all i = 1,..., n. 5: return Ciphertext and mapping (C, F ) accessible if and only if the secret key can be determined. The two remaining steps ensure that if a user knows t keywords in W, he can reconstruct K and thus decrypt the plaintext P from C. b) Step P.2: Split Secret Key into Shares: Recall that we want only users who know at least t correct keywords from W (i.e. keywords that occur in the and that are not blacklisted), to be able to extract P. This threshold-condition is integrated in this step, while in the subsequent (and last) step, the connection to the keywords in W is established. More precisely, in this step a secret sharing scheme (as proposed by Shamir [8], for example) is used to re-encode the secret key K into n shares s 1,..., s n (line 3). This operation, referred to as Split, ensures that (i) the secret key K can be recovered (using an algorithm Recover) from any t pairwise distinct shares s i1,..., s it but (ii) if less than t shares are available, no information about K can be deduced at all. Observe that the number of shares is equal to the number n of keywords in W, that is the keywords within the except those in the blacklist. So far, only standard cryptographic techniques have

6 been used in a straightforward manner. It remains now to chain the shares derived in this step to the keywords in W. This can be seen as one of the main novel contributions of the proposed scheme. c) Step P.3: Connect Keywords to Shares: In this final step, a mapping F is constructed which maps each keyword w i to one share s i. That is F (w i ) = s i for all i = 1,..., n. This ensures that a user who knows t distinct keywords w i1,..., w it W can derive the corresponding shares s i1 = F (w i1 ),..., s it = F (w it ) using F. Given these, the secret key K is recovered by K := Recover(s i1,..., s it ) that allows to decrypt P := Dec K (C). The calculation of this mapping F is more complicated and is explained in the details in Section III. The output of the protection mechanism is the ciphertext C = Enc K (P ) and information that uniquely specifies the mapping F. Observe that F actually represents the helper data mentioned before. For efficiency reasons, it is necessary that F has a compact representation. In the next section, we now explain how a forensic investigator might reconstruct the plaintext of an P from (C, F ), if and only if this contains the t keywords he is searching for. B. The Extraction Mechanism The task of the extraction mechanism (see Alg. 2)) is to allow an investigator to decrypt the encrypted , if this contains the terms he is searching for. Formally speaking, the mechanism allows a user who has knowledge of t correct keywords to extract the original plaintext P from (C, F ). Like for the protection mechanism, the extraction mechanism is divided into three steps. Each step is more or less responsible for reverting one of the steps of Protect. This needs to be done in reversed order (i.e., the first step E.1 reverts P.3, and so on). Assume now that the ciphertext C and helper data F is given and the user aims to recover the original plaintext using t keywords w 1,..., w t, which are the words he is searching. The three steps are the following: Step E.1: Compute Possible Shares From the keywords w 1,..., w t, we compute t possible shares by applying the mapping F (line 2). That is s i := F (w i ) for i = 1,..., t. Step E.2: Determine Possible Secret Key In this step, the user aims to recover the secret key K from the shares s 1,..., s t. To this end, he invokes the corresponding algorithm Recover from the secret sharing scheme on s 1,..., s t and receives a value K (line 4). Observe that by assumption on Recover, Algorithm 2 The Extraction Algorithm Extract Input: A ciphertext C, a mapping F, t keywords w 1,..., w t Output: A plaintext P {Step E.1: Compute Possible Shares} 1: for all i = 1,..., t do 2: Apply the mapping F to keyword w i to get a possible share, i.e., 3: end for s i := F (w i). {Step E.2: Determine Possible Secret Key} 4: Given the t candidates for shares of K, the corresponding recovery algorithm of the underlying secret sharing scheme is applied to get a candidate for the secret key: K := Recover(s 1,..., s t). {Step E.3: Try to Decrypt Ciphertext} 5: Apply the decryption algorithm to C with key candidate and get a plaintext candidate P : P = Dec K (C). 6: if P is a valid plaintext then 7: return plaintext P 8: else 9: return 10: end if it holds that only if those t keywords really occured in the and were not blacklisted, the output of F will be t correct shares of the secret key. Otherwise (when there were less than t or no matches in this ) K is a random value and the user does not learn which of the used search terms were correct, if any. More formally, it holds that K = K if and only if {s 1,..., s t} {s 1,..., s n }. Step E.3: Decrypting the Encrypted Plaintext In the final step, the user aims to extract P from C. That is he computes P = Dec K (C). According to the common requirements on secure encryption algorithms, it holds that P = P if K = K. Otherwise should be some meaningless data that is not obviously related to P. We assume an efficient algorithm that can reject false plaintexts (explained later) and outputs in such cases. Otherwise the P

7 final output is P. This means, in the last step the algorithm is able to determine if K was valid and otherwise does not display the useless data to the investigator. III. DETAILED DESCRIPTION OF THE COMPONENTS In this section, we provide both, a generic description based on appropriate cryptographic primitives, but also suggest a concrete choice of algorithms and parameters. In principle the following three mechanisms need to be realized: An encryption/decryption scheme that allows to efficiently identify wrongly decrypted plaintexts (steps P.1, E.3) A secret sharing scheme (steps P.2, E.2) A method to create mapping F (steps P.3, E.1) In the course of this section, we explain those three building blocks in the given order. A. Encryption/Decryption Scheme Reasonable choices are encryption schemes that (i) have been thoroughly analyzed over a significant time period and (ii) are sufficiently efficient. A good candidate is AES [9], being the current encryption standard. Although AES has been designed to encrypt 128 bit blocks only, it can encrypt larger chunks of data if used in an appropriate mode of operation like the Cipher Block Chaining (CBC) Mode [10]. That is a plaintext P is divided into several 128-bit blocks P = (p 1,..., p l ) with p i {0, 1} 128 for each i (if the bit size of P is not a multiple of 128, some padding is used at the end). With respect to the key size, the AES standard supports 128 bit, 192 bit, and 256 bit. As according to the current state of knowledge, 128 bit security is still sufficient, we propose to use AES-128. That is in the following the keylength is l K = 128 and Enc refers to AES-128 in CBC-mode; other choices are possible. Given a plaintext P, it is first transformed into a sequence (p 1,..., p l ) of 128-bit blocks. To allow for identifying wrongly decrypted plaintexts (line 6), we prepend a 128-bit block p 0 which consists of a characteristic padding pad, e.g., the all-zero bitstring pad = 0,..., 0. Then, given a 128-bit key K {0, 1} 128, the sequence of 128-bit blocks (p 0, p 1,..., p l ) is encrypted to (c 0,..., c l ). More precisely, the CBC-mode involves a randomly chosen initial value IV that will be chosen at the beginning. Then, the ciphertext blocks c i are computed as follows: c 0 := Enc K (pad IV) (1) c i := Enc K (p i c i 1 ), i = 2,..., l (2) The output of the encryption process is (IV, c 0,..., c l ). With respect to the decryption procedure, it holds that pad := Dec K (c 0 ) IV (3) p i := Dec K (c i ) c i 1, i = 2,..., l (4) Observe that also a legitimate user may be forced to re-try several selections of keywords until he is able to access the plaintext. Hence, it is desirable that the decryption procedure can be aborted as soon as possible if a wrong key is tried. This is the reason for the introduction of the characteristic padding pad. Given the encrypted blocks (c 0,..., c l ) and a key candidate K, a user can find out if K = K by checking (3). Only if this is fulfilled, the decryption process is continued. Otherwise, it is aborted and the next in the data set is tested for a valid match of all the t search terms. B. Secret Sharing Scheme The concept of secret sharing is well known in cryptography since long. In our proposal, we adopt the secret sharing scheme (SSS) proposed by Shamir [8]. As its concrete form has impact on the realization of the mapping construction, we recall its basic ideas here: Let K {0, 1} 128 denote the secret key. In the following, we treat K as an element of the finite field GF(2 128 ). Next, let the threshold value t be given and an integer n which should represent the number of shares (equally to the number of different, non-blacklisted keywords in the ). We aim to derive n shares s 1,..., s n such that any choice of t shares allows for reconstructing K. To this end, let n + 1 pairwise different values x 0,..., x n GF(2 128 ) be given. More precisely, the values x 1,..., x n will be the result of computations explained in Section III-C. Then, x 0 is chosen randomly in GF(2 128 ) \ {x 1,..., x n }. These values will later play the role of x-coordinates. Next, a polynomial p(x) of degree t 1 is chosen such that p(x 0 ) = K. Finally, the n shares are computed by s i := (x i, y i ) GF(2 128 ) GF(2 128 ) with y i = p(x i ). We denote this overall procedure as (s 1,..., s n ) := Split(K, x 1,..., x n, t) (5) Observe that each share represents one evaluation of the polynomial p at a different point. It is well known that a polynomial of degree t 1 can be interpolated from any t evaluations. Hence, recovery is straightforward: Given t distinct shares s ij = (x ij, p(x ij ), the first step is to interpolate the polynomial p(x). Once this is accomplished, the secret is determined by evaluating p(x) at x 0, that is K := p(x 0 ). However, if less shares

8 are known or if at least one of them is incorrect (that is has the form (x, y ) with y p(x)), the derived value is independent of K. Hence, this realization fulfills the requirements mentioned above. This procedure is denoted by C. Creating the Mapping K := Recover(s i1,..., s it ). (6) It remains to discuss how a mapping F can be constructed that maps the keywords w i {0, 1} to the secret shares s i {0, 1} 256. We propose a mapping F that is composed of two steps as follows: The first step uses a cryptographically secure hash function H : {0, 1} {0, 1} 256. For example, one might instantiate H with SHA-256 [11] or the variant of SHA-3 [12] that produces 256 bit outputs. In addition, a random value salt {0, 1} is chosen. Then, the first step is to compute the hash values h i of the concatenated bitstring salt w i, that is h i = H(salt w i ) i = 1,..., n. (7) Observe that h i {0, 1} 256. In the following, we parse these values as (x i, z i ) GF(2 128 ) GF(2 128 ). The values (x 1,..., x n ) will be the x-coordinates as explained in Section III-B. This requires that they are pairwise distinct. As each x i is 128 bits long, it holds for any pair (x i, x j ) with i j that the probability of a collision is if the hash function is secure. Due to the birthday paradox, the overall probability for a collision is roughly n 2 / This probability can be neglected in practice as n is usually by several magnitudes smaller than In case of the improbable event that two values collide, this step is repeated with another value for salt. Next, we assume that the secret sharing scheme has been executed (see eq. (5)) based on the values x 1,..., x n determined by the results from (7). That is we have now the situation that n hash values h i = (x i, z i ) GF(2 128 ) GF(2 128 ) are given and likewise n shares s i = (x i, y i ) GF(2 128 ) GF(2 128 ). We complete now the construction of F. First, a function g is constructed which fulfills g(x i ) = y i z i i = 1,..., n. (8) We elaborate below how g can be efficiently realized. Given such a function g allows to realize a bijective mapping G : GF(2 128 ) GF(2 128 ) GF(2 128 ) GF(2 128 ) (x, y) (x, g(x) y) This mapping G can be easily inverted and hence is bijective. Moreover, it maps a hash value h i = (x i, z i ) to G(h i ) = G(x i, z i ) = (x i, g(x i ) z i ) (9) (8) = (x i, (y i z i ) z i ) (10) = (x i, y i ) = s i. (11) Now F is defined as the composition of H and G, that is F = G H. (12) Algorithm 3 summarizes the steps of F. As we suggest the use of a known and widely analyzed hash function H, we can omit its specification in the output of Protect. In fact, the only information that is necessary for evaluating F is the function g(x). The realization that we propose below requires only n field elements which is optimal in the general case. As we suggest the use of GF(2 128 ), this translates to a representation using 128 n bits. The bijectivity of F ensures that only correct hash values map to valid shares. Algorithm 3 The Mapping F Input: A keyword w {0, 1}, a hash function H : {0, 1} {0, 1} 256, and a polynomial g(x) Output: A possible share s = (x, y) 1: Compute h := H(salt w) with h {0, 1} 256 2: Parse h = (x, z) GF(2 128 ) GF(2 128 ) 3: Compute s := (x, z g(x)) 4: return Possible share s It remains to explain how g : GF(2 128 ) GF(2 128 ) can be efficiently realized. Observe that any mapping from a finite field to itself can be represented by a polynomial. Hence, an obvious approach would be to use the tuples (x i, y i z i ) to interpolate a polynomial p(x) of degree w such that p(x i ) = y i z i for i = 1,..., n. However, the asymptotical effort for interpolation is cubic in the degree (here n) and our implementations revealed that this approach is too slow even for the average case. Therefore, we derived another approach: The core idea is to split the input space (here: GF(2 128 )) into several distinct sets and to define for each set an individual sub-function. That is given an input x, the function g is evaluated on x by first identifying the subset of GF(2 128 ) that contains x and then to apply the corresponding sub-function to x. This approach requires to derive the corresponding sub-functions which we accomplish by plain interpolation. However, as each

9 set now contains significantly less values x i than w, the overall effort is significantly reduced. To see why, let l denote the number of sets and assume for simplicity that each set contains n/l values x i. Then the overall effort to interpolate these l subfunctions of degree n/l each is l (n/l) 3 = n 3 /l 2. That is compared to the straightforward approach of interpolating g directly, our approach is faster by a factor of roughly l 2. For the splitting of GF(2 128 ), i.e., the set of all 128- bit strings, we suggest to use the k least significant bits where k 1 is some positive integer 1. More precisely, for a value x GF(2 128 ), we denote by x[1... k] the first k least significant bits (LSBs). This gives a natural splitting of GF(2 128 ) into 2 k pairwise disjunct subsets. Two elements x, x GF(2 128 ) belong to the same subset if and only if their k LSBs are equal, that is x[1... k] = x [1... k]. Let X := {x 1,..., x n } denote the n values determined by (7). For I {0, 1} k, we define X I := {x X x[1... k] = I}. (13) This yields a separation of X into 2 k pairwise distinct subsets: X = I {0,1} k X I. For each set X I, a corresponding sub-function g I is determined which fulfills g I (x i ) = y i z i for all x i X I. (14) As already stated, this is accomplished by simple interpolation. That is if n I := X I denotes the size of X I, the function g I can be realized by a polynomial of degree n I which will be close to n/2 k on average. This completes the construction of g and hence of G as well. The description of g requires only to specify the sub-functions g I, that is g ˆ=(g (0...00), g (0...01),..., g (1...11) ) (15) where each index represents a k-bit string. As each g I is of degree n I, it can be described by n I coefficients in GF(2 128 ). Because of I {0,1} k n I = n it follows that g can be fully specified by n field elements. The evaluation of g is as explained above. Given an input x GF(2 128 ), first I := x[1... n] is determined, and then g I (x) is computed which represents the final output. That is g(x) := g I (x) where I := x[1... n]. IV. SECURITY ANALYSIS In this section, we analyze the security of our scheme. To this end, we investigate if the confidentiality of the protected message P is ensured against an attacker who 1 In our realization, k ranges from 1 to 8. knows the ciphertext C and the mapping F. In the following, we assume that the deployed cryptographic primitives are secure, i.e., the encryption scheme, the secret sharing scheme, and the hash function. For example, this means that an attacker cannot determine the plaintext P from ciphertext C in reasonable time, and that guessing sufficient shares is impractical as well. As for all these cryptographic primitives established realizations do exist which haven t shown any weaknesses up to date, this is a reasonable assumption. Moreover, if one deploys our scheme with a primitive which turns out to be insecure later on, it is (conceptually) easy to replace it by another concrete scheme that is still considered to be secure. Assume now that an attacker A has access to a ciphertext C and a specification of F and aims to recover plaintext P. As we assume the encryption scheme to be secure, the probability for a successful attack is negligible if F is not taken into account. Hence, we focus on attackers who aim to exploit knowledge of F somehow. Recall that the purpose of F is to connect the keywords to the shares of K and is in particular independent of the plaintext. Thus, the knowledge of F can only be helpful for recovering the secret key (what of course would eventually allow to reveal the plaintext). Recall that F is only indirectly related to K. The purpose of F is to have a 1:1-mapping between the keywords in W and the shares connected to the key. Moreover, as the secret sharing scheme is assumed to be secure, A can only be successful if she can determine t correct shares. Hence security analysis boils down to the question of the effort to find t correct shares while possibly using the knowledge of F. We now explain why knowledge of F does not help for determining any shares. It holds for the used secret sharing scheme that for any chosen secret K any combination (x, y) {0, 1} 128 {0, 1} 128 can potentially be a share. As the mapping G is bijective, there exists for each choice of (x, y) a corresponding input (x, z) such that G((x, z)) = (x, y). Moreover, recall that according to the specification of the scheme, the input (x, z) does not represent the keywords directly but its hash value h. If the hash function H is secure, any (x, z) can be a hash value with an inifinite number of preimages. Hence, the corresponding preimage (x, z) of a share candidate (x, y) likewise doesn t tell if (x, y) can be a share. Summing up, neither the domain nor the range of F or G yields any information about the used shares. What about the structure of F resp. G? Recall that G is specified by subfunctions g I of g where the degree of g I tells an

10 upper bound of the number of x I X I (see (14)). As we suggest that each sub-function g I should have degree of at least t 1 (end of Section III), an attacker cannot decide for any X I if it contains x-coordinates of the shares. We have seen that the knowledge of F cannot be exploited for recovering any of the shares. Moreover, guessing the shares directly would be practically impossible and in fact takes more effort than guessing the key directly. Thus, the only attack that may be potentially relevant are so-called dictionary attacks. In a dictionary attack, the attacker A has access to a dictionary of N keywords W 1,..., W N and tries different combinations of t keywords until she successfully recovered the secret key K (and hence, eventually the initial plaintext as well). In the remainder of this section, we analyze the effort for such an attack. To this end, we make the assumption that each keyword combination is equally likely in principle. That is from an attacker s point of view, there exist no combinations of keywords that for whatever reason have a higher probability to be correct than others. For sure, this may not hold in practice. For example, the language of the initial plaintext and the context of this message may be taken into account for such an attack. But as this is hard to model for the general case, we consider it as out of scope of this work at hand. A further assumption we make is that the dictionary contains all keywords associated to P, that is W {W 1,..., W N }. The reasons is that security analysis should evaluate security against attackers that are as powerful as possible. In the following, we call a selection of t keywords W 1,..., W t as good if it allows to recover the plaintext by applying the Extract algorithm, i.e., P := Extract(C, F, W 1,..., W t). (16) By construction of the scheme, any choice of t keywords in W is good. That is, the number of good choices is at least ( ) n t. Theoretically, there could be more good combinations. Let p(x) the polynomial that has been used in the secret sharing scheme for determining the shares (x i, y i ), i = 1,..., n. That is it holds that p(x i ) = y i for each i. Assume now that by coincidence it happens for a keyword W W that (x, y ) := F (W ) for which p(x ) = y holds as well. Then any combination of t keywords taken from W {W } would likewise be good. In particular the number of good combinations would be at least ( ) n+1 t in such a case. While such effects can happen in principle, they are extremely unlikely. Due to the security of the involved hash function H and the bijectivity of G, randomly chosen inputs W {0, 1} to F lead to uniformly distributed outputs (x, y ) GF(2 128 ) GF(2 128 ). Hence, the probability of the event that p(x ) = y is If the dictionary contains N n keywords that do not belong to W, the expected number of keywords W that by conincidence give a correct share is N n 2 1 as N has to be for practical reasons. 2 In a similar manner, one can argue that the probability that a combination of t keywords that are not all contained in W gives the correct plaintext is Also here one cannot expect that this practically happens. Summing up, the number of good combinations is exactly ( ) n t with overwhelming probability. Hence, the dictionary attack translates to the following procedure: The) attacker continuously selects randomly one of the combinations W 1,..., W t and applies ( N t the Extract algorithm until he recovers the plaintext, i.e., until he hits by chance one of the ( ) n t good combinations. Observe that an attacker is forced to execute the complete Extract algorithm. Omitting step 1 (applying F to W 1,..., W t for getting a combination of possible shares (x 1, y 1 ),..., (x t, y t) would force the attacker to predict the hash values of W i, being infeasible if the hash function is secure. As discussed above any value (x, y) GF(2 128 ) GF(2 128 ) can be a valid share. Even if all x-coordinates x 1,..., x t would fall into the same subset X I, this combination could be possible (from an attacker s point of view) as the corresponding subfunction g I has degree t. That is there is no indication that X I may contain less than x-coordinates of the valid shares. The only event that may tell an attacker that a combination is invalid is if two x-coordinates coincide, i.e., x i = x j for i j. However, the probability for this is again and hence can be ignored. Summing up, at the end of step 1 an attacker cannot exclude yet that the considered combination of keywords may be good. Now step 2 gives a value K GF(2 128 ) that may be the correct key. Whether this is the case or not cannot be decided at this stage as the secret key does not contain any redundant information that may allow for distinguishing it from random values. Hence, the attacker has to perform the third and last step as well for figuring out whether the considered combination of keywords was good. Altogether this shows that the only possible attack is a plain dictionary attack. We evaluate the effort of a successful attack for our sample cases in Section VI-B7. 2 Already N = 2 50 would imply a dictionary of several Terabytes and is still smaller than by multiple magnitudes.

11 A. Security Model V. SECURITY ANALYSIS In this section, we analyze the security of our scheme against a malicious investigator. Before we provide a formal proof of the security of the scheme, we define the underlying notion of security. 1) Preliminaries: To simplify the discussion, we express security asymptotically with respect to some security parameter λ. That is we assume that all parameters have been chosen according to λ and that all algorithms, including the attacker, has a runtime polynomial in λ. Moreover, the key length l K is chosen such that 2 lk is negligible in λ. If not otherwise stated, the term negligible will refer to a function that is negligible in λ. Rephrasing all notions and arguments for concrete security parameters is straightforward. In the following, we assume some distribution D on the set of possible plaintexts. Whenever we say that a plaintext is chosen, we mean that it is sampled according to D. The distribution D may be known to an attacker. Moreover, we assume that the size of the plaintext and the ciphertext are known to the attacker. In general, we consider a set of public parameters that are known to all involved parties, including any possible attacker. This set will contain at least the full specification of the scheme, the value t chosen for the threshold, and possibly the distribution D, but may also comprise further information. In our discussion, we also slightly change the notation for the scheme. As already explained, a description of F can be given by specifying the underlying hash function H and the function g. As H will be fixed, we say that Protect simply outputs (C, g) instead of (C, F ) to better reflect the actual values that have been generated by Protect. Likewise, Extract gets as input g instead of F. 2) Attacker model: An attacker A will always refer to a PPT (probabilistic polynomial-time Turing machine) that plays the role of the investigator and aims to gain knowledge about the plaintexts from given ciphertexts. More precisely, as each plaintext is individually encrypted using a different key, security definition restricts to one given ciphertext and the probability that the attacker can recover information about the underlying plaintext. In real world scenarios, this information may help to guess the keywords of another plaintext. This effect is captured by the distribution D. That is the distribution may change depending on the knowledge of the attacker and hence affects the security of the next ciphertext. This is also one of the main reasons why we aim for a security definition that is independent of D. Recall that an investigator needs to be able to run keyword search on the encrypted plaintexts. That is for a given ciphertext, the investigator can execute the extraction algorithm Extract with some chosen keywords w 1,..., w t to possibly decrypt the ciphertext. Consequently, a malicious auditor will have at least the same possibility as well, making for example dictionary attacks unavoidable (see also Sec. V-C). In such an attack, the attacker tries different (plausible) keywords until she manages to decrypt the ciphertext. Informally, security means that no attacker can do significantly better. In other words, for any attacker A with access to Extract it does not affect the probability of success significantly whether (C, g) are known as well or not. This means that revealing C and g to an adversary does not induce any further security risks. Observe that this represents the optimal level of security unless one restricts the capabilities of a (possibly honest) investigator. To capture this basic functionality, we introduce an oracle E which is defined with respect to some fixed (C, g) and simply executes Extract on (C, g). That is the oracle E accepts as input t keywords w 1,..., w t and returns Extract(C, g, w 1,..., w t). We express this oracle by E (C,g) but omit the index (C, g) in most cases as it is clear from the context. In the following we assume that any attacker has black-box access to E (C,g) for the given tuple (C, g). We denote this by A E. Moreover, an attacker may get as additional input some information about (C, g). This is likewise formalized by introducing an information oracle I. Similar to E, the oracle I is defined for a fixed tuple (C, g) which we often do not write down explicitly. Moreover, I = I τ is defined with respect to a deterministic function τ which operates on inputs (C, g). Upon request, I τ returns τ(c, g). For the definition of security, only the two extreme cases are interesting: either τ simply outputs the complete tuple (C, g) or outputs nothing, i.e., outputs. In the first case, the information oracle is denoted by I id while it is simply omitted in the second case. Observe that an attacker that does not learn the full input (C, g) from I can never be more powerful than an attacker who has the full knowledge. Hence, if we show that in both extreme cases the success probability is practically the same, it holds in particular for any attacker who gets partial information on (C, g) only. Summing up, any attacker A has access to two oracles: the extraction oracle E, representing the basic functionality, and an information oracle I τ which simply returns τ(c, g) to A. We express this by A E,Iτ.

12 3) Security Definition: As already stated, the fact that a (malicious) investigator should be able to do keyword search on the data inevitably allows for dictionary attacks. This cannot be prevented, independent of how the data is protected and no matter whether the scheme is interactive or not. Consequently, the best one can ask for is that using a protection scheme restricts an attacker to attacks that only use this functionality. That is using the protection scheme and revealing (C, g) provides no significant advantage to an adversary. For a formal definition, we consider the following security game between a hypothetical challenger and an attacker A E,Iτ. The challenger samples some plaintext P, runs the protection algorithm to get (C, g), and then runs the attacker who gets black-box to E(C, g) and I τ. The security game now works as follows for an attacker A E,Iτ : 1) Choose a plaintext P (according to the distribution D) and extract the set W of keywords from P. 3 2) Execute Protect(P, W, t) and get (C, g) 3) Run A E(C,g),Iτ (C,g). The attacker wins the game if she outputs P. We denote by Adv(A E,I ) the success probability of A to win the game where the probability is taken with respect to the random coins of the challenger and the attacker. An experienced reader might wonder why winning requires to recover the whole plaintext. In cryptography, usually a weaker winning condition is considered, e.g., predicting the output of a Boolean function on input of P. The reason that we aim for full plaintext recovery is the following. As already elaborated, an unavoidable, basic attack is a dictionary attack where an attacker aims to decrypt the ciphertext by guessing correct keywords. 4 Our goal is to show that no further attacks are possible. As the dictionary attack aims to output the full plaintext, we have to adopt this as the winning condition. We are now ready to give the formal security definition. As already motivated, one cannot avoid that an attacker, i.e., a malicious investigator, misuses access to E to recover the plaintext. The only option would be to restrict the capabilities of an investigator which would severely change the overall scenario. Hence, the best we can ask for is that knowing (C, g) does not increase the probability of success significantly compared to an attacker who makes only keyword searches. Usually, one would formalize this by requesting that any attacker who 3 We assume that an attacker knows this procedure as well. 4 We call keywords correct or good if they are contained in the plaintext and hence potentially allow to decrypt the ciphertext. has access to (C, g) can be simulated by an attacker without this knowledge. However, one technical problem arises. While one can show that the knowledge of (C, g) does not help to recover the plaintext, an attacker may use this knowledge in such a way that makes it impossible to simulate it. One example is that after an attacker has recovered P, it selects several combinations of t keywords from P and checks if g always yields eventually the same key. This cannot be simulated by an algorithm that has access to E only. However, it is still possible to come up with a meaningful definition of security that can also be shown. The core idea is to show that for getting the maximum success probability, an attacker does not need to know (C, g). More precisely, the maximum success probability among the set of attackers that know (C, g) is practically the same as for the set of attackers that do not know (C, g). This is formalized as follows: Definition 1. A scheme as defined in section II is secure if it holds that max { Adv(A E,I id ) } { } max Adv(A E ) (17) A E,I id A E is negligible in the security parameter λ. Here, max A E,I id means the maximum over all attackers A that have access to the indicated oracles and so on. Observe that attackers considered in the second expression have no access to any information oracle and hence learn nothing about (C, g). Thus, the definition expresses that the best an attacker can do with the knowledge of (C, g) is practically the same as the best one can do without this knowledge. B. Proof of Security 1) Main Theorem: In this section, we prove the security of the scheme described in section II according to the security definition given in definition 1. Recall that the proposed scheme deploys three standard cryptographic components: a hash function, an encryption scheme, and a secret sharing scheme. While the latter provides information-theoretic security, the first two can be only computational secure. To abstract from any potential weaknesses of the concrete choices of the encryption function and the hash function, we follow the common approach and model these as an ideal cipher and a random oracle, respectively. That is the hash function H is modeled by an oracle that accepts as input bit strings x {0, 1} of arbitrary length and outputs the corresponding hash value. The crucial point is that a user

13 knows only these hash values he has asked for. That is a user can only know H(x) if he has explicitly asked for. Otherwise the value is completely undetermined. Observe that we assume that E has access to H as well. Similarly, an ideal cipher Π with a key size l K can be seen as a collection of 2 lk permutations indexed by the set {0, 1} lk, selected uniformly at random from the set of all such families of permutations. While in concrete instantiations the cipher will be used in an appropriate mode of operation, this may introduce weaknesses as well. Hence, we assume that the block size of the permutation fits to the length of the plaintext which is fixed and publicly known. A user can make forward and backward requests to the ideal cipher. In a forward request, it sends a key K {0, 1} lk and a plaintext P and receives Π K (P ) while in a backward request, it sends a key K and a ciphertext C and gets Π 1 K (C ). Slightly deviating from the standard definition, we assume in addition that Π also internally stores the triple (K, P, C) that has been used to generate the challenge ciphertext. We emphasize that apart of internally storing this information, Π represents an ideal cipher. A final assumption we make is that decrypting a ciphertext under a different key, i.e., a key different from the one used to create the ciphertext, yields an valid plaintext only with negligible probability. This can be achieved for example by using a characteristic padding of a size linear in λ. We are now ready for the main theorem: Theorem 1. In the random oracle and ideal cipher model, the scheme described in section II is secure according to the security definition given in definition 1. 2) Proof: The remainder of this section contains the proof of theorem 1. We show this by a sequence of attacker types A i that differ on the oracles they can access. We show for any two subsequent types (A i, A i+1 ) that max Ai Adv(A i ) max Ai+1 Adv(A i+1 ) is negligible. The first attacker type (expressed by A 0 ) has access to E, I id, H, and Π, while the last type of attacker (expressed by A 5 ) can only access E. a) Attacker Type A 1 : Let A E,Iid,H,Π 0 be an arbitrary attacker with access to the four oracles E, I id, H, and Π. We construct now from A 0 an attacker A 1 that has access to a restricted oracle Π instead of Π. This oracle Π works as follows. It accepts as inputs keys K {0, 1} lk. It first checks if K = K. If this is the case, it outputs P. Otherwise the output is. In other words, Π tells if the given key is equal to the key that has been used to created C. The attacker A 1 invokes I id once in the beginning to learn C. From this point on, it uses A 0 in a blackbox manner. All requests to E and H, and I are simply forwarded. If A 0 makes a request Π involving a key K, it first forwards K to Π. If the output is not, A 1 learns the plaintext P, i.e., the decryption of C. If the request by A 0 was a forward query including P or a backward query including C, A 1 returns C and P, respectively. All other requests are answered by A 1 by consistent lazy sampling. The final output of A 1 is simply the final output of A 0. Although A 0 has access to the ideal cipher Π, her goal is to decrypt C = Π K (P ). By definition all outputs of Π under a different key K K are information-theoretic independent from the queries using K and hence do not provide any useful information to A 0. Moreover, also the queries using K are perfectly simulated as the only restriction Π K (P ) = C is respected. Consequently A 1 perfectly simulates an ideal cipher to A 1. Moreover, A 1 wins if A 0 wins and vice versa. It follows that ( Adv A E,Iid,H,Π 0 ) = Adv (A E,Iid,H,Π 1 ). (18) In particular we have max A0 Adv(A 0 ) max A1 Adv(A 1 ). On the other hand, any attacker that has only access to the restricted oracle Π can be simulated by one that has access to the full ideal cipher Π. This implies that max A0 Adv(A 0 ) max A1 Adv(A 1 ). Concluding it holds 5 max { Adv ( A E,Iid,H,Π 0 )} { = max Adv (A E,Iid,H,Π 1 )}. b) Attacker Type A 2 : We consider now an attacker type that has access to a restricted information oracle I 2 nd instead of I id. The oracle I 2 nd simply outputs the second entry of the tuple, that is I 2 nd(c, g) = g. Observe that such an attacker does not learn C but knows it size according to the security model. Let A 1 = A E,Iid,H,Π 1 be an arbitrary attacker of the previous type. We construct now an attacker A 2 = A E,I 2nd,H,Π 2 of the current type. The attacker A 2 serves as simulator for A 1 as follows. At the beginning, A 2 invokes I 2 nd to learn g and then runs A 1 in a black-box manner. All requests of A 1 and the answers are simply forwarded. The only exception is the request to I id. In this case, A 2 simply samples some ciphertext C of the correct size and returns (C, g) to A 1. Recall that A 1 has no access to the full ideal cipher Π and does not see the parameters of Π. In other 5 We omit the index of max from now on for the sake of readability.

14 words, Π at most reveals the plaintext but never the used ciphertext. Hence, the situation is indistinguishable from the case where A 1 receives the correct ciphertext C. It follows that Adv(A E,Iid,H,Π 1 ) = Adv(A E,I 2nd,H,Π 2 ). (19) Using the same arguments as above, it follows that { )} max Adv (A E,Iid,H,Π { )} = max Adv. 1 (A E,I 2nd,H,Π 2 c) Attacker Type A 3 : The next attacker type differs from the previous one that an attacker has no longer access to Π. Again, let A 2 be any attacker of the previous type. We construct an attacker A 3 = A E,I 2 nd,h 3. The attacker A 3 invokes A 2 and forwards all queries and answers with respect to the other oracles. However, to simulate Π, she performs some actions in addition. Whenever A 2 made a query w to the random oracle H and received some response h, the attacker A 3 appends this tuple (w, h) internally in some initially empty list L RO. In addition A 3 keeps a list L Keys which is empty at the beginning. L Keys will contain all keys that can be derived from the answers received by H. That is as soon as L RO = t, that is L RO = [(w 1, h 1),..., (w t, h t )], the attacker A 3 uses the mapping F and the secret sharing scheme to compute the corresponding key K and inserts (K, w 1,..., w t) into L Keys. Whenever a new tuple (w l+1, h l+1) is appended to L RO, the attacker A 3 computes for each possible combination (w i 1,..., w i t 1, w l+1 ), 1 i 1 < i 2... < i t 1 l, the corresponding K and inserts (K, w i 1,..., w i t 1, w l+1 ) into L Keys, sorted by the first entry. Observe that L RO, i.e., the number( of queries ) to H, is polynomial in λ and that L Keys = LRO t O( L RO t ), hence polynomial in λ as well. This shows that maintaining these lists can be done efficiently. Now, if A 2 wants to make a request K to Π, A 3 proceeds as follows. It first checks if K represents one of the first entries of the tuples stored in L Keys. If this is not the case, A 3 simply outputs. Otherwise, that is K is contained in L Keys, then A 3 derives the corresponding H-inputs w i 1,..., w i t, invokes E with these inputs, and forwards the return to A 2. Now, A 3 perfectly simulates Π for A 2 except for the following two cases: (i) running Extract with wrong keywords, i.e., which do not yield K, results nonetheless in a valid plaintext (and not in ), and (ii) A 2 managed to derive the correct key K but without using t correct keywords. The probability for the first event is negligible by assumption. Next, consider a key candidate K that has not been constructed from using H and F. The probability that K = K is hence 2 lk. To see why recall that the only component that is connected to K is g. As the candidate key cannot be constructed from the H responses, either the key has been purely guessed or derived from the secret sharing scheme where at least one share has been guessed. In any case, the probability of correctly yielding K is 2 lk, which is negligible in λ by assumption. Hence, it follows that Adv(A E,I 2 nd,h,π 2 ) Adv(A E,I 2 nd,h 3 ) negl(λ). (20) Consequently it holds as well { max Adv 2 { ( max Adv (A E,I 2 nd,h,π A E,I 2 nd,h 3 )} )} negl(λ). d) Attacker Type A 4 : The next attacker type differs from the previous one by having no longer access to the information oracle I 2 nd. Again, let A 3 = A E,I 2nd,H 3 denote an attacker of the previous type. We construct now an attacker A 4 = A E,H 4 of the current type. At the beginning, A 4 randomly constructs a mapping g according to some randomly chosen set of keywords. Then, A 4 runs A 3 and in principle forwards all requests and answers except of the request to I 2 nd. If A 3 makes a request to I 2 nd, the attacker A 4 simply returns g to A 3. If A 3 produces some output P, then so does A 4. However, in addition it might happen that A 4 aborts the execution of A 3 before A 3 stops and produces an output on her own. Before we explain this situation in detail, we first explain the reasoning behind it. Observe that by construction, the function g maps the coordinate x i to y i z i where the z i represent the second part of the hash value of the correct keywords and y i the second part of the share. Obviously, even if one knows g(x i ), any value for y i (or alternatively z i ) is possible. In particular, g alone does not restrict the set of possible shares. Moreover, any collection of t keywords can be potentially correct. Even if all these have hash values that share the same x-coordinate, it may be a correct as each component function has degree at least t. Thus, g and g show exactly the same structure. Moreover, as H is preimage resistant, knowing g does not help the attacker in choosing correct keywords. Hence, the success probability of finding a collection of t correct keywords is not reduced if one replaces g by g.

15 If A 3 has a collection of t correct keywords w 1,..., w t and runs the extraction procedure using g, she will get a key K which is different to K with high probability. But this doesn t matter. As K is not used anymore by any of the oracles, A 3 cannot notice the difference. Nonetheless, providing g instead of g does not yield a perfect simulation. If an attacker runs the extraction algorithm using g twice with two different collection of correct keywords, the output will be two different keys with high probability. However, running E twice with the same sets of keywords will yield the correct plaintext in both cases. This brings us to the situation where A 4 may stop the execution of A 3 and produce her own output. Observe that the situation described above can only take place if A 3 has requested H with at least t correct keywords. Similar to the discussion of the previous attacker type, A 4 uses a list L RO where she collects all requests w of A 3 to H. Moreover, A 4 runs E for any combination of t keywords in L RO. More precisely, it runs E the first time when L RO contains t entries. From this point on, whenever a new request to H is made, A 4 runs E on any combination of t elements in L RO that contains the new requested value. As already explained, the overall effort is polynomial in λ. Moreover, if E returns to one of these requests the plaintext P (instead of ), A 4 halts and outputs P. If A 3 requests to H does not involve t correct keywords, it cannot observe that g has been replaced and the simulation is perfect. Moreover, A 4 will not abort and hence both attackers have the same success probability. Otherwise, if A 3 requested t correct keywords to H, the attacker A 4 will succeed for sure. It follows that Adv(A 3 ) Adv(A 4 ) and hence max{adv(a 3 )} max{adv(a 4 )}. On the other hand, it trivially holds that max{adv(a 3 )} max{adv(a 4 )}. Consequently: max { Adv ( A E,I 2 nd,h 3 )} = max { Adv ( A E,H 4 )}. e) Attacker Type A 5 : We come now to the final attacker type. As opposed to the previous attacker type, now an attacker has no longer access to the random oracle H. Again, let A 4 = A E,H 4 be any attacker of the previous type. We construct now an attacker A 5 which does the following. It runs A 4 and forwards all requests to and answers from E. With respect to requests to H, these are answered by consistently sampling random responses. Observe that A 5 perfectly simulates a random oracle. Moreover, A 4 cannot distinguish the simulated random oracle from H as E only tells which tuples of keywords are correct but says nothing about the internal values. Obviously, A 4 and A 5 have the same success probability. Likewise, it follows that { ( )} max Adv = max { Adv ( A E )} 5 A E,H 4 We have reached the final attacker type. Observer that we have shown for each two subsequent attacker types that the maximum success probability is either equal or differ by a negligible function only. This shows the claim of the theorem, namely max { Adv ( A E,Iid,H,Π 0 This concludes the proof. C. Dictionary Attacks )} max { Adv ( A E )} 5 negl(λ). We have seen that the only attack that may be potentially relevant are so-called dictionary attacks, formalized by attackers that have only access to the E oracle. In a dictionary attack, the attacker A has access to a dictionary of N keywords W 1,..., W N and tries different combinations of t keywords until she successfully recovered the secret key K (and hence, eventually the initial plaintext as well). In the remainder of this section, we analyze the effort for such an attack. To this end, we make the assumption that each keyword combination is equally likely in principle. That is from an attacker s point of view, there exist no combinations of keywords that for whatever reason have a higher probability to be correct than others. For sure, this may not hold in practice. For example, the language of the initial plaintext and the context of this message may be taken into account for such an attack. But as this is hard to model for the general case, we consider it as out of scope of this work at hand. A further assumption we make is that the dictionary contains all keywords associated to P, that is W {W 1,..., W N }. The reasons is that security analysis should evaluate security against attackers that are as powerful as possible. In the following, we call a selection of t keywords W 1,..., W t as good if it allows to recover the plaintext by applying the Extract algorithm, i.e., P := Extract(C, F, W 1,..., W t). (21) By construction of the scheme, any choice of t keywords in W is good. That is, the number of good choices is at least ( ) n t. Theoretically, there could be more good combinations. Let p(x) the polynomial that has been used in the secret sharing scheme for determining the shares (x i, y i ), i = 1,..., n. That is it holds that p(x i ) = y i for each i. Assume now that by coincidence it happens

16 for a keyword W W that (x, y ) := F (W ) for which p(x ) = y holds as well. Then any combination of t keywords taken from W {W } would likewise be good. In particular the number of good combinations would be at least ( ) n+1 t in such a case. While such effects can happen in principle, they are extremely unlikely. Due to the security of the involved hash function H and the bijectivity of G, randomly chosen inputs W {0, 1} to F lead to uniformly distributed outputs (x, y ) GF(2 128 ) GF(2 128 ). Hence, the probability of the event that p(x ) = y is If the dictionary contains N n keywords that do not belong to W, the expected number of keywords W that by conincidence give a correct share is N n as N has to be for practical reasons. 6 In a similar manner, one can argue that the probability that a combination of t keywords that are not all contained in W gives the correct plaintext is Also here one cannot expect that this practically happens. Summing up, the number of good combinations is exactly ( ) n t with overwhelming probability. Hence, the dictionary attack translates to the following procedure: The) attacker continuously selects randomly ( N t one of the combinations W 1,..., W t and applies the Extract algorithm until he recovers the plaintext, i.e., until he hits by chance one of the ( ) n t good combinations. Observe that an attacker is forced to execute the complete Extract algorithm. Omitting step 1 (applying F to W 1,..., W t for getting a combination of possible shares (x 1, y 1 ),..., (x t, y t) would force the attacker to predict the hash values of W i, being infeasible if the hash function is secure. As discussed above any value (x, y) GF(2 128 ) GF(2 128 ) can be a valid share. Even if all x-coordinates x 1,..., x t would fall into the same subset X I, this combination could be possible (from an attacker s point of view) as the corresponding subfunction g I has degree t. That is there is no indication that X I may contain less than x-coordinates of the valid shares. The only event that may tell an attacker that a combination is invalid is if two x-coordinates coincide, i.e., x i = x j for i j. However, the probability for this is again and hence can be ignored. Summing up, at the end of step 1 an attacker cannot exclude yet that the considered combination of keywords may be good. Now step 2 gives a value K GF(2 128 ) that may be the correct key. Whether this is the case or not cannot be decided at this stage as the secret key does not contain any redundant information that may allow for 6 Already N = 2 50 would imply a dictionary of several Terabytes and is still smaller than by multiple magnitudes. distinguishing it from random values. Hence, the attacker has to perform the third and last step as well for figuring out whether the considered combination of keywords was good. Altogether this shows that the only possible attack is a plain dictionary attack. We evaluate the effort of a successful attack for our sample cases in Section VI-B7. VI. PRACTICAL IMPLEMENTATION In this section, we show the prototype implementation of the encryption tool and the integration of the search and decryption routines in the standard forensic framework Autopsy [7] (Version 3) exemplarily. We further evaluate the performance of the approach on sample cases. A. Realization We want to sketch how our protoype implementation works. For brevity, we omit implementation details, as all the important algorithms are already explained in Sections II and III and the implementation is straight forward. The encryption Software is implemented platform independent as a Python script. The command syntax is as follows: python ppef_gen.py <pst mh mbox rmbox> <in> <out> <thr> <blist> We support different inbox formats, like the mbox format which is amongst others used by Mozilla Thunderbird, the pst format that is used by Microsoft Outlook, the MH (RFC822) format and Maildir format as used by many UNIX tools. For encryption, the threshold parameter t needs to be specified and the blacklist can be provided as a simple text-file. As for the encryption tool, the decryption software is implemented as a platform independent Python script. For better usability in real world cases, we also integrated it into Autopsy by implementing an Autopsy-plugin called PPEF (Privacy Preserving Forensics). To perform his analysis on the resulting encrypted ppef file, the investigator can just add an encrypted ppef file to a case in Autopsy as a logical file and in step 3 of the import process, select the PPEF ingest module in order to handle the encrypted data, as shown in Fig. 1 on the next page. After successfully importing a ppef file, the file is added to the data sources tree. When selecting the file, Autopsy first shows the random-looking encrypted data in the hexdump, as can be seen from Figure 2. The investigator can then use the PPEF search form to run keyword searches on the encrypted data using our

17 Fig. 1. Importing encrypted PPEF file to Autopsy. Fig. 3. Viewing the successfully decrypted matches. prototypes to evaluate the performance of our solution. Fig. 2. Performing a keyword search on the encrypted ppef file in Autopsy. previously explained extract algorithm. The search bar is shown in Fig. 3. Note that the number of keywords that have to be entered is obtained from the ppef file automatically. When clicking on the PPEF Search button, the searching algorithm is run transparently in the background and successfully decrypted s (i.e. s that match the entered keywords) are automatically imported to Autopsy. The s are parsed using Autopsy s standard mbox parser, so that s are added within the Messages tree and can be filtered, sorted, and viewed using the standard mechanisms of the tool, as also displayed in Fig. 3 (the contents of the s have been sanitized for privacy reasons). After shortly demonstrating the practical realization, we next use those B. Practical Evaluation In order to evaluate the practical usage of our scheme using the developed prototype, we use a data set of different real world boxes. We evaluate the performance of the prototype for encryption, search and decryption. All measurements where performed on one core of a laptop with Intel R Core TM i5-3320m CPU at 2.60 GHz and 16 GB RAM. 1) The data set: Our data set consists of the following 5 mailboxes: The Apache httpd user mailing list from 2001 to today consisting of s (labeled Apache) One mailbox of 1590 work s (Work) Three mailboxes of private persons with 511, 349, and 83 s (A, B, C) 2) Blacklist: As example blacklist, the 10, 000 most common words of English, German, Dutch and French [13] are used. In a real case, the blacklist may of course be choosen less restrictive, but this is a generic choice used for our performance evaluation. The keyword threshold t was set to 3. In real cases, a more case specific blacklist must be created. It must consider words such as the company name, which likely can be found in every , signatures, and generally terms often used in s, as long as they are not especially relevant to the case. We focus on the encryption scheme itself and consider building an appropriate blacklist to be a separate problem that is highly dependant on the specific case and can not be solved in general. 3) Data set details: The lexicographic details of the sample data can be seen in Table II. On average, s

18 Apache Work A B C s: Words: Max Avg Med σ Σ Keywords: Max Avg Med σ Σ TABLE II LEXICOGRAPHIC PROPERTIES OF SAMPLE MAILBOXES. for the IV and padding prepended to the plaintext to determin correct decryption and between 1 and 16 bytes PKCS#7 padding to the AES 128 blocksize. Altogether, the size increase in kilobytes of the raw s within the mailboxes compared to our ppef format can be seen in Table IV. This leads to an average storage overhead of 5.2 %. Apache Work A B C Size Raw [KB] 376, , , , 486 6, 676 Size PPEF [KB] 418, , , , 821 6, 806 Overhead 11.2 % 0.4 % 3.0 % 0.7 % 1.9 % TABLE IV SEARCH/DECRYPTION PERFORMANCE. of the sample set are composed of around 220 words. Further, disregarding multiple occurrences of the same word and via applying the blacklist, the number of resulting keywords per is reduced to an average of 35 keywords. 4) Encryption performance: Table III shows statistics about the time it takes to encrypt the sample s. It also lists the total time it takes to encrypt each sample mailbox. The rate to encrypt the mailboxes is on average only around s per second. This value is still acceptable because the encryption needs to be done only once during an investigation since our approach is noninteractive. Apache [s] Work A B C Min Max Avg Med σ Σ TABLE III ENCRYPTION PERFORMANCE OF PROTOTYPE. 5) Encryption storage overhead: Because the sole encryption of the s does not add much on the size of the data, the main factor of storage overhead is the storage for the coefficients of our polynomials used to map hashes to shares. Their number is determined directly from the keywords, with a slight overhead introduced by padding the coefficients of the sub-functions to degree t. This introduces a total average overhead of bytes for the polynomial coefficients, plus 16 byte salt per . The overhead introduced by the AES encryption is between 33 to 48 bytes consisting of 16 bytes each Apache [s] Work A B C Min Max Avg Med σ Σ TABLE V SEARCH/DECRYPTION PERFORMANCE. 6) Search and decryption performance: As the prototype always tries to decrypts each , the search and decryption is not dependent on a keyword hit and hence the times listed for search in Table V include the decryption time of successful keyword hits. Because the rate the mailboxes are searched is on average only around 98 s per second, searches take a while, but searches are still feasible. Besides timing measurements, the prototype was also evaluated with regard to functional correctness, by performing sample keyword searches for keyword combinations only included in specific s and keyword combinations not included in any s at all. The found keyword matches corresponded perfectly with the expected results, i.e., only the s actually containing the keywords were decrypted, or none at all, which also practically confirms the correctness of our scheme. 7) Attack performance: We now discuss the effort needed by an attacker to execute a dictionary attack on an entire box, such that on average, a fraction of π of all s are decrypted where 0 π < 1. The strongest possible attack to our scheme is a brute-force dictionary attack, as long as the underlying ciphers are secure. The effort of such an attack is expressed in the

19 π N Apache Work A B C , , , , , , , , , , , , , , , ,000 27, ,000 22, ,000 3, TABLE VI ATTACK TIME IN YEARS TO DECRYPT THE ENTIRE MAILBOX WITH A PROBABILITY OF π USING A WORDLIST WITH N WORDS. number q of trials an attacker needs to make. To this end, we use the following sufficient condition to decrypt a single with probability π: ( n ) t q log 2 (1 π)/ log 2 1 ) (22) ( N t This involves several highly context-dependent parameters, making a general statement impossible. But to give an intuition nonetheless, we estimate the attack effort in the cases considered in our evaluation. For each of the cases, TABLE VI displays the effort for different scenarios: We considered the cases that each contains the measured n avg keywords for each mailbox. With respect to the dictionary size, we take as orientation the Second Edition of the 20-volume Oxford English Dictionary. It contains full entries for 171,476 words in current use [14]. So, we first choose N := 171, 476. For the success rate π, we chosen the value 0.99 as this ensures for an attacker with a high probability that almost all s are decrypted. Using this parameters, we can calculate the number of tries q an attacker has to perform for the respective mailboxes. For example for the Apache mailbox, this results in q = queries to brute force a single mail. We multiply this by the number of mails in the mailbox and in order to give a better intuition, calculate the time needed for the attack based on the measured average query speed as presented in Section VI-B6 before. This leads to a time of years needed for this attack on the Apache mailbox, as shown in TABLE VI. Of course, an attacker might reach some speedup by implementing the decryption software using the C language instead of python, as we did for our prototype. The fastest attack in our setting is the one on mailbox C, as this is the smallest mailbox in our dataset and the attack would take 5, years. As one might argue that not all those words in current use might really be used within a particular company and the attacker might have some kind of auxiliary information about which words are commonly used, for example from a sample set of already decrypted s, we also calculate the effort for a worse scenario, where only 50 % of the words in current use from the Oxford English Dictionary are used and the attacker exactly knows which words belong to this 50 % so he can run an optimized attack with N := 85, 738. The effort needed for this attack is obviously lower than before, as shown in the second line of Table VI. From those figures, we can see that in this case, the attack takes years for our smallest mailbox C and even longer for the other ones. Next, we consider an attacker that wants to decrypt a random half of the corpus (i.e. he does not care, which s are decrypted and which are not, just the number of successfully decrypted s is important). If the attacker wants do decrypt a specific half of all the mails (those that contain some important information), the attack performance was identical to that of full attack from our first scenario, because due to our scheme, the attacker does not know which encrypted data belongs to which portion of the corpus prior to decryption. This means, the mails decrypted after this attack might not reveal the information an attacker is looking for, but it may be sufficient and obviously takes far less time, as shown in Table VI in the third line for π = We further consider a very strong attacker, who combines both previously mentioned attack scenarios: On the one hand, he has prior knowledge about the vocabulary used within the company, so he ran reduce his wordlist by 50 %, and on the other hand, it is sufficient for him, if only half of the s can be decrypted (N := 85, 738 and π = 0.5). The respective figures are shown in the fourth line of Table VI. Finally, we also want to evaluate what happens, if

20 the attacker possesses even more prior knowledge about the words used within the s. Therefor we set the number N of words needed for a successful brute force attack to , which corresponds to the vocabulary in daily speech of an educated person and words, beeing the number of words used in the vocabulary of an uneducated person. If we combine this with the restrictions before, we see that the worst case drops to 0.16 years (1.92 months) to decrypt half of the mails of mailbox C that contains only 83 s with a propability of 50% if the attacker exactly known which restricted vocabulary is used within the s and none of those words has been blacklisted. For the larger mailboxes it still takes more than 3 years even in this worst case scenario. We consider this case to be rather unrealistic, because besides the high prior knowledge of the attacker, mailboxes in companys usually contain far more than 83 s and we see that even for mailbox B with only 349 s the attack is rather costly (3.5 years). Thus we feel that our solution provides good privacy protection even in very bad scenarios. VII. RELATED WORK There is some related work regarding forensic analysis of s, which describe the analysis of data using data mining techniques [15], [16]. Others perform similar techniques on mail collections to identify common origins for spam [17]. However, those works have in common that they work solely on the raw data and do not consider privacy. Regarding privacy issues, there are some papers that argue broadly on different aspects of privacy protection in forensic investigations in general [18], [19], [20]. Agarwal et al. [21] give a very good overview of privacy expectations especially with in work environments. Hou et al. [22], [?] address a related area to our work when proposing a solution for privacy preserving forensics for shared remote servers. However, the techniques are very different to our work. First, they deploy public-key encryption schemes for encrypting the data which incurs a higher computational overhead and larger storage requirements compared to our scheme that uses secret-key schemes only. Second, the proposed solutions require frequent interaction between the investigator and the data holder while our solution is noninteractive. Similarly, Olivier [23] introduces a system to preserve privacy of employees when surfing the web, while allowing to track the source of malicious requests for forensics. To the best of our knowledge, there has been no solution to ensure privacy in forensic investigations by cryptographic means so far. Regarding this work s cryptographic part, to the best of our knowledge, no existing solution is adequate for what we achieve. We, however, review directions that are the most related. First, searchable encryption schemes, e.g., [24], [25], [26], encrypt files in such ways that they can be searched for certain keywords. More precisely, the scenario is that one party, the data owner, encrypts his data and outsources it to another party, the data user. The data owner now can create for keywords appropriate search tokens that allow the data user to identify the encrypted files that contain these keywords. The identified ciphertexts are given to the data owner who can decrypt them. Such schemes differ in two important aspects from the proposed scheme. First, most schemes enable search for single keywords while only few support conjunctive queries, e.g., see [27], [28], [29]. Second, they require interaction between the data owner and data user while in our scenario the investigator needs to be able to autonomously search and decrypt the encrypted files without requiring the data owner (e.g., the company) to do this. VIII. CONCLUSION AND FUTURE WORK In this section, we summarize our findings, discuss limitations, and show up possible future work. A. Summary In the course of this paper, we described a modified process of forensically analyzing data in a corporate environment. This process protects the privacy of the owners as far as possible while still supporting a forensic investigation. In order to achieve this goal, the modified process needs to make use of a novel cryptographic scheme that is introduced in this paper. Our scheme encrypts a given text in such a way, that an investigator can reconstruct the encryption key and hence decrypt the text, if (and only if) he guesses a specific number of words within this text. This threshold t of the number of words that have to be guessed can be chosen when applying our scheme. As a further configuration of the scheme one can configure a list of words that shall not provide the key reconstruction. This kind of blacklist is meant to contain words that occur in almost every text, such as pronouns or salutations. As a proof of concept, we implemented a prototype software that is able to encrypt inboxes of different formats to a container file. To show the feasibility of the practical application of our approach, we further implemented a plugin to the well-known open source forensic framework Autopsy that enables the framework