VERIFIABLE SEARCHABLE SYMMETRIC ENCRYPTION BY ZACHARY A. KISSEL B.S. MERRIMACK COLLEGE (2005) M.S. NORTHEASTERN UNIVERSITY (2007) SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY COMPUTER SCIENCE UNIVERSITY OF MASSACHUSETTES LOWELL Signature of Author: Date: Signature of Dissertation Chair: Dr. Jie Wang Signatures of Other Dissertation Committee Members Committee Member Signature: Committee Member Signature: Committee Member Signature: Dr. Xinwen Fu Dr. Tingjian Ge Dr. Yan Luo
VERIFIABLE SEARCHABLE SYMMETRIC ENCRYPTION BY ZACHARY A. KISSEL ABSTRACT OF A DISSERTATION SUBMITTED TO THE FACULTY OF THE DEPARTMENT OF COMPUTER SCIENCE IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY COMPUTER SCIENCE UNIVERSITY OF MASSACHUSETTS LOWELL 2013 Dissertation Supervisor: Jie Wang, Ph.D. Professor and Chair, Department of Computer Science
Cloud storage has become increasingly prevalent in recent years. It provides a convenient platform for users to store data that can be accessed from anywhere at anytime without the cost of maintaining a storage infrastructure. However, cloud storage is inherently insecure, hindering general acceptance of the paradigm shift. To make use of storage services provided by a cloud, users would need to place their trust, at least implicitly, in the provider. There have been a number of attempts to alleviate the need for this trust through cryptographic methods. An immediate approach would be to encrypt each file before uploading it to the cloud. This approach, calls for a new searching mechanism over encrypted data stored in the cloud. This dissertation considers a solution to this problem using Symmetric Searchable Encryption (SSE). SSE allows users to offload search queries to the cloud. The cloud is then responsible for returning the encrypted files that match the search queries (also encrypted). Most previous work was focused on keyword search in the Honest-but- Curious (HBC) cloud model, while some more recent work has considered searching on phrases. Recently, a new cloud model was introduced that supersedes the HBC model. This new model, called Semi-Honest but Curious (SHBC), is less restrictive over the actions a cloud can take. In this dissertation, we present three systems that are secure under this new SHBC model. Two systems provide phrase search and the other provides hierarchical access control over keyword search. ii
Acknowledgements I would like to begin by thanking the person responsible most for the success of this dissertation, my advisor, Prof. Jie Wang. Prof. Wang provided me with the unique opportunity to look at the problems that interested me, providing encouragement and guidance as I progressed. I would also like to thank my committee members Professors Xinwen Fu, Tingjian Ge, and Yan Luo. Together, they provided helpful comments that improved this work. In particular, the article that became Chapter 5 was in preparation at the time of the proposal; their comments around investigating access control over searching validated the need to submit that work. While completing the last year of my PhD studies, I was fortunate to have the opportunity to join the faculty at Merrimack College as a visiting professor. This appointment gave me a chance to branch out in all facets of academia. I am most indebted to the friendships and hallway conversations with Lisa Michaud, Vance Poteat, and Chris Stuetzle. In particular, I wish to thank Chris Stuetzle for early reviews of the material that would become Chapter 3. I would also like to thank Vance Poteat for serving as a mentor for my transition from industry to teaching this year, and for sparking my interest in networking and security many years ago. I would like to thank my parents Dan and Deb for their continued love, support, and encouragement over all these years, specifically for demonstrating to me the most important lesson, with hard work there are no limits. To Wendy, thank you for sharing this journey with me. Thank you for all the encouragement and understanding for iii
all the hours and late nights it took to write this dissertation. iv
Contents List of Figures vii 1 Introduction 1 1.1 Applications of Searchable Encryption................. 2 1.2 Overview of Results............................ 3 1.3 Dissertation Structure.......................... 4 2 Background 5 2.1 Background on Probability........................ 5 2.2 Background on Cryptography...................... 7 2.2.1 Pseudo-Random Primitives................... 7 2.2.2 Symmetric Encryption...................... 9 2.2.3 Cryptographic Hash Functions.................. 11 2.3 Searchable Encryption Framework.................... 12 2.4 Index Data Structures.......................... 13 2.5 Models of Clouds and Security...................... 15 2.6 Previous Work.............................. 23 2.6.1 A First Solution.......................... 23 2.6.2 Early Indexed Approaches.................... 26 2.6.3 Improved SSE Constructions................... 29 2.6.4 Phrase Searching......................... 32 v
2.6.5 Non-HBC Systems........................ 34 3 Verifiable Phrase Search 36 3.1 Verifiable Encrypted Phrase Search................... 37 3.1.1 Verifiable Keyword Search.................... 37 3.1.2 Verified Phrase Searching.................... 38 3.1.3 Correctness............................ 40 3.2 Conclusion................................. 45 4 Verifiable Phrase Search in a Single Phase 46 4.1 Notations................................. 47 4.1.1 Notations............................. 47 4.2 Background................................ 47 4.2.1 Background on Next-Word Indexing.............. 47 4.2.2 Secure Linked Lists........................ 48 4.3 Basic Construction............................ 49 4.3.1 Constructing an Encrypted Next-Word Index......... 49 4.3.2 An SSE Construction....................... 51 4.3.3 Security and Efficiency...................... 53 4.4 Adding Verification............................ 59 4.4.1 Discussion of Security Guarantees................ 59 4.5 Conclusion................................. 60 5 Hierarchical Access Control 61 5.1 Model................................... 62 5.2 Key Regression.............................. 64 5.3 Construction of HAC-SSE and Security................ 65 5.3.1 Security Guarantees of HAC-SSE................ 69 5.4 Adding Revocation and Verification................... 72 vi
5.4.1 Security Guarantees....................... 74 5.5 Conclusion................................. 75 6 Conclusion 76 6.1 Results................................... 76 6.2 Future Work................................ 76 Bibliography 78 Biography 80 vii
List of Figures 2.1 A secure linked list on the set {D 1, D 3, D 5, D 6 }............ 31 2.2 An example of a phase two table based index.............. 34 4.1 An example next-word index....................... 48 4.2 Example arrays A and N for = {w 1, w 2, w 3 }. The arcs represent a logical connection............................. 51 5.1 An annotated trie for dictionaries 1 = {cat, dog} and 2 = {car, do} 1..................................... 66 5.2 Final trie based on Figure 5.1. The values P h denotes the parents hash value and l denotes the current nodes level............... 67 5.3 Modification to the BuildIndex algorithm to add verification support to the trie................................... 73 5.4 The HVerify algorithm.......................... 73 5.5 The HRevokeUser algorithm....................... 74 viii
1 Chapter 1 Introduction Imagine for the moment that Alice has a large collection of documents, D, that she wishes to store in a distributed storage environment owned by Bob. Bob has been known to be nosy, which means Alice must encrypt all the documents in her document collection before uploading them to Bob s distributed storage environment. Assume, now, that Alice wants to read the documents in D that contain a certain word or phrase. What does she do? Trivially, she could ask Bob to send her all the files, decrypt them locally, and then search for the documents that contain the information she is looking for. Retrieving all the files and then decrypting them, however, will incur a great cost in both communication and time. It would be far more efficient, for Alice, if Bob could perform the search and only send the documents that match her query. Alice s problem is known as the searchable encryption problem. Song, Wagner, and Perrig offered the first glimpse of a solution to Alice s problem [1]. They introduced Searchable Symmetric Encryption (SSE). This new SSE construction allows for Alice to ask Bob to query the encrypted document collection for a specific word or phrase. Alice enables Bob to perform the search by providing Bob, at query time, with some special information known as a trapdoor. Bob then returns the results of the query to Alice. The guarantees that they provided are that
2 the queries remain unknown to the Bob (query privacy) and any information beyond the number of results and size of the encrypted documents is unknown to Bob (query result privacy). Though not its original intention, we can adapt the searchable encryption to cloud storage. We assume that a collection of encrypted documents, D, are stored in the cloud such that a search query can be executed over all the documents in the collection. The cloud is responsible for both executing the query and returning the results. We have the added security guarantee that the cloud should be unable to learn the nature of the query. If one uses only symmetric cryptography in the solution, the problem is called the Symmetric Searchable Encryption (SSE) problem. While there do exist asymmetric forms of searchable encryption [2], we will only consider the SSE problem,for it is more efficient in comparison to asymmetric solutions to the searchable encryption problem. 1.1 Applications of Searchable Encryption Searchable Encryption over phrases can be used to support a large number of diverse applications. For example, in human resource management, one may want to look for a series of phrases that assess the performance of an employee. In medical record management, a doctor may want to retrieve all records where a certain phrase of ailments occur next to each other. At an educational institution an instructor may want to search for student information based on phrases related to the course performance. All of these applications share the common need of querying for phrases that are not necessarily pre-known. In the case we have access to a hierarchical access control mechanism on encrypted keyword search we have even more applications. For example, a company can outsource their data to the cloud and different employees can have different access. For
3 example, only members of the finance department should be able to search for financial information and only the members of the engineering department should be able to search for blueprint information. In the area of parental controls, envision a search engine where you do not have to forgo query privacy for filtering of explicit content. All the applications presented share common needs: confidentiality of data, query privacy, and query result privacy. Thus, they are perfect for the application of searchable encryption. 1.2 Overview of Results In this dissertation we provide efficient solutions to two problems in Symmetric Searchable Encryption. Both solutions exhibit the property of verifiability. By verifiability we mean the client, in an SSE scheme, can detect if the cloud has returned incomplete or inaccurate results. Therefore, the cloud should be allowed to fabricate results that are inconsistent with the truth about the document collection. This can be achieved by considering SSE solutions under the model developed by Chai and Gong in [3]. The model is called the Semi-Honest but Curious model (SHBC). In this model, the cloud does the following: (1) honestly store data; (2) honestly execute the search operations or a fraction of them; (3) return a non-zero fraction of the query results honestly; and (4) try to learn as much information as possible. If a solution has the property of verifiability over its returned results, we say that we have a solution to the Verifiable SSE problem. Our first result is structured around providing a verifiable phrase search mechanism. This result is based on the two phase protocol presented in [4]. Given a phrase, p, the first phase finds all the documents in D that contain all the words in p. The second phase, using the results of the first, determines which documents in D contain all the words in p, ordered according to p.
4 Our second result, improving on our first result, presenting a single phase search protocol. This new single phase protocol reduces both communication complexity as well as reducing the work that must be performed by the client to do a successful search. Like our first result, the second result is also verifiable. In a second vein, we investigate an efficient verifiable searchable encryption scheme which provides access control over keywords that appear in a document collection. The most trivial access control is creating one group of users and allowing dynamic changes to the group. This problem has a good constructive solution provided by Curtmola et. al. in [5]. We demonstrate a hierarchical access control mechanism where we divide the users into numbered groups such that if a user in group i has the ability to successfully search for a particular search term, then any user in any group j > i can also successfully search for the same search term. 1.3 Dissertation Structure The remainder of this dissertation is structured as follows. In Chapter 2 we will discuss the cryptography, theory, and data structures needed to realize SSE. We will conclude this chapter with a discussion of existing work on SSE. In Chapter 3 we will present a verifiable phrase search SSE scheme. In Chapter 4 we will improve our system in Chapter 3 by introducing a single phase protocol. In Chapter 5 we will present a hierarchical access control mechanism for SSE. We conclude in Chapter 6 by discussing future directions based on the results presented.
5 Chapter 2 Background Song, Wagner and Perrig posed the question [1]: Given an encrypted document, how does one search for a word in that document? They created a system known as Searchable Symmetric Encryption (SSE) to answer just this question. In this chapter we present all the background information necessary to understand SSE. We start by reviewing a few details from probability and cryptography. We proceed to discuss two formal models of clouds and the existing security models for SSE. We conclude by discussing the existing work in the area. 2.1 Background on Probability In order to understand modern cryptography, one needs a firm grasp on probability theory. In this section we will review the probability theory needed to understand Section 2.2. The ideas that must be understood are the notions of probability distributions, statistical distance, and computational indisguishability. We begin by discussing the idea of negligible functions. In cryptography we do not require that the adversary always fail, but that the adversary only succeeds with some very small non-zero probability. Formally, we call this small non-zero probability negligible, denoted by negl. This is an asymptotic notion which we formally define in
6 Definition 2.1.1. Definition 2.1.1 (Negligible Function [6]). A function f(n) is called negligible, if for all polynomial functions, poly (n), and for all n > n 0, we have f(n) < 1 poly(n). If the bound holds, we denote f(n) by negl (n). We are interested in making statements about probability distributions. Define a sample space S as the set of possible outcomes of some experiment and an event A as a subset of S. A probability distribution is defined as follows: Definition 2.1.2 (Probability Distribution [7]). A probability distribution Pr ( ) on a sample space S is a mapping from events of S to real numbers satisfying the following axioms: 1. Pr (A) 0 for any event A. 2. Pr (S) = 1. 3. Pr (A B) = Pr (A) + Pr (B) for any two mutually exclusive events A and B. More generally, for any (finite or countably infinite) sequence of events A 1, A 2,... that are pairwise mutually exclusive, ( ) Pr A i = i i Pr (A i ). The notation Pr (A) also denotes the probability of event A. A random variable is a function X : S R, where S is a sample space. Given Definition 2.1.2 and the notion of a random variable we can define the notion of a probability ensemble. A probability ensemble is a, possibly infinite, collection of probability distributions. Formally, we define them as follows: Definition 2.1.3 (Probability Ensemble [6]). Let I be a countable set. A probability ensemble indexed by I is a collection of random variables {X i } i I.
7 Several cryptographic discussions rely on the notion of one probability distribution being computationally indistinguishable from another. What this means is that one cannot construct a probabilistic polynomial-time algorithm that can distinguish one distribution from another with more than a negligible probability. Given Definition 2.1.3 we define computational indistinguishability formally as follows: Definition 2.1.4 (Computational Indistinguishablility [6]). Two probability ensembles X = {X n } n N and Y = {Y n } n N are computationally indistinguishable, denoted X c Y, if for every probabilistic polynomial-time distinguisher D there exists a negligible function negl (n) such that Pr (D (1 n, X n ) = 1) Pr (D (1 n, Y n ) = 1) negl (n) where D (1 n, X n ) means to choose x according to distribution X n, and then run D (1 n, x). 2.2 Background on Cryptography Searchable Symmetric Encryption is based on several cryptographic primitives. The necessary primitives are pseudo-random generators, pseudo-random functions, pseudo-random permutations, symmetric key encryption, and cryptographic hash functions. For discussions of these primitives please see, for example, [8, 6, 9]. 2.2.1 Pseudo-Random Primitives We consider a pseudo-random generator (PRG). A pseudo-random generator is a function provided with an n-bit input that expands its input to a longer sequence in a way that the distribution generated by the pseudo-random generator is computationally indistinguishable from being truly random. The precise definition appears in
8 Definition 2.2.1. Definition 2.2.1 (Pseudo-Random Generator [6]). Let l( ) be a polynomial and G a deterministic polynomial-time algorithm such that for any input s {0, 1} n, algorithm G outputs a string of length l(n). We say that G is a pseudo-random generator if the following two conditions hold: 1. For every n it holds that l(n) > n. 2. For any probabilistic polynomial-time distinguisher D, there exists a negligible function negl (n) such that Pr (D(r) = 1) Pr (D(G(s)) = 1) negl (n), where r is chosen uniformly at random from {0, 1} l(n), the seed, s, is chosen uniformly at random from {0, 1} n, and the probabilities are taken over the random coin tosses used by D and the choice of r and s. A stronger pseudo-random primitive comes in the form of a pseudo-random function (PRF). A pseudo-random function is a member of the family of functions where the behavior of one function, drawn randomly from the family, is computationally indistinguishable from any other random function. A family of functions as a set of keyed functions F : {0, 1} k {0, 1} n {0, 1} l, where k, n, l > 1. If k = n = l then we have a pseudo-random permutation (PRP). Formally, a pseudo-random function is defined by Definition 2.2.2. Definition 2.2.2 (Pseudo-Random Function). A keyed function F : {0, 1} k {0, 1} n {0, 1} l is pseudo-random if for any probabilistic polynomial-time distinguisher D, given oracle access to F k = F (k, ), there exists a negligible function, negl(n) such that Pr ( D F K( ) (1 n ) = 1 ) Pr ( D f( ) (1 n ) = 1 ) negl (n),
9 where K R {0, 1} k is chosen uniformly at random and f is chosen uniformly at random from all functions that map {0, 1} n to {0, 1} l. If we have a family of length preserving functions, then we get a PRP. We say a function is length preserving if F (k, x) = x = k. Formally, this is given by Definition 2.2.3. Definition 2.2.3 (Pseudo-Random Permutation [6]). Let F : {0, 1} {0, 1} {0, 1} be an efficient, length-preserving, keyed function. We say that F is a pseudorandom permutation if for any probabilistic polynomial-time distinguisher D, there exists a negligible function negl(n) such that Pr ( D F K ( ) (1 n ) = 1 ) Pr ( D f( ) (1 n ) = 1 ) negl (n), where K R {0, 1} n is chosen uniformly at random and f is chosen uniformly at random from the set of functions mapping {0, 1} n to {0, 1} n. Notationally, D f( ) ( ) means that D uses f as an oracle and D can query f a polynomial number of times. 2.2.2 Symmetric Encryption Given a set M known as the message space, a set C known as the cipher-text space, and a set K known as the key space we define symmetric encryption as a tuple (G, E, D) of probabilistic polynomial-time algorithms. G : 1 λ K: The key generation algorithm, takes a security parameter, 1 λ, and selects a key k K. E : M K C: The encryption algorithm takes a message and a key as input and outputs a string of ciphertext.
10 D : C K M: The decryption algorithm takes a string of ciphertext and a key as input and outputs the plaintext if, and only if, the ciphertext was encrypted with the key. Otherwise, is returned. There is one correctness guarantee, namely, D k (E k (m)) = m must hold for all keys k and messages m. Notationally, we will write the key used for encryption and decryption as a subscript of the function, not as an argument. The simplest, practical, security guarantee that a symmetric encryption scheme can exhibit is that of semantic security, meaning that an attacker is unable to learn anything about the plaintext except what is leaked by the ciphertext (e.g., length of the message). IN other words, the probability of finding the plaintext from teh ciphertext is no much differnt from gussing the plaintext without the ciphertext. Formally, this can be defined as follows: Definition 2.2.4 (Semantic Security for Symmetric Encryption [6]). A symmetric encryption scheme (G, E, D) is semantically secure in the presence of an eavesdropper if for every probabilistic polynomial-time algorithm A, there exists a probabilistic polynomial-time algorithm A, such that for all efficiently-sampleable distributions X = (X 1,...) and all polynomial-time computable functions f and h, there exists a negligible function negl (n) such that Pr (A (1 n, E k (m), h (m)) = f (m)) Pr (A (1 n, h (m)) = f (m)) negl (n), where m is chosen according to distribution X n, and the probabilities are taken over the choice of m and the key k, and any random coins used by A, A, and the encryption process. This definition is based on the pioneering work of Goldwasser and Micali [10]. From Goldwasser and Micali s work, Bellare, Desai, Jokipii, and Rogaway [11] defined semantic security for symmetric encryption systems
11 Using pseudo-random generators, pseudo-random functions, and pseudo-random permutations one can construct symmetric encryption schemes. One-time pad encryption systems can be constructed from pseudo-random generators and block ciphers can be constructed from pseudo-random permutations or pseudo-random functions. In particular, block ciphers can be constructed using the Luby-Rackoff construction [12]. In the remainder of this dissertation we will consider a symmetric encryption system to be modeled as one of the pseudo-random primitives to exhibit its properties. 2.2.3 Cryptographic Hash Functions We define a hash family H as a family of surjective functions h s : {0, 1} n {0, 1} m for m < n. We say that the hash function, h s H, is collision resistant if it is hard to find different strings x 1, x 2 {0, 1} n that hash to the same value v {0, 1} m. We say that the hash function h s is pre-image resistant if given the value h s (x), an attacker can recover x with negligible probability. Lastly, we say that the hash function h s is second pre-image resistant if given a value x {0, 1} n, an attacker can find, with only negligible probability, an x {0, 1} n such that h s (x) = h s (x ). Cryptographic hash functions are a family of collision, pre-image, and second pre-image resistant hash functions which are used in many areas of cryptography. They consist of a pair of probabilistic polynomial-time functions (G, H). G is used to select, at random, a key s. This key is an index of the hash function in the family. The function h s : {0, 1} {0, 1} l(n) is drawn from H according to s. The range of h s (i.e., l(n)) must be less than, or equal to, the length of the message being hashed. Cryptographic hash functions can be constructed from block ciphers using the Merkle-Damgård construction [13, 14]. Generally, the security of hash functions is modeled in two ways. The first is called the standard security model. In the standard model, one only uses the three properties of cryptographic hash functions stated above. The second model, called the
12 random oracle model, treats a hash function as a random oracle. This random oracle responds with a random value for each query. However, if a query is repeated the oracle will respond with the same value. This model was first proposed by Goldreich, Goldwasser, and Micali in 1985 [15]. 2.3 Searchable Encryption Framework We make use of the following notation for discussing the results of research into SSE. Let D = {D 1, D 2,..., D n } denote a collection of n encrypted documents in the cloud storage, Σ the alphabet over which characters from strings are drawn, and = {w 1, w 2,..., w d } a dictionary of d words drawn from Σ. We associate with each document in collection D a number used as an index. The function is denoted by id : D Z. Let D (w i ) denote the set of document identifiers that contain the word w i. We will use m 1 m 2 to denote the concatenation of message m 1 and m 2. For the remainder of this dissertation we will define our SSE systems following the rigorous framework of Curtmola, Garay, Kamara, and Ostrovsky in [5]. Their model consists of a tuple of four algorithms (Keygen, BuildIndex, Trapdoor, Search). These algorithms are defined as follows: Keygen ( 1 λ) : A probabilistic algorithm run by the owner to setup the scheme. It takes a security parameter λ, as input, and returns a secret key K. BuildIndex (K, D): A probabilistic algorithm run by the owner to generate the indexes. It takes a key K and a document collection D as input and returns an index I. Trapdoor (K, w): An algorithm run by the owner to generate a trapdoor T w, give a word w and a key K.
13 Search (I, T w ): An algorithm, run by the cloud, that searches for a keyword in the document collection. It takes an index I and a trapdoor T w and returns the document identifiers for documents that contain word w. An index, I, is a data structure, or set of data structures, that tracks keywords and documents that contain those keywords. We note that in some chapters of this dissertation we will, in some cases, assume that the model will be using phrases p instead of words w. This will cause small modifications to both the Trapdoor and Search function inputs. There are two major forms of indexes used by SSE. They are the inverted index and the per-document index. The inverted index structure, borrowed from the field of information retrieval, is a single data structure that is used to associate each keyword with the set of documents in the document collection that contain the word [16]. The per-document index associates, with each document, a data structure that tracks the keywords stored in that document. 2.4 Index Data Structures In this section we will discuss two data structures that permeate the research. These data structures are used to construct both per-document and inverted indexes. Indexes are required to provide two operations: Search and Insert with a third optional operation: Delete. The Search operation is used to determine if a search key occurs in the data structure. The Insert operation is used to add a new key, with its associated data, to the data structure. The Delete opertion is used to remove a key, and associated data, from the data, structure. We present two index structures in this section, the trie and the Bloom Filter. Devised by Fredkin [17], a tries is an index method, which supports three main operations: Insert, Search, and Delete; all take a word w Σ as input. A trie is a Σ {$} -ary tree, where each node of the tree is labeled with an element of Σ {$}.
14 Moreover, a root-to-leaf path through the tree denotes a word w Σ, which is terminated by a special character $ Σ. The Insert operation appends a $ to the input w. Starting at the root node of the tree, we use w to create a path. The first time we reach a node that does not have the current corresponding letter in w, we add a subpath as a child to the current node. Moreover, we label this subpath appropriately with the remaining letters of w, terminating the path with a $. We note that the insertion time with in the trie is O( w ). The Search operation uses input w as a path through the tree. The function first adds a $ to the path. If that path ends in a leaf, i.e., the path is a root-to-leaf path, the search is successful. Otherwise, the word does not exist in the dictionary. We note that the search time with in the trie is Θ ( w ) in the worst case. The Delete operation uses input w as a path through the tree. This function will remove all nodes, in a bottom up fashion, according to the path given by w. There is an exception, a node will not be removed if it has children that do not match the symbol indicated by the previous level in w. We note that the Delete time in the trie is Θ ( w ). In this dissertation we will denote a trie by T and a node of the trie by T i,j, where i is the depth of the node and j the left to right placement of the node. We will denote the access to values stored in the node of T by T i,j [s], where s denotes the name of the field. Devised by B. H. Bloolm [18], a Bloom Filter is an index method, whch consists of a k-bit vector and three hash functions h 1, h 2, and h 3 with range {1, 2,..., k} and supports two operations: Insert and Search. The Insert operations inserts input v by setting position h 1 (v), h 2 (v), and h 3 (v) in the k-bit vector to 1. The Search operation determines if input v is in the filter. To do this it checks if all locations h 1 (v), h 2 (v), and h 3 (v) in the k-bit vector are 1.
15 If they are, then the value is likely to be in the filter. We say likely as it is possible that the Bloom Filter may give a false positive result. However, the Bloom filter will never give a false negative. This false positive rate comes with the advantage of Θ(1) Insert and Search time. The traditional Bloom Filter does not support the Delete operation. The reason is this: If you just simply flip the hashed locations of the the value v to zero, you may introduce false negatives. One way to provide the Delete operation is to using a Counting Bloom Filter [19]. Briefly, a counting Bloom Filter has every bucket of the Bloom filter be a k-bit value, for some k Z. The Insert operation works exactly the same as the Insert for the traditional Bloom filter, except that 1 is added to the number found at the location in a counting filter. The Search operation, for the counting filter, works similar to the Bloom filter, except that the search value is considered present in the filter if each hashed location has a number greater than zero. The Delete operation works by subtracting one from each location that the value hashes to. The Delete operations should only be performed if the Search operation finds the value to be deleted. 2.5 Models of Clouds and Security In order to talk about the security of systems we must model the abilities of the attacker. In this case, the attacker is the cloud. There are two models prevalent in the research. The first is the Honest-but-Curious (HBC) model and the second is the Semi-Honest-but-Curious (SHBC) model proposed by Chai and Gong in [3]. The HBC model has been traditionally used in the literature to model servers in cryptographic protocols. This model can be easily be talked about in the terms of a cloud, which is what we will do here. The HBC model describes a cloud that interacts in the following way:
16 1. Honestly store the data. 2. Honestly follow the steps of the protocol. 3. Try to learn as much information as possible from the interaction. The model has served well for many cloud based protocols, however, it does not take into account the desire of the cloud to do as little work as possible. In [3] Chai and Gong captured a more liberal set of requirements which they termed the Semi- Honest-but-Curious model. In the Semi-Honest but Curious model the cloud acts in the following way: 1. Honestly store the data. 2. Honestly execute the search operations or a fraction of them. 3. Return a non-zero fraction of the query results honestly. 4. Try to learn as much information as possible from the interaction. Once a cloud model is selected we must provide formal definitions of security guarantees of a constructed SSE system. There are a potpourri of security guarantees for searchable encryption that have evolved in the literature over time. The five models we will review, in this section, are Indistiguishability under chosen keyword attack (IND-CKA and IND2-CKA) due to Goh [20]; Privacy Preserving Search on Encrypted Data (PPSED) due to Chang and Mitzenmacher [21]; and Non-Adaptive and Adaptive Indistiguishability due to Curtmola, Garay, Kamara and Ostrovsky [5]. We note that Song, Wagner, and Perrig did not define a set of security models beyond the normal cryptographic assumptions. Goh, in 2006, gave the first two models of security under the name IND-CKA and IND2-CKA. The IND-CKA, semantic security against adaptive chosen keyword attacks, tracks the notion that an adversary A cannot deduce, beyond what is know
17 from previous queries, the contents of a document from its index. A system is shown to exhibit this security via a game between a challenger C and an adversary A. Definition 2.5.1 (IND-CKA Game [20]). Given a challenger C and an adversary A we define the IND-CKA game as a tuple of rounds: Setup, Queries, Challenge, and Response. Setup: The challenger C creates a set S of q words and gives it to A. The adversary A chooses a number of subsets S from S and returns S to C. Once C receives S, the challenger C uses BuildIndex to construct an index of S, I, for each document in D. The challengerc concludes by sending all indexes with their associated subsets to A. Queries: The adversary A is allowed to query C on a word w and receive the trapdoor T w for w. With T w, the adversary A can invoke the Search on an index I to determine if w I. Challenge: After making some Trapdoor requests, the adversary A decides on a challenge by picking a non-empty subset V 0 S, and generating another non-empty subset V 1 from S such that V 0 V 1 0, and the total length of the words in V 0 is equal to the total length of words in V 1. Finally, the adversary A must no have queried C, for the trapdoor of any word in V 0 V 1. The adversary A then gives V 0 and V 1 to C who chooses b R {0, 1}, runs BuildIndex to obtain the index I Vb for V b and returns I Vb to A. The challenge for A is to determine b. After the challenge is issued, the adversary A is not allowed to query C for the trapdoors of any word w (V 0 V 1 ).
18 Response: The adversary A eventually outputs a bit b, representing its guess for b. The advantage of A in winning this game is defined by Adv A = Pr (b = b ) 1 2. This probability is taken over all the internal coin tosses of A and C. We say that the adversary, A, (t, ɛ, q)-breaks the index if Adv A ɛ after A takes at most t time to make q trapdoor queries to the challenger C. A index I is (t, ɛ, q)- IND-CKA secure if no adversary can (t, ɛ, q)-break it. This game follows the semantic security game introduced by Goldwasser and Micali [10]. We point out that with this model there is no requirement that the trapdoors be secure. This deficiency is later corrected [5]. It should be noted that Goh was not trying to create an SSE system, but a secure index. The notion of a secure index is closely related to, but separated from SSE. Goh s IND2-CKA game (indistinguishability against chosen-keyword attacks) states that given access to a set of indexes the adversary is not able to learn any partial information about the encrypted document that cannot be learned from possessing the trapdoor. Moreover, possessing the trapdoor only proivdes knowledge of whether or not a keyword occurs in the index. Goh further showed that this property holds if the adversary can trick the client into generating trapdoors. We note that IND2-CKA does not require that trapdoors are kept secure. To obtain the IND2-CKA security guarantee, Goh modified Definition 2.5.1 on the Challenge step as follows: select two non-empty subsets V 0, V 1 S of possibly different size and word length such that (V 0 V 1 ) 0. Chang and Mitzenmacher proposed the PPSED [21], Privacy Preserving Keyword Searches on Remote Encrypted Data, security guarantee preserved between a server S and a user U.
19 Definition 2.5.2 (PPSED Security [21]). For k N, let C k denote all the communications that server S receives from user U before the k th round of the protocol. Let Ck = {ζ, Q 0 = ɛ, Q 1,..., Q k }, where ζ is the set of encrypted documents and Q j is an n-bit string, for j {1, 2,..., k 1}, such that for i {1, 2,... n}, we have Q j [i] = 1 if and only if w j is a keyword in document m i. We say the system is secure if for any k N, any probabilistic-polynomial-time (PPT) adversary A, any δ k = {m 1, m 2,..., m n, w 0 ɛ, w 1,..., w k 1 }, and any function h, there exists a PPT algorithm A, a negligible function negl (A) such that Pr (A (C k, 1) = h (δ k )) Pr (A (C k, 1) = h (δ k )) negl (A). The definition of C k gives us the information about the interaction from U s point of view and C k gives us all the information that is obtained by watching the messages exchanged between U and S. The PPSED definition, at its core, says that everything that can be computed by A from C k can also be computed by A from Ck. It turns out that all SSE systems trivially satisfy PPSED [5]. Curtmola, Garay, Kamara, and Ostrovsky [5] presented two precise definitions of security for searchable encryption, which corrected all of the deficiencies present in IND-CKA, IND2-CKA, and PPSED. The defincience are due the fact no previous definitions consider how the queries are issued. It was not until their work of Curtmola et. al. that rigorous and exact security definition was given for SSE. In particular, Curtmola et. al. provided formal definitions for non-adaptive, respectively adaptive, indistiguishability notions and showed a reduction to a form of non-adaptive, respectively adaptive, idistinguishablility notions to semantic security. Briefly, in nonadaptive security, privacy is only guaranteed when clients generate queries at once. In the case of adaptive security, privacy is guaranteed even if the clients generate queries as a function of previous search outcomes. Until this work, any systems that
20 is secure was secure in the non-adaptive sense at best. To understand the definitions of these security guarantees we first provide some necessary background. We qualify the security of an SSE system by what we are willing to leak about a clients communication with the cloud. In general, we call an SSE system secure if it leaks only the search pattern and the search outcomes. The search pattern describes the queries that were issued and how they may relate (e.g., if the same keyword appears in multiple queries). As defined Curtmola et. al., we define three sets (the history, the view, and the trace) over an interaction between the client and the cloud. The history is the plaintext for each query and the plaintext for documents returned as a result of issuing the query. The view consists of the encrypted documents, the index, and the trapdoors. In other words, the view consists of everything the cloud can see. The trace is the information about all of the structure of the interaction. Specifically, the view consists the length of the documents, the search outcomes, and the search patterns. Given these three sets we can more formally describe the security guarantees. A SSE system is secure if any function of the history that can be computed from the view can be computed from the trace. Curtmol et. al. s first considered security in the case of a non-adaptive adversary. A non-adaptive adversary must make search queries without seeing the outcome of previous search queries. In other words, the adversary must issue the queries in batch form. There exists a stronger adversary, namely and adaptive adversary, where the adversary can make search queries based on the outcomes of previous search queries. To prove that a system provides security in the presence of a non-adaptive adversary, we use the notion of computational indistinguishability. The proof is approached in this way so that we can capture all potential adversaries. To do so, we will give the cloud two indexes: One is legitimate and one is fabricated. We will also provide the cloud with the information it would see, if the queries were issued over the legit-
21 imate index (called the view). The goal of the cloud is to determine which index is legitimate and which index is fabricated. We introduce a proof tool, called a simulator, to construct the fabricated index from exactly the information we are willing to leak about a set of queries. We formalize these notions using the following, formal, definitions of history, view, and trace [5, 4]: Definition 2.5.3. Let D be a collection of n documents and a dictionary. A history H q is an interaction between a client and a server over q queries, which is denoted by H q = (D, w 1, w 2,... p q ), where each p i is a phrase. An adversary s view of H q under secret key K is defined by V K (H q ) = (id (D 1 ),..., id (D n ), E (D 1 ),..., E (D n ), I, T 1,..., T q ), (2.1) where T 1,..., T q are a series of trapdoors and I is an index. The trace of H q is the following sequence: Tr (H q ) = (id (D 1 ),..., id (D n ), D 1,..., D n, D (w 1 ),..., D (w q ), π q ), (2.2) where π q is the search pattern of the user and D (w i ) denotes the set of document identifiers that contain the word w i. We say a system is non-adaptively secure if for any two adversarially constructed histories with equal length and trace, no probabilistic polynomial-time adversary can distinguish the view of one from the view of the other with probability non-negligibly better than 1. Formally this notion is captured by Definition 2.5.4 2 Definition 2.5.4 (Non-Adaptive Semantic Security for SSE [5]). An SSE scheme is non-adaptively semantically secure if for all q N and for all (nonuniform) probabilistic polynomial-time adversaries A, there exists a (non-uniform)
22 probabilistic polynomial-time algorithm (called the simulator) S such that for all traces Tr q of length q, all polynomially sampleable distribution H q over {H q 2 2 q : Tr (H q ) = Tr q } (where is the dictionary), all functions f : {0, 1} m {0, 1} l(m) (where m = H q and l (m) = poly (m)), all polynomials p, and sufficiently large k, we have Pr (A (V k (H q )) = f (H q )) Pr (S (Tr (H q )) = f (H q )) < negl (k) where H q R H q, K Keygen ( 1 k), and the probabilities are taken over H q and the internal coin tossing algorithmms of Keygen, A, S, and the underlying BuildIndex. In other words, the intuitive security notion is that if the adversary is unable to learn anything more than what they can learn from the secured index and the encrypted queries then we will say the system is secure. Definition 2.5.4 captures the idea that the system is secure if the simulator can simulate some function of the history that the adversary cannot distinguish, with more than negligible probability, provided that the simulator is given access to only the trace of the history and the adversary is only given access to a view of the history. A stronger form of security can occur if we allow the adversary to issue queries based on the results of previous queries. In other words, the adversary can adapt to the results produced by the system. This is the strongest notion of security known for SSE. The stronger notion is called Adaptive Semantic Security. If we the simulator is given access to only the partial trace of the history and the adversary is only given access to a partial view of the history. The partial trace of the history, denoted by H t q is composed of the first t elements of the q-length history. The partial view, denoted by V t q, is composed of the first t elements of the q-length view. Formally we have, Definition 2.5.5 (Adaptive Semantic Security for SSE [5]). A SSE scheme is secure
23 in the sense of adaptively semantically secure if for all q Nand for all (non-uniform) probabilistic polynomial-time adversaries A, there exists a (non-uniform) probabilistic polynomial-time algorithm (the simulator) S such that for all traces Tr q of length } q, all polynomially sampleable distribution H q over {H q 2 2 q : Tr (H q ) = Tr q (where is the dictionary), all functions f : {0, 1} m {0, 1} l(m) (where m = H q and l (m) = poly (m)), all 0 t q, and all polynomials p, and sufficiently large k, we have Pr ( A ( V t k ( H t q )) = f ( H t q )) Pr ( S ( Tr ( H t q )) = f ( H t q )) < negl (k) where H q R H q, K Keygen ( 1 k), and the probabilities are taken over H q and the internal coin algorithms of Keygen, A, S, and the underlying BuildIndex algorithm. Definition 2.5.5 captures the idea that the system is secure if the simulator can simulate a view of the partial history that the adversary cannot distinguish with more than negligible probability from the actual view. The remainder of our work will concentrate on the notion of non-adaptive semantic security for SSE with a SHBC cloud. Non- adaptive semantic security is the strongest security guarantee we have been able to achieve with a cloud in the SHBC model. 2.6 Previous Work 2.6.1 A First Solution Song, Wagner, and Perrig [1], first studied how to search for a keyword in encrypted data. They investigated an indexed approach and a non-indexed approach. They further distinguished between hidden searches and non-hidden searches. In a hidden search the query submitted to the cloud is constructed in such a way that the cloud is unable to ascertain the meaning of the query (i.e., query privacy). In a non-hidden
24 search the query is know to the cloud. All the systems they investigated suffer from scalability for they require polynomially many keys. Moreover, their systems are unable to handle compressed files, as pointed out in their paper. Finally, their systems (i.e., Schemes I, II, and III) may also leak the position of a keyword in the text. Note that the problem of position leakage was fixed in later systems designed by others. The systems of Song et. al. rely on the following assumptions: 1. A document d consists of a sequence of words. 2. There exists a family of pseudo-random functions F ki : {0, 1} n m {0, 1} n, for any n and m. 3. There exists a family of pseudo-random permutations E ki : {0, 1} n {0, 1} n, for any n. 4. There exists a family of pseudo-random generators G with output contained in {0, 1} m, for any m. Using these functions and the document collection D they presented three schemes that provide solutions of varying security guarantees. Their schemes were built from the foundational scheme: Scheme I. This scheme consists of two main operations Encrypt and Search. The Encrypt operation encrypts a document in such a way that at a later time the cloud can run Search and obtain an answer to a keyword query. Encrypt: For each word in the document d, generate a pseudo-random value s i for word w i, using the pseudo-random generator G with output length m. Set T i = s i F ki (s i ) (Note: T i = w i ). Write C i = w i T i to the file we will upload to the cloud.
25 Search: Given a keyword W and {k i 1 i d }, the cloud looks at every word in the document d, computes T i = C i W, and parses T i as s v. If v = F ki (s) we found the word, otherwise we continue going through the document. We can clearly see that we require a polynomial number of keys to implement this system. One key is needed for each word in the document. Additionally, the cloud is aware of the search terms, which is not a desirable property. We would like a system that is more efficient and more secure. Song et. al. constructed a second system which seeks to improve upon the inefficiencies around keys presented in their first scheme. To achieve their results, they must add an assumption that f k : {0, 1} K is a pseudo-random function that maps arbitrary binary strings to a key space K. The Encrypt operation is modified from the first scheme such that the data owner chooses k i = f k (w i ) where k is a secret key never revealed. Encryption proceeds as in the first scheme. In order for the Search operation to succeed, we must reveal to the cloud (w, k = f k (w)). With this information, search continues as in the first scheme, but uses k instead of k i. The second scheme proposed by Song et. al. reduces the number of keys that the data owner must store, but still leaks the search term to the cloud. They resolved the search-term leak in their final scheme, building on the results of the second scheme. In particular they modified the Encryption operation by pre-encrypting every word w i in the document as x i = E k (w i ). Next, the scheme splits x i as L i R i, where L i = n m and R i = m. We complete the changes by constructing the key for the pseudo-random function F by computing k i = f k (L i ). The rest of the Encryption operation remains unaltered. It is necessary to modify the Search operation by sending to the cloud a tuple (x, k), where x = E k (w) and k = f k (L). The Search operation continues as it would in the Search operation of the second scheme.
26 2.6.2 Early Indexed Approaches There has been work that seeks to improve Song et. al. s results using a keyword index. A keyword index is a data structure used for searching, which associates keywords with documents that contain those keywords. This data structure may be an inverted index or a per-document index. Goh [20] used Bloom filters to maintain the index of keywords associated with each file. He considered conjunctive and disjunctive queries, as well as updates of document indexes. Goh s approach, however, used a weaker model of index construction, where the index size is based on the number of distinct words in a given document and therefore leaks information. Goh considered only the HBC adversarial model under the IND2-CKA security definition. Goh s index construction is based on the following assumptions: 1. There exists a pseudo-random function f : {0, 1} n {0, 1} s {0, 1} s, and a { } key set K = k R i {0, 1} s 1 i r, where s is a security parameter and r the number of hash functions used by the Bloom Filter. 2. Every document d, in a collection of documents D, is assigned a unique identifier id (d) {0, 1} n. 3. The index is constructed as a result of a process involving the generation of a code-word and a trapdoor. As part BuildIndex, a code word for every word in the document is constructed. A code-word set is defined as C = {y j = f (id (d), x j ) 1 j r}. The values x j are the results from computing the trapdoor sets T w = {x j = f (w, k j ) 1 j r} for a word w. Finally, every bit location y j is set to one in the Bloom Filter. We complete the BuildIndex by blinding the Bloom Filter. To achieve this blinding let u be the upper bound on the number of tokens in d. One may assume there is one token for
27 every byte. Denote by v the number of unique words. Blind the index by inserting (u v) r 1 s uniformly at random in the Bloom Filter. To define the Search operation over Goh s index we require a trapdoor. The Trapdoor operation constructs the trapdoor. Trapdoor: Given the set { k i } R {0, 1} s 1 i r and a word w, the data owner constructs the set T w = {x j = f (w, k j ) 1 j r} for word w. Search: Given trapdoor T w = (x 1, x 2,..., x r ) and the Bloom Filter (index) for document id (d). Compute the codewords: {y j = f (id (d), x j ) 1 j r}. Test if the Bloom Filter contains 1 s in all r locations denoted by y 1,..., y r. If so, output true. Otherwise, output false. While Goh s construction is efficient and secure under the IND2-CKA security definition, the construction suffers from the possibility of false positives. In this sense the cloud may return identifiers for documents that don t contain the search term. Though the IND2-CKA model does not explicitly require it, the trapdoors for this construction are secure. Chang and Mitzenmacher [21] devised a separate but related indexed based system. Like [1], they considered keywords only. Their system uses bit maps as an index. Each bit in the map represents one possible word. This bit map is later obfuscated by a pseudo-random function G and uploaded to the cloud. The pseudo-random function G serves as the trapdoor information. This trapdoor information is provided by the client to the cloud when a query is made. The security of their system is only under the HBC model with the PPSED security definition. The PPSED definition guarantees security of both indexes and trapdoors. However, it does not take into
28 account what can be learned by how the queries are executed. Unlike Goh s construction, Chang and Mitzenmacher s construction uses a deterministic search data structure. The index is mapped one-to-one with documents in the document collection. They further require the following assumptions: There exists a keyed pseudo-random permutation P k : {0, 1} d {0, 1} d, where k {0, 1} t, a keyed pseudo-random function F k : {0, 1} d {0, 1} t, where k {0, 1} t, and a keyed pseudo-random function G k : {1,... n} {0, 1}, where k {0, 1} t. Finally, the system requires that there exists an symmetric key encryption system (G, E, D). The BuildIndex algorithm works as follows: Given a security parameter t, choose an s, r {0, 1} t uniformly at random. For each file j, prepare an index string I j of size 2 d such that if document j contains the i th word in the dictionary w i, set I j [P s (i)] = 1, otherwise, set I j [P s (i)] = 0. Complete the construction by computing r i = F r (i) for i { 0,..., 2 d}, and for each file j, compute a 2 d -bit masked index string M j. Set each entry in M j by M j [i] = I j [i] G ri (j). Once the index is constructed, send M j for all j to the cloud and the owner keeps s, r, and the index-word pairs. To search on this index, the owner must generate a trapdoor and send that trapdoor to the cloud so the cloud can execute the Search operation. To construct a trapdoor for word w, the owner runs Trapdoor (w) to retrieve the corresponding in-
29 dex λ for word w and computes T w = (p, f), where p = P s (λ) and f = F r (p). The cloud performs the Search operation using the trapdoor T w, over each document index j in the document collection as follows: Compute I j [p] = M j [p] G f (j). If I j [p] = 1 return the document with identifier j. The problem with the above scheme is that there is no guarantee of even a nonadaptive form of security. Non-adaptive security requires security under the case that all search queries are issued at the same time. Moreover, all SSE systems satisfy the PPSED property trivially. 2.6.3 Improved SSE Constructions Multi-user searchable encryption has also been considered. It was first considered by Curtmola, Garay, Kamara, and Ostrovsky [5]. In fact, they demonstrated a generic construction to obtain a single group access model over any given SSE construction and broadcast encryption system. In that same work, Curtmola et. al. described an adaptively secure SSE system and a non-adaptively secure SSE system. For the nonadaptively secure system, the indexing mechanism that they chose to use is essentially an encrypted and permuted linked list. This linked list is a list of the document set D (w). Each list is stored collectively in an array. For their non-adaptively secure system the index also needs a lookup table, indexed by word w, used to find the start of D (w) in the linked list. The index used in the adaptively secure system just uses a special lookup table such that each word is stored in the table with its corresponding document identifier. The adaptively secure index structure is significantly more expensive in storage due to this property. The non-adaptively secure system of Curtmola et. al. s requires several cryptographic primitives. Let k and l be security parameters. Their system needs a semantically secure encryption system, one pseudo-random function, and two pseudorandom permutations. The semantically secure encryption system (G, E, D) has an
30 encryption function E : {0, 1} l {0, 1} r {0, 1} r, where r is the block size. The pseudo-random function is f : {0, 1} k {0, 1} p {0, 1} l+lg(m), where m is the total size of the plaintext document collection in bytes and p is the size of a word in bits. The two pseudo-random permutations needed are π : {0, 1} k {0, 1} p {0, 1} p and ψ : {0, 1} k {0, 1} lg(m) {0, 1} lg(m). The Keygen operation will generate the key as a triple of random bit strings needed in the implementation of the system s, y, z R {0, 1} k. To construct the index needed for searching, Curtmola et al. present the following algorithm for BuildIndex. 1. Scan the document collection, D where each document is identified by an indentifier, and build a dictionary that contains all the distinct words in D. Complete by initializing a counter c to 1. 2. For each word w build the document set D (w), which is the set of all documents containt w. 3. For each w i, build an encrypted permuted (i.e. secure) linked list containing D (w i ) and store it in array A. Select κ R i,0 {0, 1} l for each i. 4. For the j th identifier in D (w i ), (a) Select κ i,j R {0, 1} l and create N i,j = id (D i,j ) κ i,j ψ s (c + 1), where id (D i,j ) is the j-th identifier in D (w i ). (b) Compute E κi,j 1 (N i,j ), and store it in the list A at location ψ s (c) (i.e., A [ψ s (c)] = E κi,j 1 (N i,j ). (c) Increase the counter c by 1. 5. To locate the start of the lists in array A a lookup table is constructed by the following process for each w i. (a) Let v = (addr (A (N i,1 )) κ i,0 ) f y (w i ), where addr (A (N i,j )) denotes the address (index) of N i,1 in A.
31 (b) Set location π z (w i ) of T to v. In other words, T [π z (w i )] = v. 6. Set any empty locations in T to random bit strings of the correct size. An example construction of an index on the docset {D 1, D 3, D 5, D 6 } appears in figure 2.1 L D 1 D 3 D 5 D 6 π z (w i ) T (κ 1,0 ψ s (1)) f y (w 2 ) 0 1 2 3 4 A E κ1,3 (D 6 ) E κ1,1 (D 3 ψ s (3) κ 1,2 ) x R Ran (E) E κ1,0 (D 1 ψ s (2) κ 1,1 ) E κ1,2 (D 5 ψ s (4) κ 1,3 ) Figure 2.1: A secure linked list on the set {D 1, D 3, D 5, D 6 } Once the index is constructed, both the Trapdoor and Search operations may be run. The result of running Trapdoor on word w is the ordered pair (π z (w), f y (w)). To search the index for a word w, we invoke Search. The Search operation, given a trapdoor T w = (π z (w), f y (w)), will locate the start of the document set for word w using π z (w) as an index into T. Then the Search operation will walk the list, and use the κ values it finds at each node to decrypt the subsequent nodes values. The resulting document identifiers are then returned to the user. For the proof of non-adaptive security of this system see [5]. The adaptive form is significantly simpler in all algorithms, especially the BuildIndex algorithm. The Keygen operations generates a key s R {0, 1} k. The BuildIndex algorithm works as follows: 1. Scan the document collection D and build a dictionary that contains all the distinct words in D.
32 2. For each word w build the document set D (w). 3. For each w i, set T [π s (w i j)] = id (D i,j ). 4. Set all empty entries of the lookup table T to random binary strings of the correct length. To construct a trapdoor for this index, the Trapdoor function must output T w = (T w1, T w2,..., T wmax ) = (π s (w 1), π s (w 2),..., π s (w max)). Where max is the size, in words, of the longest plaintext document in D. To search the index, given the trapdoor T w, Search proceeds by using each entry in T w to lookup the associated index in T. All of those indexes are collected together and returned to the user. For the proof that this system is adaptively secure, please see [5]. 2.6.4 Phrase Searching In a separate direction, Tang, Gu, Ding, and Lu [4] presented the first phrase search over encrypted data. They solved the problem by presenting a two-phase protocol to handle the search over the encrypted data. In the first phase, the cloud retrieves the document identifiers for documents that contain all the words in the phrase provided by the client, and returns the identifiers to the client. This phase relies on a global index, namely, the index shared among all documents in the cloud. In the second phase, the client sends the phrase query and a list of document identifiers to the cloud. The cloud searches for an exact phrase match for each document in the per document index (phrase table) and returns to the client the actual encrypted documents that match the phrase. Their protocol uses the HBC model to model the cloud and are non-adaptively secure based on the definition provided in [5]. The most interesting part of the construction is the construction of the phrase table. This per document table allows the cloud to determine if the phrase occurs in
33 a specific document without learning what the phrase is. To construct the table we must assume the existence of the following three keyed pseudo-random functions Ψ : {0, 1} λ {0, 1} {0, 1} n h : {0, 1} λ {0, 1} {0, 1} u f : {0, 1} λ {0, 1} {0, 1} λ The table has the dimensions w c (d + 1). where w c is the distinct word number for the document and d is the highest frequency (for any word) that occurs in the document collection D. The phrase matching look-up table is constructed using the following process: First, associate a random number r i with the i-th word in the document. Store in the first column of the look-up table, the value Ψ z (w i id (D)). For each remaining element of the look-up table we store h s (r i 1 ) r i if the word r i 1 precedes r i. It is required that key s, be distinct for two coherent words (i.e., s = f k (w i 1 w i id (D))). The first word in the document is handled in a special way by computing h s (r ) r i, where r is a random number. Once all of the relationships are placed in the table, all unfilled slots are filled with random numbers of the same size as the output of h s and the size of the random number. Finally, permute the contents of each row in the table (starting from the second element) and sort the rows based on the first element of each row. An example of the construction is given in figure 2.2 for document D j To search this phrase look-up table, the cloud will use each Ψ z (w i id (d)) value in the order it appears. The Ψ z (w i id (d)) part of the trapdoors is constructed by the client as part of conducting the phrase search. The cloud proceeds as follows: Use binary search to find the row for Ψ z (w i id (d)). Search the row for f k (w i 1 w i id (d)). If there is a word in the phrase that is not found, the cloud returns false. If all pieces of the phrase are found, the cloud should return true.
34 w 1 w 2 w 3 w n D j r1 r2 r3 rn Ψ z (w i id (D j )) h s (r i 1 ) r i w c. (d + 1) Figure 2.2: An example of a phase two table based index. Tang et. al. s were able to show that their system exhibited non-adaptive security. What they did to show this was to give a proof methodology based on simulation. There methodology constructs a simulator S that consists of two sub-simulators S 1 and S 2. They show a simulator S 1 that can be used to prove non-adaptive security of phase one. They then show that they can construct a simulator S 2 that proves nonadaptive security for the second phase. Finally, the unify the two simulators under a composition argument to demonstrate that the two phases taken together are also non-adaptively secure. 2.6.5 Non-HBC Systems There were other attempts to move away from the HBC model. In particular, Chai and Gong [3] introduced verifiable symmetric searchable encryption, on keywords, to allow the client to verify that the cloud has returned the correct list of document identifiers. They achieved this through the use of tries [17]. Their system offers two innovations: (1) It does not use per document indexes and (2) it is secure under a more realistic Semi-Honest-but-Curious adversarial model. Chai and Gong s trie
35 based approach to managing the shared keyword index reduces substantially the complexity of previous approaches that resorted to managing per document indexes. Using the trie based index, Chai and Gong were able to devize a verifiable SSE scheme that is non-adaptively secure against a SHBC cloud a [3]. a Chai and Gong did not prove their scheme is non-adaptively secure. Chapter 3 as part of our proof of security. We offer the proof in
36 Chapter 3 Verifiable Phrase Search In this chapter we present an efficient method to carry out verifiable phrase-search over encrypted data in the setting of Semi-Honest-but-Curious (SHBC) cloud storage. In particular, we devise a two-phase protocol that verifies the result from a search of an encrypted phrase. We achieve this by incorporating, with modifications, a verifiable keyword-search technique and a phrase-search method based on work presented in ICC 2012 and ICDCS 2012. We use added verification tags to provide proof of the query results returned from the cloud. Our solution presents a two-phase search method based on the indexed search with document identifiers, which differs from the standard approach, but aligns with the work of [4]. In particular, phase one is used to retrieve potential document identifiers from a global index structure. Specifically, it returns document identifiers for documents that contain all the words in a query phrase. Phase two is used to retrieve phrase matches from specific documents using a per-file index structure. In other words, phase one provides potential candidates and phase two refines the search so that the cloud only sends to the client documents that contain exact phrase matches. Moreover, each phase can be independently verified for correctness by the client.
37 Our main contribution is the construction of verifiable encrypted phrase-search in the Semi-Honest-but-Curious (SHBC) model. Our work improves the recent encrypted phrase-search mechanism of Tang et. al. [4] in the following two aspects: First, we provide an encrypted phrase-search that is secure against the more powerful SHBC adversary. Second, we provide the ability of verifying search results. 3.1 Verifiable Encrypted Phrase Search We present a two-phase verifiable encrypted phrase search mechanism. We achieve this by augmenting the approach of Tang et. al. [4] with the verifiable encrypted keyword-search of Chai and Gong [3]. We will demonstrate verifiability for the second phase of the protocol of Tang et. al. [4] by augmenting it with a verification mechanism. As there are two protocol phases, we will have two search indexes. One global and the other per document. 3.1.1 Verifiable Keyword Search Chai and Gong achieved verifiable keyword-search [3] using the notion of trie [17] over an alphabet Σ as a global index. Each node has a value r 0 that holds the symbol in Σ of the given node. Chai et. al. further augmented each node with two fields. One field, r 1, stores a globally unique value for the node and the other field, r 2, stores a bit map of the children of the node. We note that in the case that a node is a leaf the bitmap is actually a list of document identifiers. Chai and Gong used the following algorithm to construct their trie. The algorithm assumes the existence of a pseudo-random function (keyed hash function) g k : {0, 1} {0, 1} n, a block cipher s K to encrypt (n + η) bits of plaintext, and a function ord which, when given a node of the trie, returns the index of the associated character in the alphabet Σ. The node at level j at the left-to-right location q is denoted by T j,q.
38 The trie is constructed by first inserting every word in the document collection D into the trie. Every internal node sets r 1 = g k (v j parent (v) [r 1 ]) and r 2 = E (r 1 b). Every leaf node sets r 1 = D g k (D) where D is the list of documents that contain the word on the path from the root to the leaf. Note that this stores the un-hashed version of D. The leaf node also sets the value r 2 to g k ($ j + 1 parent ($) [r 1 ]). Each node of the trie has its children permuted and its associated symbol, stored in r 0, removed. Finally, the trie is sent to the cloud along with the encrypted document collection. The client generates a privacy preserving query π for the cloud to use in searching for the keyword in the index. The query π on a word w is constructed by setting π i = g k (w i i π i 1 ) for i 1, where w i w is the i th letter in word w. The values π i 1 is the hash of: the previous character in the word, its position in the word, and the its parents hash value.. We boot strap this by setting π 0 = 0. Observe, we are essentially building a chained hash along the root to leaf path that exist in the nodes of the trie. This query is then sent to the cloud. The cloud will search the index and send back a list of document identifiers associated with documents that contain the keyword (document set), the hash of the document set and a proof that the data returned is correct and complete. The proof is the series of r 2 values in the nodes of the trie. Once the results are returned to the client, the client may verify that the cloud has behaved honestly. This is done by checking the series of r 2 values against the query and results. 3.1.2 Verified Phrase Searching Building on the ideas summarized in Sections 2.6.4 and 3.1.1, we now construct a two-phase verified searchable phrase encryption protocol. In the first phase, the cloud returns all the document identifiers that contain all of the words in the phrase submitted by the client, together with a proof of correctness. In the second phase,
39 the per-document indexes are queried for the exact phrase, and a true or false value is returned to the client for each document along with a proof of correctness. Phase one of our protocol will obtain the document identifiers using the verified encrypted keyword search protocol of Chai and Gong [3]. We extend their idea to deal with conjunction of keywords so that only identifiers for documents with all the keywords present are returned to the client. We then augment the work of Tang et. al., discussed in Section 2.6.4, to provide verification of phrase matches. For the problem of conjunctive keyword matching we use the strategy used in [3, 20, 21] and send multiple query vectors to the cloud. The client will then take the document identifiers returned from each query vector, verify them, and compute the intersection of the document identifiers with the other resulting sets of document identifiers. The resulting intersection will tell us on what documents identifiers to perform a deeper phrase search. For verification we use the individual keyword verification algorithm used in [3]. We note that the search time here is proportional to the length of the phrase. The size of the index (trie) will remain fixed to the size of the dictionary. The second phase of our protocol is to match the desired phrase in a candidate document, using the same query and search mechanism of Tang et. al. [4], with an added verification tag when building the position index matrix A. The verification tags will allow us to later verify that the cloud correctly answered our queries. These verification tags should be unforgeable by the cloud. We further add a verification routine that efficiently verifies the correctness of the cloud s decision. To enable these verifications we use the keyed hash function g k. Verification tags are added to the first column of the location index for a given document. This means that each location in the first column, a i,0, will be an ordered pair. The first element in the location is the value Ψ k2 (w 1 id (D)) as specified in the algorithm given in Section 2.6.4. The second element of the ordered pair is
40 the verification tag. The verification tag consists of a sequence of word identifiers mem. The mem part of the tag tracks the identifiers of the words that directly precede the word w i in the document. The values stored in mem are then concatenated with a hash of id (D) w i. In other words, g k (id (D) w i ) mem The resulting message is finally encrypted with a symmetric encryption algorithm to which only the data owner possesses the key. In other words, the final verification tag is s k (g k (id (D) w i ) mem). For example, if the word walker was preceded by the word dog, cat, and speed in the document with identifier 3 the verification tag would be: s k (g k (3 walker ) id ( dog ) id ( cat ) id ( speed )) The Search procedure for the second phase phrase, described in Section 2.6.4, is modified so that it builds a message that provides proof of matching along with its list of mathcing document identifiers. Once the cloud has returned the proof, we must verify that the proof is correct. We do this by decrypting the tags and making sure all of the keywords have been returned by computing the appropriate hashes. We then check each word s list of predecessors and make sure it is preceded by the appropriate word. 3.1.3 Correctness We combine the work in [4] and [3] and must consider the security proof implications. We rely on the security proof structure describe in Section 2.6.4 as we use this system unmodified. This provides security over phase one of the search. We draw attention to the fact of what is revealed as part of the transactions in phase one. The cloud learns all the document identifiers, the privacy preserving queries, and the associated encrypted documents from the document collection. We term this the view of the
41 adversary after the execution of phase one. We also have the notion of a trace which contains sets of document identifiers that correspond to a certain keyword. The trace also contains the document identifiers and the access pattern of the searches. This trace is the information we are willing to leak about the history of queries. Given the above information, the trace and the view, we can combine it with the proof overview in Section 2.6.4. In practice we fit the trace into their access pattern set. From this point forward the security for phase two follows their proof. We can prove the following theorem: Theorem 3.1.1. The construction is non-adaptively secure. Proof. We recall the definition of non-adaptive security from Definition 2.5.4 and proceed to show non-adaptive security of the first phase of the phrase search scheme. We describe a probabilistic polynomial-time phase one simulator S 1 for any q N, any probabilistic polynomial-time adversaries, A 1, any distribution L q = {H q Tr (H q ) = Tr q }, where Tr q is some q sized trace. Simulator, S 2 constructs a view V q for phase two such that A 1 cannot distinguish from a genuine view V K (H q ), given that S 1 is provided with Tr (H q ) for any q N. For q = 0, the simulator S 1 constructs a V that is indistinguishable from V K (H 0 ) for any H 0 R L 0. In particular, S 1 generates V = {1,..., n, e 1,... e n, P 1,..., P n, I }, where e i R {0, 1} D i for all 1 i n, P i R {0, 1} m+η for all 1 i n, and I = T. The simulator S 1 generates a complete Σ + 1-ary trie T of height max w ( w ), and fills each node with a random number from {0, 1} z, where z is the size of the bit strings in the nodes of the index T. The leaf nodes are be filled with random numbers from 1 to n. We claim that V and V K (H 0 ) are indistinguishable. In other words, we must show that A 1 cannot distinguish any element in V from
42 the corresponding element in V K (H 0 ). This is true simply because the document identifiers in V and V K (H 0 ) are computational indistinguishable. It remains to argue that index I = T and I = T are indistinguishable, the encrypted documents are indistinguishable, and the proofs are indistinguishable. To see that T and T are indistinguishable, we note that every node in T is a binary string in {0, 1} z and the nodes of T are either a hash value from the set {0, 1} z or a random binary string from {0, 1} z. In either case, the nodes are indistinguishable. In the case of the encrypted documents we observe that since (G, E, D) is a semantically secure encryption scheme, we conclude that e i is indistinguishable from the associated encryption in V K (H 0 ). We can see that the proofs are computationally indistinguishable as they are the result of a block cipher (pseudo-random permutation). For q > 0, simulator S 1 constructs V as V q = { 1,..., n, e 1,..., e n, P 1,..., P n, I, T 1,..., T q }, where I = T is a complete Σ -ary trie. Each value in the trie is drawn randomly from {0, 1} z. The trapdoors T i are constructed as root-to-leaf paths through the trie T. They are constructed in such a way that they have the correct length. Briefly, to get the correct length for word w i the simulator inspects E j,i. If T j,i is the zero matrix, then set w i = j 1. The trapdoors T i are also constructed in a way such that they will lead to a leaf node that contains the appropriate document set. The encrypted documents, the document identifiers, as well as the proofs of correctness, are still indistinguishable for the same reasons as the case when q = 0. The index is likewise indistinguishable. The trapdoors are indistinguishable as they consist of a sequence of random values that are indistinguishable from the function used to create genuine trapdoors. Finally, we note that they are indistinguishable in length as well. This is because the search pattern of the trace provides the simulator with sufficient
43 information to construct a trapdoor of the correct size. We have thus described a polynomial-time simulator S 1 that can create, for any polynomial-time adversary A 1, a view indistinguishable from a genuine view. This completes the proof of non-adaptive security for phase one of the protocol. The results of phase one are then used to retrieve the refined results for the queries. We define the trace, history, and view for phase two slightly differently. We start by declaring the history for phase two to consist of the collection of q phrase queries and related document sets. We create a search pattern that will be evident from a trace of the interaction of phase two by defining a matrix T of size q k where k is the length of the phrase. Each entry in matrix T is itself a matrix. This is represented by matrix E of size q k. If the (u, v) entry of E is one, this means the j-th word of phrase i is the same as the v-th word of phrase u. The trace of the second phase will only contain the document sets and queries pertaining to the second phase. We describe a probabilistic polynomial-time simulator, S 2 such that for all q N, all probabilistic polynomial-time adversaries A 2, and all distributions, L q = {H q Tr (H q ) = Tr q }, where Tr q is some q sized trace. Simulator, S 2 can construct a view V q such that A 2 can not distinguish from a genuine view V K (H q ). Given that S 2 is provided with Tr (H q ) for any q N. For q = 0, the simulator S 2 constructs a V that is indistinguishable from V K (H 0 ) for any H 0 R L 0. In particular, S 2 generates V = {1,..., n, e 1,... e n, P 1,..., P n, I }, where e i R {0, 1} D i for all 1 i n, P i R {0, 1} m+η for all 1 i n, and I = A. We construct the per document table collection A. For each i and j, in
44 every table, set entry a i,j to a randomly chosen bit string of the correct length. By a hybrid argument, the entries of all of the per document tables will be computationally indistinguishable as every entry in a real table is the result of a pseudo-random function concatenated with a value chosen uniformly at random. This implies that I is computationally indistinguishable from I. In the case of the encrypted documents we observe that since (G, E, D) is a semantically secure encryption scheme, we conclude that e i is indistinguishable from the associated encryption in V K (H 0 ). We can see that the proofs are computationally indistinguishable as they are the result of a block cipher (pseudo-random permutation). For q > 0, simulator S 1 constructs V as follows: V = { 1,..., n, e 1,..., e n, P 1,..., P n, I, T 1,..., T q }. From the trace, S 2 is able to determine the sets of documents corresponding to each phrase query T i. These document collections and associated queries are used to appropriately construct each per document index in A. Using the search pattern from the trace as well as the document collection and trapdoors we can construct the correct entries for each entry of the matrix for which we have seen a query. Any other queries that we have not seen, but may be valid, are simply replaced with random numbers. Because each entry in the matrix is a result of a pseudo-random function, the output is indistinguishable from true random. Since, the both S 1 and S 2 can create indistinguishable views, even under combination, the system is non-adaptively secure. It remains to argue the correctness and unforgeability of the proof of phrase match decisions. The verification algorithm, PhraseVerify, walks over the proof and verifies that the decision made at each word w i over the phrase w is correct. By the assumption of our model, we know that if the cloud returns results, it must also return
45 the associated proofs. Therefore, simply verifying the tags will allow us to determine that the decision of the cloud is correct. The verification algorithm further relies on an unforgeability property of the tags. We gain unforgeability from the fact that the output of the pseudo-random function is indistinguishable from being truly random, which means that without the key, the cloud cannot forge a verification tag. As a final concern we note that our verification tags cannot be fabricated by replaying a valid verification tag from another document index. This is due to the fact that our verification tags are linked to the document identifier and the current word. 3.2 Conclusion In this chapter we demonstrated a verifiable encrypted phrase search protocol for the SHBC model, which extends the encrypted phrase search protocol of Tang et. al. [4]. Our protocol consists of two verifiable phases, each relying on verification proofs. The first phase uses the work of [3] and the second phase uses the method presented in this dissertation using verification tags. By virtue of both phases being verifiable, we can detect whether the cloud has lied about the results it returns. Thus, the resulting protocol can be said to be secure under the presence of an SHBC cloud adversary.
46 Chapter 4 Verifiable Phrase Search in a Single Phase In this chapter we investigate the problem of finding an efficient mechanism to search for phrases in a collection of encrypted documents. We present an efficient singlephase Searchable Phrase Encryption system that minimizes computations at the client side. We then show that the scheme is verifiably secure under the SHBC model. Finally, we show that our system is non-adaptively secure in the presence of the SHBC adversary. Our main contribution, in this chapter, is constructing an efficient phrase search encryption system that performs searches in one phase with minimal computational burden on the user (the client). Our work improves on the searchable phrase encryption system of Tang, Gu, Ding, and Lu [4]. Their system operates in two distinct phases. In the first phase, the client will submit an encrypted n-word phrase p = (w 1, w 2,..., w n ) to the cloud. The cloud will then return, using a global index, the set D (w i ) of all the documents that contain the word w i. There is one set returned for each i. The client takes these results and computes n i=1 D (w i) to determine the documents that contain all the words in the phrase. In phase two, the
47 encrypted phrase p and the set n i=1 D (w i) are submitted to the cloud. The cloud then checks the per-document index to determine if p occurs in some d n i=1 D (w i). The cloud collects all the documents that satisfy the constraint, and returns them to the client. As stated previously, our system reduces the number of phases to a single phase. We achieve this through the use of a single global index. Because of this reduction in phases we are also able to reduce the work that is performed by the client. Specifically, the client no longer needs to compute n i=1 D (w i). This reduction allows clients with less computational power to reap the benefits of encrypted phrase search. We achieve our security guarantees under the SHBC model thus carrying our improvements to the verifiable system of Kissel and Wang [22]. This chapter is organized as follows: We provide in Section 4.2 background information needed for constructing our one phase system, which is presented in Section 4.3. We extend our system with verification in Section 4.4 and conclude the chapter in Section 4.5. 4.1 Notations 4.1.1 Notations In addition to the notation presented in Chapter two, we denote by D (p i ) the set of document identifiers that contain the phrase p i, and by A[i] the i-th entry in an array A. We further, denote by addr (A, x) the index of element x in array A. 4.2 Background 4.2.1 Background on Next-Word Indexing Our phrase indexing strategy relies on the use of a data structure called a Next-word Index [23], which is an inverted index structure consisting of three components:
48 1. a vocabulary list of every word w i in dictionary 2. a set of next-word lists consisting of each word w j that directly follows w i in some document in D 3. a set of postings list consisting of, for each pair of words (w i, w j ): the number of documents that contain the pair, a set of lists that contain the document identifiers, number of occurrences, and locations of the pairs in each document in the postings list (see Fig. 4.1). Dictionary ( ) Next Word Postings List w 1 w 2 1, (< 9, 1, [4] >) w 2 w 4 1, (< 3, 2, [3, 7] >) w 3 w 1 2, (< 3, 2, [4, 8] >, < 5, 3, [5, 7, 11] >) w 4 w 3 1, (< 3, 2, [4, 8] >) w 2 2, (< 3, 2, [5, 9] >, < 5, 3, [6, 9, 12] >) Figure 4.1: An example next-word index. For example, a posting-list entry of 1, ( 9, 1, [4] ) means that the word pair occurs once in the entire document collection D and in particular it occurs in document 9 at word 4. 4.2.2 Secure Linked Lists Our results rely on the existence of the secure linked list structure [5]. This structure seeks to compact several linked lists into one common array. A secure linked list prevents two encrypted members (also called encrypted nodes) of the same list from appearing next to one another in the array. Since each node in a secure linked list
49 is encrypted, one must decrypt a node to determine the location of the next node in the list. To construct a collection of secured linked lists we need an array A of size n max n i=1 L i, where n is the number of lists and L i the length of linked list i. We further need a semantically secure symmetric encryption algorithm (G, E, D) and a pseudo-random permutation ϕ on the set of binary strings {0, 1} log(n maxn i=1 L i ). Given a global counter c, with initial value 1, the j-th node in list i is inserted in location A [ϕ (c)] with the value E κi,j 1 (d κ i,j ϕ (c + 1)). Each κ i,j is the result of the key generation algorithm G. After the node is inserted, increase the counter c by 1. To maintain the location of the heads of the linked lists, we keep the pointers to the head nodes and the associated keys κ i,0 in a separate array. Once the set of linked lists are placed in array A, the remaining n max n i=1 L i n i=1 L i entries are filled with random strings drawn from the set of binary strings of the correct length. 4.3 Basic Construction 4.3.1 Constructing an Encrypted Next-Word Index In this section we introduce an encrypted next-word index. A next-word index consists of a set of three linked lists: a dictionary list, a set of next-word lists, and a set of posting lists. Our construction assumes the existence of the following three pseudorandom permutations: ψ : {0, 1} k {0, 1} p {0, 1} p, ϕ : {0, 1} k {0, 1} lg(m ) {0, 1} lg(m ), σ : {0, 1} k {0, 1} log(m maxι{ ρι }) {0, 1} log(m maxι{ ρι }),
50 where m is equal to the longest next-word list for the document collection D, and ρ ι denotes the ι-th postings list. We further assume the existence of a semantically secure symmetric encryption algorithm (G, E, D). Finally, we need the following two pseudo-random functions: f : {0, 1} k {0, 1} p {0, 1} k+log(m ), ζ : {0, 1} k {0, 1} p {0, 1} lg. Using the next-word index and the secure linked list structure described in Section 4.2, we construct a secure index for phrase search using the following steps. First, we create an array A of size, which will contain pointers to all the head elements of the next-word lists. For each word w i that appears in D, we use the key generation algorithm G to generate a secret key κ i,0. We store in A [ζ x (w i )] the value (κ i,0 ϕ ω (s i )) f y (w i ), where ϕ ω (s i ) denotes the location of the start of the next-word list for word w i, and κ i,0 is the encryption key used to encrypt the head node in the next-word list. Once completed, set each empty location in A to a random value drawn from {0, 1} k+m. Next, we create an array N of size m to store all of the next-word lists as encrypted linked lists. To populate N we will initialize a counter c to 1 and insert each next-word list in N. For example, let n i,j denote the j-th entry in next-word list i. We place in location N [ϕ ω (c)] the value E κi,j 1 (n i,j ). The value n i,j is defined to be ψ z (w i,j ) s i,0 σ λ (t) κ i,j ϕ ω (c + 1), where w i,j is the j-th word in the next-word list for word w i, κ i,j and s keys generated by the key generation algorithm G, and σ λ (t) the index of the start of the associated
51 postings list. After the insertion of each new entry in N, increase the counter c by 1. If an element is the last element in the list, we use a special value falling outside the array dimensions to take the place of ϕ ω (c + 1). Fig. 4.2 depicts an example construction of A and N on a dictionary = {w 1, w 2, w 3 }. A (κ 2,0 ϕ (s 2 )) f y (w 2 ) (κ 1,0 ϕ (s 1 )) f y (w 1 ) (κ 3,0 ϕ (s 3 )) f y (w 3 ) N E κ2,0 (n 2,1 ) E κ3,1 (n 3,2 ) E κ1,0 (n 1,1 ) E κ2,1 (n 2,2 ) E κ1,1 (n 1,2 ) E κ3,0 (n 3,1 ) Figure 4.2: Example arrays A and N for = {w 1, w 2, w 3 }. The arcs represent a logical connection. The last part of the next-word index is the posting lists. We create an array P of all posting lists using the idea of the encrypted linked list. To archive all the posting lists in one array we track their position using a counter t initially set to 1. Each node in the posting list will contain the document identifier for the document d D (w w ), the location l of the start of w w, a new key s i,j to decrypt the next node, and the location of the next node in the list. The contents of the node are id (d) l s i,j σ λ (t + 1). To pack this nodes in the array P we simply assign to P [σ λ (t)] the value E si,j 1 (id (d) l s i,j σ λ (t + 1)). 4.3.2 An SSE Construction Following the basic model for SSE due to Curtmola et. al. [5], we define the following four probabilistic polynomial-time algorithms: Keygen, BuildIndex, Trapdoor, and Search.
52 Keygen is responsible for generating keys z, y, x, ω, and λ uniformly at random from the key space {0, 1} k. BuildIndex builds an encrypted next-word index I = (A, N, P ) using the process described in section 4.3.1. Trapdoor computes the trapdoor T p, given the key z and a phrase p = {w 1, w 2,..., w n }, as follows: T p = {(ζ x (w 1 ), f y (w 1 ), ψ z (w 1 )), (ζ x (w 2 ), f y (w 2 ), ψ z (w 2 )),..., (ζ x (w n ), f y (w n ), ψ z (w n ))} Search is executed by the cloud: Given the trapdoor T p, the cloud must look up the phrase using the index I = (A, N, P ). The set of tuples that make up the trapdoor are paired in a fashion such that tuple i is paired with tuple i 1 for all i 2. Given a pair ((ζ x (w i ), f y (w i ), ψ z (w i )), (ζ x (w i+1 ), f y (w i+1 ), ψ z (w i+1 ))), the cloud begins by locating ζ x (w i ) in A and uses f y (w i ) to get to the start of the associated next-word list in N. The cloud will walk the next-word list, perform the appropriate decryptions, and compare the word field against ψ z (w i+1 ). If an entry e is found, the cloud reads the appropriate posting list in P. If we are only looking for a two-word phrase then we will simply return the associated documents. If the phrase is longer that two words, we use ζ x (w i+1 ) to access the appropriate head-of-list pointer in A. For phrases longer than two words, we handle the posting lists differently as we are looking for starting location numbers that are consecutive among the pairs of words with in the same document.
53 4.3.3 Security and Efficiency We first show that our system is non-adaptively secure. We will then discuss the efficiency of our system. Non-adaptive security As mentioned in Section 2.5, to prove that our system provides security in the presence of a non-adaptive adversary, we must use the notion of computational indistinguishability. We approach our proof in this way so that we can capture all potential adversaries. To do so, we will give the cloud two indexes: One is legitimate and one is fabricated. We will also provide the cloud with the information it would see, if the queries were issued over the legitimate index (called the view). The goal of the cloud is to determine which index is legitimate and which index is fabricated. We introduce a proof tool, called a simulator, to construct the fabricated index from exactly the information we are willing to leak about a set of queries. We formalize these notions using the following, formal, definitions of history, view, and trace Definition 2.5.3 For our index we define the search pattern by creating a matrix T q k, where every phrase is of length at most k. Each entry M i,j in the matrix is a q k binary matrix E constructed in a way such that E u,v = 1 if the j th word of phrase i is the same as the v th word of phrase u. Theorem 4.3.1. The basic construction for the one-round encrypted phrase search is non-adaptively secure. Proof. We describe a probabilistic polynomial-time simulator S, such that for any q N, any probabilistic polynomial-time adversary A, and any distribution L q = {H q Tr (H q ) = Tr q }, where Tr q is some q sized trace, the simulator S can construct a view V q such that A
54 cannot distinguish V q from a genuine view V K (H q ), provided that S is given Tr (H q ) for any q N. For q = 0, the simulator S constructs a V 0 that is indistinguishable from V K (H 0 ) for any H 0 R L 0. In particular, S generates V 0 = {1,..., n, e 1,... e n, I }, where e i R {0, 1} D i for all 1 i n and I = (A, N, P ) is an appropriately generated index. Generating A S allocates an array A of size, and fills each entry with a random number from {0, 1} k+lg(m ). Generating N For each of the m entries in N insert a random string drawn from {0, 1} r where r denotes the size of the output of E. Generating P For each of the m max ι { ρ ι } entries in P select a random string from {0, 1} lg(m maxι{ ρι }). We claim that V and V K (H 0 ) are indistinguishable. By a standard hybrid argument we must show that A cannot distinguish any element in V from the corresponding element in V K (H 0 ). This is true simply because the document identifiers in V and V K (H 0 ) are computational indistinguishable. It remains to argue that the genuine index I the simulated index I are computationally indistinguishable and that the encrypted documents are computationally indistinguishable. To see that I = (A, N, P ) is computationally indistinguishable from I = (A, N, P ) we observe that A consists of random strings from {0, 1} k+(m ) and A consists of
55 pseudo-random strings from {0, 1} k+lg(m ) exclusive OR-ed with the output of a pseudo-random function f. Therefore, A and A are computationally indistinguishable, otherwise f would not be a pseudo-random function. Since N has entries of the same size as the entries in N and the entries in N are the result of a semantically secure encryption scheme we can conclude N and N are computationally indistinguishable. Likewise, for P, we have that every entry in P is the same size as an entry in P and the entries of P are produced by a semantically secure encryption scheme. Therefore, the entries of P and P are computationally indistinguishable. In the case of the encrypted documents, we observe that since (G, E, D) is a semantically secure encryption scheme, e i must be computationally indistinguishable from the associated encryption in V K (H 0 ). For q > 0, simulator S constructs V as V q = { 1,..., n, e 1,..., e n, I, T 1,..., T q }, where I = (A, N, P ). S begins by generating a q k matrix M. M is constructed from M. Each entry of M is a triple from {0, 1} lg( ) {0, 1} k+lg(m ) {0, 1} p. Every unfilled entry m i,j, of M, gets filled by selecting a random (ẑ i,j, f ) i,j, p i,j. For ( every value in E, recall E is the element at M i,j, set entry m u,v to ẑ, f, ) p if E u,v = 1. At the completion of this process, S will have built all the trapdoors T 1, T 2,..., T q. The arrays A, N, and P are generated as follows: Generating A S allocates an array A of size. For each entry in M, set each empty location A [ẑ i,j ] = (κ i,0 s i ) f i,j.
56 The value κ i,0 R {0, 1} k and s i R {0, 1} lg(m ) L. We then add s i to L. The set L is used to maintain the starting locations of the next-word lists in N. Fill all unfilled entries of A with random strings from {0, 1} k+lg(m ). Generating N S allocates an array N of size m. For all unique ẑ i,j M, collect the triples of next-words into a set n l 1. Determine the associated s i and κ i,0 from A [ẑ i,j ]. 2. Insert the first pair that occurs in n l at location s i. This first entry is given by, N [s i ] = E κi,0 ( p1 s i,0 t i κ i,j c ). Where p 1 is the p value in the first entry of n l, s i,0 R {0, 1} k, t i is an empty location in P (add this to a special set L), and c is chosen randomly from {0, 1} lg(m ) L (c is then added to L). 3. Repeat the following for the remaining entries in n l. Using the previous value of c and κ i,j set N [c ] = E κi,j ( pr s i,0 κ i,j+1 c ). The value p r denotes the p value from the r-th entry in n l and c R {0, 1} lg(m ) L. After the entry has been added, add c to L and set c to c. The entire process is then repeated for the next unique word. After S completes the above process, fill all empty locations in N with random binary strings of the correct length.
57 Generating P S creates an array of size m max ι { ρ ι }. To populate the entries of P, S makes use of A, N, and M together with D (p 1 ), D (p 2 ),..., D (p q ) to fill in the array P. For each each phrase p i look up the tuple (ẑ i,1, f ) i,1, p i,1 in m i,1. Using ẑ i,1 and f i,1, the array A can be used to locate the start of the next-word index. The next-word list N is the decoded as is done in any invocation of the Search algorithm. The entry that is being sought, is p i,2. Once this entry, e, is found, use the t i value as the starting location for the postings list. Proceed to use s i,0 to fill the entry P [t i ] = E s i,0 ( d dctr + 1 s i,2 t ), where d ctr denotes the current position counter for document d D this counter is global and originally set to 0, s i,2 R {0, 1} k, and t ( R lg m max ι ) { ρ ι } L. S subsequently adds t to L and set t i to t to get ready for the next insertion. Like, in the construction of P, by the algorithm BuildIndex, each subsequent entry j in the list is encrypted by key s i,j 1. A postings list entry is added for every document d D before proceeding to m i,2. Once all the words in phrase p i have been added, S proceeds to phrase p i+1 repeating the process for all words in p i+1. For words that occur in multiple phrases, the process will navigate to the end of the encrypted list before attempting to add any additional elements. Lastly, fill any empty locations in P with random binary strings of the correct length. We assert that for q > 0, S constructs an computationally indistinguishable view V q. We observe the the trapdoors are computationally indistinguishable from the genuine trapdoors as they are constructed from random binary strings drawn from the same set as the respective pseudo-random functions and permutations that make up the trapdoor tuples. The index I = (A, N, P ) is computationally indistin-
58 guishable from I as all q trapdoor results are mimicked by I in a way that produces D (p 1 ), D (p 2 ),... D (p q ). Moreover, all the entries contained in the arrays A, N and P are random binary strings of the same length as the corresponding A, N, and P. Therefore, I is computationally indistinguishable, otherwise the primitives would not be semantically secure encryption algorithm, pseudo-random function, and pseudorandom permutations respectively. For the same reasons as in the case of q = 0, the values e 1,..., e n are computationally indistinguishable from their related encrypted documents. hybrid argument, V q Since all the parts of V q are computationally indistinguishable, by a is computationally indistinguishable from V q. We have thus described a polynomial-time simulator S that can create a view indistinguishable from a genuine view for any polynomial-time adversary A. This completes the proof. Efficiency We inspect both storage efficiency and search query efficiency. We observe that the index I = (A, N, P ) requires O (m + P ) storage extra space over the space required to store the document collection D. The time to search for a phrase p in the system is calculated as follows: We know that each look up in the array A takes constant time because ζ x gives us the index for direct access, and we make p 1 such look ups. We know that in the worst case we must traverse m entries in the next-word list stored in array N. Finally, we must traverse every element in each associated postings list. In the worst case all lists will be of maximal length. Thus, the time complexity of our system is bounded by O (m p 1 max q i=1 ρ i ), where q is the total number of postings lists and ρ i denotes the length of the i-th postings list.
59 4.4 Adding Verification We extend the construction in section 4.3 to include verification of search results. By adding verification we can place our system in the SHBC model. To do this we must modify the BuildIndex by augmenting our index with verification tags or certificates. We also extend the model of SSE to contain a polynomial time function Verify that will verify that the cloud has returned the results correctly. We add the certificates to the array A converting A from storing a single value (κ i,0 ϕ ω (s i )) f y (w i ) to a pair where the second element of the pair is a verification tag v i. Given a function idx : {1, 2,... } we define a verification tag as the word in question followed by a sequence of at most m next-word identifiers followed by at most m max q i=1 ρ i posting list entries. Finally, these sequences are encrypted using a semantically secure encryption scheme with a key that must be drawn during the Keygen algorithm. To improve the storage efficiency of the verification tag one can use methods for encoding the postings list as described in [24]. The Search algorithm will be modified to return not only the set of documents that match the phrase query, but the set of verification tags as well. The Verify algorithm analyzes the certificates returned from the cloud. It verifies the results by performing a check of all the next-word lists against the postings lists. If the Verify algorithm detects an inconsistency. If either the phrase does not occur in a document or a document was not returned that should have been, Verify returns. 4.4.1 Discussion of Security Guarantees The main concern over the security with the verification tags is that a verification tag cannot be forged. In order for the cloud to successfully forge a verification tag it would need to be able to recover the encryption key used for the tags, since the
60 cloud cannot forge a valid encryption. Moreover, the cloud cannot use a different verification tag from the array A, as the word and its associated next words are included in the verification tag, thus the verification tags are bound to specific entries in A. In a straightforward manner, one can extend the proof of theorem 4.3.1 to handle verification tags. To do this one must augment the trace, both genuine and simulator constructed, to contain the tags. To see that this does not change the proof above, we note that the verification tags themselves are the results of a semantically secure symmetric encryption system and thus are indistinguishable from true randomness. The presence of the verification tags do not leak any additional information. Since, all the verification tags are the same size any length preserving semantically secure encryption algorithm will not leak the relative size of next-word list nor the posting lists. 4.5 Conclusion In this chapter we presented an encrypted phrase search scheme that operates in a single phase with minimal client computation, thus improving on the previous known results of [4]. We further improved our results by providing a verifiable form of our system under the SHBC model of Chai and Gong [3]. Finally, we argued the correctness and efficiency of our constructions.
61 Chapter 5 Hierarchical Access Control In this chapter, we present an efficient method for providing group level hierarchical access control over keywords in a multi-user searchable encryption scheme under the SHBC model. We achieve this using a shared global inverted index stored on the cloud and efficient key-regression techniques on the client side. Our method extends the multi-user searchable encryption model of Curtmola et. al. [5] to multiple groups of users. Moreover, our method provides verifiability of search results. We further show that our system is non- adaptively secure. Our main contribution is the construction of an efficient multi-user searchable encryption system with group level hierarchical access control. We further demonstrate how to make our group membership dynamic while still preserving security. We achieve the security guarantees under the SHBC model, [3] (which includes the HBC model). wher the user needs to verify that the search results are accurate and complete. To meet this requirement we demonstrate a verification method for our system. Moreover, our system is both space and time efficient for the client as well as the cloud provider, and we show that our system is non-adaptively secure. Our work improves on existing searchable encryption work for one group of users [5] to multiple groups of users.
62 5.1 Model Curtmola et. al. [5] defined a multi-user SSE (M-SSE) system as follows: Definition 5.1.1 (Multi-User Searchable Symmetric Encryption (M-SSE)). MSSE is a collection of six polynomial-time algorithms MKeygen, MBuildIndex, AddUser, RevokeUser, MTrapdoor, and MSearch, where MKeygen ( 1 k) is a probabilistic key generation algorithm run by the owner O. It takes a security parameter k as input and returns an owner secret key K O. MBuildIndex (K O, D) is run by O to construct indexes. It takes K O and D as inputs, and returns an index I. AddUser (K O, U) is run by O whenever O wishes to add a user to the group. It takes K O and a user U as inputs, and returns U s secret key K U. RevokeUser (K O, U) is run by O whenever O wishes to revoke a user from G. It takes K O and a user U as inputs, and revokes the user s searching privileges. MTrapdoor (K U, w) is run by a user (including O) to generate a trapdoor for a given word w. It takes a user U s secret key K U and a word w as inputs, and returns a trapdoor T U,w. MSearch (I D, T U,w ) is run by the server S to search for the documents in D that contain word w. It takes the index I D for collection D and the trapdoor T U,w for word w as inputs, and returns D(w) if user U G and if user U G. We note that the group controlled searches presented in [5] only provides a single dynamic group. They showed that their system is non-adaptively secure, and that
63 evicted users cannot perform a search, provided that they cannot collude with nonevicted users. We extend this model to multiple groups of users. Let G = {g i g i U } be an indexable set of groups of users from the set of users U. We associate with each group of users, g i, a dictionary, i, of keywords allowed to be searched for. We further require that i contain all words in j for j < i. We define our extended model as follows: Definition 5.1.2 (Hierarchical Access Controlled SSE (HAC-SSE)). Let O be the owner of a document collection D and G an indexable set of groups of users from the set of users U. HAC-SSE is a set of polynomial time algorithms HKeygen, HBuildIndex, HAddUser, HRevokeUser, HTrapdoor, and HSearch, where HKeygen ( 1 k, n ) is the same as MKeygen ( 1 k). MBuildIndex ( K O, D, G, { 1, 2,..., G }) is run by O to construct indexes. It takes, as input, the owner s secret key K O, a document collection D, the set of groups G, and the set of dictionaries, { 1, 2,..., G }. The function returns an index, I, that forces the hierarchical access control. AddUser (K O, U, g) is run by O whenever O wishes to add a user U to the group g G. It takes the owner s secret key K O, a user, and the group as input. The function then returns the group key to the user. RevokeUser (K O, U, g) ( optional) is run by O whenever O wishes to revoke a user U from group g G. It takes the owner s secret key K O, a user U, and a group g as inputs. The function then revokes the user s searching privileges. MTrapdoor (K U, w) is run by a user (including O) to generate a trapdoor for a given word. It takes a user U s secret key K U and a word w i as inputs, and returns a
64 trapdoor T U,w. MSearch (I D, T U,w ) is run by the server S in order to search for the documents in D that contain word w. It takes the index I D for collection D and the trapdoor T U,w for word w in some dictionary g as inputs, and returns D(w) if user U belongs into a specific g G and if user U G. For HAC-SSE to be secure we must have the following property satisfied: Property 5.1.1. A user u g i can not, successfully, query for any word w j for all j > i with more than a negligible probability. 5.2 Key Regression Our system relies on a construct by Fu et. al. called Key Regression [25], originally designed to allow a content owner to manage dynamic group membership in a Content Distribution Network (CDN). The idea is that a content owner encrypts a document, for a group of users, with a key K i at time i. All users belonging to the access group are given a member state stm i, which allows them to derive the key K i. At time j if a member of the group is evicted, then all documents will be re-encrypted with a new key K j. All users remaining in the group are given a state stm j which can be used to derive key K j. They can now forget about stm i. This can be done, due to the property of key regression that from state stm j one can derive all previous states (and thus all previous keys). However, it is impossible for a user possessing state stm j to accurately predict future states. We call (i, stm 1, stm 2,..., stm n ) a publisher state. Formally, Key Regression [25] is defined as follows: Definition 5.2.1 (Key Regression (KR)). A KR scheme consists of four polynomial time algorithms setup, wind, unwind, and keyder, where
65 setup ( 1 λ, n ) returns a publisher state stp; where 1 λ (in unary) is a security parameter and n is an integer representing the number of evictions the system should support. This algorithm may be probabilistic. wind (stp) returns a tuple (stp, stm i ), where stp is the new publisher state and stm i is the new member state. This algorithm may be probabilistic, and may return (, ) if the number of winds has exceeded the n used in setup. unwind (stm j ) returns the previous member state stm i for i < j. If previous member states do not exist, the algorithm returns. keyder (stm i ) returns a key K i in some keyspace. These algorithms can be based on the SHA-1 hash function, the AES symmetric cipher, the RSA public key cipher, or a generic construction based on any Forward- Secure Pseudo-random Generator. For our purposes we will simply use Key Regression as a black box. 5.3 Construction of HAC-SSE and Security Our system associates with each group g i G a member state stm i from a key regression system and a dictionary i. The dictionary i contains all the words in dictionary j for all j < i. Recall that in Key Regression, given a member state stm i one can derive previous member states stm j for all j < i. We will ensure that members of group g i can search for any keyword in dictionary i, but not in dictionary k with k > i. Our index structure, I, is based on idea of secure trie [3]. The root to leaf paths through our trie, unlike the trie in [3], are secured in such a way that the hierarchical access policy is enforced. We insert into the trie the words in the set of dictionaries
66 { 1, 2,..., G }. Once this process is complete, we annotate the trie with the keys used to secure each letter along the root-to-leaf path. We start by annotating every root-to-leaf path in 1 with K 1. Note that in a trie, a word is a root-to-leaf path. Starting at i = 2 we iteratively apply the following process for each word w ( i i 1 ). Walk through the trie according to w and look at each annotation. If the node is un-annotated, then annotate the node with key K i (see Figure 5.1 for a completely annotated trie). c, K 1 d, K 1 a, K 1 o, K 1 t, K 1 r, K 2 g, K 1 $, K 2 $, K 1 $, K 2 $, K 1 Figure 5.1: An annotated trie for dictionaries 1 = {cat, dog} and 2 = {car, do} 1 We use a completely annotated trie for each dictionary i to generate a perfect hash table H i with hash function h i. Each entry in H i contains a Key Schedule that represents what key is used to secure each node along the root-to-leaf path for the hash of the corresponding word. Finally, the trie is walked and each node is secured, using a keyed hash function, according to its key annotation. This new value is the hash of the value the node represents together with the hash of the path to the nodes predecessor. After a node is secured, the key annotation is removed (see Figure 5.2). Though we did not explicitly outline how, we note that every leaf node of the trie contains a list of document identifiers that denote the documents that contain the word specified by the given root-to-leaf path. The trie, which is the index I, is then sent to the cloud. When a user (call him Bob) g j wants to search for a specific word w j, Bob looks up the appropriate entry H j [h j (w)], and determines the keys used to
67 F K1 (l c) F K1 (l d P h ) F K1 (l a P h ) F K1 (l o P h ) F K1 (l t P h ) F K2 (l r P h ) F K1 (l g P h ) F K2 (l $ P h ) F 1 (l $ P h ) F K2 (l $ P h ) F K1 (l $ P h ) Figure 5.2: Final trie based on Figure 5.1. The values P h denotes the parents hash value and l denotes the current nodes level. encrypt the root-to-leaf path specified by w. Bob then encrypts each character of w according to this key schedule and submits the encrypted query to the cloud. Upon receiving a query, the cloud uses the encrypted path to trace through the trie. If the cloud reaches a leaf node in the trie, then it returns the document identifiers to Bob; otherwise, it returns to Bob. Placing our system in the formal model of Section 5.1 we achieve the system below. HKeygen ( 1 k, n ) Set up a key regression system, KR = (setup, wind, unwind, keyder). Set stp = setup ( 1 k, n ), construct the hierarchical dictionaries, 1, 2,..., n, and return K O = (stp, { 1, 2,..., n }, KR). MBuildIndex ( K O, D, G, { 1, 2,..., G }) 1. Create a full Σ -ary tree. 2. Set (r 0, r 1, r 2 ) = (0, 0, 0) for every node, T 0,0 [r 1 ] = 0, and q 0 = 0. 3. For each word w = (w 1, w 2,...) in document D i (where 1 i n) (a) Find an l such that w l, if one does not exist skip the word. (b) For j = 1 to w
68 i. Find a q j [q j 1 Σ + 1, (1 + q j 1 ) Σ ], where T j,qj [r 1 ] = w j. If such a q j can t be found, find a q j such that T j,qj is empty and set T j,qj [r 1 ] = w j and T j,qj [r 0 ] = l. (c) Find an appropriate q j+1 [q j Σ + 1, (1 + q j ) Σ ] at level j + 1 such that T j+1,qj+1 = $. If one can not be found, find a q j+1 such that T j+1,qj+1 is empty and set its r 1 value to $ and r 2 s value to l. Where $ Σ. Add mem mem id (D i ), since w D i to node T j+1,qj+1. 4. From T build the set of perfect hash tables {H i 1 i G }. 5. For each node T j,qj in T (a) If T j,qj is a leaf node, mem mem F keyder(stm1 ) (mem). (b) If T j,qj is not a leaf node, T j,qj [r 3 ] = F ( ( ) j T j,qj [r 1 ] parent ( ) T j,qj [r3 ] ). keyder stm Tj,qj [r 0 ] 6. Pad all remaining r 3 fields in nodes with bit-strings of the same length as the other non-zero nodes of the trie. As well as clearing all values in r 0 and r 1. 7. Return I = (T, {H i 1 i G }). The value, {H i 1 i G } is given to the data owner and not the cloud. HAddUser (K O, U, g) 1. Parse K O as stp. 2. If 1 g G, if not return. (a) set newstp = stp (b) For i = 1 to g, (newstp, stm) = wind (newstp) 3. Return K U = (stm, H g ).
69 HTrapdoor (K u, w = (w[1], w[2],...)) 1. Parse K u as (stm g, H g ). 2. Set s = H g [h g (w)] and curstm = stm g, π[0] = 0, and w[ w + 1] = $. 3. For j = 1 to w + 1 (a) curstm = stm g (b) For i = 1 To i < s[j], curstm = unwind (curstm). i. If curstm =, return. (c) Set π[j] = F keyder(curstm) (j w[j] π [j 1]) 4. Return the privacy preserving trapdoor T u,w = π. HSearch (I D, T U,w ) 1. Parse T U,w as π[1... n] and set q 0 = 0. 2. For j = 1 to n (a) Find a q j [q j 1 Σ + 1, (1 + q j 1 ) Σ ] such that T j,qj [r 3 ] = π[j]. If found, go back to top of loop, otherwise return. 3. If T j,qj has no child, then return the document collection T j,qj [r 2 ]. 5.3.1 Security Guarantees of HAC-SSE We first note that each document is encrypted separately. This means that nothing is leaked about the encrypted documents except their sizes. The privacy preserving queries are a collection of hashes of prefixes with confidentiality protected by the underlying hash function. The only information leaked by a query is the length of the query.
70 Turning our attention to confidentiality of the encrypted trie, we observe that each node contains a hashed value r 3. We know that, although it is improbable for an attacker to derive the original value from r 3, it is possible that the attacker will try to learn statistical information regarding r 3 when considering all the nodes in T. We show that this is impossible using the following theorem on the uniqueness of hash values in nodes of a trie [3]. Theorem 5.3.1 (Uniqueness of Node Values). Given a trie T of depth L with C Σ L+1 1, we have Σ 1 ( ( )) Pr T j,q [r 3 ] = Tĵ,ˆq [r 3 ] (q, j) ˆq, ĵ 1 ( 2 z 1 2 z ) C(C 1) 2 Theorem 5.3.1 holds when the prefix path values for the nodes are computed with the same keyed hash function. While our system has G possible keyed hash functions, due to the construction of keyed cryptographic hash functions, the theorem still holds. It is straightforward to see that system presented satisfies Property 5.1.1, because no users, given a member state stm i, can determine stm j for any j > i with more than a negligible probability. We can further prove that our system is non-adaptively secure, stated in Theorem 5.3.2, by constructing search patterns appropriately. Theorem 5.3.2. The construction presented in section 5.3 is non-adaptively secure. Theorem 5.3.2. We recall the definition of non-adaptive security from [5] and proceed to prove non-adaptive security of HAC-SSE. We describe a probabilistic polynomialtime simulator, S, such that for all q N, all probabilistic polynomial-time adversaries, A, all distributions L q = {H q Tr (H q ) = Tr q }, where Tr q is some q sized trace. Simulator, S can construct a view V q such that A can not distinguish from a genuine view V K (H q ). Given that S is provided with Tr (H q ) for any q N.
71 For q = 0, the simulator S constructs a V that is indistinguishable from V K (H 0 ) for any H 0 R L 0. In particular, S generates V = {1,..., n, e 1,... e n, I }, where e i R {0, 1} D i for all 1 i n and I = T. The simulator S generates a complete Σ + 1-ary trie T of height max wi (w ), and fills each node with a random number from {0, 1} z. The leaf nodes should be filled with a series of random numbers from 1 to n. We claim that V and V K (H 0 ) are indistinguishable. By a standard hybrid argument we must show that A cannot distinguish any element in V from the corresponding element in V K (H 0 ). This is true simply because the document identifiers in V and V K (H 0 ) are computational indistinguishable. It remains to argue that index I = T and I = T are indistinguishable and that the encrypted documents are indistinguishable. To see that T and T are indistinguishable, we note that every node in T are binary string in {0, 1} z and the nodes of T are either a hash value from the set {0, 1} z or a random binary string from {0, 1} z. In either case, the nodes are indistinguishable. In the case of the encrypted documents we observe that since (G, E, D) is a semantically secure encryption scheme, we conclude that e i is indistinguishable from the associated encryption in V K (H 0 ). For q > 0, simulator S constructs V as V q = { 1,..., n, e 1,..., e n, I, T 1,..., T q }, where I = T is a complete Σ -ary trie. Each value in the trie is drawn randomly from {0, 1} z. The trapdoors T i are constructed as root-to-leaf paths through the trie. They are constructed in such a way that they have the correct length. Briefly, to get the correct length for word w i the simulator inspects E j,i. If T j,i is the zero matrix, then set w i = j 1. The trapdoors T i are also constructed in a way such that they will lead to a leaf node that contains the appropriate document set. The encrypted documents as well as the document identifiers, are still indistinguishable for the same reasons as the case when q = 0. The index is likewise indistinguishable. The trapdoors are indistinguishable as they consist of a sequence of random values that are indistinguishable from the function used to create genuine trapdoors. Finally,
72 we note that they are indistinguishable in length as well. This is because the search pattern of the trace provides the simulator with sufficient information to construct a trapdoor of the correct size. We have thus described a polynomial-time simulator, S, that can create a view indistinguishable from a genuine view for any polynomial-time adversary A. This completes the proof. In a straightforward manner, one can extend the proof to handle verification tags. To do this one must augment the trace, both genuine and simulator constructed, to contain the tags. To see that this does not change the proof above, we note that the verification tags themselves are the results of a semantically secure symmetric encryption system and thus are indistinguishable from random. 5.4 Adding Revocation and Verification We extend our construction in Section 5.3 to include verification of search results and revocation of access. By adding verification we can place our system under the SHBC model. To do this we must modify MBuildIndex by augmenting the nodes of the trie in a fashion similar to the methods used in [3]. We also extend our model to contain a polynomial-time function MVerify to verify that the cloud has returned the results correctly. To add revocation of access we make use of the revocation method of Curtmola et. al. [5]. To add verification to our system we add a set, B, of G bitmaps of size Σ to every node of the trie construction algorithm given in section 5.3. We require that the bitmap b i B represent all children reachable by a member of group i G. Under step 5 for the BuildIndex algorithm we add verification tags to each node according to the algorithm given in figure 5.3. To verify the results returned by the cloud we utilize the verification algorithm
73 1. For y = 1 To G (a) Set T i,j [r 4 ] as: T i,j [r 4 ] = T i,j [r 4 ] E Kkeyder(stmv) (b y T i,j [r 3 ]) Figure 5.3: Modification to the BuildIndex algorithm to add verification support to the trie given in Figure 5.4. This algorithm is a slightly modified version of Verify in [3]. HVerify (w = (w 1, w 2,...), {proof i 1 i w }, π) 1. Determine the key schedule s from the hash table H g 2. If the cloud responded positively we walk the list of proofs proof t, For t = 1 to r 1 we, (a) Parse proof t as e 1 e 2... e G (b) Decrypt the the g-th entry in proof t to obtain, b g σ. Where b g denotes the children accessible from w t and σ is the prefix signature for the node. (c) If w t+1 is not one of the children listed in bit vector b g or π[t] σ, return false. 3. return true. Figure 5.4: The HVerify algorithm To handle the need for revocations in our basic system we again modify our construction. The idea is to use a traditional broadcast encryption system and a keyed pseudo-random permutation φ to manage group membership. Instead of sending trapdoor T w to the cloud, we send φ (T w ). When the query arrives at the cloud, the cloud inverts the permutation (computes φ 1 (T w )) to recover the trapdoor T w. To enable dynamic membership, this pseudo-random permutation is indexed by a key v. In [5] the value v is changed, via a broadcast encryption, each time a user leaves the system. In our system we will make use of a second Key Regression system. Recall that Key Regression allows for a content owner to share access to data with users. Moreover, it allows a finite number of revocations of access. To add revocation to
74 our system, as described in section 5.3, we modify, HSetup, HSearch, HTrapdoor, and define the routine HRevokeUser from the HAC-SSE model. We start by modifying HSetup to take an additional parameter that describes the maximum number or revocations, ρ, that the system should permit. The content owner, can now, run setup to initialize a new Key Regression system. The HSearch function should apply the inverse pseudo-random permutation, φ 1 keyder(stm ), to the trapdoor T u,w. Where stm denotes the current member state. We modify the HTrapdoor algorithm to return the trapdoor, T u,w, with the pseudo-random permutation φ keyder(stm ) applied. The revocation algorithm is designed to be run by the owner to revoke a user, u, from any group in the entire system. The algorithm appears in figure 5.5. HRevokeUser (K O, u, g) 1. Parse K O as (stp, stp r ) 2. Run (stm, stp) = wind (stp) to obtain the next key material 3. Sent to the cloud and all users, except u, the new memberstate stm. Figure 5.5: The HRevokeUser algorithm We note here, that our method for revoking access to searches relies on the trust that the cloud will not give away the key for the pseudo-random permutation. 5.4.1 Security Guarantees We emphasize that revoked users are not able to issue successful queries after they are revoked. This is due to the assumption that a member state is not given to any user by the cloud. This directly implies that revoked users cannot generate a trapdoor, for they are unable to apply the correct pseudo-random permutation. We now discuss the unforgeablility of verification tags in the system. In order for the cloud to successfully forge a verification tag it would need to be able recover at least one encryption key, since the cloud can not forge a valid encryption. Moreover,
75 the cloud can not use a different verification tag from the tree as the hash of the prefix of the query is included in the verification tag, thus binding the verification tags to specific nodes. 5.5 Conclusion We presented a secure SSE scheme with hierarchical access control for multiple groups of users under the SHBC model. Our system can support both addition and eviction of group members. One application of our system is tagging documents with security terms (e.g., need-to-know, secret, top-secret), and with efficient secure search that enforces the data hiding property. This data hiding property amounts to obscuring what documents are tagged with the top-secret designation. Another application is providing filters over search engine queries thus providing a child lock on Internet searches while still maintaining query privacy. Yet another application is providing protection for patient records allowing only certain classes of physicians to query for certain diseases or other medical tagging.
76 Chapter 6 Conclusion In this chapter we summarize the results presented in this dissertation and elaborate upon future research directions. 6.1 Results In this dissertation we solved three open problems. First, we gave an efficient verifiable encrypted phrase search that is non-adaptively secure in the SHBC cloud model. Second, we provided a single round encrypted phrase search algorithm. Lastly, we provided the first multi-user SSE system that provides an efficient hierarchical access control over search keywords. The system supports dynamic membership, via a key regression primitive. Lastly, the system is proven to be non-adaptively secure under the SHBC cloud model. 6.2 Future Work In this section we will investigate the several directions we can follow to extend the results discussed in this dissertation. We will first consider directions along the lines of verifiable symmetric encrypted phrase search and finish with a discussion of new
potential applications of access control to searchable symmetric encryption The most basic question that results from this dissertation is how to solve the problems using adaptive security. Currently the only known method for adaptive security deals with keyword search, due to Curtmola, Garay, Kamara, and Ostrovsky [5]. While this system is asymptotically linear in its use of storage, it has a large hidden constant. In future work, we seek to develop adaptively secure encrypted keyword and phrase search mechanisms that are verifiable. There has been no work in this area to this day. In the area of access control for SSE, there are several additional issues to consider. Most importantly a system that will allow hierarchical access control over phrases. It should be determined if non-hierarchical access control systems can be realized in SSE. In both sets of problems an adaptively secure version of phrase encryption and encrypted search with access control will prove influential. 77
Bibliography [1] D. X. Song, D. Wagner, and A. Perrig, Practical techniques for searches on encrypted data, in Proceedings of the 2000 IEEE Symposium on Security and Privacy, ser. SP 00. Washington, DC, USA: IEEE Computer Society, 2000, pp. 44 55. [2] D. Boneh, G. D. Crescenzo, R. Ostrovsky, and G. Persiano, Public key encryption with keyword search, in Advances in Cryptology - EUROCRYPT 2004, ser. Lecture Notes in Computer Science, C. Cachin and J. Camenisch, Eds. Springer Berlin / Heidelberg, 2004, vol. 3027, pp. 506 522. [3] Q. Chai and G. Gong, Verifiable symmetric searchable encryption for semihonest-but-curious cloud servers, in IEEE International Conference on Communications, ICC 12, June 2012. [4] Y. Tang, D. Gu, N. Ding, and H. Lu, Phrase search over encrypted data with symmetric encryption scheme, 2012 32nd International Conference on Distributed Computing Systems Workshops, vol. 0, pp. 471 480, 2012. [5] R. Curtmola, J. Garay, S. Kamara, and R. Ostrovsky, Searchable symmetric encryption: improved definitions and efficient constructions, in Proceedings of the 13th ACM conference on Computer and communications security, ser. CCS 06. New York, NY, USA: ACM, 2006, pp. 79 88. [6] J. Katz and Y. Lindell, Introduction to Modern Cryptography (Chapman & Hall/Crc Cryptography and Network Security Series). Chapman & Hall/CRC, 2007. [7] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to algorithms. MIT press, 2009. [8] O. Goldreich, Foundations of Cryptography: Volume 1. New York, NY, USA: Cambridge University Press, 2006. [9] D. R. Stinson, Cryptography: theory and practice. CRC press, 2006. [10] S. Goldwasser and S. Micali, Probabilistic encryption, Journal of computer and system sciences, vol. 28, no. 2, pp. 270 299, 1984. 78
[11] M. Bellare, A. Desai, E. Jokipii, and P. Rogaway, A concrete security treatment of symmetric encryption, in Foundations of Computer Science, 1997. Proceedings., 38th Annual Symposium on. IEEE, 1997, pp. 394 403. [12] M. Luby and C. Rackoff, How to construct pseudorandom permutations from pseudorandom functions, SIAM Journal on Computing, vol. 17, no. 2, pp. 373 386, 1988. [13] R. C. Merkle, A certified digital signature, in Advances in Cryptology- CRYPTO89 Proceedings. Springer, 1990, pp. 218 238. [14] I. B. Damgård, A design principle for hash functions, in Advances in CryptologyCRYPTO89 Proceedings. Springer, 1990, pp. 416 427. [15] O. Goldreich, S. Goldwasser, and S. Micali, On the cryptographic applications of random functions, in Advances in Cryptology. Springer, 1985, pp. 276 288. [16] J. Zobel and A. Moffat, Inverted files for text search engines, ACM Computing Surveys (CSUR), vol. 38, no. 2, p. 6, 2006. [17] E. Fredkin, Trie memory, Commun. ACM, vol. 3, no. 9, pp. 490 499, Sep. 1960. [18] B. H. Bloom, Space/time trade-offs in hash coding with allowable errors, Commun. ACM, vol. 13, no. 7, pp. 422 426, Jul. 1970. [19] L. Fan, P. Cao, J. Almeida, and A. Z. Broder, Summary cache: a scalable wide-area web cache sharing protocol, IEEE/ACM Transactions on Networking (TON), vol. 8, no. 3, pp. 281 293, 2000. [20] E. Goh, Secure indexes, Cryptology eprint Archive, Report 2003/216, 2003, http://eprint.iacr.org/2003/216/. [21] Y. Chang and M. Mitzenmacher, Privacy preserving keyword searches on remote encrypted data, in Proceedings of the Third international conference on Applied Cryptography and Network Security, ser. ACNS 05. Berlin, Heidelberg: Springer-Verlag, 2005, pp. 442 455. [22] Z. A. Kissel and J. Wang, Verifiable phrase search over encrypted data secure against a semi-honest -but-curious adversary, in IEEE International Conference on Distributed Computing Systems, CCRM Workshop, July 2013. [23] H. E. Williams, J. Zobel, and P. Anderson, Whats next? index structures for efficient phrase querying, in Proceedings of the Australasian Database Conference. Australian Computer Society, 1999, pp. 141 152. [24] F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel, Compression of inverted indexes for fast query evaluation, in Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2002, pp. 222 229. 79
[25] K. Fu, S. Kamara, and Y. Kohno, Key Regression: Enabling Efficient Key Distribution for Secure Distributed Storage, in Network and Distributed System Security Symposium (NDSS 06), 2006. 80
Biography Zachary Kissel received his Bachelor of Science degree in Computer Science, with double major in Mathematics, in 2005 from Merrimack College and his Master of Science degree in Computer Science in 2007 from Northeastern University. In September of 2009, he began pursuing his Doctoral degree in Computer Science at the University of Massachusetts Lowell. Concurrently with his Masters and Doctoral studies he worked as a software engineer at Sun Microsystems and Oracle. In Fall of 2012, Zachary left industry and joined the Computer Science faculty at Merrimack College as a visiting professor. 81