Secure Deduplication of Encrypted Data without Additional Servers

Transcription

1 Secure Deduplication of Encrypted Data without Additional Servers Jian Liu Aalto University N. Asokan Aalto University and University of Helsinki Benny Pinkas Bar Ilan University Abstract Encrypting data on the client-side before uploading it to cloud storage is essential for protecting users privacy. However client-side encryption is at odds with the standard practice of deduplication in cloud storage services. Reconciling client-side encryption with cross-user deduplication has been an active research topic. In this paper, we present the first secure cross-user deduplication scheme that supports clientside encryption without requiring any additional independent servers. We demonstrate that our scheme provides better security guarantees than previous work. We also demonstrate that its performance is reasonable via simulations using realistic datasets. 1. INTRODUCTION Cloud storage is a service that enables people to store their data remotely with a storage provider. With a rapid growth in user base, storage providers tend to save storage costs via cross-user deduplication: if two clients want to upload the same file, the storage server detects the duplication and stores only a single copy. Deduplication achieves high storage savings [20] and is adopted by many cloud service providers. It is also adopted widely in backup systems for enterprise workstations. Clients who care about privacy want to encrypt their data on the client-side using semantically secure encryption mechanisms. However, naïve application of encryption thwarts deduplication. Reconciling deduplication and encryption is an active research topic. The current solutions are either dependent on the aid of additional independent servers [3, 23, 25] or only suitable for Peer-to-Peer (P2P) systems [13]. Reliance on independent servers is a strong assumption that is very difficult to meet in commercial contexts, and since most of the popular cloud storage services (e.g. Dropbox, Google Drive) are client-server systems, turning them into P2P systems is unrealistic. In this paper, we present the first single-server scheme for secure cross-user deduplication with client-encrypted data. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX...$ Our scheme allows two clients uploading the same file to the cloud storage to securely agree on the same file encryption key. We do this by building upon a well-known cryptographic primitive known as password authenticated key exchange (PAKE) [6]. A PAKE allows two parties to agree on a session key if and only if they share a short secret ( password ), which might have low entropy. In our deduplication scheme, PAKE protocol runs are facilitated by the storage server. It avoids the need for any additional independent servers and thus can be adopted easily into popular cloud storage services. In terms of security, our scheme makes use of per-file rate limits (namely, limiting the number of queries allowed per file), rather than per-client rate limits (limiting the number of queries allowed for each client, as was used in [3]). This change makes the scheme significantly more resistant against brute-force guessing attacks by colluding clients or a malicious storage server that attempts to masquerade as multiple clients. We make the following contributions: presenting the first single-server scheme that simultaneously enables the server to do cross-user deduplication and clients to encrypt their data on client-side using semantically secure encryption keys. (Section 5); showing that our scheme has better security guarantees than previous work (Section 6). As far as we know, our proposal is the first scheme that can prevent online brute-force guessing attacks by malicious clients or storage server, without the aid of an identity server; demonstrating, via simulations with realistic datasets, that our scheme provides its privacy benefits while retaining a utility level (in terms of deduplication effectiveness) on par with standard industry practice. Our use of per-file rate limits implies that an incoming file will not be checked against all existing files in storage. Surprisiingly, our scheme still achieves high deduplication effectiveness (Section 7); and implementing our scheme to show that it incurs minimal overhead (computation/communication) compared to traditional deduplication schemes (Section 8). 2. PRELIMINARIES 2.1 Deduplication Deduplication strategies can be categorized according to the basic data units they handle. One category is file-level deduplication which eliminates redundant files [12]. The other is block-level deduplication, in which files are segmented

2 into blocks and duplicated ones are eliminated [24]. In this paper we use the term file to refer both files and blocks. Deduplication strategies can also be categorized according to the side in which deduplication happens. In server-side deduplication, all files are uploaded to the storage server S, which then deletes the duplicated files. Clients C are unaware of deduplication. This strategy saves storage but not bandwidth. In client-side deduplication, C checks the existence of its files on S (by sending cryptographic hashes) before uploading them. Duplicates are not uploaded to the S. This strategy saves both storage and bandwidth. The effectiveness of deduplication is customarily expressed by the deduplication ratio, defined as the the number of bytes input to a data deduplication process divided by the number of bytes output [14]. In the case of a cloud storage service, this is the ratio between total size of files uploaded by all clients and the total storage space used by the service. Dedpulication percentage (sometimes referred to as space 1. deduplication ratio reduction percentage ) [14, 17] is 1 A perfect deduplication scheme will detect all duplicates. Hence its deduplication percentage will be close to 100%. The level of deduplication achieveable depends on a number of factors. In common business settings, deduplication ratios in the range of 4:1 (75%) to 500:1 (99.8%) are typical. Wendt et al [26] suggest that a figure in the range 10:1 (90%) to 20:1 (95%) is a realistic expectation. Although deduplication benefits storage providers (and hence, indirectly, their users), it also constitutes a privacy threat for users. For example, a cloud storage server that supports client-side, cross-user deduplication can be exploited as an oracle that asks queries of the form did anyone upload this file?. An adversary A can do so by uploading a file and observing whether deduplication takes place [17]. For a predictable file that is known to be drawn from a small space, A can construct all possible files from that space, upload them and observe which causes deduplication so that A can identify the target file. Harnik et al. [17] propose a randomized threshold approach to address such online brute-force guessing attacks. Specifically, for each file F, S keeps a random threshold t F (t F 2) and a counter c F that indicates the number of Cs that have previously uploaded F. Client-side deduplication happens only if c F t F. Otherwise, S does a server-side deduplication. 2.2 Hash Collisions A hash function H: Φ {0, 1} n is a function that deterministically maps a binary string in Φ of arbitrary length to a binary string h of fixed length n, where h is called the hash value or simply the hash, and n is called the hash length of H. A cryptographic hash function can map distinct elements to a hash value such that it is infeasible to invert the function and find collisions (namely, the function is collision resistant). If a hash function is not collision resistant, its collisions can be calculated as Φ, on condition that the hash values 2 n are evenly distributed across the available range. While a hash function with a long hash length has small collisions, a hash function with a small hash length will have high collisions. We will use such a short hash to protect clients privacy. For a predictable file F, drawn from a small space Φ, if A knows H(F ), she can easily perform an offline bruteforce guessing attack. But if A only knows SH(F ), where SH() is a short hash function, it is hard for A to guess the exact content of F due to the high collision rate of SH(). 2.3 Additively Homomorphic Encryption A public key encryption scheme is additively homomorphic if given two ciphertexts c 1 = Enc(pk, m 1; r 1) and c 2 = Enc(pk, m 2; r 2) (i.e., encryptions of the messages m 1, m 2 using the same public key pk but different random values r 1, r 2), it is possible to efficiently compute Enc(pk, m 1 + m 2; r) with independent randomness r, even without knowledge of the corresponding private key. Examples of such encryption are Paillier s encryption [21], or El Gamal encryption [15] where addition is done in the exponent. In this paper, we use E()/D() to denote symmetric encryption/decryption, and use Enc()/Dec() to denote additively homomorphic encryption/decryption. We abuse the notation and use Enc(pk, m) to denote the random variable induced by Enc(pk, m; r) where r is chosen uniformly at random. In addition, we use and to denote homomorphic addition and subtraction respectively. 2.4 Password Authenticated Key Exchange Password-based protocols are commonly used for user authentication. However, such protocols are vulnerable to offline brute-force guessing attacks (also referred to as dictionary attacks ) since users tend to choose passwords with relatively low entropy that are hence guessable. Bellovin and Merritt [6] were the first to propose a password authenticated key exchange (PAKE) protocol, in which an attacker making a password guess cannot verify the guess without an online attempt to authenticate itself with the password. The protocol is based on using the password as a symmetric key to encrypt the messages of a standard key exchange protocol (e.g., Diffie-Hellman [11]), so that two parties with the same password will successfully generate a session key without revealing their passwords. Following this seminal work, many protocols were proposed to improve PAKE in several aspects, e.g., describing provably secure protocols [5, 7], weakening the assumption (i.e., working in standard model without random oracles) [16, 2], achieving a stronger proof model [10, 9] and improving the round efficiency [18, 5, 19]. We use PAKE as a building block of our deduplication protocol. The ideal functionality F pake is shown in Figure 1. Alice and Bob hold passwords pw a and pw b respectively. F pake outputs the same key to both of them if and only if pw a = pw b. Otherwise, it outputs two independent random keys to Alice and Bob. We use PAKE as a building block for our deduplication protocol. The chosen PAKE protocol should satisfy the following requirements: Implicit key exchange: It must be implicit, meaning that at the end of the protocol each party outputs a key but neither party learns if the passwords matched or not. In fact, many PAKE protocols were designed to be explicit where parties will know whether the exchange succeeded or not at the end of the protocol. Single round: It must be single-round so that it can be easily facilitated by the storage server. Pseudo-random keys: It must guarantee that if the passwords are different than neither party can distinguish the key output by the other party from a random key.

3 Inputs: Alice s input is a password pw a; Bob s input is a password pw b. Outputs: Alice s output is k a; Bob s output is k b. where if pw a = pw b then k a = k b, and if pw a pw b then Bob (resp. Alice) cannot distinguish k a (resp. k b ) from a random string of the same length. Figure 1: The ideal functionality F pake for password authenticated key exchange. Concurrent executions: It must allow multiple PAKE instances to run in parallel. There are two common security notions for PAKE under concurrent executions. One notion is concurrent PAKE, defined by [5, 7]. The other stronger notion is that of UC secure PAKE [10], which guarantees security for composition with arbitrary protocols, and with arbitrary, unknown and possibly correlated password distributions. Protocols secure according to the notion of UC security (e.g. in [10]) are currently much less efficient than protocols secure according to the notion of concurrent PAKE. We therefore use a concurrent PAKE protocol in our work. Our deduplication scheme uses the SPAKE2 protocol of Abdalla and Pointcheval [1], which is described in Figure 2. This protocol is secure in the concurrent setting, in the random oracle model. It is very efficient, requiring each party to compute just three exponentiations, and send just a single group element to the other party. Theorem 5.1 in [1] states that this protocol is secure assuming that the computational Diffie-Hellman problem is hard in the group used by the protocol. 3. PROBLEM STATEMENT 3.1 General Setting Our generic setting for cloud storage systems consists of a storage server S and a set of clients (Cs) who store their files on S. Cs never communicate directly, but exchange messages with S, and S processes the messages and/or forwards them as needed. In addition, third parties (T Ps) can be introduced to the system to assist deduplication. For example, T Ps can be used for key generation [3], keeping track of file popularities [25] and even encryption/decryption [23]. But independent T Ps are expensive to deploy in practice and can be bottlenecks for both security and performance. Introducing a T P which is independent from S is often unrealistic in commercial setting. We assume that parties communicate through secure channels in the system, so that an adversary A cannot eavesdrop and tamper any data in the channel. Thus we assume that S can authenticate and authorize Cs, and Cs can authenticate S as well. We introduce new notations as needed. A summary of Inputs: The input of the uploading client C is F ; Each existing client C i has inputs F i and k Fi ; S s input is a set of existing encrypted files {E(k Fi, F i)}. Outputs: C learns a key k F with which to encrypt its file. If F is identical to an existing file F i then k F = k Fi. Otherwise k F is random; Each C i s output is empty; S learns the encryption of F with the key k F. Note that if there exists F i = F then this encryption is identical to the encryption of F i. Figure 3: The ideal functionality F dedup of deduplicating encrypted data. notations appears in Appendix A. 3.2 Ideal Model We define an ideal model that attempts to capture the ideal functionality F dedup of deduplicating encrypted data. The model has three types of participants: the storage server S, an uploading client C attempting to upload a new file, and existing clients {C i} who have already uploaded a file. The ideal functionality is shown in Figure 3. A deduplication system for encrypted files is secure if it implements F dedup. In particular, no participant learns more information than is defined in the output of the functionality. Our system implements this functionality with the following relaxation: S learns a short hash of F (say, 13 bits long), and other clients that have previously uploaded files with this hash value might learn that a new file with this short hash is being uploaded. In a large scale system this short hash agrees with many files and therefore leaks limited information about the uploaded file. 3.3 Design Goals Threat Model: An adversary A might control any subset of Cs of the system. It might also have control over S: it can gain access to the data stored at S, perform offline bruteforce guessing attacks, and might even mount active attacks by deviating from the deduplication protocol and potentially masquerading as Cs of the system. A may also control any third party T P used by the deduplication scheme. When client-side deduplication takes place then A masquerading as C uploading a file, and wishing to identify whether a specific file was already uploaded by a different C, can apply an online brute-force attack as described by Harnik et al. [17]. The randomized approach of [17], described in Section 2.1, can mitigate that attack. However, even if the deduplication scheme is secure against offline brute-force guessing attacks, an A controlling both S and some Cs can still mount a similar online brute-force attack by running the deduplication protocol for every guess of a predictable file and checking if deduplication occurs. This attack is online since, as will become clearer in Section 4,

4 Public information: A finite cyclic group G of prime order p generated by an element g. Public elements M u G associated with user u. A hash function H modeled as a random oracle. Secret information: User u has a password pw u. The protocol is run between Alice and Bob: 1. Each side performs the following computation: Alice chooses x R Z p and computes X = g x. She defines X = X (M A) pw A. Bob chooses y R Z p and computes Y = g y. He defines Y = Y (M B) pw B. 2. Alice sends X to Bob. Bob sends Y to Alice. 3. Each side computes the shared key: Alice computes K A = (Y /(M B) pw A ) x. She then computes her output as SK A = H(A, B, X, Y, pw A, K A). Bob computes K B = (X /(M A) pw B ) y. He then computes his output as SK B = H(A, B, X, Y, pw B, K B). Figure 2: The SPAKE2 protocol of Abdalla and Pointcheval [1]. A has to run a protocol invocation with the uploading C for every guess of the content of the uploaded file. Security Goals: With the above threat model in mind, we aim to propose a deduplication scheme that both enables S to do deduplication and Cs to encrypt their files with a semantically secure encryption scheme. Our security goals are as follows. S1 Avoid relying on any T P. S2 Prevent brute-force guessing attacks: S2.1 online by compromised Cs S2.2 offline by compromised passive S S2.3 online by compromised active S (masquerading as multiple Cs) Functional goals: In addition, our protocol should also meet certain functional goals. F1 Maximize deduplication effectiveness (exceed realistic expectations, as discussed in Section 2.1). F2 Minimize computational and communication overhead (i.e., the computation/communication costs should be comparable to storage systems without deduplication). 4. OVERVIEW OF THE SOLUTION We first motivate salient design decisions in our deduplication protocol before describing the details in Section 5. When an uploader C wants to upload a file F to S, we need to address two problems: (a) determining if S already has an encrypted copy of F in its storage and (b) if so, then securely arranging to have the encryption key transferred to C from some C i who uploaded the original encrypted copy of F. In traditional client-side deduplication, when C wants to upload a file F to S, she first sends a cryptographic hash h of F to S so that it can check for the existence of the file in its storage. Naïvely adapting this approach to the case of encrypted storage is insecure because a compromised S can easily do offline brute-force guessing attacks on h if F is predictable. Therefore, instead of h, we let C send a short hash sh = SH(F ). Due to the high collision rate of the short hash function SH(), S cannot reliably guess the content of F offline. Now, suppose that another C i, previously uploaded E(k Fi, F i) using k Fi as the symmetric file encryption key for F i and that sh = sh i. Our protocol needs to determine if this happened because F = F i and, in that case, arrange to have k Fi securely transferred from C i to C. We do this by having C and C i engage in an oblivious key sharing protocol which allows C to receive k Fi iff F = F i and a random key otherwise. We say that C i plays the role of a checker in this protocol. The oblivious key sharing protocol could be implemented using generic solutions for secure two-party computation, such as Yao s protocol [28], which express the desired functionality as a boolean circuit and then securely compute it. Protocols of this type have been demonstrated to be very efficient, even with security against malicious adversaries. In our setting the circuit representation is actually quite compact, but the problem in using this approach is that the inputs of the parties are relatively long (say, 288 bits long, comprising of a full-length hash value and a key), and known protocols require an invocation of oblivious transfer, namely of public-key operations, for each input bit. There are known solutions for oblivious transfer extension, which use a preprocessing step to reduce the online computation time of oblivious transfer. However, in our setting the secure computation is run between two clients that do not have any pre-existing relationship, and therefore preprocessing cannot be computed before the protocol is run. Our solution for an efficient oblivious key sharing protocol is based on having C and C i run a PAKE instance, with the hash values of their files, namely h and h i, as their respective input passwords. The protocol results in C i getting k i and C getting k i, which are equal if h = h i and are independent otherwise. The next step of the protocol, which will be described in Section 5, uses these keys to deliver to C a key k F, which is equal to k Fi iff k i = k i. C uses this key to encrypt her uploaded file, and S can deduplicate that file if this file is equal to the file uploaded by C i. Several additional issues need to be solved: 1. A compromised S can perform an online brute-force guessing attack where it initiates many interactions with C/C i to identify F /F i. Each protocol interaction essentially enables a single guess about the content of the relevant file. Cs therefore use a per-file rate limiting strategy to prevent such attacks. Specifically, they set

5 sh 1 root E(k F1, F 1)E(k F2, F 2)... C 1 C 2... sh 2... Figure 4: S s record structure. a bound on the maximum number of PAKE requests they would service as a checker or an uploader for each file. (See Section 5.1.) Our experiments with real data sets in Section 7 show that this rate limiting does not affect the success of deduplication. 2. What if C i is not online? If S has a large number of clients, it can find enough online checkers. If there are not enough online checkers, we can let the uploader run PAKE with the currently available checkers and with additional dummy checkers to hide the number of available checkers. (See Section 5.2.) Again, our experiments in Section 7 show that this does not affect the success of deduplication (since it is likely to find checkers for popular files). 3. How can S do client-side deduplication securely? S will know whether C and C i have the same file from the results of PAKE. If they do have the same file, S just inform C that she need not upload her file. In order to solve the problem of online brute-force guessing attacks by C, when client-side deduplication takes place, we can use the same randomized threshold strategy as [17] (see Section 5.3). 5. DEDUPLICATION PROTOCOL In this section we describe our deduplication protocol in detail. The data structure maintained by S is shown in Figure 4. Clients Cs who want to upload a file also upload the corresponding short hash sh. Since different files may have the same short hash, a hash sh is associated with the list of different encrypted files whose plaintext has this hash value. S also keeps track of clients (C 1, C 2,...) who have uploaded the same encrypted file (therefore, with overwhelming probability, these clients have the same plaintext file encrypted with the same key). Figure 5 shows a basic secure deduplication protocol. Each C i that uploads a file F i stores at S a short hash sh i of that file (Step 0). When a client C wishes to upload a file F it sends the short hash of this file, sh, to S (Step 1). S identifies clients {C i} that have uploaded files with the same short hash (Step 2), and runs the following protocol with each of them (in the more refined version of the system, there is a bound on the number of these protocol invocations). Consider a specific C i who uploaded a file F i. It runs a PAKE protocol with C, where their inputs are the cryptographic hash values of F i and F respectively, and their outputs are keys k i and k i respectively (Step 3). We assume that the keys are sufficiently long so that k i can be divided to a left key and right key, k i = k il k ir, where k il is long enough so that collisions are unlikely (namely, k il >> 2 log n where n is the number of clients), and k ir can be used for symmetric encryption. The same also holds for k i. At this point, a naive solution would have been for C i to send to C an encryption of k Fi with k i, namely E(k Fi, k i). However, this would enable a subtle attack by the C to identify whether F was previously uploaded. Therefore, the protocol continues as follows: each C i sends to S the pair k il, (k ir + k Fi ), where addition is done in a field whose size will be defined next. C sends to S the pair k il and Enc(pk, k ir + r), as well as its public key pk, where this is an additively homomorphic encryption (Steps 4-5). After receiving these messages from all C is, S looks for a pair i for which k il = k il. This equality happens iff F i = F (except with negligible probability). If S finds such a pair it sends to C the value e = Enc(pk, k ir + k Fi ) Enc(pk, k ir + r) = Enc(k Fi r). Otherwise it sends e = Enc(r ), where r is chosen randomly by S (Step 6). C calculates k F = Dec(e) + r, and sends E(k F, F ) to S (Step 7). Note that if F was already uploaded to the server then E(k F, F ) is equal to the previously stored encrypted version of the file. This basic protocol has several issues that are addressed in the following sections: it allows for online brute-force guessing attacks (which we prevent using rate-limiting in Section 5.1); it does not specify how checkers are selected by S, if too many checkers are available (Section 5.2), and, if client-side deduplication is used, it leaks to the client whether F was already uploaded (this is solved in Section 5.3 using a randomized threshold approach). Theorem 1. The deduplication protocol in Figure 5 implements F dedup in malicious model if the PAKE protocol is secure against malicious adversaries, the additively homomorphic encryption is semantically secure and the hash function is modeled as a random oracle. Proof. We assume there is a trusted party that implements F dedup in ideal model. We then construct a simulator, and show that the execution of F dedup in ideal model is computationally indistinguishable from the execution of the deduplication protocol in real model. For the purpose of the proof we assume that the PAKE protocol is implemented as an oracle to which the parties sends their inputs and receive their outputs. (Note that the PAKE functionality is assumed to be implemented using a fully-secure protocol; in fact, it could be implemented using generic secure computation protocols. Therefore this assumption is justified.) A corrupt uploader C: We first assume that S and the C is are honest and construct a simulator for the uploader C. The simulator operates as follows: The simulator records the calls that C makes to the hash function, which is modeled as a random oracle, and records tuples of the form (F j, h j, sh j ) of the inputs and outputs for these calls. The simulator first receives a value sh from the C. The simulator now obtains C s inputs to the different PAKE invocations. If such an input is equal to an h j value which was in the same tuple as sh the simulator invokes F dedup with F j and receives a key k F from F dedup. Note that an honest C uses the same h value in all PAKE invocations. A corrupt C might use different h values. Also,

6 0. When each C i uploaded its file F i, it uploads sh i and E(k Fi, F i). 1. C attempts to upload a file F. It calculates the cryptographic hash h and a short hash sh of F, and sends sh to S. 2. S finds the clients {C i } which uploaded files {F i} whose short hash sh i is equal to sh, and lets them run PAKE protocols with C. a C s input is h and C i s input is h i. 3. After the invocation of the PAKE protocols, each C i gets a session key k i and C gets a set of session keys {k i} corresponding to the different C i s. b 4. Each C i splits k i to k il k ir and sends k il and (k ir + k Fi ) to S. 5. C splits each k i to k il k ir, and sends to S its public key pk, {k il} and {Enc(pk, k ir + r)} where r is chosen randomly. 6. After receiving these messages from all C is, S checks if there is an index i for which k il = k il. (a) If this is the case then it uses the homomorphic properties of the encryption to send to C a value e = Enc(pk, k ir + k Fi ) Enc(pk, k ir + r) = Enc(k Fi r); c (b) Otherwise it sends e = Enc(r ), where r is chosen randomly by S. 7. C calculates k F = Dec(e) + r, and sends E(k F, F ) to S. 8. If S already stores E(k F, F ) it deletes E(k F, F ) and allows C to access the stored file E(k Fi, F i). Otherwise, S stores E(k F, F ). a The PAKE is run via S. There is no direct interaction between the C and each C i. C sends its first message of the PAKE protocol in Step 1. b Note that, with overwhelming probability, k i = k i iff F i = F. c If more than a single k il matches, then the server picks at random one of the matching values and computes the encryption using the corresponding k ir. This event does not happen if C is honest, but might happen if C uses in the PAKE protocols different h values that agree with the same short hash sh. Figure 5: The deduplication protocol. since sh is a short hash, there might be several (F j, h j, sh j ) tuples with the same sh value and therefore several legitimate h values for C to use. The simulator might therefore learn several k F values. The simulator stores for each h value a key k F, which is equal to the corresponding result from F dedup, or is random if h was not in a tuple with sh. The simulator sends back to C a random key k i for each PAKE invocation. It splits this key as k i = k il k ir, and stores the result together with the h value used in this PAKE invocation, and the key k F that corresponds to h. The simulator receives from C a set of pairs ({k il, Enc(k ir+ r)}. Let K = {k il k il = k il}, namely the set of k il values that agree with the values stored by the simulator. Let K be a maximal subset of K where each k il K corresponds to a different h value. Then if K is not empty the simulator chooses at random an element k il K and sends back to C the encryption Enc((k ir + k F ) (k ir + r)), computed using the homomorphic properties. Otherwise it sends to C an encryption of a random value. (Note that if F is not identical to any of the files F i then the key k F that C will retrieve is random, even if the simulator found a match of a the k il value.) A corrupt C i: In the ideal model C i always provides an input to the functionality. We prove security with relation to the relaxed functionality described in Section 3.2, where C i also learns whether the uploaded file has the same short hash as F i. The simulator interacts with C i in the real protocol and should extract C i s input to the functionality in the ideal model, namely (F i, k Fi ). We can assume that C i has previously sent E(k Fi, F i) to the server, and therefore the simulator should only find k Fi and use it to compute F i. The simulator first observes whether sh matches the short hash of C i. If that is not the case then the simulator provides a random k Fi to the ideal functionality. Otherwise, the simulator observes C i s input h i to the PAKE protocol, its output k i, and the message (k il, k ir +k Fi ) that C i sends to the server. If k il is different than the left part of k i then the simulator provides to the functionality a random value k Fi. Otherwise, it subtracts the right part of k i, namely k ir, from k ir + k Fi, and provides the result as the input to the functionality. A corrupt Server: In the protocol the server learns whether the k il value matched or not. This should be done in the simulation, too, but this information can be deduced from the output in the ideal model. Now we discuss several extensions to the basic protocol to account for the types of issues we alluded to in Section Rate Limiting A compromised active S can do online brute-force guessing attacks on either C or C i. Specifically, if F i is predictable, S can pretend to be uploader and send PAKE requests to C i attempting to guess F i. S can also pretend to be checkers and send PAKE responses to C to guess F. Both uploaders and checkers should limit the number of PAKE runs for each file in their respective roles. We use RL c to denote the rate limit for checkers, i.e., each C i can process at most RL c PAKE requests for F i and will ignore further requests. Similarly, RL u is the rate limit for uploaders: i.e. an C will send at most RL u PAKE requests to upload F. This strategy can both improve security (shown in Section 6.3) and

7 reduce overhead (shown in Section 7). 5.2 Checker Selection On receiving an upload request, S selects matching files in descending order of popularity (in terms of number of Cs who own the file) and then choose checkers for that file in ascending order of engagement (in terms of the number of PAKE requests they have serviced so far as checkers for that file). However, the number of PAKE instances an uploader is asked to run for a given file may reveal information about the popularity of that file. To avoid this, S may need participate in fake PAKE runs with C using random inputs. More concretely, after receiving an upload request from C, S generates a random value r 1 [1, RL u 1], and sets r 2 as RL u r 1. If no short hash matches sh a, S runs r 1 times PAKE immediately using random inputs. If S finds a sh i that matches sh and there are x files under sh i and y of them have owners whose clients are currently online, if y < r 1, S runs (r 1 y) times PAKE using random inputs; Then, for each of the y files, selects an online checker to run PAKE, and marks the other (x y) files as unchecked; if r 1 y, S selects r 1 of the y files based on popularity; and for each of them, selects an online checker to run PAKE. Then marks the other (x r 1) files as unchecked. At the end of the above procedure, if no match has been found, S picks a random PAKE instance and ask C to use the corresponding key to encrypt the file she wants to upload. Assume there are z unchecked files after the above procedure, if z < r 2, S runs (r 2 z) times PAKE using random inputs at random time interval, and whenever a checker for any of the z files comes online, sends a PAKE request to him; if r 2 z, S selects r 2 of the z unchecked files based on popularity, and whenever a checker for any of the r 2 files comes online, sends a PAKE request to it. During the above procedure, whenever a match is found, S stops sending PAKE requests to other checkers. But it still uses random inputs to make sure C has run PAKE r 1 times before uploading and r 2 times after uploading. 5.3 Randomized Threshold The protocol in Figure 5 is a sever-side deduplication scheme. To save bandwidth, we use the randomized threshold approach in [17] to transform it to client-side deduplication. For each file F, S keeps a random threshold t F (t F 2), and a counter c F that indicates the number of Cs that have previously uploaded F. In step 6 of the deduplication protocol, In the case of 6a, if c Fi < t Fi, S tells C to upload E(k F, F ) and then deletes it locally. Otherwise, S just informs C that the file is duplicated and there is no need to upload it. In the case of 6b, if there are still unchecked files, it asks C to run further PAKE instances when checkers owning those files come online. if S finally finds k jl = k jl, S puts C as an owner of E(k Fj, F j), if c Fj t Fj, S tells C to switch the key of F to k Fj and then deletes E(k F, F ) from its local storage; otherwise, S treats F as another copy of F j and keeps Enc(pk, k Fj r). If c Fj exceeds t Fj later, it sends Enc(pk, k Fj r) to C, tells her to switch the key of F and then deletes E(k F, F ) from its local storage. Otherwise, S puts E(k F, F ) under sh a and puts C as an owner of E(k F, F ). 6. SECURITY ANALYSIS In this section, we demonstrate how our deduplication scheme satisfies the security requirements in section 3.3. Our scheme doesn t rely on any T P, so it satisfies S Compromised Client Only A can compromise one or more legitimate Cs, and attempt online brute-force guessing attacks. From Section 5.2, we know that a C receives r 1 PAKE replies immediately after sending an upload request. Then C is either required to upload the encrypted file or is given access to an already existing encdrypted file. If C was asked to upload the encrypted file, it cannot still learn how many other Cs, if any, have already uploaded the same file. If C was not required to upload the encrypted file, or if after being asked to run a further r 2 PAKE instances, C was informed that the encryption key has changed, then C can only conclude that its file is sufficiently popular (i.e., has more than the specific random threshold number of owners). Therefore, our protocol satisfies S Compromised Passive Server Next, we consider the case where A has compromised S such that A can gain access to each message received during the protocol as well as each file stored in S. A can thus attempt offline brute-force guessing attacks on predictable files. But all sensitive data (i.e., files, hashes and keys) are encrypted using a semantically secure encryption scheme with keys known only to the participating Cs. The only thing that leaks information is the short hash. We can set its length to make it leak as little information as required. As long as the range of the short hash function is sufficiently smaller than the plaintext space from which a file is drawn, A cannot succeed in a brute-force guessing via a compromised passive S. Therefore, our scheme satisfies S Compromised Active Server Finally, we consider the case where A has compromised S and can make it masqurade as as many Cs as she wants (this is equivalent to A compromising S and the required number of legitimate Cs). Such an adversary can attempt to mount online brute-force guessing attacks on predictable files. As introduced in Section 5.1, we use a per-file rate limiting strategy to prevent such attacks. Recall that RL u (RL c) is the rate limit for uploaders (checkers), n is the

8 Number of Upload Requests length of the short hash and Φ is the space of a predictable file F, then A can uniquely identify F only if Φ 2 n x (RL u + RL L) (1) where x is the number of Cs who potentially own F. As a comparison, DupLESS [3] uses a per-client rate limiting strategy to prevent such online brute-force guessing attacks from one C. Since they need to choose the rate limit so that it does not degrade usability for a C that needs to legitimately upload a large number of files within a brief time interval (such as backing up a local file system), the authors of DupLEss choose the large bound as the total number of requests a single C can make during one week. That is to say, for a predictable file whose size of file space Φ is , a compromised C has to use at least one week to uniquely identify it. But if A has compromised multiple Cs, this time lag will be linearly decreased. Recall that a compromised active S can masquerade as any number of Cs needed. To prevent a compromised active S from masquerading multiple Cs, authors in [25] introduce another T P called identity server. When Cs first join the system, the identity server is responsible for verifying their identity and issuing credentials to them. But an identity server is hard to deploy in real world. As far as we know, our scheme is the first deduplication protocol that satisfies S2.3 without the aid of an identity server. The popularity-contest package on a Debian device regularly reports the list of packages installed on that device. The resulting data consists of a list of debian packages along with the number of devices which reported that package. We took a snapshot of this data on Nov 27, It consists of data collected from Debian users and. From this, we generate our enterprise dataset: it has upload requests (debian package installations) of which are for distinict files (unique debian packages). Figure 6 shows the file popularity distribution (i.e., the number of upload requests for each file) in logarithmic scale for both datasets. We map each dataset to a stream of upload requests by generating the requests in random order. Specifically, for a file that has x copies, we generate x upload requests for it at random time intervals File popularity in media dataset File popularity in enterprise dataset SIMULATION Our approach incurs some extra computation and communication due to the number of PAKE runs whenever a C wants to upload a file. Furthermore, the use of rate limits and randomized thresholds can impact deduplication effectiveness. In this section, we use realistic simulations to study the effect of various parameter choices in our protocol on its computation/communication overhead and deduplication effectiveness. Datasets. We want to consider two types of storage environments. The first consists predominantly of media files, such as audio and video files. We did not have access to a dataset from such an environment. Instead, we use a dataset comprising Android application prevalence data to represent an environment with media files. This is based on the assumption that the popularity of Android applications is likely to be similar to that of media files: both are created and published by a small number of authors (artists/developers), made available on online stores or other distribution sites, and are acquired by consumers either for free or for a fee. We call this the media dataset. We use a publicly available dataset. 1 It consists of data collected from Android devices. For each device, the dataset identifies the set of (anonymized) application identifiers found on that device. We treat each application identifier as a file and presence of a app on a device as an upload request to add the corresponding file to the storage. This dataset has upload requests in total, of which are for distinct files. The second is the type storage environments found in an enterprise backup systems. We use data gathered by Debian Popularity Contest 2 to approximate such an environment File ID Figure 6: File popularity in both datasets. Parameters. To facilitate comparison with DupLESS [3], we set Φ as We then set the length of the short hash n to 13. Inequality 1 in Section 6.3 then implies a lower bound of 100 for (RL u+rl c): with these parameter choices, A can mount a successful online brute-force guessing attack to uniquely identify a predictable file held by only one C. We use these parameter values in our simulations. We measure overhead as the average number of PAKE runs 3, which can be calculated as: µ = T otal number of PAKE runs T otal number of upload requests We measure deduplication effectiveness using deduplication percentage, introduced in Section 2.1. We assume that all files are of equal size so that deduplication percentage ρ is: ρ = 1 Number of all files in storage T otal number of upload requests First we assume that all Cs are online during the simulation, and study the impact of rate limits. Once we choose specific rate limits, we evaluate how absent Cs and the use of randomized thresholds impact deduplication effectiveness. 3 We do not include fake PAKE runs by S (Section 5.2) since we are interested in estimating the average number of real PAKE runs. In a full implementation, an uploader will always run r 1 PAKE runs, where r 1 [1, RL u 1]. (2) (3)

9 Deduplication Percentage % Deduplication Percentage % Average Number of PAKE Runs Rate limiting. Having selected RL u+rl c to be 100, we now see how the average number of PAKE runs and the deduplication effectiveness change when divide this total number in different ways between RL u and RL c. Figure 7 shows the average number of PAKE runs resulting from different values of RL u (and hence RL c) in both datasets. We also ran the simulation without any rate limits, which led to an average of PAKE runs in media dataset and PAKE runs in enterprise dataset. They are significantly larger than the results with rate limiting. Figure 8 shows the deduplication percentage resulting from different rate limit choices in both datasets. We see that setting RL u (RL c) to 30 (70), maximizes the deduplication percentage to values (97.58% and %) close to perfect deduplication (97.59% and %) in both datasets of some Cs being offline may adversely impact deduplication effectiveness. For example, as we described in Sections , if S finds that the upload request from C does not match any online checker available at the time of the upload, it will ask C to upload its encrypted file. S will continue to ask C to run r 2 PAKE instances afterwards. If one of these finds a checker whose existing file F matches C s file but c F never exceeds t F, S can no longer deduplicate C s copy of F. Had the original uploader of F been online at the time C wanted to upload its file, S would have successfully conducated server-side deduplication of F. In other words, the possibility of clients being offline will impact deduplication effectiveness. To estimate this impact, we assign an offline rate to each C as its probability to be offline during one upload request. Using the chosen rate limit values (RL u = 30 and RL c = 70) and fixing the upper limit of the random threshold (i.e., 2 t F 10), we measured the deduplication percentage by varying the offline rate value. The results for both datasets are shown in Figure 9. It shows that the deduplication percentage is still high when the offline rate is as high as 90% Deduplication percentage in media dataset Deduplication percentage in enterprise dataset Average number of PAKE runs in media dataset Average number of PAKE runs in enterprise dataset (90) 20(80) 30(70) 40(60) 50(50) 60(40) 70(30) 80(20) 90(10) RL (RL ) u c Figure 7: Average number of PAKE runs VS. rate limits. We conclude that the use of rate limits not only improves security, but can also reduce the overhead without negatively impacting deduplication effectiveness Deduplication percentage in media dataset Deduplication percentage in enterprise dataset Perfect deduplication in media dataset Perfect deduplication in enterprise dataset (90) 20(80) 30(70) 40(60) 50(50) 60(40) 70(30) 80(20) 90(10) RL (RL ) u c Figure 8: Deduplication percentage VS. rate limits. Randomized threshold and offline rate. The possibility Offline Rate Figure 9: Deduplication percentage VS. offline rates. Evolution of deduplication effectiveness: Deduplication percentage will continue to increase as more files are added to the storage. Figure 10 shows that the deduplication percentage achieved by our scheme meets the realistic expectation (95%) in the very beginning (after receiving upload requests in the media dataset, and upload requests in enterprise dataset). So our system satisfies F1. However, our use of rate limits may result in a somewhat slower increase compared to schemes with perfect deduplication. Figure 11 shows how the difference between perfect deduplication and the deduplication percentage achieved by our scheme evolves as more files are added to the storage. Figure 13 (in Appendix C) shows the secondary deviation. The difference become stable as more files are uploaded to the storage. This phenomenon can be explained by the Zipf s law [29]. As uploaders set a rate limit for their requests and S always selects files in descending order of popularity, our system is similar to web proxy caching, which is a temporary storage for popular web pages. Breslau et al. observe that page

10 Deduplication Percentage % Deviation in Deduplication Percentage % Deviation in Deduplication Percentage % # Deviation from perfect deduplication in media dataset Number of Upload Requests # Deviation from perfect deduplication in enterprise dataset Number of Upload Requests #10 8 Figure 11: Deviation from perfect deduplication VS. Number of upload requests Evolution of deduplication percentage in media dataset Evolution of deduplication percentage in enterprise dataset Number of Upload Requests (x30 in enterprise dataset) #10 6 Figure 10: Deduplication percentage VS. Number of upload requests. requests distribution follows the Zipf s law, where the frequency of a request for the mth most popular page can be calculated as, where N is the size of the cache, 1/m α Ni=1 (1/i α ) and α is the value of the exponent characterising the distribution [8]. As a result, most of the requested pages can be found in the cache. In our case, most of the upload requests can find a matched file within the rate limit if the file has been uploaded already. 8. PERFORMANCE EVALUATION 8.1 Prototype implementation Our prototype consists of two parts: (1) a server program which simulates S and (2) a client program which simulates C (performing file uploading/downloading, encryption/decryption, and assissting S in deduplication). We use Node.js 4 for the implementation of both parts and Redis 5 for the implementation of S s data structure. Furthermore, we use SHA-256 as the cryptographic hash functions and AES with 256-bit keys as the symmetric encryption scheme, both of which are provided by the Crypto module in Node.js. We truncate the cryptographic hash to get the short hash. We use the GNU multiple precision arithmetic library 6 to implement the public key operations. 8.2 Test setting and methodology We run the server side program on an Amazon EC2 instance (Intel Xeon with 1 CPU at 2.5 GHz) 7. and the client side program on an Intel Core i7 machine with GHz cores. We measure the running time using the Date module in javascript and measure the bandwidth usage using TCPdump. As the downloading phase in our protocol is simply downloading an encrypted file, we only consider the uploading phase. We set 13 for the length of short hash and 30 for RL u. We consider the worst case where we let the uploader run PAKE with 30 checkers. So we simulate the uploading phase in our protocol as: 1. An uploader C sends the short hash of the file it wants to upload to server S; 2. S sends requests to 30 checkers C i and lets them run PAKE with C; 3. S waits to get responses of all instances back from C and {C i}; 4. S chooses one instance and sends the result to C 5. C uses the resulting key to encrypt its file with AES and uploads it to S. We measure both the running time and bandwidth usage during the whole procedure above. We compare the results

11 to two baselines: (1) simply uploading a file without encryption and (2) uploading a file with AES encryption. As in [3], we repeat our experiment using files of size 2 2i KB for i {0, 1,..., 8}, which provides a file size range of 1KB to 64 MB. For each file, we upload it 100 times and calculate the mean values. For files that are larger than the computer buffer, we do loading, encryption and uploading at the same time by piping the data stream. As a result, uploading encrypted files uses almost the same amount of time with uploading plain files. 8.3 Results Figure 12 reports the uploading time and bandwidth usage in our protocol compared to two baselines. For files that are smaller than 1 MB, the extra overhead introduced by our deduplication protocol is relatively high. For example, it takes 587 ms (2 436 bytes bandwidth) to upload a 1 KB plain file and 594 ms (2 585 bytes) to upload a 1 KB encrypted file, while it takes ms ( bytes) to upload the same file in our protocol. But the extra overhead introduced by our protocol is independent of the file size, and it become negligible when the file is large enough. So our system meets F2. 9. DISCUSSION Incentives: In our scheme Cs have to run several PAKE instances as both uploaders and checkers. This imposes a cost on each C. S is the direct beneficiary of the protocol. Cs may indirectly benefit from deduplication in that the ability to effectively deduplicate makes the storage system more efficient and can thus potentially lower the cost of using the storage incurred by each C. Nevertheless, it is desirable to have more direct incentive mechanisms to encourage Cs to do do PAKE checks. For example, if a file uploaded by C is found to be shared by other Cs, S could reward the Cs owning that file by giving them small increases in their respective storage quotas. Deduplication effectiveness: We can improve deduplication effectiveness by introducing additional checks. For example, an uploader can indicate the (approximate) file size (which will be revealed any way) so that S can limit the selection of checkers to those whose files are of a similar size. Similarly, S can keep track of similarities between Cs based on the number of files they share and use this information while selecting checkers by prioritizing checkers who are similar to the uploader. Nevertheless, as discussed in Section 7, our schemes achieve surpass what is considered as realistic levels of deduplication in the very beginning. Whitehouse [27] reported that when selecting a deduplication scheme, enterprise administraotrs rated considerations such as ease of deployment and use being more important as deduplication ratio. Therefore, we argue that the small sacrifice in deduplication ratio is offset by the significant advantage of ensuring user privacy without having to use independent third party servers. Block-level deduplication: Our scheme can be applied to both file-level and block-level deduplication. But block-level deduplication will incur more overhead. 10. RELATED WORK One technique proposed to enable deduplication with encrypted data is convergent encryption (CE) [12], which derives the file encryption key deterministically from the file contents. As a result, the same file will produce the same ciphertext, so that S can deduplicate ciphertexts. But if A compromises S, it can easily perform an offline brute-force guessing attack. Bellare et al. recently proposed messagelocked encryption (MLE), which uses a semantically-secure encryption scheme but produces a tag in a deterministic way [4]. So it still suffers from the same attack. Puzio et al. propose a block-level deduplication scheme that uses a third party T P [23]. C first encrypts each block with CE and then sends the ciphertexts to T P, who adds an additional layer of encryption to the block. During file retrieval, blocks are decrypted by T P first and sent back to C. If A compromises both S and Cs, she can perform an online brute-force guessing attack by letting multiple Cs upload all of possible files and monitoring S s storage. Stanek et al. propose a scheme that only deduplicates popular files [25]. Cs encrypt their files with two layers of encryption: the inner layer is obtained through CE and the outer layer is obtained through a semantically secure threshold encryption scheme. S can decrypt the outer layer of a file iff the number of Cs who have uploaded this file reaches the threshold. They introduce a T P as an index server to keep track of the popularity of each file, but the index server has to know the content or hash of each file. In addition, they introduce another T P as an identity server to prevent online brute-force guessing attacks by multiple Cs. Both of these solutions [23, 25] are vulnerable to offline brute-force guessing attacks if A compromises the T Ps. Bellare et al. propose DupLESS that uses an additional secret in the key generation process of CE [3]. This secret is identical for all Cs and is provided by a T P. To prevent T P from conducting offline brute-force guessing attacks, Cs run an oblivious PRF protocol with T P to generate the key which guarantees that Cs who input the same file will get the same encryption key without revealing their files. But if A compromises both S and T P, S can get the secret from T P and the scheme will be reduced to normal CE. Duan proposes a distributed architecture that generates the file encryption key by a threshold number of all Cs [13]. But this scheme is only suitable for peer-to-peer scemarios: a threshold number of Cs must be online and interact with one another to generate the key. A malicious S can easily generates more than a threshold number of Cs. In Table 1, we summarize the resilience of these schemes with respect to the design goals from Section 3.3. Threat Schemes [12], [4] [23] [25] [3] [13] Our work Compromised C S (pas.) S (act.), Cs T Ps S, T Ps X X X X X X X X X X Table 1: Resilience of deduplication schemes. Acknowledgments This work was supported in part by the Academy of Finland project Cloud Security Services (283135).

12 Time Usage (Millisecond) Bandwidth Usage (Byte) Our protocol Upload encrypted files Upload plain files File Size (KB) Our protocol Upload encrypted files Upload plain files File Size (KB) Figure 12: Time (left) and bandwidth usage (right) VS. file size. 11. REFERENCES [1] M. Abdalla and D. Pointcheval. Simple password-based encrypted key exchange protocols. In A. Menezes, editor, Topics in Cryptology - CT-RSA 2005, The Cryptographers Track at the RSA Conference 2005, San Francisco, CA, USA, February 14-18, 2005, Proceedings, volume 3376 of Lecture Notes in Computer Science, pages Springer, [2] B. Barak, R. Canetti, Y. Lindell, R. Pass, and T. Rabin. Secure computation without authentication. In V. Shoup, editor, Advances in Cryptology - CRYPTO 2005, volume 3621 of Lecture Notes in Computer Science, pages Springer Berlin Heidelberg, [3] M. Bellare, S. Keelveedhi, and T. Ristenpart. Dupless: Server-aided encryption for deduplicated storage. In Proceedings of the 22Nd USENIX Conference on Security, SEC 13, pages , Berkeley, CA, USA, USENIX Association. [4] M. Bellare, S. Keelveedhi, and T. Ristenpart. Message-locked encryption and secure deduplication. In T. Johansson and P. Nguyen, editors, Advances in Cryptology - EUROCRYPT 2013, volume 7881 of Lecture Notes in Computer Science, pages Springer Berlin Heidelberg, [5] M. Bellare, D. Pointcheval, and P. Rogaway. Authenticated key exchange secure against dictionary attacks. In Preneel [22], pages [6] S. Bellovin and M. Merritt. Encrypted key exchange: password-based protocols secure against dictionary attacks. In Research in Security and Privacy, Proceedings., 1992 IEEE Computer Society Symposium on, pages 72 84, May [7] V. Boyko, P. D. MacKenzie, and S. Patel. Provably secure password-authenticated key exchange using diffie-hellman. In Preneel [22], pages [8] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching and zipf-like distributions: evidence and implications. In INFOCOM 99. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE, volume 1, pages vol.1, Mar [9] R. Canetti, D. Dachman-Soled, V. Vaikuntanathan, and H. Wee. Efficient password authenticated key exchange via oblivious transfer. In M. Fischlin, J. Buchmann, and M. Manulis, editors, Public Key Cryptography - PKC 2012, volume 7293 of Lecture Notes in Computer Science, pages Springer Berlin Heidelberg, [10] R. Canetti, S. Halevi, J. Katz, Y. Lindell, and P. D. MacKenzie. Universally composable password-based key exchange. In R. Cramer, editor, Advances in Cryptology - EUROCRYPT 2005, 24th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Aarhus, Denmark, May 22-26, 2005, Proceedings, volume 3494 of Lecture Notes in Computer Science, pages Springer, [11] W. Diffie and M. E. Hellman. New directions in cryptography. Information Theory, IEEE Transactions on, 22(6): , [12] J. Douceur, A. Adya, W. Bolosky, P. Simon, and M. Theimer. Reclaiming space from duplicate files in a serverless distributed file system. In Distributed Computing Systems, Proceedings. 22nd International Conference on, pages , [13] Y. Duan. Distributed key generation for encrypted deduplication: Achieving the strongest privacy. In Proceedings of the 6th Edition of the ACM Workshop on Cloud Computing Security, CCSW 14, pages 57 68, New York, NY, USA, ACM. [14] M. Dutch. Understanding data deduplication ratios. SNIA Data Management Forum, /l3pm284d8r1s.pdf. [15] T. ElGamal. A public key cryptosystem and a signature scheme based on discrete logarithms. In G. Blakley and D. Chaum, editors, Advances in Cryptology, volume 196 of Lecture Notes in Computer Science, pages Springer Berlin Heidelberg, [16] O. Goldreich and Y. Lindell. Session-key generation using human passwords only. In J. Kilian, editor, Advances in Cryptology - CRYPTO 2001, volume 2139 of Lecture Notes in Computer Science, pages Springer Berlin Heidelberg, [17] D. Harnik, B. Pinkas, and A. Shulman-Peleg. Side channels in cloud services: Deduplication in cloud storage. Security Privacy, IEEE, 8(6):40 47, Nov [18] I. Jeong, J. Katz, and D. Lee. One-round protocols for two-party authenticated key exchange. In

13 M. Jakobsson, M. Yung, and J. Zhou, editors, Applied Cryptography and Network Security, volume 3089 of Lecture Notes in Computer Science, pages Springer Berlin Heidelberg, [19] J. Katz and V. Vaikuntanathan. Round-optimal password-based authenticated key exchange. Journal of Cryptology, 26(4): , [20] D. T. Meyer and W. J. Bolosky. A study of practical deduplication. In Proceedings of the 9th USENIX Conference on File and Stroage Technologies, FAST 11, pages 1 1, Berkeley, CA, USA, USENIX Association. [21] P. Paillier. Public-key cryptosystems based on composite degree residuosity classes. In J. Stern, editor, Advances in Cryptology Ů EUROCRYPT Š99, volume 1592 of Lecture Notes in Computer Science, pages Springer Berlin Heidelberg, [22] B. Preneel, editor. Advances in Cryptology - EUROCRYPT 2000, International Conference on the Theory and Application of Cryptographic Techniques, Bruges, Belgium, May 14-18, 2000, Proceeding, volume 1807 of Lecture Notes in Computer Science. Springer, [23] P. Puzio, R. Molva, M. Önen, and S. Loureiro. Cloudedup: Secure deduplication with encrypted data for cloud storage. In Proceedings of the 2013 IEEE International Conference on Cloud Computing Technology and Science - Volume 01, CLOUDCOM 13, pages , Washington, DC, USA, IEEE Computer Society. [24] S. Quinlan and S. Dorward. Venti: A new approach to archival storage. In Proceedings of the 1st USENIX Conference on File and Storage Technologies, FAST 02, pages 7 7, Berkeley, CA, USA, USENIX Association. [25] J. Stanek, A. Sorniotti, E. Androulaki, and L. Kencl. A secure data deduplication scheme for cloud storage. In N. Christin and R. Safavi-Naini, editors, Financial Cryptography and Data Security, Lecture Notes in Computer Science, pages Springer Berlin Heidelberg, [26] J. M. Wendt. Getting Real About Deduplication Ratios. getting-real-about-deduplication.html, [27] L. Whitehouse. Understanding data deduplication ratios in backup systems. TechTarget article, May Understanding-data-deduplication-ratios-in\ -backup-systems. [28] A. C.-C. Yao. How to generate and exchange secrets. In Foundations of Computer Science, 1986., 27th Annual Symposium on, pages , Oct [29] G. K. Zipf. Relative frequency as a determinant of phonetic change. Harvard studies in classical philology, pages 1 95, Notation Description Entities C Client S Server A Adversary T P Third Party Cryptographic Notations E() Symmetric key encryption D() Symmetric key decryption k Symmetric encryption/decryption key Enc() Additively homomorphic encryption Dec() Additively homomorphic decryption Additively homomorphic addition Additively homomorphic subtraction H() Cryptographic hash function h Cryptographic hash SH() Short hash function sh Short hash PAKE Password Authenticated Key Exchange F pake Ideal functionality of PAKE F dedup Ideal functionality of deduplication protocol Parameters F File Φ File space for a predictable file n Length of the short hash RL u Rate limit by uploaders RL c Rate limit by checkers t F Random threshold for a file Counter for a file c F Table 2: Summary of notations Appendix A: Notation Table A table of notations is shown in Table 2. Appendix B: Secondary deviation from perfect deduplication

14 Secondary Deviation in Deduplication Percentage % Secondary Deviation in Deduplication Percentage % 18 # Secondary deviation from perfect deduplication in media dataset 14 # Secondary deviation from perfect deduplication in enterprise dataset Number of Upload Requests # Number of Upload Requests #10 8 Figure 13: Secondary deviation from perfect deduplication VS. Number of upload requests.