Ensurng Data Storage Securty n Cloud Computng Cong Wang, Qan Wang, and Ku Ren Department of ECE Illnos Insttute of Technology Emal: {cwang, qwang, kren}@ece.t.edu Wenjng Lou Department of ECE Worcester Polytechnc Insttute Emal: wjlou@ece.wp.edu Abstract Cloud Computng has been envsoned as the nextgeneraton archtecture of IT Enterprse. In contrast to tradtonal solutons, where the IT servces are under proper physcal, logcal and personnel controls, Cloud Computng moves the applcaton software and databases to the large data centers, where the management of the data and servces may not be fully trustworthy. Ths unque attrbute, however, poses many new securty challenges whch have not been well understood. In ths artcle, we focus on cloud data storage securty, whch has always been an mportant aspect of qualty of servce. To ensure the correctness of users data n the cloud, we propose an effectve and flexble dstrbuted scheme wth two salent features, opposng to ts predecessors. By utlzng the homomorphc token wth dstrbuted verfcaton of erasure-coded data, our scheme acheves the ntegraton of storage correctness nsurance and data error localzaton,.e., the dentfcaton of msbehavng server(s). Unlke most pror works, the new scheme further supports secure and effcent dynamc operatons on data blocks, ncludng: data update, delete and append. Extensve securty and performance analyss shows that the proposed scheme s hghly effcent and reslent aganst Byzantne falure, malcous data modfcaton attack, and even server colludng attacks. I. INTRODUCTION Several trends are openng up the era of Cloud Computng, whch s an Internet-based development and use of computer technology. The ever cheaper and more powerful processors, together wth the software as a servce (SaaS) computng archtecture, are transformng data centers nto pools of computng servce on a huge scale. The ncreasng network bandwdth and relable yet flexble network connectons make t even possble that users can now subscrbe hgh qualty servces from data and software that resde solely on remote data centers. Movng data nto the cloud offers great convenence to users snce they don t have to care about the complextes of drect hardware management. The poneer of Cloud Computng vendors, Amazon Smple Storage Servce (S3) and Amazon Elastc Compute Cloud (EC2) [1] are both well known examples. Whle these nternet-based onlne servces do provde huge amounts of storage space and customzable computng resources, ths computng platform shft, however, s elmnatng the responsblty of local machnes for data mantenance at the same tme. As a result, users are at the mercy of ther cloud servce provders for the avalablty and ntegrty of ther data. Recent downtme of Amazon s S3 s such an example [2]. From the perspectve of data securty, whch has always been an mportant aspect of qualty of servce, Cloud Computng nevtably poses new challengng securty threats for number of reasons. Frstly, tradtonal cryptographc prmtves for the purpose of data securty protecton can not be drectly adopted due to the users loss control of data under Cloud Computng. Therefore, verfcaton of correct data storage n the cloud must be conducted wthout explct knowledge of the whole data. Consderng varous knds of data for each user stored n the cloud and the demand of long term contnuous assurance of ther data safety, the problem of verfyng correctness of data storage n the cloud becomes even more challengng. Secondly, Cloud Computng s not just a thrd party data warehouse. The data stored n the cloud may be frequently updated by the users, ncludng nserton, deleton, modfcaton, appendng, reorderng, etc. To ensure storage correctness under dynamc data update s hence of paramount mportance. However, ths dynamc feature also makes tradtonal ntegrty nsurance technques futle and entals new solutons. Last but not the least, the deployment of Cloud Computng s powered by data centers runnng n a smultaneous, cooperated and dstrbuted manner. Indvdual user s data s redundantly stored n multple physcal locatons to further reduce the data ntegrty threats. Therefore, dstrbuted protocols for storage correctness assurance wll be of most mportance n achevng a robust and secure cloud data storage system n the real world. However, such mportant area remans to be fully explored n the lterature. Recently, the mportance of ensurng the remote data ntegrty has been hghlghted by the followng research works [3] [7]. These technques, whle can be useful to ensure the storage correctness wthout havng users possessng data, can not address all the securty threats n cloud data storage, snce they are all focusng on sngle server scenaro and most of them do not consder dynamc data operatons. As an complementary approach, researchers have also proposed dstrbuted protocols [8] [10] for ensurng storage correctness across multple servers or peers. Agan, none of these dstrbuted schemes s aware of dynamc data operatons. As a result, ther applcablty n cloud data storage can be drastcally lmted. In ths paper, we propose an effectve and flexble dstrbuted scheme wth explct dynamc data support to ensure the correctness of users data n the cloud. We rely on erasurecorrectng code n the fle dstrbuton preparaton to provde redundances and guarantee the data dependablty. Ths constructon drastcally reduces the communcaton and storage overhead as compared to the tradtonal replcaton-based fle
dstrbuton technques. By utlzng the homomorphc token wth dstrbuted verfcaton of erasure-coded data, our scheme acheves the storage correctness nsurance as well as data error localzaton: whenever data corrupton has been detected durng the storage correctness verfcaton, our scheme can almost guarantee the smultaneous localzaton of data errors,.e., the dentfcaton of the msbehavng server(s). Our work s among the frst few ones n ths feld to consder dstrbuted data storage n Cloud Computng. Our contrbuton can be summarzed as the followng three aspects: 1) Compared to many of ts predecessors, whch only provde bnary results about the storage state across the dstrbuted servers, the challenge-response protocol n our work further provdes the localzaton of data error. 2) Unlke most pror works for ensurng remote data ntegrty, the new scheme supports secure and effcent dynamc operatons on data blocks, ncludng: update, delete and append. 3) Extensve securty and performance analyss shows that the proposed scheme s hghly effcent and reslent aganst Byzantne falure, malcous data modfcaton attack, and even server colludng attacks. The rest of the paper s organzed as follows. Secton II ntroduces the system model, adversary model, our desgn goal and notatons. Then we provde the detaled descrpton of our scheme n Secton III and IV. Secton V gves the securty analyss and performance evaluatons, followed by Secton VI whch overvews the related work. Fnally, Secton VII gves the concludng remark of the whole paper. A. System Model II. PROBLEM STATEMENT A representatve network archtecture for cloud data storage s llustrated n Fgure 1. Three dfferent network enttes can be dentfed as follows: User: users, who have data to be stored n the cloud and rely on the cloud for data computaton, consst of both ndvdual consumers and organzatons. Cloud Servce Provder (CSP): a CSP, who has sgnfcant resources and expertse n buldng and managng dstrbuted cloud storage servers, owns and operates lve Cloud Computng systems. Thrd Party Audtor (TPA): an optonal TPA, who has expertse and capabltes that users may not have, s trusted to assess and expose rsk of cloud storage servces on behalf of the users upon request. In cloud data storage, a user stores hs data through a CSP nto a set of cloud servers, whch are runnng n a smultaneous, cooperated and dstrbuted manner. Data redundancy can be employed wth technque of erasure-correctng code to further tolerate faults or server crash as user s data grows n sze and mportance. Thereafter, for applcaton purposes, the user nteracts wth the cloud servers va CSP to access or retreve hs data. In some cases, the user may need to perform block level operatons on hs data. The most general forms of Users Securty Message Flow Optonal Thrd Party Audtor Data Flow Securty Message Flow Securty Message Flow Cloud Storage Servers Cloud Servce Provder Fg. 1: Cloud data storage archtecture these operatons we are consderng are block update, delete, nsert and append. As users no longer possess ther data locally, t s of crtcal mportance to assure users that ther data are beng correctly stored and mantaned. That s, users should be equpped wth securty means so that they can make contnuous correctness assurance of ther stored data even wthout the exstence of local copes. In case that users do not necessarly have the tme, feasblty or resources to montor ther data, they can delegate the tasks to an optonal trusted TPA of ther respectve choces. In our model, we assume that the pont-to-pont communcaton channels between each cloud server and the user s authentcated and relable, whch can be acheved n practce wth lttle overhead. Note that we don t address the ssue of data prvacy n ths paper, as n Cloud Computng, data prvacy s orthogonal to the problem we study here. B. Adversary Model Securty threats faced by cloud data storage can come from two dfferent sources. On the one hand, a CSP can be self-nterested, untrusted and possbly malcous. Not only does t desre to move data that has not been or s rarely accessed to a lower ter of storage than agreed for monetary reasons, but t may also attempt to hde a data loss ncdent due to management errors, Byzantne falures and so on. On the other hand, there may also exst an economcallymotvated adversary, who has the capablty to compromse a number of cloud data storage servers n dfferent tme ntervals and subsequently s able to modfy or delete users data whle remanng undetected by CSPs for a certan perod. Specfcally, we consder two types of adversary wth dfferent levels of capablty n ths paper: Weak Adversary: The adversary s nterested n corruptng the user s data fles stored on ndvdual servers. Once a server s comprsed, an adversary can pollute the orgnal data fles by modfyng or ntroducng ts own fraudulent data to prevent the orgnal data from beng retreved by the user. Strong Adversary: Ths s the worst case scenaro, n whch we assume that the adversary can compromse all the storage servers so that he can ntentonally modfy the data fles as long as they are nternally consstent. In fact, ths s equvalent to the case where all servers are colludng together to hde a data loss or corrupton ncdent.
C. Desgn Goals To ensure the securty and dependablty for cloud data storage under the aforementoned adversary model, we am to desgn effcent mechansms for dynamc data verfcaton and operaton and acheve the followng goals: (1) Storage correctness: to ensure users that ther data are ndeed stored approprately and kept ntact all the tme n the cloud. (2) Fast localzaton of data error: to effectvely locate the malfunctonng server when data corrupton has been detected. (3) Dynamc data support: to mantan the same level of storage correctness assurance even f users modfy, delete or append ther data fles n the cloud. (4) Dependablty: to enhance data avalablty aganst Byzantne falures, malcous data modfcaton and server colludng attacks,.e. mnmzng the effect brought by data errors or server falures. (5) Lghtweght: to enable users to perform storage correctness checks wth mnmum overhead. D. Notaton and Prelmnares F the data fle to be stored. We assume that F can be denoted as a matrx of m equal-szed data vectors, each consstng of l blocks. Data blocks are all well represented as elements n Galos Feld GF(2 p ) for p = 8 or 16. A The dspersal matrx used for Reed-Solomon codng. G The encoded fle matrx, whch ncludes a set of n = m + k vectors, each consstng of l blocks. f key ( ) pseudorandom functon (PRF), whch s defned as f : {0, 1} key GF(2 p ). φ key ( ) pseudorandom permutaton (PRP), whch s defned as φ : {0, 1} log 2 (l) key {0, 1} log 2 (l). ver a verson number bound wth the ndex for ndvdual blocks, whch records the tmes the block has been modfed. Intally we assume ver s 0 for all data blocks. the seed for PRF, whch depends on the fle name, block ndex, the server poston j as well as the optonal block verson number ver. s ver j III. ENSURING CLOUD DATA STORAGE In cloud data storage system, users store ther data n the cloud and no longer possess the data locally. Thus, the correctness and avalablty of the data fles beng stored on the dstrbuted cloud servers must be guaranteed. One of the key ssues s to effectvely detect any unauthorzed data modfcaton and corrupton, possbly due to server compromse and/or random Byzantne falures. Besdes, n the dstrbuted case when such nconsstences are successfully detected, to fnd whch server the data error les n s also of great sgnfcance, snce t can be the frst step to fast recover the storage errors. To address these problems, our man scheme for ensurng cloud data storage s presented n ths secton. The frst part of the secton s devoted to a revew of basc tools from codng theory that are needed n our scheme for fle dstrbuton across cloud servers. Then, the homomorphc token s ntroduced. The token computaton functon we are consderng belongs to a famly of unversal hash functon [11], chosen to preserve the homomorphc propertes, whch can be perfectly ntegrated wth the verfcaton of erasure-coded data [8] [12]. Subsequently, t s also shown how to derve a challengeresponse protocol for verfyng the storage correctness as well as dentfyng msbehavng servers. Fnally, the procedure for fle retreval and error recovery based on erasure-correctng code s outlned. A. Fle Dstrbuton Preparaton It s well known that erasure-correctng code may be used to tolerate multple falures n dstrbuted storage systems. In cloud data storage, we rely on ths technque to dsperse the data fle F redundantly across a set of n = m + k dstrbuted servers. A (m + k, k) Reed-Solomon erasure-correctng code s used to create k redundancy party vectors from m data vectors n such a way that the orgnal m data vectors can be reconstructed from any m out of the m + k data and party vectors. By placng each of the m + k vectors on a dfferent server, the orgnal data fle can survve the falure of any k of the m+k servers wthout any data loss, wth a space overhead of k/m. For support of effcent sequental I/O to the orgnal fle, our fle layout s systematc,.e., the unmodfed m data fle vectors together wth k party vectors s dstrbuted across m + k dfferent servers. Let F = (F 1, F 2,..., F m ) and F = (f 1, f 2,..., f l ) T ( {1,...,m}), where l 2 p 1. Note all these blocks are elements of GF(2 p ). The systematc layout wth party vectors s acheved wth the nformaton dspersal matrx A, derved from an m (m + k) Vandermonde matrx [13]: 1 1... 1 1... 1 β 1 β 2... β m β m+1... β n..........., β1 m 1 β2 m 1... βm m 1 βm+1 m 1... βn m 1 where β j (j {1,...,n}) are dstnct elements randomly pcked from GF(2 p ). After a sequence of elementary row transformatons, the desred matrx A can be wrtten as 1 0... 0 p 11 p 12... p 1k 0 1... 0 p 21 p 22... p 2k A = (I P) =............, 0 0... 1 p m1 p m2... p mk where I s a m m dentty matrx and P s the secret party generaton matrx wth sze m k. Note that A s derved from a Vandermonde matrx, thus t has the property that any m out of the m + k columns form an nvertble matrx. By multplyng F by A, the user obtans the encoded fle: G = F A = (G (1), G (2),...,G (m), G (m+1),...,g (n) ) = (F 1, F 2,..., F m, G (m+1),..., G (n) ), where G (j) = (g (j) 1, g(j) 2,..., g(j) l ) T (j {1,..., n}). As notced, the multplcaton reproduces the orgnal data fle vectors of F and the remanng part (G (m+1),...,g (n) ) are k party vectors generated based on F.
Algorthm 1 Token Pre-computaton 1: procedure 2: Choose parameters l, n and functon f, φ; 3: Choose the number t of tokens; 4: Choose the number r of ndces per verfcaton; 5: Generate master key K and challenge k chal ; 6: for vector G (j), j 1, n do 7: for round 1, t do 8: Derve α = f kchal () and k () 9: Compute v (j) 10: end for 11: end for 12: Store all the v s locally. 13: end procedure B. Challenge Token Precomputaton from K PRP. (q)] = r q=1 αq G(j) [φ k () In order to acheve assurance of data storage correctness and data error localzaton smultaneously, our scheme entrely reles on the pre-computed verfcaton tokens. The man dea s as follows: before fle dstrbuton the user pre-computes a certan number of short verfcaton tokens on ndvdual vector G (j) (j {1,...,n}), each token coverng a random subset of data blocks. Later, when the user wants to make sure the storage correctness for the data n the cloud, he challenges the cloud servers wth a set of randomly generated block ndces. Upon recevng challenge, each cloud server computes a short sgnature over the specfed blocks and returns them to the user. The values of these sgnatures should match the correspondng tokens pre-computed by the user. Meanwhle, as all servers operate over the same subset of the ndces, the requested response values for ntegrty check must also be a vald codeword determned by secret matrx P. Suppose the user wants to challenge the cloud servers t tmes to ensure the correctness of data storage. Then, he must pre-compute t verfcaton tokens for each G (j) (j {1,...,n}), usng a PRF f( ), a PRP φ( ), a challenge key k chal and a master permutaton key K PRP. To generate the th token for server j, the user acts as follows: 1) Derve a random challenge value α of GF(2 p ) by α = f kchal () and a permutaton key k () based on K PRP. 2) Compute the set of r randomly-chosen ndces: {I q [1,..., l] 1 q r}, where I q = φ () k (q). 3) Calculate the token as: r v (j) = α q G(j) [I q ], where G (j) [I q ] = g (j) I q. q=1 Note that v (j), whch s an element of GF(2 p ) wth small sze, s the response the user expects to receve from server j when he challenges t on the specfed data blocks. After token generaton, the user has the choce of ether keepng the pre-computed tokens locally or storng them n encrypted form on the cloud servers. In our case here, the Algorthm 2 Correctness Verfcaton and Error Localzaton 1: procedure CHALLENGE() 2: Recompute α = f kchal () and k () from K PRP ; 3: Send {α, k } () to all the cloud servers; 4: Receve from servers: (q)] 1 j n} 5: for (j m + 1, n) do 6: R (j) R (j) r q=1 f k j (s Iq,j) α q, I q = φ () k (q) 7: end for {R (j) = r q=1 αq G(j) [φ k () 8: f ((R (1),...,R (m) ) P==(R (m+1),..., R (n) )) then 9: Accept and ready for the next challenge. 10: else 11: for (j 1, n) do 12: f (R (j)! =v (j) ) then 13: return server j s msbehavng. 14: end f 15: end for 16: end f 17: end procedure user stores them locally to obvate the need for encrypton and lower the bandwdth overhead durng dynamc data operaton whch wll be dscussed shortly. The detals of token generaton are shown n Algorthm 1. Once all tokens are computed, the fnal step before fle dstrbuton s to blnd each party block g (j) n (G (m+1),..., G (n) ) by g (j) g (j) + f kj (s j ), {1,...,l}, where k j s the secret key for party vector G (j) (j {m + 1,...,n}). Ths s for protecton of the secret matrx P. We wll dscuss the necessty of usng blnded partes n detal n Secton V. After blndng the party nformaton, the user dsperses all the n encoded vectors G (j) (j {1,..., n}) across the cloud servers S 1, S 2,..., S n. C. Correctness Verfcaton and Error Localzaton Error localzaton s a key prerequste for elmnatng errors n storage systems. However, many prevous schemes do not explctly consder the problem of data error localzaton, thus only provde bnary results for the storage verfcaton. Our scheme outperforms those by ntegratng the correctness verfcaton and error localzaton n our challenge-response protocol: the response values from servers for each challenge not only determne the correctness of the dstrbuted storage, but also contan nformaton to locate potental data error(s). Specfcally, the procedure of the -th challenge-response for a cross-check over the n servers s descrbed as follows: 1) The user reveals the α as well as the -th permutaton key k () to each servers. 2) The server storng vector G (j) aggregates those r rows specfed by ndex k () nto a lnear combnaton R (j) = r q=1 α q G(j) [φ () k (q)].
3) Upon recevng R (j) s from all the servers, the user takes away blnd values n R (j) (j {m + 1,...,n}) by r R (j) R (j) f kj (s Iq,j) α q, where I q = φ () k (q). q=1 4) Then the user verfes whether the receved values reman a vald codeword determned by secret matrx P: (R (1),..., R (m) ) P =? (R (m+1),..., R (n) ). Because all the servers operate over the same subset of ndces, the lnear aggregaton of these r specfed rows (R (1),..., R (n) ) has to be a codeword n the encoded fle matrx. If the above equaton holds, the challenge s passed. Otherwse, t ndcates that among those specfed rows, there exst fle block corruptons. Once the nconsstency among the storage has been successfully detected, we can rely on the pre-computed verfcaton tokens to further determne where the potental data error(s) les n. Note that each response R (j) s computed exactly n the same way as token v (j), thus the user can smply fnd whch server s msbehavng by verfyng the followng n equatons: R (j)? = v (j), j {1,...,n}. Algorthm 2 gves the detals of correctness verfcaton and error localzaton. D. Fle Retreval and Error Recovery Snce our layout of fle matrx s systematc, the user can reconstruct the orgnal fle by downloadng the data vectors from the frst m servers, assumng that they return the correct response values. Notce that our verfcaton scheme s based on random spot-checkng, so the storage correctness assurance s a probablstc one. However, by choosng system parameters (e.g., r, l, t) approprately and conductng enough tmes of verfcaton, we can guarantee the successful fle retreval wth hgh probablty. On the other hand, whenever the data corrupton s detected, the comparson of pre-computed tokens and receved response values can guarantee the dentfcaton of msbehavng server(s), agan wth hgh probablty, whch wll be dscussed shortly. Therefore, the user can always ask servers to send back blocks of the r rows specfed n the challenge and regenerate the correct blocks by erasure correcton, shown n Algorthm 3, as long as there are at most k msbehavng servers are dentfed. The newly recovered blocks can then be redstrbuted to the msbehavng servers to mantan the correctness of storage. IV. PROVIDING DYNAMIC DATA OPERATION SUPPORT So far, we assumed that F represents statc or archved data. Ths model may ft some applcaton scenaros, such as lbrares and scentfc datasets. However, n cloud data storage, there are many potental scenaros where data stored n the cloud s dynamc, lke electronc documents, photos, or log fles etc. Therefore, t s crucal to consder the dynamc case, where a user may wsh to perform varous block-level Algorthm 3 Error Recovery 1: procedure % Assume the block corruptons have been detected among % the specfed r rows; % Assume s k servers have been dentfed msbehavng 2: Download r rows of blocks from servers; 3: Treat s servers as erasures and recover the blocks. 4: Resend the recovered blocks to correspondng servers. 5: end procedure operatons of update, delete and append to modfy the data fle whle mantanng the storage correctness assurance. The straghtforward and trval way to support these operatons s for user to download all the data from the cloud servers and re-compute the whole party blocks as well as verfcaton tokens. Ths would clearly be hghly neffcent. In ths secton, we wll show how our scheme can explctly and effcently handle dynamc data operatons for cloud data storage. A. Update Operaton In cloud data storage, sometmes the user may need to modfy some data block(s) stored n the cloud, from ts current value f j to a new one, f j + f j. We refer ths operaton as data update. Due to the lnear property of Reed- Solomon code, a user can perform the update operaton and generate the updated party blocks by usng f j only, wthout nvolvng any other unchanged blocks. Specfcally, the user can construct a general update matrx F as f 11 f 12... f 1m f 21 f 22... f 2m F =...... f l1 f l2... f lm = ( F 1, F 2,..., F m ). Note that we use zero elements n F to denote the unchanged blocks. To mantan the correspondng party vectors as well as be consstent wth the orgnal fle layout, the user can multply F by A and thus generate the update nformaton for both the data vectors and party vectors as follows: F A = ( G (1),..., G (m), G (m+1),..., G (n) ) = ( F 1,..., F m, G (m+1),..., G (n) ), where G (j) (j {m + 1,..., n}) denotes the update nformaton for the party vector G (j). Because the data update operaton nevtably affects some or all of the remanng verfcaton tokens, after preparaton of update nformaton, the user has to amend those unused tokens for each vector G (j) to mantan the same storage correctness assurance. In other words, for all the unused tokens, the user needs to exclude every occurrence of the old data block and replace t wth the new one. Thanks to the homomorphc constructon of our verfcaton token, the user can perform the token update effcently. To gve more detals, suppose a
block G (j) [I s ], whch s covered by the specfc token v (j), has been changed to G (j) [I s ] + G (j) [I s ], where I s = φ () k (s). To mantan the usablty of token v (j), t s not hard to verfy that the user can smply update t by v (j) v (j) + α s G(j) [I s ], wthout retrevng any other r 1 blocks requred n the precomputaton of v (j). After the amendment to the affected tokens 1, the user needs to blnd the update nformaton g (j) for each party block n ( G (m+1),..., G (n) ) to hde the secret matrx P by g (j) g (j) + f kj (s ver j ), {1,...,l}. Here we use a new seed s ver j for the PRF. The verson number ver functons lke a counter whch helps the user to keep track of the blnd nformaton on the specfc party blocks. After blndng, the user sends update nformaton to the cloud servers, whch perform the update operaton as G (j) G (j) + G (j), (j {1,...,n}). B. Delete Operaton Sometmes, after beng stored n the cloud, certan data blocks may need to be deleted. The delete operaton we are consderng s a general one, n whch user replaces the data block wth zero or some specal reserved data symbol. From ths pont of vew, the delete operaton s actually a specal case of the data update operaton, where the orgnal data blocks can be replaced wth zeros or some predetermned specal blocks. Therefore, we can rely on the update procedure to support delete operaton,.e., by settng f j n F to be f j. Also, all the affected tokens have to be modfed and the updated party nformaton has to be blnded usng the same method specfed n update operaton. C. Append Operaton In some cases, the user may want to ncrease the sze of hs stored data by addng blocks at the end of the data fle, whch we refer as data append. We antcpate that the most frequent append operaton n cloud data storage s bulk append, n whch the user needs to upload a large number of blocks (not a sngle block) at one tme. Gven the fle matrx F llustrated n fle dstrbuton preparaton, appendng blocks towards the end of a data fle s equvalent to concatenate correspondng rows at the bottom of the matrx layout for fle F. In the begnnng, there are only l rows n the fle matrx. To smplfy the presentaton, we suppose the user wants to append m blocks at the end of fle F, denoted as (f l+1,1, f l+1,2,..., f l+1,m ) (We can always use zero-paddng to make a row of m elements.). Wth the secret matrx P, the user can drectly calculate the append blocks for each party server as (f l+1,1,..., f l+1,m ) P = (g (m+1) l+1,..., g (n) l+1 ). 1 In practce, t s possble that only a fracton of tokens need amendment, snce the updated blocks may not be covered by all the tokens. To support block append operaton, we need a slght modfcaton to our token pre-computaton. Specfcally, we requre the user to expect the maxmum sze n blocks, denoted as l max, for each of hs data vector. The dea of supportng block append, whch s smlar as adopted n [7], reles on the ntal budget for the maxmum antcpated data sze l max n each encoded data vector as well as the system parameter r max = r (l max /l) for each pre-computed challengeresponse token. The pre-computaton of the -th token on server j s modfed as follows: where G (j) [I q ] = v = r max q=1 α q G(j) [I q ], { G (j) [φ () k (q)] 0, [φ k (), [φ () k (q)] l (q)] > l Ths formula guarantees that on average, there wll be r ndces fallng nto the range of exstng l blocks. Because the cloud servers and the user have the agreement on the number of exstng blocks n each vector G (j), servers wll follow exactly the above procedure when re-computng the token values upon recevng user s challenge request. Now when the user s ready to append new blocks,.e., both the fle blocks and the correspondng party blocks are generated, the total length of each vector G (j) wll be ncreased and fall nto the range [l, l max ]. Therefore, the user wll update those affected tokens by addng α s G(j) [I s ] to the old v whenever G (j) [I s ] 0 for I s > l, where I s = φ () k (s). The party blndng s smlar as ntroduced n update operaton, thus s omtted here. D. Insert Operaton An nsert operaton to the data fle refers to an append operaton at the desred ndex poston whle mantanng the same data block structure for the whole data fle,.e., nsertng a block F[j] corresponds to shftng all blocks startng wth ndex j + 1 by one slot. An nsert operaton may affect many rows n the logcal data fle matrx F, and a substantal number of computatons are requred to renumber all the subsequent blocks as well as re-compute the challenge-response tokens. Therefore, an effcent nsert operaton s dffcult to support and thus we leave t for our future work. V. SECURITY ANALYSIS AND PERFORMANCE EVALUATION In ths secton, we analyze our proposed scheme n terms of securty and effcency. Our securty analyss focuses on the adversary model defned n Secton II. We also evaluate the effcency of our scheme va mplementaton of both fle dstrbuton preparaton and verfcaton token precomputaton. A. Securty Strength Aganst Weak Adversary 1) Detecton Probablty aganst data modfcaton: In our scheme, servers are requred to operate on specfed rows n each correctness verfcaton for the calculaton of requested
5 5000 4500 5 9 5000 4500 5 9 5000 4500 9 5 4000 4000 4000 l (total number of rows) 3500 3000 2500 2000 1500 1000 500 5 0 0 1 2 3 4 5 6 7 8 9 10 r (number of quered rows) (as a percentage of l) l (total number of rows) 3500 3000 2500 2000 1500 1000 500 5 9 5 9 5 9 0 0 1 1.5 2 2.5 3 3.5 4 4.5 5 r (number of quered rows) (as a percentage of l) l (total number of rows) 3500 3000 2500 2000 1500 1000 500 5 9 5 9 5 0 0 1 1.2 1.4 1.6 1.8 2 r (number of quered rows) (as a percentage of l) Fg. 2: The detecton probablty P d aganst data modfcaton. We show P d as a functon of l (the number of blocks on each cloud storage server) and r (the number of rows quered by the user, shown as a percentage of l) for three values of z (the number of rows modfed by the adversary). left) z = 1% of l; mddle) z = 5% of l; rght) z = 10% of l. Note that all graphs are plotted under p = 8, n c = 10 and k = 5 and each graph has a dfferent scale. token. We wll show that ths samplng strategy on selected rows nstead of all can greatly reduce the computatonal overhead on the server, whle mantanng the detecton of the data corrupton wth hgh probablty. Suppose n c servers are msbehavng due to the possble compromse or Byzantne falure. In the followng analyss, we do not lmt the value of n c,.e., n c n. Assume the adversary modfes the data blocks n z rows out of the l rows n the encoded fle matrx. Let r be the number of dfferent rows for whch the user asks for check n a challenge. Let X be a dscrete random varable that s defned to be the number of rows chosen by the user that matches the rows modfed by the adversary. We frst analyze the matchng probablty that at least one of the rows pcked by the user matches one of the rows modfed by the adversary: Pm r = 1 P {X = 0} r 1 = 1 (1 mn{ z l, 1}) =0 1 ( l z ) r. l Note that f none of the specfed r rows n the -th verfcaton process are deleted or modfed, the adversary avods the detecton. Next, we study the probablty of a false negatve result that the data blocks n those specfed r rows has been modfed, but the checkng equaton stll holds. Consder the responses,...,r (n) returned from the data storage servers for the -th challenge, each response value R (j), calculated wthn R (1) GF(2 p ), s based on r blocks on server j. The number of responses R (m+1),..., R (n) from party servers s k = n m. Thus, accordng to the proposton 2 of our prevous work n [14], the false negatve probablty s P r f = Pr 1 + Pr 2, where Pr 1 = (1+2 p ) nc 1 2 nc 1 and Pr 2 = (1 Pr 1 )(2 p ) k. Based on above dscusson, t follows that the probablty of data modfcaton detecton across all storage servers s P d = P r m (1 P r f ). Fgure 2 plots P d for dfferent values of l, r, z whle we set p = 8, n c = 10 and k = 5. 2 From the fgure we can see that f more than a fracton of the data fle s corrupted, then t suffces to challenge for a small constant number of rows n order to acheve detecton wth hgh probablty. For example, f z = 1% of l, every token only needs to cover 460 ndces n order to acheve the detecton probablty of at least 99%. 2) Identfcaton Probablty for Msbehavng Servers: We have shown that, f the adversary modfes the data blocks among any of the data storage servers, our samplng checkng scheme can successfully detect the attack wth hgh probablty. As long as the data modfcaton s caught, the user wll further determne whch server s malfunctonng. Ths can be acheved by comparng the response values R (j) wth the pre-stored tokens v (j), where j {1,...,n}. The probablty for error localzaton or dentfyng msbehavng server(s) can be computed n a smlar way. It s the product of the matchng probablty for samplng check and the probablty of complementary event for the false negatve result. Obvously, the matchng probablty s P m r = 1 r 1 =0 (1 mn{ ẑ l, 1}), where ẑ z. Next, we consder the false negatve probablty that R (j) v (j) = when at least one of ẑ blocks s modfed. Accordng to proposton 1 of [14], tokens calculated n GF(2 p ) for two dfferent data vectors collde wth probablty P f r = 2 p. Thus, the dentfcaton probablty for msbehavng server(s) s P d = P r m (1 P r f ). Along wth the analyss n detecton probablty, f z = 1% of l and each token covers 460 ndces, the dentfcaton probablty for msbehavng servers s at least 99%. 2 Note that n c and k only affect the false negatve probablty Pf r. However n our scheme, snce p = 8 almost domnates the neglgblty of Pf r, the value of n c and k have lttle effect n the plot of P d.
B. Securty Strength Aganst Strong Adversary In ths secton, we analyze the securty strength of our schemes aganst server colludng attack and explan why blndng the party blocks can help mprove the securty strength of our proposed scheme. Recall that n the fle dstrbuton preparaton, the redundancy party vectors are calculated va multplyng the fle matrx F by P, where P s the secret party generaton matrx we later rely on for storage correctness assurance. If we dsperse all the generated vectors drectly after token precomputaton,.e., wthout blndng, malcous servers that collaborate can reconstruct the secret P matrx easly: they can pck blocks from the same rows among the data and party vectors to establsh a set of m k lnear equatons and solve for the m k entres of the party generaton matrx P. Once they have the knowledge of P, those malcous servers can consequently modfy any part of the data blocks and calculate the correspondng party blocks, and vce versa, makng ther codeword relatonshp always consstent. Therefore, our storage correctness challenge scheme would be undermned even f those modfed blocks are covered by the specfed rows, the storage correctness check equaton would always hold. To prevent colludng servers from recoverng P and makng up consstently-related data and party blocks, we utlze the technque of addng random perturbatons to the encoded fle matrx and hence hde the secret matrx P. We make use of a keyed pseudorandom functon f kj ( ) wth key k j and seed s ver j, both of whch has been ntroduced prevously. In order to mantan the systematc layout of data fle, we only blnd the party blocks wth random perturbatons. Our purpose s to add nose to the set of lnear equatons and make t computatonally nfeasble to solve for the correct secret matrx P. By blndng each party block wth random perturbaton, the malcous servers no longer have all the necessary nformaton to buld up the correct lnear equaton groups and therefore cannot derve the secret matrx P. C. Performance Evaluaton 1) Fle Dstrbuton Preparaton: We mplemented the generaton of party vectors for our scheme under feld GF(2 8 ). Our experment s conducted usng C on a system wth an Intel Core 2 processor runnng at 1.86 GHz, 2048 MB of RAM, and a 7200 RPM Western Dgtal 250 GB Seral ATA drve wth an 8 MB buffer. We consder two sets of dfferent parameters for the (m+k, m) Reed-Solomon encodng. Table I shows the average encodng cost over 10 trals for an 8 GB fle. In the table on the top, we set the number of party vectors constant at 2. In the one at the bottom, we keep the number of the data vectors fxed at 8, and ncrease the number of party vectors. Note that as m ncreases, the length l of data vectors on each server wll decrease, whch results n fewer calls to the Reed- Solomon encoder. Thus the cost n the top table decreases when more data vectors are nvolved. From Table I, t can be shown that the performance of our scheme s comparable to that of [10], even f our scheme supports dynamc data operaton whle [10] s for statc data only. set I m = 4 m = 6 m = 8 m = 10 k = 2 567.45s 484.55s 437.22s 414.22s set II k = 1 k = 2 k = 3 k = 4 m = 8 358.90s 437.22s 584.55s 733.34s TABLE I: The cost of party generaton n seconds for an 8GB data fle. For set I, the number of party servers k s fxed; for set II, the number of data servers m s constant. 2) Challenge Token Pre-computaton: Although n our scheme the number of verfcaton token t s a fxed pror determned before fle dstrbuton, we can overcome ths ssue by choosng suffcent large t n practce. For example, when t s selected to be 1825 and 3650, the data fle can be verfed every day for the next 5 years and 10 years, respectvely. Followng the securty analyss, we select a practcal parameter r = 460 for our token pre-computaton (see prevous subsectons),.e., each token covers 460 dfferent ndces. Other parameters are along wth the fle dstrbuton preparaton. Accordng to our mplementaton, the average token pre-computaton cost for t = 1825 s 51.97s per data vector, and for t = 3650 s 103.94s per data vector. Ths s faster than the hash functon based token pre-computaton scheme proposed n [7]. For a typcal number of 8 servers, the total cost for token pre-computaton s no more than 15 mnutes. Note that each token s only an element of feld GF(2 8 ), the extra storage for those pre-computed tokens s less than 1MB, and thus can be neglected. VI. RELATED WORK Juels et al. [3] descrbed a formal proof of retrevablty (POR) model for ensurng the remote data ntegrty. Ther scheme combnes spot-chekng and error-correctng code to ensure both possesson and retrevablty of fles on archve servce systems. Shacham et al. [4] bult on ths model and constructed a random lnear functon based homomorphc authentcator whch enables unlmted number of queres and requres less communcaton overhead. Bowers et al. [5] proposed an mproved framework for POR protocols that generalzes both Juels and Shacham s work. Later n ther subsequent work, Bowers et al. [10] extended POR model to dstrbuted systems. However, all these schemes are focusng on statc data. The effectveness of ther schemes rests prmarly on the preprocessng steps that the user conducts before outsourcng the data fle F. Any change to the contents of F, even few bts, must propagate through the error-correctng code, thus ntroducng sgnfcant computaton and communcaton complexty. Atenese et al. [6] defned the provable data possesson (PDP) model for ensurng possesson of fle on untrusted storages. Ther scheme utlzed publc key based homomorphc tags for audtng the data fle, thus provdng publc verfablty. However, ther scheme requres suffcent computaton overhead that can be expensve for an entre fle. In ther subsequent work, Atenese et al. [7] descrbed a PDP scheme
that uses only symmetrc key cryptography. Ths method has lower-overhead than ther prevous scheme and allows for block updates, deletons and appends to the stored fle, whch has also been supported n our work. However, ther scheme focuses on sngle server scenaro and does not address small data corruptons, leavng both the dstrbuted scenaro and data error recovery ssue unexplored. Curtmola et al. [15] amed to ensure data possesson of multple replcas across the dstrbuted storage system. They extended the PDP scheme to cover multple replcas wthout encodng each replca separately, provdng guarantee that multple copes of data are actually mantaned. In other related work, Lllbrdge et al. [9] presented a P2P backup scheme n whch blocks of a data fle are dspersed across m+k peers usng an (m+k, m)-erasure code. Peers can request random blocks from ther backup peers and verfy the ntegrty usng separate keyed cryptographc hashes attached on each block. Ther scheme can detect data loss from freerdng peers, but does not ensure all data s unchanged. Flho et al. [16] proposed to verfy data ntegrty usng RSA-based hash to demonstrate uncheatable data possesson n peer-topeer fle sharng networks. However, ther proposal requres exponentaton over the entre data fle, whch s clearly mpractcal for the server whenever the fle s large. Shah et al. [17] proposed allowng a TPA to keep onlne storage honest by frst encryptng the data then sendng a number of precomputed symmetrc-keyed hashes over the encrypted data to the audtor. However, ther scheme only works for encrypted fles, and audtors must mantan long-term state. Schwarz et al. [8] proposed to ensure fle ntegrty across multple dstrbuted servers, usng erasure-codng and block-level fle ntegrty checks. However, ther scheme only consders statc data fles and do not explctly study the problem of data error localzaton, whch we are consderng n ths work. VII. CONCLUSION In ths paper, we nvestgated the problem of data securty n cloud data storage, whch s essentally a dstrbuted storage system. To ensure the correctness of users data n cloud data storage, we proposed an effectve and flexble dstrbuted scheme wth explct dynamc data support, ncludng block update, delete, and append. We rely on erasure-correctng code n the fle dstrbuton preparaton to provde redundancy party vectors and guarantee the data dependablty. By utlzng the homomorphc token wth dstrbuted verfcaton of erasurecoded data, our scheme acheves the ntegraton of storage correctness nsurance and data error localzaton,.e., whenever data corrupton has been detected durng the storage correctness verfcaton across the dstrbuted servers, we can almost guarantee the smultaneous dentfcaton of the msbehavng server(s). Through detaled securty and performance analyss, we show that our scheme s hghly effcent and reslent to Byzantne falure, malcous data modfcaton attack, and even server colludng attacks. We beleve that data storage securty n Cloud Computng, an area full of challenges and of paramount mportance, s stll n ts nfancy now, and many research problems are yet to be dentfed. We envson several possble drectons for future research on ths area. The most promsng one we beleve s a model n whch publc verfablty s enforced. Publc verfablty, supported n [6] [4] [17], allows TPA to audt the cloud data storage wthout demandng users tme, feasblty or resources. An nterestng queston n ths model s f we can construct a scheme to acheve both publc verfablty and storage correctness assurance of dynamc data. Besdes, along wth our research on dynamc cloud data storage, we also plan to nvestgate the problem of fne-graned data error localzaton. ACKNOWLEDGEMENT Ths work was supported n part by the US Natonal Scence Foundaton under grant CNS-0831963, CNS-0626601, CNS- 0716306, and CNS-0831628. REFERENCES [1] Amazon.com, Amazon Web Servces (AWS), Onlne at http://aws. amazon.com, 2008. [2] N. Gohrng, Amazon s S3 down for several hours, Onlne at http://www.pcworld.com/busnesscenter/artcle/142549/amazons s3 down for several hours.html, 2008. [3] A. Juels and J. Burton S. Kalsk, PORs: Proofs of Retrevablty for Large Fles, Proc. of CCS 07, pp. 584 597, 2007. [4] H. Shacham and B. Waters, Compact Proofs of Retrevablty, Proc. of Asacrypt 08, Dec. 2008. [5] K. D. Bowers, A. Juels, and A. Oprea, Proofs of Retrevablty: Theory and Implementaton, Cryptology eprnt Archve, Report 2008/175, 2008, http://eprnt.acr.org/. [6] G. Atenese, R. Burns, R. Curtmola, J. Herrng, L. Kssner, Z. Peterson, and D. Song, Provable Data Possesson at Untrusted Stores, Proc. of CCS 07, pp. 598 609, 2007. [7] G. Atenese, R. D. Petro, L. V. Mancn, and G. Tsudk, Scalable and Effcent Provable Data Possesson, Proc. of SecureComm 08, pp. 1 10, 2008. [8] T. S. J. Schwarz and E. L. Mller, Store, Forget, and Check: Usng Algebrac Sgnatures to Check Remotely Admnstered Storage, Proc. of ICDCS 06, pp. 12 12, 2006. [9] M. Lllbrdge, S. Elnkety, A. Brrell, M. Burrows, and M. Isard, A Cooperatve Internet Backup Scheme, Proc. of the 2003 USENIX Annual Techncal Conference (General Track), pp. 29 41, 2003. [10] K. D. Bowers, A. Juels, and A. Oprea, HAIL: A Hgh-Avalablty and Integrty Layer for Cloud Storage, Cryptology eprnt Archve, Report 2008/489, 2008, http://eprnt.acr.org/. [11] L. Carter and M. Wegman, Unversal Hash Functons, Journal of Computer and System Scences, vol. 18, no. 2, pp. 143 154, 1979. [12] J. Hendrcks, G. Ganger, and M. Reter, Verfyng Dstrbuted Erasurecoded Data, Proc. 26th ACM Symposum on Prncples of Dstrbuted Computng, pp. 139 146, 2007. [13] J. S. Plank and Y. Dng, Note: Correcton to the 1997 Tutoral on Reed-Solomon Codng, Unversty of Tennessee, Tech. Rep. CS-03-504, 2003. [14] Q. Wang, K. Ren, W. Lou, and Y. Zhang, Dependable and Secure Sensor Data Storage wth Dynamc Integrty Assurance, Proc. of IEEE INFOCOM, 2009. [15] R. Curtmola, O. Khan, R. Burns, and G. Atenese, MR-PDP: Multple- Replca Provable Data Possesson, Proc. of ICDCS 08, pp. 411 420, 2008. [16] D. L. G. Flho and P. S. L. M. Barreto, Demonstratng Data Possesson and Uncheatable Data Transfer, Cryptology eprnt Archve, Report 2006/150, 2006, http://eprnt.acr.org/. [17] M. A. Shah, M. Baker, J. C. Mogul, and R. Swamnathan, Audtng to Keep Onlne Storage Servces Honest, Proc. 11th USENIX Workshop on Hot Topcs n Operatng Systems (HOTOS 07), pp. 1 6, 2007.