COST EFFECTIVE PRIVACY PRESERVED INTERMEDIATE DATASETS FOR CLOUD DATA SERVICES

COST EFFECTIVE PRIVACY PRESERVED INTERMEDIATE DATASETS FOR CLOUD DATA SERVICES Chaya.Y.K 1, Bharathi V Kalmath 2 1 PG Student, Department of CSE, S.V.C.E,Bengaluru 2 Asst.Professor, Department of CSE, S.V.C.E,Bengaluru ABSTRACT In cloud computing at the time of processing a runtime application a large volume of intermediate datasets are generated, instead of regenerating these datasets, we store these datasets in cloud for future purpose. When an adversary analyzes these datasets he can get the sensitive information. In existing system to provide privacy for these intermediate datasets they are encrypting all these datasets.but it is time consuming and not cost- effective, proposed system uses heuristic method to reduce cost. In this method by encrypting a part of datasets, reducing cost and also satisfying the privacy requirement of data owners. Keywords: Cloud computing, intermediate datasets, Privacy, Anonymization, and Encryption. ----------------------------------------------------------------------------------------------------------------------------- 1. INTRODUCTION Technically, cloud computing is regarded as an ingenious combination of various technologies, establishing a business model by offering IT services and using economies of scale.participants in the business chain of cloud computing can benefit from this model. Cloud customers can save huge investment of IT infrastructure, nd concentrate on their own business. Therefore, many companies or organizations migrating or building their business into cloud. However, numerous customers hesitate to take advantage of cloud due to security and rivacy concerns. The privacy concerns caused by retaining intermediate datasets in cloud are important but paid little attention. Storage and computation services are equivalent from an economical perspective because they are charged for what they use. Thus cloud customers store valuable intermediate datasets while processing original datasets in order to curtile the overall expenses by avoiding frequent recomputation to obtain these datasets. Such scenarios are quite common because data users often reanalyzes results, conduct new analysis n intermediate datastes,or share some intermediate with others. However, the storage of intermediate data enlarges attacks so that privacy requirements of data holders are at risk of being violated. Usually, intermediate datasets in cloud are accessed and processed by multiple parties, but controlled rarely data owners. This enables an adversary to collect intermediate datasets together and analyze sensitive information from them, ringing considerable economic loss or social reputation impairment to data owners. But,little attention has been paid to such privacy issue. Data provenance is used to manage intermediate datasets. Provenance is commonly as history of derivation of some objects and data, which contains the information upon how data were generated. 2. EXISTING SYSTEM Preserving the privacy of intermediate datasets becomes a challenging problem because adversaries may recover privacy-sensitive information by analyzing multiple datasets. To preserve privacy of intermediate datasets, they encrypt all datasets in cloud. They also use Anonymization method to preserve privacy of datasets. On one hand, encrypting all datasets, a straightforward approach, is widely adopted in current research. Demerits Encrypting all data sets are neither efficient nor cost-effective. Encrypting all datasets results in time consuming and costly for encrypting frequently. Needs to store the encrypt/decrypt datasets in separate requires large storage area.

3 PROPOSED SYSTEM The proposed system proves that it is not necessary to encrypt all datasets. Designed a practical heuristic algorithm to identify the intermediate data sets need to be encrypted. Demonstrate the possibility of ensuring privacy leakage requirement without encrypting all. Significantly reduce privacy preserving cost over existing approach. A tree structure has been modeled from the generation relationships of intermediate data sets to analyze privacy propagation among data sets. Privacy preserving cost reduce heuristic algorithm used for privacy leakage constraints. Sensitive Intermediate data set tree/graph (SIT/SIG) methods are used. Thus the proposed system gives the solution for the problems faced in the existing system. Merits Overcome the existing approach of encrypt all the data sets Storage and encrypt time reduced. Also provide accurate value at cost of privacy preserving. Anonymization with encryption of multiple data sets applied. 4 PROBLEM ANALYSIS A. Loading Dataset Initially we are getting the username,password and type from the user to authenticate the user category. If the user has the account in earlier allows the user to make actions else the user must register a new account. Here we categorize the type of user into two as they are: 1. Data Holder 2. Adversary All the following actions will be performing under the data holder side to reduce the capacity of file encryption decryption to provide privacy of the data. As we proposed the final output of the data to a minimized encrypt /decrypt value. In case the user type under adversary has the capability of view the datasets they needed to view. The existing provides the full encrypted datasets. Instead the proposed dataset will provide the necessary dataset to encrypt and anonymized.after processing the account authentication from user we allow them to add a dataset in cloud. For that we make a option of creating the new table from the dataset in text file to table if it is necessary for the user. Next by converting the file to table the dataset needed to upload in cloud will be identified. Original datasets Layer Segmentation by compressed tree Identifying the Privacy Preserving Lost Heuristic algorithm for reduced Privacy Preserving cost Anonymize and encrypting the data set Generating intermediate data sets. B. Layer Segmentation Here we have an intermediate data sets generating after uploading original dataset into cloud. Thereby to reduce the dataset privacy leakage cost we are undergoing the SIT algorithm. Privacy leakage of the dataset value is identified. These value generated should be under the threshold value. According to the concept of compressed tree under SIT is categorization of tree defines the layer, threshold value. Different layers have different threshold values. Based on generation relationship we are constructing tree. Further by generating the threshold value for the various layers, at the end of this process we identified the various levels in the dataset. C. Minimum Privacy Preserving Cost Usually, more than one feasible global encryption solution exists under the PLC constraints, because there are many alternative solutions in each layer. Each intermediate dataset has various size and frequency of usage, leading to different overall cost with different solutions. The type value generated in the compressed tree helps to categorize the dataset. Thus privacy preserving cost is calculated only for the layer level less than the threshold value. As the field greater than the threshold values are omitted. Here the values after some restrictions are allowed to identify the privacy preserving cost value. Such categorization of the minimum value from the dataset under privacy leakage threshold value is dais to minimum privacy preserving cost. These values will be identified with the help of size; price allocated for the transaction in GB or Mb, frequency of the dataset is taken. As these values iteratively find for all records under the field. Hence the dataset size remains same no further elimination is performed in this module. Finally we identify the minimum privacy preserving cost.

Privacy Leakage : PLs(d*) H(S,Q) H*(S,Q) Where H(S,Q) = log( QI. SD ) and H (S, Q) = Σ p(s, q). log (p(s, q)) qєqi,sєsd Cost Calculation: C (πi) =Σ Sk.PR.fk, where dk ϵ EDi, 1 i H, πi= {EDi, UDi} D.Heuristic Algorithm The state-search tree generated is different from SIT itself, but with same height. The goal in our algorithm is to find a near optimal solution in a limited search space. Based on this, we design a heuristic algorithm. The basic idea is that the heuristic algorithm iteratively selects a state node with highest heuristic value and then extends its child nodes until it reaches the goal state. The privacy-preserving solution and corresponding cost are derived from the goal state. SORT and SELECT are the simple functions as their name signify.thus each value identified in the minimum privacy preserving cost further undergoes the phase of heuristic algorithm to identify the optimized dataset need the privacy. The algorithm can be used to find which intermediate dataset need to encrypt and to obtain optimal solution. Heuristic value can be calculated for selecting a dataset with small cost and large privacy leakage value to encrypt. Heuristic function for a state node SNi is f(sni)=g(sni)+h(sni) where g(sni) is the heuristic information from starting state to current state and h(sni)is the estimated information from current state to the goal state. The g(sni) =Ccur / (ε-εi+1), where ccur is the privacy preserving cost up to current state and ε is the threshold value and εi+1 is the threshold value for the next layer. The h(sni)is calculated as (SNi)= (εi+1.cdes.bfavg)/plavg, where Cdes represents the total cost of the tree and BFAVG is the branch factor of SIT, it can be computed by BFAVG=NE/NI, where NE denotes the number of edges in the tree and NI represents the number of intermediate datasets in the tree. PLAVG is the average of privacy leakage of all the intermediate datasets. Based on the above description the heuristic value can be calculated for a search node is as follows. f(sni)= Ccur/(ε-εi+1)+(εi+1.Cdes.BFAVG)/PLAVG The heuristic value can be used in the heuristic algorithm for selecting the dataset with highest heuristic value and then retrieving its child nodes finding which intermediate dataset we need to encrypt and adding the child nodes one by one to the priority queue up to the goal state. Finally the global privacy preserving solution and corresponding costs are derived. E. Anonymization and Encryption Here we under the process of converting the dataset that which is needed to be store at cloud storage space. As we able to analyze and find the strings to be encrypted, such process over analyze the integer s values is waste of time, because we can t categorize the integer values. To make such problem to be solved we use a tech nique name Anonymization. The Anonymization process is which hiding the data. As by declaring certain ranges for the integer values and anonymize the data will surely reduce the cost of further encryption as well as the data will unnoticed the true information. As both encryption and Anonymization for a dataset will surely reduce the privacy preserving cost as we proposed in earlier. And finally we transfer the dataset which is encrypted and anonymized will be stored in a cloud space. Further we comparing the result set with the existing and proposed and also produce the graph for the heuristic privacy preserving cost value. When adversary user login to view the dataset uploaded by the data holder will produce the anonymized and encrypted dataset instead of showing fully encrypted datasets. As resulting the adversary user cannot view the original information as well he can notice the information which is necessary to

visible by the adversary user.after analyzing the dataset we generating the intermediate dataset with the data need to be encrypted and anonymized. 5. CONCLUSION Identified which data sets need to be encrypted. A practical heuristic algorithm has been designed accordingly. Cost of preserving privacy is reduced with our approach. Modeled the prob lem of saving privacy preserving cost as a constrained optimization problem which is addressed by decomposing the privacy leakage onstraints. Evaluation results on real-world data sets and larger extensive datasets have demonstrated the cost of preserving privacy in cloud can be reduced significantly with our approach over existing ones where all data sets are encrypted. 6. FUTURE WORKS With the contribution of my work, we are planning to further investigate privacy aware efficient scheduling of Intermediate datasets in cloud by taking the privacy preserving as metric together with other metrics such as storage and computation. Optimized balanced scheduling strategies are expected to be enveloped toward overall highly efficient privacy aware data set scheduling. 7. RESULT In existing system we are encrypting 80% datasets but in proposed we are encrypting 20% datasets. REFERENCES [1].M. Armbrust, A. Fox, R. Griffith, A.D.Joseph, R. Katz, A.Konwinski, G. Lee, D.Patterson, A. Rabkin, I. Stoica, and M.Zaharia, A View of Cloud Computing, Comm. ACM,vol. 53,no. 4, pp. 50-58, 2010. [2].R. Buyya, C.S. Yeo, S. Venugopal, J.Broberg, and I. Brandic, Cloud Computingand Emerging It Platforms: Vision, Hype, andreality for Delivering Computing as the FifthUtility, Future Generation Computer Systems, vol. 25, no. 6, pp. 599-616, 2009. [3].L. Wang, J. Zhan, W. Shi, and Y. Liang, In Cloud, Can Scientific Communities Benefit from the Economies of Scale?, IEEE Trans. Parallel and Distributed Systems, vol.23, no. 2, pp. 296-303, Feb.2012.

[4].H. Takabi, J.B.D. Joshi, and G. Ahn, Security and Privacy Challenges in Cloud Computing Environments, IEEE Security& Privacy, vol. 8, no. 6, pp. 24-31, Nov./Dec.2010. [5]. D. Zissis and D. Lekkas, Addressing Cloud Computing Security Issues, Future Generation Computer Systems, vol. 28, no. 3,pp. 583-592, 2011. [6]. D. Yuan, Y. Yang, X. Liu, and J. Chen, On-Demand Minimum Cost Benchmarking for Intermediate Data Set Storage in Scientific Cloud Workflow Systems, J. Parallel Distributed Computing, vol. 71, no. 2, pp. 316-332, 2011. [7]. S.Y. Ko, I. Hoque, B. Cho, and I. Gupta, Making CloudIntermediate Data Fault- Tolerant, Proc. First ACM Symp.CloudComputing (SoCC 10),pp. 181-192,2010. [8]. H. Lin and W. Tzeng, A Secure Erasure Code-Based Cloud Storage System with Secure Data Forwarding, IEEE Trans. Parallel and Distributed Systems, vol. 23, no. 6, pp. 995-1003, June2012 [9]N. Cao, C. Wang, M. Li, K. Ren and W.Lou, Privacy-Preserving Multi-Keyword Ranked Search over Encrypted Cloud Data, Proc. 31st Annual IEEE Int'l Conf. Computer Communications (INFOCOM'11), pp. 829-837, 2011. [10] M. Li, S. Yu, N. Cao and W. Lou, Authorized Private Keyword Search over Encrypted Data in Cloud Computing, Proc. 31st Int'l Conf. Distributed Computing Systems (ICDCS'11), pp. 383-392, 2011. [11] C. Gentry, Fully Homomorphic Encryption Using Ideal Lattices, Proc. 41 st Annual ACM Symp. Theory of Computing (STOC'09), pp. 169-178, 2009. [12] B.C.M. Fung, K. Wang and P.S. Yu, Anonymizing Classification Data for Privacy Preservation, IEEE Trans. Knowl. Data Eng., vol. 19, no. 5, pp. 711-725, 2007. [13] B.C.M. Fung, K. Wang, R. Chen and P.S. Yu, Privacy-Preserving Data Publishing: A Survey of Recent Developments, ACM Comput. Surv., vol. 42, no. 4, pp. 1-53, 2010. [14] X. Zhang, C. Liu, J. Chen and W. Dou, An Upper-Bound Control Approach for Cost- Effective Privacy Protection of Intermediate Dataset Storage in Cloud, Proc. 9th IEEE Int'l Conf. Dependable, Autonomic and Secure Computing (DASC'11), pp. 518-525, 2011. [15] I. Roy, S.T.V. Setty, A. Kilzer, V. Shmatikov and E. Witchel, Airavat: Security and Privacy for Mapreduce, Proc. 7 th USENIX Conf. Networked Systems Design and Implementation (NSDI'10), pp. 20-20, 2010. [16] K.P.N. Puttaswamy, C. Kruegel and B.Y. Zhao, Silverline: Toward Data Confidentiality in Storage-Intensive Cloud Applications, Proc. 2nd ACM Symp. Cloud Computing (SoCC'11), 2011. [17] K. Zhang, X. Zhou, Y. Chen, X. Wang and Y. Ruan, Sedic: Privacy-Aware Data Intensive Computing on Hybrid Clouds, Proc. 18th ACM Conf. Computer and Communications Security (CCS'11), pp. 515-526, 2011. [18] V. Ciriani, S.D.C.D. Vimercati, S.Foresti, S. Jajodia, S. Paraboschi and P. Samarati, Combining Fragmentation and Encryption to Protect Privacy in Data Storage, ACM Trans. Information and System Security, vol. 13, no. 3, pp. 1-33, 2010. [19] S.B. Davidson, S. Khanna, T. Milo, D.Panigrahi and S. Roy, Provenance Views for Module Privacy, Proc. 30th ACM SIGMODSIGACT- SIGART Symp. Principles of Database Systems (PODS'11), pp. 175-186, 2011. [20] S.B. Davidson, S. Khanna, S. Roy, J.Stoyanovich, V. Tannen and Y. Chen, On Provenance and Privacy, Proc. 14th Int'l Conf. Database Theory, pp. 3-10, 2011. [21] S.B. Davidson, S. Khanna, V. Tannen, S. Roy, Y. Chen, T. Milo and J. Stoyanovich, Enabling Privacy in Provenance-Aware Workflow Systems, Proc. 5th Biennial Conf. Innovative Data Systems Research (CIDR'11), pp. 215-218, 2011. [22] P. Samarati, Protecting Respondents' Identities in Microdata Release, IEEE Trans. Knowl. Data Eng., vol. 13, no. 6, pp. 1010-1027, 2001. [23] A. Machanavajjhala, D. Kifer, J. Gehrke and M. Venkitasubramaniam, L-Diversity: Privacy Beyond K- Anonymity, ACM Trans. Knowl. Discov. Data., vol. 1, no. 1, Article 3, 2007. [24] G. Wang, Z. Zutao, D. Wenliang and T. Zhouxuan, Inference Analysis in Privacy- Preserving Data Re- Publishing, Proc. 8 th IEEE Int'l Conf. Data Mining (ICDM '08), pp. 1079-1084, 2008. [25] W. Du, Z. Teng and Z. Zhu, Privacy- Maxent: Integrating Background Knowledge in Privacy uantification, Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD'08), pp. 459-472, 2008.