554 Hyper Elliptic Curve Encryption and Cost Minimization Approach in Moving Big Data to Cloud Jerin Jose 1, Dr. Shine N Das 2 1 (Department of CSE, College of Engineering, Munnar) 2 (Department CSE, College of Engineering, Munnar) ABSTRACT Cloud computing is a latest computational system which can be used for big data processing. Huge amount of unstructured, structured and semi structured data can be called as big data. Map-Reduce and the Hadoop facilitate an affordable mechanism to handle and process data from multiple sources and store the big data in distributed cloud. This paper explains the secured and cost minimizing approach to move and store very large amount of data to cloud. Hyper elliptic cryptography is introduced in this paper to provide encryption to the huge amount of data arriving to the cloud. In addition to cryptography, data download module is included. So the paper mainly covers cost minimization in moving big data and the security of the big data. Keywords- Big Data, Cloud Computing, Hyper Elliptic Curve Cryptography, Online Algorithm 1. INTRODUCTION Cloud computing is simply a service over the internet to store gigantic amount of data that our computers or single server cannot hold and facilitate services of computer over the Internet. That is it provides server resources such as storage, bandwidth and CPU to users. Its desirable feature is on demand supply of server resources and minimized management effort. Cloud platform is a collection/group of software and internet infrastructure integrated and hardware that are inter connected. The software - hardware services of cloud computing are available to enterprises, corporations, businesses markets and public. Essential characteristics of cloud computing are on demand self-service, rapid elasticity, broad network access, resources pooling and measured service. Massive Scale, Geographic Distribution, Homogeneity, Virtualization, Low Cost Software, Resilient computing are some of the common features of cloud computing. Big data analysts concentrated their work more in the analyzing and processing of big data. Before analyzing, it is necessary to store the data in a storage area. As we know, the big data is intensively larger in volume, so the best way is to store it in the cloud. So we have to move the massive amount of data from the sources to the cloud. The big data should be moved to the cloud in a cost optimization manner and also it should be secure. Some works are done for moving big data to cloud by considering the cost minimization. But the data should be secured. So a security system is mandatory. So I implemented hyper elliptic curve cryptography which facilitates encryption to the arrived data in the cloud. 2. RELATED WORKS A series of recent work studies application migration to the cloud. The following are some of the related works on cloud computing and big data. Big Data is not just Hadoop [1]. This paper summarizes Hadoop as a cost-efficient platform and it has the ability to significantly lower the cost of certain workloads. Organizations may have particular pain around reducing the overall cost of their data warehouse. Certain groups of data may be seldom used and possible candidates to offload to a lower-cost platform. Certain operations such as transformations may be offloaded to a more cost efficient platform. The primary area of value creation is cost savings. By pushing workloads and data sets onto a Hadoop platform, organizations are able to preserve their queries and take advantage of Hadoop s cost-effective processing capabilities. One customer example, a financial services firm, moved processing of applications and reports from an operational data warehouse to Hadoop Hbase; they were able to preserve their existing queries and reduce the operating cost of their data management platform. A tunable workflow scheduling algorithm based on particle swarm optimization for cloud computing [2] explains that Cloud computing provides a pool of virtualized computing resources and adopts pay-per-use model. Schedulers for cloud computing make decision on how to allocate tasks of workflow to those virtualized computing resources. In this paper, a flexible particle swarm optimization (PSO) based scheduling algorithm to minimize both total cost and make span is presented. Experiment is conducted by varying computation of
555 tasks, number of particles and weight values of cost and makes span in fitness function. The results show that the proposed algorithm achieves both low cost and make span. In addition, it is adjustable according to different QoS constraints. Privacy-Aware Cloud Deployment Scenario Selection [6] presented a privacy-aware decision method for cloud deployment scenarios. This method is built upon the ProPAn and PACTS method. The first step of the presented method is the definition of the clouds used in concrete deployment scenarios and their cloud stakeholders. Then which domains shall be put into which defined clouds have to be decided. Then the defined clouds, cloud stakeholders, and the relation between existing domains and the defined clouds in domain knowledge diagrams have to be captured. We can apply ProPAn s graph generation algorithms on these domain knowledge diagrams together with a given model of the functional requirements in problem frames notation. The resulting privacy threat graphs are then analyzed to decide which deployment scenario best fits the privacy needs in the last step of the method. To support the method, they extended the ProPAn-tool with wizards that guide the user through the definition of the deployment scenarios and that automatically generate the corresponding domain knowledge diagrams. The proposed method scales well due to the modular way in that the relevant knowledge for the cloud deployment scenarios are integrated into the requirements model and the provided tool-support. New Algorithms for Planning Bulk Transfer via Internet and Shipping Networks [9] is the first to explore the problem of planning a group-based deadline-oriented data transfer in a scenario where data can be sent over both: (1) the internet, and (2) by shipping storage devices (e.g., external or hot-plug drives, or SSDs) via companies such as Fedex, UPS, USPS, etc. The authors first formalize the problem and prove its NP-Hardness. Then, they propose novel algorithms and use them to build a planning system called Pandora (People and Networks Moving Data Around). Pandora uses new concepts of time-expanded networks and delta-timeexpanded networks, combining them with integer programming techniques and optimizations for both shipping and internet edges. The experimental evaluation using real data from Fedex and from PlanetLab indicate the Pandora planner manages to satisfy deadlines and reduce costs significantly. Budget-constrained bulk data transfer via internet and shipping networks [10] formulated and solved the problem of finding the fastest bulk data transfer plan given a strict budget constraint. The authors first characterized the solution space, and observed that the optimal solution can be found by searching through solutions to the deadline-constrained minimum cost problem. Based on these observations, they devised a two-step binary search method that will find an optimal solution. They then developed a bounded binary search method that makes use of bounding functions that provide upper- and lower bounds. In this paper the authors also presented two instances of bounding functions, based on variants of our data transfer networks, and proved that they do indeed provide bounds. Finally, they evaluated the proposed algorithms by running them on realistic network and found that the proposed techniques significantly reduce the time needed to compute solutions. Scaling social media applications into geodistributed clouds [8]. The paper exploits the social influences among users proposes efficient proactive algorithms for dynamic, optimal scaling of a social media application in a geo-distributed cloud. The key contribution of this paper is an online content migration and request distribution algorithm with the following features: (1) future demand prediction by novelly characterizing social influences among the users in a simple but effective epidemic model; (2) one-shot optimal content migration and request distribution based on efficient optimization algorithms to address the predicted demand, and (3) a (t)-step look-ahead mechanism to adjust the one-shot optimization results towards the offline optimum. This paper also verifies the effectiveness of our algorithm using solid theoretical analysis, as well as large-scale experiments under dynamic realistic settings on a home-built cloud platform. 3. METHODOLOGY 3.1 PROBLEM DEFINITION This work is focused on providing security in big data in cloud which arrives from data centers. Current approaches concentrate in big data analysis, and constraints regarding moving big data to cloud system. The proposed method is focused on encryption of data in cloud, downloading of data from cloud. The encryption method proposed here is Hyper Elliptic Curve Cryptography. The downloading module includes a clustering system to simplify the bottlenecks in downloading. 3.2 SYSTEM DESIGN We consider a cloud consisting of K geo-distributed data centers in a set of regions K, where K = K. A cloud user (e.g., a global astronomical telescope application) continuously produces large volumes of data at a set D
556 of multiple geographic locations. The user connects to the data centers from different data generation locations via virtual private networks (VPNs), with G VPN gateways at the user side and K VPN gateways each collocated with a data center. Let the set of VPN gateways at the user side are denoted by G, with G = G. An illustration of the system is in Fig. 1. A private (the user s) network inter-relates the data generation locations and the VPN gateways at the user side. Such a model demonstrates characteristic connection approaches between users and public clouds where devoted, private network connections are established between a user s premise and the cloud, for enhanced reliability and security, and guaranteed inter-connection bandwidth. Inter-data centre connections within a cloud are usually dedicated high-bandwidth lines. Within the user s private network, the data transmission bandwidth between a data generation location d D and a VPN gateway g G is large as well. The bandwidth Ugi on a VPN link (g, i) from user side gateway g to data center i is restricted, and comprises the bottleneck in the system. Fig. 1 Block diagram of Feature based Sentiment Analysis Model 3.2.1 PROBLEM FORMULATION Assume the system executes in a time-slotted fashion with slot length τ. F d (t) bytes of data are produced at location d in slot t, for uploading to the cloud. l dg is the latency between data location d D and user side gateway g G, p gi is the delay along VPN link (g, i), and ηik is the latency between data centers i and k. These delays, which can be obtained by a simple command such as ping, are dictated by the respective geographic distances. A cloud user needs to decide (i) via which VPN connections to upload its data to the cloud, and (ii) to which data center to aggregate data, for processing by a Map Reduce-like framework, such that the monetary charges induced, as well as the latency for the data to reach the aggregation point, are jointly minimized. The total cost C to be minimized has four components: routing cost, migration cost, bandwidth cost and aggregate storage and computing cost. 3.2.2 OFFLINE ALGORITHM We propose a polynomial-time dynamic programming based algorithm to solve the offline optimal data migration problem, given absolute knowledge of data generation in the temporal domain. The derived offline optimal strategies serve as a benchmark for our online algorithms. The offline algorithm derives the theoretical minimum cost given complete knowledge of data generation in both temporal and spatial domains. 3.2.3 ONLINE ALGORITHM A straightforward algorithm solves the above optimization in each time slot, based on y(t 1) in the previous time slot. This can be far from optimal due to premature data migration. For example, assume data center k was selected at t 1, and migrating data from k to j is cost-optimal at t according to the one-shot optimization (e.g., because more data are generated in region j in t); the offline optimum may indicate to keep all data in k at t, if the volume of data originated in k in t + 1 surges. We next explore dependencies among the selection of the aggregation data center across consecutive time slots, and design a more judicious online algorithm accordingly. We divide the overall cost C(x(t), y(t)) incurred in t into two parts: (i) migration cost Ct MG(y(t), y(t 1)) related to decisions in t 1; (ii) non-migration cost that relies only on current information at t: Ct MG(x(t), y(t)) = CBW(x(t)) + CDC(y(t)) + CRT (x(t)). (1) We design an online algorithm, whose basic idea is to postpone data center switching even if the one-shot optimum indicates so, until the cumulative nonmigration cost (in Ct MG(x(t), y(t))) has significantly exceeded the potential data migration cost. At the beginning (t=1), we solve the one-shot optimization and upload data via the derived optimal routes x(1) to the optimal aggregation data center indicted by y(1). Let ˆt be the time of the data center switch. In each following time slot t, we compute the overall non-migration cost in [ˆt, t 1], t 1 ν=ˆt Cν MG(x(ν), y(ν)). The algorithm checks whether this cost is at least β2 times the migration cost Cˆt MG(y(ˆt), y(ˆt 1)). If so, it solves the one-shot optimization to derive x(t) and y(t) without considering the migration cost, i.e., by minimizing Ct MG(x(t), y(t)) and an additional constraint, that the potential migration cost, Ct MG(y(t), y(t 1)), is no larger than β1 times the non migration cost Ct MG(x(t),
557 y(t)) at time t (to make sure that the migration cost is not too excessive). If a change of migration data center is indicated (y(t) = y(t 1)), the algorithm accepts the new aggregation decision, and migrates data accordingly. In all other cases, the aggregation data center remains unchanged from t 1, while optimal data routing paths are computed given this aggregation decision, for upload of new data generated in t. The Online Algorithm: 1: t = 1; 2: ˆt = 1; //Time slot when the last change of aggregation data center happens 3: Compute data routing decision x(1) and aggregation decision_y(1) by minimizing C(x(1), y(1)); 4: Compute C1 MG(y(1), y(0)) and C1 MG(x(1),y(1)); 5: while t T do 6: if Cˆt MG(y(ˆt), y(ˆt 1)) 1 β2 t 1 ν=ˆt Cν MG(x(ν), y(ν)) then 7: Derive x(t) and y(t) by minimizing Ct MG(x(t), y(t)) and constraint Ct MG(y(t), y(t 1)) β1ct MG(x(t), y(t)); 8: if y(t) = y(t 1) then 9: Use the new aggregation data center indicated by y(t); 10: ˆt = t; 11: if ˆt < t then //not to use new aggregation data center 12: y(t) = y(t 1), compute data routing decision x(t) if not derived; 13: t = t + 1; 3.2.4 HYPER ELLIPTIC CURVES A hyper elliptic curve C of genus g outlined over a field Fq of characteristic p is given by associate degree equation of the form y 2 + h(x)y = f(x) where h(x) and f(x) square measure polynomials with coefficients in Fq, with deg h(x) g and deg f(x) = 2g + one. an extra demand is that C isn't a singular curve. If h(x) = zero and p > a pair of this amounts to the necessity that f(x) could be a square free polynomial. In general, the condition is that there aren't any x and y in the pure mathematics closure of Fq that satisfy the equation (1) and also the 2 partial derivatives 2y + h(x) = zero and h (x)y f (x) = 0. 3.2.5 SCHEMES Signature schemes, encryption schemes and key agreement schemes are the schemes which can base on elliptic and hyper elliptic curves. Diffie-Hellman Key Agreement Scheme: Two parties Sender and Receiver wish to agree on a common secret by communicating over a public channel. An eavesdropper Interrupter, who can listen to all communication between Sender and Receiver, should not be able to obtain this common secret. First, we assume that there are the following publicly known system parameters: The group G. An element R G of large prime order r. The steps that Sender performs are the following: 1. Choose a random integer a [1, r 1]. 2. Compute P ar in the group G, and send it to Receiver. 3. Receive the element Q G from Receiver. 4. Compute S = aq as common secret. The steps that Receiver performs are: 1. Choose a random integer b [1, r 1]. 2. Compute Q = br in the group G, and send it to Sender. 3. Receive the element P G from Sender. 4. Compute S = bp as common secret. Note that both Sender and Receiver have computed the same values S, as S = a(br) = (ab)r = b(ar). It is not known how Interrupter, knowing only P, Q and R, can compute S within reasonable time. If she could solve the discrete logarithm problem in G, then she could calculate a from P and R, and then calculate S = aq. The problem of computing S from P, Q and R is known as the Diffie-Hellman problem. The pair (a, P) is called Sender s key pair consisting of her private key and public key P. Likewise, Receiver s key pair is (b, Q), with private key b and public key Q. It is important to realize that the scheme that is described here should be used with additional forms of authentication of the public keys. Otherwise an auditor (interrupter) who is able to intercept and change information sent and is able to agree on keys separately with Sender and Receiver. This is known as a man in the middle attack. The (Hyper-) Elliptic Curve Integrated Encryption Scheme: This encryption scheme uses the Diffie- Hellman scheme to derive a secret key, and combines it with tools from symmetric key cryptography to provide better provable security. It can be proved to be securing against adaptive chosen cipher text attacks. We again formulate the scheme for any group G and R G with large prime order r. The symmetric tools that are used in the scheme are: A key derivation function. This is a function KD(P) that takes as input a key P, in our case this is an element of G, and outputs keying data of any required length. A symmetric encryption scheme consisting of a function Enc k that encrypts the message M to a ciphertext C = Enc k (M) using a key k, and a function Deck that decrypts C to the message M = Deck(C).
558 A Message Authentication Code MAC k. One can think of this as a keyed hash function. It is a function that takes as input a ciphertext C and a key k. It computes a string MAC k (C) that satisfies the following property: Given a number of pairs (Ci, MAC k (Ci)), it is computationally infeasible to determine a pair (C, MACk (C)), with C different from the Ci if one does not know k. 1. Obtain Receiver s public key Q. 2. Choose a secret number a [1, r 1]. 3. Compute C1 = ar. 4. Compute C2 = aq. 5. Compute two keys k1 and k2 from KD(C2), i.e. (k1 k2) = KD(C2). 6. Encrypt the message, C = Enck1 (M). 7. Compute mac = MACK2 (C). 8. Send (C1, C, mac). To decrypt, Receiver does the following: 1. Obtain the encrypted message (C1, C, mac) from Sender. 2. Compute C2 = bc1. 3. Compute the keys k1 and k2 from KD(C2). 4. Check whether mac equals MACk2 (C). If not, reject the message and stop. 5. Decrypt the message M = Deck1 (C). The Digital Signature Algorithm (DSA) is the basis of the digital NIST signature standard. This algorithm can be adapted for elliptic and hyper elliptic curves. More generally, one can use it for any group G where the DLP is difficult, provided that one has a computable map G Z with large enough image and few inverses for each element in the image. The elliptic curve version, known as ECDSA, can be found in various standards. The hyper elliptic curve version seems not to have appeared a lot in existing literature. In the hyper elliptic curve case, one can take for φ the following map. Let D = [u(x), v(x)] be a divisor in Mumford representation.. Let u(x) = deg(u(x)) u i x i with ui Fq. Define φ(d) to be the integer whose binary expansion is the concatenation of the bit strings representing the ui, i [0, deg(u(x)) 1], as explained above. Assume the following system parameters are publicly known: A group G and a map φ Z as above, An element R G with large prime order r, A hash function H that maps messages m to 160-bit integers. To create a key pair, Alice chooses a secret integer a Z, and computes P = ar. The number a is Alice s secret key, and P is her public key. If Alice wants to sign a message m, she has to do the following: 1. Choose a random integer k [1, r 1], and compute Q = kr. 2. Compute s k 1 (H(m) + aφ(q)) mod r. The signature is (m, Q, s). To verify this signature, a verifier Bob has to do the following: 1. Compute v1 s 1H(m) mod r and v2 s 1φ(Q) mod r. 2. Compute V = v1r + v2p. 3. Accept the signature if V = Q. Otherwise, reject it. The hyper elliptic curve as explained above is implemented at the cloud side before storing it to the cloud. It makes more protection to the stored data. The data arrived at the cloud are divided into chunks and the chunks are pass through the hyper elliptic curve encryption system. Then it is converted into encrypted form. These encrypted files are stored in the cloud. 4. RESULTS We compare the performance of our scheme with the previous paper. The previous paper didn t use any security measures for storing the big data in cloud. This paper employed encryption for the big data which gives more advantage and efficiency to the system. The computation and communication overhead when we used the encryption to entire file (n) and randomly choose file(c) is shown in the TABLE 1. It is much lesser but provides great achievement to the work. Table 1. Comparison of Overheads n = 100,000 c = 460 Computation Overhead 13.15 sec 0.21 sec Communication 2.11 MB 30.37 KB Overhead Signature generation time, extra storage space on signatures are also evaluated with some other previous works which uses another encryption method and the result is obtained as shown in the TABLE 2. Table 2. Comparison of Signature Complexity [12] [13] Signature Generation 149.08 142.72 20.28 Time (ms) Extra storage space on signatures (MB) 2 20 32.8 5. CONCLUSION In this paper, we used an efficient security system for the big data in the cloud. So the data in the cloud kept safely. The encryption method used is the Hyper Elliptic Curve Cryptosystem with use the mathematical concepts of Hyper Elliptic Curve to encrypt the data. This work is
559 done with the help of Cent OS, Horton works Sandbox. The vibrant features of Java can be used for making the theory into reality. This paper also considered the download of data from cloud after clustering it. REFERENCES [1] M. Armbrust, A. Fox, R. Grifth, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. P. A. Rabkin, I. Stoica, and M. Zaharia, Above the Clouds: A Berkeley View of Cloud Computing, EECS, University of California, Berkeley, Tech. Rep., 2009. [2] S. Pandey, L. Wu, S. Guru, and R. Buyya, A Particle Swarm Optimization (PSO)-based Heuristic for Scheduling Workflow Applications in Cloud Computing Environment, in Proc. of IEEE AINA, 2010. [3] E. E. Schadt, M. D. Linderman, J. Sorenson, L. Lee, and G. P. Nolan, Computational Solutions to Large-scale Data Management and Analysis, Nat Rev Genet, vol. 11, no. 9, pp. 647 657, 09 2010. [4] R. J. Brunner, S. G. Djorgovski, T. A. Prince, and A. S. Szalay, Handbook of Massive Data Sets, J. Abello, P. M. Pardalos, and M. G. C. Resende, Eds. Norwell, MA, USA: Kluwer Academic Publishers, 2002, ch. Massive Datasets in Astronomy, pp. 931 979. [5] M. Cardosa, C. Wang, A. Nangia, A. Chandra, and J. Weissman, Exploring MapReduce Efficiency with Highly-Ditributed Data, in Proc. of ACM MapReduce, 2011. [6] M. Hajjat, X. Sun, Y. E. Sung, D. Maltz, and S. Rao, Cloudward Bound: Planning for Beneficial Migration of Enterprise Applications to the Cloud, in Proc. of ACM SIGCOM, August 2010. [7] X. Cheng and J. Liu, Load-Balanced Migration of Social Media to Content Clouds, in Proc. of ACM NOSSDAV, June 2011. [8] Y. Wu, C. Wu, B. Li, L. Zhang, Z. Li, and F. Lau, Scaling Social Media Applications into Geo- Distributed Clouds, in Proc. of IEEE INFOCOM, Mar. 2012. [9] B. Cho and I. Gupta, New Algorithms for Planning Bulk Transfer via Internet and Shipping Networks, in Proc. of IEEE ICDCS, 2010. [10] B. Cho and I. Gupta, Budget-Constrained Bulk Data Transfer via Internet and Shipping Networks, in Proc. of ACM ICAC, 2011. [11] J. Scholten and F. Vercauteren, An Introduction to Elliptic and Hyperelliptic Curve Cryptography and the NTRU Cryptosystem. [12] B. Wang, B. Li, and H. Li, Oruta: Privacy- Preserving Public Auditing for Shared Data in the Cloud, in IEEE Cloud, June 2012, pp. 295 302. [13] B. Wang, B. Li, and H. Li, Knox: Privacy- Preserving Auditing for Shared Data with Large Groups in the Cloud, in ACNS, 2012, pp. 507-525.