Optimization of Large Data in Cloud computing using Replication Methods

Optimizatio of Large Data i Cloud computig usig Replicatio Methods Vijaya -Kumar-C, Dr. G.A. Ramachadhra Computer Sciece ad Techology, Sri Krishadevaraya Uiversity Aatapuramu, AdhraPradesh, Idia Abstract-Cloud computig provides dyamically adoptig ad frequetly used the virtual resources from the Web. We ca developed differet data-itesive applicatios have High quality of service requiremets[1] - [3]. To reduce the data corruptio ad improve the performace of the cloud computig we proposed the Qos-aware data replicatio algorithms[4]. We ca achieve i two ways, the first algorithm the ituitive idea of high-qos first-replicatio to perform data replicatio usig the partitios. To achieve these two miimum objectives, the first algorithm trasforms the QADR problem ito the well-kow miimum-cost maximum-flow problem. I cloud computig system have large umber of odes to icrease the performace to compress the data usig Data block storage. We also propose ode combiatio techiques to reduce the possibly large data replicatio time. Usig the Fially, simulatio experimets are performed to demostrate the effectiveess of the proposed algorithms i the data replicatio ad recovery To overcome the hardware failures, adoptig the extesios of partitios techiques ad Compressio techiques. Wheever the hardware failures occurs, the Qos requiremet of the applicatio caot be supported cotiuously. With a large umber of odes i the cloud computig system, it is difficult to ask all odes with the same performace ad capacity i their CPUs, memory, ad disks. For example, Google is realistic heterogeeous cloud computig, which provides differet ifrastructure alog with resource types to satisfied the users requiremets[6]. Wheever usig the ode heterogeeity the data of the high-qos applicatio may be replicated i a low performace ode. So that the data replicatio is ruig from low performace ode has slow commuicatio. Architecture of HDFS Keywords: Cloud computig, data-itesive applicatio, Partitios, quality of service, data replicatio, Node combiatio. 1. INTRODUCTION A ew geeratio of data-itesive applicatios that require high speed ad ultra-low latecy Iteret ifrastructure to esure massive amouts of uiterrupted data igestio, real-time aalysis ad cost efficiecy at scale are surpassig the capabilities of traditioal virtualized public clouds[2]. Virtualizatio alog with the processig overhead of its hypervisor layer ad resource costraits resultig from shared hardware. The existig Iteret provides to us cotet i the forms of videos, emails ad iformatio served up i web pages. With Cloud Computig, the ext geeratio of Iteret will allow us to "buy" IT services from a web portal[3]. New geeratio of high-performace computig ceter does ot oly provide traditioal high-performace computig, or it is oly a high-performace equipmet solutio. The maagemet of resources, users ad virtualizatio, the dyamic resource geeratio ad recyclig should also be take ito accout. Apache Hadoop is a emergig cloud computig platform dedicated for data-itesive applicatios[6]. For large volume of data applicatio like Google,Amazo,Mircosoft,e.t.c may cosist of hudreds or thousads of server machies, each storig part of the file system s data. The fact that there are a huge umber of compoets ad that each compoet has a o-trivial probability of failure. Fig 1 To solve the problem, we have proposed greedy algorithm, i this algorithm, if the applicatio '' has QOS requiremets it will cosider ext '-1' applicatios to perform replicatios. Usig these algorithm we ca't achieve the miimum cost ad maximum results. To get the optimal solutios DSQS we are usig the stagecoach dyamic programmig. So we ca solve more complex problems ad easily to fid out the optimal solutios i more efficiecy. If we compare the HQADR ad DRQS, DRQS has reduce the size of the disk ad also reduce the retrieval of the time. The complexities of solvig problem is depedet o the umber of odes are existig the server. We ca cotrol the web traffic by give the permissios to the users for accessig the particular disks with that access user caot access all the servers. So we ca icrease the scalability ad capabilities of our servers. To workig o Big-data i cloud computig system, some of issues like icrease data after replicatios are creatio ad updatig the servers. www.ijcsit.com 3034

II PRELIMINARIES 2. 1 A. System Model We refer to the architecture of the Hadoop distributed file system (HDFS), is desiged for high fault tolerat coditios with low cost of hardware. Usig HDFS architecture we developed the data replicatio algorithm ad refer the Google file system as havig the same properties[6]. I HDFS, a sigle Name ode ad umber of data sets, these Data odes ad Name odes are migrate ad form a rack. The Name ode refers to mappig of the Data blocks to Data odes. Data odes cosists of oe or more data blocks, ad Data odes maybe desirable to move from oe Name ode to aother Name ode. Whe the applicatio are executed i data odes, it should be process from the data blocks ad these data blocks acts as cliet ad sed the request to the Data odes. These data odes are sed the request to the Name odes. We used the etwork topology ad all switches for commuicatio protocol : Spaig tree protocol. 2. 2 Related work To reduce failures i cloud computig we are itroducig the save status techique. Usig these techique we ca tolerate the server failures. However, Hadoop file systems, saves the data blocks i Name odes or we ca save the file systems as per the user eeds. Wheever the failures occurs the file system ad data blocks ad other directories are restored usig the save status techique. Replicatio methods are protect ad stored the data blocks agaist the disk failures[7]. I HDFS the data is stored data blocks, ad the these data blocks are created as two replicas ad stored i differet Data Node. Data blocks are referred as a Data ode i, origial data of data block. Usig the Back Method we ca get the data replica updates ad data access i process, which ca improve the update data to the users ad reduce the respose time. However three replica, replicatio strategy is used to reduce the cost effective data ad it apply the reliability maagemet mechaism. Based o this mechaism to active the replica verify the miimum umber of replicatios are stored so that we reduce the storage space cosumptios. It depedet o the storage costrait, we used to distributed pagig ad it works o the dyamic allocatio of the copies of data blocks i the distributed etwork. we have to miimize the total commuicatio cost to set the umber of sequece of write ad read access to sed the requests from the rack. I this work, how to defie the positio of the replicas of each data object replicatios ad users ca access the data objects from the server. DPQS problem is proved o the NP-Complete, have used add, modify the objects without icrease the size of the data objects ad improve the QOS[8]. We have proposed the algorithm takes log process to executio. To reduce the executio time we itroduces cover set. The cover set of a server 'S' is set of servers, it has sed requests from the 'r', withi S(r), whe S(r) is requested from the QoS Usig ode combiatio, dyamic etwork ad fid the shortest path of odes Nij(t) chages of the fuctios of time. For dyamic replicatio D=(d0,dr1,dr2) with discrete time costs of set of odes N,( N =) ode set N={1,2,3,4...} ad set of ame odes S ( S =m).we proposed a itegrated solutios for dyamic data replicatio that as requests umber of locatio of the replicas to be iserted / deleted aimig at workload balacig For creatig replicas we eed to Storage ad etwork resources are cosumed ad hadlig ad the ode workload icreases or decreases is depedig upo the creatio of the replicas ad sizes of the replicas. To maitai the creatio of replicas for efficietly i a dataorieted maer, cosiders the parameters to idetified problems of the total disk space ad space that are used for the creatio of ew replicas[9]. Whe a ode receives a read request from the cliet, it determie the specified level ad umber of replicas that have stored i save check poit[11]. Whe each replica respods with the requested data that should be compared with save check poit ad most recet versio data. Figure 1 3. DEFINED THE "FREQUENTLY " USED DATA REPLICAS May cliets are maily focus o the QoS requiremet for frequetly access the data sets ad much importace to the frequetly used data sets. Importace of data set eeds more replicas to be created, so cliets are sedig the umber of retrieval requests i fixed itervals for that we get the umber of the importace data sets of the system is defied Ds(i), i =1... where is umber of data sets i system. We get total requests for data retrievals i fixed itervals Do = Ds(i) Do(i) is similar to the Daccess are defied the sum of the etwork commuicatio latecy ad disk access latecy for retrievig data block replica from the ode i to ode rj, icludes the time to trasmit a data replica from rj to ode i. Time to write the data block replica ito the disk i for limited amout disk space the requested odes are placed their data block replicas at the same qualified odes. Time Complexity : We aalyze the time complexity of the DQOS algorithm, the DQOS algorithm ca be divided ito two parts Importace replicas request ad frequetly replicas request. I the Qos requiremet of the request ode is high i the case of "high", the data replicatio request is frequetly replica request used ad also importace replica request i the data block. www.ijcsit.com 3035

I cases of creatig replicas, the save poit is agai used to determie the amout of odes required to respod with server respose before the duplicate replicatio is added. If the ode is required for savig poit user ca able to create dyamic replicatio the the replica ode come ito olie to easy access the multiple users at ay time i the server respose. d[v] = Legth of curret shortest path from the server to v. ƛ(s,v) legth of the shortest path., π(v) = predecessor of V i the shortest path from s to v. Relax(u,v,w) if d[v] > d[u] + w[u,v] d[v] = d[u]+ w[u,v] π[v] = u. 3.1 Data Replicatio creatio approach: Before creatig the data replica we eed to store the data sets i space, for that we have allocated based o the frequetly used replicas ad also importat used replicas[11]. We approach to create ad stored the data replicas i data sets ad it fidig how may replicas should be created for each data set. Algorithm : 1. Sort the data files i descedig order of size. 2. Defie data set as i 3. For every data set i, to calculate the umber of replicatios are created usig formula (i) = [s[i]/size(i)]. 4. Usig the frequetly data to arraged a used data ad uused data. 5. Idetified the used data as stored i spacea. 6. Space A(i) = (i) - (j). Size(j) 7. Calculate the total Used space, Space A = i = 0 i 8. Select the first data set i the list of descedig order of size(i). 9.Calculate the uused space usig spacea 1 = space A- size (i). 10. Go to step 5 util, data set is completed. Lemma : The relaxatio operatio maitais the iverts that d[u] ƛ(s,v) for all v V. Set S S Sd Sq Su Str S sm Sc Table 1- Summary of otatios Descriptio A Set of all the storage odes Total umber of odes A Set of storage odes that cotai the data replica A Set of storage odes that are qualified A Set of odes that are ot qualified. Total umber of data replicatios Total umber of active odes A ode that are earest to the data replica Iput: A set of requested odes Sr. Output: Optimal Placemet for the QoS ad get the data replicas 1: Sd Ɵ 2: for each requested ode i i Sc do 3. Sd for qualified odes i ad fid earest ode. 4. Sd Sd U Sq ri 5. Ed For 6. Sd ad Sc are the etwork flow graph 7. Use the data values o the edges of the etwork flow graph. 8. Apply the data values ad use the optimal placemet for QoS data replicas ad D_MCMF 9. Str Ɵ ad Sqr Ɵ 10. for each requested replicas i Sd do 11. Ssm the amout of data odes are i active 12. if Ssm< Sr the 13.Str Str U Sd 14.S S-- Suq ( For uqualified odes) 15. Sr Sr U Suq 16. Ed if 17. Ed for 18. Use the data values o the edges of the etwork flow graph 19. Apply the optimal placemet for QoS data replicas. Optimal Replica Placemet algorithm. 3.2 Deactivated replicatio method. Data replicatio is very cosumig the data stored i the disk space, wheever the data replicatio is ot used, that have dropped ito the uused disk space ad it ca be deleted, to free the storage space. We proposed algorithm to umber of replicatios are uused to be deleted for reduce the data sets ad icrease the disk space. we defied the replicatios s(d) is activated ad 1-s(d) defied deactivated[13]. For usig '' umber of storage data replicatios we ca defie = {1 s ( d )} i 1 = - s( d) = 1 we ca get S 1 (d) = 1-s(d) / -1 i Algorithm. 1. Sort the data files i list of descedig order of size. 2. Defie the data sets as i ad data sets size = i ƛ for ƛ = 1,2,...-1. 3. ƛ =1 4. if ƛ= the the algorithm is exit. 5. calculate the umber of deactivated replicas by usig p(iƛ) = mi[s(iƛ)]/size(iƛ)], q(iƛ)-1). 6. Quality Q= s(iƛ)- p(iƛ) * size(iƛ), data sets are storage ad that have to shared i to data blocks. 7. ƛ = ƛ +1 8. go to step 4 www.ijcsit.com 3036

Defied the data replicatios are deactivated : Let S be the storage odes that represets the data set S = i, j umber of data replicatios that are deactivated P s (i-j), (i-j ) idicates combiatios of S ad R data sets that are data request to server[15]. However the data replicatio that are stored odes of the data set S- P s (i-j) ca be deleted from the data blocks ad it will update automatically, fially it icrease the disk space ad improve the performace of the data blocks. 4. IMPLEMENTED SIMULATION I simulatio implemetatio, we have used a tree structure to the etwork topology. Usig STP protocols ca form a etwork topology to provide the iter-rack ode ad itra rack ode for commuicatio. I the HDFS, we implemeted the rack formatio ad also ode distributios, we cosider 100 racks ad each rack is cotais the oe switch, that are distributed i a uit square plae for usig that we ca implemeted the simulatios. To orgaize the 100 racks, we ca make them oe rack is metioed as the cetral rack, 4000 odes are radomly deployed ad specified odes are i active mode ad remaiig odes are i deactivate mode. For the deactivate mode we are cosider the uqualified odes for the etwork topology for the "frequetly" used data. Wheever the odes are i the same rack alog with active odes ad deactivate odes we are occupied the locatios oly for the active odes. Figure 2 We have used parameter settig to geerated etwork topology to get the simulatio experimet results. I this etwork topology we are stored the maximum umber of data block replicas ad each ode is available i replicatio space. For experimet, selectig the radom umber of data blocks itervals of [0,50] [51,100], ad also we used same data blocks itervals for QoS requiremets of a applicatio i the ode Sd. To aalysis the data replicas we used lower boud to access the time for less local disk ad upper boud to large access time for local disk for specified QoS itervals. I Simulatio results, we radomly used 2000 odes to ru the big data applicatios istead of 4000 odes. However we used the frequetly used data replicatios as activate odes ad remaiig odes we make them as deactivated. For this experimet we cosider requested replicas as 500, 1000,1500,2000 odes. We cosider i disk access latecy ad applied to the workloads i sort order, wheever the rack space get request from the oe or more odes are distributed with their respectively to data replicatios[14]. For usig the write ad read operatios, we ca read the data block i ode from the particular disk ad trasmits data blocks from the ode j ad it received accordig to the disk queue. We used lower boud ad upper boud parameters to miimize the requested data replicatios from the rack space. I simulatio experimets we used[0 50] [51 100] simulatios ru ad get the results mea values of 50. Figure 3 I simulatio, the total replicatio cost for differet umber of requested odes from 500 to 2000, it cofiguratio to the rack space ad used differet types of disk access time. Fig --- the total umber of requested odes are miimize ad also miimize the total replicatio cost. We cosider the HQRS, Hadoop replicatio algorithm to replace the data blocks whe the rack space is failure sceario. We have implemeted DRQS replicatio algorithm has cosider failure sceario ad also improve the performace of the rack space. We proposed the algorithm reduce the total replicatio cost ad reduce the total time. CONCLUSIONS AND FUTURE WORK We have ivestigated the QoS data replicatio for Big data applicatios i cloud computig. To fid the sloutio for HQRS problem ad we have implemeted the DRQS algorithm to improve the performace of the QoS data replicatio i cloud computig. I future we have to implemet the DRQS algorithm i a real cloud computig platform ad also to decrease the storage i data ceters ad eergy cosumptio to save the power ad importace for the gree cloud computig. ACKNOWLEDGMENT This research was supported by the Sri Krishadevaraya Uiversity, Aatapuramu,Adhra Pradesh. REFERENCES [1].http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.pdf. [2]. M. Creeger, Cloud Computig: A Overview, Queue, vol. 7, o. 5, pp. 2:3 2:4. www.ijcsit.com 3037

[3] S. Veugopal, J. Broberg, ad I. Bradic, Cloud Computig ad Emergig IT Platforms: Visio, Hype, ad Reality for Deliverig Computig as the 5th Utility, Future Geer. Comput. System. [4] B. Kemme, R. Jim eez-peris, ad M. Pati o-mart ıez, Database Replicatio, Sythesis Lectures o Data Maagemet. Morga & Claypool Publishers. [5] S. Saviov ad K. Daudjee, Dyamic database replica provisioig through virtualizatio, i CloudDB. [6] Apache Hadoop Project. Available from iteret. http://hadoop.apache.org. [7] E. Piheiro, W.-D. Weber, ad L. A. Barroso, Failure Treds i a Large Disk Drive Populatio,. [8] A. Gao ad L. Diao, Lazy Update Propagatio for Data Replicatio i Cloud Computig, [9] Amazo EC2 Available from iteret http://aws.amazo.com/ec2/ [10 ] X. Tag ad J. Xu, QoS-Aware Replica Placemet for Cotet Distributio,. [11] R.-S. Chag, P.-H. Che, Complete ad fragmeted replica selectio ad retrieval i data Grids, Future Geeratio Computer Systems. [12] K. Ragaatha, I. Foster, Idetifyig dyamic replicatio strategies for a high performace data Grid ad Grid Computig. [13] M. Karlsso, C. Karamaolis, M. Mahaligam, A framework for evaluatig replica placemet algorithms, Techical Report, Hewlett Packard Labs. [14] Soda, A. (2009). Adaptive schedulig for qos virtual machies uder differet resource availability first experieces. 14th Workshop o Job Schedulig Strategies for Parallel Processig, Vol. LNCS 5798, Rome, Italy, 259 279. [15] Syodios, D. G.. LHC Grid: Data storage ad aalysis for the largest scietific istrumet o the plaet. http://www.ifoq.com/articles/lhc-grid. C. Vijaya Kumar is curretly a Ph.D. studet i the Departmet of Computer sciece ad Techology from Sri Krishadevaraya Uiversity. He received the B.SC degree i computer sciece from Sri Krishadevaraya Uiversity ad the M.C.A degree i computer sciece ad iformatio egieerig from.the Sri Vekateswara Uiversity, Trirupathi, i 2008,. His research iterests iclude cloud computig, ad Hadoop. Dr. G.A Ramachadra is a associate professor at Sri Krishadevaraya Uiversity. His research iterests are Data Miig, Computer etworks. www.ijcsit.com 3038