A cost-effective mechanism for Cloud data reliability management based on proactive replica checking
|
|
|
- Edgar King
- 10 years ago
- Views:
Transcription
1 th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing A cost-effective mechanism for Cloud data reliability management based on proactive replica checking Wenhao Li*, Yun Yang**, *, Jinjun Chen*, ***, Dong Yuan* * Faculty of Information and Communication Technologies, Swinburne University of Technology, Australia ** School of Computer Science, Anhui University, Hefei, Anhui , China *** School of Systems, Management and Leadership, University of Technology Sydney, Australia {wli, yyang, dyuan}@swin.edu.au, [email protected] Abstract In current Cloud computing environments, management of data reliability has become a challenge. For data-intensive scientific applications, storing data in the Cloud with the typical 3-replica replication strategy for managing the data reliability would incur huge storage cost. To address this issue, in this paper we present a novel cost-effective data reliability management mechanism named PRCR, which proactively checks the availability of replicas for maintaining data reliability. Our simulation indicates that, comparing with the typical 3-replica replication strategy, PRCR can reduce the storage space consumption by one-third to two-thirds, hence reduce the storage cost significantly in the Cloud. Keywords-data replication, proactive replica checking, data reliability, cost-effective storage, Cloud computing I. INTRODUCTION The size of Cloud storage is expanding in a dramatic speed. It is estimated that by 2015 the data stored in the Cloud will reach 0.8 ZB, while more data are stored or processed temperately in their journey. Meantime, with development of the Cloud computing paradigm, Cloud-based applications have put forward higher demand for Cloud storage. While the requirement of data reliability should be met in the first place, data in the Cloud need to be stored with higher cost-effectiveness. Data reliability is defined as the probability that the data is available in the system for a certain period. It is an important issue in data storage systems, which indicates the ability of keeping data persistence and fully accessible. Due to the accelerating growth of Cloud data, the management of data reliability in the Cloud has become a challenge. Currently, data replication is one of the most popular approaches for supporting data reliability. However, the consumption of storage space for replicas would incur huge cost, and it is believed that the cost would be passed on to the users eventually. In this paper, we present a novel cost-effective data reliability management mechanism based on proactive replica checking named PRCR for reducing the storage space consumption, hence storage cost for data-intensive applications in the Cloud. The objective of PRCR is to reduce the number of replicas stored in the Cloud while meeting the data reliability requirement. Compared with existing data reliability solutions in the Cloud, PRCR has the following features: PRCR is able to provide data reliability assurance at the file level. PRCR is able to provide data reliability management in a more cost-effective way. By applying PRCR, a wide range of data reliability can be assured with at most two replicas stored in the Cloud. By applying benchmarks from Amazon Web Services, we evaluate the capacity of PRCR nodes, and simulate the reliability management process using PRCR with the data generated by a data-intensive scientific application. The results are compared with the typical 3-replica replication strategy. It is shown that PRCR can reduce one-third to twothirds of the storage consumption that currently consumed in the Cloud, hence the storage cost is reduced accordingly. The rest of the paper is organised as follows. The related works on data reliability, data replication and cost-effective data storage are addressed in Section II. The motivating example and the problem analysis are stated in Section III. The concepts and model of the PRCR mechanism and a high level design for PRCR are presented in Section IV. The detailed design for PRCR is described in Section V. The evaluation of PRCR is illustrated in Section VI. Finally, conclusions and future work are summarised in Section VII. II. RELATED WORKS Data reliability is one of the key research issues in distributed computing. In order to ensure reliable processing and storage of data, many efforts have been made in different aspects [2, 6, 14], and some previous works have provided valuable references. In [5], a detailed survey on grid reliability identifies important issues and reviews the progress of data reliability research in grid environments. In [11, 15], analytical data reliability models are proposed and comprehensively studied. In [11], data reliability of the system is measured by data missing rate and file missing rate, and the issue of maximising data reliability with limited storage capacity is investigated. In [15], it proposes an analytical replication model for determining the optimal number of replica servers, catalogue servers, and catalogue sizes to guarantee a given overall data reliability. Among the existing research for data reliability in distributed systems, data replication has always been considered as an important approach. Some existing works on large-scale distributed storage systems have been proposed, such as [10, 16] investigate data replication in a Grid environment and [8] investigates data replication in a /12 $ IEEE DOI /CCGrid
2 P2P system, etc. In Cloud computing, some related works have also been conducted. For example, in [17] the Cloud storage and data replication management issues are investigated and optimised mainly from the database perspective. Additionally, data replication technologies are also widely used in current commercial cloud systems for performance and reliability purposes. Some typical examples include Amazon Simple Storage Service (Amazon S3) [3], Google File System (GFS) [7], Hadoop distributed file system (HDFS) [4], etc. Although data replication has been widely used, there is a side effect that data replication would consume storage resources and incur additional cost significantly. Amazon S3 has published its Reduced Redundancy Storage (RRS) solution to reduce the storage cost [3], in which only a lower level of data reliability can be assured for a lower price. Some of our works have made contributions in the costeffective data reliability management area based on data replication. For example, in [12], we propose a cost-effective dynamic data replication strategy for data reliability in Cloud data centres, in which an incremental replication method is applied to reduce the average replica number while meeting the data reliability requirement. However, for long-term storage, this strategy still generates 3 replicas for the data, so that its ability of reducing storage cost is limited. In this paper, our research is closely related to the aspects mentioned above. Compare with the Amazon S3 RRS service, PRCR provides a more flexible replication solution. It allows a file stored with only one replica when applicable, and a variety of data reliability requirement can be met without jeopardy when stored with two replicas. In addition, PRCR can be adapted to a wide range of Cloud systems. For example, it can be applied over GFS and HDFS for tuning the replication level and improve the storage consumption. Compare with the data replication strategy in [12], PRCR stores no more than two replicas for each file in the Cloud. In addition, PRCR also incorporates the distinctive characteristics of Cloud storage environments. First, Cloud storage systems are formed by several largescale data centres rather than many heterogeneous nodes. Second, data properties such as data and metadata locations cannot be derived by users because of virtualisation. These two characteristics have led to different approaches of achieving data reliability in the Cloud from other distributed systems. III. MOTIVATING EXAMPLE AND PROBLEM ANALYSIS A. Motivating Example The Cloud has offered an excellent storage capacity, which from user s perspective, is unlimited for storing all the data generated during the execution of applications. This feature of the Cloud is very desirable especially by scientific applications with data-intensive characteristic. For example, a series of pulsar searching applications has been conducted by the Astrophysics Group at Swinburne University of Technology. Depending on terabytes of observation data obtained from Parkers Radio Telescope [1], many complex and time consuming tasks have been carried out and large volumes of data are generated during their execution. Fig. 1 shows the dataflow of a pulsar searching workflow example [13]. There are several steps in the pulsar searching workflow. First, from the raw signal data received from the Parkes Telescope, 13 beam files are extracted and compressed. The beam files contain the pulsar signals which are dispersed by the interstellar medium. For counteracting this effect, in the second step, the de-dispersion process conducts different dispersion trials and a minimum of 1200 de-dispersion files are generated for storing the results. Next, for binary pulsar searching purposes, every de-dispersion file needs another accelerate step, which generates 5-10 accelerated de-dispersion files for each de-dispersion file. Based on the accelerated de-dispersion files, the seeking algorithm can be applied to search for pulsar candidates. In each beam, a candidate list of pulsars is generated. Furthermore, by comparing the candidates generated from different beam files, the top 100 pulsar candidates are selected. These final pulsar candidates are folded in XML files with their feature signals, and finally submitted to scientists as the final result. The pulsar searching applications currently run on the Swinburne highperformance supercomputing facility. Due to the storage limitation, all the generated data are deleted after having been used once and only the beam data which are extracted from the raw telescope data are stored. However, by applying the Cloud for storage of the pulsar searching data, the storage limitation for these applications can be completely eliminated, and much more generated data can be stored for reuse. File name Extracted & compressed beam files De-dispersion files Accelerated Dedispersion files Seek result files Candidate list XML files File size and quantity 2.1GB each, 13 beams in total 8.7MB each, 1200 for each beam 8.7MB each, 5-10 for each de-dispersion file 16KB each, 1 for each de-dispersion file 1KB each, 1 for each beam 25KB each, 100 in total Figure 1. Dataflow of a pulsar searching workflow instance When we were trying to store all pulsar searching data into the Cloud, a problem has emerged, that is the cost of hiring Cloud storage resources for these data could be huge. For example, in a typical pulsar searching workflow instance as shown in Fig. 1, more than files with the size over 565
3 230GB are generated. According to the latest Amazon S3 storage prices, storing these data using S3 standard storage service costs about US$12.65 per month (i.e. $0.055 per GB/month). In order to meet the needs of pulsar searching applications, we probably need to store data generated by several hundreds of such workflow instances, hence thousands of dollars will be spent on data storage every month. B. Problem Analysis The cost for storing data is inevitable, but there is still room for reducing it. When we look into the pulsar searching example, there are two major factors that could lead to the high storage cost. First, current Cloud storage systems generally use data replication for data reliability. For example, storage systems such as Amazon S3 [3], Google File System [7] and Hadoop Distributed File System [4] all adopt similar multi-replicas data replication strategies, which typically store three replicas by default for all data. By using the typical 3-replica replication strategy, three replicas are generated at once at the beginning of the storage and stored at three different places. Such replication strategy causes a huge amount of storage space consumed, and users have to pay for the cost eventually. By applying the typical 3-replica replication strategy, storing one TB of data needs three TB of data space, which would incur a 200% additional cost. For dataintensive applications such as the pulsar searching example, the extra money spent could be huge. Second, according to the importance and storage duration, the data generated in a pulsar searching workflow instance can be divided into two major types. One type of data is critical and would be reused for a long time. For example, the extracted beam files and the XML files are the input and output of a pulsar searching workflow; the dedispersion files are frequently used generated data [19], based on which different seeking algorithms can be applied and some other generated data can be derived. These data record the current state of the universe, which is very important and can be used for long term analysis. For this kind of data, high data reliability assurance and recovery ability are necessary. Another type of data is only used for a short term and lack of long-term value. For example, the accelerated de-dispersion files, seek result files and the candidate lists all belong to this type. For this type of data, relatively low reliability assurance can be applied and recovery ability is most likely unnecessary. However, by applying the typical 3-replica replication strategy, all these data are replicated for the same number of times, which would incur some unnecessary extra storage cost. In order to reduce the storage cost for data-intensive scientific applications in the Cloud, both abovementioned major factors must be considered, and a new replication mechanism should be proposed to replace the typical 3- replica replication strategy. In the rest of the paper, we provide a feasible solution. IV. PROACTIVE RELIABILITY MANAGEMENT MECHANISM Instead of the typical 3-replica replication strategy, there is another way in which we can provide data reliability with fewer replicas. In this section we propose a novel costeffective reliability management mechanism for Cloud data storage named PRCR (Proactive Replica Checking for Reliability) at the high level with detailed design presented in Section V. A. Concepts and Reliability Management Model of PRCR The idea of PRCR follows the principles described below. First, the reliability of data in the Cloud follows the exponential distribution. As the basic component of the Cloud, data centres contain a large amount of commodity storage facilities which are unified as Cloud storage units. The data in the Cloud are stored in these storage units, and each copy of the data is called a replica of the data. According to classical theories [9, 18], we assume that the reliability of a Cloud storage unit over period T follows the exponential distribution: ( ) =, where λ is the failure rate of a Cloud storage unit, which stands for the expected number of Cloud storage unit failures in a certain period. The replicas in Cloud storage units should have the same reliability as the storage units, because data loss is primarily caused by the failure of Cloud storage units. For example, if a data centre experiences 100 disk failures out of disks a year, the average failure rate is 0.01/year, and thus the reliability of each replica should be 99% over one year. In addition, the reliability assurance of data changes according to the storage duration of the data. If the data are stored for more than one year, the reliability assurance of the data will decrease while the probability of data loss will increase. Second, the storage duration for meeting the reliability requirement can be deduced. Assume that multiple replicas be stored in different storage units. According to the reliability equation of a Cloud storage unit (which is also the reliability equation of a single replica), the reliability of data with multiple replicas can be derived from equation (1): =1 1 (1) In this equation, variable X is the data reliability requirement, variable k is the number of replicas and variable is the storage duration of the data with k replicas. The right-hand side of this equation describes the probability that no failure happens during the storage duration of, when k replicas are stored in Cloud storage units with failure rate. From this equation, it can be found that if we get the number of replicas and the failure rate of storage units, the storage duration of data while meeting the reliability requirement can be derived. Third, the reliability of the data has the memory-less property. As the exponential distribution has the memoryless property, the reliability of data in the storage units should also have the memory-less property, in which ( > + > ) = ( > ) for all, > 0. This is to say that the reliability of data from time s to s+t is equivalent to the reliability of data from time 0 to t. According to this property of data reliability, as long as we can guarantee that 566
4 the data is available at this very moment, the reliability of data for any period from that moment can be calculated. Based on the principles described above, the PRCR mechanism is proposed. In PRCR, the data in the Cloud are managed in different types according to their expected storage durations and reliability requirements: for data that are only for short-term storage and require low data reliability, one replica is stored in the Cloud; for data that are for long-term use and have a data reliability requirement higher than the reliability assurance of single replica, two replicas are stored in the Cloud where the management of two replicas is based on a proactive checking manner. In proactive checking, the availability of data is checked before the reliability assurance is dropped to less than the reliability requirement. If one of the replicas is found to be unavailable, it can be recovered from another replica. By applying PRCR, the reliability of all the data can be maintained to meet the reliability requirement. Compared with the typical 3-replica replication strategy, PRCR stores no more than two replicas for each file, so that the storage cost can be reduced significantly. B. Overview of PRCR PRCR runs on top of the virtual layer of the Cloud and manages data which are stored as files in different data centres. Fig. 2 shows the architecture of PRCR. There are two major parts: PRCR node and user interface. Computing instances Replica management ManagementData table module PRCR node User interface Figure 2. PRCR architecture Legend: Data Metadata PRCR node: It is the core component of PRCR, which is responsible for the management of the replicas. As will be mentioned later, according to the number of files in the Cloud, PRCR may contain one or more PRCR nodes which are independent of each other. The PRCR node is composed of two major components: data table and replica management module. Data Table 1 : It maintains the metadata of all data that each PRCR node manages. For each file in the storage, there are four types of metadata attributes maintained in the data table: file ID, time stamp, scan interval, and replica address. File ID is a unique identification of the file. Time stamp records the time when the last replica checking is conducted 1 The reliability of data table itself is beyond the scope of this paper. A conventional primary-secondary backup mechanism may well serve for the purpose. on the file. The scan interval is the time interval between the two replica checking tasks of each file. Depending on the time stamp and the scan interval, PRCR is able to determine the files that need to be checked, and according to the replica addresses, all replicas are able to be found. In the data table, each round of the scan is called a scan cycle. In each scan cycle, all of the metadata in the data table are sequentially scanned once. Replica Management Module: It is the control component of the PRCR node, which is responsible for managing the metadata in the data table and cooperating with the Cloud computing instance 2 to process the checking tasks. In each scan cycle, the replica management module scans the metadata in the data table and determines whether the file needs to be checked. For files that need to be checked, the replica management module extracts their metadata from the data table and sends these metadata to the Cloud computing instances. After the Cloud computing instances have finished the checking tasks, the replica management module receives the returned results and conducts further actions accordingly. In particular, if any replica is not available, the replica management module sends a request for replication to the Cloud which creates a new replica for the data, and updates the data table accordingly. User interface: It is a very important component of PRCR, which is responsible for justifying reliability management requests and distributing accepted requests to different PRCR nodes. A reliability management request is a request for PRCR to manage the file. In the request, the metadata of the file is required. Despite the four types of metadata attributes in the data table, a metadata attribute which is called the expected storage duration is needed. It is an optional metadata attribute of the file, which indicates the storage duration that the user expects. According to the expected storage duration, the user interface is able to justify whether the file needs to be managed by PRCR or not. In particular, if such management is unnecessary, the reliability management request is declined and the file is stored with only one replica in the Cloud. V. DETAILED DESIGN OF PRCR In Section IV we presented the high level design of PRCR. In this section, the detailed design is presented. First, we demonstrate the capacity of PRCR for managing files in the Cloud. Then, in order to maximise the capacity of PRCR, a metadata distribution algorithm is demonstrated. At last, by following the life cycle of a file, the working process of PRCR is presented. A. Capacity of PRCR The capacity of PRCR stands for the maximum number of files that PRCR is able to manage. In order to manage the reliability of Cloud data, PRCR must have an excellent capacity in order to manage the huge number of files in the Cloud. Due to the limitation of computing power of a single 2 In this paper we use term Cloud computing instance to represent the virtual computing unit in the Cloud. 567
5 PRCR node and different reliability requirements, PRCR usually contains several PRCR nodes, and the capacity of PRCR is the sum of the capacities of all PRCR nodes. The capacity of PRCR is mainly determined by two parameters, which are the metadata scanning time and the scan cycle of the PRCR node. The metadata scanning time is the time taken for scanning each metadata in the data table, and the scan cycle is the time that all metadata in the data table of a PRCR node are sequentially scanned once. The capacity of PRCR can be presented with equation (2): = (2) In this equation, is the capacity of PRCR, is the scan cycle of PRCR node i, is the metadata scanning time of PRCR node i and N is the number of PRCR nodes in PRCR. From this equation, it can be seen that with faster metadata scanning time and scan cycle, more files can be managed by PRCR. B. Metadata Distribution Algorithm Due to different reliability requirements for scan cycles of PRCR nodes, the capacity of the PRCR may fluctuate significantly. By maximising the capacity of PRCR, the cost for running PRCR can be minimised, and thus the reliability management can be conducted cost-effectively. To address this issue, we design a metadata distribution algorithm for PRCR. The metadata distribution algorithm is conducted within the user interface component of PRCR. It calculates the scan interval attribute and if needed, distributes the reliability management request to one of the PRCR nodes. In order to maximise the capacity of PRCR, this algorithm follows such an idea that each file should be managed by the most appropriate PRCR node. When a reliability management request is received, the metadata distribution algorithm first calculates the longest duration that the file needs to be stored. If replication is needed, it then sends the request to the PRCR node which has the scan cycle smaller but closest to the longest duration. The longest duration that the file can be stored without proactive replica checking ( the longest storage duration for short) indicates the time that the reliability assurance drops to the level that the reliability requirement can no longer be met. The difference between the scan cycle of a PRCR node and the longest duration of a file indicates such a meaning, that how long the proactive replica checking is conducted before the longest storage duration is reached. The reasons for sending the request to the PRCR node with the scan cycle smaller but closest to the longest duration are: first, PRCR must conduct proactive replica checking before the longest storage duration of each file; second, conducting proactive replica checking only when the longest storage duration of the file is about to reach can maximise the capacity of the PRCR service. In addition to what is stated above, the design of the metadata distribution algorithm has also considered the different types of data in the storage. As mentioned in Section III.B, there are two types of data generated by the scientific applications in the Cloud: short term and long term. The metadata distribution algorithm justifies the reliability management request by comparing the file s longest storage duration and the expected storage duration. If the expected storage duration is shorter than the longest storage duration of one replica, then no replication is needed, hence the reliability management request is declined. The longest storage duration of the file can be derived from equation (1) presented in Section III. Specifically, for no more than two replicas stored in the Cloud, the equation can be transformed into equations (3) and (4) respectively: When k=1, ( ) = (3) When k=2, ( ) = (4) where =. In equations (3) and (4), and indicate the longest durations of data with one replica and two replicas respectively. Algorithm: Metadata distribution algorithm Input: Expected storage duration e; //-1 if not exist Failure rate λ; Reliability requirement X; PRCR nodes set S; // include all the N PRCR nodes Output: scaninterval; // the longest storage duration without detection PRCR node node // the destination PRCR node 01. if (e -1) { λ 02. solve equation f T e T 1 ( 1 ) / X 03. T 1 the positive real root of f ( T 1 ) 04. if (e< T 1 ) return -1 } // reject the request if short term storage 05. else { solve equation f ( a ) Xa 2 a a the positive real root of f (a ) 08. T2 log a / 09. scaninterval T 2 } // longest storage duration with 2 replicas 10. for (each i S & scancycle(i) < scaninterval) 11. Set diff scaninterval - scancycle(i) // calculate the scaninterval - scancycle values 12. for (each j S & scancycle(i) < scaninterval) { 13. if (scaninterval - scancycle(j) = min(diff)) 14. Set nodes j } // find the nodes with the smallest scaninterval - scancycle value 15. node random(nodes) // randomly return one of the nodes as the destination 16. return scaninterval, node Figure 3. Pseudo code of metadata distribution process The pseudo code of the algorithm is shown in Fig. 3. When the reliability management request encapsulating the metadata arrives, the algorithm computes the longest storage duration of one replica first. By solving equation (3), is derived. If the expected storage duration attribute exists and is shorter than, the reliability management request is rejected and the file is stored with only one replica in the Cloud. Otherwise, the reliability management request is accepted and the file is stored in the Cloud with two replicas. The longest storage duration of two replicas is derived from equation (4). After is derived, it is stored as the scan interval attribute. Afterward, the process calculates the scan interval minus the scan cycle value for the PRCR 568
6 node of which the scan cycle is smaller than the scan interval of the file. Then, a randomly selected PRCR node that has the smallest scan interval minus the scan cycle value is decided as the destination node. Finally, the scan interval attribute together with the destination PRCR node are returned as the result. In this process, we randomly pick one node from the nodes set to deal with the situation that two PRCR nodes have the same scan cycle. By applying the metadata distribution algorithm, PRCR can reach its maximum capacity, and the cost for managing files can be minimised. C. Working Process of PRCR In Fig. 4, we illustrate the working process of PRCR in the Cloud by following the life cycle of a file. 1.reliability management request 2. reject request User interface 2. metadata distribution 8. data recovery PRCR node 7. return result Replica management module Computing instances 3. update 4. scan 5.extract metadata 6. send detection task Figure 4. Working process of PRCR Data table 1. The process starts from the submission of a reliability management request to PRCR. In the request, the metadata of the file is encapsulated. According to the metadata, the user interface justifies whether the management of the file in PRCR is necessary. 2. According to the metadata distribution algorithm, if the proactive replica checking is necessary, the request is distributed to the replica management module of the corresponding PRCR node. Otherwise, the reliability management request is declined (rejected) when the file only needs one replica stored in the Cloud to satisfy the reliability requirement. 3. The replica management module calls cloud storage service to do the data replication. After data replication is finished, the metadata are updated in the data table. 4. In each scan cycle of the PRCR node, the metadata are scanned sequentially. According to the metadata, PRCR determines whether the file needs to be checked or not. 5. During the scanning process, if a file needs to be checked, the replica management module extracts its metadata from the data table. 6. The replica management module sends the proactive replica checking task to one of the Cloud computing instances. The Cloud computing instance executes the task, which checks the availability of both replicas of the file. 7. After the checking task is completed, the Cloud computing instance returns the result. The replica management module receives the returned result and conducts further action accordingly as follows. 8. If one of the replicas is not available 3, the replica management module sends a replica generation request to the Cloud. Using the metadata of the other replica, a new replica is generated and thus the data can be recovered. Note: Steps 4 to 8 form a continuous loop until the file is deleted, i.e. end of life cycle. VI. EVALUATION A. Parameter Setting for PRCR Simulation We have conducted several tests based on the Amazon Web Services including Amazon S3, Amazon EC2, and Amazon Beanstalk. Based on these tests, we have obtained benchmarking parameters from a real Cloud environment. The PRCR for experiments is implemented in Java. It is composed of one user interface and one PRCR node, and both of them run on a single EC2 instance. In experiments, the metadata scanning process and the proactive replica checking tasks are conducted. There are mainly two parameters collected from the tests on the Amazon Web Services: the metadata scanning time and the proactive replica checking task processing time. The metadata scanning time is a very important parameter. As mentioned in Section V.A, it is directly related to the capacity of each PRCR node. The proactive replica checking task processing time is the time taken for each proactive replica checking task. In experiments, the content of each proactive replica checking task is to access both replicas of the file and check the availability. The proactive replica checking task processing time is an important parameter for the evaluation of PRCR, because the replica checking tasks occupy the most resources of PRCR, and we need this parameter to compute the running cost of PRCR. The tests are conducted with several different configurations. We hired four types of EC2 computing instances for the management of 3000 S3 objects (e.g. files) stored with standard storage and reduce redundancy storage respectively. The results are shown in Table I. It can be seen that the metadata scanning time is at a hundreds of nanoseconds level, and the checking task processing time of each file is at a tens of milliseconds level. To our surprise, the results seem to indicate that the performance of PRCR does not follow the trend of the increasing performance of the Cloud computing instances. As shown in Table I, the performance of the Cloud computing instances increases from left to right, but the scanning time and proactive replica checking task processing time do not increase accordingly. We speculate that the reason for this is that we have not optimised the code for multi-core processors, and the Cloud virtual partition technology may have reduced the overall performance. According to the results shown in Table I, in the evaluation of PRCR we choose 700ns and 30ms as the 3 In some extreme cases, both replicas may be lost which would be within the reliability range, hence not jeopardising the reliability assurance in overall terms. 569
7 benchmarks for metadata scanning time and proactive replica checking time respectively, and the micro EC2 instance (t1.micro) is chosen to be the default Cloud computing instance. TABLE I. DATA SCANNING TIME AND CHECKING TASK PROCESSING TIME t1.micro m1.small m1.large m1.xlarge Scanning Time 700ns 400ns 700ns 850ns CTP Time(standard) 27ms 27ms 30ms 27ms CTP Time(reduce redundancy) 25ms 24ms 37ms 23ms B. Cost-effectiveness In order to demonstrate the effectiveness of PRCR, we evaluate the capacity of PRCR nodes, and simulate the reliability management process using PRCR for data generated by pulsar searching workflow instances. The results are compared with the widely used 3-replica replication strategy. We calculate the capacity of PRCR nodes for storing files with the data reliability requirements of 99% over a year, 99.99% over a year and % over a year under different storage unit failure rates. In Table II, the relationship among the reliability requirement X, failure rate of single replica λ and the capacity of PRCR nodes can be clearly seen. With different single replica failure rate and reliability requirement, each PRCR node is able to manage from files to files, which are rather large. X λ TABLE II. PRCR CAPACITY % % % In a further step of the evaluation we simulate the reliability management process using PRCR for data generated by pulsar searching workflow instances. In the simulation we assume that all the pulsar searching workflow instances be the same. The configuration is set as follows. The storage unit failure rate is 0.01 per year. Two types of Cloud storage services are provided, which are 2-replicas storage with reliability assurance of % over a year and 1-replica storage with reliability assurance of 99% over a year which is normally enough for short term data. The data stored in the former are managed by PRCR. In the simulation, we apply four different types of storage plans respectively, which are 1-replica plan, 1+2 replica plan, 2-replica plan and 3-replica plan. Among these four storage plans, the 1+2 replica plan divides all the data into two categories, in which the data with higher reliability requirement are stored with 2 replicas and the rest are stored with 1 replica. As mentioned in Section III.B, in the pulsar searching workflow example, among all the files, the extracted & compressed beam files, the XML files and the de-dispersion files should be stored for long-term use and have higher reliability requirements, so that they are stored with 2-replica storage. The other files are stored with 1- replica storage. Fig. 5 and Fig. 6 have shown the results of the simulation. Fig. 5 shows the average replica numbers and file sizes generated by the pulsar searching workflow instance. The results show that by applying the 2-replica storage plan, one-third of the generated data size can be reduced compared with the 3-replica replication strategy, and the average replica number for each file is reduced accordingly. By applying the 1+2 replica storage plan, the combination of two storage types further reduced the consumption of storage space. In our simulation, by applying the 1+2 replica storage plan for the pulsar searching workflow instance, each workflow instance has around files stored with 2 replicas, while more than files are stored with 1 replica. The ratio between the volumes of two types of files is about 4.8:1, and compared with the 2-replica plan, about 70GB data is reduced. With a large number of pulsar searching instances, the overall saving would be huge. Data size (GB) average replica number data size 1 replica 1+2replica 2replica 3replica Storage type Figure 5. Average replica numbers and data sizes For comparing the general cost of data storage using PRCR and the 3-replica replication strategy, we adopt the Amazon pricing model for Amazon EC2 instance and S3 storage. Assuming that Amazon S3 standard storage also use the 3-replica replication strategy, Fig. 6 shows the cost for storing 1GB of data for one month in the Cloud. In this figure, the horizontal axis represents the number of pulsar searching workflow instances conducted. The results indicate clearly the cost-effectiveness of PRCR for the reliability management of different sizes of data. By using the micro EC2 instance, the cost for running PRCR with one PRCR node in the Cloud is $14.4 per month. When only a small amount of data is managed by PRCR, the cost of hiring EC2 instances for PRCR has taken a significant proportion of the total data storage cost, and the costeffectiveness of using the default 3-replica replication strategy in the Cloud is higher. With the increase of data
8 amount, the proportion of cost for running PRCR become smaller, thus the unit storage cost decreases rapidly. Consider the situation that PRCR would normally well loaded, the cost for running PRCR is small so that it can be negligible. Therefore, the unit storage cost can be reduced to between two-thirds to one-third of the Amazon S3 storage cost. Cost rates ( $ / GB * month) Number of pulsar searching instances Figure 6. Cost rates for storing data 3replica 2replica 1+2replica 1 replica VII. CONCLUSIONS AND FUTURE WORK In this paper, we present a novel cost-effective reliability management mechanism called PRCR. It applies an innovative proactive replica checking approach to check the availability of each file so that the data reliability can be assured with no more than two replicas. Simulation of the reliability management process using PRCR for data generated by pulsar searching workflow instances has demonstrated that this mechanism can significantly reduce the Cloud storage space consumption and hence Cloud storage cost. Our current work considers only cost and reliability factors for data storage. Therefore, in this paper we did not discuss the implications over data access performance by using PRCR. In the future, we will incorporate the performance factor into our research. [5] C. Dabrowski, "Reliability in grid computing systems," Concurrency and Computation: Practice and Experience, vol. 21, pp , [6] R. Duan, R. Prodan, and T. Fahringer, "Data mining-based fault prediction and detection on the grid," in IEEE International Symposium on High Performance Distributed Computing, pp , [7] S. Ghemawat, H. Gobioff, and S. Leung, "The google file system," in ACM Symposium on Operating Systems Principles pp , [8] V. Gopalakrishnan, B. Silaghi, B. Bhattacharjee, and P. Keleher, "Adaptive replication in peer-to-peer systems," in International Conference on Distributed Computing Systems, pp , 2004 [9] H. Jin, X. H. Sun, and Z. Zheng, "Performance under failures of DAGbased parallel computing," in IEEE/ACM International Symposium on Cluster Computing and the Grid, pp , [10] H. Lamehamedi, B. Szymanski, Z. Shentu, and E. Deelman, "Data replication strategies in grid environments," in International Conference on Algorithms and Architectures for Parallel Processing, pp , [11] M. Lei, S. V. Vrbsky, and Z. Qi, "Online grid replication optimizers to improve system reliability," in IEEE International Parallel and Distributed Processing Symposium, pp. 1-8, [12] W. Li, Y. Yang, and D. Yuan, "A novel cost-effective dynamic data replication strategy for reliability in cloud data centres," in International Conference on Cloud and Green Computing, pp , [13] X. Liu, D. Yuan, G. Zhang, W. Li, D. Cao, Q. He, J. Chen, and Y. Yang, The design of cloud workflow systems: Springer, [14] D. Patterson, G. Gibson, and R. Katz, "A case for redundant arrays of inexpensive disks (RAID)," in ACM SIGMOD International Conference on the Management of Data, pp , [15] F. Schintke and R. Alexander, "Modelling replica availability in large data grids," Journal of Grid Computing, vol. 1, pp , [16] H. Stockinger, A. Samar, B. Allcock, I. Foster, K. Holtman, and B. Tierney, "File and object replication in data grids," Journal of Cluster Computing vol. 5, pp , [17] H. Vo, C. Chen, and B. Ooi, "Towards elastic transactional cloud storage with range query support," Proceedings of the VLDB Endowment, vol. 3, [18] J. W. Young, "A first order approximation to the optimal checkpoint Interval," Communications of the ACM, vol. 17, pp , [19] D. Yuan, Y. Yang, X. Liu, G. Zhang, and J. Chen, "A data dependency based strategy for intermediate data storage in scientific cloud workflow systems," Concurrency and Computation: Practice and Experience, online, ACKNOWLEDGMENT The research work reported in this paper is partly supported by Australian Research Council under Linkage Project LP and Discovery Project DP , as well as State Administration of Foreign Experts Affairs of China Project P Yun Yang is the corresponding author. REFERENCES [1] The Parkes radio telescope. Available: [2] J. Abawajy, "Fault-tolerant scheduling policy for grid computing systems," in International Parallel and Distributed Processing Symposium, pp , [3] Amazon. (2011). Amazon simple storage service (Amazon S3). Available: [4] D. Borthakur. (2007). The Hadoop distributed file system: Architecture and design. Available: hdfs_design.html 571
A Data Dependency Based Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems *
A Data Dependency Based Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems * Dong Yuan, Yun Yang, Xiao Liu, Gaofeng Zhang, Jinjun Chen Faculty of Information and Communication
Efficient Data Replication Scheme based on Hadoop Distributed File System
, pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT
MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT 1 SARIKA K B, 2 S SUBASREE 1 Department of Computer Science, Nehru College of Engineering and Research Centre, Thrissur, Kerala 2 Professor and Head,
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
Snapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Distributed File Systems
Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM
MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM Julia Myint 1 and Thinn Thu Naing 2 1 University of Computer Studies, Yangon, Myanmar [email protected] 2 University of Computer
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
PARALLELS CLOUD STORAGE
PARALLELS CLOUD STORAGE Performance Benchmark Results 1 Table of Contents Executive Summary... Error! Bookmark not defined. Architecture Overview... 3 Key Features... 5 No Special Hardware Requirements...
Evaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang [email protected] University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES
CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,
Big Data Challenges in Bioinformatics
Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres [email protected] Talk outline! We talk about Petabyte?
Cloud Storage Solution for WSN Based on Internet Innovation Union
Cloud Storage Solution for WSN Based on Internet Innovation Union Tongrang Fan 1, Xuan Zhang 1, Feng Gao 1 1 School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang,
Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2
Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2 1 PDA College of Engineering, Gulbarga, Karnataka, India [email protected] 2 PDA College of Engineering, Gulbarga, Karnataka,
ANALYSING THE FEATURES OF JAVA AND MAP/REDUCE ON HADOOP
ANALYSING THE FEATURES OF JAVA AND MAP/REDUCE ON HADOOP Livjeet Kaur Research Student, Department of Computer Science, Punjabi University, Patiala, India Abstract In the present study, we have compared
UPS battery remote monitoring system in cloud computing
, pp.11-15 http://dx.doi.org/10.14257/astl.2014.53.03 UPS battery remote monitoring system in cloud computing Shiwei Li, Haiying Wang, Qi Fan School of Automation, Harbin University of Science and Technology
Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.
Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Hosting Transaction Based Applications on Cloud
Proc. of Int. Conf. on Multimedia Processing, Communication& Info. Tech., MPCIT Hosting Transaction Based Applications on Cloud A.N.Diggikar 1, Dr. D.H.Rao 2 1 Jain College of Engineering, Belgaum, India
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
SLA-aware Resource Scheduling for Cloud Storage
SLA-aware Resource Scheduling for Cloud Storage Zhihao Yao Computer and Information Technology Purdue University West Lafayette, Indiana 47906 Email: [email protected] Ioannis Papapanagiotou Computer and
PRIVACY PRESERVATION ALGORITHM USING EFFECTIVE DATA LOOKUP ORGANIZATION FOR STORAGE CLOUDS
PRIVACY PRESERVATION ALGORITHM USING EFFECTIVE DATA LOOKUP ORGANIZATION FOR STORAGE CLOUDS Amar More 1 and Sarang Joshi 2 1 Department of Computer Engineering, Pune Institute of Computer Technology, Maharashtra,
Minimal Cost Data Sets Storage in the Cloud
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 5, May 2014, pg.1091
Scalable Multiple NameNodes Hadoop Cloud Storage System
Vol.8, No.1 (2015), pp.105-110 http://dx.doi.org/10.14257/ijdta.2015.8.1.12 Scalable Multiple NameNodes Hadoop Cloud Storage System Kun Bi 1 and Dezhi Han 1,2 1 College of Information Engineering, Shanghai
Object Storage: A Growing Opportunity for Service Providers. White Paper. Prepared for: 2012 Neovise, LLC. All Rights Reserved.
Object Storage: A Growing Opportunity for Service Providers Prepared for: White Paper 2012 Neovise, LLC. All Rights Reserved. Introduction For service providers, the rise of cloud computing is both a threat
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
Telecom Data processing and analysis based on Hadoop
COMPUTER MODELLING & NEW TECHNOLOGIES 214 18(12B) 658-664 Abstract Telecom Data processing and analysis based on Hadoop Guofan Lu, Qingnian Zhang *, Zhao Chen Wuhan University of Technology, Wuhan 4363,China
Big Data Storage Architecture Design in Cloud Computing
Big Data Storage Architecture Design in Cloud Computing Xuebin Chen 1, Shi Wang 1( ), Yanyan Dong 1, and Xu Wang 2 1 College of Science, North China University of Science and Technology, Tangshan, Hebei,
http://www.paper.edu.cn
5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission
The Google File System
The Google File System By Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung (Presented at SOSP 2003) Introduction Google search engine. Applications process lots of data. Need good file system. Solution:
Reduction of Data at Namenode in HDFS using harballing Technique
Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu [email protected] [email protected] Abstract HDFS stands for the Hadoop Distributed File System.
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 [email protected], 2 [email protected],
Survey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
A REVIEW: Distributed File System
Journal of Computer Networks and Communications Security VOL. 3, NO. 5, MAY 2015, 229 234 Available online at: www.ijcncs.org EISSN 23089830 (Online) / ISSN 24100595 (Print) A REVIEW: System Shiva Asadianfam
CiteSeer x in the Cloud
Published in the 2nd USENIX Workshop on Hot Topics in Cloud Computing 2010 CiteSeer x in the Cloud Pradeep B. Teregowda Pennsylvania State University C. Lee Giles Pennsylvania State University Bhuvan Urgaonkar
Application Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Energy aware RAID Configuration for Large Storage Systems
Energy aware RAID Configuration for Large Storage Systems Norifumi Nishikawa [email protected] Miyuki Nakano [email protected] Masaru Kitsuregawa [email protected] Abstract
Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems
215 IEEE International Conference on Big Data (Big Data) Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems Guoxin Liu and Haiying Shen and Haoyu Wang Department of Electrical
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
Cluster, Grid, Cloud Concepts
Cluster, Grid, Cloud Concepts Kalaiselvan.K Contents Section 1: Cluster Section 2: Grid Section 3: Cloud Cluster An Overview Need for a Cluster Cluster categorizations A computer cluster is a group of
The Cloud Hosting Revolution: Learn How to Cut Costs and Eliminate Downtime with GlowHost's Cloud Hosting Services
The Cloud Hosting Revolution: Learn How to Cut Costs and Eliminate Downtime with GlowHost's Cloud Hosting Services For years, companies have struggled to find an affordable and effective method of building
Mobile Cloud Computing for Data-Intensive Applications
Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, [email protected] Advisor: Professor Priya Narasimhan, [email protected] Abstract The computational and storage
CLOUD BASED PEER TO PEER NETWORK FOR ENTERPRISE DATAWAREHOUSE SHARING
CLOUD BASED PEER TO PEER NETWORK FOR ENTERPRISE DATAWAREHOUSE SHARING Basangouda V.K 1,Aruna M.G 2 1 PG Student, Dept of CSE, M.S Engineering College, Bangalore,[email protected] 2 Associate Professor.,
ISSN:2320-0790. Keywords: HDFS, Replication, Map-Reduce I Introduction:
ISSN:2320-0790 Dynamic Data Replication for HPC Analytics Applications in Hadoop Ragupathi T 1, Sujaudeen N 2 1 PG Scholar, Department of CSE, SSN College of Engineering, Chennai, India 2 Assistant Professor,
Google File System. Web and scalability
Google File System Web and scalability The web: - How big is the Web right now? No one knows. - Number of pages that are crawled: o 100,000 pages in 1994 o 8 million pages in 2005 - Crawlable pages might
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
marlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
Amazon Elastic Compute Cloud Getting Started Guide. My experience
Amazon Elastic Compute Cloud Getting Started Guide My experience Prepare Cell Phone Credit Card Register & Activate Pricing(Singapore) Region Amazon EC2 running Linux(SUSE Linux Windows Windows with SQL
Data Semantics Aware Cloud for High Performance Analytics
Data Semantics Aware Cloud for High Performance Analytics Microsoft Future Cloud Workshop 2011 June 2nd 2011, Prof. Jun Wang, Computer Architecture and Storage System Laboratory (CASS) Acknowledgement
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Keywords: Cloudsim, MIPS, Gridlet, Virtual machine, Data center, Simulation, SaaS, PaaS, IaaS, VM. Introduction
Vol. 3 Issue 1, January-2014, pp: (1-5), Impact Factor: 1.252, Available online at: www.erpublications.com Performance evaluation of cloud application with constant data center configuration and variable
Research on Job Scheduling Algorithm in Hadoop
Journal of Computational Information Systems 7: 6 () 5769-5775 Available at http://www.jofcis.com Research on Job Scheduling Algorithm in Hadoop Yang XIA, Lei WANG, Qiang ZHAO, Gongxuan ZHANG School of
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
Data Consistency on Private Cloud Storage System
Volume, Issue, May-June 202 ISS 2278-6856 Data Consistency on Private Cloud Storage System Yin yein Aye University of Computer Studies,Yangon [email protected] Abstract: Cloud computing paradigm
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
Big Data and Hadoop with components like Flume, Pig, Hive and Jaql
Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.
BIG DATA CHALLENGES AND PERSPECTIVES
BIG DATA CHALLENGES AND PERSPECTIVES Meenakshi Sharma 1, Keshav Kishore 2 1 Student of Master of Technology, 2 Head of Department, Department of Computer Science and Engineering, A P Goyal Shimla University,
Big Data and Natural Language: Extracting Insight From Text
An Oracle White Paper October 2012 Big Data and Natural Language: Extracting Insight From Text Table of Contents Executive Overview... 3 Introduction... 3 Oracle Big Data Appliance... 4 Synthesys... 5
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
Cyber Forensic for Hadoop based Cloud System
Cyber Forensic for Hadoop based Cloud System ChaeHo Cho 1, SungHo Chin 2 and * Kwang Sik Chung 3 1 Korea National Open University graduate school Dept. of Computer Science 2 LG Electronics CTO Division
Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage
White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage
Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
Manifest for Big Data Pig, Hive & Jaql
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
