Adaptive Task Scheduling for Multi Job MapReduce

Size: px
Start display at page:

Download "Adaptive Task Scheduling for Multi Job MapReduce"

Transcription

1 Adaptive Task Scheduling for MultiJob MapReduce Environments Jordà Polo, David de Nadal, David Carrera, Yolanda Becerra, Vicenç Beltran, Jordi Torres and Eduard Ayguadé Barcelona Supercomputing Center (BSC) Technical University of Catalonia (UPC) Barcelona, Spain {jordap, ddenadal, dcarrera, yolandab, vbeltran, torres, Abstract MapReduce is a data-driven programming model originally proposed by Google back in 24, specially well suited for running extremely distributed applications in very large data centers. Current trends in data center management comprise is the extensive use of virtualization and workload consolidation strategies to save energy and cut costs. In such an environment, different applications belonging to different users share the same physical resources. Therefore, predicting and managing the performance of each workload is a crucial task from the point of view of data center management. In this paper we address the problem of predicting and managing the performance of MapReduce applications according to a set of performance goals defined for each application. In our prototype system we introduce a new task scheduler for a MapReduce framework that allows the performance-driven management of MapReduce tasks. The result is a task scheduler that dynamically predicts the performance of several concurrent MapReduce jobs and that adjusts at runtime the resources that each job gets allocated, in order to meet its performance goal without wasting physical resources. Index Terms MapReduce, Performance management, Task scheduling I. INTRODUCTION MapReduce is a framework originally designed by Google to exploit large clusters to compute parallel problems. It is based on an implicit parallel programming model that provides an easy and convenient way to express certain kinds of distributed/parallel problems. This framework is composed of an execution runtime and a distributed file system that take care of all the details involved in the distribution of tasks and data across nodes. The runtime and the distributed file system also handle all the issues related to fault tolerance and reliability, which are crucial in such a dynamic environment. Each MapReduce application is run as a job that is submitted to the MapReduce runtime. Each job is split into a large number of Map and Reduce tasks before being started. The runtime is in charge of running tasks for every job until they are completed. The tasks are actually executed in any of the slave nodes that the MapReduce cluster comprises. In particular, the task scheduler is responsible for deciding what tasks are run at each moment in time, as well as what slave node will host the task execution. Therefore, in a multi-job environment, the task scheduler is the major responsible for the performance delivered by each MapReduce application. However, while at first MapReduce was primarily used for batch data processing, it is now also being used in shared, multi-user environments in which submitted jobs may have completely different priorities: from small, almost interactive executions, to very long programs that can take hours to complete. In these new scenarios, task scheduling is becoming even more relevant as it is responsible for deciding what tasks are run, and thus the performance delivered by each application. In this paper we present an application-centric task scheduler for Hadoop (Apache s open source implementation of a MapReduce framework). The proposed scheduler is able to dynamically manage the performance of the in-execution MapReduce applications by dynamically adapting the resources each application gets allocated. In particular, the scheduler is driven by high-level performance goals (such as job deadlines), and leverages estimates for each job completion time to decide on runtime the best allocation of task execution slots across jobs. In our experiments, we evaluate different job completion time estimation techniques, and we pick the best performing one to run illustrate execution examples in which the effectiveness of our performance-driven dynamic scheduling for MapReduce applications is tested. The rest of the paper is structured as follows. In section II we introduce some technical background necessary to understand MapReduce applications as well as Hadoop, the MapReduce open source implementation we have used in our experiments. Section III depicts the different techniques for estimation of job completion time that we have evaluated, as well as the core of the performance management scheduler we propose. Section IV presents the basic architecture of the scheduler architecture and its integration into Hadoop. In Section V we present the experiments that support the contribution of our work. And finally, section VII shows the related work and section VIII the conclusions of our work. A. MapReduce II. TECHNICAL BACKGROUND MapReduce [6] is a programming model proposed by Google to facilitate the implementation of massively parallel applications that process large data sets. This programming model hides to the programmer all the complexity involved in management of parallelization. The programmer only has to implement the computational function that will execute on

2 each node (the map() function) and the function to combine the intermediate results to generate the final result of the application (the reduce() function). Each map input is a keyvalue pair that identifies a piece of work. The type of both value and key can be defined by the programmer. The output of each map is an intermediate result also expressed as a pair key-value, which can be of a different type than the input pair, and that are also defined by the programmer. The input of the reduce function is composed of all the intermediate values identified by the same key, this way the reduce function can combine all of them to form the final result. In the MapReduce model all nodes in the cluster execute the same function but on a different chunk of input data. The MapReduce runtime is responsible for distributing and for balancing all the work across the nodes, dividing the input data into chunks, assigning a new chunk every time a node becomes idle and collecting the partial results. There are many runtime implementations to support this model, depending on the type of environment they aim to support. In this paper we present a prototype that is implemented on top of Hadoop, a MapReduce framework presented in the following section. For example, in the case of Google implementation [6] and in the case of Hadoop [1], they target the execution on a distributed execution environment that uses a distributed file system to support the data sharing (the GFS [9] in the case of Google, and the HDFS [3] in the case of Hadoop). B. Hadoop Hadoop is an open source MapReduce runtime implementation provided by the Apache Software Foundation. The mechanism it uses to implement shared memory is based on a distributed file system also provided by the Hadoop project: Hadoop Distributed File System [3] (HDFS). Thus, data sharing is implemented through HDFS files. The architecture of this File System is a master/slave architecture. The master process (NameNode) manages the global name space and controls the operations on files, and each slave process (DataNode) implements the operations on those blocks stored in its local disk, following the NameNode indications. The runtime has two different processes involved. The process which distributes work among nodes is named JobTracker. This process uses the method configured by the programmer to partition the input data into splits and thus, to populate a local job-queue. If a node in the system becomes idle, the JobTracker picks a new job from its queue to feed it. Thus, the granularity of the splits have a high influence on the balancing capability of the scheduler. Another consideration of the map tasks scheduling is the location of the blocks, as it tries to minimize the number of remote blocks accesses. The process that controls the execution of the map tasks inside a node is named TaskTracker. This process receives a split description, divides the split data into records (through the RecordReader component) and launches the processes that will execute the map tasks (Mappers). The programmer can also decide how many simultaneous map() functions wants to execute on a node, that is, how many Mappers the TaskTracker has to create. When a TaskTracker ends the processing of one split and is ready to receive a new split, it sends a notification to the JobTracker. III. PERFORMANCE MANAGEMENT In this section we present the techniques we have studied to predict the behaviour of jobs and its completion time, and then describe how this information can be used by schedulers to provide performance-driven scheduling of multiple jobs. A. Job performance estimation In order to estimate the finish time of a given job, it is important to collect accurate data about tasks that have already finished, as well as about tasks that are currently running. Hadoop already provides a way for tasks to report its progress, but as you will see in section IV, it doesn t always provide good enough data. Since our goal is to schedule all kinds of jobs, we have looked into alternative mechanisms that could be used to estimate the job s progress from the JobTracker. See below for an overview of the estimators we have studied so far. Completed tasks (E c ) This is the first and most elementary estimator: it only takes into account the number of completed tasks since the job started (N), and the task completion rate is then used to estimate the time it will take to finish the remaining tasks. E c = Elapsed job time N Remaining tasks Input size (E s ) This estimator is based on the input size of all the maps that have been completed, which is then compared to the total input size to get the input progress. E s = Elapsed job time N i=1 Input i Total input Partial time (E p ) Unlike the previous estimators, this one collects measures not only from finished tasks, but also from currently running tasks (thus partial). It calculates the average task time based on the time it has taken to finish already completed tasks, and also for how long current tasks (N + 1 to M) have been running. N i=1 Average = Finish time i Start time i N M P artial = Current time Start time i E p = i=n+1 Elapsed job time Desired tasks N + P artial Average B. Performance-driven co-scheduling To prove that scheduling decisions could benefit from progress estimations, we first tried to design a very simple scheduler. The idea behind this equalizing scheduler is to

3 complete all running jobs at the same time, assigning free task slots to the job with the more distant estimated completion time. Since estimators become more accurate as more tasks are completed, we expect this scheduler to be fairly close to the desired results. In addition to this basic scheduler, we also worked on a more advanced QoS-oriented scheduler, the goal of which is to execute running jobs using as few resources as possible while trying to meet the user-provided deadline for each job. The deadline scheduler works by assigning available task slots to the job with the largest positive need of CPU, which roughly translates as the job that needs more resources to be completed on time. The need is defined as the difference between the estimated number of slots required to meet the deadline and the number of running tasks: Slots = Remaining tasks Remaining time until deadline Average task time Need = Slots Running tasks Note, however, that right after submitting a job, there is no data available and it isn t possible to estimate the required slots nor the completion time until some tasks are actually run. Since we want to have an idea of the job s needs as soon as possible, jobs with no completed and no running tasks always take precedence over other jobs. If there is more than one such job, the oldest one comes first. And finally, there is another scenario that needs to be taken care of: jobs that are past its deadline. If it isn t possible to meet the job s deadline, the scheduler tries to at least complete it as soon as possible, prioritizing it over any other kind of job, which in turn also helps to avoid job starvation. IV. SYSTEM PROTOTYPE Analyzing the job s progress to estimate future behaviour is the basis for the schedulers described in this paper. Early in the first versions, however, we found scenarios that weren t as promising as we had thought. As discussed in section III-A, Hadoop provides a way for tasks to report its progress to the JobTracker. Although Map tasks can use the RecordReader s getprogress() method, the scheduler can not rely on the use of this method since it is not always correctly implemented (or implemented at all, for that matter). Since our goal is to schedule any kind of job, we decided to implement alternative mechanisms that could be used to estimate the job s progress externally, that is, from the JobTracker or scheduler, without relying on the progress as reported by each task. It should also be noted that the current implementation of the schedulers proposed in the previous section has been restricted to map tasks only. Estimation of maps is arguably easier than that of reduces: there is usually a large, known number of maps, while the number of reduces is smaller and its execution is more variable. As part of our future work, however, we intend to extend the initial implementation to support reduce tasks. The deadline scheduler keeps information about each job, updating it regularly, and using it to queue all jobs accordingly, prioritizing jobs with the highest need of CPU. The status of each job is one of the keys of the scheduler since it helps to select the right job. All running jobs are classified into one of the following estates: NODATA There is not enough data to estimate the job s ADJUST requirements. The scheduler is able to decide whether the job needs more or less slots. UNDEAD The job is past its deadline. When jobs are submitted, they are marked as NODATA, and can move to ADJUST as soon as the first task is completed, but obviously not the other way around. If a job demands more resources than currently available, it will end up becoming an UNDEAD job. Eventually, when all tasks are completed, the job is removed from the list of running jobs. In order to avoid starvation, UNDEAD jobs are always run first, followed by NODATA jobs with no running tasks, and finally ADJUST jobs. V. EXPERIMENTS In this Section we present four different experiments carried on to explore the effectiveness and goodness of the different job completion estimation techniques, as well as to evaluate the task scheduling techniques we have implemented in the core of the Hadoop JobTracker. To evaluate the chosen job estimation technique we use two different MapReduce applications. The first one presents a regular task length distribution. The second application shows an irregular task length distribution. In particular, we used the sample WordCount application distributed with Hadoop as a case of job with regular task completion time, and we used a distributed simulator as a case of job with irregular task completion time. A. Experimental environment In our experiments, we used 5 2-way Intel Xeon nodes with 2GB of RAM as our experimental Hadoop cluster. One of the nodes was configured to be the JobTracker, another one as the NameNode, and the three remaining nodes were used as DataNodes and TaskTrackers. All the nodes were connected using a Gigabit ethernet, and were running 32-bit Debian GNU/Linux 4. with kernel version 2.6. We used OpenJDK 1.6 to run the MapReduce applications we used Hadoop.21- dev. The simulator used in the experiments is that presented in [5], that was extended to run a number of system configurations. Each simulator instance is run as a Map task, and we leverage MapReduce to coordinate the distributed execution of parallel simulations. B. Goodness of job completion estimation techniques In this first experiment we consider the three different techniques, presented in Section III-A, for estimating job completion time. Results are presented in Figure 1. As it can observed, Completed Tasks and the Input Size estimators offer almost the same behavior. Both of them show a high

4 Absolute error (s) Time per map (s) Number of finished maps Fig. 1. Completed tasks Input size Partial time Completion time for each estimator Fig. 2. TaskTracker 1 TaskTracker 2 TaskTracker 3 Task length distribution: WordCount variabiliy, and it is only when the job gets close to its actual completion time that these methods show a low absolute error in the estimation. The Partial time method offers more stable results over time, with a relative error that is less than 5% for this experiment, and presents a clear trend to be optimistic in the estimation. We are currently working on improving the method to increase the accuracy of this method. Notice that Partial time is the approach that exhibits a better behavior and therefore this is the method that has been used for the rest of the experiments presented in this Section. Absolute error (s) Fig. 3. Number of finished maps Partial time estimator: WordCount C. Estimating job completion time In a second experiment we evaluate the accuracy of the job completion time prediction technique for a WordCount application (regular task length distribution) and for the Simulator (irregular task length distribution). Figures 2 and 3 show the results for the WordCount application. In particular, Figure 2 shows the task completion time distribution (and which one of the three used TaskTrackers ran each Task) over time (actual time each Mapper needs to be completed). Notice that task completion time is very regular for this application, and the small visible variability can be explained by the heterogeneity of the nodes used in the Hadoop cluster. Figure 3 shows the absolute error of the predicted job completion time with respect to the actual completion time at each moment of the job execution. Negative time values represent that the technique is optimistic, and that the predicted job completion time is before the actual completion time observed when the job finishes. The estimation is continuously updated over time. As it can be observed, for an application with regular task completion time distribution, the estimation is particularly stable, with some variations in the initial phases of the experiment while the predictor is calibrated, and becomes stable as more tasks are completed. Notice that all over the experiment the relative error of the prediction is below 5% of the actual job completion time, that is about 9s. Figures 4 and 5 show the results for the Simulator application. In particular, Figure 4 shows the task length distribution (and which one of the three used TaskTrackers ran each Task) over time. Notice that in this case task completion time is very irregular for this application, and the task time depends on the hardness of the simulator parameters. Figure 5 shows the absolute error of the predictions for the simulator application. As in the previous experiment, negative time values represent that the technique is optimistic. The high variability of the task length introduces in this case a reduction in the accuracy of the predictions. In periods in which the average task length is lower, such as in the periods -4,s, and 7,-1,s, the prediction is updated to expect an earlier completion time, and in periods in which the average task length increases, the prediction is updated to expect later completion times. Notice that the accuracy of the prediction increases over time, as the average task completion time becomes more accurate. The prediction assumes that each task takes this average time to complete, and thus, the more population from the tasks to be completed has been observed, the more accurate the prediction becomes. This can be observed in the fact that although the absolute prediction error follows a similar pattern to that observed for the task length distribution over time, the prediction and the actual completion time converge as time gets closer to the actual job completion time. D. Equalizing completion time for multiple jobs In a third experiment we explore the capacity of the proposed Hadoop scheduler to equalize the completion time of two jobs. This scenario is taken as a particular case of

5 5 4 TaskTracker 1 TaskTracker 2 TaskTracker J2 Job 1 Job 2 2 Time per map (s) 3 2 Absolute error (s) Fig. 4. Task length distribution: Simulator Fig. 6. Evolution of the estimated completion time Number of finished maps 1 Absolute error (s) Percent completed (%) Fig. 7. J2 Job 1 Job 2 Evolution of the job s percentage of completion Fig. 5. Partial time estimator: Simulator scheduling policy that can be implemented in the Hadoop scheduler. The experiment considers two instances of the Simulator application that are submitted at different moments in time. Afterwards, the scheduler is responsible for making both instances to complete simultaneously. The scheduler achieves this objective by allocating different number of execution slots to each job based on the estimated completion time over time. In this case the simulator is configured to run less tests than in the previous sections, and thus, the job has a shorter execution time and less variability in the task length. Figure 6 shows the absolute error of the predictions for the simulator application. As in the previous experiments, negative time values represent that the technique is optimistic. Figure 7 shows the percentage of the job tasks already completed over time. As it can be observed, both instances of the same job are started at different moments in time. The first instance starts at time, while the second instance starts short after time 2s. Both job instances are identical, so they have to complete the same number of tasks. Looking at the progress of both jobs over time, it can be observed that when the second job starts, the first job stops making any progress, because the scheduler decides to allocate the second job all the available execution slots. When the estimated completion time for both jobs is equalized, the scheduler starts balancing the allocation of execution slots to both jobs so they continue making similar progress until completion time. Notice that this experiment is driven by the progress made by each job over time, but is not driven by any explicit completion time goal. E. Managing jobs with different deadlines In the last experiment, two instances of the WordCount application are submitted, and scheduler is provided with an explicit completion time goal for each job. Therefore, the scheduler is responsible for making the goal for both jobs by continuously monitoring the progress and the estimated completion time for each job, and by dynamically adjusting the allocation of execution slots to each job. Notice that in this experiment, the deadlines are set to be achievable with the available resources. Figure 8 shows the absolute error of the predictions for the simulator application. As in the previous experiments, negative time values represent that the technique is optimistic. Figure 9 shows the percentage of the job tasks already completed over time. As it can be observed, the experiment shows two instances of the WordCount job being submitted to the Hadoop cluster. The first instance (Job1) is submitted at time s, while the second instance (Job2) is submitted at time 4s (J2 in the chart). The deadline for the first job is set to be 24s after its

6 5 J2 D2 D1 1 Job 1 Job 2 8 Absolute error (s) -5-1 Percent completed (%) Job 1 deadline: 24 Job 2 deadline: J2 D2 D Fig. 8. Evolution of the estimated completion time Fig. 9. Evolution of the job s percentage of completion submission (D1), that is, at time 24s. The deadline for the second instance is set to be 12s after its submission (D2), that is at time 16s. Notice that the minimum completion time for one instance of this job, running alone in the Hadoop cluster, and having all the available resources allocated, is about 1s. Therefore, we chose a tight deadline factor for Job1, which is 1.2x times its minimum achievable completion time, and for Job2 it is 2.4x, so it is more relaxed. Notice also that there is a delay of around 3 seconds between the job submission time at the point at wich tasks start being executed. This time is explained by the cost of splitting the input data and the distribution of keys and values through HDFS. When Job1 is submitted, at time, it is automatically put to run. At time J2, the second job is submitted. Although the scheduler needs to start running this second job immediately to start collecting progress data that will allow the estimation of the expected completion time, it is not able to do so because it must wait for at least one task from Job1 to be freed. When Job2 starts running, the scheduler determines that it needs to allocate more resources to this job, and thus, Job1 is slowdown in favour of Job1. This can be observed in Figure 9. The scheduler keeps adjusting the resource allocation of both jobs until their completion, with more precision as the accuracy of its estimations improves. Finally, Figure 1 shows the allocation of task execution slots for each of the jobs. In this particular experiment we used 2 TaskTrackers, configured to host a maximum of 2 Tasks each. Thus, four is the maximum number of concurrent Tasks that can be executed. Notice how Job2 gets more resources than Job1 since its submission, what is explained by its tight deadline. The scheduler is configured to minimize the resources allocated to each job, using the minimum required to meet, when possible, the deadline for each job. It can be observed that once Job2 is completed, Job1 still has some margin to meet its goal, so the scheduler decides to allocate only 2 of the available slots to this job (with punctually an allocation of only 1 slot). The other two slots remain available to run other Tasks or even to introduce an opportunity to apply energy saving techniques. Number of running tasks Number of running tasks Fig. 1. Distribution of running tasks VI. PERFORMANCE STUDY Job 1 Job 2 In this section we present two additional experiments related to the performance of MapReduce applications. In our first experiment, we evaluate the impact of data locality, while in the second one we discuss some issues related to memory usage in Hadoop s MapReduce. Since our goal was to understand Hadoop s implementation, and specially how the initial steps performed and what was its overhead, this time we used a custom application specifically designed to do the bare minimum: only read its input. This job included its own input format to divide the input into as many splits as available TaskTrackers, with only one concurrent map task per node. Note that in the following sections we use next() to refer to the last method that actually reads the files from HDFS, and that was configured to always read 256 MB at a time. The map tasks stop right after reading, so there is no output, and consequently no reduce tasks either. A. Experimental environment In these experiments we used a similar environment than in the previous ones. JobTracker and NameNode were run on the same Intel Xeon nodes with 2 GB of RAM, while the 2 slaves used as DataNodes and TaskTrackers were Intel Xeon nodes with 16 GB of RAM running 64-bit Debian GNU/Linux.

7 B. Remote data overhead In order to evaluate the impact of data locality in Hadoop s performance during the map phase, we ran the application switching the two available slaves to act as DataNodes or TaskTrackers. First, input was copied to only one DataNode, and then read from the same node s TaskTracker; after that, repeat using the second TaskTracker in order to read the data remotely. The input consisted of 8 files of exactly 64 MB, the same as HDFS s block size (dfs.block.size), so there was no overhead to join split files. Also, since there was only a single TaskTracker with one map slot reading 256 MB at a time, the task had to run 2 sequential next() operations to complete the total 512 MB. The results below show the averages after repeating each execution 5 times. TABLE I AVERAGE N E X T() METHOD TIME (MS) TaskTracker DataNode Slave 1 Slave 2 Slave Slave Time (ms) slave 1 slave Number of 64MB input files Fig. 11. Elapsed time to execute each next() call before detecting hardware failure slave 1 slave 2 TaskTracker TABLE II TOTAL ELAPSED TIME (S) DataNode Slave 1 Slave 2 Slave Slave Time (ms) Table I shows the time needed to execute each next() method under the different scenarios. As expected, it is significantly faster to read locally: approximately seconds instead of 3 seconds. But, while reading from remote HDFS nodes is slower than local reads, they aren t too way behind, and in this case the differences can be exclusively attributed to the data transfer over the network. Table II shows the elapsed time to execute the entire job. Again, there is a difference between executions that use the same host as TaskTracker and DataNode, and those that use different hosts. However, taking into account that each job calls the next() method twice, it can be observed that the 3 second difference is in fact solely caused by the differences in next() times. C. Memory constraints During some of the experiments, we experienced a degradation in Hadoop s performance when using large inputs, and found out later that it was caused by a hardware issue that reduced the amount of memory in one of the slaves to 3 GB only. It is interesting, however, to see how failures affect the whole process. In this experiment we ran a job repeatedly, gradually increasing its input size: first 8 files of 64 MB (512 MB), then 16 (1 GB), etc. up to 88 files (5.5 GB). Again, the program basically did nothing other than reading input files. Also, note Number of 64MB input files Fig. 12. Elapsed time to execute each next() call after fixing hardware failure that the job was equally divided into 2 tasks independently of its input size, but within the task each step (or next operation) always read the same amount of data: 256 MB. Taking the initial input of 512 MB, for instance, it is split into two tasks, each of which will read 256 MB only once. On the other hand, an input of 5.5 GB is also split into two tasks, but this time each task will proceed to read 256 MB eleven times. Figure 11 shows the time it took to execute each next() call before detecting the hardware failure. One of the slaves was slower all along, but from 56 files (3.5 GB) onward, the time it took to do the same amount of work increased in both slaves. Even though only one of the machines, slave 1, failed, it also affected the second slave. This isn t that surprising considering that both were working also as DataNodes: not only it was slower to run the task, but also to serve the data to the remote TaskTracker. After fixing the failure, the times were much more stable, as shown in figure 12.

8 VII. RELATED WORK Process scheduling is a deep explored topic for parallel applications, considering different type of applications, different scheduling goals and different platforms architecture ([8], [7], [1]). However, there is few related work on scheduling for MapReduce applications. The initial scheduler presented in the Hadoop distribution uses a very simple FIFO policy, considering five different application priorities. In addition, in order to isolate the performance of different jobs, the Hadoop project is working on a system for provisioning dedicated Hadoop clusters to applications [2]. However this approach can result in resource underutilization. In [11] the authors propose a fair scheduling implementation to manage dataintensive and interactive MapReduce applications executed on very large clusters. The main concern of this policy scheduling is to give equal shares to each user. In addition, as for this kind of applications exploiting data locality is a must, it tries to execute each task near the data they use. However, this approach is not appropiated for long-running applications with different performance goals. In addition, they do not dinamically adapt the scheduling decissions to changes on the progress of the applications. VIII. CONCLUSIONS In this paper we have presented a prototype of a task scheduler for MapReduce applications. It has been implemented on top of Hadoop, the Apache s open-source implementation of a MapReduce framework. The proposed scheduler is able to calculate the estimated completion time for each MapReduce job in the system. The estimator takes advantage of the fact that each MapReduce job is composed of a large number of tasks (mappers and reducers) to be completed, that this number of tasks is calculated in advance during the job initialization phase (when the input data is split) and that the progress of the job can be observed on runtime. The proposed scheduler takes each submitted and not yet completed Hadoop job and monitors the average task length for already completed tasks. This information is later used to predict the job completion time. Based on this estimations, the scheduler is able to dynamically adapt the number of task slots that each job is allocated. Therefore it introduces runtime performance management in the Hadoop MapReduce framework. This work is a first step toward the creation of an efficient MapReduce scheduler for Hadoop that will be able to manage the performance of the MapReduce applications based on high level policies. We plan to consider several high-level objectives, such as energy saving, workload collocation and data placement, in the management of the resources of the MapReduce cluster. We are also working on the integration of this scheduler in the EmotiveCloud [4] middleware. REFERENCES [1] Hadoop Map/Reduce Tutorial tutorial.html. [2] Hadoop On Demand user guide.html. [3] HDFS Architecture design.html. [4] Autonomic Systems and ebusiness Platforms research line. Barcelona Supercomputing Center (BSC). Emotive Cloud. [5] D. Carrera, M. Steinder, I. Whalley, J. Torres, and E. Ayguadé. Enabling resource sharing between transactional and batch workloads using dynamic application placement. In Middleware 8: Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware, pages , New York, NY, USA, 28. Springer-Verlag New York, Inc. [6] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI 4: Sixth Symposium on Operating System Design and Implementation, pages , San Francisco, CA, December 24. [7] D. G. Feitelson. Job scheduling in multiprogrammed parallel systems. Technical Report RC 1979 (87657), August [8] D. G. Feitelson and L. Rudolph. Parallel job scheduling: Issues and approaches. In JSSPP, pages 1 18, [9] S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. SIGOPS Oper. Syst. Rev., 37(5):29 43, 23. [1] C. E. Volker, V. Hamscher, and R. Yahyapour. Economic scheduling in grid computing. In Scheduling Strategies for Parallel Processing, pages Springer, 22. [11] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Job scheduling for multi-user mapreduce clusters. Technical Report UCB/EECS-29-55, EECS Department, University of California, Berkeley, Apr 29. ACKNOWLEDGEMENTS This work is partially supported by the Ministry of Science and Technology of Spain and the European Union (FEDER funds) under contract TIN , and by IBM through the 28 IBM Faculty Award program.

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

Non-intrusive Slot Layering in Hadoop

Non-intrusive Slot Layering in Hadoop 213 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing Non-intrusive Layering in Hadoop Peng Lu, Young Choon Lee, Albert Y. Zomaya Center for Distributed and High Performance Computing,

More information

Scheduling Data Intensive Workloads through Virtualization on MapReduce based Clouds

Scheduling Data Intensive Workloads through Virtualization on MapReduce based Clouds ABSTRACT Scheduling Data Intensive Workloads through Virtualization on MapReduce based Clouds 1 B.Thirumala Rao, 2 L.S.S.Reddy Department of Computer Science and Engineering, Lakireddy Bali Reddy College

More information

Matchmaking: A New MapReduce Scheduling Technique

Matchmaking: A New MapReduce Scheduling Technique Matchmaking: A New MapReduce Scheduling Technique Chen He Ying Lu David Swanson Department of Computer Science and Engineering University of Nebraska-Lincoln Lincoln, U.S. {che,ylu,dswanson}@cse.unl.edu

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT

More information

Mobile Cloud Computing for Data-Intensive Applications

Mobile Cloud Computing for Data-Intensive Applications Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down

More information

Big Data Analysis and Its Scheduling Policy Hadoop

Big Data Analysis and Its Scheduling Policy Hadoop IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 1, Ver. IV (Jan Feb. 2015), PP 36-40 www.iosrjournals.org Big Data Analysis and Its Scheduling Policy

More information

Efficient Data Replication Scheme based on Hadoop Distributed File System

Efficient Data Replication Scheme based on Hadoop Distributed File System , pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,

More information

Performance Optimization of a Distributed Transcoding System based on Hadoop for Multimedia Streaming Services

Performance Optimization of a Distributed Transcoding System based on Hadoop for Multimedia Streaming Services RESEARCH ARTICLE Adv. Sci. Lett. 4, 400 407, 2011 Copyright 2011 American Scientific Publishers Advanced Science Letters All rights reserved Vol. 4, 400 407, 2011 Printed in the United States of America

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

A Game Theory Based MapReduce Scheduling Algorithm

A Game Theory Based MapReduce Scheduling Algorithm A Game Theory Based MapReduce Scheduling Algorithm Ge Song 1, Lei Yu 2, Zide Meng 3, Xuelian Lin 4 Abstract. A Hadoop MapReduce cluster is an environment where multi-users, multijobs and multi-tasks share

More information

Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India

Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India Volume 3, Issue 1, January 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com ISSN:

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

Research on Job Scheduling Algorithm in Hadoop

Research on Job Scheduling Algorithm in Hadoop Journal of Computational Information Systems 7: 6 () 5769-5775 Available at http://www.jofcis.com Research on Job Scheduling Algorithm in Hadoop Yang XIA, Lei WANG, Qiang ZHAO, Gongxuan ZHANG School of

More information

Hadoop Scheduler w i t h Deadline Constraint

Hadoop Scheduler w i t h Deadline Constraint Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,

More information

Evaluating Task Scheduling in Hadoop-based Cloud Systems

Evaluating Task Scheduling in Hadoop-based Cloud Systems 2013 IEEE International Conference on Big Data Evaluating Task Scheduling in Hadoop-based Cloud Systems Shengyuan Liu, Jungang Xu College of Computer and Control Engineering University of Chinese Academy

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

MapReduce Job Processing

MapReduce Job Processing April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

An improved task assignment scheme for Hadoop running in the clouds

An improved task assignment scheme for Hadoop running in the clouds Dai and Bassiouni Journal of Cloud Computing: Advances, Systems and Applications 2013, 2:23 RESEARCH An improved task assignment scheme for Hadoop running in the clouds Wei Dai * and Mostafa Bassiouni

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Residual Traffic Based Task Scheduling in Hadoop

Residual Traffic Based Task Scheduling in Hadoop Residual Traffic Based Task Scheduling in Hadoop Daichi Tanaka University of Tsukuba Graduate School of Library, Information and Media Studies Tsukuba, Japan e-mail: s1421593@u.tsukuba.ac.jp Masatoshi

More information

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT Samira Daneshyar 1 and Majid Razmjoo 2 1,2 School of Computer Science, Centre of Software Technology and Management (SOFTEM),

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Research Article Hadoop-Based Distributed Sensor Node Management System

Research Article Hadoop-Based Distributed Sensor Node Management System Distributed Networks, Article ID 61868, 7 pages http://dx.doi.org/1.1155/214/61868 Research Article Hadoop-Based Distributed Node Management System In-Yong Jung, Ki-Hyun Kim, Byong-John Han, and Chang-Sung

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Performance and Energy Efficiency of. Hadoop deployment models

Performance and Energy Efficiency of. Hadoop deployment models Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

The Improved Job Scheduling Algorithm of Hadoop Platform

The Improved Job Scheduling Algorithm of Hadoop Platform The Improved Job Scheduling Algorithm of Hadoop Platform Yingjie Guo a, Linzhi Wu b, Wei Yu c, Bin Wu d, Xiaotian Wang e a,b,c,d,e University of Chinese Academy of Sciences 100408, China b Email: wulinzhi1001@163.com

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

Hadoop Parallel Data Processing

Hadoop Parallel Data Processing MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions

More information

M U LT I - C O N S T R A I N T S C H E D U L I N G O F M A P R E D U C E W O R K L O A D S. jordà polo

M U LT I - C O N S T R A I N T S C H E D U L I N G O F M A P R E D U C E W O R K L O A D S. jordà polo M U LT I - C O N S T R A I N T S C H E D U L I N G O F M A P R E D U C E W O R K L O A D S jordà polo A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor in Computer

More information

A Cost-Evaluation of MapReduce Applications in the Cloud

A Cost-Evaluation of MapReduce Applications in the Cloud 1/23 A Cost-Evaluation of MapReduce Applications in the Cloud Diana Moise, Alexandra Carpen-Amarie Gabriel Antoniu, Luc Bougé KerData team 2/23 1 MapReduce applications - case study 2 3 4 5 3/23 MapReduce

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

MAPREDUCE Programming Model

MAPREDUCE Programming Model CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

http://www.paper.edu.cn

http://www.paper.edu.cn 5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

Snapshots in Hadoop Distributed File System

Snapshots in Hadoop Distributed File System Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any

More information

Processing of Hadoop using Highly Available NameNode

Processing of Hadoop using Highly Available NameNode Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Telecom Data processing and analysis based on Hadoop

Telecom Data processing and analysis based on Hadoop COMPUTER MODELLING & NEW TECHNOLOGIES 214 18(12B) 658-664 Abstract Telecom Data processing and analysis based on Hadoop Guofan Lu, Qingnian Zhang *, Zhao Chen Wuhan University of Technology, Wuhan 4363,China

More information

A Hadoop MapReduce Performance Prediction Method

A Hadoop MapReduce Performance Prediction Method A Hadoop MapReduce Performance Prediction Method Ge Song, Zide Meng, Fabrice Huet, Frederic Magoules, Lei Yu and Xuelian Lin University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France Ecole Centrale

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Improving MapReduce Performance in Heterogeneous Environments

Improving MapReduce Performance in Heterogeneous Environments UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University of California at Berkeley Motivation 1. MapReduce

More information

Towards a Resource Aware Scheduler in Hadoop

Towards a Resource Aware Scheduler in Hadoop Towards a Resource Aware Scheduler in Hadoop Mark Yong, Nitin Garegrat, Shiwali Mohan Computer Science and Engineering, University of Michigan, Ann Arbor December 21, 2009 Abstract Hadoop-MapReduce is

More information

Facilitating Consistency Check between Specification and Implementation with MapReduce Framework

Facilitating Consistency Check between Specification and Implementation with MapReduce Framework Facilitating Consistency Check between Specification and Implementation with MapReduce Framework Shigeru KUSAKABE, Yoichi OMORI, and Keijiro ARAKI Grad. School of Information Science and Electrical Engineering,

More information

Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity

Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity Noname manuscript No. (will be inserted by the editor) Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity Aysan Rasooli Douglas G. Down Received: date / Accepted: date Abstract Hadoop

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

More information

METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT

METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT 1 SEUNGHO HAN, 2 MYOUNGJIN KIM, 3 YUN CUI, 4 SEUNGHYUN SEO, 5 SEUNGBUM SEO, 6 HANKU LEE 1,2,3,4,5 Department

More information

Introduction to Hadoop

Introduction to Hadoop 1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

More information

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Dharmendra Agawane 1, Rohit Pawar 2, Pavankumar Purohit 3, Gangadhar Agre 4 Guide: Prof. P B Jawade 2

More information

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance

More information

Analysis of Information Management and Scheduling Technology in Hadoop

Analysis of Information Management and Scheduling Technology in Hadoop Analysis of Information Management and Scheduling Technology in Hadoop Ma Weihua, Zhang Hong, Li Qianmu, Xia Bin School of Computer Science and Technology Nanjing University of Science and Engineering

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

Improving Job Scheduling in Hadoop

Improving Job Scheduling in Hadoop Improving Job Scheduling in Hadoop MapReduce Himangi G. Patel, Richard Sonaliya Computer Engineering, Silver Oak College of Engineering and Technology, Ahmedabad, Gujarat, India. Abstract Hadoop is a framework

More information

Energy-Saving Cloud Computing Platform Based On Micro-Embedded System

Energy-Saving Cloud Computing Platform Based On Micro-Embedded System Energy-Saving Cloud Computing Platform Based On Micro-Embedded System Wen-Hsu HSIEH *, San-Peng KAO **, Kuang-Hung TAN **, Jiann-Liang CHEN ** * Department of Computer and Communication, De Lin Institute

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS

A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS Suma R 1, Vinay T R 2, Byre Gowda B K 3 1 Post graduate Student, CSE, SVCE, Bangalore 2 Assistant Professor, CSE, SVCE, Bangalore

More information

Big Data - Infrastructure Considerations

Big Data - Infrastructure Considerations April 2014, HAPPIEST MINDS TECHNOLOGIES Big Data - Infrastructure Considerations Author Anand Veeramani / Deepak Shivamurthy SHARING. MINDFUL. INTEGRITY. LEARNING. EXCELLENCE. SOCIAL RESPONSIBILITY. Copyright

More information

MapReduce, Hadoop and Amazon AWS

MapReduce, Hadoop and Amazon AWS MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables

More information

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

<Insert Picture Here> Oracle and/or Hadoop And what you need to know Oracle and/or Hadoop And what you need to know Jean-Pierre Dijcks Data Warehouse Product Management Agenda Business Context An overview of Hadoop and/or MapReduce Choices, choices,

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015 7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

More information

Survey on Job Schedulers in Hadoop Cluster

Survey on Job Schedulers in Hadoop Cluster IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 15, Issue 1 (Sep. - Oct. 2013), PP 46-50 Bincy P Andrews 1, Binu A 2 1 (Rajagiri School of Engineering and Technology,

More information

Performance Modeling of MapReduce Jobs in Heterogeneous Cloud Environments Zhuoyao Zhang University of Pennsylvania zhuoyao@seas.upenn.

Performance Modeling of MapReduce Jobs in Heterogeneous Cloud Environments Zhuoyao Zhang University of Pennsylvania zhuoyao@seas.upenn. Performance Modeling of MapReduce Jobs in Heterogeneous Cloud Environments Zhuoyao Zhang University of Pennsylvania zhuoyao@seas.upenn.edu Ludmila Cherkasova Hewlett-Packard Labs lucy.cherkasova@hp.com

More information

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013 Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables

More information

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani Big Data and Hadoop Sreedhar C, Dr. D. Kavitha, K. Asha Rani Abstract Big data has become a buzzword in the recent years. Big data is used to describe a massive volume of both structured and unstructured

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

Reduction of Data at Namenode in HDFS using harballing Technique

Reduction of Data at Namenode in HDFS using harballing Technique Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud

Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud Gunho Lee, Byung-Gon Chun, Randy H. Katz University of California, Berkeley, Yahoo! Research Abstract Data analytics are key applications

More information

Advanced Load Balancing Mechanism on Mixed Batch and Transactional Workloads

Advanced Load Balancing Mechanism on Mixed Batch and Transactional Workloads Advanced Load Balancing Mechanism on Mixed Batch and Transactional Workloads G. Suganthi (Member, IEEE), K. N. Vimal Shankar, Department of Computer Science and Engineering, V.S.B. Engineering College,

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information