Energy Efficient Scheduling Heuristics and the Case Study

Size: px
Start display at page:

Download "Energy Efficient Scheduling Heuristics and the Case Study"

Transcription

1 Energy Efficient Scheduling of MapReduce Workloads on Heterogeneous Clusters ABSTRACT Nezih Yigitbasi Delft University of Technology the Netherlands Energy efficiency has become the center of attention in emerging data center infrastructures as increasing energy costs continue to outgrow all other operating expenditures. In this work we investigate energy aware scheduling heuristics to increase the energy efficiency of MapReduce workloads on heterogeneous Hadoop clusters comprising both low power (wimpy) and high performance (brawny) nodes. We first make a case for heterogeneity by showing that low power Intel Atom processors and high performance Intel Sandy Bridge processors are more energy efficient for I/O bound workloads and CPU bound workloads, respectively. Then we present several energy efficient scheduling heuristics that exploit this heterogeneity and real-time power measurements enabled by modern processor architectures. Through experiments on a 23-node heterogeneous Hadoop cluster we demonstrate up to 27% better energy efficiency with our heuristics compared with the default Hadoop scheduler. Categories and Subject Descriptors H.3.4[Systems and Software]: Distributed systems; D.4.8 [Performance]: Measurements 1. INTRODUCTION The power consumption of the data centers is expected to reach unprecedented scales. The EPA estimates that US data centers will consume 1 billion kilowatt hours annually by 211 at a cost of $7.4 billion per year [8]. Moreover, the annual energy cost has already surpassed the annualized capital cost of servers [4], and is expected to surpass all other operating costs in future deployments. Even worse, the energy problem is exacerbated by the increasing processor and rack densities. Therefore, energy efficiency is becoming a first class citizen in many aspects of modern data centers, and the industry and academia are seeking ways to improve the data center energy efficiency. Researchers have been investigating diverse solutions to this problem such as real-time power monitoring for better power management, operating the systems at optimal efficiency, smarter cool- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GCM 211, December 12th, 211, Lisbon, Portugal. Copyright 211 ACM /11/12...$1.. Kushal Datta, Nilesh Jain, and Theodore Willke Intel Labs Hillsboro, OR {kushal.datta,nilesh.jain,theodore.l.willke}@intel.com ing techniques, and smarter workload placement techniques. This work slowed the growth of energy consumption significantly between 25 and 21 [12]. There is an increasing amount of research focused on data center energy efficiency. In general, researchers have addressed energy efficient techniques for consolidation and workload placement [19, 2, 21], and load distribution [15]. And with the increasing interest on MapReduce [7] for data intensive computing, they have endeavored to improve the energy efficiency of MapReduce clusters [16, 13, 11]. Recently the feasibility of deploying low power (wimpy) nodes in clusters was demonstrated [1, 6], and these studies have led to interesting discussions about the use of wimpy nodes in data centers [14, 9]. However, little work has been done for energy efficient scheduling in heterogeneous MapReduce clusters comprised of both wimpy and brawny (high performance) nodes. In this work we fill in this gap by investigating energy efficient scheduling techniques for heterogeneous MapReduce clusters. Toward this end we first make a case for heterogeneity by showing that wimpy Intel Atom processors and brawny Intel Sandy Bridge processors are more energy efficient for I/O bound workloads and CPU bound workloads, respectively. For characterizing the energy efficiency of CPU bound workloads we use the performance per watt (perf/watt) metric where performance is 1/completion time, and for I/O bound workloads we use IOPS/watt as the energy efficiency metric since for I/O bound workloads performance is well characterized by I/O operations per second (IOPS). After making a case for heterogeneity, we investigate several scheduling heuristics for energy efficient execution of MapReduce workloads that exploit the cluster heterogeneity and real-time power measurements enabled with recent processor architectures. Our energy efficient scheduling heuristics take both performance and power into account for scheduling decisions. Moreover, these heuristics address the nontrivial trade off between the conflicting performance and power goals; although running a particular task on a wimpy node may result in lower power consumption, in the end it may take longer for the job to complete resulting in degraded performance and increased overall energy consumption. With this work we make the following contributions. First, we demonstrate a case for heterogeneity by showing that, for I/O bound MapReduce workloads, Atom nodes are around 2.5x more energy efficient than the Sandy Bridge nodes. Second, we propose and evaluate three scheduling heuristics on a heterogeneous Hadoop cluster, and we show that our heuristics are able to provide up to 27% better energy

2 CPU Memory Storage CPU TDP [W] Intel Core (TM) i7-26 CPU 8GB Intel x25e 95 (Sandy Bridge) 3.4GHz SSD Atom D51 4GB Intel x25e GHz SSD Table 1: Specifications of the platforms that we use in our experiments. efficiency compared with the default Hadoop scheduler. 2. BACKGROUND: MAPREDUCE AND HADOOP In this section, we briefly describe MapReduce [7], which is a programming model for clusters, and Hadoop [2], which is an open source MapReduce implementation. MapReduce is typically used for processing large amounts of data on commodity clusters. The user specifies a map function that processes a key-value pair to produce a list of intermediate key-value pairs, and a reduce function to aggregate the output of the map function. Hadoop is framework that implements the MapReduce programming model, and simplifies cluster programming by taking care of automatic parallelization, load balancing and fault-tolerance. A typical Hadoop cluster runs over the Hadoop Distributed File System (HDFS) and has a single job tracker (master) that is responsible for managing the task trackers (slaves) running on each node in the cluster. When a user submits a job consisting of map and reduce functions, Hadoop first splits the input data that resides in the HDFS into splits. Then, Hadoop divides the job into several tasks depending on the size of the input data. The job tracker schedules these tasks in response to the heartbeats sent by the task trackers periodically. A single map task is run for every input split producing a list of key-value pairs. Hadoop then partitions the map output based on the keys, and runs a reduce task for each key writing the final output to the HDFS. InthisworkweusethedefaultschedulingpolicyofHadoop as the baseline for our performance evaluation. The default Hadoop scheduler is first in first out (FIFO) with multiple priority levels. On receiving a heartbeat from a task tracker with information about the number of free map/reduce slots, the scheduler scans the job queue in priority order and determines which tasks to assign to this task tracker considering data locality for map tasks to improve the performance. 3. A CASE FOR HETEROGENEITY In this section, we demonstrate a case for heterogeneity through experiments with two different platforms: low power Atom processors, the wimpy processors, and high performance Sandy Bridge (SNB) processors, the brawny processors. Table 1 shows the specifications of these platforms. We show in this section that there is a scheduling opportunity that we can exploit since for I/O bound workloads wimpy nodes provide up to 2.5x better energy efficiency compared with the brawny nodes. As the workload we use three MapReduce applications from the HiBench [1] benchmark suite: word count, sort and nutch. Word count and sort are micro benchmarks that are available in the Hadoop distribution, and nutch is the indexing system of Apache Nutch search engine [3]. These applications are representative of real MapReduce workloads as the computation performed by these applications are common use cases of MapReduce, namely transforming data from one representation to another, extracting a small amount of data from a large data set, and large-scale indexing [1]. It is unfair to compare wimpy node and brawny node clusters of the same size. Because, the size of a wimpy node cluster will be larger than a brawny node cluster either when these clusters are built to provide the same level of performance or they are provisioned the same amount of power. In this work we use the Thermal Design Power (TDP) of the two processors to determine the number of nodes to deploy in the clusters, and in the end the same amount of power is provisioned to both clusters according to their TDP values. As the ratio of the TDP of the platforms is 1:7 we create two different Hadoop clusters for our experiments: an Atom clusterofsevenatomnodesandasnb cluster (pseudocluster) of a single SNB node. With the single node SNB cluster we are actually neglecting the impact of workload partitioning and the network performance which are two important concerns for MapReduce workloads. However, even we deploy more than one SNB node and consider these impacts, our results will still hold as our evaluation with a single node SNB cluster actually provides an upper bound for the perf/watt. Because, when we increase the cluster size the perf/watt will get worse since the power consumption does scale linearly while the performance does not. Instead of using the holistic power measured with a power meter at the wall socket we use a linear power model for two reasons. First, some of the hardware components on our experimental platforms, such as the SATA headers, and USB and HDMI ports, add noticeable overheads to the measured holistic power. Second, these overheads on the Atom and SNB nodes are significantly different. Therefore, to model the holistic power for the Atom and SNB nodes we first determine the power consumed by the processor package for a particular workload. Then through micro benchmarks we determine the power consumption of the other components on the platform like the storage devices, main memory, and the network interface. We model the holistic power of a single node as the sum of its package power and the power consumption of the storage devices, network interface and the main memory taking into account an overhead of 5% for the unanticipated power consumption, and a power supply efficiency of 9% for the DC to AC conversion. Since we are interested in both the performance and the power consumption of the system, we characterize the energy efficiency with the perf/watt metric. We use 1/energy and IOPS/watt, which is actually the number of I/O operations performed per joule, metrics for CPU bound and I/O bound workloads, respectively. The cumulative distribution function (CDF) of the power consumption of the two platforms is shown in Figure 1. We use the CDFs to characterize the power consumption profile of both platforms. Each graph shows the power consumption of the processor package and the whole cluster for both platforms. For all workloads, the processor consumes roughly half of the holistic power, confirming the previous studies [18]. Also note that the dynamic power ranges of the two platforms are significantly different: Atom has a narrow dynamic range from 9W to 13W while SNB has a larger dynamic range from 5W to around 15W for the sort and nutch workloads. Since the Atom s package power consumption also includes the power consumed by other components, such as the chipset, memory, SATA, and gigabit

3 CDF [%] Atom Cl. Atom Node SNB Cl. SNB Node Power Consumption [Watts] CDF [%] Atom Cl. Atom Node SNB Cl. SNB Node Power Consumption [Watts] (a) Word count (b) Sort (c) Nutch Figure 1: Cumulative distribution function (CDF) of the power consumption of the processor package (Atom/SNB Node) and the whole cluster (Atom/SNB Cl.) for both the Atom and the Sandy Bridge (SNB) platforms. Ethernet controller, the idle Atom package power (9W) is slightly greater than the idle SNB package power (5W). The narrow dynamic range of Atom suggests that it consumes a similar amount of power, whether it is idle or not, and it is possible to exploit the low power Atom platform for better energy efficiency, if in the end, the performance is not degraded significantly, which is the case for I/O bound workloads as we show later in this section. Note that although we setup the two clusters based on the ratio of the TDP values (1:7), our measurements reveal that the Atom cluster consumes more power than the SNB cluster; around 1.7x for word count, 2.5x for sort and 2.5x for the nutch workload, respectively. Because, TDP is not a good estimate of the actual power consumption since the actual consumption will vary depending on several conditions such as the temperature and the workload dynamics. However, regardless of this poor estimate our main result still holds; the Atom cluster is more energy efficient for I/O bound workloads because of the significant increase in performance. We then investigate the job completion time and the perf/watt and show the results in Figure 2. The three workloads we use are of different characteristics: word count is a CPU bound workload with a median CPU utilization around 8% on Atom nodes and 6% on SNB nodes, sort is an I/O bound workload, and nutch is a balanced workload with a median CPU utilization of around 5% on both platforms. We see that for the CPU bound word count workload the job completion time on the Atom cluster is around 1.3x of the SNB cluster, and together with the higher power consumption of the Atom cluster the SNB cluster results in 2x better energy efficiency. For the I/O bound sort workload the Atom cluster has signifcantly better job completion time (3.5x) yielding 2.5x better energy efficiency over the SNB cluster, confirming the previous research [6]. This shows that there is a class of workloads that results in better energy efficiency when executed on the Atom nodes providing a scheduling opportunity that we exploit with our scheduling heuristics in the next section. Finally, for the nutch workload the SNB cluster has 1.7x better energy efficiency than the Atom cluster, despite the Atom cluster having slightly better performance (1%), because the power consumption of the Atom cluster is larger than that of the SNB cluster. 4. SCHEDULING HEURISTICS In the previous section we showed that the SNB nodes are more energy efficient for CPU bound workloads, and the Atom nodes are more energy efficient for I/O bound workloads, thus making a case for heterogeneity. Unfortunately, existing Hadoop schedulers(i.e., the FIFO scheduler, the fair Job Completion Time [m] Perf/Watt [Normalized] CDF [%] Atom Cl. Atom Node SNB Cl. SNB Node Power Consumption [Watts] SNB Cluster Atom Cluster Word Count Sort Nutch Word Count Sort Nutch Figure 2: Job completion times (top) and the normalized energy efficiency (bottom) for word count, sort, and nutch workloads on all platforms. scheduler and the capacity scheduler) do not consider heterogeneity when making scheduling decisions. Therefore, in this work we fill this gap by proposing several scheduling heuristics that exploit heterogeneity for better energy efficiency. Our scheduling heuristics determine which tasks should run on the wimpy nodes in the cluster, and address the nontrivial trade off between power and performance; although running a particular task on an Atom node may result in lower power consumption it may result in longer job completion times degrading the performance and increasing the total energy consumption significantly. Toward this end, our heuristics characterize the energy efficiency of the nodes in the cluster using the records/joule and IOPS/watt metrics and use these to metrics make scheduling decisions; we describe in Section 5.1 how our schedulers collect these metrics. While designing our heuristics we made two assumptions. First, we assumed that the characteristics of the workload are not known a priori, making our problem an online scheduling problem. Second, we assumed that the cluster is shared by multiple users which is the case for production Hadoop deployments [22], therefore our heuristics considers fairness as an important concern. We describe our scheduling heuristics in turn. Default Scheduler (Default): We use the default Hadoop

4 scheduler(section 2) as the baseline in our performance evaluation. Energy Efficient Scheduler (EESched): This greedy heuristic schedules a task to the most energy efficient node for that task type, either map or reduce. We define the most energy efficient node as the node with free slots and with the maximum records per joule metric for the map tasks, and with the maximum IOPS/watt for the reduce tasks. Since map tasks mostly involve computation and the reduce tasks mostly involve I/O, we believe that these metrics characterize the energy efficiency well. The intuition behind EESched is that tasks are scheduled to the most energy efficient node in the cluster until this node is not the most energy efficient node any more (i.e., the energy efficiency metrics deviate from their best values), so the heuristic does its best to operate the nodes close to their most energy efficient operating points. When EESched receives a heartbeat from a node it first schedules the map tasks and then the reduce tasks. After determining the most energy efficient node for a task, the heuristic checks whether the current heartbeat is received from the most energy efficient node, and if this is not the case the heuristic does not schedule any tasks to this node. Then, EESched sorts the runnable jobs in the queue by their number of running tasks for fairness, and determines the number of tasks to schedule using the number of free map/reduce slots of the node that sent the current heartbeat. The heuristic then traverses the job queue and schedules the map tasks while taking data locality into consideration; if no local map task is found for a node then it considers rack locality, and if none of these are possible it considers off-rack locality. Note that EESched is a greedy heuristic that makes locally optimal decisions, and these decisions may be far from the globally optimal which we leave as future work. Energy Efficient Scheduler with Locality (EESched+Locality): We modified EESched for better locality for the map tasks. A task tracker TT is assigned a task t if TT contains an input split of t and TT is the most energy efficient node for t, where we define the most energy efficient node as before. With this heuristic, a map task is guaranteed to be executed on a node that contains an input split for this task, yielding better locality compared to the other heuristics. Our motivation for evaluating this heuristic is to investigate whether better locality will result in better energy efficiency, as we expect better locality to result in better job completion times. Run Reduce Phase on Wimpy Nodes (RoW): As shown in the previous section, wimpy nodes are more energy efficient for I/O bound workloads. Therefore, the intuition behind this heuristic is that running the whole reduce phase, which is mostly I/O bound, on the wimpy nodes may result in better energy efficiency for the reduce phase, and for the whole workload. Note that RoW only modifies the scheduling of the reduce tasks and it uses Default for scheduling the map tasks. 5. PERFORMANCE EVALUATION 5.1 Experimental Setup We evaluate our heuristics on a heterogeneous Hadoop cluster comprising twenty Intel Atom nodes and three Intel Completion Time [m] Normalized Energy Consumption Default Default EESched EESched + Locality EESched EESched + Locality Figure 3: Workload completion time (top) and the normalized energy consumption (bottom) for the various scheduling heuristics. SNB nodes (see Section 3 for the processor specifications). The ratio of the number of Atom nodes to the number of SNBnodesisroughly1:7asdescribedinSection3. Allnodes reside on the same rack and are connected by a single gigabit Ethernet switch. Before performing the experiments we have done our best to optimize the performance of our cluster by performing several experiments and carefully tuning the configuration parameters. We have developed the necessary tools to measure the package power on the Atom and SNB nodes. During the experiments our measurement tools run inthebackgroundonallnodesandexportthepackagepower measurements to be used by the task trackers. We performed our experiments with Hadoop.2. and we have made several modifications to implement the heuristics described in Section 4. With our modifications, the task trackers collect the package powers using our measurement infrastructure and derive the energy efficiency metrics, records/joule and IOPS/watt, at fixed time intervals. To determine these metrics the task trackers estimate the holistic power using the power model described in Section 3 and the measured package powers. Finally, task trackers send these metrics to the job tracker at every heartbeat used for scheduling decisions. Since production Hadoop clusters are shared (Section 4) we use a mix of workloads in lieu of a single application to assess the heuristics under realistic scenarios. The workload mix comprises 25 jobs and the job inter arrival time follows an exponential distribution with a mean of 14 s [22]. Each job in the mix is randomly picked from the word count, sort and nutch applications described in Section 3, and in the end the workload contains roughly the same number of jobs for each application. Each job in the mix has 15 GB input data to process, and in total the workload comprises around 49 map tasks and 8 reduce tasks, and it takes roughly 2.5 hours with the Default Hadoop scheduler to process the complete workload. 5.2 Results We assess the workload completion time and present the RoW RoW

5 results in Figure 3 (top). First we observe that all heuristics improve the workload completion time compared to Default: EESched by 3%, EESched+Locality by 15%, and RoW by 5%. These results suggest that scheduling the tasks considering the energy efficiency of the nodes (EESched) helps to reduce the total workload completion time. However, when we favor better locality while taking the scheduling decisions (EESched+Locality), we see that the improvement in completion time over Default is less compared to EESched. This shows that the nodes that contain the input data for a map task are not necessarily the best performing node as HDFS replicates the data in a random way. For example, with EESched+Locality, a compute intensive map task may run on a wimpy node which may increase the completion time of the job noticeably. Finally, running the whole reduce phase on the wimpy nodes (RoW) reduces the completion time slightly, due to the improvements in the completion times of the individual reduce tasks; we have already shown that Atom nodes perform better than SNB nodes for I/O bound workloads (Section 3). In Figure 3 (bottom), we evaluate the energy efficiency of our heuristics and show the energy consumption of the heuristics normalized to the Default scheduler. Apparently considering energy efficiency metrics during scheduling and scheduling the tasks to the most energy efficient node(eesched) pays off and improves the total energy consumed by the workload execution significantly (by 27%). The reason is that EESched schedules each type of task (map or reduce) to the most energy efficient node for that particular task type and does its best to operate the nodes close to their most energy efficient operating points. Similar to the results for the workload completion time (Figure 3 (top)) when we favor locality (EESched+Locality), the improvement in the energy consumption over Default is less compared to the EESched (by around 15%); the nodes that contain the input splits for a job are not necessarily the most energy efficient nodes due to the way HDFS replicates data. Therefore, it is definitely an interesting future work to investigate energy efficient replication strategies, and how to couple them with our energy efficient scheduling heuristics. Finally, a very simple modification to the scheduler (RoW) improves the energy efficiency by roughly 2%. The reason for lower energy savings with RoW compared to (EESched) is that RoW only considers energy efficiency for the reduce tasks while (EESched) considers energy efficiency for both map and reduce tasks. 6. RELATED WORK There has been an increasing amount of research on energy efficiency in data centers. In general, several studies have proposed solutions that include turning the machines off [16, 13]. However, this solution raises concerns about the availability of the replicated data, and the problem of which machines to turn off is more difficult for heterogeneous clusters. Similarly, instead of turning servers off putting them to low power states has also been investigated in previous work [17]. Recently the feasibility of using wimpy nodes was shown[1, 6]. However, these studies led to interesting discussions [14, 9] arguing that other concerns, such as the additional software/hardware costs introduced by using a large number of wimpy nodes and requirements of latency sensitive workloads, should also be taken into account. Closest to our work, several studies have investigated energy efficient workload placement. In [21] authors present e- Dryad, which learns the workload characteristics and places the jobs with complementary resource requirements on the same node, in the end demonstrating lower energy consumption than the default Dryad scheduler. In [15] authors address load distribution across several data centers. They propose several policies including an optimization approach for managing the energy consumption and the cost of the Internet services while satisfying the service level agreements. Similar to these studies in [19] authors propose a bin packing heuristic for energy aware workload consolidation that maximizes the sum of the Euclidean distances of the current placement to the optimal point at each server. Finally, in [2] the authors investigate the problem of power aware workload placement on heterogeneous virtual clusters. They propose and evaluate a bin packing heuristic that considers both power and migration cost constraints, and they demonstrate the efficacy of their solution both with theoretical and experimental analyses. There are also several studies for improving the energy efficiency of the HDFS. GreenHDFS [11] partitions the servers into cold and hot zones based on the popularity of the files, and saves energy by putting the servers in the cold zone to the low power mode. Similarly, in [16, 13] the authors propose a covering set strategy that replicates data blocks on a subset of the servers and powers down the remaining servers to save energy, and another strategy that uses all the nodes in the cluster to execute the workload and then powers down the entire cluster. These studies have shown dramatic improvements in energy at the file system level and are orthogonal to the scheduling heuristics that we propose in this work. Our work is different from the previous work on energy efficient scheduling in two ways. First, to the best of our knowledge none of the previous work exploits the heterogeneity of MapReduce clusters comprising both wimpy and brawny nodes and the real-time power measurements in the scheduler for improving the energy efficiency. Second, the closest previous work either investigates offline scheduling with complete workload information [19] or assumes that the same set of jobs are executed repeatedly in the cluster and exploit this fact to determine the workload requirements during runtime [21]. However, our assumptions are more realistic: we address the online version of the energy efficient scheduling problem where the characteristics of the workload are not known a priori, and we evaluate our heuristics with a workload mix to emulate production MapReduce workloads as we assume that the cluster is shared by multiple users. 7. DISCUSSION AND FUTURE WORK In this work, we started investigating whether energy aware scheduling strategies that exploit the heterogeneity in MapReduce clusters comprising wimpy nodes, such as the Intel Atom processors, and brawny nodes, such as the Intel SNB processors, can improve energy efficiency. Toward this end, in Section 3 we first characterized the performance and energy efficiency of various MapReduce workloads on both Atom and SNB nodes and showed that wimpy nodes are more energy efficient for I/O bound workloads. Then in Section 5 we have shown that it is possible to exploit heterogeneity and real-time power measurements in the scheduler, and we have demonstrated up to 27% improvements

6 in energy efficiency over the default Hadoop scheduler with simple scheduling heuristics. Although our work and several previous studies [1, 6] demonstrate the feasibility of using wimpy nodes for improving the energy efficiency, there are also other concerns that should be taken into account when deploying wimpy nodes in a data center [9, 14]. These concerns include system administration, hardware and software development costs, fault tolerance for clusters comprising a large number of wimpy nodes, and wimpy nodes not being suitable for latency sensitive workloads where the software infrastructure has to be carefully optimized for the wimpy nodes to guarantee the Service Level Agreements (SLA). Our work raises the following interesting research questions that we plan to address in our future work. Degree of heterogeneity in a cluster How will our scheduling heuristics perform in a larger cluster where the ratio of wimpy to brawny nodes is considerably different from1:7? Whatwillbetheeffectonperformanceandenergy efficiency if more than two types of platforms are deployed in the cluster? Moreover, given a fixed power or cost budget, what is the optimal number of nodes of different types that should be deployed in a cluster such that the overall energy efficiency is maximized? Impact of different mixes of workloads How will our heuristics perform with different workloads, such as mixes where all tasks are I/O or CPU bound or other mixes in between? Impact of replication Our results show that the heuristic favoring data locality (EESched+Locality) was less energy efficient than the other heuristics probably because Hadoop does not consider energy efficiency of the nodes during replication. What will be the impact of an energy efficient replication strategy on the resulting performance and energy consumption? Recent studies [11, 16, 13] have already proposed solutions at the HDFS level that result in better energy efficiency. Straggler tasks How can the scheduler detect straggler tasks in a heterogeneous cluster and handle these straggler tasks while still meeting the SLAs? Recent work [23] has already investigated techniques for detecting straggler tasks in heterogeneous clusters. SLAs Data processing frameworks, such as Bigtable [5] and HBase, are already being used for real-time data processing. Therefore, a common use case includes latency requirements for the jobs submitted by different users. How can we design a scheduler that improves the overall energy efficiency while fulfilling these SLAs? 8. REFERENCES [1] D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. Fawn: a fast array of wimpy nodes. In SOSP, pages 1 14, 29. [2] Apache Hadoop Project. [3] Apache Nutch Project. [4] C. Belady. In the data center, power and cooling costs more than the it equipment it supports, February 27. [5] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a distributed storage system for structured data. In OSDI, pages 15 15, 26. [6] B.-G. Chun, G. Iannaccone, G. Iannaccone, R. Katz, G. Lee, and L. Niccolini. An energy case for hybrid datacenters. SIGOPS Oper. Syst. Rev., 44:76 8, March 21. [7] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51:17 113, January 28. [8] U. EPA. Report to congress on server and data center energy efficiency, August 27. U.S. Environmental Protection Agency, Tech. Rep. [9] U. Hölzle. Brawny cores still beat wimpy cores, most of the time. research.google.com/pubs/archive/36448.pdf. [1] S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In Intl. Conference on Data Engineering Workshops, pages 41 51, march 21. [11] R. T. Kaushik and M. Bhandarkar. Greenhdfs: towards an energy-conserving, storage-efficient, hybrid hadoop compute cluster. In HotPower, pages 1 9, 21. [12] J. Koomey. Growth in data center electricity use 25 to 21, August 211. [13] W. Lang and J. M. Patel. Energy management for mapreduce clusters. VLDB Endow., 3: , September 21. [14] W. Lang, J. M. Patel, and S. Shankar. Wimpy node clusters: what about non-wimpy workloads? In Intl Workshop on Data Management on New Hardware, pages 47 55, 21. [15] K. Le, R. Bianchini, M. Martonosi, and T. D. Nguyen. Cost- and energy-aware load distribution across data centers. In HotPower, 29. [16] J. Leverich and C. Kozyrakis. On the energy (in)efficiency of hadoop clusters. SIGOPS Oper. Syst. Rev., 44:61 65, March 21. [17] D. Meisner, B. T. Gold, and T. F. Wenisch. Powernap: eliminating server idle power. SIGPLAN Not., 44:25 216, March 29. [18] S. Pelley, D. Meisner, T. Wenisch, and J. VanGilder. Understanding and abstracting total data center power. In Workshop on Energy Efficient Design, 29. [19] S. Srikantaiah, A. Kansal, and F. Zhao. Energy aware consolidation for cloud computing. In HotPower, pages 1 1, 28. [2] A. Verma, P. Ahuja, and A. Neogi. pmapper: power and migration cost aware application placement in virtualized systems. In Middleware, pages , 28. [21] W. Xiong and A. Kansal. Energy efficient data intensive distributed computing. IEEE Data Eng. Bull., 34(1):24 33, 211. [22] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In EuroSys, pages , 21. [23] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica. Improving mapreduce performance in heterogeneous environments. OSDI, pages 29 42, 28.

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

Research on Job Scheduling Algorithm in Hadoop

Research on Job Scheduling Algorithm in Hadoop Journal of Computational Information Systems 7: 6 () 5769-5775 Available at http://www.jofcis.com Research on Job Scheduling Algorithm in Hadoop Yang XIA, Lei WANG, Qiang ZHAO, Gongxuan ZHANG School of

More information

CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING

CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING Journal homepage: http://www.journalijar.com INTERNATIONAL JOURNAL OF ADVANCED RESEARCH RESEARCH ARTICLE CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING R.Kohila

More information

Matchmaking: A New MapReduce Scheduling Technique

Matchmaking: A New MapReduce Scheduling Technique Matchmaking: A New MapReduce Scheduling Technique Chen He Ying Lu David Swanson Department of Computer Science and Engineering University of Nebraska-Lincoln Lincoln, U.S. {che,ylu,dswanson}@cse.unl.edu

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity

Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity Noname manuscript No. (will be inserted by the editor) Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity Aysan Rasooli Douglas G. Down Received: date / Accepted: date Abstract Hadoop

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

Future Prospects of Scalable Cloud Computing

Future Prospects of Scalable Cloud Computing Future Prospects of Scalable Cloud Computing Keijo Heljanko Department of Information and Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 7.3-2012 1/17 Future Cloud Topics Beyond

More information

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

More information

Improving Job Scheduling in Hadoop

Improving Job Scheduling in Hadoop Improving Job Scheduling in Hadoop MapReduce Himangi G. Patel, Richard Sonaliya Computer Engineering, Silver Oak College of Engineering and Technology, Ahmedabad, Gujarat, India. Abstract Hadoop is a framework

More information

Towards a Resource Aware Scheduler in Hadoop

Towards a Resource Aware Scheduler in Hadoop Towards a Resource Aware Scheduler in Hadoop Mark Yong, Nitin Garegrat, Shiwali Mohan Computer Science and Engineering, University of Michigan, Ann Arbor December 21, 2009 Abstract Hadoop-MapReduce is

More information

Performance and Energy Efficiency of. Hadoop deployment models

Performance and Energy Efficiency of. Hadoop deployment models Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

FAWN - a Fast Array of Wimpy Nodes

FAWN - a Fast Array of Wimpy Nodes University of Warsaw January 12, 2011 Outline Introduction 1 Introduction 2 3 4 5 Key issues Introduction Growing CPU vs. I/O gap Contemporary systems must serve millions of users Electricity consumed

More information

Locality and Network-Aware Reduce Task Scheduling for Data-Intensive Applications

Locality and Network-Aware Reduce Task Scheduling for Data-Intensive Applications Locality and Network-Aware Reduce Task Scheduling for Data-Intensive Applications Engin Arslan University at Buffalo (SUNY) enginars@buffalo.edu Mrigank Shekhar Tevfik Kosar Intel Corporation University

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Efficient Data Replication Scheme based on Hadoop Distributed File System

Efficient Data Replication Scheme based on Hadoop Distributed File System , pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,

More information

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish

More information

Big Data Analysis and Its Scheduling Policy Hadoop

Big Data Analysis and Its Scheduling Policy Hadoop IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 1, Ver. IV (Jan Feb. 2015), PP 36-40 www.iosrjournals.org Big Data Analysis and Its Scheduling Policy

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

INFO5011. Cloud Computing Semester 2, 2011 Lecture 11, Cloud Scheduling

INFO5011. Cloud Computing Semester 2, 2011 Lecture 11, Cloud Scheduling INFO5011 Cloud Computing Semester 2, 2011 Lecture 11, Cloud Scheduling COMMONWEALTH OF Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of the

More information

Scheduling Data Intensive Workloads through Virtualization on MapReduce based Clouds

Scheduling Data Intensive Workloads through Virtualization on MapReduce based Clouds ABSTRACT Scheduling Data Intensive Workloads through Virtualization on MapReduce based Clouds 1 B.Thirumala Rao, 2 L.S.S.Reddy Department of Computer Science and Engineering, Lakireddy Bali Reddy College

More information

Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India

Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India Volume 3, Issue 1, January 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com ISSN:

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

Job Scheduling for MapReduce

Job Scheduling for MapReduce UC Berkeley Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Scott Shenker, Ion Stoica RAD Lab, * Facebook Inc 1 Motivation Hadoop was designed for large batch jobs

More information

Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing

Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing Deep Mann ME (Software Engineering) Computer Science and Engineering Department Thapar University Patiala-147004

More information

Affinity Aware VM Colocation Mechanism for Cloud

Affinity Aware VM Colocation Mechanism for Cloud Affinity Aware VM Colocation Mechanism for Cloud Nilesh Pachorkar 1* and Rajesh Ingle 2 Received: 24-December-2014; Revised: 12-January-2015; Accepted: 12-January-2015 2014 ACCENTS Abstract The most of

More information

Adaptive Task Scheduling for Multi Job MapReduce

Adaptive Task Scheduling for Multi Job MapReduce Adaptive Task Scheduling for MultiJob MapReduce Environments Jordà Polo, David de Nadal, David Carrera, Yolanda Becerra, Vicenç Beltran, Jordi Torres and Eduard Ayguadé Barcelona Supercomputing Center

More information

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Evaluating HDFS I/O Performance on Virtualized Systems

Evaluating HDFS I/O Performance on Virtualized Systems Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing

More information

Improving MapReduce Performance in Heterogeneous Environments

Improving MapReduce Performance in Heterogeneous Environments UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University of California at Berkeley Motivation 1. MapReduce

More information

Delay Scheduling. A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling

Delay Scheduling. A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling Delay Scheduling A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Khaled Elmeleegy +, Scott Shenker, Ion Stoica UC Berkeley,

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION A DIABLO WHITE PAPER AUGUST 2014 Ricky Trigalo Director of Business Development Virtualization, Diablo Technologies

More information

Hadoop: Embracing future hardware

Hadoop: Embracing future hardware Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop

More information

Energy Aware Consolidation for Cloud Computing

Energy Aware Consolidation for Cloud Computing Abstract Energy Aware Consolidation for Cloud Computing Shekhar Srikantaiah Pennsylvania State University Consolidation of applications in cloud computing environments presents a significant opportunity

More information

Performance Modeling of MapReduce Jobs in Heterogeneous Cloud Environments Zhuoyao Zhang University of Pennsylvania zhuoyao@seas.upenn.

Performance Modeling of MapReduce Jobs in Heterogeneous Cloud Environments Zhuoyao Zhang University of Pennsylvania zhuoyao@seas.upenn. Performance Modeling of MapReduce Jobs in Heterogeneous Cloud Environments Zhuoyao Zhang University of Pennsylvania zhuoyao@seas.upenn.edu Ludmila Cherkasova Hewlett-Packard Labs lucy.cherkasova@hp.com

More information

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II) UC BERKELEY Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II) Anthony D. Joseph LASER Summer School September 2013 My Talks at LASER 2013 1. AMP Lab introduction 2. The Datacenter

More information

The Improved Job Scheduling Algorithm of Hadoop Platform

The Improved Job Scheduling Algorithm of Hadoop Platform The Improved Job Scheduling Algorithm of Hadoop Platform Yingjie Guo a, Linzhi Wu b, Wei Yu c, Bin Wu d, Xiaotian Wang e a,b,c,d,e University of Chinese Academy of Sciences 100408, China b Email: wulinzhi1001@163.com

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

An Adaptive Scheduling Algorithm for Dynamic Heterogeneous Hadoop Systems

An Adaptive Scheduling Algorithm for Dynamic Heterogeneous Hadoop Systems An Adaptive Scheduling Algorithm for Dynamic Heterogeneous Hadoop Systems Aysan Rasooli, Douglas G. Down Department of Computing and Software McMaster University {rasooa, downd}@mcmaster.ca Abstract The

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

Energy aware RAID Configuration for Large Storage Systems

Energy aware RAID Configuration for Large Storage Systems Energy aware RAID Configuration for Large Storage Systems Norifumi Nishikawa norifumi@tkl.iis.u-tokyo.ac.jp Miyuki Nakano miyuki@tkl.iis.u-tokyo.ac.jp Masaru Kitsuregawa kitsure@tkl.iis.u-tokyo.ac.jp Abstract

More information

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Built up on Cisco s big data common platform architecture (CPA), a

More information

Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk

Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk WHITE PAPER Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk 951 SanDisk Drive, Milpitas, CA 95035 2015 SanDisk Corporation. All rights reserved. www.sandisk.com Table of Contents Introduction

More information

Non-intrusive Slot Layering in Hadoop

Non-intrusive Slot Layering in Hadoop 213 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing Non-intrusive Layering in Hadoop Peng Lu, Young Choon Lee, Albert Y. Zomaya Center for Distributed and High Performance Computing,

More information

Scheduling using Optimization Decomposition in Wireless Network with Time Performance Analysis

Scheduling using Optimization Decomposition in Wireless Network with Time Performance Analysis Scheduling using Optimization Decomposition in Wireless Network with Time Performance Analysis Aparna.C 1, Kavitha.V.kakade 2 M.E Student, Department of Computer Science and Engineering, Sri Shakthi Institute

More information

A Case for Flash Memory SSD in Hadoop Applications

A Case for Flash Memory SSD in Hadoop Applications A Case for Flash Memory SSD in Hadoop Applications Seok-Hoon Kang, Dong-Hyun Koo, Woon-Hak Kang and Sang-Won Lee Dept of Computer Engineering, Sungkyunkwan University, Korea x860221@gmail.com, smwindy@naver.com,

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS

A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS Suma R 1, Vinay T R 2, Byre Gowda B K 3 1 Post graduate Student, CSE, SVCE, Bangalore 2 Assistant Professor, CSE, SVCE, Bangalore

More information

Mobile Cloud Computing for Data-Intensive Applications

Mobile Cloud Computing for Data-Intensive Applications Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage

More information

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 25, NO. X, XXXXX 2014 1

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 25, NO. X, XXXXX 2014 1 TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 25, NO. X, XXXXX 2014 1 Energy-Aware Scheduling of MapReduce Jobs for Big Data Applications Lena Mashayekhy, Student Member,, Mahyar Movahed Nejad,

More information

Task Scheduling in Hadoop

Task Scheduling in Hadoop Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed

More information

Cloud Based Dynamic Workload Management

Cloud Based Dynamic Workload Management International Journal of scientific research and management (IJSRM) Volume 2 Issue 6 Pages 940-945 2014 Website: www.ijsrm.in ISSN (e): 2321-3418 Cloud Based Dynamic Workload Management Ms. Betsy M Babykutty

More information

Residual Traffic Based Task Scheduling in Hadoop

Residual Traffic Based Task Scheduling in Hadoop Residual Traffic Based Task Scheduling in Hadoop Daichi Tanaka University of Tsukuba Graduate School of Library, Information and Media Studies Tsukuba, Japan e-mail: s1421593@u.tsukuba.ac.jp Masatoshi

More information

Survey on Improved AutoScaling in Hadoop into Cloud Environments

Survey on Improved AutoScaling in Hadoop into Cloud Environments 2013 5th Conference on Information and Knowledge Technology (IKT) Survey on Improved AutoScaling in Hadoop into Cloud Environments Masoumeh Rezaei Jam Department of Computer Engineering, Faculty of Electrical

More information

Energy-Saving Cloud Computing Platform Based On Micro-Embedded System

Energy-Saving Cloud Computing Platform Based On Micro-Embedded System Energy-Saving Cloud Computing Platform Based On Micro-Embedded System Wen-Hsu HSIEH *, San-Peng KAO **, Kuang-Hung TAN **, Jiann-Liang CHEN ** * Department of Computer and Communication, De Lin Institute

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Hsin-Wen Wei 1,2, Che-Wei Hsu 2, Tin-Yu Wu 3, Wei-Tsong Lee 1 1 Department of Electrical Engineering, Tamkang University

More information

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications

More information

A Survey of Cloud Computing Guanfeng Octides

A Survey of Cloud Computing Guanfeng Octides A Survey of Cloud Computing Guanfeng Nov 7, 2010 Abstract The principal service provided by cloud computing is that underlying infrastructure, which often consists of compute resources like storage, processors,

More information

Understanding Data Locality in VMware Virtual SAN

Understanding Data Locality in VMware Virtual SAN Understanding Data Locality in VMware Virtual SAN July 2014 Edition T E C H N I C A L M A R K E T I N G D O C U M E N T A T I O N Table of Contents Introduction... 2 Virtual SAN Design Goals... 3 Data

More information

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Can High-Performance Interconnects Benefit Memcached and Hadoop? Can High-Performance Interconnects Benefit Memcached and Hadoop? D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,

More information

Dell Reference Configuration for Hortonworks Data Platform

Dell Reference Configuration for Hortonworks Data Platform Dell Reference Configuration for Hortonworks Data Platform A Quick Reference Configuration Guide Armando Acosta Hadoop Product Manager Dell Revolutionary Cloud and Big Data Group Kris Applegate Solution

More information

The IntelliMagic White Paper: Storage Performance Analysis for an IBM Storwize V7000

The IntelliMagic White Paper: Storage Performance Analysis for an IBM Storwize V7000 The IntelliMagic White Paper: Storage Performance Analysis for an IBM Storwize V7000 Summary: This document describes how to analyze performance on an IBM Storwize V7000. IntelliMagic 2012 Page 1 This

More information

Joint Optimization of Overlapping Phases in MapReduce

Joint Optimization of Overlapping Phases in MapReduce Joint Optimization of Overlapping Phases in MapReduce Minghong Lin, Li Zhang, Adam Wierman, Jian Tan Abstract MapReduce is a scalable parallel computing framework for big data processing. It exhibits multiple

More information

Energy Constrained Resource Scheduling for Cloud Environment

Energy Constrained Resource Scheduling for Cloud Environment Energy Constrained Resource Scheduling for Cloud Environment 1 R.Selvi, 2 S.Russia, 3 V.K.Anitha 1 2 nd Year M.E.(Software Engineering), 2 Assistant Professor Department of IT KSR Institute for Engineering

More information

Evaluating Task Scheduling in Hadoop-based Cloud Systems

Evaluating Task Scheduling in Hadoop-based Cloud Systems 2013 IEEE International Conference on Big Data Evaluating Task Scheduling in Hadoop-based Cloud Systems Shengyuan Liu, Jungang Xu College of Computer and Control Engineering University of Chinese Academy

More information

Exploiting Cloud Heterogeneity for Optimized Cost/Performance MapReduce Processing

Exploiting Cloud Heterogeneity for Optimized Cost/Performance MapReduce Processing Exploiting Cloud Heterogeneity for Optimized Cost/Performance MapReduce Processing Zhuoyao Zhang University of Pennsylvania, USA zhuoyao@seas.upenn.edu Ludmila Cherkasova Hewlett-Packard Labs, USA lucy.cherkasova@hp.com

More information

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 6, June 2015 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

More information

Cloud computing The cloud as a pool of shared hadrware and software resources

Cloud computing The cloud as a pool of shared hadrware and software resources Cloud computing The cloud as a pool of shared hadrware and software resources cloud Towards SLA-oriented Cloud Computing middleware layers (e.g. application servers) operating systems, virtual machines

More information

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is

More information

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency WHITE PAPER Solving I/O Bottlenecks to Enable Superior Cloud Efficiency Overview...1 Mellanox I/O Virtualization Features and Benefits...2 Summary...6 Overview We already have 8 or even 16 cores on one

More information

Implementation Issues of A Cloud Computing Platform

Implementation Issues of A Cloud Computing Platform Implementation Issues of A Cloud Computing Platform Bo Peng, Bin Cui and Xiaoming Li Department of Computer Science and Technology, Peking University {pb,bin.cui,lxm}@pku.edu.cn Abstract Cloud computing

More information

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack HIGHLIGHTS Real-Time Results Elasticsearch on Cisco UCS enables a deeper

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Hadoop Scheduler w i t h Deadline Constraint

Hadoop Scheduler w i t h Deadline Constraint Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,

More information

COST MINIMIZATION OF RUNNING MAPREDUCE ACROSS GEOGRAPHICALLY DISTRIBUTED DATA CENTERS

COST MINIMIZATION OF RUNNING MAPREDUCE ACROSS GEOGRAPHICALLY DISTRIBUTED DATA CENTERS COST MINIMIZATION OF RUNNING MAPREDUCE ACROSS GEOGRAPHICALLY DISTRIBUTED DATA CENTERS Ms. T. Cowsalya PG Scholar, SVS College of Engineering, Coimbatore, Tamilnadu, India Dr. S. Senthamarai Kannan Assistant

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information

Federated Big Data for resource aggregation and load balancing with DIRAC

Federated Big Data for resource aggregation and load balancing with DIRAC Procedia Computer Science Volume 51, 2015, Pages 2769 2773 ICCS 2015 International Conference On Computational Science Federated Big Data for resource aggregation and load balancing with DIRAC Víctor Fernández

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here> s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline

More information

Multi-Datacenter Replication

Multi-Datacenter Replication www.basho.com Multi-Datacenter Replication A Technical Overview & Use Cases Table of Contents Table of Contents... 1 Introduction... 1 How It Works... 1 Default Mode...1 Advanced Mode...2 Architectural

More information

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,

More information