Diagnosing Heterogeneous Hadoop Clusters
|
|
|
- Marcia Ward
- 9 years ago
- Views:
Transcription
1 Diagnosing Heterogeneous Hadoop Clusters Shekhar Gupta, Christian Fritz, Johan de Kleer, and Cees Witteveen 2 Palo Alto Research Center, Palo Alto, CA 930, USA {ShekharGupta, cfritz, dekleer}@parccom 2 Delft University of Technology, Delft, The Netherlands CWitteveen@tudelftnl ABSTRACT We present a data-driven approach for diagnosing performance issues in heterogeneous Hadoop clusters Hadoop is a popular and extremely successful framework for horizontally scalable distributed computing over large data sets based on the MapReduce framework In its current implementation, Hadoop assumes a homogeneous cluster of compute nodes This assumption manifests in Hadoop s scheduling algorithms, but is also crucial to existing approaches for diagnosing performance issues, which rely on the peer similarity between nodes It is desirable to enable efficient use of Hadoop on heterogeneous clusters as well as on virtual/cloud infrastructure, both of which violate the peer-similarity assumption To this end, we have implemented and here present preliminary results of an approach for automatically diagnosing the health of nodes in the cluster, as well as the resource requirements of incoming MapReduce jobs We show that the approach can be used to identify abnormally performing cluster nodes and to diagnose the kind of fault occurring on the node in terms of the system resource affected by the fault (eg, CPU contention, disk I/O contention) We also describe our future plans for using this approach to increase the efficiency of Hadoop on heterogeneous and virtual clusters, with or without faults INTRODUCTION Hadoop is a popular and extremely successful framework for horizontally scalable distributed processing of large data sets It is an open-source implementation of the Google filesystem (Ghemawat et al, 2003), called HDFS in Hadoop, and Google s MapReduce framework (Dean and Ghemawat, 2008) HDFS is able to store tremendously large files across several machines and using MapReduce, such files can be processed in a distributed fashion, moving the computation to the data, rather than the other way round An increasing number of so called big data applications, incl social network analysis, genome sequencing, or fraud detection in financial transaction data, require horizontally scalable solutions, and have demonstrated the limits of existing relational databases and SQL querying approaches Even though the value of Hadoop is widely acknowledged, the technology itself is still in its infancy One of the problems that remains unsolved in the general case is the diagnosis of faults and performance issues in the cluster Another shortcoming is Hadoop s implicit assumption of a homogeneous cluster, ie, the assumption that all servers in the cluster have the same hardware and software configuration If this assumption is violated, performance can be extremely poor Therefore, in practice, Hadoop requires a homogeneous cluster Given the vast number of existing servers with varying specifications in many organizations, it would be desirable to enable running Hadoop on such heterogeneous clusters as well In this paper we present work in progress towards solving these two problems with particular emphasis on the diagnosis We will describe in the future work section how we believe the same insights gained during diagnosis can be used to increase overall productivity of Hadoop on heterogeneous clusters, even in the absence of faults In its current version, Hadoop possesses only rudimentary monitoring capabilities The primary mechanism for detecting faults is by pinging each compute node at a regular frequency A node is considered dead if it fails to reply the ping within a certain time period While this mechanism can be used to detect certain kinds of hard faults present in the cluster, performance issues that do not cause a node to go down but only slow it down and may be intermittent remain undetected In order to deal with hard faults that do not affect nodes at the level of the operating system where pings are affected, but which happen in user space (eg, out of memory exceptions), Hadoop kills exceptionally long running tasks and reschedules them on a different node This is problematic in particular on heterogeneous clusters where the required time to complete a task can vary strongly between different machines
2 Recently, (Tan et al, 200b) proposed a diagnosis approach that uses a simple peer similarity model to identify faulty nodes in a Hadoop cluster The idea underlying their approach is that the same task should take approximately the same amount of time on each node in the cluster By building statistical models of task completion times over the nodes of the cluster, the authors are able to identify outliers with fairly high accuracy However, the peer similarly model is again making the assumption that the cluster is homogeneous and is not applicable to heterogeneous clusters as we will show in Section 22 We propose a data-driven diagnosis approach for detecting performance problems that is applicable to heterogeneous Hadoop clusters Our approach constructs a behavioral model of every node for different classes of jobs, with one class for each dimension in the configuration space of the nodes, eg, CPU speed, disk I/O speed, amount of RAM The intuitive role for these base jobs is that their completion time will largely depend on the availability of the respective resource The behavioral model based on these jobs can then be used for diagnosis The model is used to determine how long a certain new job should take on a given node that is to be diagnosed, given the class of the new job, and the time it took to run on a different node In this paper, we present preliminary results that only consider problems that cause CPU or disk I/O contention and hence lead to degraded performance of jobs that make use of these resources In future work, we plan to integrate this diagnostic information into the Hadoop scheduler to increase the overall productivity of the cluster in the face of faults as well as heterogeneity among nodes The diagnostic information will allow the scheduler to make better predictions on task durations on specific nodes and hence create tighter schedules In the next section we will review Hadoop in more detail and summarize relevant related work We then define the problem and describe our approach including empirical results Before we conclude we outline our plans for future work 2 BACKGROUND AND RELATED WORK 2 Hadoop Hadoop is an open-source platform for distributed computing that currently is the de-facto standard for storing and analyzing very large amounts of data Hadoop comprises a storage solution called HDFS, and a framework for the distributed execution of computational tasks called MapReduce Figure depicts how Hadoop stores and processes data A Hadoop cluster consists of one NameNode and many DataNodes (tens to thousands) When a data file is copied into the system, it is divided up into blocks of 6MB Each block is stored on three or more DataNodes depending on the replication policy of the deployed Hadoop cluster (Figure (a)) Once the data is loaded, computational jobs can be executed over this data New jobs are submitted to the NameNode (Figure (b)) The NameNode will schedule map and reduce tasks onto the DataNodes A map task is to process one block and generate a result for this block which gets written back to HDFS Hadoop will schedule one map task for each block of the data, and it will do so, generally speaking, by selecting one of the three DataNodes that is storing a copy of that block to avoid moving large amounts of data over the network A reduce task takes all these intermediate results and combines them into one, final result that forms the output of the computation In order to take advantage of Hadoop, a programmer has to write his program in terms a Map and a Reduce function These functions are agnostic of the block structure and the distributed nature of the execution but typically rather operate at the level of a record or a line of text: some small unit of which the data is comprised For each such unit, the Map function typically only performs rather basic operations The canonical example of a MapReduce program is WordCount, a program that counts the number of occurrences of each word in a large corpus of text The Map function of WordCount tokenizes one block of the text, and counts words locally, line by line The Reduce function would then take these local counts and sum them up to get the global result 2 Since generally the Reduce Phase can only start once the Map Phase has been completed, the overall job completion time is typically determined by the time it takes the slowest DataNode to finish its assigned map tasks Hadoop, as of the current version 022, does not take any performance differences between the DataNodes into account during the scheduling phase, but assumes a homogeneous cluster, ie, servers that are equally fast in terms of CPU, disk I/O, RAM, and network bandwidth, etc the keycontributors to task completion time While Hadoop is particularly good for certain kinds of text analytics, in practice, various different kinds of jobs execute on a Hadoop cluster For the purpose of this paper we will characterize jobs in terms of the types of resources they make use of most heavily Many jobs are CPU intense, while others read from and write to disk a lot, or need to move larger amounts of data over the network More specifically, we will generally characterize the map tasks of a job in this way and ignore the reduce tasks for the most part 22 Related Work Recently, (Tan et al, 200b) presented Kahuna, a diagnosis approach that uses a simple peer similarity model to identify faulty nodes in a Hadoop cluster Roughly, the idea underlying their approach is that the same task should take approximately the same amount of time on each node in the cluster More precisely, the authors build histograms of the time each task of a job takes on each machine A node is identified as faulty when its histogram deviates from those of the other nodes The authors show that this approach can detect slowdowns caused by various kinds of issues including CPU hogging, disk I/O hogging, as well as to a limited degree network package loss The authors 2 For parsimony we are omitting certain details in this example to the extend of which they are irrelevant for this paper 2
3 Hadoop: HDFS NameNode DataNode divide into 6MB blocks Input Data store each block on three DataNodes each (a) HDFS DataNode j 2 DataNode N Job on Data NameNode DataNode Hadoop: MapReduce schedule processing of blocks of Data on nodes 2 Map 3 Reduce 5 7 DataNode j 2 DataNode N results for blocks (b) MapReduce 7 DataNode r Result Figure : HDFS stores files in blocks of 6MB, and replicates these blocks on three cluster nodes each For a new job MapReduce then processes ( maps ) each block locally first, and then reduces all these partial results in a central location on the cluster to generate the end result also show that different workloads have different diagnostic power in the sense that certain issues are not uncovered by certain jobs This is consistent with our assumption of different job classes The authors do not describe whether Kahuna is able to detect what kind of fault may have occurred on a machine Kahuna assumes that the cluster is homogeneous, ie, that tasks take roughly the same amount of time across machines Figure 2 illustrates that, unsurprisingly, on heterogeneous clusters, the same task can take significantly longer or shorter depending on which machine is being used During the experiment, we made sure that all machines were functioning flawlessly Diagnosis hence cannot be based on the assumption that the same task should take equally long to execute on every node Many root cause analysis techniques use distributed monitoring tools that require active human intervention to locate the fault Gangilia (Massie et al, 200) is a well known distributed monitoring system, which is capable of handling large clusters and grids X-Trace (Fonseca et al, 2007) and Pinpoint (Chen et al, 2002) are tracing techniques to identify faults in distributed systems Tan et al (200a) developed a visualization tool to aid humans in debugging performance related issues of Hadoop cluster The tool uses the log analysis technique SALSA (Tan et al, ) that uses the Hadoop log files and visualizes a state-machine based view of every nodes behavior Ganesha (Pan et al, 2008) is another diagnosis technique for Hadoop, which locates faults in MapReduce systems by exploiting OS level metrics Konwniski and Zahari (2008) use X-Trace to instrument Hadoop systems to investigate Hadoop s behavior under different situations and tune its performance Automated performance diagnosis in service based cloud infrastructures is also possible via the identification of components (software/hardware) that are involved in a specific query response (Zhang et al, 2007) The violation of a specified Service Level Agreement (SLA), ie, expected response time, for one or more queries implies problems in one or more components involved in processing these queries The methodology is widely accepted for many distributed systems To a degree Hadoop is using this approach as well, as described above, but since the processing time for MapReduce jobs depend on many factors including the size of the data, only very crude limits can be used as cut-off The determination of a reasonable cut-off is further hindered on heterogeneous clusters, where processing times can vary strongly between machines Node Node2 Average Mapping Time (s) Node3 Node Node5 Node6 Node7 Node8 Node9 Node0 Node Node2 Node3 Node Average Mapping Time (s) Figure 2: Average execution time of mapping tasks of the WordCount job on heterogeneous nodes of the Hadoop cluster at PARC 3 PROBLEM STATEMENT Our primary goal in this work is to identify soft faults in heterogeneous Hadoop clusters, ie, nodes that are slowed down, as well as to determine the cause for the degraded performance Hadoop itself has a fault resilience component that continuously pings each DataNode to verify it is still alive If a DataNode goes down or due to some other fault does not respond the ping, the NameNode marks the DataNode as dead and does not assign any more tasks to it This mechanism is able to detect many hard faults in the cluster, but is unable to identify problems responsible for slowdowns Goals The long term goal of this project is to enable the Hadoop scheduler to generate efficient schedules even 3
4 (a) Time distribution for WordCount without CPU hogging (b) Time distribution for WordCount with CPU hogging Figure 3: CPU intensive MapReduce jobs such as WordCount are strongly affected in their completion time by over-subscription of the CPU (please note the number of seconds shown on the x-axis) (a) Time distribution of RandomWriter without CPU (b) Time distribution of RandomWriter with CPU hogging hogging Figure : MapReduce jobs that are not as CPU intensive such as RandomWriter are much less strongly affected in their completion time by over-subscription of the CPU on heterogeneous clusters and when some nodes are suffering from soft faults This means that the scheduler will need to schedule fewer tasks on slower nodes, and for this it has to take the job characteristic resource profile into account as well as the performance characteristics of the nodes As a pre-requisite of that, the focus of this paper is to determine the status of nodes in the cluster and learn both the job profiles as well as the relative node performance characteristics The latter, intuitively, corresponds to a node s hardware configuration When diagnosing a node to have a potential soft fault, we also want to determine the kind of fault This can be accomplished by comparing the degree by which jobs with different resource profiles are affected by the fault 3 Effect of Faults In this paper we consider soft faults in the form of resource contention on a cluster node For most resources there are numerous ways in which they can be oversubscribed, however, these root causes do not affect the performance in different ways, and we will only be able to determine which resources are affected, but not what is causing their contention For the purpose of validating our approach and in order to create the empirical results presented in the next section, we simulated two types of faults: CPU hogging, and disk I/O hogging For CPU hogging, for instance, we ran additional programs on the node that used a lot of CPU time As a result, tasks of CPU intensive jobs took much longer to complete, as shown in Figure 22 (please note the number of seconds on the x-axis) On the other hand, disk I/O intensive jobs such as the RandomWriter are not affected nearly as strongly, as can be seen in Figure When the class of a job is known, this difference can be used to determine the likely type of soft-fault occurring on a node, ie, the resource that is over-subscribed APPROACH Our approach for determining performance problems of a heterogeneous Hadoop cluster intuitively consists of two steps First we learn a model of every node in the cluster for each class of jobs In the second step, these models are used to estimate how long a given new task should take on a given machine Intuitively, if the new task takes much longer than predicted by the model, the node is likely to be suffering from a fault By comparing the impact such a fault has on the completion time for tasks of different classes, ie different resource profiles, it is possible to diagnose the kind of
5 β C MM2 J,M J,M2 α C JJ2 α C JJ2 J2,M J2,M2 β C MM2 Figure 5: The complexity coefficient α describes the relative hardness or complexity of a job compared to other jobs and is assumed to be machine independent Likewise, β is a coefficient that captures the relative performance of one machine compared to another It reflects the relative hardware configuration of the machines Both of these coefficients are specific to a class of job, denoted C fault in the sense of determining which resource is affected (eg, CPU, disk I/O) Assumptions Our approach makes a number of assumptions For some of these assumptions we are presenting empirical evidence, others can be further relaxed in future work The resource profile of a given task is known 2 For each job there is one resource that is the bottleneck which dominates the resource requirements for all map tasks of the job For instance, a CPU intensive job is only marginally affected by disk I/O contention 3 The class-specific, relative complexity of a task is machine independent This means that for two jobs A and B of the same class, if B takes a factor of X longer than A on one node, then this factor will also apply to the completion times for A and B on another node, even if the two nodes have different configurations We will further elaborate on this and provide empirical evidence for this assumption in Section The completion times for tasks executing on the same node are independent of each other 5 The distribution of completion times for the tasks of a specific job on a specific node follow a Gaussian distribution This means two things: repeating the exact same task multiple times would lead to a Gaussian distribution But also the execution time for individual tasks of the same job, ie, processing distinct 6MB blocks of the same data set using the same algorithm, follow this distribution Our use of a Gaussian distribution in this paper is a first, pragmatic choice In principle, a distribution that does not extend into the negative values would be a better choice for this application Nevertheless, for the results presented in this paper, this misfit is fairly irrelevant Given these assumptions, our approach is captured by Figure 5: We can infer from the completion time of jobs J and J2 on machine M to the completion time of J2 on M2 once the completion time for J on M2 is known This is possible based on Assumption 3 2 Classes Our diagnosis approach relies on an understanding of the resource profile of a task Every task will make use of system resources in different ways Some jobs are more CPU intensive, others read and write from and to disk more than others In this paper, as stated in Assumption 2, we assume that for each job there is a single resource that forms the bottleneck In this paper we only consider two classes: CPU intensive jobs and disk I/O intensive jobs The assumption states that the intersection between these two classes is empty In the rest of the paper, we will denote the class of jobs whose bottleneck is resource R as C R Hence, we will be talking about two classes, C CP U and C IO In our approach, we run our diagnosis for every class and in the end combine these diagnoses to recommend the specific root cause for a slowdown For example, if we observe that on a particular node only CPU intensive jobs take longer than predicted by the model, this suggests that the node is suffering from CPU over-subscription, which could be caused, eg, by some background, operating system, or zombie process Base Model of Class In our approach we assume that initially, eg, during first installation of the cluster, a set of prototypical, base jobs can be run on each machine We assume that one such job for each class of jobs is run on each machine, where there is one class for each dimension in the configuration space, eg, CPU speed, disk I/O speed, amount of RAM This is discussed further in the next section For the two classes we are considering in this paper, C CP U and C IO, WordCount and RandomWriter are selected as base jobs respectively The model for another job of a class is determined relative to the base job model Given job J belonging to class C R, we define α C R J, the complexity coefficient of job J with respect to resource R, as the factor of how much longer J takes to compute relative to the base job for C R Assumption 3 states that this factor is equal for every machine That means that if, for instance, the original WordCount task on Node takes 60 seconds, and another job that is also CPU intense takes 20 seconds on the same node, then we assume that the same factor of 2 would be observed on other nodes as well Empirical justification and the procedure for estimating job complexity coefficients is described in Section 3 Behavioral Model Construction To learn the models we gather time samples of mapping tasks of every job from the log files generated by Hadoop We collect the system logs produced by Hadoop s native logging feature on the NameNode Subsequently, we parse these log files to collect timing samples for every node in the cluster Each entry in the log can be treated as an event and is marked with a time stamp An event is used to determine when a task is started and when it finishes on a specific node The duration of mapping tasks is computed by subtracting those time stamps We denote the average duration of all mapping tasks belonging to job J on machine M by t J,M 5
6 Model Learning It is assumed that mapping task durations of the same job are distributed normally for any specific machine The model for each job class C R is hence described by a mean µ J,R and variance σ J,R For the base jobs of each class we can easily learn these parameters from the samples gathered during the initial execution of base jobs For each class C R, we run the base job of the class on the Hadoop cluster and collect samples t J,M,i for every machine M Recall that Hadoop schedules the execution of the tasks belonging to a job onto the available DataNodes according to where the data to be processed is stored Since the data is distributed roughly uniformly, this process produces a number of samples for each machine We assume that during this initial learning, no faults occur on the cluster For a job J, the mean and variance for machine M can be estimated using the standard maximum likelihood estimator over the samples collected for tasks belonging to the base job of C R : µ J,M = t J,M,i () N J,M σj,m 2 = (t J,M,i µ J,M ) 2 (2) N J,M i Where N J,M is the total number of samples collected for machine M and job J In our experiments, we use WordCount as the base job for class C CP U, and RandomWriter for C IO We introduce the shorthand J CP U = W ordcount and J IO = RandomW riter and will in the rest of the paper hence refer to µ JCP U,M and σj 2 CP U,M as the model parameters of the base job for C CP U and µ JIO,M and σj 2 IO,M respectively for C IO Estimating Job Complexity Assumption 3 states that the relative complexity of a job compared to its base job does not depend on the machine it is executed on To verify this assumption we conducted an experiment with two new MapReduce jobs J CP U and J IO J CP U belongs to class C CP U and J IO belongs to C IO For every machine M in the Hadoop cluster, we computed αj CP U CP U,M and αj IO IO,M as: i α CP U J CP U,M = µ J CP U,M µ CP U,M (3) α IO J IO,M = µ J IO,M µ IO,M () Figure 6 shows the results The figure shows that the complexity coefficient for each class is fairly similar across machines, providing empirical justification for Assumption 3 5 Diagnosis Given the duration of mapping tasks belonging to a previously unseen job J2 on a number of machines {M i } i, we want to determine whether any of the nodes is having a fault and what the nature of the fault is We do this by estimating for each machine how long each Node Node2 Node3 Node Node5 Node6 Node7 Node8 Node9 Node0 Node Node2 Node3 Node CPU Intensive (J) IO Intensive (J2) Figure 6: The values for complexity coefficient α for jobs belonging to C CP U (WordCount) and C IO (RandomWriter) Note that even though the nodes are heterogenous the relative complexity of a job is comparable between nodes task should take under normal circumstances, and then use this information to determine a likelihood for a new observation (task duration) to indicate abnormality of the node The idea for predicting the duration is that we can estimate the model, ie, distribution of task duration, for a new task executing on a specific machine by scaling the base model of the class of the job for this machine by the complexity coefficient The complexity coefficient in turn can be estimated from all other machines that have already run tasks of this job Predicting Job Completion Time When diagnosing machine M k for a job of class C R, to get a better estimate of the complexity coefficient, every machine s α value for this job will be used except M k s Hence, αj, R can be estimated as α R J, = N αj,m R i (5) i k Where N is the number of machines in the cluster that ran mapping tasks for this job On node M k, the model for job J2 can be inferred from the model for M k for jobs of the respective class Given the estimated complexity coefficient αj, R for job J2 belonging to class C R, the mean and variance for J2 on M k can be estimated as: µ J2,M k = α R J2, µ R,Mk (6) σ J2,M k = α R J2, σ R,Mk (7) Recall that the approximate model of a job on a machine, characterized by these two parameters, describes the (approximate) distribution of durations of tasks of this job in this node Therefore, the approximate model can be used for diagnosis, by computing the relative likelihood for an observed duration x of a task, using the probability density function (PDF) of the underlying distribution, ie, in this case, the PDF of the normal distribution: f(x; µ, σ) = σ 2π e 2 ( x µ σ )2 (8) 6
7 If a machine is suffering from over-subscription of a resource that is made intense use of by the task, then this will slow down the task and make the observed task completion time (duration) less likely Hence, we consider a node M k to potentially have an oversubscription of resource R, if for a job J R of class C R, f(t J R,M k ; µ J R,M, σ k J R,M ) is significantly less k than the average relative likelihood for the durations observed for tasks of J R on all other machines That is when: f(t J R,M k ; µ J R,M, σ k J R,M ) k f(t J N R,M i ; µ J R,Mi, σ J R,Mi) i k 6 Experimental Evaluation To evaluate the effectiveness of our diagnosis approach we collected task durations from the log files of our -node Hadoop cluster for two different instances of the WordCount and RandomWriter jobs The nodes are quite heterogeneous in their hardware configuration since these machines were purchased at different times and for different original purposes This heterogeneity is reflected in Figure 2 Following the described approach, we divided the experiments into two phases, a learning phase and a diagnosis phase In the learning phase, we learned the model for the two base jobs, one for WordCount and one for RandomWriter, for each machine in the cluster During this phase, we made sure that every node was functioning flawlessly During the diagnosis phase, we then simulated two types of resource contention: disk I/O contention on Node 0 and CPU contention on Node 3 These faults were simulated by running additional programs on the nodes that made excessive use of these resources ( hogging ) For CPU hogging we over-subscribed each CPU core on the node by a factor of two by running one extra program per core that ran an infinite loop with a basic arithmetic operation inside The overall load per CPU core was hence around 20 when the Hadoop jobs were executing For disk hogging, we ran a program that repeatedly wrote large files to the disks used by HDFS With these fault simulations in place, we ran a different WordCount job and a different RandomWriter job These jobs differed from the jobs used as base jobs in that they repeated certain sub-tasks multiple times This was to ensure these jobs have roughly the same resource profile as the base jobs but at the same time take noticeably longer As can be seen in Figure 6, these jobs took roughly twice as long as their respective base job Results Figure 7 summarizes the results of our experiment On the x-axis we show the node number For each node, two relative likelihood values for the observed average task durations are shown: one for the new WordCount job and one for the new RandomWriter job As can be seen, the relative likelihood for the observed task durations of the second WordCount job Node Node2 Node3 Node Node5 Node6 Node7 Node8 Node9 Node0 Node Node2 Node3 Node WordCount RandomWriter Figure 7: Results of the diagnosis for an instance of the WordCount and an instance of the RandomWriter job On Node 0 we injected disk I/O contention, and on Node 3 we injected CPU contention The y-axis shows the relative likelihood for observed task durations on Node 3 is significantly lower than the relative likelihoods for all other machines Similarly, the relative likelihood for the observed task durations of the second RandomWriter job on Node 0 is significantly lower than the others This suggests that it is indeed possible to use the presented approach to identify abnormally behaving nodes in a heterogeneous Hadoop cluster: Any node whose average task duration has a relative likelihood that is significantly lower than the average likelihood over other nodes is a candidate diagnosis This is because, according to the model, it is unlikely to observe such average task durations on a node that is behaving normally Applied to the results shown in the figure, this approach would correctly identify Nodes 0 and 3 as being abnormal Furthermore, note that the performance of Node 3 on the RandomWriter is not noticeably reduced by the presence of CPU contention, and that likewise the effect of disk I/O contention on Node 0 does not impact the completion time of the CPU intense WordCount job to a noticeable degree This leads us to believe that this approach is indeed capable of determining the type of fault occurring on a node, at least to the level of detail of which resource is affected by the fault Applied to our results, this approach would correctly determine that Node 0 is suffering from disk I/O contention, and Node 3 is experiencing CPU contention 5 FUTURE WORK The work presented in this paper is work in progress and the results are preliminary This section outlines our plans for future work in this area The longer term objective with this project is to increase productivity of Hadoop on heterogeneous and virtual clusters and in the face of faults, without human intervention The key ideas for achieving this are as follows: Simultaneous machine status and job characterization We believe that it is possible to simultaneously learn about the status of nodes in the cluster and the resource profile of jobs, given a model of nominal behavior of nodes on canonical test jobs While in this paper we were assuming that jobs belong to exactly one job class, in practice it is common for jobs to re- 7
8 quire a mix of resources This is something we believe can be learned as well Model transfer As depicted in Figure 5, we believe that it is possible to predict the time it takes a node to process a certain task, given the node s measured performance on an earlier (set of) job(s) and task durations on other machines Likewise, it is possible to infer from one machine to another, even when the machines are equipped differently, by considering their models trained on prototypical jobs We already explained and exploited this idea in this paper Adaptive scheduling Given these improved predictions on how long a task should take on a specific node, it is possible to increase cluster productivity, in particular for (re-)scheduling of tasks on heterogeneous servers Again, the reduce phase can typically only start once all mapping tasks have been completed Therefore, a good prediction on how long a certain task will take on a certain node is critical to job completion time: without good predictions it is not possible to generate a schedule that is near-optimal Pervasive diagnosis The two key ideas behind pervasive diagnosis (Kuhn et al, 2008) are that performance measures of production jobs contain information about the health of the system, and that by directing the execution of jobs in a particular way, a system s state can be diagnosed without interrupting production by deriving insights from these performance measures Even in the absence of faults, a Hadoop cluster, like many systems, has some degree of hidden state, ie, aspects that are relevant but not directly observable But since this hidden state affects job and task completion time, the hidden state can be diagnosed over time by deriving insights from these performance measures Furthermore, scheduling a task of a known class (eg, a CPU intensive job) onto a specific node can be used to probe the node and reveal the status of a specific system resource Applying these ideas, in future work, we plan to extend the existing Hadoop scheduling algorithms to proactively reduce the uncertainty about (a) a node s health status in terms of its resources (CPU, RAM, disk I/O, network I/O, etc), as well as the uncertainty about (b) the resource requirements of previously unseen jobs Again, the purpose of reducing this uncertainty is to increase resilience of the system as well as reduce average job completion time via improved scheduling 6 CONCLUSIONS We have presented an approach for diagnosing performance issues in heterogeneous Hadoop clusters The approach extends existing diagnosis approaches for Hadoop to clusters where the peer-similarity assumption does not hold, and further is able to distinguish between different types of faults While the approach is based on a number of assumptions, we have presented empirical evidence for some of these assumptions Further validating and/or relaxing some of the other assumptions is part of future work To empirically validate our diagnosis approach, we simulated soft faults in a Hadoop cluster consisting of nodes, running two different MapReduce jobs with different resource profiles The preliminary results presented in this paper suggest that the proposed approach is viable and able to achieve the intended goals of a) identifying machines on which resource contention is occurring, and b) determining the resource which is over-subscribed by considering the relative impact of faults on the completion time for jobs with different resource requirements Acknowledgments We thank Bob Price, Rui Zhang, and the anonymous reviewers for their thoughts and constructive criticism REFERENCES (Chen et al, 2002) Mike Y Chen, Emre Kiciman, Eugene Fratkin, Armando Fox, O Fox, and Eric Brewer Pinpoint: Problem determination in large, dynamic internet services In In Proc 2002 Intl Conf on Dependable Systems and Networks, pages , 2002 (Dean and Ghemawat, 2008) Jeffrey Dean and Sanjay Ghemawat Mapreduce: simplified data processing on large clusters Commun ACM, 5():07 3, January 2008 (Fonseca et al, 2007) Rodrigo Fonseca, George Porter, Randy H Katz, Scott Shenker, and Ion Stoica X-trace: A pervasive network tracing framework In In NSDI, 2007 (Ghemawat et al, 2003) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung The google file system, 2003 (Konwniski and Zahari, 2008) Andy Konwniski and Matei Zahari Finding the elephent in the data center: Tracing hadoop Technical report, 2008 (Kuhn et al, 2008) Lukas Kuhn, Bob Price, Johan de Kleer, Minh Binh Do, and Rong Zhou Pervasive diagnosis: The integration of diagnostic goals into production plans In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, AAAI 2008, pages , 2008 (Massie et al, 200) M L Massie, B N Chun, and D E Culler The Ganglia Distributed Monitoring System: Design, Implementation, and Experience Parallel Computing, 30(7), July 200 (Pan et al, 2008) Xinghao Pan, Jiaqi Tan, Soila Kavulya, Rajeev G, and Priya Narasimhan Ganesha: Black-box fault diagnosis for mapreduce systems Technical report, 2008 (Tan et al, ) Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev G, and Priya Narasimhan Salsa: Analyzing logs as state machines (Tan et al, 200a) Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan Visual, log-based causal tracing for performance debugging of mapreduce systems In ICDCS, pages , 200 (Tan et al, 200b) Jiaqi Tan, Xinghao Pan, Eugene Marinelli, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan Kahuna: Problem diagnosis for mareducebased cloud computing environments In IEEE/IFIP Network Operations and Management Symposium, 200 (Zhang et al, 2007) Rui Zhang, Steve Moyle, Steve McKeever, and Alan Bivens Performance problem localization in self-healing, service-oriented systems using bayesian networks In Proceedings of the 2007 ACM symposium on Applied computing, SAC 07 ACM,
Improving MapReduce Performance in Heterogeneous Environments
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University of California at Berkeley Motivation 1. MapReduce
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
Mobile Cloud Computing for Data-Intensive Applications
Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, [email protected] Advisor: Professor Priya Narasimhan, [email protected] Abstract The computational and storage
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Hadoop Performance Monitoring Tools
Hadoop Performance Monitoring Tools Kai Ren, Lianghong Xu, Zongwei Zhou Abstract Diagnosing performance problems in large-scale data-intensive programs is inherently difficult due to its scale and distributed
Abstract. 1 Introduction. 2 Problem Statement. 1 Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop
1 Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan Electrical & Computer Engineering Department Carnegie Mellon University,
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Introduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
Evaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang [email protected] University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Research on Job Scheduling Algorithm in Hadoop
Journal of Computational Information Systems 7: 6 () 5769-5775 Available at http://www.jofcis.com Research on Job Scheduling Algorithm in Hadoop Yang XIA, Lei WANG, Qiang ZHAO, Gongxuan ZHANG School of
Survey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
Chukwa: A large-scale monitoring system
Chukwa: A large-scale monitoring system Jerome Boulon [email protected] Ariel Rabkin [email protected] UC Berkeley Andy Konwinski [email protected] UC Berkeley Eric Yang [email protected]
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Ganesha: Black-Box Diagnosis of MapReduce Systems
Ganesha: Black-Box Diagnosis of MapReduce Systems Xinghao Pan, Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan Electrical & Computer Engineering Department, Carnegie Mellon University Pittsburgh,
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Hadoop Scheduler w i t h Deadline Constraint
Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
Adaptive Task Scheduling for Multi Job MapReduce
Adaptive Task Scheduling for MultiJob MapReduce Environments Jordà Polo, David de Nadal, David Carrera, Yolanda Becerra, Vicenç Beltran, Jordi Torres and Eduard Ayguadé Barcelona Supercomputing Center
Towards a Resource Aware Scheduler in Hadoop
Towards a Resource Aware Scheduler in Hadoop Mark Yong, Nitin Garegrat, Shiwali Mohan Computer Science and Engineering, University of Michigan, Ann Arbor December 21, 2009 Abstract Hadoop-MapReduce is
MapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially
MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
Big Data Analysis and Its Scheduling Policy Hadoop
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 1, Ver. IV (Jan Feb. 2015), PP 36-40 www.iosrjournals.org Big Data Analysis and Its Scheduling Policy
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
HDFS Users Guide. Table of contents
Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
Application Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware
Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray ware 2 Agenda The Hadoop Journey Why Virtualize Hadoop? Elasticity and Scalability Performance Tests Storage Reference
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
Optimizing a ëcontent-aware" Load Balancing Strategy for Shared Web Hosting Service Ludmila Cherkasova Hewlett-Packard Laboratories 1501 Page Mill Road, Palo Alto, CA 94303 [email protected] Shankar
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 [email protected], 2 [email protected],
Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks
Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
Distributed File Systems
Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)
An improved task assignment scheme for Hadoop running in the clouds
Dai and Bassiouni Journal of Cloud Computing: Advances, Systems and Applications 2013, 2:23 RESEARCH An improved task assignment scheme for Hadoop running in the clouds Wei Dai * and Mostafa Bassiouni
Survey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input
CSE-E5430 Scalable Cloud Computing Lecture 11
CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 30.11-2015 1/24 Distributed Coordination Systems Consensus
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Parallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model
A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS
A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS Suma R 1, Vinay T R 2, Byre Gowda B K 3 1 Post graduate Student, CSE, SVCE, Bangalore 2 Assistant Professor, CSE, SVCE, Bangalore
GeoGrid Project and Experiences with Hadoop
GeoGrid Project and Experiences with Hadoop Gong Zhang and Ling Liu Distributed Data Intensive Systems Lab (DiSL) Center for Experimental Computer Systems Research (CERCS) Georgia Institute of Technology
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani
What is Big Data? Concepts, Ideas and Principles Hitesh Dharamdasani # whoami Security Researcher, Malware Reversing Engineer, Developer GIT > George Mason > UC Berkeley > FireEye > On Stage Building Data-driven
... ... PEPPERDATA OVERVIEW AND DIFFERENTIATORS ... ... ... ... ...
..................................... WHITEPAPER PEPPERDATA OVERVIEW AND DIFFERENTIATORS INTRODUCTION Prospective customers will often pose the question, How is Pepperdata different from tools like Ganglia,
A Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
1.1 Difficulty in Fault Localization in Large-Scale Computing Systems
Chapter 1 Introduction System failures have been one of the biggest obstacles in operating today s largescale computing systems. Fault localization, i.e., identifying direct or indirect causes of failures,
Reduction of Data at Namenode in HDFS using harballing Technique
Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu [email protected] [email protected] Abstract HDFS stands for the Hadoop Distributed File System.
Networking in the Hadoop Cluster
Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Big Data and Hadoop with components like Flume, Pig, Hive and Jaql
Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Analysis and Modeling of MapReduce s Performance on Hadoop YARN
Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University [email protected] Dr. Thomas C. Bressoud Dept. of Mathematics and
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.
Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated
Exploiting Cloud Heterogeneity for Optimized Cost/Performance MapReduce Processing
Exploiting Cloud Heterogeneity for Optimized Cost/Performance MapReduce Processing Zhuoyao Zhang University of Pennsylvania, USA [email protected] Ludmila Cherkasova Hewlett-Packard Labs, USA [email protected]
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2
Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2 1 PDA College of Engineering, Gulbarga, Karnataka, India [email protected] 2 PDA College of Engineering, Gulbarga, Karnataka,
HiBench Introduction. Carson Wang ([email protected]) Software & Services Group
HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
The Evolution of Load Testing. Why Gomez 360 o Web Load Testing Is a
Technical White Paper: WEb Load Testing To perform as intended, today s mission-critical applications rely on highly available, stable and trusted software services. Load testing ensures that those criteria
Performance Prediction, Sizing and Capacity Planning for Distributed E-Commerce Applications
Performance Prediction, Sizing and Capacity Planning for Distributed E-Commerce Applications by Samuel D. Kounev ([email protected]) Information Technology Transfer Office Abstract Modern e-commerce
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Hadoop Cluster Applications
Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday
A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce
A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace
Final Project Proposal. CSCI.6500 Distributed Computing over the Internet
Final Project Proposal CSCI.6500 Distributed Computing over the Internet Qingling Wang 660795696 1. Purpose Implement an application layer on Hybrid Grid Cloud Infrastructure to automatically or at least
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Matchmaking: A New MapReduce Scheduling Technique
Matchmaking: A New MapReduce Scheduling Technique Chen He Ying Lu David Swanson Department of Computer Science and Engineering University of Nebraska-Lincoln Lincoln, U.S. {che,ylu,dswanson}@cse.unl.edu
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National
