HiTune: Dataflow-Based Performance Analysis for Big Data Cloud
|
|
|
- Alvin Watson
- 10 years ago
- Views:
Transcription
1 HiTune: Dataflow-Based Performance Analysis for Big Data Cloud Jinquan Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, P.R.China, {jason.dai, jie.huang, shengsheng.huang, bo.huang, Abstract Although Big Data Cloud (e.g., MapReduce, Hadoop and Dryad) makes it easy to develop and run highly scalable applications, efficient provisioning and finetuning of these massively distributed systems remain a major challenge. In this paper, we describe a general approach to help address this challenge, based on distributed instrumentations and dataflow-driven performance analysis. Based on this approach, we have implemented HiTune, a scalable, lightweight and extensible performance analyzer for Hadoop. We report our experience on how HiTune helps users to efficiently conduct Hadoop performance analysis and tuning, demonstrating the benefits of dataflow-based analysis and the limitations of existing approaches (e.g., system statistics, Hadoop logs and metrics, and traditional profiling). 1. Introduction There are dramatic differences between delivering software as a service in the cloud for millions to use, versus distributing software as bits for millions to run on their PCs. First and foremost, services must be highly scalable, storing and processing an enormous amount of data. For instance, in June 2010, Facebook reported 21PB raw storage capacity in their internal data warehouse, with 12TB compressed new data added every day and 800TB compressed data scanned daily [1]. This type of Big Data phenomenon has led to the emergence of several new cloud infrastructures (e.g., MapReduce [2], Hadoop [2], Dryad [4], Pig [5] and [6]), characterized by the ability to scale to thousands of nodes, fault tolerance and relaxed consistency. In these systems, the users can develop their applications according to a dataflow graph (either implicitly dictated by the programming/query model or explicitly specified by the users). Once an application is cast into the system, the cloud runtime is responsible for dynamically mapping the logical dataflow graph to the underlying cluster for distributed executions. With these Big Data cloud infrastructures, the users are required to exploit the inherent data parallelism exposed by the dataflow graph when developing the applications; on the other hand, they are abstracted away from the messy details of data partitioning, task distribution, load balancing, fault tolerance and node communications. Unfortunately, this abstraction makes it very difficult, if not impossible, for the users to understand the cloud runtime behaviors. Consequently, although Big Data Cloud makes it easy to develop and run highly scalable applications, efficient provisioning and fine-tuning of these massively distributed systems remain a major challenge. To help address this challenge, we attempt to design tools that allow users to understand the runtime behaviors of Big Data Cloud, so that they can make educated decisions regarding how to improve the efficiency of these massively distributed systems just as what traditional performance analyzers do for a single execution of a single program. Unfortunately, performance analysis for Big Data Cloud is particularly challenging, because these applications can potentially comprise several thousands of programs running on thousands of machines, and the low level performance details are hidden from the users by using a high level dataflow model. In this paper, we describe a specific solution to this problem based on distributed instrumentations and dataflow-driven performance analysis, which correlates concurrent performance activities across different programs and machines, reconstructs the dataflow-based, distributed execution process of the Big Data application, and relates the low level performance activities to the high level dataflow model. Based on this approach, we have implemented HiTune, a scalable, lightweight and extensible performance analyzer for Hadoop. We report our experience on how HiTune helps users to efficiently conduct Hadoop performance analysis and tuning, demonstrating the benefits of dataflow-based analysis and the limitations of existing approaches (e.g., system statistics, Hadoop logs and metrics, and traditional profiling). For instance, reconstructing the dataflow execution process of a Hadoop job allows users to understand the dynamic interactions between different tasks and stages (e.g., task scheduling and data shuffle; see sections 7.1 and 7.2). In addition, relating performance activities to the dataflow model allows users to conduct fine-grained,
2 dataflow-based hotspot breakdown (e.g., for identifying application hotspots and hardware problems; see sections 7.2 and 7.3). The rest of the paper is organized as follows. In section 2, we introduce the motivations and objectives of our work. We give an overview of our approach in section 3, and present the dataflow-based performance analysis in section 4. In section 5, we describe the implementation of HiTune, a performance analyzer for Hadoop. We experimentally evaluate HiTune in section 6, and report our experience in section 7. We discuss the related work in section 8, and finally conclude the paper in section Problem Statement In this section, we describe the motivations, challenges, goals and non-goals of our work. 2.1 Big Data Cloud In Big Data Cloud, the input applications are modeled as directed acyclic dataflow graphs to the users, where graph vertices represent processing stages and graph edges represent communication channels. All the data parallelisms of the computation and the data dependencies between processing stages are explicitly encoded in the dataflow graph. The users can develop their applications by simply supplying programs that run on the vertices to these systems; on the other hand, they are abstracted away from the low level details of the distributed executions of their applications. The cloud runtime is responsible for dynamically mapping the logical dataflow graph to the underlying cluster, including generating the optimized dataflow graph of execution plans, assigning the vertices and edges to physical resources, scheduling and executing each vertex (usually using multiple instances and possibly multiple times due to failures). For instance, the MapReduce model dictates a twostage group-by-aggregation dataflow graph to the users, as shown in Figure 1. A MapReduce application has one input that can be trivially partitioned. In the first stage a Map function, which specifies how the grouping is performed, is applied to each partition of input data. In the second stage a Reduce function, which performs the aggregation, is applied to each group produced by the first stage. The MapReduce framework is then responsible for mapping this logical dataflow graph to the physical resources. For instance, the Hadoop framework automatically executes the input MapReduce application using an internal dataflow graph of execution plan, as shown in Figure 2. The input data is divided into splits, and a distinct Map task is launched for each split. Inside each Map task, the map stage applies the Map function to the input data, and the spill stage stores the map output on local disks. In addition, a distinct Reduce task is launched for each partition of the map outputs. Inside each Reduce task, the copier and merge stages run in a pipelined fashion, fetching the relevant partition over the network and merging the fetched data respectively; after that, the sort and reduce stages merge the reduce inputs and apply the Reduce function respectively. Partitioned Input D T A A MAP MAP MAP MAP RE DU CE Aggregated Output Figure 1. Dataflow graph of a MapReduce application Partitioned Input D A T A Map Tasks map map map map spill spill Spill spill Streaming dataflow shuffle copier merge shuffle copier merge shuffle copier merge Reduce Tasks sort sort sort reduce reduce reduce Sequential dataflow Aggregated Output Figure 2. Dataflow graph of the Hadoop execution plan Pig Script good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)> ; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); Query SELECT category, AVG(pagerank) FROM (SELECT category, pagerank, count(1) AS recordnum FROM urls WHERE pagerank > 0.2 GROUP BY category) big_groups WHERE big_groups.recordnum > Figure 3. The Pig program [5] and query example In addition, the Pig and systems allow the users to perform ad-hoc analysis of Big Data on top of Hadoop, using dataflow-style scripts and SQL-like queries respectively. For instance, Figure 3 shows the Pig program (an example in the original Pig paper [5]) and query for the same operation (i.e., finding, for each sufficiently large category, the average pagerank of high-pagerank urls in that category). In these two systems, the logical dataflow graph of the operation is implicitly dictated by the language or query model, and
3 is automatically compiled into the physical execution plan (another dataflow graph) that is executed on the underlying Hadoop system. Unlike the aforementioned systems that restrict their applications dataflow graph, the Dryad system allows the users to specify an arbitrary directed acyclic graph to describe the application, as illustrated in Figure 4 (an example in the Dryad website [7]). The cloud runtime then refines the input dataflow graph and executes the optimized execution plan on the underlying cluster. Figure 4. Dataflow graph of a Dryad application [7] 2.2 Motivations and Challenges By exposing data parallelisms through the dataflow model and hiding the low level details of the underlying cluster, Big Data Cloud allows users to work at the appropriate level of abstraction, which makes it easy to develop and run highly scalable applications. Unfortunately, this abstraction makes it very difficult, if not impossible, for users to understand the cloud runtime behaviors. Consequently, it remains as a big challenge to efficiently provision and tune these massively distributed systems, which entails requesting and allocating the optimal number of (physical or virtual) resources, and optimizing the system and applications for better resource utilizations. As Big Data Cloud grows in pervasiveness and scale, addressing this challenge becomes critically important (for instance, tuning Hadoop jobs is considered as a very difficult problem and requires a lot of efforts on understanding Hadoop internals in Hadoop community [8]; in addition, lack of tuning tools for Hadoop often forces users to resort to trial and error tuning [9]). This motivates our work to design tools that allows users to understand the runtime behaviors of Big Data applications, so that they can make educated decisions regarding how to improve the efficiency of these massively distributed systems just as what traditional performance analyzers (e.g., gprof [10] and Intel VTune [11]) do for a single execution of a single program. Unfortunately, performance analysis for Big Data Cloud is particularly challenging due to its unique properties. Massively distributed systems: Each Big Data application is a complex distributed application, which may comprise tens of thousands of processes and threads running on thousands of machines. Understanding system behaviors in this context would require correlating concurrent performance activities (e.g., CPU cycles, retired instructions, lock contentions, etc.) across many programs and machines with each other. High level abstractions: Big Data Cloud allows users to work at an appropriately high level of abstraction, by hiding the messy details of parallelisms behind the dataflow model and dynamically instantiating the dataflow graph (including resource allocations, task scheduling, fault tolerance, etc.). Consequently, it is very difficult, if not impossible, for users to understand how the low level performance activities can be related to the high level abstraction (which they have used to develop and run their applications). In this paper, we address these technical challenges through distributed instrumentations and dataflowdriven performance analysis. Our approach allows users to easily associate different low level performance activities with the high level dataflow model, and provide valuable insights into the runtime behaviors of Big Data Cloud and applications. 2.3 Goals and Non-Goals Our goal is to design tools that help users to efficiently conduct performance analysis for Big Data Cloud. In particular, we want our tools to be broadly applicable to many different applications and systems, and to be applicable to even production systems (because it is often impossible to reproduce the cloud behaviors given the scale of Big Data Cloud). Several concrete design goals result from these requirements. Low overhead: It is critical that our tools have negligible (e.g., less than a few percent) performance impacts on the running applications. No source code modifications: Our tools should not require any modifications to the cloud runtime, middleware, messages, or applications. Scalability: Our tools need to handle applications that potentially comprise tens of thousands of processes/threads running on thousands of servers. Extensibility: We would like our tools to be flexible enough so that it can be easily extended to support different cloud systems. We also have several non-goals. We are not developing tools that can replace the need for developers (e.g., by automatically
4 allocating the right amount of resources). Performance analysis for distributed systems is hard, and our goal is to make it easier for users, not to automate it. Our tools are not meant to verify the correct system behaviors, or diagnose the cause of faulty behaviors. 3. Overview of Our Approach Our approach relies on distributed instrumentations on each node in the cloud, and then aggregating all the instrumentation results for dataflow-based analysis. The performance analysis framework consists of three major components, namely tracker, aggregation engine and analysis engine, as illustrated in Figure 5. the sampler can send the sampling record to the aggregation engine in real-time, which can then record the receiving time. Type specifies the type of the sampling record (e.g., CPU cycles, disk bandwidth, log files, etc.). Target specifies the source of the sampling record. It contains the name of the local node, as well as other sampler-specific information (e.g., CPUID, network interface name or log file name). Value contains the detailed sampling information of this record (e.g., CPU load, network bandwidth utilization, or a line/record in the log file). The aggregation engine is responsible for collecting the sampling information from all the trackers in a distributed fashion, and storing the sampling information in a separate monitoring cluster for analysis. Any distributed log collection tools (e.g., Chukwa [12], Scribe [13] and Flume [14]) can be used as the aggregation engine. In addition, the analysis engine runs on the monitoring cluster, and is responsible for conducting the performance analysis and generating the analysis report, using the collected sampling information and a specification file describing the logical dataflow model of the specific Big Data cloud. 4. Dataflow-Based Performance Analysis Figure 5. Performance analysis framework Timestamp Type Target Value Figure 6. Format of the sampling record The tracker is a lightweight agent running on every node. Each tracker has several samplers, which inspect the runtime information of the programs and system running on the local node (either periodically or based on specific events), and sends the sampling records to the aggregation engine. Each sampling record is of the format shown in figure 6. Timestamp is the sampling time for each record. Since the nodes in the cloud are usually in the same administrative domain and highly connected, it is easy to have all the nodes time-synchronized (e.g., in Hadoop all the slaves exchange heartbeat messages with the master periodically and the time synchronization information can be easily piggybacked); consequently the sampler can directly record its sampling time. Alternatively, In order to help users to understand the runtime behaviors of Big Data Cloud, our framework presents the performance analysis results in the same dataflow model that is used in developing and running the applications. The key technical challenge is to reconstruct the high level, dataflow-based, distributed and dynamic execution process for each Big Data application, based on the low level sampling records collected across different programs and machines. We address this challenge by: 1) Running a task execution sampler on every node to collect the execution information of each task in the application. 2) Describing the high level dataflow model of Big Data Cloud in a specification file provided to the analysis engine. 3) Constructing the dataflow execution process for the application based the dataflow specification and the program execution information. 4.1 Task Execution Sampler To collect the program execution information, the task execution sampler instruments the cloud runtime and tasks running on the local node, and stores associated information into its sampling records at fixed time
5 intervals as follows. The Target field of the sampling record needs to store the identifier (e.g., application name, task name, process ID and/or thread ID) of the program that is instrumented to collect this piece of sampling information. The Value field of the sampling record must contain the execution position of the program (e.g., thread name, stack trace, basic-block ID and/or instruction address) at which the program is running when it is instrumented to collect this piece of sampling information. In practice, the task execution sampler can be implemented using any traditional instrumentation tool (which runs on a single machine), such as Intel VTune. 4.2 Dataflow Specification In order for the analysis engine to efficiently conduct the dataflow-based analysis, the dataflow specification needs to describe not only the dataflow graph (e.g., vertices and edges), but also the high level resource mappings of the dataflow model (e.g., physical implementations of vertices/edges, parallelisms between different phases/stages, and communication patterns between different stages). Consequently, the dataflow specification does require a priori knowledge of the cloud system. On the other hand, the users are not required to write the specification; instead, the dataflow specification is provided by the cloud system or the performance analyzer, and is either written by the developers of the cloud system (and/or the performance analyzer), or dynamically generated by the cloud runtime (e.g., by the query compiler). The format of the dataflow specification is illustrated in Figure 7 and described in detail below. The Input (Output) section contains a list of <inputid: storage location> (<outputid: storage location>), where the storage location specifics which storage system (e.g., HDFS or MySQL) is used to store the input (output) data. The Vertices section contains a list of <vertexid: program location>, where the program location specifies the portion of program (e.g., the corresponding thread or function) running on this graph vertex (i.e., processing stage). It is required that each execution position collected by the task execution sampler can be mapped to a unique program location in the specification, so that the analysis engine can determine which vertex each task execution sampling record belongs to. //dataflow graph Input { //list of <inputid: storage location> In1: storage location Output { //list of <outputid: storage location> Out1: storage location Vertices { //list of <vertexid: program location> V1: program location Edges { //list of <edgeid: inputid/vertexid vertexid/outputid> E1: In1 V1 E2: V1 V2 //resource mapping Vertex Mapping { //list of Task Pool Task Pool [(name)] <(cardinality)> { //ordered list of Phase Phase [(name)] { //list of Thread Pool or Thread Group Pool Thread Pool [(name)] <(cardinality)> { //ordered list of vertexid V1, V2, //end of Thread Pool Thread Group Pool [(name)] <(cardinality, group size)> { //a single vertexid V3 //end of Thread Group Pool.. //end of Phase Phase [(name)] { //end of Task Pool Task Pool [(name)] <(cardinality)> { Edge Mapping { //list of <edgeid: edge type, endpoint location> E1: edge type, endpoint location Figure 7. Dataflow specification of Big Data Cloud The Edges section contains a list of <edgeid: inputid/vertexid vertexid/outputid>, which defines all the graph edges (i.e., communication channels). The Vertex Mapping section describes the high level resource mappings and parallelisms of the graph vertices. This section contains a list of Task Pool subsections; for each Task Pool subsection, the cloud runtime will launch several tasks (or processes) that can potentially run on the different nodes in parallel. The Task Pool subsection contains an ordered list of Phase subsections, and each task belonging to this task pool will sequentially execute these phases in the specified order. The Phase subsection contains a list of Thread Pool or Thread Group Pool subsections; for each of these subsections, the associated task will spawn several
6 threads or thread groups in parallel. The Thread Pool subsection contains an ordered list of vertexid, and each thread belonging to this thread pool will sequentially execute these vertices in the specified order. On the other hand, a number of threads (as determined by group size) in the thread group will run in concert with each other, executing the vertex specified in the Thread Group Pool subsection. The cardinality of the Task Pool, Thread Pool or Thread Group Pool subsections determines the numbers of instances (i.e., processes, threads or thread groups) to be launched. It can have several values as follows. (1) N exactly N instances will be launched. (2) 1~N up to N instances will be launched. (3) 1~ the number of instances to be launched is dynamically determined by the cloud runtime. The Edge Mapping section contains a list of <edgeid: edge type, endpoint location>. The edge type specifies the physical implementation of the edge, such as network connection, local file or memory buffer. The endpoint location specifies the communication patterns between the vertices, which can be intra-thread/intra-task/intra-node (i.e., data transfer exists only between vertex instances running in the same thread, the same task and the same node respectively), or unconstrained. It is possible to extend the specification to support even more complex dataflow model and resource mappings (e.g., Process Group Pool); however, the current model is sufficient for all the Big Data cloud infrastructures that we have considered. For instance, the dataflow specification for our Hadoop cluster is shown in Figure Dataflow-Based Analysis As described in the previous sections, the program execution information collected by task execution samplers is generic in nature, and the dataflow model of the specific Big Data cloud is defined in a specification file. Based on these data, the analysis engine can reconstruct the dataflow execution process for the Big Data applications, and associate different performance activities with the high level dataflow model. In this way, our framework can be easily applied to different cloud systems by simply changing the specification file. We defer the detailed description of the dataflow construction algorithm to section 5.3. //Hadoop dataflow graph Input { //list of <inputid: storage location> Input:HDFS Output { //list of <outputid: storage location> Output:HDFS Vertices { //list of <vertexid: program location> map: MapTask.run spill: SpillThread.run copier: MapOutputCopier.run merge: InMemFSMergeThread.run or LocalFSMerger.run sort: ReduceCopier.createKVIterator#ReduceCopier.access reduce: runnewreducer or runoldreducer Edges { //list of <edgeid: inputid/vertexid vertexid/outputid> E1: Input map E2: map spill E3: spill copier E4: copier merge E5: merge sort E6: sort reduce E7: reduce Output Vertex Mapping { //list of Task Pool Task pool (Map) (1~ ) { //ordered list of Phase Phase { //list of Thread Pool or Thread Group Pool Thread Pool (1) {map Thread Pool (1) {spill Task Pool (Reduce) (1~ ) { Phase (shuffle) { Thread Group Pool (copy) (1, 20) {copier Thread Group Pool (merge) (1, 2) {merge Phase { Thread Pool (1) {sort, reduce Edge Mapping { //list of <edgeid: edge type, endpoint location> E1: HDFS, unconstrained E2: memory buffer, intra-task E3: HTTP, unconstrained E4: memory buffer, intra-task E5: memory buffer or local file, intra-task E6: memory buffer or local file, intra-thread E7: HDFS, unconstrained Figure 8. Hadoop dataflow specification 5. HiTune: A Dataflow-Based Hadoop Performance Analyzer Based on our general performance analysis framework, we have implemented HiTune, a scalable, lightweight and extensible performance analyzer for Hadoop. In this section, we describe the implementation of HiTune, and in particular, how it is carefully engineered to meet our design goals that are described in section Implementation of Tracker In our current implementation, all the nodes in the
7 Hadoop cluster are time synchronized (e.g., using a time service), and a tracker runs on each node in the cluster. Currently, the tracker consists of three samplers (i.e., task execution sampler, system sampler and log file sampler), and each sampler directly stores the sampling time in the Timestamp field of its sampling record. We have chosen to implement the task execution sampler using binary instrumentation techniques, so that it can instrument and collect the program execution information without any source code modifications. Specifically, the task execution sampler runs as a Java programming language agent [15]; whenever the Hadoop framework launches a JVM to be instrumented, it dynamically attaches to the JVM the sampler agent, which samples the Java thread stack trace and state for all the threads in the JVM at specified intervals (during the entire lifetime of the JVM). For each sampling record generated by the task execution sampler, its Target field contains the node name, the task name and the Java thread ID; and its value field contains the current execution position (i.e., the Java thread name and thread stack trace) as well the as the Java thread state. That is, the Target field is specified using the identifier of the runtime program instance (i.e., the thread ID), which allows the analysis engine to construct the entire sampling trace of each thread; in addition, the execution position is specified using the static program names (i.e., the Java thread name and method names), which allows the dataflow model and resource mappings to be easily described in the specification file. In addition, the system sampler simply reports the system statistics (e.g., CPU load, disk I/O, network bandwidth, etc.) periodically using the sysstat package, and the log sampler reports the Hadoop log information (including Hadoop metrics and job history files) whenever there is new log information. Since the tracker (especially the task execution sampler) needs to instrument the Hadoop tasks running on each node, it is the major source of performance overheads in HiTune. We have carefully designed and implemented the tracker (e.g., the task execution sampler caches the stack traces in memory and batches the write-out to the aggregation engine), so that it has very low (less than 2% according to our measurement) performance impacts on Hadoop applications. 5.2 Implementation of Aggregation Engine To ensure its scalability, the aggregation engine is implemented as a distributed data collection system, which can collect the sampling information from potentially thousands of nodes in the Hadoop cluster. In the current implementation, we have chosen to use Chukwa (a distributed log collection framework) as our aggregation engine. Every sampler in HiTune directly sends its sampling records to the Chukwa agent running on the local node, which in turn sends data to the Chukwa collectors over the network; the collector is responsible for storing the sampling data in a (separate) monitoring Hadoop/HDFS cluster. 5.3 Implementation of Analysis Engine The sampling data for a Hadoop job can be potentially very large in size (e.g., about 100GB for TeraSort [16][17] in our cluster). We address the scalability challenge by first storing the sampling data in HDFS (as described in section 5.2), and then running the analysis engine as a Hadoop application on the monitoring Hadoop cluster in an offline fashion. Based on the Target field (i.e., the node name, the task name and the Java thread ID) of every task execution sampling record, the analysis engine first constructs a sampling trace for each thread (i.e., the sequence of all task execution sampling records belonging to that thread, ordered by the record timestamps) in the Hadoop job. The program location (used in the dataflow specification) can therefore be defined as a range of consecutive sampling records in one thread trace (or, in the case of thread group, multiple ranges each in a different thread). Each record range is identified by the starting and ending records, which are specified using their execution positions (i.e., partial stack traces). For instance, all the records between the first appearance and the last appearance of MapTask.run (or simply the MapTask.run method) constitute one instance of the map vertex. See Figure 8 for the detailed dataflow specification of our Hadoop cluster. Based on the Target and Timestamp fields of the two boundary records of corresponding program locations, the analysis engine then determines which machine each instance of a vertex runs on, when it starts and when it ends. Finally, it associates all the system and log file sampling records to each instance of the dataflow graph vertex (i.e., the processing stage), again using the Target and Timestamp information of the records. With the algorithm and dataflow specification described above, the analysis engine can easily reconstruct the
8 dataflow execution for the Hadoop job and associates different sampling records with the dataflow graph. In addition, the performance analysis algorithm is itself implemented as a Hadoop application, which processes the sampling records for each JVM simultaneously. Consequently, we can generate various analysis reports that provide valuable insights into the Hadoop runtime behaviors (presented in the same dataflow model used in developing and running the Hadoop job). For instance, a timeline based execution chart for all task instances, similar to the pipeline space-time diagram [18], can be generated so that users can get a complete picture about the dataflow-based execution process of the Hadoop job. It is also possible to generate the hotspot breakdown (e.g., disk I/O vs. network transfer vs. computations) for each vertex in the dataflow, so that users can identify the possible bottlenecks in the Hadoop cluster. We show some analysis reports and how they are used to help our Hadoop performance analysis and tuning in section Evaluations In this section, we experimentally evaluate the runtime overheads and scalability of HiTune, using three benchmarks (namely, Sort, WordCount and Nutch indexing) in the HiBench Hadoop benchmark suite [17], as shown in Table 1. The Hadoop cluster used in our experiment consists of one master (running JobTracker and NameNode) and up to 20 slaves (running TaskTracker and DataNode); the detailed server configurations are shown in Table 2. Every server has two GbE NICs, each of which is connected to a different gigabit Ethernet switch, forming two different networks; one network is used for the Hadoop jobs, and the other is used for administration and monitoring tasks (e.g., the Chukwa aggregation engine in HiTune). Table 1. Hadoop benchmarks Benchmark Input Data Sort 60GB generated by RandomWriter example. WordCount 60GB generated by RandomTextWriter example Nutch indexing 19GB data generated by crawling an in-house Wikipedia mirror Table 2. Server configurations Processor Dual-socket quad-core Intel Xeon processor Disk 4 SATA 7200RPM HDDs Memory 24 GB ECC DRAM Network 2 Gigabit Ethernet NICs OS Redhat Enterprise Linux 5.4 We evaluate the runtime overheads of HiTune by measuring the Hadoop performance (speed and throughput) as well as the system resource (e.g., CPU and memory) utilizations of the Hadoop cluster. The speed is measured using the job running time, and the throughput is defined as the number of tasks completed per minute when the Hadoop cluster is at full utilization (by continuously submitting multiple jobs to the cluster). In addition, we evaluate the scalability of HiTune by analyzing that, when there are more nodes in the Hadoop cluster, whether the runtime overheads increase and whether it becomes more complex for HiTune to conduct the dataflow-based performance analysis. 6.1 Runtime Overheads As mentioned in section 5.1, the tracker (especially the task execution sampler) is the major source of runtime overheads in HiTune. This is because the task execution sampler needs to instrument the Hadoop tasks running on each node, while the aggregation and analysis of sampling data are performed on a separate monitoring cluster in an offline fashion. We first compare the instrumented Hadoop performance (measured when the tracker is running) and the baseline performance (measured when the tracker is completely turned off). In our experiment, when the tracker is running, the task execution sampler dumps the Java thread stack trace every 20 milliseconds, the system sampler reports the system statistics every 5 seconds, and the Hadoop cluster is configured to output its metrics to the log file every 10 seconds. Figures 9 and 10 show the ratio of the instrumented performance over the baseline performance for job running time (lower is better) and throughput (higher is better) respectively. It is clear that the overhead of running the tracker is very low in terms of performance the instrumented job running time is longer than the baseline by less than 2%, and the instrumented throughput is lower than the baseline by less than 2%. In addition, we also compare the instrumented system resource utilizations (measured when the tracker is running) and the baseline utilizations (measured when only the system sampler is running, which is needed to report the system resource utilizations periodically). Since the sampling records are aggregated using a separate network, we only present the CPU and memory utilization results of the Hadoop cluster in this paper. Figures 11 and 12 show the ratio of the instrumented CPU and memory utilizations over the baseline utilizations respectively. It is clear that the overhead of running the tracker is also very low in terms of resource utilizations either the instrumented CPU or memory utilization is higher than the baseline by less than 2%.
9 Ratio of Completion Time 120% 100% 80% 60% 40% 20% 0% 5-slave cluster 10-slave cluster 20-slave cluster 101% 100% 101% 101% 100% 100% 101% 102% 101% Sort WordCount Nutch indexing Workloads Figure 9. Ratio of instrumented job running time over baseline job running time Ratio of Throughput 120% 100% 80% 60% 40% 20% 0% 5-slave cluster 10-slave cluster 20-slave cluster 98% 100% 98% 99% 98% 100% 98% 100% 98% Sort WordCount Nutch indexing Workloads Figure 10. Ratio of instrumented cluster throughput over baseline cluster throughput Ratio of CPU Utilization 120% 100% 80% 60% 40% 20% 0% 5-slave cluster 10-slave cluster 20-slave cluster 102% 101% 100% 100% 100% 100% 100% 102% 100% Sort WordCount Nutch indexing Workloads Figure 11. Ratio of instrumented cluster CPU utilization over baseline cluster CPU utilization Ratio of Memory Utilization 120% 100% 80% 60% 40% 20% 0% 5-slave cluster 10-slave cluster 20-slave cluster 102% 100% 101% 102% 100% 100% 102% 100% 102% Sort WordCount Nutch indexing Workloads Figure 12. Ratio of instrumented cluster memory utilization over baseline cluster memory utilization In summary, HiTune is a very lightweight performance analyzer for Hadoop, with very low (less than 2%) runtime overheads in terms of speed, throughput and system resource utilizations. In addition, HiTune scales very well in terms of the runtime overheads, because it instruments each node in the cluster independently and consequently the runtime overheads remain the same even when there are more nodes in the cluster (as confirmed by the experimental results). 6.2 Complexity of Performance Analysis Since the analysis engine needs to re-construct the dataflow execution of a Hadoop job and associate the sampling records to each vertex instance in the dataflow, the complexity of analysis can be evaluated by comparing the sizes of sampling data and the numbers of vertex instances between different sized clusters. Figure 13 shows the sampling data sizes for the 5-, 10- and 20-slave clusters. It is clear that the sampling data sizes remain about the same (or increase very slowly) for different sized clusters (e.g., only less than 18% increase in the sample data size even when the cluster size is increased by 4x). Intuitively, since HiTune samples each instance of the processing stages at fixed time intervals, the sampling data size is proportional to the sum of the running time of all vertex instances. As long as the underlying Hadoop framework scales well with the cluster sizes, the sum of the vertex instance running time will remain about the same (or increase very slowly), and so does the sampling data size. In practice, even with very large (1000s of nodes) clusters, a MapReduce job usually runs on about 100s of worker machines [19], and the Hadoop framework scales reasonably well with that number (100s) of machines. Sampling Data Size(G) slave cluster 10-slave cluster 20-slave cluster Sort WordCount Nutch indexing Workloads Figure 13. Comparison of sampling data sizes In addition, assume M and R are the total numbers of the map and reduce tasks of a Hadoop job respectively. Since in the Hadoop dataflow model (as shown in Figure 8) each map task contains two stages and each reduce task contains four stages, the total number of vertex instances can be computed as 2*M+4*R. In practice, the number of map tasks is about 26x of that of reduce tasks in average for each MapReduce job [20], and therefore the vertex instance count is about 2.15*M. Since the number of map tasks (M) of a Hadoop job is typically determined by its input data size (e.g., by the number of HDFS file blocks), the number of vertex instances will also remain about the same for different sized clusters in practice.
10 In summary, the complexity for HiTune to conduct the dataflow-based performance analysis will remain about the same even when there are more nodes in the cluster (or, more precisely, the dataflow-based performance analysis in HiTune scales as well as Hadoop does with the cluster sizes), because the sampling data sizes and the vertex instance counts will remain about the same even when there are more nodes in the cluster. In addition, we have implemented the analysis engine as a Hadoop application, so that the dataflow-based performance analysis can be parallelized using another monitoring Hadoop cluster. For instance, to process the 100GB sampling data generated when running TeraSort in our cluster, it takes about 16 minutes on a singleslave monitoring cluster, and about 5 minutes on a 4- slave monitoring cluster. 7. Experience HiTune has been used intensively inside Intel for Hadoop performance analysis and tuning (e.g., see [17]). In this section, we share our experience on how we use HiTune to efficiently conduct performance analysis and tuning for Hadoop, demonstrating the benefits of dataflow-based analysis and the limitations of existing approaches (e.g., system statistics, Hadoop logs and metrics, and traditional profiling). 7.1 Tuning Hadoop Framework One performance issue we encountered is extremely low system utilizations when sorting many small files ( KB-sized files) using Hadoop system statistics collected by the cluster monitoring tools (e.g., Ganglia [21]) show that the CPU, disk I/O and network bandwidth utilizations are all below 5%. That is, there are no obvious bottlenecks or hotspots in our cluster; consequently, traditional tools (e.g., system monitors and program profilers) fail to reveal the root cause. To address this performance issue, we used HiTune to reconstruct the dataflow execution process of this Hadoop job, as illustrated in Figure 14. The x-axis represents the elapse of wall clock time, and each horizontal line in the chart represents a map or reduce task. Within each line, bootstrap represents the period before the task is launched, idle represents the period after the task is complete, map represents the period when the map task is running, and shuffle, sort and reduce represent the periods when (the instances of) the corresponding stages are running respectively. As is obvious in the dataflow execution, there are few parallelisms between the Map tasks, or between the Map tasks and Reduce tasks in this job. Clearly, the task scheduler in Hadoop (Fair Scheduler [22] is used in our cluster) fails to launch all the tasks as soon as possible in this case. Once the problem is isolated, we quickly identified the root cause by default, the Fair Scheduler in Hadoop only assigns one task to a slave at each heartbeat (i.e., the periodical keepalive message between the master and slaves), and it schedules map tasks first whenever possible; in our job, each map task processes a small file and completes very fast (faster than the heartbeat interval), and consequently each slave runs the map tasks sequentially and the reduce tasks are scheduled after all the map tasks are done. To fix this performance issue, we upgraded the cluster to Fair Scheduler 2.0 [23][24], which by default schedules multiple tasks (including reduce tasks) in each heartbeat; consequently the job runs about 6x faster (as shown in Figure 15) and the cluster utilization is greatly improved. Figure 15. Dataflow execution for sorting many small files with Fair Scheduler Analyzing Application Hotspots Figure 14. Dataflow execution for sorting many small files with Hadoop In the previous section, we demonstrate that the high level dataflow execution process of a Hadoop job helps users to understand the dynamic task scheduling and assignment of the Hadoop framework. In this section,
11 we show that the dataflow execution process helps users to identify the data shuffle gaps between map and reduce, and that relating the low level performance activities to the high level dataflow model allows users to conduct fine-grained, dataflow-based hotspot breakdown (so as to understand the hotspots of the massively distributed applications). Figure 16 shows the dataflow execution, as well as the timeline based CPU, disk and network bandwidth utilizations of TeraSort [16][17] (sorting 10 billion 100- byte records). It has high CPU utilizations during the map tasks, because the map output data are compressed (using the default codec in Hadoop) to reduce the disk and network I/O. (Compressing the input or output of TeraSort is not allowed in the benchmark specs). To address this issue, we used HiTune to conduct hotspot breakdown of the shuffle phases, which is possible because HiTune has associated all the low level sampling records with the high level dataflow execution of the Hadoop job. The dataflow-based hotspot breakdown (see Figure 17) shows that, in the shuffle stages, the copier threads are actually idle 80% of the time, waiting (in the ShuffleRamManager. reserve method) for the occupied memory buffers to be freed by the memory merge threads. (The idle vs. busy breakdown and the method hotspot are determined using the Java thread state and stack trace in the task execution sampling records respectively). On the other hand, most of the busy time of the memory merge thread is due to the compression, which is the root cause of the large gap between map and shuffle. To fix this issue and reduce the compression hotspots, we changed the compression codec to LZO [25], which improves the TeraSort performance by more than 2x and completely eliminates the gap (see Figure 18). gap Figure 17. Copier and Memory Merge threads breakdown (using default compression codec) Figure 16. TeraSort (using default compression codec) However, the dataflow execution process of TeraSort shows that there is a large gap (about 15% of the total job running time) between the end of map tasks and the end of shuffle phases. According to the communication patterns specified in the Hadoop dataflow model (see Figure 2 and Figure 8), shuffle phases need to fetch the output from all the map tasks in the copier stages, and ideally should complete as soon as all the map tasks complete. Unfortunately, traditional tools or Hadoop logs fail to reveal the root cause of the large gap, because during that period, none of the CPU, disk I/O and network bandwidth are bottlenecked, the Shuffle Fetchers Busy Percent metric reported by the Hadoop framework is always 100%, while increasing the number of copier threads does not improve the utilization or performance. Figure 18. TeraSort (using LZO compression) 7.3 Diagnosing Hardware Problems By examining Figure 18 in more detail, we also found that the reduce stage running time is significantly skewed among different reduce tasks a small number of stages are much slower than the others, as shown in Figure 19. Figure 19. Reduce tasks of TeraSort (using LZO compression)
12 Based on the association of the low level sampling records and the high level dataflow model, we use HiTune to generate the normalized average running time and the idle vs. busy breakdown of the reduce stages (grouped by the Tasktrackers that the stages run on) in Figure 20. It is clear that reduce stages running on the 3 rd and 7 th TaskTrackers are much slower (about 20% and 14% slower than the average respectively). In addition, while all the reduce stages have about the same busy time, the reduce stages running on these two TaskTrackers have more idle time, waiting in the DFSOutputStream.writeChunk method (i.e., writing data to HDFS). Since the data replication factor in TeraSort is set to 1 (as required by the benchmark specs), the HDFS write operations in the reduce stage only writes to the local disks. By examining the average write bandwidth of the disks on these two TaskTrackers, we finally identified the root cause of this problem there is one disk on each of these two nodes that is much slower than other disks in the cluster (about 44% and 30% slower than the average respectively), which is later confirmed to have bad sectors through a very expensive fsck process. Normalized reduce stage time 140% 120% 100% 80% 60% 40% 20% 0% Busy Idle Average Running Time TaskTrackers Figure 20. Normalized average running time and busy vs. idle breakdown of reduce stages 7.4 Extending HiTune to Other Systems Since the initial release of HiTune inside Intel, it has been extended by the users in different ways to meet their requirements. For instance, new samplers are added so that processor microarchitecture events and power state behaviors of Hadoop jobs can be analyzed using the dataflow model. In addition, HiTune has also been applied to (an open source data warehouse built on top of Hadoop), by extending the original Hadoop dataflow model to include additional phases and stages, as illustrated in Figure 21. The map stage is divided into 5 smaller stages namely, Stage Init, Init, Active, Close and Stage close; in addition, the reduce stage is divided into 4 smaller stages namely, Iinit, Active, Close and Stage Close. This is accomplished by providing to the analysis engine a new specification file that describes the dataflow model and resource mappings in. data flow stage timeline Stage Init Map/Reduce Tasks Init Active Close Processing Period map stage timeline Active Stage Close data flow stage timeline Init Active Close Processing Period reduce stage timeline Active Figure 21. Extended dataflow model for aggregation query Time line bootstrap Stage Init shuffle sort HIVE INIT Stage Close HIVE ACTIVE HIVE CLOSE Stage Close Figure 22. Dataflow execution of the query Close 0% Init 4% Stage Init 9% Close 0.2% Init 2.1% Stage Close 19% Active 68% Input 15% Operations 32% Output 21% Figure 23. Map stage breakdown Stage Close 20.2% Active 77.6% Output 3.2% Input 42.4% Operations 32.0% Figure 24. Reduce stage breakdown Figure 22 shows the dataflow execution process for the aggregation query in performance benchmarks [26][9]. In addition, Figures 23 and 24 show the dataflow-based breakdown of the map/reduce stages for the aggregation query (both the map and reduce active stages are further broken into 3 portions: Input, Operation and Output based on the Java methods). As shown in Figures 23 and 24, the query spends only about 32% of its time performing the Operations; on the other hand, it spends about 68% of its time on the data input/output, as well as the initialization and cleanup of the Hadoop/ frameworks. Therefore, to optimize this query, it is more critical to reduce the size of intermediate results, idle
13 to improve the efficiency of data input/output, and to reduce the overheads of the Hadoop/ frameworks. 8. Related Work There are several distributed system tracing tools (e.g., Magpie [27], X-Trace [28] and Dapper [29]), which associates and propagates the tracing metadata as the request passes through the system. With this type of path information, the tracing tools can easily construct an event graph capturing the causality of events across the system, which can be then queried for various analyses [30]. Unfortunately, these tools would require changes not only to source codes but also to message schemas, and are usually restricted to a small portion of the system in practice (e.g., Dapper only instruments the threading, callback and RPC libraries in Google [29]). In contrast, our approach uses binary instrumentations to sample the tasks in a distributed and independent fashion at each node, and reconstructs the dataflow execution process of the application based on a priori knowledge of Big Data Cloud. Consequently, it requires no modifications to the system, and therefore can be applied more extensively to obtain richer information (e.g., the hottest function) than these tracing tools. Our distributed instrumentations are similar to Google- Wide Profiling (GWP) [31], which samples across machines in multiple data centers for production applications. In addition, the current Hadoop framework can profile specific map/reduce tasks using traditional Java profilers (e.g., HPROF [32]), which however have very high overheads and are usually applied to a small (2 or 3) number of tasks. More importantly, both GWP and the existing profiling support in Hadoop focus on providing traditional performance analysis to the distributed systems (e.g., by allowing the users to directly query the low level sampling data). In contrast, our key technical challenge is to reconstruct the high level dataflow execution of the application based on the low level sampling data, so that users can work on the same dataflow model used in developing and running their Big Data applications. In the industry, traditional cluster monitoring tools (e.g., Ganglia [21], Nagios [33] and Cacti [34]) have been widely used to collect system statistics (e.g., CPU load) from all the machines in the cluster; in addition, several large-scale log collection systems (e.g., Chukwa [12], Scribe [13] and Flume [14]) have been recently developed to aggregate log data from a large number of servers. All of these tools focus on providing a distributed framework to collect statistics and logs, and are orthogonal to our work (e.g., we have actually used Chukwa as the aggregation engine in the current HiTune implementation). Existing diagnostic tools for Hadoop and Dryad (e.g., Vaidya [35], Kahuna [36] and Artemis [37]) focus on mining the system logs to detect performance problems. For instance, it is possible to construct the task execution chart (as shown in section 7.1) using the Hadoop job history files. Compared to these tools, our approach (based on distributed instrumentation and dataflow-driven performance analysis) has many advantages. First, it can provide much more insights, such as dataflow-based hotspot breakdown (see sections 7.2 and 7.3), into the cloud runtime behaviors. More importantly, performance problems of massively distributed systems are very complex, and are often due to issues that the developers are completely unaware of (and therefore are not exposed by the existing codes or logs). For instance, in section 7.2, the Hadoop framework shows that the shuffle fetchers are always busy, while detailed breakdown provided by HiTune reveals that copiers are actually idle most of the time. Finally, our approach is much more general, and consequently can be easily extended to support other systems such as (see section 7.4). 9. Conclusions In this paper, we propose a general approach of performance analysis for Big Data Cloud, based on distributed instrumentations and dataflow-driven performance analysis. Based on this approach, we have implemented HiTune (a Hadoop performance analyzer), which provide valuable insights into the Hadoop runtime behaviors with every low overhead, no source code changes, very good scalability and extensibility. We also report our experience on how to use HiTune to efficiently conduct performance analysis and tuning for Hadoop, demonstrating the benefits of dataflow-based analysis and the limitations of existing approaches. Reference [1] D. Borthakur, N. Jain, J. S. Sarma, R. Murthy, H. Liu. Data warehousing and analytics infrastructure at facebook. The 36th ACM SIGMOD International Conference on Management of Data, [2] J. Dean, S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. The 6th Symposium on Operating Systems Design and Implementation, [3] Hadoop. [4] M. Isard, M. Budiu, Y. Yu, A. Birrell, D. Fetterly. Dryad: Distributed Data-Parallel Programs from
14 Sequential Building Blocks. The 2nd European Conference on Computer Systems, [5] C. Olston, B. Reed, U. Srivastava, R. Kumar, A. Tomkins. Pig latin: a not-so-foreign language for data processing. The 34th ACM SIGMOD international conference on Management of data, [6] A. Thusoo, R. Murthy, J. S. Sarma, Z. Shao, N. Jain, P. Chakka, S. Anthony, H. Liu, N. Zhang. - A Petabyte Scale Data Warehousing Using Hadoop. The 26th IEEE International Conference on Data Engineering, [7] Dryad. /Dryad/ [8] Hadoop and HBase at RIPE NCC. cloudera.com/blog/2010/11/hadoop-and-hbase-atripe-ncc/ [9] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, M. Stonebraker. A comparison of approaches to large-scale data analysis. The 35th SIGMOD international conference on Management of data, [10] S. L. Graham, P. B. Kessler, M. K. Mckusick. Gprof: A call graph execution profiler. The 1982 ACM SIGPLAN Symposium on Compiler Construction, [11] Intel VTune Performance Analyzer. intel.com/en-us/intel-vtune/ [12] A. Rabkin, R. H. Katz. Chukwa: A system for reliable large-scale log collection, Technical Report UCB/EECS , UC Berkeley, [13] Scribe. [14] Flume. [15] Java Instrumentation. javase/6/docs/api/java/lang/instrument/packagesummary.html [16] Sort benchmark. [17] S. Huang, J. Huang, J. Dai, T. Xie, B. Huang. The HiBench Benchmark suite: Characterization of the MapReduce-Based Data Analysis. IEEE 26th International Conference on Data Engineering Workshops, [18] J. L. Hennessy, D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 4 th edition, [19] Jeff Dean. Designs, Lessons and Advice from Building Large Distributed Systems. The 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware, [20] Jeffrey Dean. Experiences with MapReduce, an abstraction for large-scale computation. The 15th International Conference on Parallel Architectures and Compilation Techniques, [21] Ganglia. [22] A fair sharing job scheduler. org/jira/browse/hadoop-3746 [23] M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, I. Stoica. Job Scheduling for Multi-User MapReduce Clusters. Technical Report UCB/EECS , UC Berkeley, [24] Hadoop Fair Scheduler Design Document. /trunk/src/contrib/fairscheduler/designdoc/fair_sche duler_design_doc.pdf [25] Hadoop LZO patch. hadoop-lzo [26] performance benchmark. org/jira/browse/hive-396 [27] P. Barham, R. Isaacs, R. Mortier, D. Narayanan. Magpie: online modelling and performance-aware systems. The 9th conference on Hot Topics in Operating Systems, [28] R. Fonseca, G. Porter, R. H. Katz, S. Shenker, I. Stoica. X-trace: A pervasive network tracing framework. The 4th USENIX Symposium on Networked Systems Design & Implementation, [29] B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, C. Shanbhag. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Google Research, [30] B. Chun, K. Chen, G. Lee, R. Katz, S. Shenker. D3: Declarative Distributed Debugging. Technical Report UCB/EECS , UC Berkeley, 208. [31] G. Ren, E. Tune, T. Moseley, Y. Shi, S. Rus, R. Hundt. Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers. IEEE Micro (2010), pp [32] HPROF: a Heap/CPU Profiling Tool in J2SE gramming/hprof.html [33] Nagios. [34] Cacti. [35] V. Bhat, S. Gogate, M. Bhandarkar. Hadoop Vaidya. Hadoop World [36] J. Tan, X. Pan, S. Kavulya, R. Gandhi, P. Narasimhan. Kahuna: Problem Diagnosis for MapReduce-Based Cloud Computing Environments. IEEE/IFIP Network Operations and Management Symposium (NOMS), [37] G. Cretu, M. Budiu, M. Goldszmidt. Hunting for problems with Artemis. First USENIX conference on Analysis of system logs, 2008.
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
New Paradigm for Big Data Analytics
The Hadoop Stack: New Paradigm for Big Data Storage and Processing Contributors Jinquan Dai Software and Services Group, Intel Corporation Jie Huang Software and Services Group, Intel Corporation Shengsheng
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
HiBench Introduction. Carson Wang ([email protected]) Software & Services Group
HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
Intel Distribution for Apache Hadoop* Software: Optimization and Tuning Guide
OPTIMIZATION AND TUNING GUIDE Intel Distribution for Apache Hadoop* Software Intel Distribution for Apache Hadoop* Software: Optimization and Tuning Guide Configuring and managing your Hadoop* environment
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
America s Most Wanted a metric to detect persistently faulty machines in Hadoop
America s Most Wanted a metric to detect persistently faulty machines in Hadoop Dhruba Borthakur and Andrew Ryan dhruba,[email protected] Presented at IFIP Workshop on Failure Diagnosis, Chicago June
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Maximizing Hadoop Performance with Hardware Compression
Maximizing Hadoop Performance with Hardware Compression Robert Reiner Director of Marketing Compression and Security Exar Corporation November 2012 1 What is Big? sets whose size is beyond the ability
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 * Other names and brands may be claimed as the property of others. Agenda Hadoop Intro Why run Hadoop on Lustre?
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
Data Management in the Cloud
Data Management in the Cloud Ryan Stern [email protected] : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Dell Reference Configuration for Hortonworks Data Platform
Dell Reference Configuration for Hortonworks Data Platform A Quick Reference Configuration Guide Armando Acosta Hadoop Product Manager Dell Revolutionary Cloud and Big Data Group Kris Applegate Solution
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Windows Server Performance Monitoring
Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly
MAGENTO HOSTING Progressive Server Performance Improvements
MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 [email protected] 1.866.963.0424 www.simplehelix.com 2 Table of Contents
GraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
Deploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer [email protected] Alejandro Bonilla / Sales Engineer [email protected] 2 Hadoop Core Components 3 Typical Hadoop Distribution
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013
Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics
Parallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model
Networking in the Hadoop Cluster
Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop
Big Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 1 Market Trends Big Data Growing technology deployments are creating an exponential increase in the volume
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Hadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Hadoop and its Usage at Facebook. Dhruba Borthakur [email protected], June 22 rd, 2009
Hadoop and its Usage at Facebook Dhruba Borthakur [email protected], June 22 rd, 2009 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed on Hadoop Distributed File System Facebook
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
HADOOP PERFORMANCE TUNING
PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
CS54100: Database Systems
CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics
Design and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Apache HBase. Crazy dances on the elephant back
Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage
Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,[email protected]
Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,[email protected] Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,
Duke University http://www.cs.duke.edu/starfish
Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University http://www.cs.duke.edu/starfish Practitioners of Big Data Analytics Google Yahoo! Facebook ebay Physicists Biologists Economists
MapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially
Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
MapReduce and Hadoop Distributed File System V I J A Y R A O
MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB
GeoGrid Project and Experiences with Hadoop
GeoGrid Project and Experiences with Hadoop Gong Zhang and Ling Liu Distributed Data Intensive Systems Lab (DiSL) Center for Experimental Computer Systems Research (CERCS) Georgia Institute of Technology
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Performance Management in Big Data Applica6ons. Michael Kopp, Technology Strategist @mikopp
Performance Management in Big Data Applica6ons Michael Kopp, Technology Strategist NoSQL: High Volume/Low Latency DBs Web Java Key Challenges 1) Even Distribu6on 2) Correct Schema and Access paperns 3)
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
Performance and Energy Efficiency of. Hadoop deployment models
Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced
An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
Mobile Cloud Computing for Data-Intensive Applications
Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, [email protected] Advisor: Professor Priya Narasimhan, [email protected] Abstract The computational and storage
MapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside [email protected] About Journey Dynamics Founded in 2006 to develop software technology to address the issues
Hadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
Introduction. Various user groups requiring Hadoop, each with its own diverse needs, include:
Introduction BIG DATA is a term that s been buzzing around a lot lately, and its use is a trend that s been increasing at a steady pace over the past few years. It s quite likely you ve also encountered
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish
MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015
7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan [email protected] Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE
Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context
MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW 757 Maleta Lane, Suite 201 Castle Rock, CO 80108 Brett Weninger, Managing Director [email protected] Dave Smelker, Managing Principal [email protected]
Big Data Storage Options for Hadoop Sam Fineberg, HP Storage
Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations
Testing 3Vs (Volume, Variety and Velocity) of Big Data
Testing 3Vs (Volume, Variety and Velocity) of Big Data 1 A lot happens in the Digital World in 60 seconds 2 What is Big Data Big Data refers to data sets whose size is beyond the ability of commonly used
Introduction to DISC and Hadoop
Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and
An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
Extending Hadoop beyond MapReduce
Extending Hadoop beyond MapReduce Mahadev Konar Co-Founder @mahadevkonar (@hortonworks) Page 1 Bio Apache Hadoop since 2006 - committer and PMC member Developed and supported Map Reduce @Yahoo! - Core
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
