Dell Apache Hadoop Performance Analysis
|
|
|
- Godwin Whitehead
- 9 years ago
- Views:
Transcription
1 Dell Apache Hadoop Performance Analysis Dell PowerEdge R720/R720XD Benchmarking Report Nicholas Wakou Hadoop/Big Data Benchmarking Engineer Dell Revolutionary Cloud and Big Data Engineering November 2013 November2013
2 Revisions Date November 2013 Description Initial release THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR IMPLIED WARRANTIES OF ANY KIND Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without the express written permission of Dell Inc. is strictly forbidden. For more information, contact Dell. PRODUCT WARRANTIES APPLICABLE TO THE DELL PRODUCTS DESCRIBED IN THIS DOCUMENT MAY BE FOUND AT: Performance of network reference architectures discussed in this document may vary with differing deployment conditions, network loads, and the like. Third party products may be included in reference architectures for the convenience of the reader. Inclusion of such third party products does not necessarily constitute Dell s recommendation of those products. Please consult your Dell representative for additional information. Trademarks used in this text: Dell, the Dell logo, Dell Boomi, Dell Precision,OptiPlex, Latitude, PowerEdge, PowerVault, PowerConnect, OpenManage, EqualLogic, Compellent, KACE, FlexAddress, Force10 and Vostro are trademarks of Dell Inc. Other Dell trademarks may be used in this document. Cisco Nexus, Cisco MDS, Cisco NX- 0S, and other Cisco Catalyst are registered trademarks of Cisco System Inc. EMC VNX, and EMC Unisphere are registered trademarks of EMC Corporation. Intel, Pentium, Xeon, Core and Celeron are registered trademarks of Intel Corporation in the U.S. and other countries. AMD is a registered trademark and AMD Opteron, AMD Phenom and AMD Sempron are trademarks of Advanced Micro Devices, Inc. Microsoft, Windows, Windows Server, Internet Explorer, MS-DOS, Windows Vista and Active Directory are either trademarks or registered trademarks of Microsoft Corporation in the United States and/or other countries. Red Hat and Red Hat Enterprise Linux are registered trademarks of Red Hat, Inc. in the United States and/or other countries. Novell and SUSE are registered trademarks of Novell Inc. in the United States and other countries. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Citrix, Xen, XenServer and XenMotion are either registered trademarks or trademarks of Citrix Systems, Inc. in the United States and/or other countries. VMware, Virtual SMP, vmotion, vcenter and vsphere are registered trademarks or trademarks of VMware, Inc. in the United States or other countries. IBM is a registered trademark of International Business Machines Corporation. Broadcom and NetXtreme are registered trademarks of Broadcom Corporation. Qlogic is a registered trademark of QLogic Corporation. Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks and/or names or their products and are the property of their respective owners. Dell disclaims proprietary interest in the marks and names of others. 2 Dell Apache Hadoop Performance Analysis
3 3 Dell Apache Hadoop Performance Analysis
4 Table of Contents Revisions... 2 Executive Summary Introduction Strategic Goals Benchmark Testing Functionality Performance Characterization Tests Characterization Workloads Stress Tests MapReduce Stress Testing HDFS Stress Testing Bottleneck Investigation Performance Tuning System under Test (SUT) Hardware Configuration Software Stack Network Configuration Run Time Environment TestDFSIO Terasort K-Means Flush Memory Performance Results MapReduce Performance Performance Characterization MapReduce Job: CPU Profile MapReduce Networking Bottleneck Investigation Performance Tuning MapReduce Resource Utilization Analysis of a Teragen/Terasort Job Dell Apache Hadoop Performance Analysis
5 4.3 MapReduce Performance under a complex application K-Means K-Means Job Time K-Means CPU Utilization Thoroughput MapReduce Resource Utilization Under a Complex Application Analysis of a K-Means Job HDFS Performance HDFS Write Performance HDFS Read Performance IO Bottleneck Investigation HDFS Distributed Processing Analysis of a TestDFSIO Job Conclusion Appendix A Apache Hadoop TestDFSIO Intel HiBench HiBench: Terasort Intel HiBench: K-Means Appendix A2: Test Methodology Hadoop Infrastructure Deployment Cloudera Manager Deployment Appendix A3: Installing Benchmarks TestDFSIO Intel HiBench Appendix B: Hadoop Configuration Parameters Dell Apache Hadoop Performance Analysis
6 6 Dell Apache Hadoop Performance Analysis
7 Executive Summary The purpose of this document is helping you gain better insights when deploying and tuning hadoop clusters by understanding the performance of the RA using data points and workloads that are typical of a big data environment. This white paper discusses the Hi-Bench 2.2 benchmarking tests that were conducted based on the R720/R720XD Reference Architecture (RA) of the Dell Cloudera Apache Hadoop Solution while focussing on the performance of MapReduce and HDFS. The reference architecture discussed in this document improves performance by tuning and adjusting O/S parameters like HDFS block size, and hadoop configuration setting within CDH. 7 Dell Apache Hadoop Performance Analysis
8 1 Introduction This report is based on benchmarking tests that were carried out on the R720/R720xd Reference Architecture (RA) of the Dell Cloudera Apache Hadoop Solution. This performance review focused on the performance of the Hadoop core components such as MapReduce and HDFS. Note: The performance of eco-system components (Hive, Impala, HBase etc.) is not part of this review. 1.1 Strategic Goals Gain an in-depth understanding of the performance of the RA using data points and workloads that are typical of a big data environment. Obtain baseline performance data for the hardware platform. Assess the performance impact of some hardware components on the RA. Tune and optimize the performance of the cluster. Identify and clear bottlenecks. 1.2 Benchmark Testing The benchmarking plan proposes 3 categories of benchmark tests as listed below. This benchmark review focusses on engineering analysis tests that were used to obtain an in-depth understanding of the RA. Engineering Analysis o Functionality QA o Characterization o Stress testing o Bottleneck Investigation o Performance Tuning Business Recovery o Not performed in this iteration Comparative analysis for marketing purposes o Not performed in this iteration 1.3 Functionality These benchmarking tests were performed after QA tests in the release cycle. The hardware platforms and software stacks of the System under Test (SUT) were stable and ready for shipping. 8 Dell Apache Hadoop Performance Analysis
9 1.4 Performance Characterization Tests These are benchmark tests that were used to characterize the performance of the RA. Performance data from these tests is typically used for architectural designs and modifications, capacity analysis, and identification of bottlenecks. To get a good understanding of the cluster, these tests were modular and the following hardware components were characterized (stressed): IO Network CPU The goal of these tests was to: Record and analyze MapReduce and HDFS performance of the RA under varying loads. Analyze RA behavior under the Map, Shuffle, Reduce and Replication phases of MapReduce jobs Characterization Workloads Standard, Open-Source workloads were used Teragen/Terasort o Data generator o Primary Metric: Latency (s) o Secondary Metrics: CPU utilization (%), Network utilization (%), Network Throughput (MB/s) TestDFSIO o Read/Write IO characterization tool o Primary Metric: Throughput (MB/s), Latency (s) K-Means o Machine-Learning tool o Cluster analysis: partitions (n) samples into (k) clusters. The dimension (d) of each sample can be varied to obtain desired levels of complexity o Primary Metric: Wall clock time 1.5 Stress Tests These are basically characterization tests performed under peak load conditions to analyze the behavior of the RA at full load and to identify possible bottlenecks. 9 Dell Apache Hadoop Performance Analysis
10 1.5.1 MapReduce Stress Testing The goal was to obtain 100% CPU utilization on the slave nodes of the cluster. A MapReduce job was submitted using Teragen/Terasort. The load (dataset size in MB) was increased until 100% CPU utilization was observed on the Slave nodes. Performance at 100% CPU was sustained and monitored for long job durations HDFS Stress Testing The goal was to obtain peak IO throughput (MB/s) while targeting the SAS limit of the disk controller on the slave nodes and/or 100% of the network utilization of the cluster. TestDFSIO was used to vary the file size of a dataset that was read from or written to the cluster until peak throughput was attained and sustained. 1.6 Bottleneck Investigation Identifying bottlenecks was a prime goal of this review. Ideally, the SUT must run at full capacity (optimal utilization of available resources). When the SUT does not perform as expected, it is essential to identify the source of the problem (bottlenecks). Bottlenecks can be caused either due to hardware limitations or due to inefficient software configurations or both MapReduce jobs used in this review were CPU-intensive. The inability of a MapReduce job to fully maximize the CPU resources (attain 100% CPU utilization) on the slave nodes at full load is an indication of a bottleneck on the SUT. HDFS jobs used in this review were IO-intensive. Attaining the SAS throughput limit of the disk controller is a good indication of a bottleneck-free SUT. 1.7 Performance Tuning It is imperative that bottlenecks are eliminated or their impact mitigated in order to fully utilize the resources of the SUT. In this review, software parameters (OS and Hadoop) were tweaked (tuned) in order to attain the desired CPU and IO profiles. 10 Dell Apache Hadoop Performance Analysis
11 2 System under Test (SUT) This benchmark review was undertaken on the Dell PowerEdge R720/R720 hardware platform. Figure 1 System Under Test 11 Dell Apache Hadoop Performance Analysis
12 2.1 Hardware Configuration Table 1: Hardware Configuration Machine Function Active and Secondary Name Node Admin Node, HA Node Edge Node Data node Platform PowerEdge R720 PowerEdge R720xd CPU 2 x E (6-core) 2 x E (6- core) RAM (Minimum) LOM 96 GB 36 GB 4 x 1GbE DISK 6 x 600-GB 10K SAS 3.5-inch 24 x 1-TB SATA 7.2K 2.5-inch Storage Controller PERC H710 RAID RAID 10 Single Drive RAID Software Stack Table 2: Software Stack Component Version Operating System Red Hat Enterprise Linux 6.2 Hadoop Cloudera Distribution of Hadoop CDH Cloudera Manager Cloudera Manager Java Sun Oracle Java version 6 Update Dell Apache Hadoop Performance Analysis
13 2.3 Network Configuration Figure 2 Network Configuration Table 3 Server Side Cabling Component NICs to Switch Port LOM1 LOM2 LOM3 LOM4 BMC Admin node Name Nodes N/A Data Node N/A N/A Edge Node Legend Production LAN Management LAN Public LAN In order to segregate network traffic and enable dedicated network links, the Dell Cloudera solution configures 3 distinct vlans: Network Description vlan tag Tagged Production Management Public Used by the Hadoop system to handle traffic between all nodes for HDFS operations, MapReduce jobs, and other Hadoop traffic. Network links are bonded in a team of 2 or more links. Used for connecting to the BMC of each node. Additionally used for administrative functions such as Crowbar node installation, backups and other monitoring. Used for connections to devices external to the Hadoop Cluster. Bonded in a team of 2 links q Tagged 300 Not tagged q Tagged 13 Dell Apache Hadoop Performance Analysis
14 3 Run Time Environment Performance tests were executed on the Masternode from the command line or scripts. 3.1 TestDFSIO Executed from the command line. See Appendix A1 (1.1: Apache Hadoop TestDFSIO) Before every performance run, remove previous test data by running the command: # sudo u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-test mr1- cdh4.1.1.jar TestDFSIO -clean Data set size was used to characterize and stress the SUT. The following command line options were used to vary the data set size from 100GB to 5000GB - - nrfiles (number of files) - - filesize (size of each file) Performance Metrics provided at program completion - Throughput MB/s - IO Rate - MB/s - Execution time (s) - Standard Deviation Secondary Metrics - CPU utilization Ganglia, Cloudera Manager Host statistics - Network utilization Ganglia 3.2 Terasort Executed from HiBench scripts as shown in Appendix A1 (1.3 HiBench: Terasort) Follow instructions to flush cache before any performance run. Performance characterization and stress testing was done by varying the dataset size from 10GB to 10,000GB. Modify the ~/HiBench-2.1/terasort/conf/configure.sh to set the data set size - # for prepare (total) - 1TB - DATASIZE= Data generation (teragen) run the ~/HiBench-2.1/terasort/bin/prepare.sh Program Execution (terasort) run the ~/HiBench-2.1/terasort/bin/run.sh Primary Performance Metrics - Latency job duration shown by the Cloudera Manager Activity Monitor. Secondary Metrics - CPU utilization Ganglia, Cloudera Manager Host statistics - Network utilization Ganglia 14 Dell Apache Hadoop Performance Analysis
15 3.3 K-Means Executed from HiBench scripts as shown in Appendix A1 (1.4 Intel HiBench: K-Means) Follow instructions in Memory to flush cache before any performance run. Performance characterization and stress testing using a complex application was done by varying the dataset size from 0.3GB to 2,500GB. Modify the ~/HiBench-2.1/terasort/conf/configure.sh as shown in Appendix A1 to define the data set size. - Number of samples (n) - 10^3, 10^4, 10^10 - Dimension of each sample (d): 2,4,8,16,32,64 - Number of clusters (k): 2,4,8,16,32,64,128 - Samples per input File: (Number of samples / 5) Data generation run the ~/HiBench-2.1/kmeans/bin/prepare.sh Program Execution run the ~/HiBench-2.1/kmeans/bin/run.sh Primary Performance Metrics - Results obtained from hibench.report Throughput (MB/s) Latency Secondary Metrics - CPU utilization Ganglia, Cloudera Manager Host statistics - Network utilization Ganglia 3.4 Flush Memory Before any performance run, it is necessary to flush cache memory on all slave nodes. A script to flush cache on slave nodes with IP addresses ranging from to : 1. doall.sh for i in ; do echo ne $i:; ssh $1; done 2. From the command line #./doall free m #./doall sync & #./do all echo 3> /proc/sys/vm/drop_caches #./doall free m 15 Dell Apache Hadoop Performance Analysis
16 4 Performance Results 4.1 MapReduce Performance The performance of the MapReduce layer was analyzed by varying the dataset size. At each instance, the CPU characteristics, network utilization and performance metrics were obtained. Performance results were analyzed for evidence of bottlenecks. The SUT was tuned to improve performance Performance Characterization Terasort was used to characterize MapReduce performance. The instructions in were followed to vary the size of the dataset from 10GB to 10000GB. At each instance, performance data was collected and recorded. Figure 3 Performance Characterization Chart Latency (s) Data Size (GB) CPU_utiliization (%) 16 Dell Apache Hadoop Performance Analysis
17 4.1.2 MapReduce Job: CPU Profile The CPU profile of the Map and Reduce phases of a 1TB sort job was captured. Figure 4 MapReduce Job CPU Profile MapReduce Networking The nodes in a Hadoop cluster are interconnected through the network. Typically, one or more of the following phases of MapReduce jobs transfers data over the network: 1. Writing data: This phase occurs when the initial data is either streamed or bulk-delivered to HDFS. Data blocks of the loaded files are replicated, transferring additional data over the network. 2. Workload execution: The MapReduce algorithm is run. a. Map phase: In the map phase of the algorithm, almost no traffic is sent over the network. The network is used at the beginning of the map phase only if a HDFS locality miss occurs (the data block is not locally available and has to be requested from another data node). b. Shuffle phase: This is the phase of workload execution in which traffic is sent over the network, the degree to which depends on the workload. Data is transferred over the network when the output of the mappers is shuffled to the reducers. c. Reduce phase: In this phase, almost no traffic is sent over the network because the reducers have all the data they need from the shuffle phase. d. Output replication: MapReduce output is stored as a file in HDFS. The network is used when the blocks of the result file have to be replicated by HDFS for redundancy. 3. Reading data: This phase occurs when the final data is read from HDFS for consumption by the end application, such as the website, indexing, or SQL database. 17 Dell Apache Hadoop Performance Analysis
18 Figure 5 Network Utilization by MapReduce Phases 4.2 Bottleneck Investigation MapReduce jobs are CPU-intensive. In this review it was possible to stress the slave nodes to attain 100% CPU utilization indicating the absence of MapReduce performance bottlenecks (particularly IO and network bottlenecks). Further analysis of the CPU profile showed very high CPU system time > 30%. Typically, CPU system time should be < 15% and user time should be > 80%. Using techniques described in 5.1.5, the CPU System time was reduced to < 15%. There was evidence of memory swapping with dataset sizes > 3TB indicating that memory is an issue at those sizes. 18 Dell Apache Hadoop Performance Analysis
19 Figure 6 Tuning CPU System Time Performance Tuning Typically, performance tuning is performed to fix bottlenecks or to mitigate their impact. For dataset sizes considered in this review (10GB-10,000GB) CPU utilization was identified as the only bottleneck for MapReduce performance indicating that the SUT was already optimized for the best CPU and Memory performance. Further analysis (see Figure 6) of the CPU profile indicated that CPU system time was very high and had to be reduced to further improve performance. Based on best-practices adopted from Cloudera Performance Engineers, the impact of tuning several hardware and software parameters was investigated. In this review, the blocksize and the hadoop configuration parameters provided the most significant boost to performance. These parameters were tuned as shown below and performance characterization tests were repeated with the dataset size of the MapReduce jobs being varied from 100GB to 1000GB. See the results in Figure OS Parameters: Use the doall.sh script shown in Flush Memory in order to apply to all the slave nodes a. Turn down swappiness: # doall.sh echo 0 > /proc/sys/vm/swappiness b. Turn off huge page defrag: #doall.sh echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag 19 Dell Apache Hadoop Performance Analysis
20 2. Blocksize: The blocksize was increased from 128MB to 512MB using the configuration parameter: dfs.blocksize= Hadoop configuration parameters. The following parameters were tuned: Table 4: Tuning Hadoop Configuration Parameters Parameter Value Default mapred.map.tasks mapred.reduce.tasks 64 1 mapred.tasktracker.map.tasks.maximum 24 2 mapred.tasktracker.reduce.tasks.maximum 8 2 MapReduce Child Java Maximum Heap size NULL Datanode Java Heap Size NULL Task Tracker Java Heap Size NULL Namenode Java Heap Size Secondary Namenode Java Heap Size io.sort.mb io.sort.record.percent io.sort.spill.percent The overall performance improvement due to tuning CPU system time, the blocksize and hadoop configuration parameters was found to be 50.84% based on the reduction in job duration times. Table 5 Performance Boost by Tuning Parameter Performance boost OS parameters 38.10% Blocksize 12.20% Hadoop configuration parameters 0.54% Figure 7 Comparing the Performance of a Tuned to Non-tuned SUT 20 Dell Apache Hadoop Performance Analysis
21 Job Duration (s) Data Size (GB) MapReduce Resource Utilization Resource utilization by MapReduce jobs was observed for 1 day with Teragen/Terasort jobs running on the cluster. It was observed that datanode servers are the work horses of the hadoop cluster and they fully utilized their memory, network and CPU resources. There was evidence of memory swapping on datanode servers when large datasets (> 3TB) were sorted using terasort. The namenode and other infrastructure servers used these resources very lightly. This could have an impact on how these servers are scoped for small clusters. Figure 8 Memory Used by a Datanode Server 21 Dell Apache Hadoop Performance Analysis
22 Figure 9 Memory Used by a Namenode Serve Figure 10 CPU Utilization by a Datanode Server 22 Dell Apache Hadoop Performance Analysis
23 Figure 11 CPU Utilization by a Namenode Server Analysis of a Teragen/Terasort Job The result of a single Teragen/Terasort job was captured as a data point for comparison with future benchmark reviews. The result of a single Teragen/Terasort job was captured as a data point for comparison with future benchmark reviews. Table 6: Analysis of a Teragen/Terasort Job Parameter Teragen Terasort Input parameters Number of rows Dataset size 1TB Results Job duration 1032s (17mins 12s) 725s (12mins 5s) Network utilization 83.4% 47.9% CPU Utilization Map phase Shuffle/Reduce phase 90% 31% 97% 45% 23 Dell Apache Hadoop Performance Analysis
24 4.3 MapReduce Performance under a complex application K-Means K-Means provides an application that can be configured to match the complexity of real-world use-cases. Refer to Appendix A1 (Intel HiBench: K-Means) and K-Means for details on how K-Means was installed and implemented. It was observed that performance characteristics were largely impacted by the sample size (n); Performance characterization tests were performed for small samples with n =< 103, and then repeated for large samples with n=> In all cases and for each job, sample size (n), the dimension (d) and the number of clusters (k) was varied. For each sample size (small or large) and number of dimensions, the cluster size was varied K-Means Job Time It was observed that time complexity for a small sample was exponential on the number of clusters * samples: Where d=dimension, k = clusters, n = samples Figure 12 Small Sample Duration Small Sample : Duration - # Samples = 10^3 - Data generated x, Run Time (seconds) d=2 d=4 d=8 d=16 d=32 d= # Clusters For large samples, job completion time is almost linearly proportional to the number of clusters, k. 24 Dell Apache Hadoop Performance Analysis
25 Figure 13 Large Sample Duration and CPU Large Sample (10 10 ) Duration and CPU Run Time (s) 12, , , , , , # Clusters CPU Utilization (%) Duration CPU K-Means CPU Utilization For small samples, 80% CPU was attained when d>16 and k > 32. Figure 14 K-Means Small Sample CPU utilization Small Sample: CPU CPU Utilization (%) # Clusters d=2 d=4 d=8 d=16 d=32 d=64 For large samples, 80% CPU is attained even with k=1. Refer to CPU Profile Large Sample Jobs. 25 Dell Apache Hadoop Performance Analysis
26 Figure 15 CPU Profile Large Sample Jobs - Large Samples = CPU Utilization ~95% - User time ~ 90% - System time ~ 5% Thoroughput MapReduce throughput under K-Means, a complex application was obtained and analyzed. K-Means Throughput was provided by the HiBench report and was an indication of the rate at which data was analyzed. This was based on the amount of data and time taken to compute centroids (for each iteration) before convergence. For low complex (d<8) small samples, throughput increases linearly with number of clusters k. For high complex (d>16) small samples, throughput starts to drop for large numbers of clusters. Figure 16 Small Sample Throughputss Throughput (MB/s) Small Sample : Throughput # Clusters d=2 d=4 d=8 d=16 d=32 d=64 For large samples, throughput drops linearly with the number of clusters. 26 Dell Apache Hadoop Performance Analysis
27 Figure 17 Large Sample Throughput Large Sample: Throughput Throughput (MB/s) d= # Clusters 27 Dell Apache Hadoop Performance Analysis
28 4.3.4 MapReduce Resource Utilization Under a Complex Application Figure 18 Resource Utilization Under K-Means 28 Dell Apache Hadoop Performance Analysis
29 The figure above show how CPU and IO resources of the SUT were utilized with a large-sample K-Means job (d=2, k=2) running for about 1 hour. The charts show that CPU utilization was high (98%) throughout the 5 iterations. There was significant read I/O activity during the run but write I/O activity kicked in after the 5 iterations had completed. Network traffic was noticeable after the iterations. Memory utilization was high (~80%) throughout the run Analysis of a K-Means Job The result of a single K-Means job was captured as a data point for comparison with future benchmark reviews. Table 7 Analysis of a K-Means Job Parameter Value Input Parameters Sample size (n) 10^10 Dimensions (d) 32 Clusters (k) 8 Samples per input file Maximum number of Iterations 5 Results Input size Total time Throughput bytes (2.4 TB) 13, seconds (~ 3hrs 48 mins) bytes/second ( MB/s) CPU 100% Network utilization 84% 4.4 HDFS Performance The TestDFSIO benchmark was used to analyze the performance of the HDFS layer. For instructions on how to run this benchmark refer to Appendix A1 ( Apache Hadoop TestDFSIO) and TestDFSIO. 4.5 HDFS Write Performance For this SUT, job execution times rise linearly with the size of the dataset up to 1000GB. For dataset sizes of 1000GB and bigger, job execution times rise exponentially. This is mainly due to replication and the limitations of the network bandwidth. The default replication factor of 3 was maintained for all HDFS tests. As dataset sizes increase, the multiplier effect of the replication factor comes into play requiring more data 29 Dell Apache Hadoop Performance Analysis
30 to be transferred across the network. As the network becomes the bottleneck, data transfer is constrained and leads to increased job execution times. Figure 19 HDFS Write Performance Chart IO throughput (MB/s) nrfiles = 1000 time Data size (GB) Job duration time (s) HDFS Read Performance TestDFSIO Read jobs are processed locally within each Data Node with no significant transfer of data across the network. Network traffic is therefore not as significant as that expected in write jobs. In addition to the underlying I/O hardware architecture (disks, controllers), the number of files to process and the number of available hadoop map and reduce slots have a significant impact on read performance. Figure 20 HDFS Read Performance Chart IO throughput (MB/s) nrfiles = 1000 tim e Data size (GB) Job duration time (s) 30 Dell Apache Hadoop Performance Analysis
31 4.5.2 IO Bottleneck Investigation 3 possible IO limits were considered SAS Limit An LSI whitepaper, Switched SAS: Sharable, Scalable SAS Infrastructure shows how to calculate the SAS limit of an 8 lane controller port with a SAS bandwidth of 6Gbps: 6Gb/s x 8 lanes = 48Gb/s per x8 port 48Gb/s (8b/10b encoding) = 4.8GB/sec per port (per node) 4.8GB/s per port x 88.33% (arbitration delays and additional framing) = 4320MB/s per port PCI-E Slot The Dell R720 provides integrated PCI-E Gen-3 capable slots. Gen-3 is defined at 8 Gbps; this gives a bandwidth of 8.0 Gb/s (Scrambling + 128b/130b encoding instead of 8b/10b encoding) per lane, so for example a PCIe Gen-3 x8 link delivers an aggregate bandwidth of 8 GB/s Network Each Slave node has 2x1GbE NIC bonded interfaces. The full-duplex bandwidth (BW) per node: BW = 1 Gb/s x 2 (interfaces) x 2 (full duplex) / 8 (bits) = 0.5 GB/s Allowing for 20% transmission overheads, the nominal BW is expected at ~ 400MB/s per node The IO limits per node are summarized in the following table. Table 8 IO Bottlenecks Component SAS Controller PCI-E Gen-3 Slot 2x1GbE NIC Interfaces Max Bandwidth 4.8 GB/s 8.0 GB/s 400MB/s It is clear that the network has the lowest bandwidth limit. Write IO performance is severely impacted by the network limit due to the requirement to transfer data across the network. Read IO performance is more dependent on the IO bandwidth limitations of the underlying IO components (SAS controller, PCI slots, disks etc.). Since these limits are high for each node, the Read Performance of this SUT depended more on TestDFSIO (number of Files nrfiles) and hadoop parameters (number of map slots) HDFS Distributed Processing The number of distributed partitions significantly impacts HDFS performance. The more the partitions (distributed files) the better the performance as shown in the chart below which shows how write performance is impacted by the number of files (nrfiles). 31 Dell Apache Hadoop Performance Analysis
32 Figure 21 HDFS Distributed Processing Chart time IO write throughput (MB/s) Job duration time (s) nrfiles 0 32 Dell Apache Hadoop Performance Analysis
33 4.5.4 Analysis of a TestDFSIO Job The results of read and write TestDFSIO jobs were captured as a data point that could be used for future reference. Table 9 Analysis of a TestDFSIO Job Parameter Write Performance Read Performance Input parameters nrfiles filesize 1000MB 1000 option -write -read Results Throughput 3310 MB/s 8820 MB/s IO Rate 4000 MB/s MB/s Execution time s s Network utilization 85.4% 79.2% CPU Utilization Map phase Shuffle/Reduce phase 100% 25% 70% 27% 33 Dell Apache Hadoop Performance Analysis
34 5 Conclusion This R720/R720xd benchmarking review is the first in a series that are planned to be conducted with every major release of the Dell Cloudera Hadoop solution. Every attempt was made to perform all tests recommended in the Benchmarking Plan and Guide but for various reasons a number of them could not be performed. The results obtained in this review will be used as baseline data for comparing the performance of subsequent RA revisions, configurations and performance optimizations. The main achievements of this benchmarking review are: Performance characterization of the R720/720xd RA using CPU and IO intensive workloads. Stress testing the RA. Understanding the behavior of the RA as the load increases up to the point when bottlenecks become evident. Bottleneck investigation. For the size of the hadoop cluster under review, the main bottlenecks are CPU and the network. Results from this review show when each bottleneck comes into play. Performance tuning. Software (OS & hadoop) tuning techniques were employed to mitigate the impact of the CPU bottleneck. These tweaks provided a performance boost of 50.84% over a nontuned system. This implies that a Terasort job will run at least 50% faster after the tweaks. These tuning tweaks should be incorporated into the solution. Based on the performance results and analysis, some recommendations have been made and should be considered in order to improve the performance of the Dell Cloudera Hadoop Solution: 1. Performance tuning use techniques in Performance Tuning to improve the performance of the solution by over 50%. a. Apply hadoop configuration parameters. b. Apply OS parameters. c. Increase the block size from the default 128MB to 512MB or more. 2. RA changes should be subject to a cost/benefit analysis. a. More memory to the Slave nodes > 64GB. b. Less memory on Infrastructure nodes. c. Fewer processing elements (CPU/Cores) on Infrastructure nodes. d. More processing elements (CPU/Cores) on Slave nodes. 34 Dell Apache Hadoop Performance Analysis
35 6 Appendix A Appendix A Test Environment: Open source workloads were used to generate data and submit jobs to the hadoop cluster. Appendix A1 Test Suites: The high level goal of this benchmarking review was to test the architectural components of Hadoop; MapReduce, HDFS and how they interact with the underlying hardware infrastructure. Workloads were selected based on how best they could exercise the hardware components (IO, CPU and Network) that have the biggest impact on MapReduce and HDFS: Table 10 Test Suites Benchmark Distribution Hadoop Stressed Component Hardware Component Stressed TestDFSIO ApacheHadoop HDFS IO, Network Teragen HiBench 2.2 HDFS IO, Network Terasort HiBench 2.2 MapReduce CPU K-means HiBench 2.2 MapReduce Application level, CPU Apache Hadoop TestDFSIO The TestDFSIO benchmark is a read and write test for HDFS. TestDFSIO is used to measure performance of HDFS and stresses both the network and IO subsystems. The command reads and writes files in HDFS which is useful in measuring system-wide performance and exposing network bottlenecks on the Hadoop cluster. A majority of HDFS workloads are IO bound more than compute and hence TestDFSIO can provide an accurate initial picture of such scenarios. Nevertheless, because this test is run as a MapReduce job, the MapReduce stack of the cluster must be correctly working. In other words, this test cannot be used to benchmark HDFS in isolation from MapReduce. The benchmark can be run for writing, using the write switch, and using read for the read test. The command line accepts a number of files and sizes of each file in HDFS. The command used to generate and write 1000 files each 1000MB is as below: # sudo u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-test mr1-cdh4.1.1.jar TestDFSIO write nrfiles 1000 filesize 1000 resfile /tests/testdfsio/results.txt TestDFSIO generates 1 map task per file and splits are defined such that each map processes a single file. After every run, the command generates a log file indicating performance in terms of 4 metrics: Throughput in MB/s, Average IO rate in MB/s, IO rate standard deviation and execution time. The most notable metrics are throughput and average IO, both of which are based on file size read or written by the individual map task and the elapsed time in performing the task. The throughput and IO rate for N map tasks is defined as: 35 Dell Apache Hadoop Performance Analysis
36 If the cluster has 50 map slots and TestDFSIO creates 1000 files, the throughput can be calculated as: Concurrent Throughput = Reported Throughput x Number of Map Slots The IO rate can be calculated in a similar fashion. While measuring cluster performance using TestDFSIO may be considered sufficient, the HDFS replication factor (value of dfs.replication ) also plays an important role. A lower replication factor leads to higher throughput performance due to reduced background traffic Intel HiBench HiBench is a benchmarking suite for Hadoop. It consists of a set of Hadoop programs, including both synthetic micro-benchmarks and real-world Hadoop applications. An overview of the benchmark can be obtained at github ( This review used the following HiBench 2.1 micro-benchmarks: Terasort CPU intensive workload to characterize the performance of and stress-test the MapReduce layer. K-means CPU intensive workload to characterize MapReduce performance on a SUT running complex hadoop applications. The HiBench suite is hierarchically organized with each micro-benchmark having a similar directory structure. Each micro-benchmark has the following files with tunable parameters: ~/conf/configure.sh sets the environment data size compression run-time hadoop parameters ~/bin/prepare.sh Data generation Run-time hadoop parameters ~/bin/run.sh Benchmark execution Run-time hadoop parameters 36 Dell Apache Hadoop Performance Analysis
37 6.1.3 HiBench: Terasort Terasort is part of the Apache Hadoop distribution and is available on any cluster. This review used the package that was distributed with the HiBench suite. It is distributed as a 2-part package: Teragen is a map/reduce data generator. Given a dataset size, It divides the desired number of rows by the desired number of tasks and assigns ranges of rows to each map. The map uses the random number generator to jump to the correct value for the first row and generates the subsequent rows. Teragen is executed by the prepare.sh scripts. Terasort is a standard sort program that samples the input data generated by teragen and uses map/reduce to sort the data into a total order. Terasort is executed via run.sh Run-time Scripts These were the tunable parameters implemented in running Terasort: ~/HiBench/bin/hibench-config.sh # switch on/off compression: 0-off, 1-on export COMPRESS_GLOBAL=0 export COMPRESS_CODEC_GLOBAL=org.apache.hadoop.io.compress.DefaultCodec ~/HiBench/terasort/conf/configure.sh # for prepare (total) - 1TB DATASIZE= #Number of Map tasks NUM_MAPS=180 #Number of Reduce tasks NUM_REDS=64 ~/HiBench/terasort/bin/prepare.sh 37 Dell Apache Hadoop Performance Analysis
38 # Generate the terasort data hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop mr1-cdh4.1.1-examples.jar teragen \ -D mapred.map.tasks=$num_maps \ $DATASIZE $INPUT_HDFS ~/HiBench/terasort/bin/run.sh # run bench hadoop jar $HADOOP_HOME/hadoop mr1-cdh4.1.2-examples.jar terasort -D mapred.reduce.tasks=$num_reds $INPUT_HDFS $OUTPUT_HDFS # post-running END_TIME=`timestamp` gen_report "TERASORT" ${START_TIME} ${END_TIME} ${SIZE} >> ${HIBENCH_REPORT} Intel HiBench: K-Means K-means is a data mining, cluster analysis algorithm that aims to partition n observations (x 1, x 2,, x n ), into k sets (clusters) (k n) where S = {S 1, S 2,, S k } Each observation belongs to the cluster with the nearest mean i.e. the one with the most similar items: 1. k centroids are selected. 2. Each item in the sample is placed in a cluster with the least distance (nearest centroid). 3. For each group of points assigned to the same center, compute a new center by taking the centroid of the points. 4. Repeat until there is convergence. This review used the K-means benchmark from Intel HiBench 2.1 suite Runtime scripts ~/HiBench/bin/hibench-config.sh Dell Apache Hadoop Performance Analysis
39 ###################### Global Paths ################## export HADOOP_HOME=/usr/lib/hadoop-0.20-mapreduce HADOOP_CONF_DIR=$HADOOP_HOME/conf HADOOP_EXAMPLES_JAR=$HADOOP_HOME/hadoop-examples*.jar if [ -z "$HIBENCH_HOME" ]; then fi export HIBENCH_HOME=/var/lib/hadoop-hdfs/hibench if [ -z "$HIBENCH_CONF" ]; then fi export HIBENCH_CONF=${HIBENCH_HOME}/conf if [ -f "${HIBENCH_CONF}/funcs.sh" ]; then fi source "${HIBENCH_CONF}/funcs.sh" if [ -z "$HIVE_HOME" ]; then fi export HIVE_HOME=/usr/lib/hive if [ -z "$MAHOUT_HOME" ]; then fi export MAHOUT_HOME=/usr/lib/mahout 39 Dell Apache Hadoop Performance Analysis
40 if [ -z "$DATATOOLS" ]; then export DATATOOLS=${HIBENCH_HOME}/common/autogen/dist/datatools.jar fi ~/HiBench/kmeans/conf/configure.sh for prepare # Number of clusters (k) NUM_OF_CLUSTERS=128 # Number of samples (n) NUM_OF_SAMPLES=100 #SAMPLES_PER_INPUTFILE= SAMPLES_PER_INPUTFILE= # Number of dimensions (d) DIMENSIONS=4 # for running MAX_ITERATION=5 ~/HiBench/kmeans/bin/prepare.sh # generate data 40 Dell Apache Hadoop Performance Analysis
41 OPTION="-sampleDir ${INPUT_SAMPLE} -clusterdir ${INPUT_CLUSTER} -numclusters ${NUM_OF_CLUSTERS} -numsamples ${NUM_OF_SAMPLES} -samplesperfile ${SAMPLES_PER_INPUTFILE} -sampledimension ${DIMENSIONS}" export HADOOP_CLASSPATH=`mahout classpath tail -1` hadoop jar /var/lib/hadoop-hdfs/hibench/common/autogen/dist/datatools.jar org.apache.mahout.clustering.kmeans.genkmeansdataset -libjars $MAHOUT_HOME/mahout-examples- 0.7-cdh4.1.2-job.jar ${OPTION} ~/HiBench/kmeans/bin/run.sh OPTION="-i ${INPUT_SAMPLE} -c ${INPUT_CLUSTER} -o ${OUTPUT_HDFS} -x ${MAX_ITERATION} -ow - cl -cd 0.5 -dm org.apache.mahout.common.distance.euclideandistancemeasure -xm mapreduce" START_TIME=`timestamp` #START_TIME=date +%s echo $MAHOUT_HOME # run bench mahout kmeans ${OPTION} # post-running END_TIME=`timestamp` echo $END_TIME gen_report "KMEANS" ${START_TIME} ${END_TIME} ${SIZE}>> ${HIBENCH_REPORT 41 Dell Apache Hadoop Performance Analysis
42 6.2 Appendix A2: Test Methodology The hardware and network configurations are set up. Crowbar was used to deploy the Hadoop cluster infrastructure and Cloudera Manager deployed hadoop Hadoop Infrastructure Deployment The hadoop cluster infrastructure was setup and configured by Crowbar. Please refer to the Dell Cloudera Solution Deployment Guide and Dell Cloudera Solution Crowbar Administrator User Guide for details. 1. Follow instructions to install the Crowbar Admin node. 2. Use a browser to connect to Crowbar. 3. Power up servers that will be part of the Hadoop cluster. 4. Allow the servers to PXE boot from the Crowbar admin node. 5. Note that the nodes get discovered in Crowbar. 6. Create, Edit, save and apply the Cloudera Manager Barclamp. 7. Crowbar configures BIOS and RAID configurations and installs the OS on all the nodes. 8. After completing the OS install, the nodes should transition to the Ready state in Crowbar UI. 9. Note that all the Hadoop cluster nodes and the Cloudera Manager Barclamp are in the Ready state (Green LED) in Crowbar UI Cloudera Manager Deployment Instructions on how to deploy Cloudera Manager can be found in the Dell Cloudera Solution Crowbar Administrator User Guide. Follow the link available from the Cloudera-manager server to login to Cloudera Manager. Provide a license in order to use the Enterprise Edition. This review used the Enterprise Edition of Cloudera Manager for its monitoring capabilities and tools. Install Hadoop Core services (HDFS, Map Reduce, HUE and Oozie). Verify that these services are up and running with good health. 6.3 Appendix A3: Installing Benchmarks The benchmarks specified in Appendix A1 Test Suites are installed as shown in this section TestDFSIO TestDFSIO was used in this review and was available with CDH distribution of Hadoop. The program can be run from the command line Intel HiBench HiBench 2.1 was used to provide and manage Teragen, Terasort and K-Means packages. 1. Download the latest version of HiBench 2.x (ZIP) from github: 2. For the HiBench 2.1 used in this review, the download is available at: 42 Dell Apache Hadoop Performance Analysis
43 3. Download the zipball to the following suggested directory of the Master node server /home/hibench. 4. Unzip and extract the zipball. 5. Rename (mv) ~/hibench/hibench-hibench-2.1-* to ~/hibench/hibench-2.1 # chown hdfs:hdfs R /home/hibench/hibench-2.1/. 6. Mahout packages are required for implementation of K-Means. The default installation of Hadoop does not provide Mahout. The version of Mahout that is downloaded with HiBench 2.1 has compatibility issues with CDH4. The Cloudera version of Mahout has those issues settled. Download the mahout package from the following link: 7. Search for mahout Unzip and untar. 9. Modify configuration and run scripts as shown in Appendix A1 Test Suites 43 Dell Apache Hadoop Performance Analysis
44 7 Appendix B: Hadoop Configuration Parameters A complete listing of the hadoop configuration parameters for a 1TB Terasort job is shown in the following table. name value job.end.retry.interval mapred.job.tracker.retiredjobs.cache.size 1000 mapred.queue.default.acl-administer-jobs * dfs.image.transfer.bandwidthpersec 0 mapred.task.profile.reduces 0-2 mapreduce.jobtracker.staging.root.dir ${hadoop.tmp.dir}/mapred/staging mapred.job.reuse.jvm.num.tasks -1 dfs.block.access.token.lifetime 600 fs.abstractfilesystem.file.impl org.apache.hadoop.fs.local.localfs mapred.reduce.tasks.speculative.execution hadoop.ssl.keystores.factory.class FALSE org.apache.hadoop.security.ssl.filebasedkeystor esfactory mapred.job.name hadoop.http.authentication.kerberos.keyta b TeraSort ${user.home}/hadoop.keytab io.seqfile.sorter.recordlimit s3.blocksize dfs.namenode.num.checkpoints.retained 2 hadoop.relaxed.worker.version.check TRUE mapred.task.tracker.http.address :50060 dfs.namenode.delegation.token.renewinterval io.map.index.interval 128 s3.client-write-packet-size dfs.namenode.http-address dd4-ae e-31.dell.com:50070 ha.zookeeper.session-timeout.ms 5000 mapred.system.dir ${hadoop.tmp.dir}/mapred/system hadoop.hdfs.configuration.version 1 s3.replication 3 dfs.datanode.balance.bandwidthpersec Dell Apache Hadoop Performance Analysis
45 mapred.task.tracker.report.address :0 mapred.jobtracker.plugins org.apache.hadoop.thriftfs.thriftjobtrackerplugi n jobtracker.thrift.address dd4-ae e-31.dell.com:9290 mapreduce.reduce.shuffle.connect.timeou t dfs.journalnode.rpc-address :8485 hadoop.ssl.enabled FALSE mapreduce.job.counters.max 120 dfs.datanode.readahead.bytes ipc.client.connect.max.retries.on.timeouts 45 mapred.healthchecker.interval mapreduce.job.complete.cancel.delegatio TRUE n.tokens dfs.client.failover.max.attempts 15 dfs.namenode.checkpoint.dir file://${hadoop.tmp.dir}/dfs/namesecondary dfs.namenode.replication.work.multiplier.p 2 er.iteration fs.trash.interval 0 hadoop.jetty.logs.serve.aliases TRUE mapred.skip.map.auto.incr.proc.count TRUE hadoop.http.authentication.kerberos.princi HTTP/_HOST@LOCALHOST pal terasort.num-rows s3native.blocksize mapred.child.tmp./tmp mapred.tasktracker.taskmemorymanager monitoring-interval dfs.namenode.edits.dir ${dfs.namenode.name.dir} dfs.encrypt.data.transfer FALSE dfs.datanode.http.address :50075 io.sort.spill.percent 0.98 dfs.client.use.datanode.hostname FALSE mapred.job.shuffle.input.buffer.percent 0.7 hadoop.skip.worker.version.check FALSE hadoop.security.instrumentation.requires.a FALSE 45 Dell Apache Hadoop Performance Analysis
46 dmin mapred.skip.map.max.skip.records 0 mapreduce.reduce.shuffle.maxfetchfailure 10 s hadoop.security.authorization FALSE user.name hdfs dfs.client.failover.connection.retries.on.tim 0 eouts hadoop.security.group.mapping.ldap.searc (objectclass=group) h.filter.group dfs.namenode.safemode.extension mapred.task.profile.maps 0-2 dfs.datanode.sync.behind.writes FALSE dfs.https.server.keystore.resource ssl-server.xml mapred.local.dir ${hadoop.tmp.dir}/mapred/local hadoop.security.group.mapping.ldap.searc cn h.attr.group.name mapred.merge.recordsbeforeprogress mapred.job.tracker.http.address :50030 dfs.namenode.replication.min 1 mapred.compress.map.output TRUE mapred.userlog.retain.hours 24 s3native.bytes-per-checksum 512 tfile.fs.output.buffer.size mapred.tasktracker.reduce.tasks.maximum 8 fs.abstractfilesystem.hdfs.impl org.apache.hadoop.fs.hdfs dfs.namenode.safemode.min.datanodes 0 mapred.disk.healthchecker.interval dfs.client.https.need-auth FALSE dfs.client.https.keystore.resource ssl-client.xml dfs.namenode.max.objects 0 mapred.cluster.map.memory.mb -1 hadoop.ssl.client.conf ssl-client.xml dfs.namenode.safemode.threshold-pct 0.999f dfs.blocksize dfs.thrift.threads.max 20 mapreduce.job.submithost dd4-ae e-31.dell.com hue.kerberos.principal.shortname hue 46 Dell Apache Hadoop Performance Analysis
47 mapreduce.tasktracker.outofband.heartbe FALSE at io.native.lib.available TRUE dfs.client-write-packet-size mapred.jobtracker.restart.recover FALSE mapred.reduce.child.log.level INFO mapreduce.shuffle.ssl.address dfs.namenode.name.dir file://${hadoop.tmp.dir}/dfs/name dfs.ha.log-roll.period 120 dfs.client.failover.sleep.base.millis 500 dfs.datanode.directoryscan.threads 1 dfs.permissions.enabled TRUE dfs.support.append TRUE mapred.inmem.merge.threshold 1000 ipc.client.connection.maxidletime mapreduce.shuffle.ssl.enabled ${hadoop.ssl.enabled} dfs.namenode.invalidate.work.pct.per.iterat 0.32f ion dfs.blockreport.intervalmsec fs.s3.sleeptimeseconds 10 dfs.namenode.replication.considerload TRUE dfs.client.block.write.retries 3 hadoop.ssl.server.conf ssl-server.xml mapred.jobtracker.retirejob.interval dfs.namenode.name.dir.restore FALSE dfs.datanode.hdfs-blocksmetadata.enabled TRUE mapred.reduce.tasks 0 ha.zookeeper.parent-znode /hadoop-ha mapred.queue.names default io.seqfile.lazydecompress TRUE dfs.https.enable FALSE mapred.fairscheduler.preemption FALSE 47 Dell Apache Hadoop Performance Analysis
48 mapred.hosts.exclude /var/run/cloudera-scm-agent/process/705- mapreduce- JOBTRACKER/mapred_hosts_exclude.txt dfs.replication 3 ipc.client.tcpnodelay FALSE dfs.namenode.accesstime.precision mapred.output.format.class org.apache.hadoop.examples.terasort.teraoutpu tformat mapred.acls.enabled FALSE s3.stream-buffer-size 4096 mapred.tasktracker.dns.nameserver default mapred.submit.replication 3 io.compression.codecs org.apache.hadoop.io.compress.defaultcodec,o rg.apache.hadoop.io.compress.gzipcodec,org.a pache.hadoop.io.compress.bzip2codec,org.apa che.hadoop.io.compress.deflatecodec,org.apac he.hadoop.io.compress.snappycodec io.file.buffer.size mapred.map.tasks.speculative.execution FALSE dfs.namenode.checkpoint.txns mapred.map.child.log.level INFO kfs.replication 3 rpc.engine.org.apache.hadoop.hdfs.protoc org.apache.hadoop.ipc.protobufrpcengine olpb.clientnamenodeprotocolpb mapred.map.max.attempts 4 dfs.ha.tail-edits.period 60 kfs.stream-buffer-size 4096 mapred.job.shuffle.merge.percent 0.66 hadoop.security.authentication simple fs.s3.buffer.dir ${hadoop.tmp.dir}/s3 mapred.skip.reduce.auto.incr.proc.count mapred.job.tracker.jobhistory.lru.cache.siz e TRUE 5 48 Dell Apache Hadoop Performance Analysis
49 dfs.client.file-block-storagelocations.timeout 60 dfs.datanode.drop.cache.behind.writes FALSE tfile.fs.input.buffer.size dfs.block.access.token.enable FALSE dfs.journalnode.http-address :8480 mapreduce.job.acl-view-job mapred.job.queue.name default ftp.blocksize dfs.datanode.data.dir file://${hadoop.tmp.dir}/dfs/data mapred.job.tracker.persist.jobstatus.hours 0 dfs.https.port dfs.namenode.replication.interval 3 mapred.fairscheduler.assignmultiple TRUE mapreduce.tasktracker.cache.local.numbe rdirectories dfs.namenode.https-address dd4-ae e-31.dell.com:50470 dfs.ha.automatic-failover.enabled FALSE ipc.client.kill.max 10 mapred.healthchecker.script.timeout mapred.tasktracker.map.tasks.maximum 24 hadoop.proxyuser.oozie.hosts * dfs.client.failover.sleep.max.millis jobclient.completion.poll.interval 5000 mapred.job.tracker.persist.jobstatus.dir /jobtracker/jobsinfo mapreduce.shuffle.ssl.port dfs.default.chunk.view.size kfs.bytes-per-checksum 512 mapred.reduce.slowstart.completed.maps 0.8 hadoop.http.filter.initializers org.apache.hadoop.http.lib.staticuserwebfilter mapred.mapper.class org.apache.hadoop.examples.terasort.teragen$ SortGenMapper dfs.datanode.failed.volumes.tolerated 0 io.sort.mb Dell Apache Hadoop Performance Analysis
50 mapred.hosts /var/run/cloudera-scm-agent/process/705- mapreduce- JOBTRACKER/mapred_hosts_allow.txt hadoop.http.authentication.type simple dfs.datanode.data.dir.perm 700 ipc.server.listen.queue.size 128 file.stream-buffer-size 4096 dfs.namenode.fs-limits.max-directoryitems 0 io.mapfile.bloom.size ftp.replication 3 dfs.datanode.dns.nameserver default mapred.child.java.opts -Xmx dfs.replication.max 512 mapred.queue.default.state RUNNING map.sort.class org.apache.hadoop.util.quicksort dfs.stream-buffer-size 4096 hadoop.job.history.location file:////var/log/hadoop-0.20-mapreduce/history dfs.namenode.backup.address :50100 mapred.jobtracker.instrumentation org.apache.hadoop.mapred.jobtrackermetricsin st hadoop.util.hash.type murmur dfs.block.access.key.update.interval 600 dfs.datanode.use.datanode.hostname FALSE dfs.datanode.dns.interface default dfs.namenode.backup.http-address :50105 mapred.output.compression.type BLOCK dfs.thrift.timeout 60 mapred.skip.attempts.to.start.skipping 2 kfs.client-write-packet-size ha.zookeeper.acl world:anyone:rwcda 50 Dell Apache Hadoop Performance Analysis
51 mapreduce.job.dir hdfs://dd4-ae e- 31.dell.com:8020/user/hdfs/.staging/job_ _0001 io.map.index.skip 0 net.topology.node.switch.mapping.impl org.apache.hadoop.net.scriptbasedmapping mapred.cluster.max.map.memory.mb -1 fs.s3.maxretries 4 dfs.namenode.logging.level info s3native.client-write-packet-size mapred.task.tracker.task-controller org.apache.hadoop.mapred.defaulttaskcontroll er mapred.userlog.limit.kb 0 hadoop.http.staticuser.user dr.who mapred.input.format.class org.apache.hadoop.examples.terasort.teragen$ RangeInputFormat mapreduce.ifile.readahead.bytes hadoop.http.authentication.simple.anonym TRUE ous.allowed hadoop.fuse.timer.period 5 dfs.namenode.num.extra.edits.retained hadoop.rpc.socket.factory.class.default org.apache.hadoop.net.standardsocketfactory dfs.namenode.handler.count 10 fs.automatic.close TRUE mapreduce.job.submithostaddress dfs.datanode.directoryscan.interval mapred.map.tasks 180 mapred.local.dir.minspacekill 0 mapred.job.map.memory.mb -1 mapred.jobtracker.completeuserjobs.maxi 100 mum mapreduce.jobtracker.split.metainfo.maxsi ze 51 Dell Apache Hadoop Performance Analysis
52 mapred.cluster.max.reduce.memory.mb -1 mapred.cluster.reduce.memory.mb -1 s3native.replication 3 mapred.task.profile mapred.reduce.parallel.copies 10 dfs.heartbeat.interval 3 FALSE dfs.ha.fencing.ssh.connect-timeout local.cache.size net.topology.script.file.name dfs.client.file-block-storagelocations.num-threads 10 jobclient.progress.monitor.poll.interval 1000 dfs.bytes-per-checksum 512 ftp.stream-buffer-size 4096 mapred.fairscheduler.allow.undeclared.po TRUE ols hadoop.security.group.mapping.ldap.searc member h.attr.member dfs.blockreport.initialdelay 0 mapred.min.split.size 0 hadoop.http.authentication.token.validity dfs.namenode.delegation.token.maxlifetime mapred.output.compression.codec org.apache.hadoop.io.compress.defaultcodec /var/run/cloudera-scm-agent/process/705- mapreduce-jobtracker/topology.py io.sort.factor 64 kfs.blocksize mapred.task.timeout mapred.fairscheduler.poolnameproperty user.name dfs.namenode.secondary.http-address :50090 ipc.client.idlethreshold 4000 ipc.server.tcpnodelay FALSE ftp.bytes-per-checksum 512 mapred.output.dir hdfs://dd4-ae e- 31.dell.com:8020/HiBench/Terasort/Input 52 Dell Apache Hadoop Performance Analysis
53 group.name hdfs s3.bytes-per-checksum 512 mapred.heartbeats.in.second 100 fs.s3.block.size dfs.client.failover.connection.retries 0 mapred.map.output.compression.codec org.apache.hadoop.io.compress.snappycodec hadoop.rpc.protection mapred.task.cache.levels 2 mapred.tasktracker.dns.interface hadoop.security.auth_to_local dfs.secondary.namenode.kerberos.internal. spnego.principal authentication default DEFAULT ${dfs.web.authentication.kerberos.principal} hadoop.proxyuser.hue.hosts * ftp.client-write-packet-size mapred.output.key.class org.apache.hadoop.io.text fs.defaultfs hdfs://dd4-ae e-31.dell.com:8020 file.client-write-packet-size mapred.job.reduce.memory.mb -1 mapred.max.tracker.failures 4 fs.trash.checkpoint.interval 0 mapred.fairscheduler.allocation.file fair-scheduler.xml hadoop.http.authentication.signature.secre t.file ${user.home}/hadoop-http-auth-signaturesecret s3native.stream-buffer-size 4096 mapreduce.reduce.shuffle.read.timeout mapred.tasktracker.tasks.sleeptimebefore-sigkill 5000 dfs.namenode.checkpoint.edits.dir ${dfs.namenode.checkpoint.dir} fs.permissions.umask-mode 22 mapred.max.tracker.blacklists 4 hadoop.common.configuration.version jobclient.output.filter FAILED hadoop.security.group.mapping.ldap.ssl FALSE 53 Dell Apache Hadoop Performance Analysis
54 mapreduce.ifile.readahead io.serializations TRUE org.apache.hadoop.io.serializer.writableserializat ion,org.apache.hadoop.io.serializer.avro.avrospe cificserialization,org.apache.hadoop.io.serializer. avro.avroreflectserialization fs.df.interval io.seqfile.compress.blocksize mapred.jobtracker.taskscheduler org.apache.hadoop.mapred.jobqueuetasksche duler job.end.retry.attempts 0 ipc.client.connect.max.retries 10 hadoop.security.groups.cache.secs 300 dfs.namenode.delegation.key.updateinterval webinterface.private.actions FALSE mapred.tasktracker.indexcache.mb 10 hadoop.security.group.mapping.ldap.searc (&(objectclass=user)(samaccountname={0})) h.filter.user mapreduce.reduce.input.limit -1 dfs.image.compress FALSE mapred.output.value.class org.apache.hadoop.io.text tasktracker.http.threads 40 dfs.namenode.kerberos.internal.spnego.pri ${dfs.web.authentication.kerberos.principal} ncipal fs.s3n.block.size mapred.job.tracker.handler.count 10 fs.ftp.host keep.failed.task.files FALSE mapred.output.compress FALSE hadoop.security.group.mapping org.apache.hadoop.security.shellbasedunixgrou psmapping mapred.jobtracker.job.history.block.size mapred.skip.reduce.max.skip.groups 0 dfs.datanode.address : Dell Apache Hadoop Performance Analysis
55 dfs.datanode.https.address :50475 file.replication 1 dfs.datanode.drop.cache.behind.reads FALSE hadoop.fuse.connection.timeout 300 mapred.jar /user/hdfs/.staging/job_ _0001/job.jar hadoop.work.around.non.threadsafe.getp wuid mapreduce.client.genericoptionsparser.us ed hadoop.tmp.dir FALSE TRUE /tmp/hadoop-${user.name} dfs.client.block.write.replace-datanodeon-failure.policy DEFAULT mapred.line.input.format.linespermap 1 hadoop.kerberos.kinit.command kinit dfs.webhdfs.enabled FALSE dfs.datanode.du.reserved 0 file.bytes-per-checksum 512 dfs.thrift.socket.timeout mapred.local.dir.minspacestart 0 mapred.jobtracker.maxtasks.per.job -1 dfs.client.block.write.replace-datanodeon-failure.enable TRUE dfs.thrift.threads.min 10 mapred.user.jobconf.limit mapred.reduce.max.attempts 4 net.topology.script.number.args 100 dfs.namenode.decommission.interval 30 mapred.job.tracker dd4-ae e-31.dell.com:8021 dfs.image.compression.codec org.apache.hadoop.io.compress.defaultcodec dfs.namenode.support.allow.format hadoop.ssl.hostname.verifier mapred.tasktracker.instrumentation TRUE DEFAULT org.apache.hadoop.mapred.tasktrackermetricsi nst io.mapfile.bloom.error.rate Dell Apache Hadoop Performance Analysis
56 dfs.permissions.superusergroup supergroup mapred.tasktracker.expiry.interval hadoop.proxyuser.hue.groups * io.sort.record.percent mapred.job.tracker.persist.jobstatus.active FALSE dfs.namenode.checkpoint.check.period 60 io.seqfile.local.dir ${hadoop.tmp.dir}/io/local tfile.io.chunk.size file.blocksize hadoop.proxyuser.oozie.groups * mapreduce.job.acl-modify-job io.skip.checksum.errors dfs.namenode.edits.journal-plugin.qjournal FALSE org.apache.hadoop.hdfs.qjournal.client.quorum JournalManager mapred.temp.dir ${hadoop.tmp.dir}/mapred/temp dfs.datanode.handler.count 10 dfs.namenode.decommission.nodes.per.int 5 erval fs.ftp.host.port 21 dfs.namenode.checkpoint.period 3600 dfs.namenode.fs-limits.max-componentlength 0 fs.abstractfilesystem.viewfs.impl org.apache.hadoop.fs.viewfs.viewfs dfs.datanode.ipc.address :50020 mapred.working.dir hdfs://dd4-ae e- 31.dell.com:8020/user/hdfs hadoop.ssl.require.client.cert FALSE dfs.datanode.max.transfer.threads 4096 mapred.job.reduce.input.buffer.percent 0 hadoop.ssl.require.client.cert FALSE dfs.datanode.max.transfer.threads 4096 mapred.job.reduce.input.buffer.percent 0 56 Dell Apache Hadoop Performance Analysis
57 57 Dell Apache Hadoop Performance Analysis
High Performance SQL Server with Storage Center 6.4 All Flash Array
High Performance SQL Server with Storage Center 6.4 All Flash Array Dell Storage November 2013 A Dell Compellent Technical White Paper Revisions Date November 2013 Description Initial release THIS WHITE
Dell Wyse Datacenter for View RDS Desktops and Remote Applications
Dell Wyse Datacenter for View RDS Desktops and Remote Applications An overview of Remote Desktop Session (RDS) based desktops and Remote Applications in a VMware Horizon View environment Dell Wyse Solutions
How To Create A Web Server On A Zen Nlb 2.2.2 (Networking) With A Web Browser On A Linux Server On An Ipad Or Ipad 2.3.2 On A Raspberry Web 2.4 (
Wyse vworkspace 8.6 - Setting up load balancing using ZEN NLB Appliance Dell Cloud Client-Computing Revision 20150828 August 2015 A Dell Best Practices Revisions Date August 2015 August, 20 th 2015 Description
Dell Solutions Configurator Guide for the Dell Blueprint for Big Data & Analytics
Dell Solutions Configurator Guide for the Dell Blueprint for Big Data & Analytics This Dell Best Practices guide provides assistance to configure the reference architecture solutions of the Dell Blueprint
Installing idrac Certificate Using RACADM Commands
Installing idrac Certificate Using RACADM Commands This Dell Technical white paper provides detailed information about generation of idrac certificate by using RACADM CLI. Dell Engineering October 2013
Dell Reference Configuration for Hortonworks Data Platform
Dell Reference Configuration for Hortonworks Data Platform A Quick Reference Configuration Guide Armando Acosta Hadoop Product Manager Dell Revolutionary Cloud and Big Data Group Kris Applegate Solution
Using Dell EqualLogic and Multipath I/O with Citrix XenServer 6.2
Using Dell EqualLogic and Multipath I/O with Citrix XenServer 6.2 Dell Engineering Donald Williams November 2013 A Dell Deployment and Configuration Guide Revisions Date November 2013 Description Initial
Managing Web Server Certificates on idrac
Managing Web Server Certificates on idrac This Dell technical white paper explains how to configure the web server certificates on idrac to establish secure remote connections. Dell Engineering November
Dell Fabric Manager Installation Guide 1.0.0
Dell Fabric Manager Installation Guide 1.0.0 Notes, Cautions, and Warnings NOTE: A NOTE indicates important information that helps you make better use of your computer. CAUTION: A CAUTION indicates either
Oracle Database Scalability in VMware ESX VMware ESX 3.5
Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises
Hadoop on the Gordon Data Intensive Cluster
Hadoop on the Gordon Data Intensive Cluster Amit Majumdar, Scientific Computing Applications Mahidhar Tatineni, HPC User Services San Diego Supercomputer Center University of California San Diego Dec 18,
Active Fabric Manager (AFM) Plug-in for VMware vcenter Virtual Distributed Switch (VDS) CLI Guide
Active Fabric Manager (AFM) Plug-in for VMware vcenter Virtual Distributed Switch (VDS) CLI Guide Notes, Cautions, and Warnings NOTE: A NOTE indicates important information that helps you make better use
Dell PowerEdge Blades Outperform Cisco UCS in East-West Network Performance
Dell PowerEdge Blades Outperform Cisco UCS in East-West Network Performance This white paper compares the performance of blade-to-blade network traffic between two enterprise blade solutions: the Dell
Scaling the Deployment of Multiple Hadoop Workloads on a Virtualized Infrastructure
Scaling the Deployment of Multiple Hadoop Workloads on a Virtualized Infrastructure The Intel Distribution for Apache Hadoop* software running on 808 VMs using VMware vsphere Big Data Extensions and Dell
HiBench Introduction. Carson Wang ([email protected]) Software & Services Group
HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
Hadoop Hardware @Twitter: Size does matter. @joep and @eecraft Hadoop Summit 2013
Hadoop Hardware : Size does matter. @joep and @eecraft Hadoop Summit 2013 v2.3 About us Joep Rottinghuis Software Engineer @ Twitter Engineering Manager Hadoop/HBase team @ Twitter Follow me @joep Jay
Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters
Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters Table of Contents Introduction... Hardware requirements... Recommended Hadoop cluster
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,
Increasing Hadoop Performance with SanDisk Solid State Drives (SSDs)
WHITE PAPER Increasing Hadoop Performance with SanDisk Solid State Drives (SSDs) July 2014 951 SanDisk Drive, Milpitas, CA 95035 2014 SanDIsk Corporation. All rights reserved www.sandisk.com Table of Contents
Recommended Methods for Updating Firmware on Dell Servers
Recommended Methods for Updating Firmware on Dell Servers AVS Sashi Kiran November 2013 Revisions Date December 2013 Description Initial release THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND
HADOOP PERFORMANCE TUNING
PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The
Optimizing SQL Server Storage Performance with the PowerEdge R720
Optimizing SQL Server Storage Performance with the PowerEdge R720 Choosing the best storage solution for optimal database performance Luis Acosta Solutions Performance Analysis Group Joe Noyola Advanced
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Accessing Remote Desktop using VNC on Dell PowerEdge Servers
Accessing Remote Desktop using VNC on Dell PowerEdge Servers Establish secure remote desktop connections to Server Host OS using standard VNC clients, starting idrac7 firmware version 1.50.50 Harsha S
Dell Server Management Pack Suite Version 5.0.1 For Microsoft System Center Operations Manager And System Center Essentials User s Guide
Dell Server Management Pack Suite Version 5.0.1 For Microsoft System Center Operations Manager And System Center Essentials User s Guide Notes, Cautions, and Warnings NOTE: A NOTE indicates important information
Accelerating Server Storage Performance on Lenovo ThinkServer
Accelerating Server Storage Performance on Lenovo ThinkServer Lenovo Enterprise Product Group April 214 Copyright Lenovo 214 LENOVO PROVIDES THIS PUBLICATION AS IS WITHOUT WARRANTY OF ANY KIND, EITHER
Understanding Hadoop Performance on Lustre
Understanding Hadoop Performance on Lustre Stephen Skory, PhD Seagate Technology Collaborators Kelsie Betsch, Daniel Kaslovsky, Daniel Lingenfelter, Dimitar Vlassarev, and Zhenzhen Yan LUG Conference 15
An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
MapReduce Evaluator: User Guide
University of A Coruña Computer Architecture Group MapReduce Evaluator: User Guide Authors: Jorge Veiga, Roberto R. Expósito, Guillermo L. Taboada and Juan Touriño December 9, 2014 Contents 1 Overview
Performance measurement of a Hadoop Cluster
Performance measurement of a Hadoop Cluster Technical white paper Created: February 8, 2012 Last Modified: February 23 2012 Contents Introduction... 1 The Big Data Puzzle... 1 Apache Hadoop and MapReduce...
Dell Virtual Remote Desktop Reference Architecture. Technical White Paper Version 1.0
Dell Virtual Remote Desktop Reference Architecture Technical White Paper Version 1.0 July 2010 THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES.
High Performance Tier Implementation Guideline
High Performance Tier Implementation Guideline A Dell Technical White Paper PowerVault MD32 and MD32i Storage Arrays THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Dell PowerVault MD32xx Deployment Guide for VMware ESX4.1 Server
Dell PowerVault MD32xx Deployment Guide for VMware ESX4.1 Server A Dell Technical White Paper PowerVault MD32xx Storage Array www.dell.com/md32xx THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND
idrac7 Version 1.30.30 With Lifecycle Controller 2 Version 1.1 Quick Start Guide
idrac7 Version 1.30.30 With Lifecycle Controller 2 Version 1.1 Quick Start Guide Notes, Cautions, and Warnings NOTE: A NOTE indicates important information that helps you make better use of your computer.
Intel Distribution for Apache Hadoop on Dell PowerEdge Servers
Intel Distribution for Apache Hadoop on Dell PowerEdge Servers A Dell Technical White Paper Armando Acosta Hadoop Product Manager Dell Revolutionary Cloud and Big Data Group Kris Applegate Solution Architect
Dell Microsoft Business Intelligence and Data Warehousing Reference Configuration Performance Results Phase III
White Paper Dell Microsoft Business Intelligence and Data Warehousing Reference Configuration Performance Results Phase III Performance of Microsoft SQL Server 2008 BI and D/W Solutions on Dell PowerEdge
VIDEO SURVEILLANCE WITH SURVEILLUS VMS AND EMC ISILON STORAGE ARRAYS
VIDEO SURVEILLANCE WITH SURVEILLUS VMS AND EMC ISILON STORAGE ARRAYS Successfully configure all solution components Use VMS at the required bandwidth for NAS storage Meet the bandwidth demands of a 2,200
Enabling High performance Big Data platform with RDMA
Enabling High performance Big Data platform with RDMA Tong Liu HPC Advisory Council Oct 7 th, 2014 Shortcomings of Hadoop Administration tooling Performance Reliability SQL support Backup and recovery
Microsoft SharePoint Server 2010
Microsoft SharePoint Server 2010 Small Farm Performance Study Dell SharePoint Solutions Ravikanth Chaganti and Quocdat Nguyen November 2010 THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY
Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers
Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers White Paper rev. 2015-11-27 2015 FlashGrid Inc. 1 www.flashgrid.io Abstract Oracle Real Application Clusters (RAC)
Installing Hadoop over Ceph, Using High Performance Networking
WHITE PAPER March 2014 Installing Hadoop over Ceph, Using High Performance Networking Contents Background...2 Hadoop...2 Hadoop Distributed File System (HDFS)...2 Ceph...2 Ceph File System (CephFS)...3
Reference Architecture for a Virtualized SharePoint 2010 Document Management Solution A Dell Technical White Paper
Dell EqualLogic Best Practices Series Reference Architecture for a Virtualized SharePoint 2010 Document Management Solution A Dell Technical White Paper Storage Infrastructure and Solutions Engineering
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters
CONNECT - Lab Guide Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters Hardware, software and configuration steps needed to deploy Apache Hadoop 2.4.1 with the Emulex family
Intel Distribution for Apache Hadoop* Software: Optimization and Tuning Guide
OPTIMIZATION AND TUNING GUIDE Intel Distribution for Apache Hadoop* Software Intel Distribution for Apache Hadoop* Software: Optimization and Tuning Guide Configuring and managing your Hadoop* environment
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 * Other names and brands may be claimed as the property of others. Agenda Hadoop Intro Why run Hadoop on Lustre?
How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (
Apache Hadoop 1.0 High Availability Solution on VMware vsphere TM Reference Architecture TECHNICAL WHITE PAPER v 1.0 June 2012 Table of Contents Executive Summary... 3 Introduction... 3 Terminology...
Protecting Hadoop with VMware vsphere. 5 Fault Tolerance. Performance Study TECHNICAL WHITE PAPER
VMware vsphere 5 Fault Tolerance Performance Study TECHNICAL WHITE PAPER Table of Contents Executive Summary... 3 Introduction... 3 Configuration... 4 Hardware Overview... 5 Local Storage... 5 Shared Storage...
HP SN1000E 16 Gb Fibre Channel HBA Evaluation
HP SN1000E 16 Gb Fibre Channel HBA Evaluation Evaluation report prepared under contract with Emulex Executive Summary The computing industry is experiencing an increasing demand for storage performance
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
Tuning Hadoop on Dell PowerEdge Servers
Tuning Hadoop on Dell PowerEdge Servers This Dell Technical White Paper explains how to tune BIOS, OS and Hadoop settings to increase performance in Hadoop workloads. Donald Russell Solutions Performance
DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering
DELL Virtual Desktop Infrastructure Study END-TO-END COMPUTING Dell Enterprise Solutions Engineering 1 THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL
Hadoop on OpenStack Cloud. Dmitry Mescheryakov Software Engineer, @MirantisIT
Hadoop on OpenStack Cloud Dmitry Mescheryakov Software Engineer, @MirantisIT Agenda OpenStack Sahara Demo Hadoop Performance on Cloud Conclusion OpenStack Open source cloud computing platform 17,209 commits
VMware ESX 2.5 Server Software Backup and Restore Guide on Dell PowerEdge Servers and PowerVault Storage
VMware ESX 2.5 Server Software Backup and Restore Guide on Dell PowerEdge Servers and PowerVault Storage This document provides best practices for backup and recovery of Virtual Machines running on VMware
Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications
Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introduc-on Open source MapReduce
Dell Desktop Virtualization Solutions Stack with Teradici APEX 2800 server offload card
Dell Desktop Virtualization Solutions Stack with Teradici APEX 2800 server offload card Performance Validation A joint Teradici / Dell white paper Contents 1. Executive overview...2 2. Introduction...3
Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations. Database Solutions Engineering
Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations A Dell Technical White Paper Database Solutions Engineering By Sudhansu Sekhar and Raghunatha
Dell EqualLogic Best Practices Series
Dell EqualLogic Best Practices Series Sizing and Best Practices for Deploying Oracle 11g Release 2 Based Decision Support Systems with Dell EqualLogic 10GbE iscsi SAN A Dell Technical Whitepaper Storage
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
The Methodology Behind the Dell SQL Server Advisor Tool
The Methodology Behind the Dell SQL Server Advisor Tool Database Solutions Engineering By Phani MV Dell Product Group October 2009 Executive Summary The Dell SQL Server Advisor is intended to perform capacity
Intel RAID SSD Cache Controller RCS25ZB040
SOLUTION Brief Intel RAID SSD Cache Controller RCS25ZB040 When Faster Matters Cost-Effective Intelligent RAID with Embedded High Performance Flash Intel RAID SSD Cache Controller RCS25ZB040 When Faster
Reference Architecture for Dell VIS Self-Service Creator and VMware vsphere 4
Reference Architecture for Dell VIS Self-Service Creator and VMware vsphere 4 Solutions for Large Environments Virtualization Solutions Engineering Ryan Weldon and Tom Harrington THIS WHITE PAPER IS FOR
Using Red Hat Network Satellite Server to Manage Dell PowerEdge Servers
Using Red Hat Network Satellite Server to Manage Dell PowerEdge Servers Enterprise Product Group (EPG) Dell White Paper By Todd Muirhead and Peter Lillian July 2004 Contents Executive Summary... 3 Introduction...
DVS Enterprise. Reference Architecture. VMware Horizon View Reference
DVS Enterprise Reference Architecture VMware Horizon View Reference THIS DOCUMENT IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED
Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1
Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System
Reference Architecture - Microsoft Exchange 2013 on Dell PowerEdge R730xd
Reference Architecture - Microsoft Exchange 2013 on Dell PowerEdge R730xd Reference Implementation for up to 8000 mailboxes Dell Global Solutions Engineering June 2015 A Dell Reference Architecture THIS
An Oracle White Paper September 2011. Oracle Exadata Database Machine - Backup & Recovery Sizing: Tape Backups
An Oracle White Paper September 2011 Oracle Exadata Database Machine - Backup & Recovery Sizing: Tape Backups Table of Contents Introduction... 3 Tape Backup Infrastructure Components... 4 Requirements...
DELL s Oracle Database Advisor
DELL s Oracle Database Advisor Underlying Methodology A Dell Technical White Paper Database Solutions Engineering By Roger Lopez Phani MV Dell Product Group January 2010 THIS WHITE PAPER IS FOR INFORMATIONAL
Achieving a High Performance OLTP Database using SQL Server and Dell PowerEdge R720 with Internal PCIe SSD Storage
Achieving a High Performance OLTP Database using SQL Server and Dell PowerEdge R720 with This Dell Technical White Paper discusses the OLTP performance benefit achieved on a SQL Server database using a
Deploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer [email protected] Alejandro Bonilla / Sales Engineer [email protected] 2 Hadoop Core Components 3 Typical Hadoop Distribution
DELL TM PowerEdge TM T610 500 Mailbox Resiliency Exchange 2010 Storage Solution
DELL TM PowerEdge TM T610 500 Mailbox Resiliency Exchange 2010 Storage Solution Tested with: ESRP Storage Version 3.0 Tested Date: Content DELL TM PowerEdge TM T610... 1 500 Mailbox Resiliency
VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014
VMware SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014 VMware SAN Backup Using VMware vsphere Table of Contents Introduction.... 3 vsphere Architectural Overview... 4 SAN Backup
HP reference configuration for entry-level SAS Grid Manager solutions
HP reference configuration for entry-level SAS Grid Manager solutions Up to 864 simultaneous SAS jobs and more than 3 GB/s I/O throughput Technical white paper Table of contents Executive summary... 2
Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA
WHITE PAPER April 2014 Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA Executive Summary...1 Background...2 File Systems Architecture...2 Network Architecture...3 IBM BigInsights...5
DELL. Dell Microsoft Windows Server 2008 Hyper-V TM Reference Architecture VIRTUALIZATION SOLUTIONS ENGINEERING
DELL Dell Microsoft Windows Server 2008 Hyper-V TM Reference Architecture VIRTUALIZATION SOLUTIONS ENGINEERING September 2008 1 THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW 757 Maleta Lane, Suite 201 Castle Rock, CO 80108 Brett Weninger, Managing Director [email protected] Dave Smelker, Managing Principal [email protected]
Big Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
A Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
Optimizing LTO Backup Performance
Optimizing LTO Backup Performance July 19, 2011 Written by: Ash McCarty Contributors: Cedrick Burton Bob Dawson Vang Nguyen Richard Snook Table of Contents 1.0 Introduction... 3 2.0 Host System Configuration...
SUN DUAL PORT 10GBase-T ETHERNET NETWORKING CARDS
SUN DUAL PORT 10GBase-T ETHERNET NETWORKING CARDS ADVANCED PCIE 2.0 10GBASE-T ETHERNET NETWORKING FOR SUN BLADE AND RACK SERVERS KEY FEATURES Low profile adapter and ExpressModule form factors for Oracle
Applied Storage Performance For Big Analytics. PRESENTATION TITLE GOES HERE Hubbert Smith LSI
Applied Storage Performance For Big Analytics PRESENTATION TITLE GOES HERE Hubbert Smith LSI It s NOT THIS SIMPLE!!! 2 Theoretical vs Real World Theoretical & Lab Storage Workloads I/O I/O I/O I/O I/O
Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra
Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra A Quick Reference Configuration Guide Kris Applegate [email protected] Solution Architect Dell Solution Centers Dave
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish
Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820
Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820 This white paper discusses the SQL server workload consolidation capabilities of Dell PowerEdge R820 using Virtualization.
Dell EqualLogic Best Practices Series. Dell EqualLogic PS Series Reference Architecture for Cisco Catalyst 3750X Two-Switch SAN Reference
Dell EqualLogic Best Practices Series Dell EqualLogic PS Series Reference Architecture for Cisco Catalyst 3750X Two-Switch SAN Reference Storage Infrastructure and Solutions Engineering Dell Product Group
Performance characterization report for Microsoft Hyper-V R2 on HP StorageWorks P4500 SAN storage
Performance characterization report for Microsoft Hyper-V R2 on HP StorageWorks P4500 SAN storage Technical white paper Table of contents Executive summary... 2 Introduction... 2 Test methodology... 3
Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload
Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload Drive operational efficiency and lower data transformation costs with a Reference Architecture for an end-to-end optimization and offload
Data Migration: Moving from Dell PowerVault MD3000i/MD3000 to MD3200i/MD3220i and MD3600i/MD3620i Series Storage Arrays
Data Migration: Moving from Dell PowerVault MD3000i/MD3000 to MD3200i/MD3220i and MD3600i/MD3620i A Dell Technical White Paper PowerVault MD3200/MD3200i and MD3600i Storage Arrays THIS WHITE PAPER IS FOR
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
Dell Microsoft SQL Server 2008 Fast Track Data Warehouse Performance Characterization
Dell Microsoft SQL Server 2008 Fast Track Data Warehouse Performance Characterization A Dell Technical White Paper Database Solutions Engineering Dell Product Group Anthony Fernandez Jisha J Executive
Upgrade to Microsoft Windows Server 2012 R2 on Dell PowerEdge Servers Abstract
Upgrade to Microsoft Windows Server 2012 R2 on Dell PowerEdge Servers Abstract Microsoft Windows Server 2012 R2, is Microsoft s latest version of Windows for Servers. This R2 release brings exciting new
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 1 Market Trends Big Data Growing technology deployments are creating an exponential increase in the volume
Evaluation of Enterprise Data Protection using SEP Software
Test Validation Test Validation - SEP sesam Enterprise Backup Software Evaluation of Enterprise Data Protection using SEP Software Author:... Enabling you to make the best technology decisions Backup &
Exar. Optimizing Hadoop Is Bigger Better?? March 2013. [email protected]. Exar Corporation 48720 Kato Road Fremont, CA 510-668-7000. www.exar.
Exar Optimizing Hadoop Is Bigger Better?? [email protected] Exar Corporation 48720 Kato Road Fremont, CA 510-668-7000 March 2013 www.exar.com Section I: Exar Introduction Exar Corporate Overview Section II:
Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays
Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays Database Solutions Engineering By Murali Krishnan.K Dell Product Group October 2009
