A Benchmarking Case Study of Virtualized Hadoop Performance on VMware vsphere 5
|
|
|
- Jessie Jones
- 10 years ago
- Views:
Transcription
1 A Benchmarking Case Study of Virtualized Hadoop Performance on VMware vsphere 5 Performance Study TECHNICAL WHITE PAPER
2 Table of Contents Executive Summary... 3 Introduction... 3 Why Virtualize?... 5 Test Setup... 6 Hadoop Benchmarks... 8 Pi... 8 TestDFSIO... 8 TeraSort... 8 Benchmark Results... 9 Analysis Pi TestDFSIO TeraSort Conclusion Appendix: Configuration References About the Author Acknowledgements TECHNICAL WHITE PAPER / 2
3 Executive Summary The performance of three Hadoop applications is reported for several virtual configurations on VMware vsphere 5 and compared to native configurations. A well-balanced seven-node AMAX ClusterMax system was used to show that the average performance difference between native and the simplest virtualized configurations is only 4%. Further, the flexibility enabled by virtualization to create multiple Hadoop nodes per host can be used to achieve performance significantly better than native. Introduction In recent years the amount of data stored worldwide has exploded, increasing by a factor of nine in the last five years 1. Individual companies often have petabytes or more of data and buried in this is business information that is critical to continued growth and success. However, the quantity of data is often far too large to store and analyze in traditional relational database systems, or the data are in unstructured forms unsuitable for structured schemas, or the hardware needed for conventional analysis is just too costly. And even when an RDBMS is suitable for the actual analysis, the sheer volume of raw data can create issues for data preparation tasks like data integration and ETL. As the size and value of the stored data increases, the importance of reliable backups also increases and tolerance of hardware failures decreases. The potential value of insights that can be gained from a particular set of information may be very large, but these insights are effectively inaccessible if the IT costs to reach them are yet greater. Hadoop evolved as a distributed software platform for managing and transforming large quantities of data, and has grown to be one of the most popular tools to meet many of the above needs in a cost-effective manner. By abstracting away many of the high availability (HA) and distributed programming issues, Hadoop allows developers to focus on higher-level algorithms. Hadoop is designed to run on a large cluster of commodity servers and to scale to hundreds or thousands of nodes. Each disk, server, network link, and even rack within the cluster is assumed to be unreliable. This assumption allows the use of the least expensive cluster components consistent with delivering sufficient performance, including the use of unprotected local storage (JBODs). Hadoop s design and ability to handle large amounts of data efficiently make it a natural fit as an integration, data transformation, and analytics platform. Hadoop use cases include: Customizing content for users: Creating a better user experience through targeted and relevant ads, personalized home pages, and good recommendations. Supply chain management: Examining all available historical data enables better decisions for stocking and managing goods. Among the many sectors in this category are retail, agriculture, hospitals, and energy. Fraud analysis: Analyzing transaction histories in the financial sector (for example, for credit cards and ATMs) to detect fraud. Bioinformatics: Applying genome analytics and DNA sequencing algorithms to large datasets. Beyond analytics: Transforming data from one form to another, including adding structure to unstructured data in log files before combining them with other structured data. Miscellaneous uses: Aggregating large sets of images from various sources and combining them into one (for example, satellite images), moving large amounts of data from one location to another. Hadoop comprises two basic components: a distributed file system (inspired by Google File System 2 ) and the computational framework (Google MapReduce 3 ). In the first component, data is stored in Hadoop Distributed File System (HDFS). The file system namespace in the Apache open-source version of HDFS (and in the Cloudera distribution used here) is managed by a single NameNode, while a set of DataNodes do the work of storing and retrieving data. HDFS also manages the replication of data blocks. The exception to HA in Hadoop is the NameNode. While Hadoop provides mechanisms to protect the data, the NameNode is a single point of failure, so in production clusters it is advisable to isolate it and take other measures to enhance its availability (this was not done for this paper since the focus was on performance). There is also a Secondary NameNode which keeps a TECHNICAL WHITE PAPER / 3
4 A Benchmarking Case Study of Virtualized copy of the NameNode data to be used to restart the NameNode in case of failure, although this copy may not be current so some data loss is still likely to occur. The other basic component of Hadoop is MapReduce, which provides a computational framework for data processing. MapReduce programs are inherently parallel and thus very suited to a distributed environment. A single JobTracker schedules all the jobs on the cluster, as well as individual tasks. Here, each benchmark test is a job and runs by itself on the cluster. A job is split into a set of tasks that execute on the worker nodes. A TaskTracker running on each worker node is responsible for starting tasks and reporting progress to the JobTracker. As the name implies, there are two phases in MapReduce processing: map and reduce. Both use keyvalue pairs defined by the user as input and output. This allows the output of one job to serve directly as input for another. Hadoop is often used in that way, including several of the tests here. Both phases execute through a set of tasks. Note that the number of map and reduce tasks are independent, and that neither set of tasks is required to use all the nodes in the cluster. In Figure 1, a simplified view of a typical dataflow with just one task per node is shown. Figure 1. Simplified MapReduce Dataflow with One Task Per Node TECHNICAL WHITE PAPER /4
5 Hadoop splits the input data into suitable-sized chunks for processing by the map tasks. These splits need to be large enough to minimize the overhead of managing files and tasks, but small enough so there is enough parallel work to do. The map task output is sorted by keys and written not to HDFS, but to local disk (shown in gray). It is then sent to the appropriate reduce task. This is known as the shuffle phase and is typically an all-to-all operation that often stresses the bandwidth of the network interconnect. The reduce tasks do the final processing and the output is written to HDFS (shown in blue). If there is more than one reducer, the output is partitioned. A good reference that gives more details on MapReduce, HDFS, and other components is Hadoop: The Definitive Guide 4. Since Hadoop jobs can be very large, with some requiring hours on large clusters to execute, the resource cost can be significant. The cost of a job is inversely related to the throughput Hadoop is able to deliver, and therefore performance is of utmost importance. CPU, memory, network throughput, and storage capacity and throughput all need to be sized appropriately to achieve a balanced system, and the software layer needs to be able to fully utilize all these resources. In Hadoop, this means finding enough parallel work to keep all the computational nodes busy, and scheduling tasks to run such that the data they need are nearly always local (often referred to as moving the computation to the data, rather than moving data to the computational resources). A small overhead or scheduling inefficiency in the virtualization, operating system, or application layers can add up to a large cumulative cost and therefore needs to be understood and quantified. Why Virtualize? Hadoop is a modern application with features such as consolidation of jobs and HA that overlap with capabilities enabled by virtualization. This leads some to believe there is no motivation for virtualizing Hadoop; however, there are a variety of reasons for doing so. Some of these are: Scheduling: Taking advantage of unused capacity in existing virtual infrastructures during periods of low usage (for example, overnight) to run batch jobs. Resource utilization: Co-locating Hadoop VMs and other kinds of VMs on the same hosts. This often allows better overall utilization by consolidating applications that use different kinds of resources. Storage models: Although Hadoop was developed with local storage in mind, it can just as easily use shared storage for all data or a hybrid model in which temporary data is kept on local disk and HDFS is hosted on a SAN. With either of these configurations, the unused shared storage capacity and bandwidth within the virtual infrastructure can be given to Hadoop jobs. Datacenter efficiency: Virtualizing Hadoop can increase datacenter efficiency by increasing the types of workloads that can be run on a virtualized infrastructure. Deployment: Virtualization tools ranging from simple cloning to sophisticated products like VMware vcloud Director can speed up the deployment of Hadoop nodes. Performance: Virtualization enables the flexible configuration of hardware resources. The last item above is the main focus of this paper. Many applications can efficiently use many nodes in a cluster (they scale out well), but lack good SMP scaling (poor scale-up). Many Web servers, mail servers, and Java application servers fall into this category. These can usually be fixed in a virtual environment by running several smaller VMs per host. Whatever the reason for virtualizing Hadoop, it is important to understand the performance implications of such a decision to ensure the cost will be reasonable and that there will be sufficient resources. While some of the relevant issues were investigated for this report, there are many more that were left for future work. Among these are the trade-offs of different kinds of storage, network tuning, larger scale-out, and more applications. TECHNICAL WHITE PAPER / 5
6 Test Setup The hardware configuration of the test cluster is shown in Figure 2. Seven host servers of an AMAX ClusterMax system were connected to a single Mellanox 10GbE switch. Each host was equipped with two Intel X GHz 6-core processors, 12 internal 500GB 7,200 RPM SATA disks, and a 10GbE Mellanox adapter. The disks were connected to a PCIe 1.1 X4 storage controller. This controller has a theoretical throughput of 1 GB/s, but in practice can deliver MB/s. Power states were disabled in the BIOS to improve consistency of results. Intel TurboMode was enabled but it never engaged (probably because of the high utilization of all the cores). Intel Hyper-Threading (HT) Technology was either enabled or disabled in the BIOS depending on the test performed. Complete hardware details are given in the Appendix. AMAX ClusterMax 2X X5650, 96GB 12X SATA 500GB Mellanox 10GbE adapter Mellanox 10GbE switch Figure 2. Cluster Hardware Configuration Two of the 12 internal disks in each host were mirrored, divided into LUNs, and used to install the native operating system and ESXi hypervisor. During the tests, the I/O to these disks consisted of just logs and job statistics. The other ten disks were configured as JBODs and used exclusively for Hadoop data. A single aligned partition was created on each disk and formatted with EXT4. For the virtualized tests, these disks were passed through to the VMs using physical raw device mappings; this ensured both native and virtual tests used the same underlying file systems. RHEL 6.1 x86_64 was used as the operating system (OS) in all tests. The native and virtual OS configurations were nearly the same, but there were some differences. The native tests used all the CPUs and memory of the machine. In the virtual tests, all the CPUs were split between the VMs (that is, CPUs were exactly-committed) while a few gigabytes of memory were held back for the hypervisor and VM memory overhead (slightly undercommitted). The Mellanox MLNX_EN (version 1.5.6) driver was installed in the native OS, with enable_sys_tune turned on (this significantly lowers network latency). In ESXi, a development version of this driver was used. The latter is expected to be released soon. The virtual adapter was vmxnet3. Default networking parameters were used both in ESXi and in the guest OS. Sun Java 6 is recommended for Hadoop; version 1.6.0_25 was used here. Two kernel tunables were increased. These and other OS details are given in the Appendix. TECHNICAL WHITE PAPER / 6
7 The Cloudera CDH3u0 version of Apache Hadoop was installed in all machines. Replication was set to 2, as is common for small clusters (large clusters should use 3 or more). The HDFS block size was increased to 128MB from the 64MB default. Most of machine memory was used for the Java heap. In the native case, the maximum was set to 2048MB. Slightly less, 1900MB, was used for the virtual tests since the total VM memory allocated was smaller. A few other Hadoop parameters were changed from their defaults and are listed in the Appendix. Probably the most important Hadoop parameters are the numbers of map and reduce tasks. The number of each was different for each benchmark but was the same when comparing native configurations and the various virtual tests. This leaves open the possibility that the particular number of tasks used was more optimal on one platform (native or virtual) than another; this was not investigated here. However, it was observed that the performance of the heavy I/O tests was not very sensitive to the number of tasks. All of the hosts/vms were used as Hadoop worker nodes. Each worker node ran a DataNode, a TaskTracker, and a set of map and reduce tasks. The NameNode and JobTracker were run on the first worker node, and the Secondary NameNode was run on the second worker node. It is often advised to run these three management processes on machines separate from the worker nodes for reliability reasons. In large clusters it is necessary to do so for performance reasons. However, in small clusters these processes take very little CPU time, so dedicating a machine to them would reduce overall resource utilization and lower performance. In the 2-VM/host tests, all the hardware resources of a host were evenly divided between the VMs. In addition, each VM was pinned to a NUMA node (which for these machines is a socket plus the associated memory) for better reproducibility of the results. The native, 1-VM, and 2-VM tests are described as homogeneous because the machines/vms are all the same size. For the 4-VM/host tests a heterogeneous configuration was chosen where two small and two large VMs were used on each host. The small VMs were assigned two data disks, five vcpus, and 18400MB each, while the large VMs were assigned three data disks, seven vcpus, and 27600MB. One of each was pinned to each NUMA node. The number of tasks per Hadoop node was then configured to be either proportional to the number of vcpus (for computational tests), or to the number of disks (for I/O tests). Clearly, the use of heterogeneous configurations raises many issues, of which only a few will be discussed here. As mentioned previously, all of the present tests were performed with replication=2. This means each original block is copied to one other worker node. However, when multiple VMs per host are used, there is the undesirable possibility from a single-point of failure perspective that the replica node chosen is another VM on the same host. Hadoop manages replica placement based on the network topology of a cluster. Since Hadoop does not discover the topology itself, the user needs to describe it in a hierarchical fashion as part of the input (the default is a flat topology). Hadoop uses this information both to minimize long-distance network transfers and to maximize availability, including tolerating rack failures. Virtualization adds another layer to the network topology that needs to be taken into account when deploying a Hadoop cluster: inter-vm communication on each host. Equating the set of VMs running on a particular host to a unique rack ensures that at least the first replica of every block is sent to a VM on another host. From a benchmarking point of view, defining an appropriate topology also ensures that the multiple-vm cases do not make replicas local to their hosts and thereby (inappropriately) gain a performance advantage over the single-vm and native configurations. TECHNICAL WHITE PAPER / 7
8 Hadoop Benchmarks Several benchmarks were created from three of the example applications included with the Cloudera distribution: Pi, TestDFSIO, and TeraSort. Table 1 summarizes the maximum number of simultaneous map and reduce tasks across the cluster for the various benchmarks. These were the same for native and all the virtual configurations. BENCHMARK MAP REDUCE HT on HT off HT on HT off Pi TestDFSIO-write TestDFSIO-read TeraGen TeraSort TeraValidate Table 1. Maximum Total Number of Simultaneous Tasks per Job Pi Pi is a purely computational application that employs a Monte Carlo method to estimate the value of pi. It is very nearly embarrassingly parallel : the map tasks are all independent and the single reduce task gathers very little data from the map tasks. There is little network traffic or storage I/O. All the results reported here calculate 1.68 trillion samples. These are spread across 168 total map tasks (HT enabled) or 84 map tasks (HT disabled). For both homogeneous and heterogeneous cases, the number of map tasks per Hadoop node is equal to the number of CPUs or vcpus. All the map tasks start at the same time and the job finishes when the last map task completes (plus a trivial amount of time for the reduce task). It is important that all map tasks run at the same speed and finish at about the same time for good performance. One slow map task can have a significant effect on elapsed time for the job. TestDFSIO TestDFSIO is a storage throughput test that is split into two parts here: TestDFSIO-write writes MB (about 1TB) of data to HDFS, and TestDFSIO-read reads it back in. Because of the replication factor, the write test does twice as much I/O as the read test and generates substantial networking traffic. A total of 140 map tasks were found to be close to optimal for the HT disabled case, while 70 map tasks were not enough and 210 gave close to the same performance as 140. The same number of tasks was used for the HT-enabled tests since additional CPU resources were not expected to help this I/O dominated benchmark. The number of map tasks per worker node in the heterogeneous case was made proportional to the number of available disks (that is, four for the small VMs, six for the large). TeraSort TeraSort sorts a large number of 100-byte records. It does considerable computation, networking, and storage I/O, and is often considered to be representative of real Hadoop workloads. It is split into three parts: generation, sorting, and validation. TeraGen creates the data and is similar to TestDFSIO-write except that significant computation is involved in creating the random data. The map tasks write directly to HDFS so there is no reduce phase. TeraSort does the actual sorting and writes sorted data to HDFS in a number of partitioned files. The application itself overrides the specified replication factor so only one copy is written. The philosophy is that if a data disk is lost, the user can always rerun the application, but input data needs replication since it may not be as easily recovered. TeraValidate reads all the sorted data to verify that it is in order. The map tasks do this for each TECHNICAL WHITE PAPER / 8
9 file independently, and then the single reduce task checks that the last record of each file comes before the first record of the next file. Results are reported for two sizes that often appear in published Hadoop tests: 10 billion records ( 1TB ) and 35 billion records ( 3.5TB ). A total of 280 map tasks was used for TeraGen, 280 simultaneous (several thousand in all) map tasks and 70 reduce tasks for TeraSort, and 70 map tasks for TeraValidate. Benchmark Results Table 2 and Table 3 present elapsed time results with HT disabled and enabled respectively. BENCHMARK NATIVE VIRTUAL 1 VM 2 VMs Pi TestDFSIO-write TestDFSIO-read TeraGen 1TB TeraSort 1TB 2,995 3,127 3,110 TeraValidate 1TB TeraGen 3.5TB 2,328 2,504 2,030 TeraSort 3.5TB 13,460 14,863 12,494 TeraValidate 3.5TB 2,783 2,745 2,552 Table 2. Elapsed Time in Seconds with HT Disabled (lower is better; number of VMs shown is per host) BENCHMARK NATIVE VIRTUAL 1 VM 2 VMs 4 VMs Pi TestDFSIO-write TestDFSIO-read TeraGen 1TB TeraSort 1TB 2,629 3,001 2,792 2,450 TeraValidate 1TB TeraGen 3.5TB 2,289 2,517 1,968 2,302 TeraSort 3.5TB 13,712 14,353 12,673 13,054 TeraValidate 3.5TB 2,460 2,563 2,324 3,629 Table 3. Elapsed Time in Seconds with HT Enabled (lower is better; number of VMs shown is per host) Figure 3 and Figure 4 show the elapsed time results for the virtual cases normalized to the corresponding native cases with HT disabled and enabled respectively. TECHNICAL WHITE PAPER / 9
10 Ratio to Native VM 2 VMs 0.0 Figure 3. Elapsed Time with HT Disabled Normalized to the Native Case (lower is better) Ratio to Native VM 2 VMs 4 VMs Figure 4. Elapsed Time with HT Enabled Normalized to the Native Case (lower is better) TECHNICAL WHITE PAPER / 10
11 Analysis Pi Pi is 4-10% faster for all of the virtual cases than for the corresponding native cases. This is an unexpected result since normally pure CPU applications show a 1-5% reduction in performance when virtualized. There is a small contribution from map tasks not executing at the same rate (a function of the Linux and ESXi schedulers) which is reflected in how long the reduce task runs (the reduce task starts running when the first map task finishes). Differences in the schedulers may account for 1-2% of the differences in performance. Another Hadoop statistic shows the sum of the elapsed times of all the map tasks. This is up to 9% less for the virtual cases. Together with 100% CPU utilization in all cases, this means that each map task is running faster in a VM. Investigations are ongoing to better understand this behavior. Enabling HT reduces the elapsed time by 12% for the native case and 15-18% for the virtual cases. TestDFSIO TestDFSIO is a stress test for high storage throughput applications. The results show that in three out of the four 1-VM cases (write/read, HT disabled/enabled) virtual is 7-10% slower than the corresponding native cases. In the fourth 1-VM case and all the multi-vm cases, virtual is 4-13% faster. In this kind of test (sequential throughput, lightly-loaded processors) it is expected that throughput would be strictly limited by hardware (in this case the storage controller), and that the variation among software platforms would be much less. A partial explanation for the observed behavior can be seen from Figure 5, which shows the total write throughput on one of the hosts with different numbers of VMs and HT enabled. The average throughput is lower for the 1-VM case not because the peak throughput is lower, but because the throughput is much noisier in time. Data shown was collected in esxtop at 20 second intervals. With finer intervals, throughput actually drops to zero for a few seconds at a time. These drops are synchronized across all the nodes in the cluster, indicating they are associated with the application and not the underlying platform. With two VMs per host, the throughput is much more consistent and even more so with four VMs. The native case exhibits noise comparable to the single-vm case. It is not understood why more VMs improves the throughput behavior. One possibility is that a single Hadoop node has difficulty scheduling I/O fairly to ten disks and that it can do a better job when fewer disks are presented to it. Although the throughput is substantially higher for four VMs than two VMs for much of the duration of the test, the former finishes only 2% sooner. This is because the heterogeneity of the 4-VM case makes it difficult for all the tasks to finish at about the same time as in the 2-VM case. By adjusting the number of tasks per node, the test writes a number of original files to each node in proportion to the number of disks it has. However, the replicas are spread evenly across all the nodes, destroying this proportionality. Heterogeneous Hadoop nodes have the potential to be useful, but the trade-offs will be difficult to understand. HT does not have a consistent effect on performance, which is to be expected for a workload with low CPU utilization. TECHNICAL WHITE PAPER / 11
12 Write Throughput, MB/s VM 2 VMs 4 VMs Time, Seconds Figure 5. TestDFSIO Write Throughput on One of the Hosts (HT enabled) TeraSort The three TeraSort benchmarks also show excellent performance overall when virtualized. There is a significant amount of noise in the execution time of all three tests on both native and virtual platforms, especially TeraValidate. This is partly due to the greater complexity of the applications, but mostly due to the long pole effect where a single long-running task can have an out-sized effect on the total elapsed time. Using a 3.5TB dataset reduces these effects, making the tests more repeatable. TeraGen writes almost an identical amount of data to disk as TestDFSIO-write and has similar elapsed times for all cases. The CPU utilization is considerably higher (about 50% compared to 30%) but not high enough to affect the elapsed time. As with TestDFSIO-write, the write behavior of TeraGen is sensitive to the number of VMs. The 4-VM cases have the highest sustained throughput for most of the duration of the tests. They take longer to finish than the 2-VM cases because of the heterogeneity of the configuration (see the paragraph at the end of this section). TeraSort is the most resource-intensive of the tests presented here: it combines high CPU utilization, high storage throughput and moderate networking bandwidth (1-2 Gb/sec per host). Using a single VM per host, the virtual cases are 4-14% slower than the corresponding native cases. The average performance reduction of 8% is reasonable for I/O-heavy applications. There does not seem to be a pattern with respect to HT or size of the dataset. With two VMs per host, the performance improves significantly compared to the native cases: it becomes 4-6% slower for the 1TB dataset and 7-8% faster for the 3.5TB dataset. HT improves performance by 4-12% for 1TB but doesn t have a significant effect for 3.5TB. Partitioning each host into four VMs greatly improves the 1TB case and slightly degrades the 3.5TB case compared to two VMs. TECHNICAL WHITE PAPER / 12
13 These observations and the lack of strong patterns can be explained by considering CPU utilization, shown in Figure 6 for the 3.5TB TeraSort test with HT enabled. The map tasks and shuffle phase run up to the cliff at about 8,000 seconds, and then the reduce tasks run at relatively low utilization. It s the long-pole effect of the relatively few and long-running reduce tasks that causes much of the elapsed time variation. For instance, the 1- VM case finished the map/shuffle phase in the same amount of time as the native case, and most of the reduce tasks earlier, but it ends up finishing the whole job later. Two VMs clearly improves execution speed at approximately the same CPU cost. Four VMs also improves the map/shuffle performance, but with a large decrease in CPU utilization. This is offset by reduced performance in the reduce phase. The reason for the decrease in utilization requires further investigation. The large differences in performance make it worthwhile to investigate different virtual configurations to find the optimal one. The choice of performance metric can have as much of an impact on finding the optimal configuration as the performance based on any one metric. For example, in a job-consolidation environment (running multiple jobs under Hadoop, or other kinds of applications in other VMs simultaneously), the total integrated CPU utilization is often a better performance metric than job elapsed time, since it is the former that determines how many jobs can run in a given time period. With such a metric, the performance advantage of four VMs over the native configuration would be much greater than the 5% indicated by the job elapsed time. The 1TB case leads to similar conclusions, although the metrics are very different. The map/shuffle phase runs at 96-99% CPU utilization for all the virtual cases and about 75% for the native case. The gains in efficiency with multiple VMs are reflected more in reductions in elapsed time; the 4-VM case finishes the map/shuffle phase (where most of the CPU utilization occurs) 24% faster than the native case (compared to 8% faster in the 3.5TB case) Host %CPU Utilization VM 2 VMs 4 VMs Native 0 0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 Time, Seconds Figure TB TeraSort %CPU Utilization with HT Enabled on One of the Hosts (lower is better) TeraValidate is mostly a sequential read test; however, the relatively high run-to-run variability makes it difficult to compare to TestDFSIO-read. The difference between HT enabled and disabled for the 1 TB case does not appear to be significant. The long execution time of the 3.5TB case with four VMs is significant and is discussed below. Otherwise, the native and virtual cases are quite comparable. TECHNICAL WHITE PAPER / 13
14 One of the features of Hadoop is speculative execution. If one task is running slowly and another node has task slots available, Hadoop may speculatively start the same task on the second node. It will then keep the results from the task that finishes first and kill the laggard. This strategy is very effective if the reason for the slowness is some kind of system problem on a node. The extreme case is node failure where Hadoop will simply restart the tasks on another node. However, if all the nodes are healthy, speculative execution may slow down progress of the job by over-taxing resources. This is what happened for all three of the 3.5TB TeraSort tests with four VMs. The heterogeneity of this configuration leads to differences in task execution, which in turn leads to speculative execution. The effect was largest for TeraValidate: starting a second identical task is usually harmful because there is only one copy of the data that needs to be read. By disabling speculative execution, TeraGen and TeraValidate improved by 11% and 31% respectively to become just 5% and 7% slower than the corresponding 2- VM cases. TeraSort improved only very slightly; this is probably due to the large number of relatively small tasks helping to keep the task slots full. Conclusion These results show that virtualizing Hadoop can be compelling for performance reasons alone. With a single VM per host, the average increase in elapsed time across all the benchmarks over the native configuration is 4%. This is a small price to pay for all the other advantages offered by virtualization. Running two or four smaller VMs on each two-socket machine results in average performance better than the corresponding native configuration; some cases were up to 14% faster. Using an integrated CPU utilization metric, which is more relevant in many environments, may yield even greater savings. The results presented here represent a small part of ongoing work surrounding virtualized Hadoop. Future papers will deal with alternative storage strategies, ease-of-use, larger clusters, higher-level applications, and many other topics. TECHNICAL WHITE PAPER / 14
15 Appendix: Configuration Hardware Cluster: AMAX ClusterMax with seven 2-socket hosts Host CPU and memory: 2X Intel Xeon X5650 processors, 2.66 GHz, 12MB cache 96 GB, 1333 MHz, 12 DIMMs Host storage controller: Intel RAID Controller SROMBSASMR, PCIe 1.1 x4 host interface 128 MB cache Write Back enabled Host hard disks: 12X Western Digital WD5003ABYX, SATA, 500GB, 7200 RPM 2 mirrored disks divided into LUNs for native OS and ESXi root drives and VM storage 10 disks configured as JBODs for Hadoop data Host file system: Single Linux partition aligned on 64KB boundary per hard disk Partition formatted with EXT4: 4KB block size 4MB per inode 2% reserved blocks Host network adapter: Mellanox ConnectX VPI (MT26418) - PCIe 2.0 5GT/s, 10GbE Host BIOS settings: Intel Hyper-Threading Technology: enabled or disabled as required C-states: disabled Enhanced Intel SpeedStep Technology: disabled Network switch: Mellanox Vantage 6048, 48 ports, 10GbE Linux Distribution: RHEL 6.1 x86_64 Kernel parameters: nofile=16384 nproc=4096 Java: Sun Java 1.6.0_25 Native OS CPU, memory, and disks: same as Hardware Network driver: Mellanox MLNX_EN (version 1.5.6), enable_sys_tune=1 Hypervisor vsphere 5.0 RTM, changeset Development version of Mellanox network driver TECHNICAL WHITE PAPER / 15
16 Virtual Machines VMware Tools: installed Virtual network adapter: vmxnet3 Virtual SCSI controller: LSI Logic Parallel Disks: Physical Raw Device Mappings 1 VM per host: 92000MB, 24 vcpus (HT enabled) or 12 vcpus (HT disabled), 10 disks. 2 VMs per host: 46000MB, 12 vcpus (HT enabled) or 6 vcpus (HT disabled), 5 disks One VM pinned to each socket. 4 VMs per host: HT enabled only Small: 18400MB, 5 vcpus, 2 disks Large: 27600MB, 7 vcpus, 3 disks One small and one large VM pinned to each socket Hadoop Distribution: Cloudera CDH3u0 (Hadoop ) NameNode, JobTracker: node0 Secondary NameNode: node1 Workers: All machines/vms Non-default parameters: Number of tasks: Table 1 dfs.datanode.max.xcievers=4096 dfs.replication=2 dfs.block.size= io.file.buffer.size= mapred.child.java.opts= -Xmx2048m Xmn512m (native) mapred.child.java.opts= -Xmx1900m Xmn512m (virtual) Cluster topology: Native, 1 VM per host: all nodes in the default rack 2, 4 VMs per host: each host is a unique rack TECHNICAL WHITE PAPER / 16
17 References 1. IDC study. EMC Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. The Google File System. SOSP 03, Oct ACM, Dean, Jeffrey and Sanjay Ghemawat. OSDI White, Tom. Hadoop: The Definitive Guide. O Reilly, About the Author Jeff Buell is a Staff Engineer in the Performance Engineering group at VMware. His work focuses on the performance of various virtualized applications, most recently in the HPC and Big Data areas. His findings can be found in white papers, blog articles, and VMworld presentations. Acknowledgements The work presented here was made possible through a collaboration with AMAX, who provided a ClusterMax system for this performance testing and who hosted and supported the cluster in their corporate datacenter. The author would like to thank Julia Shih, Justin Quon, Kris Tsao, Reeann Zhang, and Winston Tang for their support and hardware expertise. The author would like to thank Mellanox for supplying the network hardware used for these tests, and in particular Yaron Haviv, Motti Beck, Dhiraj Sehgal, Matthew Finlay, and Todd Wilde for insights and help with networking. Many of the author s colleagues at VMware contributed valuable suggestions as well. VMware, Inc Hillview Avenue Palo Alto CA USA Tel Fax Copyright 2011 VMware, Inc. All rights reserved. This product is protected by U.S. and international copyright and intellectual property laws. VMware products are covered by one or more patents listed at VMware is a registered trademark or trademark of VMware, Inc. in the United States and/or other jurisdictions. All other marks and names mentioned herein may be trademarks of their respective companies. Item: EN
Protecting Hadoop with VMware vsphere. 5 Fault Tolerance. Performance Study TECHNICAL WHITE PAPER
VMware vsphere 5 Fault Tolerance Performance Study TECHNICAL WHITE PAPER Table of Contents Executive Summary... 3 Introduction... 3 Configuration... 4 Hardware Overview... 5 Local Storage... 5 Shared Storage...
Performance measurement of a Hadoop Cluster
Performance measurement of a Hadoop Cluster Technical white paper Created: February 8, 2012 Last Modified: February 23 2012 Contents Introduction... 1 The Big Data Puzzle... 1 Apache Hadoop and MapReduce...
How To Improve Performance On A Large Data Center With A Virtualization
Virtualized Hadoop Performance with VMware vsphere 5.1 Performance Study TECHNICAL WHITE PAPER Table of Contents Executive Summary... 3 Introduction... 3 Configuration... 4 Hardware Overview... 4 Local
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1
Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Oracle Database Scalability in VMware ESX VMware ESX 3.5
Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW 757 Maleta Lane, Suite 201 Castle Rock, CO 80108 Brett Weninger, Managing Director [email protected] Dave Smelker, Managing Principal [email protected]
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014
VMware SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014 VMware SAN Backup Using VMware vsphere Table of Contents Introduction.... 3 vsphere Architectural Overview... 4 SAN Backup
Virtualizing Apache Hadoop. June, 2012
June, 2012 Table of Contents EXECUTIVE SUMMARY... 3 INTRODUCTION... 3 VIRTUALIZING APACHE HADOOP... 4 INTRODUCTION TO VSPHERE TM... 4 USE CASES AND ADVANTAGES OF VIRTUALIZING HADOOP... 4 MYTHS ABOUT RUNNING
Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware
Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray ware 2 Agenda The Hadoop Journey Why Virtualize Hadoop? Elasticity and Scalability Performance Tests Storage Reference
Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009
Performance Study Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009 Introduction With more and more mission critical networking intensive workloads being virtualized
Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment
Technical Paper Moving SAS Applications from a Physical to a Virtual VMware Environment Release Information Content Version: April 2015. Trademarks and Patents SAS Institute Inc., SAS Campus Drive, Cary,
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,
Performance and Energy Efficiency of. Hadoop deployment models
Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced
Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM
White Paper Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM September, 2013 Author Sanhita Sarkar, Director of Engineering, SGI Abstract This paper describes how to implement
How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (
Apache Hadoop 1.0 High Availability Solution on VMware vsphere TM Reference Architecture TECHNICAL WHITE PAPER v 1.0 June 2012 Table of Contents Executive Summary... 3 Introduction... 3 Terminology...
SEsparse in VMware vsphere 5.5
SEsparse in VMware vsphere 5.5 Performance Study TECHNICAL WHITEPAPER Table of Contents Executive Summary... 3 Introduction... 3 Overview of Sparse VMDK formats... 4 VMFSsparse... 4 SEsparse... 5 Iometer
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 1 Market Trends Big Data Growing technology deployments are creating an exponential increase in the volume
HADOOP PERFORMANCE TUNING
PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
WHITE PAPER Optimizing Virtual Platform Disk Performance
WHITE PAPER Optimizing Virtual Platform Disk Performance Think Faster. Visit us at Condusiv.com Optimizing Virtual Platform Disk Performance 1 The intensified demand for IT network efficiency and lower
Scaling the Deployment of Multiple Hadoop Workloads on a Virtualized Infrastructure
Scaling the Deployment of Multiple Hadoop Workloads on a Virtualized Infrastructure The Intel Distribution for Apache Hadoop* software running on 808 VMs using VMware vsphere Big Data Extensions and Dell
Dell Reference Configuration for Hortonworks Data Platform
Dell Reference Configuration for Hortonworks Data Platform A Quick Reference Configuration Guide Armando Acosta Hadoop Product Manager Dell Revolutionary Cloud and Big Data Group Kris Applegate Solution
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
Adobe Deploys Hadoop as a Service on VMware vsphere
Adobe Deploys Hadoop as a Service A TECHNICAL CASE STUDY APRIL 2015 Table of Contents A Technical Case Study.... 3 Background... 3 Why Virtualize Hadoop on vsphere?.... 3 The Adobe Marketing Cloud and
Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Enabling High performance Big Data platform with RDMA
Enabling High performance Big Data platform with RDMA Tong Liu HPC Advisory Council Oct 7 th, 2014 Shortcomings of Hadoop Administration tooling Performance Reliability SQL support Backup and recovery
Configuration Maximums
Topic Configuration s VMware vsphere 5.1 When you select and configure your virtual and physical equipment, you must stay at or below the maximums supported by vsphere 5.1. The limits presented in the
Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA
WHITE PAPER April 2014 Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA Executive Summary...1 Background...2 File Systems Architecture...2 Network Architecture...3 IBM BigInsights...5
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
Understanding Data Locality in VMware Virtual SAN
Understanding Data Locality in VMware Virtual SAN July 2014 Edition T E C H N I C A L M A R K E T I N G D O C U M E N T A T I O N Table of Contents Introduction... 2 Virtual SAN Design Goals... 3 Data
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION
DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION A DIABLO WHITE PAPER AUGUST 2014 Ricky Trigalo Director of Business Development Virtualization, Diablo Technologies
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
SQL Server Business Intelligence on HP ProLiant DL785 Server
SQL Server Business Intelligence on HP ProLiant DL785 Server By Ajay Goyal www.scalabilityexperts.com Mike Fitzner Hewlett Packard www.hp.com Recommendations presented in this document should be thoroughly
VMware Virtual SAN Design and Sizing Guide TECHNICAL MARKETING DOCUMENTATION V 1.0/MARCH 2014
VMware Virtual SAN Design and Sizing Guide TECHNICAL MARKETING DOCUMENTATION V 1.0/MARCH 2014 Table of Contents Introduction... 3 1.1 VMware Virtual SAN...3 1.2 Virtual SAN Datastore Characteristics and
Microsoft Office SharePoint Server 2007 Performance on VMware vsphere 4.1
Performance Study Microsoft Office SharePoint Server 2007 Performance on VMware vsphere 4.1 VMware vsphere 4.1 One of the key benefits of virtualization is the ability to consolidate multiple applications
Increasing Hadoop Performance with SanDisk Solid State Drives (SSDs)
WHITE PAPER Increasing Hadoop Performance with SanDisk Solid State Drives (SSDs) July 2014 951 SanDisk Drive, Milpitas, CA 95035 2014 SanDIsk Corporation. All rights reserved www.sandisk.com Table of Contents
Leveraging NIC Technology to Improve Network Performance in VMware vsphere
Leveraging NIC Technology to Improve Network Performance in VMware vsphere Performance Study TECHNICAL WHITE PAPER Table of Contents Introduction... 3 Hardware Description... 3 List of Features... 4 NetQueue...
White Paper. Recording Server Virtualization
White Paper Recording Server Virtualization Prepared by: Mike Sherwood, Senior Solutions Engineer Milestone Systems 23 March 2011 Table of Contents Introduction... 3 Target audience and white paper purpose...
Analysis of VDI Storage Performance During Bootstorm
Analysis of VDI Storage Performance During Bootstorm Introduction Virtual desktops are gaining popularity as a more cost effective and more easily serviceable solution. The most resource-dependent process
VMware Virtual SAN Hardware Guidance. TECHNICAL MARKETING DOCUMENTATION v 1.0
VMware Virtual SAN Hardware Guidance TECHNICAL MARKETING DOCUMENTATION v 1.0 Table of Contents Introduction.... 3 Server Form Factors... 3 Rackmount.... 3 Blade.........................................................................3
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Hadoop Cluster Applications
Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday
Best Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software
Best Practices for Monitoring Databases on VMware Dean Richards Senior DBA, Confio Software 1 Who Am I? 20+ Years in Oracle & SQL Server DBA and Developer Worked for Oracle Consulting Specialize in Performance
Solving I/O Bottlenecks to Enable Superior Cloud Efficiency
WHITE PAPER Solving I/O Bottlenecks to Enable Superior Cloud Efficiency Overview...1 Mellanox I/O Virtualization Features and Benefits...2 Summary...6 Overview We already have 8 or even 16 cores on one
SAN Conceptual and Design Basics
TECHNICAL NOTE VMware Infrastructure 3 SAN Conceptual and Design Basics VMware ESX Server can be used in conjunction with a SAN (storage area network), a specialized high speed network that connects computer
Big Data Storage Options for Hadoop Sam Fineberg, HP Storage
Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Managing Capacity Using VMware vcenter CapacityIQ TECHNICAL WHITE PAPER
Managing Capacity Using VMware vcenter CapacityIQ TECHNICAL WHITE PAPER Table of Contents Capacity Management Overview.... 3 CapacityIQ Information Collection.... 3 CapacityIQ Performance Metrics.... 4
HP reference configuration for entry-level SAS Grid Manager solutions
HP reference configuration for entry-level SAS Grid Manager solutions Up to 864 simultaneous SAS jobs and more than 3 GB/s I/O throughput Technical white paper Table of contents Executive summary... 2
Hadoop on the Gordon Data Intensive Cluster
Hadoop on the Gordon Data Intensive Cluster Amit Majumdar, Scientific Computing Applications Mahidhar Tatineni, HPC User Services San Diego Supercomputer Center University of California San Diego Dec 18,
GraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
CDH AND BUSINESS CONTINUITY:
WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable
Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820
Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820 This white paper discusses the SQL server workload consolidation capabilities of Dell PowerEdge R820 using Virtualization.
Apache Hadoop Cluster Configuration Guide
Community Driven Apache Hadoop Apache Hadoop Cluster Configuration Guide April 2013 2013 Hortonworks Inc. http://www.hortonworks.com Introduction Sizing a Hadoop cluster is important, as the right resources
Configuration Maximums
Topic Configuration s VMware vsphere 5.0 When you select and configure your virtual and physical equipment, you must stay at or below the maximums supported by vsphere 5.0. The limits presented in the
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
IOmark- VDI. HP HP ConvergedSystem 242- HC StoreVirtual Test Report: VDI- HC- 150427- b Test Report Date: 27, April 2015. www.iomark.
IOmark- VDI HP HP ConvergedSystem 242- HC StoreVirtual Test Report: VDI- HC- 150427- b Test Copyright 2010-2014 Evaluator Group, Inc. All rights reserved. IOmark- VDI, IOmark- VM, VDI- IOmark, and IOmark
Kronos Workforce Central on VMware Virtual Infrastructure
Kronos Workforce Central on VMware Virtual Infrastructure June 2010 VALIDATION TEST REPORT Legal Notice 2010 VMware, Inc., Kronos Incorporated. All rights reserved. VMware is a registered trademark or
Virtualizing Performance-Critical Database Applications in VMware vsphere VMware vsphere 4.0 with ESX 4.0
Performance Study Virtualizing Performance-Critical Database Applications in VMware vsphere VMware vsphere 4.0 with ESX 4.0 VMware vsphere 4.0 with ESX 4.0 makes it easier than ever to virtualize demanding
vsphere Resource Management
ESXi 5.1 vcenter Server 5.1 This document supports the version of each product listed and supports all subsequent versions until the document is replaced by a new edition. To check for more recent editions
Configuration Maximums VMware vsphere 4.0
Topic Configuration s VMware vsphere 4.0 When you select and configure your virtual and physical equipment, you must stay at or below the maximums supported by vsphere 4.0. The limits presented in the
Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine
Where IT perceptions are reality Test Report OCe14000 Performance Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine Document # TEST2014001 v9, October 2014 Copyright 2014 IT Brand
Performance brief for IBM WebSphere Application Server 7.0 with VMware ESX 4.0 on HP ProLiant DL380 G6 server
Performance brief for IBM WebSphere Application Server.0 with VMware ESX.0 on HP ProLiant DL0 G server Table of contents Executive summary... WebSphere test configuration... Server information... WebSphere
SQL Server Virtualization
The Essential Guide to SQL Server Virtualization S p o n s o r e d b y Virtualization in the Enterprise Today most organizations understand the importance of implementing virtualization. Virtualization
Storage Architectures for Big Data in the Cloud
Storage Architectures for Big Data in the Cloud Sam Fineberg HP Storage CT Office/ May 2013 Overview Introduction What is big data? Big Data I/O Hadoop/HDFS SAN Distributed FS Cloud Summary Research Areas
Introduction to VMware EVO: RAIL. White Paper
Introduction to VMware EVO: RAIL White Paper Table of Contents Introducing VMware EVO: RAIL.... 3 Hardware.................................................................... 4 Appliance...............................................................
Hadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
Citrix XenApp Server Deployment on VMware ESX at a Large Multi-National Insurance Company
Citrix XenApp Server Deployment on VMware ESX at a Large Multi-National Insurance Company June 2010 TECHNICAL CASE STUDY Table of Contents Executive Summary...1 Customer Overview...1 Business Challenges...1
HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief
Technical white paper HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Scale-up your Microsoft SQL Server environment to new heights Table of contents Executive summary... 2 Introduction...
FLOW-3D Performance Benchmark and Profiling. September 2012
FLOW-3D Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: FLOW-3D, Dell, Intel, Mellanox Compute
Distributed File Systems
Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)
Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering
Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays Red Hat Performance Engineering Version 1.0 August 2013 1801 Varsity Drive Raleigh NC
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Big Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
Enabling Technologies for Distributed and Cloud Computing
Enabling Technologies for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Multi-core CPUs and Multithreading
Server and Storage Sizing Guide for Windows 7 TECHNICAL NOTES
Server and Storage Sizing Guide for Windows 7 TECHNICAL NOTES Table of Contents About this Document.... 3 Introduction... 4 Baseline Existing Desktop Environment... 4 Estimate VDI Hardware Needed.... 5
Evaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang [email protected] University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
Directions for VMware Ready Testing for Application Software
Directions for VMware Ready Testing for Application Software Introduction To be awarded the VMware ready logo for your product requires a modest amount of engineering work, assuming that the pre-requisites
Intel Distribution for Apache Hadoop* Software: Optimization and Tuning Guide
OPTIMIZATION AND TUNING GUIDE Intel Distribution for Apache Hadoop* Software Intel Distribution for Apache Hadoop* Software: Optimization and Tuning Guide Configuring and managing your Hadoop* environment
Hadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
