Apache Hadoop Storage Provisioning Using VMware vsphere Big Data Extensions TECHNICAL WHITE PAPER

Table of Contents Apache Hadoop Deployment on VMware vsphere Using vsphere Big Data Extensions.... 3 Local Storage and Shared Storage.... 4 Basic vsphere Storage Concepts.... 4 Using Local and Shared Storage for Hadoop.... 4 Storage Provisioning by BDE.... 5 Datastore Management.... 5 Cluster Specification of Storage.... 5 Disk Placement and Storage Allocation... 6 Storage Management After Cluster Deployment.... 8 Allocation of Unused Datastore Storage.... 8 Storage Failure and Recovery... 8 Disk Replacement and Node Data Disk Recovery.... 8 Disk Replacement and Node Recovery.... 9 Recoverable Disk Failure... 10 Storage Configuration for Hadoop Outside of BDE.... 10 Data Disk Resizing.... 10 Utilization of Additional Disks.... 11 Conclusion...12 TECHNICAL WHITE PAPER / 2

Apache Hadoop Deployment on VMware vsphere Using vsphere Big Data Extensions The Apache Hadoop software library is a framework that enables the distributed processing of large data sets across clusters of computers. It is designed to scale up from single servers to thousands of machines, with each offering local computation and storage. Hadoop is being used by enterprises across verticals for big data analytics, to help make better business decisions based on large data sets. Serengeti is an open-source project initiated by VMware to automate deployment and management of Hadoop clusters on virtualized environments such as VMware vsphere. Serengeti offers the following key benefits: Deploy a Hadoop cluster on vsphere in minutes via one command Employ a fully customizable configuration profile to specify computer, storage and network resources as well as node placement Provide better Hadoop manageability and usability, enabling fast and simple cluster scale-out and Hadoop tuning Enable separation of data and compute nodes without losing data locality Improve Hadoop cluster availability by leveraging VMware vsphere High Availability (vsphere HA), VMware vsphere Fault Tolerance (vsphere FT) and VMware vsphere vmotion Support multiple Hadoop distributions, including Apache Hadoop, Cloudera CDH, Pivotal HD, MapR, Hortonworks Data Platform (HDP) and Intel IDH Through its sponsorship of Project Serengeti, VMware has been investing in making it easier for users to run big data and Hadoop workloads. VMware has introduced Big Data Extensions (BDE) as a commercially supported version of Project Serengeti designed for enterprises seeking VMware support. BDE enables customers to run clustered, scale-out Hadoop applications through vsphere, delivering all the benefits of virtualization to Hadoop users. BDE provides increased agility through an easy-to-use interface, elastic scaling through the separation of compute and storage resources, and increased reliability and security by leveraging proven vsphere technology. VMware has built BDE to support all major Hadoop distributions and associated Hadoop projects such as Pig, Hive and HBase. Serengeti automates deployment of a Hadoop cluster, masking from the user complex resource allocation and configuration tasks on a virtualized infrastructure. Among all resources, storage probably intrigues the Hadoop user the most, due to performance, capacity and data-locality considerations. This white paper examines how storage is allocated and configured for a Hadoop cluster deployed using Serengeti. It also offers recommendations on how to administer storage in certain scenarios where manual intervention is necessary. TECHNICAL WHITE PAPER / 3

Local Storage and Shared Storage Basic vsphere Storage Concepts VMware ESXi provides host-level storage virtualization, which logically abstracts the physical storage layer from virtual machines. An ESXi virtual machine uses one or more virtual disks to store its operating system (OS), program files and other data. Each virtual disk is a large physical file, or a set of files, that resides on a VMware vsphere VMFS datastore, a datastore based on some other technology such as Network File System (NFS) or VMware Virtual SAN, or a raw disk. To access virtual disks, a virtual machine uses virtual SCSI controllers. From the standpoint of the virtual machine, each virtual disk appears as if it were a SCSI drive connected to a SCSI controller. The underlying physical storage for the virtual disk whether accessed through parallel SCSI, iscsi, network or Fibre Channel adapters on the ESXi host is transparent to the guest OS and to applications running on the virtual machine. ESXi supports two types of storage: local storage and shared storage. Local storage maintains virtual machine files on internal or directly attached external disks that are managed exclusively by that single host, whereas shared storage maintains virtual machine files on disks or storage arrays shared among more than one host, such as those connected through an IP-based or Fibre Channel network. Datastores are logical containers that hide specifics of each storage device and provide a uniform model for storing virtual machines. For more details about vsphere storage, refer to vsphere documentation. Using Local and Shared Storage for Hadoop When deploying a Hadoop cluster on vsphere, users can choose to use either local or shared storage for each node. BDE reads the setting from the cluster configuration file and creates virtual disks for the nodes in specified datastore(s) accordingly. Local and shared storage offer distinctive benefits. Shared storage in a vsphere environment enables advanced capabilities such as vsphere HA, vsphere FT, and vsphere vmotion in the vsphere cluster to protect Hadoop nodes. Shared storage is typically provided by network-attached storage (NAS) or storage area network (SAN) storage arrays, which not only offer high and scalable capacity but also add another layer of storage availability through RAID, hardware redundancy and multipathing. On the other hand, local storage offers better I/O performance by eliminating network overhead and latency. To improve performance and conserve bandwidth, Hadoop is particularly designed for data locality, so data is processed on the same machine that stores it. Data locality is preserved in a Hadoop deployment on vsphere when a slave node uses local storage. Virtual disk I/Os from the slave node are directed to the local disks attached to the ESXi host without requiring to be transferred on network. vsphere virtualization enables the possibility of deploying multiple slave nodes on a single ESXi host with none of them losing data locality. When two virtual machines are deployed on the same ESXi host, communication between the virtual machines is transmitted on the virtual network that logically connects them in the host. Network traffic is performed in the ESXi host memory and never leaves the host. This enables data and compute to be separated for a Hadoop cluster without compromising data locality. When separating data and compute nodes, users can set constraints to strictly associate compute nodes with data nodes. When a user specifies TEMPFS as the storage type for the compute nodes, BDE installs an NFS server on associated data nodes, installs an NFS client on compute nodes, and mounts data node disks on compute nodes. BDE does not assign disks to compute nodes, and all temporary files generated during Hadoop MapReduce jobs are saved on the NFS disks. Using NFS storage for compute nodes increases the capacity of each compute node and returns storage resources when compute nodes stop. TECHNICAL WHITE PAPER / 4

VMware recommends the following best practices for configuring storage for a Hadoop cluster deployed on vsphere: Place the Hadoop master node (including NameNode and JobTracker) on shared storage to enable vsphere HA, vsphere FT and VMware vsphere Distributed Resource Scheduler (vsphere DRS) features. These features prevent the master node from being the single point of failure (SPOF) in the Hadoop cluster. Place the Hadoop data nodes on local storage for locality and performance. Follow similar best practices of storage provisioning (disk types, number of drives per node, no RAID, and so on) as for Hadoop deployment on physical infrastructure. If separating data and compute in the cluster, or deploying a compute-only cluster, place the compute nodes on local storage or use NFS in the form previously described. Place the Hadoop client nodes and other Hadoop ecosystem nodes on either local storage or shared storage. When using local storage, set the server RAID controller cache policy to write back instead of write through if a cache battery backup unit (BBU) module is installed. Initial I/Os to a disk formatted as either thin provision or thick provision will result in disk zeroing on demand, leading to degraded performance until the entire disk has been zeroed. The write back cache mode helps eliminate this performance degradation. By default, BDE formats node data disks to the thick provision format, so initial Hadoop performance might not be optimal unless the write back mode is applied on the RAID controller. Storage Provisioning by BDE Datastore Management BDE enables users to specify datastores to be selected for Hadoop deployment. In the BDE CLI, the following is the command syntax: datastore add --name <storagepool name in BDE> --spec <datastore name in vsphere> --type <LOCAL SHARED> BDE defines the type of storage pool to be used for cluster deployment. A pool can contain one or many vsphere datastores. The datastore name can be specified using a wildcard certificate to include a set of datastores for cluster use. BDE currently does not check whether the datastore actually exists in VMware vcenter. Use of a nonexistent datastore will cause cluster creation to fail. Two other commands, datastore delete and datastore list, are provided for deleting and listing BDE storage pools. Cluster Specification of Storage When deploying a customized Hadoop cluster, users can specify a set of attributes related to storage for each node group, including number of nodes, size of each node, and storage type. For instance, the following specification instructs BDE to create four data nodes, each using 50GB of local storage: TECHNICAL WHITE PAPER / 5

{ } name : data, roles : [ hadoop_datanode ], instancenum : 4, cpunum : 2, memcapacitymb : 2048, storage : { type : LOCAL, sizegb : 50 } The cluster specification can also be used to place system and data disks on separate datastores. In this storage clause, data disks are placed on dsnames4data datastores, and system disks are placed on dsnames4system datastores: storage : { type : LOCAL, sizegb : 50, dsnames4data : [ DSLOCALSSD ], dsnames4system : [ DSNDFS ] } The cluster create command uses the --dsnames parameter to specify the list of BDE storage pools to be used for cluster creation. These storage pools must collectively meet the size and type requirements in the cluster specification. Otherwise, cluster creation will fail. Disk Placement and Storage Allocation To illustrate how BDE creates virtual disks and allocates storage for a node according to the cluster specification, a simple example is provided here. The same placement and allocation policy and algorithm apply to both local and shared storage. Suppose there is an ESXi cluster of four nodes, each with five locally attached 120GB disks. On each ESXi host, all five disks are formatted in VMFS to create datastores named as localds<0-4>_esx<0-3>. All these datastores a total of 20 have been added into a BDE local storage pool to be used for cluster creation. The ESXi hosts also share an SAN storage array, from which a 100GB LUN is created and formatted in VMFS as a datastore named sharedds. This datastore is added into a BDE shared storage pool for cluster use. A Hadoop cluster with one master node, four data nodes, eight compute nodes, and one client node must be created using the previously specified BDE local and shared storage pools. The data nodes will use local storage with 150GB each, the compute nodes will use local storage with 25GB each, while the master node and client node will use shared storage with 20GB each. In this example, BDE by default will place one data node and two compute nodes per ESXi host and will place the master node and client node randomly on two of the ESXi hosts. Every Hadoop node deployed by BDE will have three types of virtual disks: a fixed-size system disk, formatted in multiple ext3 partitions, to install the guest OS and application; a swap disk of the same size as the virtual memory; and one or more disks, each formatted in a single ext4 partition, to store application data. The number of data disks is dictated by the number of datastores specified for cluster creation on the host in the corresponding BDE storage pool. The size of each disk equals the specified node size divided by the number of data disks. BDE creates either thin provision or thick provision virtual disks, depending on the node and disk types. In this example, every node is assigned 4GB of virtual memory, so the swap disk size is also roughly 4GB. TECHNICAL WHITE PAPER / 6

Applying this disk placement and storage allocation policy to the example, Table 1 shows how storage is configured on each of the cluster nodes. NODE VIRTUAL DISK USAGE DATASTORE SIZE VDISK TYPE Master Client Data (one per ESXi host) Compute (two per ESXi host) /dev/sda /dev/sdb /dev/sdc /dev/sda /dev/sdb /dev/sdc System Swap Data System Swap Data sharedds sharedds sharedds sharedds sharedds sharedds 20GB ~4GB 20GB 20GB ~4GB 20GB Thin provision Thin provision Thin provision Thin provision Thin provision Thin provision /dev/sda System localds0_esx# 20GB Thin provision /dev/sdb Swap localds0_esx# ~4GB Thick provision /dev/sdc Data localds0_esx# 30GB Thick provision /dev/sdd Data localds1_esx# 30GB Thick provision /dev/sde Data localds2_esx# 30GB Thick provision /dev/sdf Data localds3_esx# 30GB Thick provision /dev/sdg Data localds4_esx# 30GB Thick provision /dev/sda System localds0_esx# 20GB Thin provision /dev/sdb Swap localds0_esx# ~4GB Thick provision /dev/sdc Data localds0_esx# 5GB Thick provision /dev/sdd Data localds1_esx# 5GB Thick provision /dev/sde Data localds2_esx# 5GB Thick provision /dev/sdf Data localds3_esx# 5GB Thick provision /dev/sdg Data localds4_esx# 5GB Thick provision Table 1. Hadoop Cluster Node Storage Provisioning As a result, the shared datastore is estimated to have 10GB of free space left. Of the five local disks on each ESXi, the first disk has roughly 5GB of free space left and each of the other four has about 80GB. TECHNICAL WHITE PAPER / 7

Storage Management After Cluster Deployment Allocation of Unused Datastore Storage In a Hadoop cluster deployed by BDE, virtual disks of the cluster nodes cannot be resized through BDE to use available free space in the underlying vsphere datastores, nor can BDE create additional disks for the nodes. The unused datastore storage can be utilized in other ways: Scale out the existing Hadoop cluster to create more slave nodes Create another Hadoop cluster Allocate the storage to other applications However, these methods the last two in particular of using available free space in the datastores will inevitably lead to disk contention with the existing Hadoop cluster when running concurrently, resulting in significant performance degradation. It is not a recommended practice unless applications and workloads can be scheduled reasonably to avoid contention. Therefore, it is very important to plan and size storage carefully at both the physical and virtual layers prior to cluster deployment, taking into consideration the existing storage requirements as well as the prospects for data growth. Storage Failure and Recovery Enterprise-class SAN and NAS storage rarely fails, due to the sophisticated set of high-availability capabilities built into the arrays. However, individual commercial off-the-shelf hard disk drives have a much higher probability of failure, particularly the lower-grade SATA drives that are often used for Hadoop deployment. When the underlying storage fails, the vsphere datastore becomes unavailable, resulting in loss of access to data among virtual machines using the datastore. If a virtual machine s system disk resides in the datastore, the virtual machine is completely inaccessible. This section discusses the impact of a local disk failure on the Hadoop cluster using the disk and how to recover from the failure. There are a few different scenarios related to hard disk failure: The failed disk is used by Hadoop node(s) for data disk only. The disk is not recoverable and must be replaced. The failed disk is used by Hadoop node(s) for both system and data disks. The disk is not recoverable and must be replaced. The failed disk is recoverable without its VMFS partition damaged and content corrupted. Disk Replacement and Node Data Disk Recovery If the datastore created from the failed HDD contains Hadoop data disks only, each node using the datastore goes through the following sequence of states: The node loses one of the data disks. Consequently, the Hadoop Distributed File System (HDFS) blocks stored on the disk are missing from the node. The node remains up for a short period. BDE reports the node to be service ready. Hadoop reports the node to be alive in the cluster. Because a cluster deployed by BDE has fault tolerance level set to zero by default, the node eventually stops the DataNode service, due to the loss of the disk. BDE detects the loss of the node and reports it. Hadoop detects that the node is not in service and adds the node to the deadnodes list. The cluster remains fully functional with the remaining nodes, although at a reduced capacity. There is no data loss because HDFS has replicas of the blocks elsewhere. Over time, HDFS will detect underreplicated blocks and replicate them automatically. TECHNICAL WHITE PAPER / 8

The node can resume service after removal of the inaccessible data disk in accordance with the procedure described in VMware knowledge base article 1009854. After power-on of the node, BDE reprovisions the node appropriately and updates relevant Hadoop configuration files on the node to exclude the lost data disk. BDE then reports the node to be back in service. Hadoop reports the node to be alive again. The cluster is fully functional with all nodes, although this particular node has one fewer data disk. After a new physical disk has replaced the failed one, the following procedure can be used to make it available to the Hadoop cluster to recover each of the affected nodes with a recreated data disk: 1. Create a VMFS datastore on the new disk, as detailed in the BDE User s Guide. 2. Power off the node. 3. Add a virtual disk to the node. a. Click Edit Settings. b. Click Add in the virtual machine Properties window. c. Select Hard Disk as the type of device to add. d. Select Create a new virtual disk. e. Specify the disk size to be exactly the same as the other data disks on the node, and choose Thick Provision Lazy Zeroed as the provisioning type. f. Select Specify a datastore or datastore cluster for the disk location, and browse to choose the datastore created in step 1. g. Place the disk on the same SCSI controller and target location as the previously removed disk. h. Make the disk Independent in the Persistent mode. 4. Power on the node. BDE reprovisions the node appropriately and updates relevant Hadoop configuration files on the node to include the newly provisioned data disk. BDE then reports the node to be back in service. It is recommended that an HDFS fsck be run after all affected nodes have been recovered. At this point, the Hadoop cluster is fully functional with all nodes. Each node has the same number of data disks as initially deployed by BDE. There should be no data loss throughout this entire failure and recovery process. Hadoop will not try to balance data blocks across the newly replaced data disks but will likely place blocks of newly created files on these data disks first. Disk Replacement and Node Recovery If a Hadoop node has both its system disk and data disk in the datastore created from the failed HDD, the cluster and node go into the following state: The node is completely dead due to the loss of system disk. BDE reports the node to be down. The cluster loses the node and places it on the deadnodes list. The cluster remains fully functional with the remaining nodes, although at a reduced capacity. There is no data loss because HDFS has replicas of the blocks elsewhere. Over time, HDFS will detect underreplicated blocks and replicate them automatically. There is currently no way of recovering the node after a new physical disk has replaced the failed one. To preserve the cluster size, users can run the following command to scale out the slave node group by one: cluster resize --name <cluster name> --nodegroup worker --instancenum <slave # + 1> BDE now maintains a seemingly larger Hadoop cluster but with the same number of active slave nodes as before. TECHNICAL WHITE PAPER / 9

Recoverable Disk Failure Some hard disk failures can be repaired physically or through utility software that corrects logical sector errors. In other cases, the disk itself might be fine but there can be a problem with the cable or cable connection. In these cases, the hard disk can be reused after the problem has been corrected. Normally the disk contains the original VMFS partition and data. Therefore, the vsphere datastore is recovered when the disk has been made available again. In turn, the virtual disk(s) created in the datastore become available to the Hadoop nodes. All of this happens automatically within minutes after the disk has been made ready on the ESXi host. There is no specific action that must be taken in BDE and the Hadoop cluster. Any node that uses the datastore is back in service after powering up. Depending on whether both system and data disks are impacted, during this failure and recovery process the affected nodes might go into states described in the previous scenarios. Nevertheless, except for rare and extreme cases where nodes are too concentrated on the failed disk, severely undermining Hadoop functionality and availability, the Hadoop cluster remains functional. Storage Configuration for Hadoop Outside of BDE Data Disk Resizing During cluster deployment, BDE calculates disk sizes based on cluster specification and availability of datastores. After the cluster has been deployed, BDE manages node storage only in terms of maintaining the number of disks as provisioned and presenting them for Hadoop to use. It does not keep track of other aspects, including disk size. Therefore, even though BDE currently provides no function to enable disk resizing for the cluster, it does not prohibit the disks from being resized manually. System disks should not be resized. The following procedure can be used to resize a data disk: 1. Follow Hadoop best practices to move data blocks from the disk to other disks or nodes to maintain HDFS consistency. 2. Power off the node. 3. Resize the disk in vsphere. a. Click Edit Settings on the node. b. Select the disk to be resized, and enter the new disk size for Provisioned Size. c. Click OK to submit the change. 4. Power on the node. 5. Utilize the newly added disk space: Option 1: a. Install a partition management utility such as GParted in the node to expand the partition on the disk. b. Run the resize2fs command to expand the ext4 file system on the partition. Option 2: a. Run the fdisk command to remove the existing partition. b. Reboot the node. BDE will prepare the disk appropriately, including partition, file system, and Hadoop directory structure creation. Hadoop will recognize the new disk size automatically and will start using it. TECHNICAL WHITE PAPER / 10

Utilization of Additional Disks Disk resizing is a way of scaling up storage on a node. Another possible way of scaling up storage is to add more data disks to the node, which is much more complicated than disk resizing. The following procedure can be used to manually add a data disk to a node: 1. Power off the node. 2. Follow the previously described procedure to add a new disk to the node. a. Place the disk in the selected datastore. b. Attach the disk to a SCSI controller and target location sequentially to existing disks. 3. Power on the node. 4. Run the sfdisk command to create a single partition on the new disk. 5. Run mkfs to create an ext4 file system using the partition. 6. Create a mount point under /mnt and add the mount point to the /etc/fstab file. 7. Mount the file system to the mount point. 8. Create the following directory structure in the file system: # mkdir hadoop # chown R hdfs:hadoop hadoop # cd hadoop # mkdir hdfs mapred # cd hdfs # mkdir data name secondary # chown hdfs:hadoop data name secondary # chmod 700 name secondary # cd../mapred # mkdir local # chown mapred:hadoop local 9. Edit the /usr/lib/hadoop-1.0.1/conf/hdfs-site.xml file to add the new HDFS name and data locations for the dfs.name.dir and dfs.data.dir properties respectively. 10. Edit the /usr/lib/hadoop-1.0.1/conf/mapred-site.xml file to add the new MapReduce local directory for property mapred.local.dir. 11. Restart the hadoop-0.20-datanode and hadoop-0.20-tasktracker services on the node. The new data disk is now ready for use by the Hadoop cluster. When BDE restarts the cluster, or when the node reboots, the new disk will be intact on the node. However, BDE will restore the hdfs-site.xml and mapred-site.xml Hadoop configuration files for the node, based on the BDE cluster configuration database. Therefore, the new data disk is not included in the configuration files for Hadoop to consume because BDE does not detect the disk. To use the disk, steps 9 11 must be performed every time the node reboots or the cluster restarts. TECHNICAL WHITE PAPER / 11

Conclusion An Apache Hadoop cluster deployed on VMware vsphere can leverage advanced vsphere HA, vsphere FT and vsphere vmotion features for enhanced availability by using shared storage, while also preserving data locality by using local storage for data nodes. Virtualization enables data and compute separation without compromising data locality. Big Data Extensions simplifies Hadoop deployment on vsphere, accelerates deployment speed, and masks the complexity from the vsphere administrator. TECHNICAL WHITE PAPER / 12

VMware, Inc. 3401 Hillview Avenue Palo Alto CA 94304 USA Tel 877-486-9273 Fax 650-427-5001 www.vmware.com Copyright 2013 VMware, Inc. All rights reserved. This product is protected by U.S. and international copyright and intellectual property laws. VMware products are covered by one or more patents listed at http://www.vmware.com/go/patents. VMware is a registered trademark or trademark of VMware, Inc. in the United States and/or other jurisdictions. All other marks and names mentioned herein may be trademarks of their respective companies. Item No: VMW-WP-BIG-DATA-STOR-PROV-USLET-101 Docsouce: OIC-13VM005.03