RDMA for Apache Hadoop 2.x User Guide
|
|
|
- Molly Gilmore
- 10 years ago
- Views:
Transcription
1 0.9.8 User Guide HIGH-PERFORMANCE BIG DATA TEAM NETWORK-BASED COMPUTING LABORATORY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING THE OHIO STATE UNIVERSITY Copyright (c) Network-Based Computing Laboratory, headed by Dr. D. K. Panda. All rights reserved. Last revised: September 28, 2015
2 Contents 1 Overview of the RDMA for Apache Hadoop 2.x Project 1 2 Features 1 3 Setup Instructions Prerequisites Download Installation steps Installing intergrated RDMA for Apache Hadoop package Installing RDMA-based plugin for Apache Hadoop or HDP Basic Configuration HDFS with Heterogeneous Storage (HHH) HDFS in Memory (HHH-M) HDFS with Heterogeneous Storage and Lustre (HHH-L) MapReduce over Lustre with Local Disks MapReduce over Lustre without Local Disks Advanced Configuration Parallel Replication Placement Policy Selection Automatic Placement Policy Selection Threshold of RAM Disk Usage No. of In-Memory Replicas Disk-assisted Shuffle Basic Usage Instructions Startup MapReduce over HDFS MapReduce over Lustre Basic Commands Shutdown MapReduce over HDFS MapReduce over Lustre Running RDMA for Apache Hadoop with SLURM/PBS Usage Configure and Start a Hadoop Job Shutdown Hadoop Cluster Running MapReduce over HDFS with SLURM/PBS Startup Running Benchmarks Shutdown Running MapReduce over Lustre with SLURM/PBS Startup Running Benchmarks Shutdown i
3 6 Benchmarks TestDFSIO Sort TeraSort OHB Micro-benchmarks ii
4 1 Overview of the RDMA for Apache Hadoop 2.x Project RDMA for Apache Hadoop 2.x is a high-performance design of Hadoop over RDMA-enabled Interconnects. This version of RDMA for Apache Hadoop 2.x is based on Apache Hadoop and is compliant with Apache Hadoop and Hortonworks Data Platform (HDP) APIs and applications. This file is intended to guide users through the various steps involved in installing, configuring, and running RDMA for Apache Hadoop 2.x over InfiniBand. Figure 1 presents a high-level architecture of RDMA for Apache Hadoop 2.x. In this package, many different modes have been included that can be enabled/disabled to obtain performance benefits for different kinds of applications in different Hadoop environments. This package can be configured to run MapReduce jobs on top of HDFS as well as Lustre. Following are the different modes that are included in our package. HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well as performance. This mode is enabled by default in the package. Job Scheduler SLURM PBS In-memory (HHH-M) RDMA-enhanced MapReduce Local disks (SSD, HDD) Intermediate Data Dir RDMA-enhanced HDFS Heterogeneous Storage (HHH) Lustre With Lustre (HHH-L) Lustre RDMAenhanced RPC HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in-memory and obtain as much performance benefit as possible. Figure 1: RDMA-based Hadoop architecture and its different modes HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster. MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top of Lustre alone. Here, two different modes are introduced: with local disks and without local disks. Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-L, and MapReduce over Lustre). If there are any questions, comments or feedback regarding this software package, please post them to rdma-hadoop-discuss mailing list ([email protected]). 2 Features High-level features of RDMA for Apache Hadoop 2.x are listed below. New features and enhancements compared to release are marked as (NEW). (NEW) Based on Apache Hadoop
5 High performance design with native InfiniBand and RoCE support at the verbs level for HDFS, MapReduce, and RPC components (NEW) Compliant with Apache Hadoop and Hortonworks Data Platform (HDP) APIs and applications Plugin-based architecture supporting RDMA-based designs for HDFS (HHH, HHH-M, HHH-L), MapReduce, MapReduce over Lustre, and RPC, etc. (NEW) Plugin for Apache Hadoop distribution (tested with 2.7.1) (NEW) Plugin for Hortonworks Data Platform (HDP) (tested with ) Supports deploying Hadoop with Slurm and PBS in different running modes (HHH, HHH-M, HHH-L, and MapReduce over Lustre) Easily configurable for different running modes (HHH, HHH-M, HHH-L, and MapReduce over Lustre) and different protocols (native InfiniBand, RoCE, and IPoIB) On-demand connection setup HDFS over native InfiniBand and RoCE RDMA-based write RDMA-based replication Parallel replication support Overlapping in different stages of write and replication Enhanced hybrid HDFS design with in-memory and heterogeneous storage (HHH) Supports three modes of operations HHH (default) with I/O operations over RAM disk, SSD, and HDD HHH-M (in-memory) with I/O operations in-memory HHH-L (Lustre-integrated) with I/O operations in local storage and Lustre Policies to efficiently utilize heterogeneous storage devices (RAM Disk, SSD, HDD, and Lustre) Greedy and Balanced policies support Automatic policy selection based on available storage types Hybrid replication (in-memory and persistent storage) for HHH default mode Memory replication (in-memory only with lazy persistence) for HHH-M mode Lustre-based fault-tolerance for HHH-L mode No HDFS replication Reduced local storage space usage MapReduce over native InfiniBand and RoCE RDMA-based shuffle Prefetching and caching of map output In-memory merge 2
6 Advanced optimization in overlapping map, shuffle, and merge shuffle, merge, and reduce Optional disk-assisted shuffle High performance design of MapReduce over Lustre Supports two shuffle approaches Lustre read based shuffle RDMA based shuffle Hybrid shuffle based on both shuffle approaches Configurable distribution support In-memory merge and overlapping of different phases RPC over native InfiniBand and RoCE JVM-bypassed buffer management RDMA or send/recv based adaptive communication Intelligent buffer allocation and adjustment for serialization Tested with Mellanox InfiniBand adapters (DDR, QDR, and FDR) RoCE support with Mellanox adapters Various multi-core platforms RAM Disks, SSDs, HDDs, and Lustre 3 Setup Instructions 3.1 Prerequisites Prior to the installation of RDMA for Apache Hadoop 2.x, please ensure that you have the latest version of JDK installed on your system, and set the JAVA HOME and PATH environment variables to point to the appropriate JDK installation. We recommend the use of JDK version 1.7 and later. In order to use the RDMA-based features provided with RDMA for Apache Hadoop 2.x, install the latest version of the OFED distribution that can be obtained from Download Two kinds of tarballs are provided for downloading, intergrated package and plugin package. Download the most recent distribution tarball of intergrated RDMA for Apache Hadoop package from gz. 3
7 Download the most recent distribution tarball of RDMA for Apache Hadoop plugin package from hibd.cse.ohio-state.edu/download/hibd/rdma-hadoop-2.x plugin.tar.gz. 3.3 Installation steps The RDMA for Apache Hadoop can be installed using the intergrated distribution or as plugins on existing Apache Hadoop source in use Installing intergrated RDMA for Apache Hadoop package Following steps can be used to install the integrated RDMA for Apache Hadoop package. 1. Unzip the intergrated RDMA for Apache Hadoop distribution tarball using the following command: tar zxf rdma-hadoop-2.x bin.tar.gz 2. Change directory to rdma-hadoop-2.x cd rdma-hadoop-2.x Installing RDMA-based plugin for Apache Hadoop or HDP Following steps can be used to build and install the RDMA-based plugin for Apache Hadoop or HDP package. 1. Unzip the plugin tarball of the specific distribution using the following command: tar zxf rdma-hadoop-2.x plugin.tar.gz 2. Change directory to hadoop-plugin-2.7.1/rdma-plugins/ cd rdma-hadoop-2.x plugin 3. Run the install script install.sh. This script applies the appropriate patch and re-builds the source from the path specified. This installation can be launched using the following command:./install.sh HADOOP_SRC_DIR DISTRIBUTION The installation script takes two arguments: HADOOP SRC DIR This mandatory argument is the directory location containg Hadoop source. DISTRIBUTION This mandatory argument specifies the Hadoop source distribution. This is necessary to choose the appropriate patch to apply. The current options are apache or hdp. 4
8 3.4 Basic Configuration The configuration files can be found in the directory rdma-hadoop-2.x-0.9.8/etc/hadoop, in the intergrated RDMA for Apache Hadoop package. If the plugin-based Hadoop distribution is being used, the configuration files can be found at HADOOP SOURCE/hadoop-dist/target/etc/hadoop, where HADOOP SOURCE is the Hadoop source directory on which the plugin is being (or has been) applied HDFS with Heterogeneous Storage (HHH) Steps to configure RDMA for Apache Hadoop 2.x include: 1. Configure hadoop-env.sh file. export JAVA_HOME=/opt/java/ Configure core-site.xml file. RDMA for Apache Hadoop 2.x supports three different modes: IB, RoCE, and TCP/IP. Configuration of the IB mode: <configuration> <name>fs.defaultfs</name> <value>hdfs://node001:9000</value> <description>namenode URI.</description> <name>hadoop.ib.enabled</name> <value>true</value> <description>enable the RDMA feature over IB. Default value of hadoop.ib.enabled is true.</description> <name>hadoop.roce.enabled</name> <value>false</value> <description>disable the RDMA feature over RoCE. Default value of hadoop.roce.enabled is false.</description> </configuration> Configuration of the RoCE mode: <configuration> <name>fs.defaultfs</name> 5
9 <value>hdfs://node001:9000</value> <description>namenode URI.</description> <name>hadoop.ib.enabled</name> <value>false</value> <description>disable the RDMA feature over IB. Default value of hadoop.ib.enabled is true.</description> <name>hadoop.roce.enabled</name> <value>true</value> <description>enable the RDMA feature over RoCE. Default value of hadoop.roce.enabled is false.</description> </configuration> Configuration of the TCP/IP mode: <configuration> <name>fs.defaultfs</name> <value>hdfs://node001:9000</value> <description>namenode URI.</description> <name>hadoop.ib.enabled</name> <value>false</value> <description>disable the RDMA feature over IB. Default value of hadoop.ib.enabled is true.</description> <name>hadoop.roce.enabled</name> <value>false</value> <description>disable the RDMA feature over RoCE. Default value of hadoop.roce.enabled is false.</description> </configuration> Note that we should not enable hadoop.ib.enabled and hadoop.roce.enabled at the same time. Also, for IB and RoCE mode, the speculative executions of map and reduce tasks are disabled by default. The Physical and Virtual memory monitoring is also disabled for these modes. As default, org.apache.hadoop.mapred.homrshufflehandler is configured as the Shuffle Handler service and 6
10 org.apache.hadoop.mapreduce.task.reduce.homrshuffle is configured as the default Shuffle plugin. 3. Configure hdfs-site.xml file. <configuration> <name>dfs.namenode.name.dir</name> <value>file:///home/hadoop/rdma-hadoop-2.x-0.9.8/name</value> <description>path on the local filesystem where the NameNode stores the namespace and transactions logs persistently.</description> <name>dfs.datanode.data.dir</name> <value>[ram_disk]file:///data01, [SSD]file:///data02, [DISK]file:///data03</value> <description>comma separated list of paths of the local storage devices with corresponding types on a DataNode where it should store its blocks.</description> </configuration> 4. Configure yarn-site.xml file. <configuration> <name>yarn.resourcemanager.address</name> <value>node001:8032</value> <description>resourcemanager host:port for clients to submit jobs.</description> <name>yarn.resourcemanager.scheduler.address</name> <value>node001:8030</value> <description>resourcemanager host:port for ApplicationMasters to talk to Scheduler to obtain resources.</description> <name>yarn.resourcemanager.resource-tracker.address</name> <value>node001:8031</value> <description>resourcemanager host:port for NodeManagers.</description> 7
11 <name>yarn.resourcemanager.admin.address</name> <value>node001:8033</value> <description>resourcemanager host:port for administrative commands.</description> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> <description>shuffle service that needs to be set for MapReduce applications.</description> </configuration> 5. Configure mapred-site.xml file. <configuration> <name>mapreduce.framework.name</name> <value>yarn</value> <description>execution framework set to Hadoop YARN.</description> <name>yarn.app.mapreduce.am.command-opts</name> <value>-xmx1024m -Dhadoop.conf.dir=${HADOOP_CONF_DIR}</value> <description>java opts for the MR App Master processes. The following symbol, if present, will be interpolated: is replaced by current TaskID. Any other occurrences will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a value of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@[email protected] Usage of -Djava.library.path can cause programs to no longer function if hadoop native libraries are used. These values should instead be set as part of LD_LIBRARY_PATH in the map / reduce JVM env using the mapreduce.map.env and mapreduce.reduce.env config settings. </description> <name>yarn.app.mapreduce.am.env</name> 8
12 <value>ld_library_path=${hadoop_home}/lib/native</value> <description>user added environment variables for the MR App Master processes. Example : 1) A=foo This will set the env variable A to foo 2) B=$B:c This is inherit tasktracker s B env variable. </description> </configuration> 6. Configure slaves file. List all slave hostnames in this file, one per line. node002 node003 We can also configure more specific items according to actual needs. For example, we can configure the item dfs.blocksize in hdfs-site.xml to change the HDFS block size. To get more detailed information, please visit HDFS in Memory (HHH-M) We can enable MapReduce over in-memory HDFS using the following configuration steps: 1. Configure hdfs-site.xml file. <configuration> <name>dfs.namenode.name.dir</name> <value>file:///home/hadoop/rdma-hadoop-2.x-0.9.8/name</value> <description>path on the local filesystem where the NameNode stores the namespace and transactions logs persistently.</description> <name>dfs.datanode.data.dir</name> <value>[ram_disk]file:///data01, [SSD]file:///data02, [DISK]file:///data03</value> <description>comma separated list of paths of the local storage devices with corresponding types on a DataNode where it should store its blocks.</description> <name>dfs.rdma.hhh.mode</name> 9
13 <value>in-memory</value> <description>select In-Memory mode (HHH-M). </description> </configuration> 2. Configure the yarn-site.xml file as specified in Section Configure the mapred-site.xml file as specified in Section Configure the core-site.xml file as specified in Section Configure the slaves file as specified in Section HDFS with Heterogeneous Storage and Lustre (HHH-L) We can enable MapReduce over Luster-integrated HDFS using the following configuration steps: 1. Configure hdfs-site.xml file. <configuration> <name>dfs.namenode.name.dir</name> <value>file:///home/hadoop/rdma-hadoop-2.x-0.9.8/name</value> <description>path on the local filesystem where the NameNode stores the namespace and transactions logs persistently.</description> <name>dfs.datanode.data.dir</name> <value>[ram_disk]file:///data01, [SSD]file:///data02, [DISK]file:///data03</value> <description>comma separated list of paths of the local storage devices with corresponding types on a DataNode where it should store its blocks.</description> <name>dfs.rdma.hhh.mode</name> <value>lustre</value> <description>select Lustre-Integrated mode (HHH-L). </description> 10
14 <name>dfs.replication</name> <value>1</value> <description>lustre provides fault-tolerance; use replication factor of 1</description> <name>dfs.rdma.lustre.path</name> <value>/scratch/hadoop-lustre/</value> <description>the path of a directory in Lustre where HDFS will store the blocks</description> </configuration> 2. Configure the yarn-site.xml file as specified in Section Configure the mapred-site.xml file as specified in Section Configure the core-site.xml file as specified in Section Configure the slaves file as specified in Section MapReduce over Lustre with Local Disks In this mode, Hadoop MapReduce can be launched on top of Lustre file system without using HDFS-level APIs. We can enable MapReduce over Lustre with two different kinds of Hadoop configuration settings. In the first setting, we use the local disks of each node to hold the intermediate data generated in the runtime of job execution. To enable MapReduce over Lustre with this setting, we use the following configuration steps: 1. Create the data directory in the Lustre File System. mkdir <lustre-path-data-dir> 2. Configure Lustre file striping. It is recommended to use a stripe size equal to the file system block size ( fs.local.block.size ) in core-site.xml. For our package, we recommend to use a stripe size value of 256 MB. lfs setstripe -s 256M <lustre-path-data-dir> 3. Configure mapred-site.xml file. <configuration> <name>mapreduce.framework.name</name> <value>yarn</value> 11
15 <description>execution framework set to Hadoop YARN.</description> <name>yarn.app.mapreduce.am.command-opts</name> <value>-xmx1024m -Dhadoop.conf.dir=$HADOOP_HOME/etc/hadoop</value> <description>java opts for the MR App Master processes. The following symbol, if present, will be interpolated: is replaced by current TaskID. Any other occurrences will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a value of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@[email protected] Usage of -Djava.library.path can cause programs to no longer function if hadoop native libraries are used. These values should instead be set as part of LD_LIBRARY_PATH in the map / reduce JVM env using the mapreduce.map.env and mapreduce.reduce.env config settings.</description> <name>yarn.app.mapreduce.am.env</name> <value>ld_library_path=${hadoop_home}/lib/native</value> <description>user added environment variables for the MR App Master processes. Example : 1) A=foo This will set the env variable A to foo 2) B=$B:c This is inherit tasktracker s B env variable. </description> <name>mapreduce.jobtracker.system.dir</name> <value><lustre-path-data-dir>/mapred/system</value> <description>the directory where MapReduce stores control files. </description> <name>mapreduce.jobtracker.staging.root.dir</name> <value><lustre-path-data-dir>/mapred/staging</value> <description>the root of the staging area for users job files. </description> 12
16 <name>yarn.app.mapreduce.am.staging-dir</name> <value><lustre-path-data-dir>/yarn/staging</value> <description>the staging dir used while submitting jobs. </description> <name>mapred.rdma.shuffle.lustre</name> <value>1</value> <description>this parameter enables mapreduce over lustre with all enhanced shuffle algorithms. For MapReduce over Lustre with local disks, this parameter value should be 1. The default value of this parameter is -1, which enables RDMA-based shuffle for MapReduce over HDFS.</description> </configuration> 4. Configure the core-site.xml file. <configuration> <name>fs.defaultfs</name> <value>file://<lustre-path-data-dir>/namenode</value> <description>the name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri s scheme determines the config property (fs.scheme.impl) naming the FileSystem implementation class. The uri s authority is used to determine the host, port, etc. for a filesystem.</description> <name>fs.local.block.size</name> <value> </value> <description>this value should be equal to Lustre stripe size.</description> <name>hadoop.ib.enabled</name> <value>true</value> <description>enable/disable the RDMA feature over IB. Default 13
17 value of hadoop.ib.enabled is true.</description> <name>hadoop.roce.enabled</name> <value>false</value> <description>enable/disable the RDMA feature over RoCE. Default value of hadoop.roce.enabled is false.</description> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop_local</value> <description>a base for other temporary directories.</description> </configuration> 5. Configure the yarn-site.xml file as specified in Section Configure the slaves file as specified in Section MapReduce over Lustre without Local Disks For MapReduce over Lustre without local disks, Lustre should provide the intermediate data directories for MapReduce jobs. However, to do this, Hadoop tmp directory ( hadoop.tmp.dir ) must have separate paths to Lustre for each NodeManager. This can be achieved by installing a separate Hadoop package on each node with difference in the configuration value for hadoop.tmp.dir. To enable MapReduce over Lustre with this setting, we use the following configuration steps: 1. Make the Lustre data directory and configure the stripe size as specified in Section Copy the hadoop installation on each of the slaves in a local dir of that node. Note that, this step is a requirement for this mode. 3. Configure mapred-site.xml file. <configuration> <name>mapreduce.framework.name</name> <value>yarn</value> <description>execution framework set to Hadoop YARN.</description> 14
18 <name>yarn.app.mapreduce.am.command-opts</name> <value>-xmx1024m -Dhadoop.conf.dir=$HADOOP_HOME/etc/hadoop</value> <description>java opts for the MR App Master processes. The following symbol, if present, will be interpolated: is replaced by current TaskID. Any other occurrences will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a value of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@[email protected] Usage of -Djava.library.path can cause programs to no longer function if hadoop native libraries are used. These values should instead be set as part of LD_LIBRARY_PATH in the map / reduce JVM env using the mapreduce.map.env and mapreduce.reduce.env config settings. </description> <name>yarn.app.mapreduce.am.env</name> <value>ld_library_path=${hadoop_home}/lib/native</value> <description>user added environment variables for the MR App Master processes. Example : 1) A=foo This will set the env variable A to foo 2) B=$B:c This is inherit tasktracker s B env variable. </description> <name>mapreduce.jobtracker.system.dir</name> <value><lustre-path-data-dir>/mapred/system</value> <description>the directory where MapReduce stores control files. </description> <name>mapreduce.jobtracker.staging.root.dir</name> <value><lustre-path-data-dir>/mapred/staging</value> <description>the root of the staging area for users job files. </description> <name>yarn.app.mapreduce.am.staging-dir</name> 15
19 <value><lustre-path-data-dir>/yarn/staging</value> <description>the staging dir used while submitting jobs. </description> <name>mapred.rdma.shuffle.lustre</name> <value>2</value> <description>this parameter enables Mapreduce over Lustre with all enhanced shuffle algorithms. For MapReduce over Lustre without local disks, it can have a value of 0, 1, 2, or any fractional value in between 0 and 1. A value of 1 indicates that the entire shuffle will go over RDMA; a value of 0 indicates that the Lustre file system read operation will take place instead of data shuffle. Any value in between (e.g. 0.70) indicates some shuffle going over RDMA (70%); while the rest is Lustre Read (30%). A special value of 2 enables a hybrid shuffle algorithm for MapReduce over Lustre with Lustre as intermediate data dir. The default value of this parameter is -1, which enables RDMA-based shuffle for MapReduce over HDFS.</description> </configuration> 4. Configure the core-site.xml file for each NodeManager separately. <configuration> <name>fs.defaultfs</name> <value>file://<lustre-path-data-dir>/namenode</value> <description>the name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri s scheme determines the config property (fs.scheme.impl) naming the FileSystem implementation class. The uri s authority is used to determine the host, port, etc. for a filesystem.</description> <name>fs.local.block.size</name> <value> </value> <description>this value should be equal to Lustre stripe size.</description> 16
20 <name>hadoop.ib.enabled</name> <value>true</value> <description>enable/disable the RDMA feature over IB. Default value of hadoop.ib.enabled is true.</description> <name>hadoop.roce.enabled</name> <value>false</value> <description>enable/disable the RDMA feature over RoCE. Default value of hadoop.roce.enabled is false.</description> <name>hadoop.tmp.dir</name> <value><lustre-path-data-dir>/tmp_{$nodeid}</value> <description>a base for other temporary directories. For MapReduce over Lustre with Lustre providing the intermediate data storage, this parameter value should be different for each NodeManager. For example, the first NodeManager in the slaves file may have a value of <lustre-path-data-dir>/tmp_0, while the second NodeManager can be configured with <lustre-path-data-dir>/tmp_1, and so on. Any value can replace $nodeid in the value here as long as the value is different for each node.</description> </configuration> 5. Configure the yarn-site.xml file as specified in Section Configure the slaves file as specified in Section Advanced Configuration Some advanced features in RDMA for Apache Hadoop 2.x can be manually enabled by users. Steps to configure these features in RDMA for Apache Hadoop 2.x are discussed in this section Parallel Replication Enable parallel replication in HDFS by configuring hdfs-site.xml file. By default, RDMA for Apache Hadoop 2.x will choose the pipeline replication mechanism. This parameter is applicable for HHH and HHH-M modes as discussed in Section and Section 3.4.2, respectively. 17
21 <name>dfs.replication.parallel.enabled</name> <value>true</value> <description>enable the parallel replication feature in HDFS. Default value of dfs.replication.parallel is false. </description> Placement Policy Selection Select specific placement policy (Greedy or Balanced) in HDFS by configuring hdfs-site.xml file. By default, RDMA for Apache Hadoop 2.x selects policy automatically based on the storage types of the HDFS data directories. This parameter is applicable for HHH and HHH-L modes as discussed in Section and Section 3.4.3, respectively. <name>dfs.rdma.placement.policy</name> <value>greedy/balanced</value> <description>enable specific data placement policy. </description> Automatic Placement Policy Selection By default, RDMA for Apache Hadoop 2.x selects policy automatically based on the storage types of the HDFS data directories. This parameter can be used if the user wants to disable automatic policy detection. This parameter is applicable for HHH mode as discussed in Section <name>dfs.rdma.policy.autodetect</name> <value>false</value> <description>disable automatic policy detection (default is true). </description> In order to use the storage policies of default HDFS, users should not use the dfs.rdma.placement.policy parameter as discussed in Section and disable policy auto detection Threshold of RAM Disk Usage Select a threshold of RAM Disk usage in HDFS by configuring hdfs-site.xml file. By default, RDMA for Apache Hadoop 2.x uses 70% of RAM Disk when RAM Disk is configured as a HDFS data directory. This parameter is applicable for HHH, HHH-M, and HHH-L modes as discussed in Section 3.4.1, Section 3.4.2, and Section 3.4.3, respectively. 18
22 <name>dfs.rdma.memory.percentage</name> <value>0.5</value> <description>select a threshold (default = 0.7) for RAM Disk usage. </description> No. of In-Memory Replicas Select the number of in-memory replicas in HHH mode by configuring hdfs-site.xml file. By default, RDMA for Apache Hadoop 2.x writes two replicas to RAM Disk and one to persistent storage (replication factor = 3). The no. of in-memory replicas can be changed from one to no. of replication factor (all in-memory). This parameter is applicable for HHH mode as discussed in Section <name>dfs.rdma.memory.replica</name> <value>3</value> <description>select no. of in-memory replicas (default = 2). </description> Disk-assisted Shuffle Enable disk-assisted shuffle in MapReduce by configuring mapred-site.xml file. By default, RDMA for Apache Hadoop 2.x assumes that disk-assisted shuffle is disabled. We encourage our users to enable this parameter if they feel that the local disk performance in their cluster is good. This parameter is applicable for all the modes in RDMA for Apache Hadoop 2.x <name>mapred.disk-assisted.shuffle.enabled</name> <value>true</value> <description>enable disk-assisted shuffle in MapReduce. Default value of mapred.disk-assisted.shuffle.enabled is false. </description> 4 Basic Usage Instructions RDMA for Apache Hadoop 2.x has management operations similar to default Apache Hadoop This section lists several of them for basic usage. 19
23 4.1 Startup MapReduce over HDFS To run MapReduce over HDFS with any of the modes (HHH/HHH-M/HHH-L), please follow these steps. 1. Use the following command to format the directory which stores the namespace and transactions logs for NameNode. $ bin/hdfs namenode -format 2. Start HDFS with the following command: $ sbin/start-dfs.sh 3. Start YARN with the following command: $ sbin/start-yarn.sh MapReduce over Lustre To run MapReduce over Lustre, startup involves starting YARN only. 1. Start YARN with the following command: $ sbin/start-yarn.sh 4.2 Basic Commands 1. Use the following command to manage HDFS: $ bin/hdfs dfsadmin Usage: java DFSAdmin Note: Administrative commands can only be run as the HDFS superuser. [-report] [-safemode enter leave get wait] [-allowsnapshot <snapshotdir>] [-disallowsnapshot <snapshotdir>] [-savenamespace] [-rolledits] [-restorefailedstorage true false check] [-refreshnodes] [-finalizeupgrade] [-rollingupgrade [<query prepare finalize>]] 20
24 [-metasave filename] [-refreshserviceacl] [-refreshusertogroupsmappings] [-refreshsuperusergroupsconfiguration] [-refreshcallqueue] [-printtopology] [-refreshnamenodes datanodehost:port] [-deleteblockpool datanode-host:port blockpoolid [force]] [-setquota <quota> <dirname>...<dirname>] [-clrquota <dirname>...<dirname>] [-setspacequota <quota> <dirname>...<dirname>] [-clrspacequota <dirname>...<dirname>] [-setbalancerbandwidth <bandwidth in bytes per second>] [-fetchimage <local directory>] [-shutdowndatanode <datanode_host:ipc_port> [upgrade]] [-getdatanodeinfo <datanode_host:ipc_port>] [-help [cmd]] Generic options supported are -conf <configuration file> specify an application configuration file -D <property=value> use value for given property -fs <local namenode:port> specify a namenode -jt <local jobtracker:port> specify a job tracker -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath. -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines. For example, we often use the following command to show the status of HDFS: $ bin/hdfs dfsadmin -report 2. Use the following command to manage files in HDFS: $ bin/hdfs dfs Usage: hadoop fs [generic options] [-appendtofile <localsrc>... <dst>] [-cat [-ignorecrc] <src>...] [-checksum <src>...] [-chgrp [-R] GROUP PATH...] [-chmod [-R] <MODE[,MODE]... OCTALMODE> PATH...] [-chown [-R] [OWNER][:[GROUP]] PATH...] [-copyfromlocal [-f] [-p] <localsrc>... <dst>] 21
25 [-copytolocal [-p] [-ignorecrc] [-crc] <src>... <localdst>] [-count [-q] <path>...] [-cp [-f] [-p] <src>... <dst>] [-createsnapshot <snapshotdir> [<snapshotname>]] [-deletesnapshot <snapshotdir> <snapshotname>] [-df [-h] [<path>...]] [-du [-s] [-h] <path>...] [-expunge] [-get [-p] [-ignorecrc] [-crc] <src>... <localdst>] [-getfacl [-R] <path>] [-getmerge [-nl] <src> <localdst>] [-help [cmd...]] [-ls [-d] [-h] [-R] [<path>...]] [-mkdir [-p] <path>...] [-movefromlocal <localsrc>... <dst>] [-movetolocal <src> <localdst>] [-mv <src>... <dst>] [-put [-f] [-p] <localsrc>... <dst>] [-renamesnapshot <snapshotdir> <oldname> <newname>] [-rm [-f] [-r -R] [-skiptrash] <src>...] [-rmdir [--ignore-fail-on-non-empty] <dir>...] [-setfacl [-R] [{-b -k} {-m -x <acl_spec>} <path>] [--set <acl_spec> <path>]] [-setrep [-R] [-w] <rep> <path>...] [-stat [format] <path>...] [-tail [-f] <file>] [-test -[defsz] <path>] [-text [-ignorecrc] <src>...] [-touchz <path>...] [-usage [cmd...]] Generic options supported are -conf <configuration file> specify an application configuration file -D <property=value> use value for given property -fs <local namenode:port> specify a namenode -jt <local jobtracker:port> specify a job tracker -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath. -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines. For example, we can use the following command to list directory contents of HDFS: 22
26 $ bin/hdfs dfs -ls / 3. Use the following command to interoperate with the MapReduce framework: $ bin/mapred job Usage: JobClient <command> <args> [-submit <job-file>] [-status <job-id>] [-counter <job-id> <group-name> <counter-name>] [-kill <job-id>] [-set-priority <job-id> <priority>]. Valid values for priorities are: VERY_HIGH HIGH NORMAL LOW VERY_LOW [-events <job-id> <from-event-#> <#-of-events>] [-history <joboutputdir>] [-list [all]] [-list-active-trackers] [-list-blacklisted-trackers] [-list-attempt-ids <job-id> <task-type> <task-state>] [-kill-task <task-id>] [-fail-task <task-id>] Usage: CLI <command> <args> [-submit <job-file>] [-status <job-id>] [-counter <job-id> <group-name> <counter-name>] [-kill <job-id>] [-set-priority <job-id> <priority>]. Valid values for priorities are: VERY_HIGH HIGH NORMAL LOW VERY_LOW [-events <job-id> <from-event-#> <#-of-events>] [-history <jobhistoryfile>] [-list [all]] [-list-active-trackers] [-list-blacklisted-trackers] [-list-attempt-ids <job-id> <task-type> <task-state>]. Valid values for <task-type> are REDUCE MAP. Valid values for <task-state> are running, completed [-kill-task <task-attempt-id>] [-fail-task <task-attempt-id>] [-logs <job-id> <task-attempt-id>] Generic options supported are -conf <configuration file> specify an application configuration file -D <property=value> use value for given property -fs <local namenode:port> specify a namenode 23
27 -jt <local jobtracker:port> specify a job tracker -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath. -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines. For example, we can use the following command to list all active trackers of MapReduce: $ bin/mapred job -list-active-trackers 4.3 Shutdown MapReduce over HDFS 1. Stop HDFS with the following command: $ sbin/stop-dfs.sh 2. Stop YARN with the following command: $ sbin/stop-yarn.sh MapReduce over Lustre 1. Stop YARN with the following command: $ sbin/stop-yarn.sh 5 Running RDMA for Apache Hadoop with SLURM/PBS To run RDMA for Apache Hadoop with SLURM/PBS, scripts in HADOOP HOME/bin/slurm pbs/ directory can be used. These scripts can be used in interactive mode or batch mode. In the interactive mode, the user must allocate interactive nodes, and explicitly use the startup, run benchmark, and shutdown commands described Sections 5.2 or 5.3, in their interactive session. In the batch mode, the users must create and lauch a SLURM/PBS batch script with the startup, run benchmark, and shutdown commands described Sections 5.2 or 5.3. Detailed steps for using these scripts are described in Sections 5.1, 5.2 and Usage This section gives an overview of the SLURM/PBS scripts and their usage for startup and shutdown for running a Hadoop cluster. 24
28 5.1.1 Configure and Start a Hadoop Job For installing, configuring, and starting RDMA for Apache Hadoop with any particular mode of operation, hibd install configure start.sh can be used. This script can configure and start Hadoop depending on the parameters provided by the user. Detailed parameter options available with this script are mentioned below: $./hibd_install_configure_start.sh? Usage: hibd_install_configure_start.sh [options] -h <dir> specify location of hadoop installation a.k.a. hadoop home -m <hhh hhh-m hhh-l mrlustre-local mrlustre-lustre> specify the mode of operation (default: hhh). For more information, visit -c <dir> specify the hadoop conf dir (default: ""). If user provides this directory, then the conf files are chosen from this directory. Otherwise, the conf files are generated automatically with/without user provided configuration with flag -u -j <dir> specify jdk installation or JAVA_HOME (default: ""). If user does not provide this, then java installation is searched in the environment. -u <file> specify a file containing all the configurations for hadoop installation(default: n/a). Each line of this file must be formatted as below: "<C H M Y>\t<parameter_name>\t<parameter_value>" C = core-site.xml, H = hdfs-site.xml, M = mapred-site.xml, Y = yarn-site.xml -l <dir> specify the Lustre path to use for hhh-l, mrlustre-local, and mrlustre-lustre modes (default: "") -r <dir> specify the ram disk path to use for hhh and hhh-m modes (default: /dev/shm) -s specify to start hadoop after installation and configuration 25
29 -? show this help message Shutdown Hadoop Cluster After running a benchmark with the script indicated in Section 5.1.1, stopping the Hadoop cluster with cleanup of all the directories can be achieved by using the script hibd stop cleanup.sh. Similar to the startup script, this cleanup script can makes different parameters available to the user. Detailed parameter options available with this script are mentioned below: $./hibd_stop_cleanup.sh? Usage: hibd_stop_cleanup.sh [options] -h <dir> specify location of hadoop installation a.k.a. hadoop home -m <hhh hhh-m hhh-l mrlustre-local mrlustre-lustre> specify the mode of operation (default: hhh). For more information, visit -c <dir> specify the hadoop conf dir (default: ""). -l <dir> specify the Lustre path to use for hhh-l, mrlustre-local, and mrlustre-lustre modes (default: "") -r <dir> specify the ram disk path to use for hhh and hhh-m modes (default: /dev/shm) -d -? specify to delete logs and data after hadoop stops show this help message Details of the usage of the above-mentioned scripts can also be found in slurm-script.sh 5.2 Running MapReduce over HDFS with SLURM/PBS The user can run MapReduce over HDFS in one of the three modes: HHH, HHH-M or HHH-L. Based on the parameters supplied to the hibd install configure start.sh script, Hadoop cluster will start with the requested mode of operation configuration setting. 26
30 5.2.1 Startup To start Hadoop in HHH mode, the following command can be used: $ hibd_install_configure_start.sh -s -m hhh-default -r /dev/shm -h $HADOOP_HOME -j $JAVA_HOME To start Hadoop in HHH-M mode, the following command can be used: $ hibd_install_configure_start.sh -s -m hhh-m -r /dev/shm -h $HADOOP_HOME -j $JAVA_HOME To start Hadoop in HHH-L mode, the following command can be used: $ hibd_install_configure_start.sh -s -m hhh-l -r /dev/shm -h $HADOOP_HOME -j $JAVA_HOME -l <lustre_path> Running Benchmarks User can launch a benchmark after successful start of the Hadoop cluster. While running a benchmark, user should provide the Hadoop config directory using --config flag with hadoop script. If user does not have any pre-configured files, the default config directory will be created in the present working directory named as conf concatenated with job id. $ $HADOOP_HOME/bin/hadoop --config./conf_<job_id> jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar randomwriter -Dmapreduce.randomwriter.mapsperhost=4 -Dmapreduce.randomwriter.bytespermap= rand_in $ $HADOOP_HOME/bin/hadoop --config./conf_<job_id> jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar sort rand_in rand_out Shutdown In order to stop Hadoop and cleanup the corresponding logs, the following command can be used: $ hibd_stop_cleanup.sh -d -l <lustre_path> -h $HADOOP_HOME 5.3 Running MapReduce over Lustre with SLURM/PBS Startup To start Hadoop in MapReduce over Lustre with local disks mode, the following command can be used: 27
31 $ hibd_install_configure_start.sh -s -m mrlustre-local -l <lustre_path> -j $JAVA_HOME -h $HADOOP_HOME To start Hadoop in MapReduce over Lustre without local disks mode, the following command can be used: $ hibd_install_configure_start.sh -s -m mrlustre-lustre -l <lustre_path> -j $JAVA_HOME -h $HADOOP_HOME Running Benchmarks Running benchmark for MapReduce over Lustre with local disks follows the same guidelines as shown above. For benchmarks running on MapReduce over Lustre without local disks, the following command should be used. $ /tmp/hadoop_install_<job_id>/bin/hadoop jar /tmp/hadoop_install_<job_id>/share/hadoop/mapreduce/ hadoop-mapreduce-examples-*.jar randomwriter -Dmapreduce.randomwriter.mapsperhost=4 -Dmapreduce.randomwriter.bytespermap= file://<lustre_path>/hibd_data_<job_id>/rand_in $ /tmp/hadoop_install_<job_id>/bin/hadoop jar /tmp/hadoop_install_<job_id>/share/hadoop/mapreduce/ hadoop-mapreduce-examples-*.jar sort file://<lustre_path>/hibd_data_<job_id>/rand_in file://<lustre_path>/hibd_data_<job_id>/rand_out Shutdown For stopping Hadoop and clean the used directories, the same command as shown above can be used. 6 Benchmarks 6.1 TestDFSIO The TestDFSIO benchmark is used to measure I/O performance of the underlying file system. It does this by using a MapReduce job to read or write files in parallel. Each file is read or written in a separate map task and the benchmark reports the average read/write throughput per map. On a client node, the TestDFSIO write experiment can be run using the following command: $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*-tests.jar TestDFSIO -write -nrfiles <nfiles> -filesize <fsize> 28
32 This command writes nfiles files and fsize MB each. To run the same job with MapReduce over Lustre, one additional config parameter must be added in mapred-site.xml. <name>test.build.data</name> <value><lustre-path-data-dir>/benchmarks/testdfsio</value> After adding this config parameter, the same command as mentioned above can be used to launch TestDF- SIO experiment on top of Lustre. 6.2 Sort The Sort benchmark uses the MapReduce framework to sort the input directory into the output directory. The inputs and outputs are sequence files where the keys and values are BytesWritable. Before running the Sort benchmark, we can use RandomWriter to generate the input data. RandomWriter writes random data to HDFS using the MapReduce framework. Each map takes a single file name as input and writes random BytesWritable keys and values to the HDFS sequence file. On a client node, the RandomWriter experiment can be run using the following command: $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar randomwriter -Dmapreduce.randomwriter.bytespermap=<nbytes> -Dmapreduce.randomwriter.mapsperhost=<nmaps> <out-dir> This command launches nmaps maps per node, and each map writes nbytes data to out-dir. On a client node, the Sort experiment can be run using the following command: $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar sort -r <nreds> <in-dir> <out-dir> This command launches nreds reduces to sort data from in-dir to out-dir. The input directory of Sort can be the output directory of RandomWriter. To run the same job with MapReduce over Lustre, the following commands can be used. $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar randomwriter -Dmapreduce.randomwriter.bytespermap=<nbytes> -Dmapreduce.randomwriter.mapsperhost=<nmaps> file:///<lustre-path-data-dir>/<out-dir> $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar sort -r <nreds> file:///<lustre-path-data-dir>/<in-dir> file:///<lustre-path-data-dir>/<out-dir> 29
33 6.3 TeraSort TeraSort is probably the most well-known Hadoop benchmark. It is a benchmark that combines testing the HDFS and MapReduce layers of a Hadoop cluster. The input data for TeraSort can be generated by the TeraGen tool, which writes the desired number of rows of data in the input directory. By default, the key and value size is fixed for this benchmark at 100 bytes. TeraSort takes the data from the input directory and sorts it to another directory. The output of TeraSort can be validated by the TeraValidate tool. Before running the TeraSort benchmark, we can use TeraGen to generate the input data as follows: $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar teragen <nrows> <out-dir> This command writes nrows of 100-byte rows to out-dir. On a client node, the TeraSort experiment can be run using the following command: $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar terasort <in-dir> <out-dir> This command sorts data from in-dir to out-dir. The input directory of TeraSort can be the output directory of TeraGen. To run the same job with MapReduce over Lustre, the following commands can be used. $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar teragen <nrows> file:///<lustre-path-data-dir>/<out-dir> $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar terasort file:///<lustre-path-data-dir>/<in-dir> file:///<lustre-path-data-dir>/<out-dir> 6.4 OHB Micro-benchmarks The OHB Micro-benchmarks support standalone evaluations of Hadoop Distributed File System (HDFS) and Memcached (See here). These benchmarks help fine-tune each component avoiding the impact of others. OSU HiBD-Benchmarks (OHB-0.8) have HDFS benchmarks for Sequential Write Latency (SWL), Sequential Read Latency (SRL), Random Read Latency (RRL), Sequential Write Throughput (SWT), Sequential Read Throughput (SRT). The source code can be downloaded from hibd/osu-hibd-benchmarks-0.8.tar.gz. The source can be compiled with the help of the Makefile provided. More details on building and running the OHB Micro-benchmark are provided in the README. 30
34 A brief description of the benchmark is provided below: Sequential Write Latency (SWL): This benchmark measures the latency of sequential write to HDFS. The benchmark takes five parameters: file name (-filename), file size (-filesize), block size (-bsize), replication factor (-rep), and buffer size (-bufsize). The mandatory parameters are file name and size (in MB). The output of the benchmark is the time taken to write the file to HDFS. The buffer size indicates the size of the write buffer. HDFS block size and replication factor can also be tuned through this benchmarks. The benchmark also prints the important configuration parameters of HDFS. Sequential Read Latency (SRL): This benchmark measures the latency of sequential read from HDFS. The benchmark takes two parameters: file name and buffer size. The mandatory parameter is file name. The output of the benchmark is the time taken to read the file from HDFS. The buffer size indicates the size of the read buffer. The benchmark also prints the important configuration parameters of HDFS. Random Read Latency (RRL): This benchmark measures the latency of random read from HDFS. The benchmark takes four parameters: file name (-filename), file size (-filesize), skip size (-skipsize) and buffer size (-bufsize). The mandatory parameters are file name and file size. The benchmark first creates a file 2x the file (read) size and then randomly reads from it with a default skip size of 10. The output of the benchmark is the time taken to read the file from HDFS. The buffer size indicates the size of the read buffer. The benchmark also prints the important configuration parameters of HDFS. Sequential Write Throughput (SWT): This benchmark measures the throughput of sequential write to HDFS. The benchmark takes five parameters: file size (-filesize), block size (-bsize), replication factor (-rep), buffer size (-bufsize), and an output directory (-outdir) for the output files. The mandatory parameters are file size (in MB) and the output directory. Linux xargs command is used to launch multiple concurrent writers. File size indicates the write size per writer. A hostfile contains the hostnames where the write processes are launched. The benchmark outputs the total write throughput in MBps. The buffer size indicates the size of the write buffer. HDFS block size and replication factor can also be tuned through this benchmarks. Sequential Read Throughput (SRT): This benchmark measures the throughput of sequential read from HDFS. The benchmark takes three parameters: file size (-filesize), buffer size (-bufsize), and an output directory (-outdir) for the output files. The mandatory parameters are file size (in MB) and the output directory. Linux xargs command is used to launch multiple concurrent readers. File size indicates the write size per reader. A hostfile contains the hostnames where the write processes are lauched. The benchmark outputs the total read throughput in MBps. The buffer size indicates the size of the read buffer. 31
RDMA for Apache Hadoop 0.9.9 User Guide
0.9.9 User Guide HIGH-PERFORMANCE BIG DATA TEAM http://hibd.cse.ohio-state.edu NETWORK-BASED COMPUTING LABORATORY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING THE OHIO STATE UNIVERSITY Copyright (c)
Hadoop Shell Commands
Table of contents 1 FS Shell...3 1.1 cat... 3 1.2 chgrp... 3 1.3 chmod... 3 1.4 chown... 4 1.5 copyfromlocal...4 1.6 copytolocal...4 1.7 cp... 4 1.8 du... 4 1.9 dus...5 1.10 expunge...5 1.11 get...5 1.12
Hadoop Shell Commands
Table of contents 1 DFShell... 3 2 cat...3 3 chgrp...3 4 chmod...3 5 chown...4 6 copyfromlocal... 4 7 copytolocal... 4 8 cp...4 9 du...4 10 dus... 5 11 expunge... 5 12 get... 5 13 getmerge... 5 14 ls...
HDFS File System Shell Guide
Table of contents 1 Overview...3 1.1 cat... 3 1.2 chgrp... 3 1.3 chmod... 3 1.4 chown... 4 1.5 copyfromlocal...4 1.6 copytolocal...4 1.7 count... 4 1.8 cp... 4 1.9 du... 5 1.10 dus...5 1.11 expunge...5
File System Shell Guide
Table of contents 1 Overview...3 1.1 cat... 3 1.2 chgrp... 3 1.3 chmod... 3 1.4 chown... 4 1.5 copyfromlocal...4 1.6 copytolocal...4 1.7 count... 4 1.8 cp... 5 1.9 du... 5 1.10 dus...5 1.11 expunge...6
CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015
CS242 PROJECT Presented by Moloud Shahbazi Spring 2015 AGENDA Project Overview Data Collection Indexing Big Data Processing PROJECT- PART1 1.1 Data Collection: 5G < data size < 10G Deliverables: Document
研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1
102 年 度 國 科 會 雲 端 計 算 與 資 訊 安 全 技 術 研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊 Version 0.1 總 計 畫 名 稱 : 行 動 雲 端 環 境 動 態 群 組 服 務 研 究 與 創 新 應 用 子 計 畫 一 : 行 動 雲 端 群 組 服 務 架 構 與 動 態 群 組 管 理 (NSC 102-2218-E-259-003) 計
HSearch Installation
To configure HSearch you need to install Hadoop, Hbase, Zookeeper, HSearch and Tomcat. 1. Add the machines ip address in the /etc/hosts to access all the servers using name as shown below. 2. Allow all
1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation
1. GridGain In-Memory Accelerator For Hadoop GridGain's In-Memory Accelerator For Hadoop edition is based on the industry's first high-performance dual-mode in-memory file system that is 100% compatible
HDFS Users Guide. Table of contents
Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9
HDFS Installation and Shell
2012 coreservlets.com and Dima May HDFS Installation and Shell Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses
Distributed Filesystems
Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science [email protected] April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls
How To Install Hadoop 1.2.1.1 From Apa Hadoop 1.3.2 To 1.4.2 (Hadoop)
Contents Download and install Java JDK... 1 Download the Hadoop tar ball... 1 Update $HOME/.bashrc... 3 Configuration of Hadoop in Pseudo Distributed Mode... 4 Format the newly created cluster to create
Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.
EDUREKA Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.0 Cluster edureka! 11/12/2013 A guide to Install and Configure
Accelerating Spark with RDMA for Big Data Processing: Early Experiences
Accelerating Spark with RDMA for Big Data Processing: Early Experiences Xiaoyi Lu, Md. Wasi- ur- Rahman, Nusrat Islam, Dip7 Shankar, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng Laboratory Department
Hadoop (pseudo-distributed) installation and configuration
Hadoop (pseudo-distributed) installation and configuration 1. Operating systems. Linux-based systems are preferred, e.g., Ubuntu or Mac OS X. 2. Install Java. For Linux, you should download JDK 8 under
Can High-Performance Interconnects Benefit Memcached and Hadoop?
Can High-Performance Interconnects Benefit Memcached and Hadoop? D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
Installation and Configuration Documentation
Installation and Configuration Documentation Release 1.0.1 Oshin Prem October 08, 2015 Contents 1 HADOOP INSTALLATION 3 1.1 SINGLE-NODE INSTALLATION................................... 3 1.2 MULTI-NODE
Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"
Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster" Mahidhar Tatineni SDSC Summer Institute August 06, 2013 Overview "" Hadoop framework extensively used for scalable distributed processing
Big Data Technologies
Big Data Technologies Hadoop and its Ecosystem Hala El-Ali EMIS/CSE 8331 SMU [email protected] Agenda Introduction Hadoop Core Demo Hadoop Ecosystem Demo QA Big Data Big data is the term for a collection
Intel Cloud Builders Guide to Cloud Design and Deployment on Intel Platforms
Intel Cloud Builders Guide Intel Xeon Processor-based Servers Apache* Hadoop* Intel Cloud Builders Guide to Cloud Design and Deployment on Intel Platforms Apache* Hadoop* Intel Xeon Processor 5600 Series
Hadoop 2.6.0 Setup Walkthrough
Hadoop 2.6.0 Setup Walkthrough This document provides information about working with Hadoop 2.6.0. 1 Setting Up Configuration Files... 2 2 Setting Up The Environment... 2 3 Additional Notes... 3 4 Selecting
TP1: Getting Started with Hadoop
TP1: Getting Started with Hadoop Alexandru Costan MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development of web
This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download.
AWS Starting Hadoop in Distributed Mode This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download. 1) Start up 3
MapReduce Evaluator: User Guide
University of A Coruña Computer Architecture Group MapReduce Evaluator: User Guide Authors: Jorge Veiga, Roberto R. Expósito, Guillermo L. Taboada and Juan Touriño December 9, 2014 Contents 1 Overview
Spectrum Scale HDFS Transparency Guide
Spectrum Scale Guide Spectrum Scale BDA 2016-1-5 Contents 1. Overview... 3 2. Supported Spectrum Scale storage mode... 4 2.1. Local Storage mode... 4 2.2. Shared Storage Mode... 4 3. Hadoop cluster planning...
Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015
Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has
Hadoop Distributed File System. Dhruba Borthakur June, 2007
Hadoop Distributed File System Dhruba Borthakur June, 2007 Goals of HDFS Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle
Accelerating Big Data Processing with Hadoop, Spark and Memcached
Accelerating Big Data Processing with Hadoop, Spark and Memcached Talk at HPC Advisory Council Stanford Conference (Feb 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected]
A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks
A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks Xiaoyi Lu, Md. Wasi- ur- Rahman, Nusrat Islam, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng Laboratory Department
Accelerating Big Data Processing with Hadoop, Spark and Memcached
Accelerating Big Data Processing with Hadoop, Spark and Memcached Talk at HPC Advisory Council Switzerland Conference (Mar 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected]
HADOOP - MULTI NODE CLUSTER
HADOOP - MULTI NODE CLUSTER http://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm Copyright tutorialspoint.com This chapter explains the setup of the Hadoop Multi-Node cluster on a distributed
How To Use Hadoop
Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop
HDFS Architecture Guide
by Dhruba Borthakur Table of contents 1 Introduction... 3 2 Assumptions and Goals... 3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets... 3 2.4 Simple Coherency Model...3 2.5
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer!
Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer! Mahidhar Tatineni, Rick Wagner, Eva Hocks, Christopher Irving, and Jerry Greenberg! SDSC! XSEDE13, July 22-25, 2013! Overview!!
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Hadoop. History and Introduction. Explained By Vaibhav Agarwal
Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow
Introduction to HDFS. Prasanth Kothuri, CERN
Prasanth Kothuri, CERN 2 What s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. HDFS is the primary distributed storage for Hadoop applications. Hadoop
HADOOP MOCK TEST HADOOP MOCK TEST II
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.
Big Data Computing Instructor: Prof. Irene Finocchi Master's Degree in Computer Science Academic Year 2013-2014, spring semester Installing Hadoop Emanuele Fusco ([email protected]) Prerequisites You
Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.
Data Analytics CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL All rights reserved. The data analytics benchmark relies on using the Hadoop MapReduce framework
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Deploying Apache Hadoop with Colfax and Mellanox VPI Solutions
WHITE PAPER March 2013 Deploying Apache Hadoop with Colfax and Mellanox VPI Solutions In collaboration with Colfax International Background...1 Hardware...1 Software Requirements...3 Installation...3 Scalling
Introduction to HDFS. Prasanth Kothuri, CERN
Prasanth Kothuri, CERN 2 What s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. HDFS is the primary distributed storage for Hadoop applications. HDFS
Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.
Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software
Data Science Analytics & Research Centre
Data Science Analytics & Research Centre Data Science Analytics & Research Centre 1 Big Data Big Data Overview Characteristics Applications & Use Case HDFS Hadoop Distributed File System (HDFS) Overview
Installing Hadoop over Ceph, Using High Performance Networking
WHITE PAPER March 2014 Installing Hadoop over Ceph, Using High Performance Networking Contents Background...2 Hadoop...2 Hadoop Distributed File System (HDFS)...2 Ceph...2 Ceph File System (CephFS)...3
Apache Hadoop new way for the company to store and analyze big data
Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File
MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015
MarkLogic Connector for Hadoop Developer s Guide 1 MarkLogic 8 February, 2015 Last Revised: 8.0-3, June, 2015 Copyright 2015 MarkLogic Corporation. All rights reserved. Table of Contents Table of Contents
The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - -
The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - - Hadoop Implementation on Riptide 2 Table of Contents Executive
RDMA-based Plugin Design and Profiler for Apache and Enterprise Hadoop Distributed File system THESIS
RDMA-based Plugin Design and Profiler for Apache and Enterprise Hadoop Distributed File system THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate
HADOOP CLUSTER SETUP GUIDE:
HADOOP CLUSTER SETUP GUIDE: Passwordless SSH Sessions: Before we start our installation, we have to ensure that passwordless SSH Login is possible to any of the Linux machines of CS120. In order to do
Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/
Download the Hadoop tar. Download the Java from Oracle - Unpack the Comparisons -- $tar -zxvf hadoop-2.6.0.tar.gz $tar -zxf jdk1.7.0_60.tar.gz Set JAVA PATH in Linux Environment. Edit.bashrc and add below
HDFS Snapshots and Beyond
HDFS Snapshots and Beyond Tsz-Wo (Nicholas) Sze Jing Zhao October 29, 2013 Page 1 About Us Tsz-Wo Nicholas Sze, Ph.D. Software Engineer at Hortonworks PMC Member at Apache Hadoop One of the most active
MapReduce. Tushar B. Kute, http://tusharkute.com
MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity
Hadoop on the Gordon Data Intensive Cluster
Hadoop on the Gordon Data Intensive Cluster Amit Majumdar, Scientific Computing Applications Mahidhar Tatineni, HPC User Services San Diego Supercomputer Center University of California San Diego Dec 18,
HADOOP MOCK TEST HADOOP MOCK TEST
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
5 HDFS - Hadoop Distributed System
5 HDFS - Hadoop Distributed System 5.1 Definition and Remarks HDFS is a file system designed for storing very large files with streaming data access patterns running on clusters of commoditive hardware.
Running Kmeans Mapreduce code on Amazon AWS
Running Kmeans Mapreduce code on Amazon AWS Pseudo Code Input: Dataset D, Number of clusters k Output: Data points with cluster memberships Step 1: for iteration = 1 to MaxIterations do Step 2: Mapper:
Hadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
Enabling High performance Big Data platform with RDMA
Enabling High performance Big Data platform with RDMA Tong Liu HPC Advisory Council Oct 7 th, 2014 Shortcomings of Hadoop Administration tooling Performance Reliability SQL support Backup and recovery
CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)
CactoScale Guide User Guide Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB) Version History Version Date Change Author 0.1 12/10/2014 Initial version Athanasios Tsitsipas(UULM)
HADOOP PERFORMANCE TUNING
PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The
Setting up Hadoop with MongoDB on Windows 7 64-bit
SGT WHITE PAPER Setting up Hadoop with MongoDB on Windows 7 64-bit HCCP Big Data Lab 2015 SGT, Inc. All Rights Reserved 7701 Greenbelt Road, Suite 400, Greenbelt, MD 20770 Tel: (301) 614-8600 Fax: (301)
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed
H2O on Hadoop. September 30, 2014. www.0xdata.com
H2O on Hadoop September 30, 2014 www.0xdata.com H2O on Hadoop Introduction H2O is the open source math & machine learning engine for big data that brings distribution and parallelism to powerful algorithms
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
Tableau Spark SQL Setup Instructions
Tableau Spark SQL Setup Instructions 1. Prerequisites 2. Configuring Hive 3. Configuring Spark & Hive 4. Starting the Spark Service and the Spark Thrift Server 5. Connecting Tableau to Spark SQL 5A. Install
Experiences with Lustre* and Hadoop*
Experiences with Lustre* and Hadoop* Gabriele Paciucci (Intel) June, 2014 Intel * Some Con fidential name Do Not Forward and brands may be claimed as the property of others. Agenda Overview Intel Enterprise
HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe
HADOOP Installation and Deployment of a Single Node on a Linux System Presented by: Liv Nguekap And Garrett Poppe Topics Create hadoopuser and group Edit sudoers Set up SSH Install JDK Install Hadoop Editting
Revolution R Enterprise 7 Hadoop Configuration Guide
Revolution R Enterprise 7 Hadoop Configuration Guide The correct bibliographic citation for this manual is as follows: Revolution Analytics, Inc. 2014. Revolution R Enterprise 7 Hadoop Configuration Guide.
Single Node Hadoop Cluster Setup
Single Node Hadoop Cluster Setup This document describes how to create Hadoop Single Node cluster in just 30 Minutes on Amazon EC2 cloud. You will learn following topics. Click Here to watch these steps
The Hadoop Distributed File System
The Hadoop Distributed File System The Hadoop Distributed File System, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, Yahoo, 2010 Agenda Topic 1: Introduction Topic 2: Architecture
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Hadoop Installation. Sandeep Prasad
Hadoop Installation Sandeep Prasad 1 Introduction Hadoop is a system to manage large quantity of data. For this report hadoop- 1.0.3 (Released, May 2012) is used and tested on Ubuntu-12.04. The system
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Kognitio Technote Kognitio v8.x Hadoop Connector Setup
Kognitio Technote Kognitio v8.x Hadoop Connector Setup For External Release Kognitio Document No Authors Reviewed By Authorised By Document Version Stuart Watt Date Table Of Contents Document Control...
Introduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology [email protected] Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process
Understanding Hadoop Performance on Lustre
Understanding Hadoop Performance on Lustre Stephen Skory, PhD Seagate Technology Collaborators Kelsie Betsch, Daniel Kaslovsky, Daniel Lingenfelter, Dimitar Vlassarev, and Zhenzhen Yan LUG Conference 15
HDFS Cluster Installation Automation for TupleWare
HDFS Cluster Installation Automation for TupleWare Xinyi Lu Department of Computer Science Brown University Providence, RI 02912 [email protected] March 26, 2014 Abstract TupleWare[1] is a C++ Framework
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler
Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations.
Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster
Setup Hadoop On Ubuntu Linux ---Multi-Node Cluster We have installed the JDK and Hadoop for you. The JAVA_HOME is /usr/lib/jvm/java/jdk1.6.0_22 The Hadoop home is /home/user/hadoop-0.20.2 1. Network Edit
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
docs.hortonworks.com
docs.hortonworks.com Hortonworks Data Platform: Upgrading HDP Manually Copyright 2012-2015 Hortonworks, Inc. Some rights reserved. The Hortonworks Data Platform, powered by Apache Hadoop, is a massively
Hadoop Lab - Setting a 3 node Cluster. http://hadoop.apache.org/releases.html. Java - http://wiki.apache.org/hadoop/hadoopjavaversions
Hadoop Lab - Setting a 3 node Cluster Packages Hadoop Packages can be downloaded from: http://hadoop.apache.org/releases.html Java - http://wiki.apache.org/hadoop/hadoopjavaversions Note: I have tested
Hadoop* on Lustre* Liu Ying ([email protected]) High Performance Data Division, Intel Corporation
Hadoop* on Lustre* Liu Ying ([email protected]) High Performance Data Division, Intel Corporation Agenda Overview HAM and HAL Hadoop* Ecosystem with Lustre * Benchmark results Conclusion and future work
Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics
Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics www.thinkbiganalytics.com 520 San Antonio Rd, Suite 210 Mt. View, CA 94040 (650) 949-2350 Table of Contents OVERVIEW
Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters
CONNECT - Lab Guide Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters Hardware, software and configuration steps needed to deploy Apache Hadoop 2.4.1 with the Emulex family
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
Hadoop Streaming. Table of contents
Table of contents 1 Hadoop Streaming...3 2 How Streaming Works... 3 3 Streaming Command Options...4 3.1 Specifying a Java Class as the Mapper/Reducer... 5 3.2 Packaging Files With Job Submissions... 5
