User Manual: Using Hadoop with WS-PGRADE workflows December 9, 2014 1 About This manual explains the configuration of a set of workflows that can be used to submit a Hadoop job through a WS-PGRADE portal. The workflow automatically creates the Hadoop cluster in an OpenStack cloud and executes the Hadoop job there. The user only needs to provide job input files and two configuration files for specifying cluster and job parameters. Two methods can be used for submitting a Hadoop job through a WS- PGRADE portal: 1. Single Node Method The Single Node method uses a single node workflow with a simple program to create Hadoop clusters in an openstack infrastructure, execute jobs in the cluster and retrieve results. 2. Three Node Method The Three Node method works the same way as the Single Node method, but divides the task into three stages. The first stage creates the Hadoop cluster, the second executes the Hadoop job and the third destroys the cluster. Each stage can be considered as a workflow node executing a particular task. The main idea behind dividing the complete process into three stages was to allow the user to deploy Hadoop before executing a job and not at the time of executing a job. In addition, it allows for reusibility of the Hadoop cluster, as the user can keep on adding execute nodes, one after the other. An added advantage is that the user can place these three nodes anywhere in the workflow. 1
2 Prerequisites 1. Access to any CloudBroker Platform 2. Access to any WS-PGRADE (guse) portal configured to submit jobs to the CloudBroker Platform 3. Access to any Openstack cloud configured to submit jobs using the CloudBroker Platform 4. Hadoop application pre-deployed in CloudBroker platform 3 Single Node Method 1. Log in to the WS-PGRADE portal, select the import option under the Workflow tab and select Remote SHIWA Repository 2. From the list of public bundles, find and import bundle named Hadoop 3. Select the workflow tab and click on the configure button for the new imported workflow 4. In the Job Executable tab find the deployed Hadoop application and configure the parameters as desired 5. Download configuration files from here 6. Fill details in the job.cfg and cluster.cfg config files 7. Copy your Hadoop job executable (jar file) in the same folder 8. Copy your Hadoop job input files in a folder called input, compress this folder to create a tar archive called input.tar and copy this compressed file to the same folder as before 9. Copy your OpenStack credentials file to the same folder. Make sure that your password is hardcoded in the file. 10. Compress the configuration files, job executable, compressed job input file and OpenStack credentials file as a tar file called Data.tar 11. In the Job I/O tab, now scroll down to Port 3 settings and upload the Data.tar file 12. Remember to save and upload the new configuration 13. You can now submit the workflow from the Workflow tab 2
4 Three Node Method Figure 1: Three Node basic configuration 1. Download configuration files from here 2. Fill details in the job.cfg and cluster.cfg config files 3. Copy your Hadoop job executable (jar file) in the same folder 4. Copy your Hadoop job input files in a folder called input, compress this folder to create a tar archive called input.tar and copy this compressed file to the same folder as before 5. Copy your OpenStack credentials file to the same folder. Make sure that your password is hardcoded in the file. 6. Log in to the WS-PGRADE portal and create a workflow according to your application (Configurations for each Node are given below) 7. Place the Create Node before the first Execute Node and place the Destroy Node after the last Execute Node (See Figure 1) 3
8. Connect the output(channel) port of the Create Node to the input(channel) port of the first Execute Node 9. Connect the output(channel) port of the First Execute Node to the input(channel) port of the next Execute Node and repeat for every Execute Node. 10. Connect the output(channel) port of the last Execute Node to the input(channel) port of the Destroy Node 4.1 Create Node 1. This node should have 3 input ports and 1 output port (port 0-2 as input and 3 as output) 2. Download input scripts from here 3. Configure node as follows: (a) Job Executable i. Type: cloudbroker ii. Name: The name of the platform iii. Software: Hadoop 1.0 iv. Executable: Hadoop 1.0 hadoop test.sh v. Fill resource, region and instance type according to your requirements (b) Job I/O i. Port 0 input file: hadoop.sh (For each port, please enter the internal file name to be the same as the input file name) ii. Port 1 input file: create.sh iii. Port 2 input file: Data.tar (Compress the cluster.cfg and OpenStack credentials file as a tar file named Data.tar) iv. Port 3 output(channel) file: job.id 4.2 Execute Node 1. This node should have 4 input ports and 2 output port (port 0-3 as input and 4-5 as output) 2. Download input scripts from here 4
3. Configure node as follows: (a) Job Executable i. Type: cloudbroker ii. Name: The name of the platform iii. Software: Hadoop 1.0 iv. Executable: Hadoop 1.0 hadoop test.sh v. Fill resource, region and instance type the same as in the Create Node (b) Job I/O i. Port 0 input(channel) file: job.id ii. Port 1 input file: hadoop.sh (For each port, please enter the internal file name to be the same as the input file name) iii. Port 2 input file: execute.sh iv. Port 3 input file: Data.tar (Compress the configuration files, job executable, compressed job input file and OpenStack credentials file as a tar file named Data.tar) v. Port 4 output(channel) file: job.id vi. Port 5 output file: output.tar.gz (Hadoop job output folder as compressed archive) 4.3 Destroy Node 1. This node should have 3 input ports 2. Download input script from here 3. Configure node as follows: (a) Job Executable i. Type: cloudbroker ii. Name: The name of the platform iii. Software: Hadoop 1.0 iv. Executable: Hadoop 1.0 hadoop test.sh v. Fill resource, region and instance type same as in the Create Node (b) Job I/O i. Port 0 input(channel) file: job.id 5
ii. Port 1 input file: hadoop.sh (For each port, please enter the internal file name to be the same as the input file name) iii. Port 2 input file: Data.tar (Compress the cluster.cfg and OpenStack credentials file as a tar file named Data.tar) 6