CONFIGURING ECLIPSE FOR AWS EMR DEVELOPMENT With this post we thought of sharing a tutorial for configuring Eclipse IDE (Intergrated Development Environment) for Amazon AWS EMR scripting and development. Once we started creating our own bootstrap scripts for EMR, we quickly realized that it gets cumbersome to use Notepad ++, PuTTY, WinSCP, Command Prompt for EMR CLI and Git all in different windows and it would be nice to have an integrated environment to do that. Eclipse seemed perfect for this as it has a plugin for anything one can even think of. We have developed a bootstrap script for launching a keep-alive MapR M3 cluster on AWS EMR using this environment and were quite happy with it. All the steps are summarized in this index. Feel free to jump to a specific topic that interests you or follow all the steps from the beginning. Please leave your comments as we d like to hear back from you! Installing Amazon EMR Command Line Interface (CLI)... 2 Setting up Eclipse IDE for AWS/Hadoop Development in Shell Scripts And Python... 5 Installing Oracle JDK... 5 Installing Eclipse... 5 Installing PyDev Plugin... 6 Installing AWS Toolkit... 8 Installing ShellEd... 11 Configuring SSH... 12 Configuring GIT... 13 Setting Up CMD Prompt Inside Eclipse... 28 Developing EMR Bootstrap Script... 31 Launching EMR MapR Cluster... 31 Connecting To Master Node From Eclipse... 34 Running the Bootstrap Script... 40 Testing the Cluster... 41 1 P a g e
Installing Amazon EMR Command Line Interface (CLI) This installation is done on local Windows computer (not on AWS). The EMR CLI is written in Ruby and therefore requires Ruby to be installed as a prerequisite. 1. Install Ruby 1.8.7 on Windows a. Get installation package from http://rubyforge.org/frs/download.php/76524/rubyinstaller- 1.8.7-p371.exe b. Go through the installation procedure 2 P a g e
c. Check that Ruby and RubyGems are installed properly by running ruby -v and gem -v from command prompt. 2. Get EMR CLI from http://aws.amazon.com/developertools/2264 3 P a g e
a. Unzip content of elastic-mapreduce-ruby.zip into C:\AWS\elastic-mapreduce-cli b. Create credentials.json file under C:\AWS\elastic-mapreduce-cli { access_id : AKIAJGJKJSHKDJF6GUIOIEUR, private_key : dfsdfkjkdfsdfldfsdf99484nksdjnwr934, key-pair : key, key-pair-file : C:\key.ppk, log_uri : s3n://mybucket/logs/, region : us-east-1 } c. Test CLI by running elastic-mapreduce version from the command prompt 4 P a g e
Setting up Eclipse IDE for AWS/Hadoop Development in Shell Scripts And Python Installing Oracle JDK As stated in Eclipse readme file Oracle Java 7u9 is the best supported JDK for Eclipse. We tried installing with 8u5 version and it worked fine. Download JDK fromhttp://download.oracle.com/otnpub/java/jdk/8u5-b13/jdk-8u5-windows-x64.exe and follow the installation procedure. Installing Eclipse 1. Download Eclipse JEE version from https://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/rele ase/kepler/sr2/eclipse-jee-kepler-sr2-win32-x86_64.zip&mirror_id=1135 2. Expand downloaded zip archive into a local folder (in my case it was C:\Users\Dmitri\eclipse) 3. Launch eclipse.exe 4. Select workspace location 5. Click OK and Eclipse should be launched successfully 5 P a g e
Installing PyDev Plugin 1. Go to Help -> Install New Software and add PyDev repository at http://pydev.org/updates 6 P a g e
2. Select PyDev for Eclipse and follow the installation procedure 7 P a g e
Installing AWS Toolkit 8 P a g e
1. Goto Help->Install New Software and adding AWS repository at http://aws.amazon.com/eclipse 9 P a g e
2. Provide AWS access keys and click Finish 3. Go to Window->Open Perspective->Other and select AWS Management 10 P a g e
4. Click OK and the environment should look like this: Installing ShellEd 1. Go to Help->Install New Software and adding ShellEd repository at http://sourceforge.net/projects/shelled/files/shelled/update/ 11 P a g e
2. Click Next and follow the installation procedure Configuring SSH 1. Go to Window->Preferences->General->Network Connections->SSH2 2. Click Key Management tab. 12 P a g e
3. Click Load Existing Key button. Locate your AWS private key used in credentials.json file when setting up AWS EMR CLI. 4. Click Save Private Key button, click OK twice to bypass warnings and save id_rsa file in your.ssh directory. It should say that it has successfully saved public and private keys. Configuring GIT 1. Go to Window->Preferences->Team->Git->Configuration and changing user and email parameters 13 P a g e
2. Create a new project by going to File->New->Project and select Shell Script Project 14 P a g e
3. Click Next and give a name to the project 15 P a g e
4. Click Finish 16 P a g e
5. Right click on the project and go to New->File and create a new.sh file 17 P a g e
6. Right click on the project and go to Team->Share Project in the context menu. 7. Select Git and click Next 18 P a g e
8. On the Configure Git Repository screen click Create button 19 P a g e
9. Enter the name of the new Git repository 10. Click Finish two times and the screen should look like this. NO-HEAD means that nothing has been committed yet. 20 P a g e
11. Doing first commit. Right click on the project name and go to Team->Commit in the context menu. Select the.sh file, provide comments and click Commit. 12. The changes have been committed to the master branch. 21 P a g e
13. Pushing the changes to GitHub. Bring up Git Repositories view by going to Window->Show View->Other->Git->Git Repositories. 14. In the Git Repositories window right click on Remotes and select Create Remote 22 P a g e
15. Leave Remote name as origin and Configure push option selected and click OK 16. Create your GitHub repository if you haven t already done so. If you don t have a GitHub account, then sign up. 23 P a g e
17. Go to Account Settings -> SSH Keys and add a new key. Paste the text from your *.ppk file used in credentials.json file when setting up EMR CLI. 24 P a g e
18. Go to repository, select SSH clone and copy the URL to clipboard 25 P a g e
19. In Eclipse paste the copied URL, select ssh for protocol, leave the Password blank and click Finish. 26 P a g e
20. Click Save and Push 27 P a g e
21. If there are no-fast-forward errors refer to the web forums discussions and troubleshooting at http://stackoverflow.com/questions/3598355/i-am-not-able-to-push-on-git Setting Up CMD Prompt Inside Eclipse 1. In Eclipse go to Run->External Tools->External Tools Configurations. 2. Click on Program and then on Create New Launch Configuration button (the one with + sign on the left toolbar) 28 P a g e
3. Specify CMD as a Name, C:\Windows\System32\cmd.exe as a Location and a folder where EMR CLI is installed as a Working Directory. 4. Make sure that Allocate Console is checked on the Common tab 29 P a g e
5. Click Run and you have a running Windows console in your Eclipse. If you need to work with another AWS CLI, then change the Working Directory to point to it. 30 P a g e
Developing EMR Bootstrap Script Launching EMR MapR Cluster 1. Run the following command from CMD: C:\AWS\elastic-mapreduce-cli>ruby elastic-mapreduce create alive instance-type m1.large numinstances 3 supported-product mapr name MapR M3 Cluster args edition,m3 -v 2. Check the instances in EC2 Instances window. 3. In CMD Console run the following command to find out Public DNS of the master node: 31 P a g e
C:\AWS\elastic-mapreduce-cli>ruby elastic-mapreduce describe [Job Flow ID] 4. Check if MapR Control System (MCS) is running. a. Go to Security Groups and click on ElasticMapReduce-master b. Right click In the list of permissions and select Add Permission from the context menu. Enter port 8453 and you can leave default value for Network Mask, or restrict it to a specific IP address or subnet. 32 P a g e
c. Open MCS in the browser at https://master-node-public-dns:8453. It s going to warn about the certificate, ask about applying licenses, etc. Do all that. d. Locate master node by hovering over green squares in the Dashboard view and click on it. Check the running services. 33 P a g e
Connecting To Master Node From Eclipse 1. Go to Window->Open Perspective->Other->Remote System Explorer 2. Click Define a connection to remote system button (the one with + sign on the toolbar on the left) 3. In the Select Remote System Type window select SSH Only and click Next 34 P a g e
4. Enter master node public DNS and connection name and click Finish 35 P a g e
5. Right click on the EMR MapR Master Node in the Remote Systems window and choose Connect from the context menu. Change User ID to hadoop, leave password blank and click OK. 36 P a g e
6. In the Properties window it has to say Some subsystems connected 7. Right click on Ssh Terminals under EMR MapR Master Node connection and select Launch Terminal from the context menu. The terminal should launch with the master node prompt. 8. Switch back to AWS Management perspective 9. Open the Remote Systems view by going to Window->Show View->Other->Remote Systems- >Remote Systems. 10. Open the Terminals view by going to Window->Show View->Other->Remote Systems- >Terminals 37 P a g e
11. Your environment should look like the screenshot below with AWS Explorer, Project Explorer and Remote Systems on the left, the opened file to work on at the top and all the EC2, Git, Windows CMD and Terminal connected to master node at the bottom. 12. Right click on Local Files under Local in Remote Systems view and select New->Filter from the context menu. Enter the path to the working directory for your project. 38 P a g e
13. After filter is set up you should see the project folder and the *.sh file you are editing 14. Create /opt/mapr/custom directory on the master node and change the owner to the hadoop user 39 P a g e
$ sudo mkdir /opt/mapr/custom $ sudo chown hadoop /opt/mapr/custom 15. Follow similar procedure to set filter under Sftp Files that points to /opt/mapr/custom folder on the master node. 16. Now you should be able to drag and drop the file from Local to Sftp folders. Running the Bootstrap Script The finished script is located in GitHub at https://github.com/dmitrisafine/aws-emrmapr/blob/master/aws-emr-mapr/emr-mapr-bootstrap.sh 1. Upload the script to S3. 2. Start the following command from your Eclipse CMD console: 40 P a g e
ruby elastic-mapreduce create alive instance-type m1.large num-instances 3 supported-product mapr name MapR M3 Cluster args edition,m3 bootstrap-action s3://mybucket/emr-maprbootstrap.sh 3. Note your Job Flow ID. Testing the Cluster 1. Find out the public DNS of the master node by running C:\AWS\elastic-mapreduce-cli\ruby elastic-mapreduce describe <Job Flow ID> 2. Go to MCS at https://master_node_public_dns:8453 and login with hadoop/hadoop as a username and password. 41 P a g e
3. Go to Hue at http://master_node_public_dns:8888 and login with hadoop/mapr as username and password. It should say All OK. Configuration check passed. 4. Enjoy hadooping!!! 42 P a g e