Hadoop Data Warehouse Manual

Ruben Vervaeke & Jonas Lesy 1 Hadoop Data Warehouse Manual To start off, we d like to advise you to read the thesis written about this project before applying any changes to the setup! The thesis can be found on the same Moodle page where this manual was found or on the download link found here: https://www.theseus.fi/handle/10024/96600 Before any changes are made, we d like to note that it s advised to first run a local virtual machine with Hadoop installed to develop on. This is a safer way to test since the setup on the Hadoop cluster is not optimised for debugging. A virtual machine can also be found on the mentioned Moodle page, this machine has Hadoop and HBase already installed and should be used for development. The machine can be downloaded and then run with VMware Player. Setting up the virtual machine and obtaining the program This chapter will describe how to work with the virtual machine and what needs to be done before development or testing can be continued. First, like mentioned above, one will have to download the virtual machine from the Moodle page. After unzipping, the machine can be opened in VMware Player. When running the virtual machine for the first time a screen with three options might pop up. Please click I Moved It, if this happens. This will adjust networking settings and such to the appropriate settings. Now a login screen should popup which looks like the image below. The password for the Ruben Vervaeke account is karelia. Before doing anything now, please make sure you have Internet access on this virtual machine, otherwise development will not be possible. This is because we used Maven in our project and Maven pulls all necessary dependencies from the Internet when the application is being built. The first thing that should be noticed is that there are two files on the desktop. Please don t alter them, they can be moved to a Documents folder for example but make sure to remember where you put it. This script has to be run to start all Hadoop components and functionalities. This is done by opening the command shell and typing the following lines. > su This is done to change to root user (since the script has to be run as root).

Ruben Vervaeke & Jonas Lesy 2 This line is followed by the root s password which is also karelia like the user account s password. > cd Desktop/ Change to the Desktop folder. > sh hadoopscript.sh Run the hadoopscript (which is currently still in the Desktop folder) Now the script is running and all that needs to happen is waiting. After it is done, one should check the running Hadoop services by typing the jps command. The following results should pop up: Here you can see the different services running and all of these should be running, if not please (re)start them manually. The commands for this can be found in the other file on the desktop. The complete application is stored on a server with Github version control. Currently Ruben Vervaeke is the owner of the repository, so anyone who want s access to our application will have to have a Github account and request access. In our case, you can send an email to ruben.vervaeke@hotmail.com with your Github username, so I can add you to the Github project. Once this is done you can clone the repository in Netbeans. You can find a tutorial about this here: https://netbeans.org/kb/docs/ide/git.html

Ruben Vervaeke & Jonas Lesy 3 Creating a new resource The implementation in the application is now slightly different than explained in the thesis. The way of adding a new resource is now much more flexible and easy. As mentioned in the thesis new resources can be added by defining them in the XML config files. But now a driverjar tag has been added to define the name of the MapReduce jar file that was created for the resource. So we need a MapReduce application. For how this is done you can refer to all documentation from Hadoop and HBase. The only important thing is that the mapper will have to output its result to the HBase database. Therefor before working with new domain objects, you need to create a schema via the shell for HBase. All this information can be found in the HBase documentation. After this is done the data can be stored onto the file system, but again if you defined new resources that didn t yet had domain objects defined in the program, you will have to create a corresponding web service class for them. You can use the existing SensorService class as a reference on how to do this.

Ruben Vervaeke & Jonas Lesy 4 Connect to the setup with Cloudera In case the project is ready to be deployed on the Cloudera setup, this chapter will describe how connection to this system is achieved. On the image below, the complete network setup can be seen. The bottom three machines are those where the system is running on. These machines are virtual and run on a blade server at the Karelia University of Applied Sciences. These machines are protected by a gateway which requires a username and password to connect to. This networking setup was configured by Tiainen Henri and Janne Puustinen, so full credit for this goes to them. Connection to the full setup requires the installation of X2Go Client (http://wiki.x2go.org/doku.php). After completely installing the software, the first things that needs to happen is connecting to the gateway. This is done by opening the X2Go Client software, clicking on the Session button followed by New Session. In this window the following details are filled in. And under Session type select the following option:

Ruben Vervaeke & Jonas Lesy 5 The first field represents the host to connect to, this is the IP address shown on the networking setup image. This field is followed by the default user to login with which is user. The last field is the SSH port, which is 22 in this case. After configuring this, the X2Go Client should have a session available like shown on the image below. When clicking on this session, connection will be initiated but a password is required first. The password to connect to the gateway is Password1!. Now you re connected to the desktop of the gateway. If the X2Go Client software is not opened on this gateway, you ll still have to open it first to make further connection. This is done by right clicking on the desktop and selecting the option found in the image below. Now the familiar program is opened and the three other machines can be seen. The software s screen should look like shown on the image below.

Ruben Vervaeke & Jonas Lesy 6 Now you can connect to whichever machine you need with the same credentials as before, the default user user and Password1! password. The project could now be deployed on the Master machine if it is ready. The terminal can be opened by clicking this button: The browser (Mozilla Firefox in this case) can be opened by clicking this button.

Ruben Vervaeke & Jonas Lesy 7 Manage and configure the Cloudera Manager To open the Cloudera Manager, the following steps must be taken. On one of the virtual clients, open the Firefox browser and go to the following address: 192.168.72.11:7180. This will bring you to the following page: Now you can login with the following credentials: - Username: admin - Password: fibekarelia4 From there on, you can manage the different installed components and check for errors. How to start Cloudera after restart It s possible that the servers were down for some reason. In that case Cloudera will not automatically restart after the servers have finished starting up. You can start Cloudera using following steps: 1. Start the Cloudera database using following command as root user on the master node: $ sudo service cloudera-scm-server-db start 2. Start the Cloudera manager server using following command as root user on the master node: $ sudo service cloudera-scm-server start 3. Start the Cloudera manager agents using following commands as root user on BOTH slave nodes: $ sudo service cloudera-scm-agent start After this you should be able to login the admin console and monitor the system s status.