Web Crawling and Data Mining with Apache Nutch Dr. Zakir Laliwala Abdulbasit Shaikh Chapter No. 3 "Integration of Apache Nutch with Apache Hadoop and Eclipse"
In this package, you will find: A Biography of the authors of the book A preview chapter from the book, Chapter NO.3 "Integration of Apache Nutch with Apache Hadoop and Eclipse" A synopsis of the book s content Information on where to buy this book About the Authors Dr. Zakir Laliwala is an entrepreneur, an open source specialist, and a hands-on CTO at Attune Infocom. Attune Infocom provides enterprise open source solutions and services for SOA, BPM, ESB, Portal, cloud computing, and ECM. At Attune Infocom, he is responsible for product development and the delivery of solutions and services. He explores new enterprise open source technologies and defines architecture, roadmaps, and best practices. He has provided consultations and training to corporations around the world on various open source technologies such as Mule ESB, Activiti BPM, JBoss jbpm and Drools, Liferay Portal, Alfresco ECM, JBoss SOA, and cloud computing. He received a Ph.D. in Information and Communication Technology from Dhirubhai Ambani Institute of Information and Communication Technology. He was an adjunct faculty at Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), and he taught Master's degree students at CEPT.
He has published many research papers on web services, SOA, grid computing, and the semantic web in IEEE, and has participated in ACM International Conferences. He serves as a reviewer at various international conferences and journals. He has also published book chapters and written books on open source technologies. He was a co-author of the books Mule ESB Cookbook and Activiti5 Business Process Management Beginner's Guide, Packt Publishing. Abdulbasit Shaikh has more than two years of experience in the IT industry. He completed his Masters' degree from the Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT). He has a lot of experience in open source technologies. He has worked on a number of open source technologies, such as Apache Hadoop, Apache Solr, Apache ZooKeeper, Apache Mahout, Apache Nutch, and Liferay. He has provided training on Apache Nutch, Apache Hadoop, Apache Mahout, and AWS architect. He is currently working on the OpenStack technology. He has also delivered projects and training on open source technologies. He has a very good knowledge of cloud computing, such as AWS and Microsoft Azure, as he has successfully delivered many projects in cloud computing. He is a very enthusiastic and active person when he is working on a project or delivering a project. Currently, he is working as a Java developer at Attune Infocom Pvt. Ltd. He is totally focused on open source technologies, and he is very much interested in sharing his knowledge with the open source community.
Web Crawling and Data Mining with Apache Nutch Apache Nutch is an open source web crawler software that is used for crawling websites. It is extensible and scalable. It provides facilities for parsing, indexing, and scoring filters for custom implementations. This book is designed for making you comfortable in applying web crawling and data mining in your existing application. It will demonstrate real-world problems and give the solutions to those problems with appropriate use cases. This book will demonstrate all the practical implementations hands-on so readers can perform the examples on their own and make themselves comfortable. The book covers numerous practical implementations and also covers different types of integrations. What This Book Covers Chapter 1, Getting Started with Apache Nutch, covers the introduction of Apache Nutch, including its installation, and guides you for crawling, parsing, and creating plugins with Apache Nutch. By the end of this chapter, you will be able to install Apache Nutch in your own environment, and also be able to crawl and parse websites. Additionally, you will be able to create a Nutch plugin. Chapter 2, Deployment, Sharding, and AJAX Solr with Apache Nutch, covers the deployment of Apache Nutch on a particular server; that is, Apache Tomcat and Jetty. It also covers how sharding can take place with Apache Nutch using Apache Solr as a search tool. By the end of this chapter, you will be able to deploy Apache Solr on a server that contains the data crawled by Apache Nutch and also be able to perform sharding using Apache Nutch and Apache Solr. You will also be able to integrate AJAX with your running Apache Solr instance. Chapter 3, Integrating Apache Nutch with Apache Hadoop and Eclipse, covers integration of Apache Nutch with Apache Hadoop and also covers how we can integrate Apache Nutch with Eclipse. By the end of this chapter, you will be able to set up Apache Nutch running on Apache Hadoop in your own environment and also be able to perform crawling in Apache Nutch using Eclipse. Chapter 4, Apache Nutch with Gora, Accumulo, and MySQL, covers the integration of Apache Nutch with Gora, Accumulo, and MySQL. By the end of this chapter, you will be able to integrate Apache Nutch with Apache Accumulo as well as with MySQL. After that, you can perform crawling using Apache Nutch on Apache Accumulo and also on MySQL. You can also get the results of your crawled pages on Accumulo as well as on MYSQL. You can integrate Apache Solr too, as we have discussed before, and get your crawled pages indexed onto Apache Solr.
Integration of Apache Nutch with Apache Hadoop and Eclipse We have discussed in Chapter 2, Deployment, Sharding, and AJAX Solr with Apache Nutch, how deployment takes place with Apache Solr and how we can apply Sharding using Apache Solr. We have also covered integrating AJAX with our running Apache Solr instance. In this chapter, we will see how we can integrate Apache Nutch with Apache Hadoop, and we will also see how we can integrate Apache Nutch with Eclipse. Apache Hadoop is a framework which is used for running our applications in a cluster environment. Eclipse will be used as an Integrated Development Environment (IDE) for performing crawling operations with Apache Nutch. We will discuss in detail about this in the coming sections. So, we will first start with the integration of Apache Nutch with Apache Hadoop. And then, we will move towards the integration of Apache Nutch with Eclipse. So let's get started. In this chapter, we are going to cover the following topics: Integrating Apache Nutch with Apache Hadoop Introducing Apache Hadoop Introducing Apache Nutch integration with Apache Hadoop Installing Apache Hadoop and Apache Nutch Setting up the deployment architecture of Apache Nutch
Integration of Apache Nutch with Apache Hadoop and Eclipse Configuring Apache Nutch with Eclipse Introducing the Apache Nutch configuration with Eclipse Installing and building Apache Nutch with Eclipse Crawling in Eclipse By the end of this chapter, you will be able to set up Apache Nutch on Apache Hadoop in your own environment. You will also be able to perform crawling in Apache Nutch by using Eclipse. Integrating Apache Nutch with Apache Hadoop In this section, we will see how Apache Nutch can be integrated with Apache Hadoop. So, we will start by introducing Apache Hadoop. Then, we will have some basic introduction of integration of Apache Nutch with Apache Hadoop. Lastly, we will move over to the configuration part of Apache Hadoop and Apache Nutch. We will also see how we can deploy Apache Nutch on multiple machines. Introducing Apache Hadoop Apache Hadoop is designed for running your application on servers where there will be a lot of computers, one of them will be the master computer and the rest will be the slave computers. So, it's a huge data warehouse. Master computers are the computers that will direct slave computers for data processing. So, processing is done by slave computers. This is the reason why Apache Hadoop is used for processing huge amounts of data. The process is divided into the a number of slave computers, which is why Apache Hadoop gives the highest throughput for any processing. So, as data increases, you will need to increase the number of slave computers. That's how the Apache Hadoop functionality runs. Apache Nutch can be easily integrated with Apache Hadoop, and we can make our process much faster than running Apache Nutch on a single machine. After integrating Apache Nutch with Apache Hadoop, we can perform crawling on the Apache Hadoop cluster environment. So, the process will be much faster and we will get the highest amount of throughput. [ 60 ]
Chapter 3 Installing Apache Hadoop and Apache Nutch In this section, we will see how we can configure Apache Hadoop and Apache Nutch in our own environment. After installing these, you can perform the crawling operation in Apache Nutch, which will be run on the Apache Hadoop cluster environment. Downloading Apache Hadoop and Apache Nutch Both Apache Nutch and Apache Hadoop can be downloaded from the Apache websites. You need to check out the newest version of Nutch from the source after downloading. As an alternative, you can pick up a stable release from the Apache Nutch website. Setting up Apache Hadoop with the cluster Apache Hadoop with Cluster setup does not require a huge hardware to be purchased to run Apache Nutch and Apache Hadoop. It is designed in such a way that it makes the maximum use of hardware. So, we are going to use six computers to set up Apache Nutch with Apache Hadoop. My computer configuration in the cluster will have Ubuntu 10.04 installed. The names of our computers are as follows: rees hucluster01 reeshucluster02 reeshucluster03 reeshucluster04 reeshucluster05 reeshucluster06 To start our master node we use the reeshucluster01 computer. By master node, I mean that this will run the Hadoop services, which will coordinate with the slave nodes (the rest of the computers). In this case, all the remaining nodes will be the slave nodes, that is, reeshucluster02, reeshucluster03, and so on. And it's the machine on which we will perform our crawl. We have to set up Apache Hadoop on each of the above clusters. The steps for setting up Apache Hadoop on these clusters are described in the following sections. [ 61 ]
Integration of Apache Nutch with Apache Hadoop and Eclipse Installing Java Java is a programming language by Sun Microsystems. It is fast, secure, and reliable. Java is used everywhere in laptops, data centers, game consoles, scientific supercomputers, cell phones, Internet, and so on. My cluster configuration consists of Java 6, that's why I have explained just Java 6. If your cluster configuration has Java 7 installed, you need to change it accordingly. You can refer to http://askubuntu. com/questions/56104/how-can-i-install-sun-oracles-proprietary-java- 6-7-jre-or-jdk for installing Java 7. The steps for installing Java 6 are given as follows. You can always refer to http://www.oracle.com/technetwork/java/ javase/install-linux-self-extracting-138783.html if you are facing any difficulty in installing: 1. Download the 32-bit or 64-bit Linux compressed binary file from http:// www.oracle.com/technetwork/java/javase/downloads/index.html. It has a.bin file extension. 2. Give it permissions to execute and extract it using the following command: #chmod a+x [version]-linux-i586.bin./[version]-linux-i586.bin 3. During installation, it will ask you to register press Enter. Firefox will open with the registration page. Registration is optional. JDK 6 package is extracted into the./jdk1.6.0_x directory, for example,./jdk1.6.0_30. 4. Let's rename it as: #mv jdk1.6.0_30 java-6-oracle 5. Now, move the JDK 6 directory to/usr/lib. #sudo mkdir /usr/lib/jvm #sudo mv java-6-oracle /usr/lib/jvm 6. Run the self-extracting binary. 7. Execute the downloaded file, which is prepended by the path to it. For example, if the file is in the current directory, prepend it with./ (necessary if. is not in the PATH environment variable): #./jdk-6u <version>-linux-i586.bin 8. The binary code license is displayed, and you are prompted to agree to its terms. 9. The JDK files are installed in a directory called jdk1.6.0_<version> in the current directory. 10. Delete the bin file if you want to save disk space. [ 62 ]
Chapter 3 11. Finally, test it by using the following command: #java -version #javac version 12. The preceding commands should display the Oracle version installed. For example, 1.6.0_30. 13. If all succeeds, you will get following output: The preceding screenshot will show you that your JDK is successfully installed and your Java is running correctly. Downloading Apache Hadoop I have used Apache Hadoop 1.1.2 for this configuration. I have tried to cover everything correctly. Still, you can always refer to http://www.michael-noll. com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ for running Apache Hadoop on Ubuntu Linux for single-node clusters. Refer to http:// www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multinode-cluster/ for running Apache Hadoop on Ubuntu Linux for multi-node cluster. The steps for downloading Apache Hadoop are given as follows: 1. Download Apache Hadoop 1.1.2 branch distribution from http://www. apache.org/dyn/closer.cgi/hadoop/common/. Unzip Apache Hadoop using the relevant commands. If you're using Windows, you will have to use an archive program, such as WinZip or WinRar, for extracting. Fire the following command by going to the directory where your downloaded Apache Hadoop resides: $tar hadoop-1.1.2.tar.gz I have configured Apache Hadoop by taking a separate new user on my system as it's a good practice to do it. I have created a new user and given permission to Apache Hadoop to allow only this user to access Apache Hadoop. It's for security purposes (so that no other user can access Apache Hadoop). Not even the root user is able to access Apache Hadoop. You have to log in as this user to access Apache Hadoop. [ 63 ]
Integration of Apache Nutch with Apache Hadoop and Eclipse 2. The following command will add a new group called hadoop: $ sudo addgroup hadoop 3. The following command will add new user called hduser, and then adds this user to the hadoop group: $ sudo adduser --ingroup hadoop hduser Configuring SSH Apache Hadoop needs SSH, which stands for Secure Shell, to manage its nodes, that is, remote machines and your local machine. It is used to log into the remote system or the local system, and also performs necessary operations on a particular machine. So, you need to configure this in your local environment as well as in your remote environment for running Apache Hadoop. The commands and steps for configuring SSH are given as follows: 1. The following command will be used for logging into the hduser: user@ubuntu:~$ su - hduser It will ask for a password, if one is set. Enter the password and you will be logged into that user. 2. The following command will be used for generating a key. This key will be used to provide authentication at the time of login. Make sure you are logged in as an hduser before firing the following command. hduser@ubuntu:~$ ssh-keygen -t rsa -P "" 3. You will get a result as follows: Generating public/private rsa key try. Enter move into that to save lots of the key (/home/hduser/.ssh/id_rsa): 4. Press Enter and you will get a result as follows: Created directory '/home/hduser/.ssh'. Your identification has been saved in /home/hduser/.ssh/id_rsa. Your public key has been saved in /home/hduser/.ssh/id_rsa.pub. The key fingerprint is: 9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu The key's random art image is: [ 64 ]
Chapter 3 The preceding screenshot is showing the generated key image. The second line will make an RSA key pair with an empty password. Generally, using a blank password isn't recommended. 5. The following command will copy the generated key to the authorized_ keys directory that you will find in $HOME/.ssh, where $HOME will be your home directory (that is, /home/reeshu). It's required for authentication at the time of SSH login. $hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys 6. The final step is to test the SSH setup by connecting to your local machine as the hduser. The following command will be used for testing: hduser@ubuntu:~$ ssh localhost 7. You will get an output as follows: The authenticity of host 'localhost (::1)' can't be established. RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44 :f9:36:26. Are you sure you want to continue connecting (yes/no)? Just type yes and press enter. You will get as follows: Warning: Permanently added 'localhost' (RSA) to the list of known hosts. Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux Ubuntu 10.04 LTS hduser@ubuntu:~$ [ 65 ]
Integration of Apache Nutch with Apache Hadoop and Eclipse Disabling IPv6 Internet Protocol version 6 (IPv6) is the latest revision of Internet Protocol (IP). It is also the communication protocol which provides the location and an identification for routers and computers on networks. Sometimes this address creates problems in configuring Apache Hadoop. So, it is better to disable it. The configuration for disabling IPv6 is discussed next. Sometimes, using 0.0.0.0 for various networking-related Apache Hadoop configuration results in Apache Hadoop binding to the address of IPv6 of the Ubuntu box. IPv6 may not be required when you are not connected to the IPv6 network. So at that time, you can disable it. For disabling IPv6, open sysctl.conf (which you will find in /etc) and put the following configuration at the end of the file: # disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 Reboot your machine to apply the effect. To check whether IPv6 is enabled or not on your machine, the following command will be used: $ cat /proc/sys/net/ipv6/conf/all/disable_ipv6 If it returns 0, IPv6 is enabled; and if it returns 1, IPv6 is disabled. And that's what we want. Another way of disabling IPv6 is by changing hadoop-env.sh, which you will find in <Respected directory where your Apache Hadoop resides>/conf. Open hadoop-env.sh and put the following line into it: export HADOOP_OPT S=-Djava.net.preferIPv4Stack=true Installing Apache Hadoop To install Apache Hadoop, download Apache Hadoop as discussed, and extract it to your preferred directory. I have used the /usr/local directory. So, I will follow this directory for the installation. You have to follow your directory. The following commands will be used to change the owner of all the files to the hduser user and the hadoop group cluster. So, we will assign permission to Apache Hadoop such that only hduser is able to access Apache Hadoop and not any other user: The following command will take you to the directory where Apache Hadoop resides: $ cd /usr/local [ 66 ]
Chapter 3 The following command will extract Apache Hadoop: $ sudo tar xzf hadoop-1.0.3.tar.gz The following command will rename Apache Hadoop to hadoop: $ sudo mv hadoop-1.0.3 hadoop The following command will give hduser permission to access Apache Hadoop: $ sudo chown -R hduser:hadoop hadoop Open the bashrc file by going to the root directory, and type the following command: gedit ~/.bashrc I have created HADOOP_HOME and JAVA_HOME where my Apache Hadoop and Java reside. You need to also create a location where your Apache Hadoop and Java will reside. It will set the Apache Hadoop and Java to our classpath, which is required for this configuration. Put the following configuration at the end of the file: export HADOOP_HOME=/usr/local/hadoop export JAVA_HOME=/usr/lib/jvm/java-6-sun unalias fs &> /dev/null alias fs="hadoop fs" unalias hls &> /dev/null alias hls="fs -l export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_HOME/bin Required ownerships and permissions In this section, we are going to cover configuration of data files, the network ports of Apache Hadoop on which Apache Hadoop listens, and so on. This setup will use Hadoop Distributed File System (HDFS), though there is only a single machine in our cluster. We need to create three directories. We will create directories called app, hadoop, and tmp. So, this will be like /app/hadoop/tmp. Apache Hadoop will use the tmp directory for its operations. Hadoop default configurations use hadoop. tmp.dir. It is a property that you will find in core-site.xml, which resides in / usr/local/hadoop/conf for local file system and HDFS. Therefore, you should not get surprised if, at some later point, you see Hadoop making the required directory mechanically on HDFS. The following command is used for creating this directory: $ sudo mkdir -p /app/hadoop/tmp [ 67 ]
Integration of Apache Nutch with Apache Hadoop and Eclipse The following command will allow the /app/hadoop/tmp directory to be accessed only by hduser: $ sudo chown hduser:hadoop /app/hadoop/tmp And if you want to tighten up security, change the value of chmod from 755 to 750, using the following command: $ sudo chmod 750 /app/hadoop/tmp The configuration required for Hadoop_HOME/ conf/* In this section, we will modify core-site.xml, hdfs-site.xml and mapred-site. xml, which reside inside the conf directory that you will find in /usr/local/ hadoop. This configuration is required for Apache Hadoop to perform its operations: core-site.xml: The fs.default.name property is used by Apache Nutch to check the filesystem which it's attempting to use. Since we are working with the Apache Hadoop filesystem, we've pointed this to the hadoop master or name node. In this case it's hdfs://reeeshu:54311t, which is the location of our master name node. You must put your master node accordingly. This will provide information about our master node to Apache Hadoop. Put the following configuration into it: <?xml-styleshe et type="text/xsl" href="configuration.xsl"?> <!-- Put site- specific propert y overrides i n this file. --> <configuration> <property> <name>fs.default.name</ name> <value>h dfs://reeeshu:54311</value> <description> Where to fin d the Hadoop Filesystem through the network. Note 54311 is not the default port. (This is slightly changed fr om previous version s which didnt have "hdfs") </description> </property> </configuration> [ 68 ]
Chapter 3 hdfs-site.xml: This file will be used to tell Apache Hadoop how many replications Apache Hadoop should use. The dfs.replication property will tell Apache Hadoop about the number of replications to be used. By default, the value is 1. It means that there is no replication, that is, it is using only a single machine in a cluster. It should usually be three or more in fact you should have a minimum in that range of operating nodes in your Hadoop cluster. The dfs.name.dir property will be used as the Apache Nutch name directory. The dfs.data.dir data directory is used by Apache Nutch for its operations. You must create these directories manually and give the proper path of those directories. Put the following configuration into it: <?xml-stylesh eet type="text/xsl" href="configuration.xsl"?> <!-- Put site -specific proper t y overrides in this file. --> <configur ation> <property> <name>dfs.name.dir< /name> <va l ue>/nutch/f ilesystem/name</value> </pro perty> <property> <name>dfs.data.dir< /name> <va l ue>/nutch/f ilesystem/data</value> </proper ty> <property> < name>dfs.rep lication</name> <value>1</value> </property> </configuration> The dfs.name.dir property is the directory used by the name node for storing, following, and coordinating the data for the info nodes. The dfs. data.dir property is the directory used by the data nodes for storing the particular filesystem's data blocks. This should often be expected to be similar on each node. [ 69 ]
Integration of Apache Nutch with Apache Hadoop and Eclipse mapred-site.xml: The distributed file system has name nodes and information nodes. When a client wants to govern a file in the filesystem, it will contact the name node. The name node will indicate which data node to contact for accepting the file. The name node is the organizer and will store what blocks area unit on what computers, and what must be replicated to completely different data nodes. The data node's area units are simply the workhorses. They are storing the particular files, serving them up on request, and so on. If you're running a name node and a data node on the same PC, it will still act over sockets as if the information node was on a different PC. Put the following configuration into it: <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>reeeshu:54310</value> <description> The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. Note 54310 is not the default port. </description> </property> <property> <name>mapred.map.tasks</name> <value>2</value> <description> define mapred.map tasks to be number of slave hosts </description> </property> <property> <name>mapred.reduce.tasks</name> <value>2</value> <description> define mapred.reduce tasks to be number of slave hosts </description> </property> [ 70 ]
<property> <name>mapred.system.dir</name> <value>/nutch/filesystem/mapreduce/system</value> <description> Define system directory. You have to manually create the directory and give the proper path to the value property. </description> </property> <property> <name>mapred.local.dir</name> <value>/nutch/filesystem/mapreduce/local</value> <description> Define local directory. You have to manually create the directory and give the proper path to the value property. </description> </property> </configuration> Chapter 3 The mapred.system.dir property stores the name of the directory that the mapreduce tracker uses to store its data. This is often only on the tracker and not on the mapreduce hosts. Formatting the HDFS filesystem using the NameNode HDFS is a directory used by Apache Hadoop for storage purposes. So, it's the directory that stores all the data related to Apache Hadoop. It has two components: NameNode and DataNode. NameNode manages the filesystem's metadata and DataNode actually stores the data. It's highly configurable and well-suited for many installations. Whenever there are very large clusters, this is when the configuration needs to be tuned. The first step for getting your Apache Hadoop started is formatting the Hadoop filesystem, which is implemented on top of the local filesystem of your cluster (which will include only your local machine if you have followed). The HDFS directory will be the directory that you specified in hdfs-site.xml with the property dfs.data. dir explained previously. [ 71 ]
Integration of Apache Nutch with Apache Hadoop and Eclipse To format the filesystem, go to the respective directory where your Apache Hadoop resides by terminal. In my case, it is in /usr/local. Make sure that you are logged in as a hduser before hitting the following command:. hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format If all succeeds, you will get an output as follows: The preceding screenshot shows that your HDFS directory is formatted successfully. [ 72 ]
Chapter 3 Starting your single-node cluster Now, we are done with the setup of the single-node cluster of Apache Hadoop. It's time to start Apache Hadoop and check whether it is running up to the mark or not. So, run the following command for starting your single-node cluster. Make sure you are logged in as hduser before hitting the following command:. hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh If all succeeds, you will get the output as follows: The preceding screenshot shows all the components which have started; they're listed one by one. Once started, you need to check whether all the components are running perfectly or not. For that, run the following command: hduser@ubuntu:/usr/local/hadoop$ jps If all succeeds, you will get the following output: The preceding screenshot shows the number of components running in Apache Hadoop. You can always refer to http://docs.hortonworks.com/hdpdocuments/ HDP1/HDP-Win-1.3.0/bk_getting-started-guide/content/ch_hdp1_getting_ started_chp2_1.html for any additional information: JobTracker: This is a component that will keep track of the number of jobs running in Apache Hadoop and divide each job into the number of tasks that are performed by TaskTracker. [ 73 ]
Integration of Apache Nutch with Apache Hadoop and Eclipse TaskTracker: This is used for performing tasks given by JobTracker. So, each task tracker has multiple tasks to be performed. And once it is completed with a particular task, it will inform the JobTracker. That's how JobTracker will get an update that tasks are being performed in the desired manner. Namenode: This keeps track of the directories created inside HDFS. You can iterate to those directories using Namenode. The responsibility of Namenode is to transfer data to Datanodes. So, Namenode itself will not store any data. Rather, it will transfer all the data to the DataNode. SecondaryNameNode: This is a backup of Namenode. So, whenever any Namenode fails, we can back up our data from SecondaryNamenode. DataNode: This is the component which actually stores the data transferred from NameNode. So, the responsibility of DataNode is to store all the data of Apache Hadoop. jps: This is not an Apache Hadoop component. It's a command that is a part of Sun Java since v1.5.0. Just check on your browser by hitting the following URL: http://localhost:50070 where, 50070 is the port on which namenode is running. Stopping your single-node cluster If you want to stop your running cluster, hit the following command. Make sure you are logged in as a hduser before hitting this command.: hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh If all succeeds, you will get an output as follows: The preceding screenshot shows the number of components in Apache Hadoop that are being stopped. So that's all for installation of Apache Hadoop on a single machine. Now, we will move over to setting up the deployment architecture of Apache Nutch. [ 74 ]
Chapter 3 Setting up the deployment architecture of Apache Nutch We have to set up Apache Nutch on each of the machines that we are using. In this case, we are using a six-machine cluster. So, we have to set up Apache Nutch on all of the machines. If there are a small number of machines in our cluster configuration, we can set it up manually on each machine. But when there are more machines, let's say we have 100 machines in our cluster environment, we can't set it up on each machine manually. For that, we require a deployment tool such as Chef or at least distributed SSH. You can refer to http://www.opscode.com/chef/ for getting familiar with Chef. You can refer to http://www.ibm.com/developerworks/aix/ library/au-satdistadmin/ for getting familiar with distributed SSH. I will just demonstrate running Apache Hadoop on Ubuntu for a single-node cluster. If you want to run Apache Hadoop on Ubuntu for a multi-node cluster, I have already provided the reference link. You can follow that and configure it from there. Once we are done with the deployment of Apache Nutch to a single machine, we will run this start-all.sh script that will start the services on the master node and data nodes. This means that the script will begin the Hadoop daemons on the master node. So, we are able to login to all the slave nodes using the SSH command as explained, and this will begin daemons on the slave nodes. The start-all.sh script expects that Apache Nutch should be put on the same location on each machine. It is also expecting that Apache Hadoop is storing the data at the same file path on each machine. The start-all.sh script that starts the daemons on the master and slave nodes are going to use password-less login by using SSH. Installing Apache Nutch Download Apache Nutch from http://nutch.apache.org/downloads.html and extract the contents of the Apache Nutch package to your preferred location. I picked the /usr/local directory for Apache Nutch. You need to assign permission to Apache Nutch so that only hduser can access it. This is for security purposes. The commands that will be used are given as follows: The following command will take you to the local directory: $ cd /usr/local The following command will be used for extracting the contents of Apache Nutch: $ sudo tar xzf apache-nutch-1.4-bin.tar.gz [ 75 ]
Integration of Apache Nutch with Apache Hadoop and Eclipse The following command will rename apache-nutch-1.4-bin.tar.gz to nutch: $ sudo mv apache-nutch-1.4-bin.tar.gz nutch The following command will assign permission to the nutch directory that can be accessed only by hduser: $ sudo chown -R hduser:hadoop nutch Now, we need to modify the bashrc file. To open this file, go to the root directory from your terminal. Then, hit the following command: gedit ~/.bashrc Put the following configuration at the end of the file. It will set your classpath for Apache Nutch. export NUTCH_HOME=/usr/local/nutch export PATH=$PATH:$NUTCH_HOME/bin Modify nutch-site.xml that you will find in $NUTCH_HOME/conf by inserting the following content: <property> <name>plugin.folders</name> <value>/usr/local/nutch/build/plugins</value> </property Search the http.agent.name key in nutch-site.xml, and set its value to the crawler name. Copy hadoop-env.sh, core-site.xml, hdfs-site.xml, mapred-site. xml, master, and slaves from $HADOOP_HOME/conf to $NUTCH_HOME/conf by hitting the following command: $ cd $HADOOP_HOME/conf $ cp hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml master slaves $NUTCH_HOME/nutch/conf/ Copy the conf directory from $NUTCH_HOME to $NUTCH_HOME/runtime/ local/conf by hitting the following command: $ cd $NUTCH_HOME $ cp conf/* runtime/local/conf/ [ 76 ]
Chapter 3 Key points of the Apache Nutch installation We need to rebuild Apache Nutch by using the ant command. Otherwise it will give a fatal error as http.agent.name doesn't work even when we edit the nutch-site. xml file. And, we also need to set the classpath in hadoop-env.sh, which you will find in $HADOOP_HOME/conf by putting the following configuration into it: Hadoop_classpath=/usr/local/nutch/runtime/lib/*:/usr/local/nutch/ runtime/deploy/apache-nutch-2.2.1.job After this, go to the $NUTCH_HOME directory and type the following command for rebuilding Apache Nutch: /usr/local/nutch/$ ant Once rebuild is finished, copy the nutch-2.2.1.job and nutch-2.2.1.jar to the deploy and local directory of Apache Nutch respectively, by typing the following command: $ cp $NUTCH_HOME/build/nutch-2.2.1.job $NUTCH_HOME/runtime/deploy $ cp $NUTCH_HOMEbuild/nutch-2.2.1.jar $NUTCH_HOME/runtime/local/lib/ We have successfully set up Apache Nutch on the Apache Hadoop cluster. Now, we can start Apache Hadoop and perform the tasks of Apache Nutch on that. Starting the cluster We've configured Apache Hadoop. It's time to start up Apache Hadoop on a single node and check whether it's working properly or not. To start up all of the Hadoop components on the local machine (NameNode, DataNode, TaskTracker, JobTracker, and SecondaryNameNode), the following command will be used: HADOOP_HOME$bin/start-all.sh If you want to stop all components, the following command should be used: HADOOP_HOME$bin/stop-all.sh [ 77 ]
Integration of Apache Nutch with Apache Hadoop and Eclipse Performing crawling on the Apache Hadoop cluster We are going to perform crawling in Apache Nutch on the Apache Hadoop Cluster. It will perform crawling on the Apache Hadoop cluster and give us the result. The steps for performing this job are given as follows: 1. Create the seed.txt file by hitting the following commands. This file will contain the list of URLs to be crawled on the Apache Hadoop Cluster. $cd $NUTCH_HOME/runtime $mkdir urlsdir $ vi urls/seed.txt 2. The preceding command will create the seed.txt file and open it up. Put the following configuration into it: http://nutch.apache.org http://apache.org The preceding URLs are used for crawling. You can enter n number of URLs, but one URL per line. 3. Now we need to add the urls directory to the HDFS directory as Apache Hadoop will use this directory for crawling. The following command will be used for adding this directory. Make sure you are logged in as an hduser before hitting the following command: HADDOP_HOME$bin/hadoop dfs -put /usr/local/nutch/runtime/local/urls urls 4. For checking whether it's correctly put or not, type the following command. It will list all the directories that are inside the given directory. $bin/hadoop dfs ls 5. Modify regex-urlfilter.txt, which sets the filter to crawl only the webpages as *.apache.org. So, any web URL that ends with apache.org will be crawled. Fire the following commands to perform this: $cd $NUTCH_HOME/runtime $ vi conf/regex-urlfilter.txt 6. The preceding command will open up regex-urlfilter.txt. Replace the line +^http://([a-z0-9]*\.)*my.domain.name/ with +^http://([a-z0-9]*\.)*apache.org/. [ 78 ]
Chapter 3 We have added our urls directory to the Hadoop distributed filesystem and we also have edited our regex-urlfilter.txt. Now, it's time to start crawling. To begin with Apache Nutch crawling, first copy your apache-nutch-2.2.1.job jar file (you will find in $NUTCH_HOME/build) to $HADOOP_HOME. Then use the following command to perform crawling. Make sure that you are logged in as an hduser before hitting the following commands: The following command will take you to the HADOOP_HOME directory: $cd $HADOOP_HOME The following commands will actually perform crawling: $hadoop jar apache-nutch-${version}.job org.apache.nutch.crawl.crawl urls -dir crawl -depth 3 -topn5 hadoop: This is a command for running Hadoop. jar: This is a command for defining the JAR files. apache-nutch-${version}.job: This is an Apache Nutch job jar file, which is used for crawling. apache.nutch.crawl.crawl: This is a command for crawling. urls: This is a directory that Apache Hadoop will use for fetching the URLs that will be crawled. This is the directory which we put in the HDFS. dir: This is a command for defining the directory. crawl: This is the actual directory where Apache Hadoop keeps the output of crawling. depth: This is a command that is used for setting the number of iterations Apache Nutch will take to crawl. You can give any integer value to it. topn: This is a command that is used by Apache Nutch to define the number of top-most URLs that need to be crawled. So, only those URLs will be crawled from the urls directory. You can give any integer value to it, but with the condition that topn should be greater than or equal to the total number of URLs in the urls directory. [ 79 ]
Integration of Apache Nutch with Apache Hadoop and Eclipse So, there are many arguments which you can apply according to your needs. If all succeeds, you will get the following output: The preceding screenshot is showing that Apache Hadoop crawled URLs for you. Its showing the last three lines of the output. You can also keep track of your crawling from the browser by opening the Jobtracker component of Apache Hadoop. It will show you the statistics of the running jobs, and also the number of tasks per job. Hit the following URL to check this: http://localhost:50030/jobtracker.jsp If all succeeds, you will get an output as follows: The preceding screenshot will show you the number of jobs and their statuses for the crawling that you have performed. You can also check the detailed statistics of your tasks per job by opening the Tasktracker component of Apache Hadoop. Hit the following URL to check this: http://localhost:50060/tasktracker.jsp [ 80 ]
Chapter 3 If all succeeds, you will get an output as follows: The preceding screenshot is showing the task tracker that will keep track of the number of tasks for the particular job. You could take the dump of the crawled URLs by copying the directory from HDFS, which contains data about all the crawled URLs. Copy that directory and paste it to your preferred location for backup. So, now we have completed integration of Apache Nutch with Apache Hadoop. We have successfully crawled URLs in Apache Nutch on Apache Hadoop Cluster. Now we will move over to Apache nutch configuration with eclipse in the next section. So let's see what it is. Configuring Apache Nutch with Eclipse In this section, we will see how we can integrate Apache Nutch with Eclipse. So, just like we performed crawling in Apache Nutch using the command line, we can perform crawling using Java API. [ 81 ]
Integration of Apache Nutch with Apache Hadoop and Eclipse Introducing Apache Nutch configuration with Eclipse Apache Nutch can be easily configured with Eclipse. After that, we can perform crawling easily using Eclipse. So, we need to perform crawling from the command line. We can use eclipse for all the operations of crawling that we are doing from the command line. Instructions are provided for fixing a development environment for Apache Nutch with Eclipse IDE. It's supposed to give a comprehensive starting resource for configuring, building, crawling, and debugging of Apache Nutch. The prerequisites for Apache Nutch integration with Eclipse are given as follows: Get the latest version of Eclipse from http://www.eclipse.org/ downloads/packages/release/juno/r All the required components are available from them Eclipse Marketplace. You can download the from this link: http://marketplace.eclipse.org/ marketplace-client-intro Once you've configured Eclipse, download as per here http://subclipse.tigris.org/. If you have faced a problem with the 1.8.x release, try 1.6.x. This may resolve compatibility issues. Download the IvyDE plugin for Eclipse from the following link: http://ant.apache.org/ivy/ivyde/download.cgi Download the m2e plugin for Eclipse from the following link: http:// marketplace.eclipse.org/content/maven-integration-eclipse Installation and building Apache Nutch with Eclipse Here, we will define the installation steps for configuring Apache Nutch with Eclipse. The steps for configuring Apache Nutch with Eclipse are given as follows: [ 82 ]
Chapter 3 1. Get the latest source code of Apache Nutch by using SVN, which is a subversion repository. Go to your terminal and fire the following commands: For Apache Nutch 2.2.1, the following command will be used: $svn co https://svn.apache.org/repos/asf/nutch/trunk $ cd trunk For Apache Nutch 2.x, the following command will be used: $svn co https://svn.apache.org/repos/asf/nutch/branches/2.x $ cd 2.x 2. You need to decide which data store you are going to use for this integration. See Apache Gora Documentation from http://gora.apache.org/ for more information. Some of the choices of storage classes are given as follows: org.apache.gora.hbase.store.hbasesto re org.apache.gora.cassandra.store.cassandrasto re org.apache.gora.accumulo.store.accumulosto re org.apache.gora.avro.store.avrostore org.apache.gora.avro.store.datafileavrostore 3. Modify nutch-site.xml, which you will find in <Respected directory where your Apache Nutch resides>/conf, by putting the following configuration into it. I am taking Hbase as the datastore here. <prope rty> <name>storage.data.store.class</n ame> <value>org.apache.gora.hbase.store.hbasestore</va lue> <description>default class for storing data</description> </property> 4. Modify ivy.xml, which you will find in <Respected directory where your Apache Nutch resides>/ivy, by uncommenting the following line if you have taken hbase as the data store. Otherwise, you have to search for your datastore and uncomment it if it is commented accordingly. <dependency org=""org.apache.gora"" name=""gora-hbase"" rev=""0.3"" conf=""*->default"" /> [ 83 ]
Integration of Apache Nutch with Apache Hadoop and Eclipse 5. Modify gora.properties, which you will find in <Respected directory where your Apache Nutch resides>/conf, by in putting the following line if you have taken Hbase as your datastore. Otherwise you have to find out that line for your datastore and in put it accordingly. gora.datastore.default=org.apache.gora.hbase.store.hbasestore 6. Modify nutch-site.xml by putting http.agent.name (as explained before) and http.robots.agent (same as http.agent.name) into it. Also, set the plugin.folders property by putting the following configuration into it: <prope rty> <name>plugin.folders</n ame> <value>/home/tejas/desktop/2.x/build/plugins</value> </property> The value of the plugin.folders would be <Respected directory where your Apache Nutch resides>/build/plugins. 7. Run the following command for building Eclipse: $cd <Respected directory where your Apache Nutch resides> $ant eclipse We have configured this successfully. Now, we shall move to the Crawling in Eclipse section. Crawling in Eclipse In this section, we will see how we can import Apache Nutch into Eclipse and perform crawling. The steps are given as follows: 1. Open your Eclipse IDE. 2. In Eclipse, navigate to File Import. 3. Select Existing Projects into Workspace as shown in the following screenshot: [ 84 ]
Chapter 3 4. In the next window, set the root directory to the place where you have done the checkout of Apache Nutch 2.1.1, and then click on Finish. 5. You are currently seeing a new project named 2.1.1, which is being added within the workspace. Wait for some time until Eclipse refreshes its SVN cache and builds its workspace. You'll get the status at the end point of the corner of the Eclipse window, as shown in the next screenshot: [ 85 ]
Integration of Apache Nutch with Apache Hadoop and Eclipse 6. In the Package Explorer, right-click on the project 2.1.1 and navigate to Build Path Configure Build Path as shown below: 7. In the Order and Export tab, scroll down and choose 2.x/conf. Click on the Top button. Sadly, Eclipse will take one more build of the workspace, but this time it won't take much time. How to create an Eclipse launcher? Let's start with the inject operation. The steps for this are given as follows: 1. Right-click on the project by navigating to Package Explorer Run As Run Configurations. Make a new configuration. Name it inject, as shown below: For Apache version 1.x : Set the main class value as org.apache. nutch.crawl.injector For Apache version 2.x : Set the main class as org.apache.nutch. crawl.injectorjob [ 86 ]
Chapter 3 2. In the Arguments tab for program arguments, give the path of the input dir, which has the seed URLs. Set VM Arguments to -Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log, as shown in the following screenshot: [ 87 ]
Integration of Apache Nutch with Apache Hadoop and Eclipse 3. Click on Apply and then click on Run. If everything was done perfectly, then you will see the inject operation progressing on the console as shown in the following screenshot: If you want to find out the Java class related to any command, just go inside the src/bin/nutch script; at the bottom, you will find a switch-case code with a case corresponding to each command. The important classes corresponding to the crawl cycle are given as follows: Operation Class in Nutch 1.x (that is, trunk) Class in Nutch 2.x inject org.apache.nutch.crawl. org.apache.nutch.crawl. Injector InjectorJob generate fetch parse updated org.apache.nutch.crawl. Generator org.apache.nutch.fetcher. Fetcher org.apache.nutch.parse. ParseSegment org.apache.nutch.crawl. CrawlDb org.apache.nutch.crawl. GeneratorJob org.apache.nutch.fetcher. FetcherJob org.apache.nutch.parse. ParserJob org.apache.nutch.crawl. DbUpdaterJob In the same way, you can perform all the jobs that are listed in the preceding table. You can take the respective Java class of a particular job and run that within Eclipse. So that's how crawling occurs in Apache Nutch using Eclipse. So, now we have successfully integrated Apache Nutch with Eclipse. So that's the end of this chapter. Let's go to the Summary section now and revise what you have learned from this chapter. [ 88 ]
Chapter 3 Summary We started with the integration of Apache Nutch with Apache Hadoop. In that, we have covered an introduction to Apache Hadoop, what do we mean by integrating Apache Nutch with Apache Hadoop, and what are the benefits of that. Then, we moved on to the configuration steps, and we configured Apache Nutch with Apache Hadoop successfully. We also performed a crawling job by installing Apache Nutch on a machine, and confirmed the output Apache Hadoop cluster is running properly and is performing the crawling job correctly. Then, we started with integration of Apache Nutch with Eclipse. We also had a little introduction to what is integration of Apache Nutch with Eclipse. Then, we looked at the configuration of Apache Nutch with Eclipse. We have successfully integrated Apache Nutch with Eclipse and performed one InjectorJob example. I hope you have enjoyed reading this chapter. Now, let's see how we can integrate Apache Nutch with Gora, Accumulo, and MySQL in the next chapter. [ 89 ]
Where to buy this book You can buy Web Crawling and Data Mining with Apache Nutch from the Packt Publishing website: Free shipping to the US, UK, Europe and selected Asian countries. For more information, please read our shipping policy. Alternatively, you can buy the book from Amazon, BN.com, Computer Manuals and most internet book retailers. www.packtpub.com