To configure HSearch you need to install Hadoop, Hbase, Zookeeper, HSearch and Tomcat. 1. Add the machines ip address in the /etc/hosts to access all the servers using name as shown below. 2. Allow all these servers to communicate in the machine firewall. $ iptables -A INPUT -s master -j ACCEPT $ iptables -A INPUT -s slave -j ACCEPT Also enable in ec2 instance the ports the instance can connect, in below image all ports are opened for the corresponding master and slave ip 3. Setup ssh so that the two machine can communicate with each other. 2013 Bizosys Technologies Pvt. Ltd. Page 1
In the above step a ssh keypair with empty password is generated using the following command: $ ssh-keygen t rsa $ cd ~/.ssh $ cat id_rsa.pub >> authorized_keys $ chmod 600 authorized_keys Do the above steps for both the master and slave. Now for the master to communicate with the slave the master s public key should be added to the slave s authorized_keys. $ cat id_rsa.pub Copy the contents of the file and append it to the authorized_keys file of slave machine. Now the two machine can communicate with each other using the shared public rsa key. From the master machine: $ ssh slave $ exit From slave machine: $ ssh master $ exit 2013 Bizosys Technologies Pvt. Ltd. Page 2
4. Set up java $ mkdir /usr/java $ cd /usr/java $ wget http://download.oracle.com/otn-pub/java/jdk/6u39-b04/jdk-6u39- linux-x64.bin?authparam=1360821815_696261924ee18abf65bf0d3b2106256a $ mv jdk-6u39-linuxx64.bin\?authparam\=1360821815_696261924ee18abf65bf0d3b2106256a jdk- 6u39-linux-x64.bin $./jdk-6u39-linux-x64.bin $ $ rm -rf jdk-6u39-linux-x64.bin Test java is running $ cd /usr/java/jdk1.6.0_39/bin $ java version Once you have setup this it s good time for installing hadoop. 5. Setup Hadoop $ cd /mnt $ wget http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0- cdh4.5.0.tar.gz $ gzip -d hadoop-2.0.0-cdh4.1.3.tar.gz $ tar -xf hadoop-2.0.0-cdh4.1.3.tar $ mv hadoop-2.0.0-cdh4.1.3 hadoop $ cd hadoop/etc/hadoop $ echo "" > excludes $ mkdir -p /mnt/data/namenode /mnt/data/namenode/dfsname /mnt/data/namenode/dfsnameedit /mnt/data/datanode /mnt/logs Export variables to the end of the hadoop configuration file(hadoopenv.sh) $ echo "export JAVA_HOME=/usr/java/jdk1.6.0_39" >> hadoop-env.sh $ echo "export HADOOP_LOG_DIR=/mnt/logs" >> hadoop-env.sh $ echo "master" > masters; cat masters $ echo "slave" > slaves; cat slaves Change the following line of log4j.properties file. $ hadoop.log.dir=/mnt/logs 2013 Bizosys Technologies Pvt. Ltd. Page 3
Edit the core-site.xml file: <?xml version="1.0" encoding="utf-8"?> <configuration> <name>fs.defaultfs</name> <value>hdfs://master:54310</value> <description>the name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.scheme.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem. </description> <name>io.file.buffer.size</name> <value>16384</value> <description>read/write buffer size used in SequenceFiles (should be in multiples of the hardware page size All Intel processors have 4KiB pages. Means 4096 In a one machine cluster, I am making this as small as possible continuous streaming. In KB 65536 Means 64KB. 16KB starting 16384. Typical value for a 250 to 2000 nodes is 32768-131072 </description> <name>io.seqfile.compress.blocksize</name> <value>4096</value> <description> The minimum block size for compression in block compressed SequenceFiles We will compress a minimum 4KB file, as this us to read less. </description> </configuration> 2013 Bizosys Technologies Pvt. Ltd. Page 4
Edit the Hdfs-site.xml HSearch Installation <?xml version="1.0" encoding="utf-8"?> <configuration> <name>hadoop.home</name> <value>/mnt/hadoop</value> <name>metadata.dir</name> <value>/mnt/data/namenode</value> <description>where the NameNode metadata should be stored </description> <name>dfs.datanode.data.dir</name> <value>/mnt/data/datanode</value> <description>at which location the data is stored./data/1,/data/2,/data/3 </description> <name>dfs.replication</name> <value>2</value> <name>dfs.blocksize</name> <value>33554432</value> <name>dfs.namenode.handler.count</name> <value>40</value> <name>dfs.datanode.handler.count</name> <value>40</value> <name>dfs.hosts.exclude</name> <value>/mnt/hadoop/etc/hadoop/excludes</value> <name>dfs.namenode.name.dir</name> <value>${metadata.dir}/dfsname</value> <name>dfs.namenode.edits.dir</name> <value>${metadata.dir}/dfsnameedit</value> </configuration> 2013 Bizosys Technologies Pvt. Ltd. Page 5
Edit mapred-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <name>mapreduce.jobtracker.address</name> <value></value> <name>mapreduce.framework.name</name> <value>yarn</value> <name>mapred.child.java.opts</name> <value>-xmx400m</value> </configuration> Edit yarn-site.xml <?xml version="1.0"?> <configuration> <name>yarn.resourcemanager.address</name> <value>master:54311</value> <name>yarn.nodemanager.local-dirs</name> <value>/mnt/data/1/mapred/local/</value> <final>true</final> <name>yarn.nodemanager.aux-services</name> <value>mapreduce.shuffle</value> </configuration> 2013 Bizosys Technologies Pvt. Ltd. Page 6
After configuring all log files, set the path for hadoop and java. $ cd ~ $ vi.bash_aliases Add the following lines in it export JAVA_HOME=/usr/java/jdk1.6.0_39 export HADOOP_PREFIX=/mnt/hadoop export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_PREFIX/bin Now exit the terminal for the changes to take effect and then login again. Perform the same above activities on the slave machine excluding the creation of masters and slave file in the /mnt/hadoop/etc/hadoop dir. i.e avoid following commands. $ echo "master" > masters; cat masters $ echo "slave" > slaves; cat slaves We don t need above commands since the slave machine will simply run the datanode only. Now all the hadoop is set up and good to go. 2013 Bizosys Technologies Pvt. Ltd. Page 7
Now we need to format the namenode and then we can start hadoop.(if formatting more than once then also delete the datanode directories as there will be conflict of namesapceid when connecting to newly formatted namenode.) $ hdfs namenode format You will see the following similar output if it successfully formats: Start Hadoop $ start-dfs.sh To check if the namenode and datanode successfully started in master and slave use the following command: $ jps Master: Slave: 2013 Bizosys Technologies Pvt. Ltd. Page 8
Test Hadoop $ cat > /tmp/a.txt $ hdfs dfs -copyfromlocal /tmp/a.txt / $ hdfs dfs -ls / $ hdfs dfs -cat /a.txt $ hdfs dfs copytolocal /a.txt /tmp $./hadoop dfs -rm /a.txt $./hadoop dfs -ls / Stop Hadoop $ stop-dfs.sh 6. Setup Zookeeper: $ cd /mnt $ wget http://archive.cloudera.com/cdh4/cdh/4/zookeeper-3.4.5- cdh4.5.0.tar.gz $ mv zookeeper-3.4.5-cdh4.5.0.tar.gz zookeeper-3.4.5.tar.gz $ gzip -d zookeeper-3.4.5.tar.gz $ tar -xf zookeeper-3.4.5.tar $ mv zookeeper-3.4.5.tar zookeeper-3.4.5 $ mv zookeeper-3.4.5 zookeeper $ cd zookeeper $ rm -rf docs ivy.xml ivysettings.xml src contrib dist-maven; ls $ cd conf $ cp zoo_sample.cfg zoo.cfg $ vi zoo.cfg 2013 Bizosys Technologies Pvt. Ltd. Page 9
7. Setup HBase: $ cd /mnt $ wget http://archive.cloudera.com/cdh4/cdh/4/hbase-0.94.6- cdh4.5.0.tar.gz $ mv hbase-0.94.6-cdh4.5.0.tar.gz hbase-0.94.6.tar.gz $ gzip -d hbase-0.94.6.tar.gz $ tar xf hbase-0.94.6.tar $ mv hbase-0.94.6 hbase $ mv hbase-0.90.4.tar hbase $ cd hbase $ rm -rf docs hbase-0.90.4-tests.jar pom.xml src $ cd /mnt/hbase/conf $ echo "slave" > regionservers; cat regionservers log4j.properties file edit. hbase.log.dir=/mnt/logs Export variables to the end of the hbase configuration file. $ echo "export JAVA_HOME=/usr/lib/jdk" >> hbase-env.sh $ echo "export HBASE_CLASSPATH=/mnt/hbase/conf" >> hbase-env.sh $ echo "export HBASE_MANAGES_ZK=true" >> hbase-env.sh $ echo "export HBASE_HEAPSIZE=2048" >> hbase-env.sh $ echo "export HBASE_OPTS=\"-server -XX:+UseParallelGC - XX:ParallelGCThreads=4 -XX:+AggressiveHeap - XX:+HeapDumpOnOutOfMemoryError\"" >> hbase-env.sh $ echo "export HBASE_LOG_DIR=/mnt/logs" >> hbase-env.sh 2013 Bizosys Technologies Pvt. Ltd. Page 10
8. Setup Tomcat: $ cd /mnt $ wget http://apache.mesi.com.ar/tomcat/tomcat-7/v7.0.47/bin/apachetomcat-7.0.47.tar.gz $ gzip -d apache-tomcat-7.0.47.tar.gz $ tar xf apache-tomcat-7.0.47.tar $ mv apache-tomcat-7.0.47 tomcat 9. Setup Hsearch: 1. Download the war file from Amazon S3: For CDH 4.5 (Tested): http://hsearch.war.s3.amazonaws.com/hsearch.war.cdh4.5.tar.gz For HDP 1.3 (Tested): http://hsearch.war.s3.amazonaws.com/hsearch.war.hdp1.3.tar.gz 2. Unzip and extract the war file 3. Deploy the war file to your server and start your server 4. Copy hsearch and lucene jar files to hadoop and hbase lib folder 5. Restart hadoop and hbase 6. Now Open : http://yourserverurl:port/hsearch/ 7. Fill the setup page and create a new project. 8. Import data from hdfs file location, index and search. Hsearch query syntax: e.g. ((sex:male AND subject:{usb101,usb102} ) OR (dayrange:[-10 : 2])) Following where query types are possible: Exact Match = sex:male Not Match = sex:!male In Match = organs:{ Adrenal, glands, Liver } Range Match = reading:[-10 : 20] Note: If data contains comma in inqueries then it should be passed within double quotes. 2013 Bizosys Technologies Pvt. Ltd. Page 11