Install Hadoop on Ubuntu and run as standalone

Size: px
Start display at page:

Download "Install Hadoop on Ubuntu and run as standalone"

Transcription

1 Welcome, this document is a record of my installation of Hadoop for study purpose. Version Version Date Content and Change Dec Initialize study Hadoop Install basic environment, run first word count program May Expend Hadoop to master+slave (1 master + 2 slave nodes) on RHEL Run word count on bigger file, see improvement Study Hadoop based projects Hive, Pig, Hbase Install Hbase on Hadoop as full distribute mode Install Hadoop on Ubuntu and run as standalone Hadoop version: OS: Ubuntu 32 bit hduser@ubuntu:~$ cat /etc/issue Ubuntu \n \l Host: VMWARE workstation on Windows 8 Please be advice this is a draft, and just fit my own circumstance. References: Developer list!! Following things need to be verified:

2 1. Job tracker interface can t show Fixed 2. Where is job tracker? It seems they changed job tracker to YARN (Next generation Map-reduce) Use local:8088 instead of localhost features.html Web Interfaces Once the Hadoop cluster is up and running check the web-ui of the components as described below: NameNode ResourceManager MapReduce JobHistory Server Apache Hadoop NextGen MapReduce (YARN) MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN. So when check with JPS, there is no longer has jobtracker Web UI In MR1, the JobTracker web UI served detailed information about the state of the cluster and the jobs currently and recently running on it. It also contained the job history page, which served information from disk about older jobs.

3 The MR2 web UI provides the same information structured in the same way, but has been revamped with a new look and feel. The ResourceManager UI, which includes information about running applications and the state of the cluster, is now located by default at:8088. The job history UI is now located by default at: Jobs can be searched and viewed there just as they could in MR1. Because the ResourceManager is meant to be agnostic to many of the concepts in MapReduce, it cannot host job information directly. Instead, it proxies to a web UI that can. If the job is running, this is the relevant MapReduce Application Master; if it has completed, this is the JobHistoryServer. In this sense, the user experience is similar to that of MR1, but the information is coming from different places. Some conceptions What Is Apache Hadoop? The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver highavailability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The project includes these modules: Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS ): A distributed file system that provides high-throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. Other Hadoop-related projects at Apache include: Ambari : A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner. Avro : A data serialization system. Cassandra : A scalable multi-master database with no single points of failure. Chukwa : A data collection system for managing large distributed systems. HBase : A scalable, distributed database that supports structured data storage for large tables. Hive : A data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout : A Scalable machine learning and data mining library. Pig : A high-level data-flow language and execution framework for parallel computation.

4 Spark : A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. ZooKeeper : A high-performance coordination service for distributed applications. Hadoop follows the idea of job map and reduce. It has a DFS file system to support distribution Important ports 1.Job Tracker :50030 (no longer in 2.20?) 2.HDFS : HDFS communication: MapReducecommunication:9001 Management 1. HDFS web interface 2. MapReduce interface Install Ubuntu You may like to install it by a flash disk (on a new tower) Install required packages For example: JRE/JDK $ sudo apt-get update $ sudo apt-get install sun-java6-jdk Add user and user group $ sudo addgroup hadoop

5 $ sudo adduser --ingroup hadoop hduser This will add the user hduser and the group hadoop to your local machine. Configuring SSH su - hduser hduser@ubuntu:~$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/hduser/.ssh/id_rsa): Created directory '/home/hduser/.ssh'. Your identification has been saved in /home/hduser/.ssh/id_rsa. Your public key has been saved in /home/hduser/.ssh/id_rsa.pub. The key fingerprint is: 9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu The key's randomart image is: [...snipp...] hduser@ubuntu:~$ If success, you may ssh to your host without input credentials Disable IPV6 (not proved whether necessary) # disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 You have to reboot your machine in order to make the changes take effect. You can check whether IPv6 is enabled on your machine with the following command: $ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

6 Install Hadoop files $ cd /usr/local $ sudo tar xzf hadoop tar.gz $ sudo mv hadoop hadoop $ sudo chown -R hduser:hadoop hadoop You may also think to make you directory where to place DFS data file, in my case it is /hadoopfs. If necessary, grant owner to hduser:hadoop Update environment variables vim /home/hduser/.rcbash export HADOOP_HOME=/usr/local/hadoop unalias fs &> /dev/null alias fs="hadoop fs" unalias hls &> /dev/null alias hls="fs -ls" lzohead () { } hadoop fs -cat $1 lzop -dc head less export PATH=$PATH:$HADOOP_HOME/bin export JAVA_HOME=/usr/lib/jvm/default-java Configure site Modify core-site.xml

7 pwd /usr/local/hadoop/etc/hadoop Add following lines:?xml version="1.0" encoding="utf-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <name>hadoop.tmp.dir</name> <value>/hadoopfs/tmp</value> <description>a base for other temporary directories.</description> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> <description>the name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.scheme.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> <final>true</final> </configuration> Configure DHS Modify hdsf-site.xml. This file indicates information to DFS duser@ubuntu:/usr/local/hadoop/etc/hadoop$ cat hdfs-site.xml

8 <?xml version="1.0" encoding="utf-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <name>dfs.replication</name> <value>1</value> <description>default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> <name>dfs.permissions</name>

9 <value>false</value> <name>dfs.namenode.name.dir</name> <value>file:/hadoopfs/dfs/name</value> <final>true</final> <name>dfs.datanode.data.dir</name> <value>file:/hadoopfs/dfs/data</value> <final>true</final> </configuration> Configure Map-reduce cat mapred-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

10 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <name>mapreduce.framework.name</name> <value>yarn</value> <name>mapreduce.jobtracker.system.dir</name> <value>file:/hadoopfs/dfs/system</value> <final>true</final> <name>mapreduce.jobtracker.address</name> <value>localhost:9001</value> <description>the host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description>

11 <name>mapreduce.jobtracker.http.address</name> <value>localhost:50030</value> <name>mapreduce.cluster.local.dir</name> <value>file:/hadoopfs/dfs/local</value> <final>true</final> <name>mapreduce.cluster.temp.dir</name> <value>file:/hadoopfs/dfs/tmp</value> <description>no description</description> <final>true</final> </configuration> Yarn-site.xml <?xml version="1.0"?> <configuration> <!-- Site specific YARN configuration properties -->

12 <name>yarn.resourcemanager.hostname</name> <value>localhost</value> <description>the host is the hostname of the ResourceManager and the port is the port on which the clients can talk to the Resource Manager. </description> <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.shufflehandler</value> <description>shuffle service that needs to be set for Map Reduce to run </description> <name>yarn.nodemanager.aux-services</name> <value>mapreduce.shuffle</value> </configuration> Initialize DFS $ hdfs namenode -format In some cases if there is some problem with node and namenode DFS, you need to re-format the DFS $ rm -rf /hadoopfs/dfs/* $ rm -rf /hadoopfs/tmp/* Then re-format namenode DFS. Start and stop the server(s) Should be done one by one namenode, datanode and Yarn(resource manager), but you can use startall.sh and stop-all.sh to enable one time start/stop /usr/local/hadoop/sbin

13 $ hadoop-daemon.sh start namenode $ hadoop-daemon.sh start datanode Monitor the node(s) General nodes information can be checked by web browser at port (by default) You can also use jps to see following instances are running hduser@ubuntu:/usr/local/hadoop/sbin$ jps 3135 ResourceManager 2678 DataNode 2428 NameNode 5044 Jps 2961 SecondaryNameNode Run a test Let s try this word count example. Don t try to run the randomwriter example by default if you want waste 10 GB

14 RandomWriter example writes 10 gig (by default) of random data/host to DFS using Map/Reduce. Try to find some txt such like this cool stuff $ wget $ unzip 4300.zip hduser@ubuntu:/usr/local/hadoop/share/hadoop/mapreduce$ hadoop dfs -mkdir /tmp hduser@ubuntu:/usr/local/hadoop/share/hadoop/mapreduce$ hadoop dfs -copyfromlocal 4300.txt /tmp Now ready to roll! hduser@ubuntu:/usr/local/hadoop/share/hadoop/mapreduce$ hadoop jar hadoop-mapreduceexamples beta.jar wordcount /tmp/4300.txt /tmp/output3 14/01/28 20:04:56 WARN conf.configuration: session.id is deprecated. Instead, use dfs.metrics.session-id 14/01/28 20:04:56 INFO jvm.jvmmetrics: Initializing JVM Metrics with processname=jobtracker, sessionid= 14/01/28 20:04:56 INFO input.fileinputformat: Total input paths to process : 1 14/01/28 20:04:56 INFO mapreduce.jobsubmitter: number of splits:1 14/01/28 20:04:56 WARN conf.configuration: user.name is deprecated. Instead, use mapreduce.job.user.name 14/01/28 20:04:56 WARN conf.configuration: mapred.jar is deprecated. Instead, use mapreduce.job.jar 14/01/28 20:04:56 WARN conf.configuration: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class 14/01/28 20:04:56 WARN conf.configuration: mapreduce.combine.class is deprecated. Instead, use mapreduce.job.combine.class 14/01/28 20:04:56 WARN conf.configuration: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 14/01/28 20:04:56 WARN conf.configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name 14/01/28 20:04:56 WARN conf.configuration: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class

15 14/01/28 20:04:56 WARN conf.configuration: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir 14/01/28 20:04:56 WARN conf.configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir 14/01/28 20:04:56 WARN conf.configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 14/01/28 20:04:56 WARN conf.configuration: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class 14/01/28 20:04:56 WARN conf.configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 14/01/28 20:04:56 INFO mapreduce.jobsubmitter: Submitting tokens for job: job_local _ /01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/staging/hduser /.staging/job_local _0001/job.xml:an attempt to override final parameter: mapreduce.local.dir; Ignoring. 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/staging/hduser /.staging/job_local _0001/job.xml:an attempt to override final parameter: dfs.namenode.name.dir; Ignoring. 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/staging/hduser /.staging/job_local _0001/job.xml:an attempt to override final parameter: mapreduce.job.endnotification.max.retry.interval; Ignoring. 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/staging/hduser /.staging/job_local _0001/job.xml:an attempt to override final parameter: dfs.datanode.data.dir; Ignoring. 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/staging/hduser /.staging/job_local _0001/job.xml:an attempt to override final parameter: mapreduce.job.endnotification.max.attempts; Ignoring. 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/staging/hduser /.staging/job_local _0001/job.xml:an attempt to override final parameter: fs.defaultfs; Ignoring. 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/staging/hduser /.staging/job_local _0001/job.xml:an attempt to override final parameter: mapreduce.system.dir; Ignoring. 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/local/localrunner/job_local _0001.xml:an attempt to override final parameter: mapreduce.local.dir; Ignoring.

16 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/local/localrunner/job_local _0001.xml:an attempt to override final parameter: dfs.namenode.name.dir; Ignoring. 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/local/localrunner/job_local _0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/local/localrunner/job_local _0001.xml:an attempt to override final parameter: dfs.datanode.data.dir; Ignoring. 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/local/localrunner/job_local _0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/local/localrunner/job_local _0001.xml:an attempt to override final parameter: fs.defaultfs; Ignoring. 14/01/28 20:04:56 WARN conf.configuration: file:/hadoopfs/tmp/mapred/local/localrunner/job_local _0001.xml:an attempt to override final parameter: mapreduce.system.dir; Ignoring. 14/01/28 20:04:56 INFO mapreduce.job: The url to track the job: 14/01/28 20:04:56 INFO mapreduce.job: Running job: job_local _ /01/28 20:04:56 INFO mapred.localjobrunner: OutputCommitter set in config null 14/01/28 20:04:56 INFO mapred.localjobrunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.fileoutputcommitter 14/01/28 20:04:56 INFO mapred.localjobrunner: Waiting for map tasks 14/01/28 20:04:56 INFO mapred.localjobrunner: Starting task: attempt_local _0001_m_000000_0 14/01/28 20:04:56 INFO mapred.task: Using ResourceCalculatorProcessTree : [ ] 14/01/28 20:04:56 INFO mapred.maptask: Processing split: hdfs://localhost:9000/tmp/4300.txt: /01/28 20:04:57 INFO mapred.maptask: Map output collector class = org.apache.hadoop.mapred.maptask$mapoutputbuffer 14/01/28 20:04:57 INFO mapred.maptask: (EQUATOR) 0 kvi ( ) 14/01/28 20:04:57 INFO mapred.maptask: mapreduce.task.io.sort.mb: /01/28 20:04:57 INFO mapred.maptask: soft limit at

17 14/01/28 20:04:57 INFO mapred.maptask: bufstart = 0; bufvoid = /01/28 20:04:57 INFO mapred.maptask: kvstart = ; length = /01/28 20:04:57 INFO mapred.localjobrunner: 14/01/28 20:04:57 INFO mapred.maptask: Starting flush of map output 14/01/28 20:04:57 INFO mapred.maptask: Spilling map output 14/01/28 20:04:57 INFO mapred.maptask: bufstart = 0; bufend = ; bufvoid = /01/28 20:04:57 INFO mapred.maptask: kvstart = ( ); kvend = ( ); length = / /01/28 20:04:57 INFO mapreduce.job: Job job_local _0001 running in uber mode : false 14/01/28 20:04:57 INFO mapreduce.job: map 0% reduce 0% 14/01/28 20:04:58 INFO mapred.maptask: Finished spill 0 14/01/28 20:04:58 INFO mapred.task: Task:attempt_local _0001_m_000000_0 is done. And is in the process of committing 14/01/28 20:04:58 INFO mapred.localjobrunner: map 14/01/28 20:04:58 INFO mapred.task: Task 'attempt_local _0001_m_000000_0' done. 14/01/28 20:04:58 INFO mapred.localjobrunner: Finishing task: attempt_local _0001_m_000000_0 14/01/28 20:04:58 INFO mapred.localjobrunner: Map task executor complete. 14/01/28 20:04:58 INFO mapred.task: Using ResourceCalculatorProcessTree : [ ] 14/01/28 20:04:58 INFO mapred.merger: Merging 1 sorted segments 14/01/28 20:04:58 INFO mapred.merger: Down to the last merge-pass, with 1 segments left of total size: bytes 14/01/28 20:04:58 INFO mapred.localjobrunner: 14/01/28 20:04:58 WARN conf.configuration: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords 14/01/28 20:04:58 INFO mapreduce.job: map 100% reduce 0% 14/01/28 20:04:58 INFO mapred.task: Task:attempt_local _0001_r_000000_0 is done. And is in the process of committing 14/01/28 20:04:58 INFO mapred.localjobrunner:

18 14/01/28 20:04:58 INFO mapred.task: Task attempt_local _0001_r_000000_0 is allowed to commit now 14/01/28 20:04:58 INFO output.fileoutputcommitter: Saved output of task 'attempt_local _0001_r_000000_0' to hdfs://localhost:9000/tmp/output3/_temporary/0/task_local _0001_r_ /01/28 20:04:58 INFO mapred.localjobrunner: reduce > reduce 14/01/28 20:04:58 INFO mapred.task: Task 'attempt_local _0001_r_000000_0' done. 14/01/28 20:04:59 INFO mapreduce.job: map 100% reduce 100% 14/01/28 20:04:59 INFO mapreduce.job: Job job_local _0001 completed successfully 14/01/28 20:04:59 INFO mapreduce.job: Counters: 32 File System Counters FILE: Number of bytes read= FILE: Number of bytes written= FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read= HDFS: Number of bytes written= HDFS: Number of read operations=13 HDFS: Number of large read operations=0 HDFS: Number of write operations=4 Map-Reduce Framework Map input records=33056 Map output records= Map output bytes= Map output materialized bytes= Input split bytes=99 Combine input records= Combine output records=50095

19 Reduce input groups=50095 Reduce shuffle bytes=0 Reduce input records=50095 Reduce output records=50095 Spilled Records= Shuffled Maps =0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=37 CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 Total committed heap usage (bytes)= File Input Format Counters Bytes Read= File Output Format Counters Bytes Written= see the result by hadoop dfs -cat /tmp/output3/part* Or from web console

20 Remove a directory hduser@ubuntu:/usr/local/hadoop/share/hadoop/mapreduce$ hadoop dfs -rmr /tmp/output DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. rmr: DEPRECATED: Please use 'rm -r' instead. 14/01/30 09:26:27 INFO fs.trashpolicydefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes. Deleted /tmp/output hduser@ubuntu:/usr/local/hadoop/share/hadoop/mapreduce$ hadoop dfs -rmr /tmp/output2 DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it.

21 rmr: DEPRECATED: Please use 'rm -r' instead. 14/01/30 09:26:56 INFO fs.trashpolicydefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes. Deleted /tmp/output2 Install Hadoop as Master + Slave on RHEL 6.5 General steps 1. Preparation Master Slave Hostname master.hadoop.advol node.hadoop.advol edit /etc/hosts JDK /usr/java/jdk As master Download rpm from oracle $HADOOP_HOME /data/hadoop/hadoop As master Can use separated logic volume as hdfs user hadoop hadoop Adduser hadoop Also grant owner of $HADOOP_HOME to hadoop Hardware Standalone hardware, RHEL 6.5 RHEL 6.5 on VMWARE on my laptop 2. DO THE SAME ON MASTER NODE AND SLAVE NODES TO INSTALL JDK and ENVIRONMENT CONFIGURATION 3. MODIFY HOSTNAME TO ENABLE PING BY HOSTNAME TO EACH OTHER 4. MODIFY FIREWALL IPTABLES 5. ADD SSH PUBLIC KEY FROM MASTER TO SLAVE AS WELL AS FROM SLAVE TO MASTER 6. KEEP DIRECTORY BE THE SAME 7. MODIFY SLAVE FILE IN $HADOOPHOME/ETC 8. FORMAT NAMENODE AGAIN 9. START-ALL FROM MASTER Hadoop-env.sh Add JAVA_HOME # The java implementation to use. export JAVA_HOME=/usr/java/jdk1.7.0_55

22 core-site.xml <configuration> <name>hadoop.tmp.dir</name> <value>/data/hadoop/hadoopfs/tmp</value> <description>a base for other temporary directories.</description> <name>fs.default.name</name> <value>hdfs://master.hadoop.advol:9000</value> <final>true</final> </configuration> hdfs-site.xml <configuration> <name>dfs.replication</name> <value>2</value> <name>dfs.permissions</name> <value>false</value> <name>dfs.namenode.name.dir</name> <value>file:/data/hadoop/hadoopfs/dfs/name</value>

23 <final>true</final> <name>dfs.datanode.data.dir</name> <value>file:/data/hadoop/hadoopfs/dfs/data</value> <final>true</final> <name>dfs.datanode.max.xcievers</name> <value>4096</value> </configuration> yarn-site.xml <configuration> <name>dfs.replication</name> <value>2</value> <name>dfs.permissions</name> <value>false</value> <name>dfs.namenode.name.dir</name> <value>file:/data/hadoop/hadoopfs/dfs/name</value> <final>true</final> <name>dfs.datanode.data.dir</name>

24 <value>file:/data/hadoop/hadoopfs/dfs/data</value> <final>true</final> <name>dfs.datanode.max.xcievers</name> <value>4096</value> </configuration> mapred-site.xml <configuration> <name>mapreduce.framework.name</name> <value>yarn</value> <name>mapreduce.cluster.local.dir</name> <value>file:/data/hadoop/hadoopfs/dfs/local</value> <final>true</final> <name>mapreduce.cluster.temp.dir</name> <value>file:/data/hadoop/hadoopfs/dfs/tmp</value> <description>no description</description> <final>true</final> </configuration>

25 After start (~/$HADOOP_HOME/sbin/start-all.sh ), use JPS to check JAVA threads, at Master: sbin]$ jps ResourceManager NodeManager Jps SecondaryNameNode NameNode sbin]$ JPS at SLAVE ~]$ jps 3934 Jps 3480 DataNode 3583 NodeManager ~]$ If have issue with copyfromlocal (only one node accept, while another raise a I/O issue), check firewall Check hdfs [hadoop@master hadoop]$ hdfs dfsadmin -report Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /data/hadoop/hadoop-2.4.0/lib/native/libhadoop.so which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'. 14/05/09 17:38:44 WARN util.nativecodeloader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Configured Capacity: ( GB) Present Capacity: ( GB) DFS Remaining: ( GB) DFS Used: (3.11 MB) DFS Used%: 0.00%

26 Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: Datanodes available: 2 (2 total, 0 dead) Live datanodes: Name: :50010 (node.hadoop.advol) Hostname: node.hadoop.advol Decommission Status : Normal Configured Capacity: (17.43 GB) DFS Used: (28 KB) Non DFS Used: (4.23 GB) DFS Remaining: (13.20 GB) DFS Used%: 0.00% DFS Remaining%: 75.73% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: % Cache Remaining%: 0.00% Last contact: Fri May 09 17:38:43 EDT 2014 Name: :50010 (master.hadoop.advol) Hostname: master.hadoop.advol Decommission Status : Normal Configured Capacity: ( GB) DFS Used: (3.09 MB)

27 Non DFS Used: (15.76 GB) DFS Remaining: ( GB) DFS Used%: 0.00% DFS Remaining%: 89.55% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: % Cache Remaining%: 0.00% Last contact: Fri May 09 17:38:43 EDT 2014 [hadoop@master hadoop]$ Run word count on 1 master + 2 slave nodes Can also check application job tracker at master server localhost not accepted [hadoop@master ~]$ hadoop jar /data/hadoop/hadoop-2.4.0/share/hadoop/mapreduce/hadoopmapreduce-examples jar wordcount /tmp/4300.txt /tmp/output4 Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /data/hadoop/hadoop-2.4.0/lib/native/libhadoop.so which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'. 14/05/12 13:04:13 WARN util.nativecodeloader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/05/12 13:04:14 INFO client.rmproxy: Connecting to ResourceManager at master.hadoop.advol/ : /05/12 13:04:15 INFO input.fileinputformat: Total input paths to process : 1 14/05/12 13:04:15 INFO mapreduce.jobsubmitter: number of splits:1

28 14/05/12 13:04:15 INFO mapreduce.jobsubmitter: Submitting tokens for job: job_ _ /05/12 13:04:16 INFO impl.yarnclientimpl: Submitted application application_ _ /05/12 13:04:16 INFO mapreduce.job: The url to track the job: 14/05/12 13:04:16 INFO mapreduce.job: Running job: job_ _ /05/12 13:04:26 INFO mapreduce.job: Job job_ _0001 running in uber mode : false 14/05/12 13:04:26 INFO mapreduce.job: map 0% reduce 0% 14/05/12 13:04:35 INFO mapreduce.job: map 100% reduce 0% 14/05/12 13:04:41 INFO mapreduce.job: map 100% reduce 100% 14/05/12 13:04:41 INFO mapreduce.job: Job job_ _0001 completed successfully 14/05/12 13:04:42 INFO mapreduce.job: Counters: 49 File System Counters FILE: Number of bytes read= FILE: Number of bytes written= FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read= HDFS: Number of bytes written= HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=6084

29 Total time spent by all reduces in occupied slots (ms)=3660 Total time spent by all map tasks (ms)=6084 Total time spent by all reduce tasks (ms)=3660 Total vcore-seconds taken by all map tasks=6084 Total vcore-seconds taken by all reduce tasks=3660 Total megabyte-seconds taken by all map tasks= Total megabyte-seconds taken by all reduce tasks= Map-Reduce Framework Map input records=33056 Map output records= Map output bytes= Map output materialized bytes= Input split bytes=109 Combine input records= Combine output records=50095 Reduce input groups=50095 Reduce shuffle bytes= Reduce input records=50095 Reduce output records=50095 Spilled Records= Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=286 CPU time spent (ms)=4990 Physical memory (bytes) snapshot= Virtual memory (bytes) snapshot= Total committed heap usage (bytes)= Shuffle Errors BAD_ID=0

30 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read= File Output Format Counters Bytes Written= ~]$ Run word count on master+slave (1 node) sbin]$ hadoop jar /data/hadoop/hadoop /share/hadoop/mapreduce/hadoop-mapreduce-examples jar wordcount /tmp/4300.txt /tmp/output5 Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /data/hadoop/hadoop-2.4.0/lib/native/libhadoop.so which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'. 14/05/12 13:29:24 WARN util.nativecodeloader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/05/12 13:29:25 INFO client.rmproxy: Connecting to ResourceManager at master.hadoop.advol/ : /05/12 13:29:26 INFO input.fileinputformat: Total input paths to process : 1 14/05/12 13:29:26 INFO mapreduce.jobsubmitter: number of splits:1 14/05/12 13:29:26 INFO mapreduce.jobsubmitter: Submitting tokens for job: job_ _ /05/12 13:29:27 INFO impl.yarnclientimpl: Submitted application application_ _0001

31 14/05/12 13:29:27 INFO mapreduce.job: The url to track the job: 14/05/12 13:29:27 INFO mapreduce.job: Running job: job_ _ /05/12 13:29:34 INFO mapreduce.job: Job job_ _0001 running in uber mode : false 14/05/12 13:29:34 INFO mapreduce.job: map 0% reduce 0% 14/05/12 13:29:41 INFO mapreduce.job: map 100% reduce 0% 14/05/12 13:29:47 INFO mapreduce.job: map 100% reduce 100% 14/05/12 13:29:48 INFO mapreduce.job: Job job_ _0001 completed successfully 14/05/12 13:29:48 INFO mapreduce.job: Counters: 49 File System Counters FILE: Number of bytes read= FILE: Number of bytes written= FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read= HDFS: Number of bytes written= HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=4254 Total time spent by all reduces in occupied slots (ms)=3904 Total time spent by all map tasks (ms)=4254 Total time spent by all reduce tasks (ms)=3904 Total vcore-seconds taken by all map tasks=4254

32 Total vcore-seconds taken by all reduce tasks=3904 Total megabyte-seconds taken by all map tasks= Total megabyte-seconds taken by all reduce tasks= Map-Reduce Framework Map input records=33056 Map output records= Map output bytes= Map output materialized bytes= Input split bytes=109 Combine input records= Combine output records=50095 Reduce input groups=50095 Reduce shuffle bytes= Reduce input records=50095 Reduce output records=50095 Spilled Records= Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=119 CPU time spent (ms)=5290 Physical memory (bytes) snapshot= Virtual memory (bytes) snapshot= Total committed heap usage (bytes)= Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0

33 WRONG_REDUCE=0 File Input Format Counters Bytes Read= File Output Format Counters Bytes Written= sbin]$ Observation: I don t see any improvement on the result of running on standalone or master+slave Run word count on bible on standalone and cluster Try to work on a bigger file [hadoop@master hadoop]$ wget [hadoop@master hadoop]$ unzip AV_txt.zip [hadoop@master hadoop]$ mv AV1611Bible.txt bible.txt [hadoop@master hadoop]$ hdfs dfs -copyfromlocal bible.txt /tmp/bible.txt [hadoop@master hadoop]$ hadoop jar /data/hadoop/hadoop /share/hadoop/mapreduce/hadoop-mapreduce-examples jar wordcount /tmp/bible.txt /tmp/bible/ [hadoop@master hadoop]$ hdfs dfs -getmerge /tmp/bible/./bible.result [hadoop@master hadoop]$ cat bible.result grep god Gudgodah 1 Gudgodah; 1 [god] 1 [god]: 1 [gods 1 [gods], 2 ergodiwkthv 1 god 20 god, 19 god. 9 god: 4

34 god; 3 goddess 4 goddess. 1 godliness 4 godliness) 1 godliness, 4 godliness. 1 godliness: 2 godliness; 3 godly 16 godly, 1 godly-learned 1 gods 93 gods, 88 gods. 28 gods: 15 gods; 8 gods? 7 ungodliness 3 ungodliness. 1 ungodly 17 ungodly, 5 ungodly. 2 ungodly; 2 ungodly? 1

35 Run it on Master + 2 slave nodes [hadoop@master sbin]$./start-all.sh [hadoop@master sbin]$ jps 853 SecondaryNameNode 598 DataNode 1009 ResourceManager 466 NameNode 1452 Jps 1120 NodeManager [hadoop@master sbin]$ hadoop jar /data/hadoop/hadoop /share/hadoop/mapreduce/hadoop-mapreduce-examples jar wordcount /tmp/bible.txt /tmp/bible2/ Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /data/hadoop/hadoop-2.4.0/lib/native/libhadoop.so which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'. 14/05/12 14:01:50 WARN util.nativecodeloader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/05/12 14:01:51 INFO client.rmproxy: Connecting to ResourceManager at master.hadoop.advol/ : /05/12 14:01:52 INFO input.fileinputformat: Total input paths to process : 1 14/05/12 14:01:52 INFO mapreduce.jobsubmitter: number of splits:1 14/05/12 14:01:52 INFO mapreduce.jobsubmitter: Submitting tokens for job: job_ _ /05/12 14:01:52 INFO impl.yarnclientimpl: Submitted application application_ _ /05/12 14:01:52 INFO mapreduce.job: The url to track the job: 14/05/12 14:01:53 INFO mapreduce.job: Running job: job_ _ /05/12 14:02:00 INFO mapreduce.job: Job job_ _0001 running in uber mode : false

36 14/05/12 14:02:00 INFO mapreduce.job: map 0% reduce 0% 14/05/12 14:02:12 INFO mapreduce.job: map 100% reduce 0% 14/05/12 14:02:21 INFO mapreduce.job: map 100% reduce 100% 14/05/12 14:02:22 INFO mapreduce.job: Job job_ _0001 completed successfully 14/05/12 14:02:22 INFO mapreduce.job: Counters: 49 File System Counters FILE: Number of bytes read= FILE: Number of bytes written= FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read= HDFS: Number of bytes written= HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=10832 Total time spent by all reduces in occupied slots (ms)=5080 Total time spent by all map tasks (ms)=10832 Total time spent by all reduce tasks (ms)=5080 Total vcore-seconds taken by all map tasks=10832 Total vcore-seconds taken by all reduce tasks=5080 Total megabyte-seconds taken by all map tasks= Total megabyte-seconds taken by all reduce tasks= Map-Reduce Framework

37 Map input records=33979 Map output records= Map output bytes= Map output materialized bytes= Input split bytes=110 Combine input records= Combine output records=32635 Reduce input groups=32635 Reduce shuffle bytes= Reduce input records=32635 Reduce output records=32635 Spilled Records=65270 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=760 CPU time spent (ms)=7800 Physical memory (bytes) snapshot= Virtual memory (bytes) snapshot= Total committed heap usage (bytes)= Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read= File Output Format Counters

38 Bytes Written= sbin]$ hdfs dfs -getmerge /tmp/bible/./bible2.result Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /data/hadoop/hadoop-2.4.0/lib/native/libhadoop.so which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'. 14/05/12 14:07:39 WARN util.nativecodeloader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable sbin]$ cat bible2.result grep god Gudgodah 1 Gudgodah; 1 [god] 1 [god]: 1 [gods 1 [gods], 2 ergodiwkthv 1 god 20 god, 19 god. 9 god: 4 god; 3 goddess 4 goddess. 1 godliness 4 godliness) 1 godliness, 4 godliness. 1 godliness: 2 godliness; 3 godly 16

39 godly, 1 godly-learned 1 gods 93 gods, 88 gods. 28 gods: 15 gods; 8 gods? 7 ungodliness 3 ungodliness. 1 ungodly 17 ungodly, 5 ungodly. 2 ungodly; 2 ungodly? 1 [hadoop@master sbin]$ Run word count on my s master + 2 slave nodes Try to run another bigger file My s.txt is a export from my outlook which includes 3 years [hadoop@master hadoop]$ hadoop jar /data/hadoop/hadoop /share/hadoop/mapreduce/hadoop-mapreduce-examples jar wordcount /tmp/my s.txt /tmp/my / Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /data/hadoop/hadoop-2.4.0/lib/native/libhadoop.so which might have disabled stack guard. The VM will try to fix the stack guard now.

40 It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'. 14/05/12 14:54:45 WARN util.nativecodeloader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/05/12 14:54:46 INFO client.rmproxy: Connecting to ResourceManager at master.hadoop.advol/ : /05/12 14:54:46 INFO input.fileinputformat: Total input paths to process : 1 14/05/12 14:54:46 INFO mapreduce.jobsubmitter: number of splits:1 14/05/12 14:54:47 INFO mapreduce.jobsubmitter: Submitting tokens for job: job_ _ /05/12 14:54:47 INFO impl.yarnclientimpl: Submitted application application_ _ /05/12 14:54:47 INFO mapreduce.job: The url to track the job: 14/05/12 14:54:47 INFO mapreduce.job: Running job: job_ _ /05/12 14:54:54 INFO mapreduce.job: Job job_ _0002 running in uber mode : false 14/05/12 14:54:54 INFO mapreduce.job: map 0% reduce 0% 14/05/12 14:55:10 INFO mapreduce.job: map 21% reduce 0% 14/05/12 14:55:13 INFO mapreduce.job: map 24% reduce 0% 14/05/12 14:55:16 INFO mapreduce.job: map 43% reduce 0% 14/05/12 14:55:19 INFO mapreduce.job: map 47% reduce 0% 14/05/12 14:55:22 INFO mapreduce.job: map 62% reduce 0% 14/05/12 14:55:25 INFO mapreduce.job: map 67% reduce 0% 14/05/12 14:55:26 INFO mapreduce.job: map 100% reduce 0% 14/05/12 14:55:33 INFO mapreduce.job: map 100% reduce 100% 14/05/12 14:55:33 INFO mapreduce.job: Job job_ _0002 completed successfully 14/05/12 14:55:33 INFO mapreduce.job: Counters: 49 File System Counters FILE: Number of bytes read= FILE: Number of bytes written=

41 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read= HDFS: Number of bytes written= HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=29563 Total time spent by all reduces in occupied slots (ms)=4477 Total time spent by all map tasks (ms)=29563 Total time spent by all reduce tasks (ms)=4477 Total vcore-seconds taken by all map tasks=29563 Total vcore-seconds taken by all reduce tasks=4477 Total megabyte-seconds taken by all map tasks= Total megabyte-seconds taken by all reduce tasks= Map-Reduce Framework Map input records= Map output records= Map output bytes= Map output materialized bytes= Input split bytes=113 Combine input records= Combine output records= Reduce input groups= Reduce shuffle bytes=

42 Reduce input records= Reduce output records= Spilled Records= Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=604 CPU time spent (ms)=32020 Physical memory (bytes) snapshot= Virtual memory (bytes) snapshot= Total committed heap usage (bytes)= Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read= File Output Format Counters Bytes Written= hadoop]$

43 Start history server Not start by default sbin]$./mr-jobhistory-daemon.sh start historyserver starting historyserver, logging to /data/hadoop/hadoop-2.4.0/logs/mapred-hadoophistoryserver-master.hadoop.advol.out sbin]$ Install Hbase Understand what is Hadoop Hbase Hive Pig hive 与 hbase 区 别 1 hive 是 sql 语 言, 通 过 数 据 库 的 方 式 来 操 作 hdfs 文 件 系 统, 为 了 简 化 编 程, 底 层 计 算 方 式 为 mapreduce 2 hive 是 面 向 行 存 储 的 数 据 库 3 Hive 本 身 不 存 储 和 计 算 数 据, 它 完 全 依 赖 于 HDFS 和 MapReduce,Hive 中 的 表 纯 逻 辑 4 HBase 为 查 询 而 生 的, 它 通 过 组 织 起 节 点 內 所 有 机 器 的 內 存, 提 供 一 個 超 大 的 內 存 Hash 表 5 hbase 不 是 关 系 型 数 据 库, 而 是 一 个 在 hdfs 上 开 发 的 面 向 列 的 分 布 式 数 据 库, 不 支 持 sql 6 hbase 是 物 理 表, 不 是 逻 辑 表, 提 供 一 个 超 大 的 内 存 hash 表, 搜 索 引 擎 通 过 它 来 存 储 索 引, 方 便 查 询 操 作

44 7 hbase 是 列 存 储 Pig Pig 是 一 种 数 据 流 语 言, 用 来 快 速 轻 松 的 处 理 巨 大 的 数 据 Pig 包 含 两 个 部 分 :Pig Interface,Pig Latin Pig 可 以 非 常 方 便 的 处 理 HDFS 和 HBase 的 数 据, 和 Hive 一 样,Pig 可 以 非 常 高 效 的 处 理 其 需 要 做 的, 通 过 直 接 操 作 Pig 查 询 可 以 节 省 大 量 的 劳 动 和 时 间 当 你 想 在 你 的 数 据 上 做 一 些 转 换, 并 且 不 想 编 写 MapReduce jobs 就 可 以 用 Pig. Hive 起 源 于 FaceBook,Hive 在 Hadoop 中 扮 演 数 据 仓 库 的 角 色 建 立 在 Hadoop 集 群 的 最 顶 层, 对 存 储 在 Hadoop 群 上 的 数 据 提 供 类 SQL 的 接 口 进 行 操 作 你 可 以 用 HiveQL 进 行 select,join, 等 等 操 作 如 果 你 有 数 据 仓 库 的 需 求 并 且 你 擅 长 写 SQL 并 且 不 想 写 MapReduce jobs 就 可 以 用 Hive 代 替 HBase HBase 作 为 面 向 列 的 数 据 库 运 行 在 HDFS 之 上,HDFS 缺 乏 随 即 读 写 操 作,HBase 正 是 为 此 而 出 现 HBase 以 Google BigTable 为 蓝 本, 以 键 值 对 的 形 式 存 储 项 目 的 目 标 就 是 快 速 在 主 机 内 数 十 亿 行 数 据 中 定 位 所 需 的 数 据 并 访 问 它 HBase 是 一 个 数 据 库, 一 个 NoSql 的 数 据 库, 像 其 他 数 据 库 一 样 提 供 随 即 读 写 功 能,Hadoop 不 能 满 足 实 时 需 要,HBase 正 可 以 满 足 如 果 你 需 要 实 时 访 问 一 些 数 据, 就 把 它 存 入 HBase 你 可 以 用 Hadoop 作 为 静 态 数 据 仓 库,HBase 作 为 数 据 存 储, 放 那 些 进 行 一 些 操 作 会 改 变 的 数 据 Pig VS Hive Hive 更 适 合 于 数 据 仓 库 的 任 务,Hive 主 要 用 于 静 态 的 结 构 以 及 需 要 经 常 分 析 的 工 作 Hive 与 SQL 相 似 促 使 其 成 为 Hadoop 与 其 他 BI 工 具 结 合 的 理 想 交 集 Pig 赋 予 开 发 人 员 在 大 数 据 集 领 域 更 多 的 灵 活 性, 并 允 许 开 发 简 洁 的 脚 本 用 于 转 换 数 据 流 以 便 嵌 入 到 较 大 的 应 用 程 序 Pig 相 比 Hive 相 对 轻 量, 它 主 要 的 优 势 是 相 比 于 直 接 使 用 Hadoop Java APIs 可 大 幅 削 减 代 码 量 正 因 为 如 此,Pig 仍 然 是 吸 引 大 量 的 软 件 开 发 人 员 Hive 和 Pig 都 可 以 与 HBase 组 合 使 用,Hive 和 Pig 还 为 HBase 提 供 了 高 层 语 言 支 持, 使 得 在 HBase 上 进 行 数 据 统 计 处 理 变 的 非 常 简 单 Hive VS HBase Hive 是 建 立 在 Hadoop 之 上 为 了 减 少 MapReduce jobs 编 写 工 作 的 批 处 理 系 统,HBase 是 为 了 支 持 弥 补 Hadoop 对 实 时 操 作 的 缺 陷 的 项 目 想 象 你 在 操 作 RMDB 数 据 库, 如 果 是 全 表 扫 描, 就 用 Hive+Hadoop, 如 果 是 索 引 访 问, 就 用 HBase+Hadoop Hive query 就 是 MapReduce jobs 可 以 从 5 分 钟 到 数 小 时 不 止,HBase 是 非 常 高 效 的, 肯 定 比 Hive 高 效 的 多 If see:

45 org.apache.hadoop.ipc.remoteexception: Server IPC version 9 cannot communicate with client version 4 Check Hbase version and Hadoop version [hadoop@master download]$ wget /hbase hadoop2-bin.tar.gz :54: /hbase hadoop2-bin.tar.gz Resolving mirror.csclub.uwaterloo.ca Connecting to mirror.csclub.uwaterloo.ca :80... connected. HTTP request sent, awaiting response OK Length: (78M) [application/x-gzip] Saving to: 鈥 渉 base hadoop2-bin.tar.gz 鈥? 100%[==================================================================>] 82,246, M/s in 3.9s :54:22 (20.0 MB/s) - 鈥 渉 base hadoop2-bin.tar.gz 鈥?saved [ / ] [hadoop@master download]$ tar xfz hbase hadoop2-bin.tar.gz -C /data/hbase/ Download and modify [hadoop@master hbase hadoop2]$ cd conf [hadoop@master conf]$ ls hadoop-metrics2-hbase.properties hbase-env.sh hbase-site.xml regionservers hbase-env.cmd hbase-policy.xml log4j.properties [hadoop@master conf]$ more hbase-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>

46 <name>hbase.cluster.distributed</name> <value>true</value> <name>hbase.rootdir</name> <value>hdfs://master.hadoop.advol:9000/hbase</value> </configuration> Create a table card and add a record [hadoop@master bin]$./hbase shell hbase(main):019:0> create 'card','mycf' 0 row(s) in seconds => Hbase::Table - card hbase(main):020:0> list 'card' TABLE card 1 row(s) in seconds => ["card"] hbase(main):021:0> put 'card','rowid1','mycf:a','sword of Truth' 0 row(s) in seconds hbase(main):022:0> list 'card' TABLE card 1 row(s) in seconds => ["card"]

47 hbase(main):024:0> scan 'card' ROW rowid1 Truth COLUMN+CELL column=mycf:a, timestamp= , value=sword of 1 row(s) in seconds hbase(main):025:0> [hadoop@master bin]$ [hadoop@master bin]$ [hadoop@master bin]$./stop-hbase.sh stopping hbase... localhost: stopping zookeeper. [hadoop@master bin]$ jps 8707 NameNode 9248 ResourceManager 9092 SecondaryNameNode Jps 8837 DataNode 9359 NodeManager 5532 JobHistoryServer [hadoop@master bin]$./start-hbase.sh localhost: starting zookeeper, logging to /data/hbase/hbase hadoop2/bin/../logs/hbase-hadoop-zookeeper-master.hadoop.advol.out starting master, logging to /data/hbase/hbase hadoop2/bin/../logs/hbase-hadoopmaster-master.hadoop.advol.out localhost: starting regionserver, logging to /data/hbase/hbase hadoop2/bin/../logs/hbase-hadoop-regionserver-master.hadoop.advol.out [hadoop@master bin]$./hbase shell :10:51,394 INFO [main] Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available HBase Shell; enter 'help<return>' for list of supported commands.

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1 102 年 度 國 科 會 雲 端 計 算 與 資 訊 安 全 技 術 研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊 Version 0.1 總 計 畫 名 稱 : 行 動 雲 端 環 境 動 態 群 組 服 務 研 究 與 創 新 應 用 子 計 畫 一 : 行 動 雲 端 群 組 服 務 架 構 與 動 態 群 組 管 理 (NSC 102-2218-E-259-003) 計

More information

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics www.thinkbiganalytics.com 520 San Antonio Rd, Suite 210 Mt. View, CA 94040 (650) 949-2350 Table of Contents OVERVIEW

More information

HSearch Installation

HSearch Installation To configure HSearch you need to install Hadoop, Hbase, Zookeeper, HSearch and Tomcat. 1. Add the machines ip address in the /etc/hosts to access all the servers using name as shown below. 2. Allow all

More information

Installation and Configuration Documentation

Installation and Configuration Documentation Installation and Configuration Documentation Release 1.0.1 Oshin Prem October 08, 2015 Contents 1 HADOOP INSTALLATION 3 1.1 SINGLE-NODE INSTALLATION................................... 3 1.2 MULTI-NODE

More information

Hadoop Installation Guide

Hadoop Installation Guide Hadoop Installation Guide Hadoop Installation Guide (for Ubuntu- Trusty) v1.0, 25 Nov 2014 Naveen Subramani Hadoop Installation Guide (for Ubuntu - Trusty) v1.0, 25 Nov 2014 Hadoop and the Hadoop Logo

More information

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has

More information

Chase Wu New Jersey Ins0tute of Technology

Chase Wu New Jersey Ins0tute of Technology CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at

More information

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/ Download the Hadoop tar. Download the Java from Oracle - Unpack the Comparisons -- $tar -zxvf hadoop-2.6.0.tar.gz $tar -zxf jdk1.7.0_60.tar.gz Set JAVA PATH in Linux Environment. Edit.bashrc and add below

More information

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2. EDUREKA Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.0 Cluster edureka! 11/12/2013 A guide to Install and Configure

More information

CS 455 Spring 2015. Word Count Example

CS 455 Spring 2015. Word Count Example CS 455 Spring 2015 Word Count Example Before starting, make sure that you have HDFS and Yarn running, using sbin/start-dfs.sh and sbin/start-yarn.sh Download text copies of at least 3 books from Project

More information

Hadoop Installation. Sandeep Prasad

Hadoop Installation. Sandeep Prasad Hadoop Installation Sandeep Prasad 1 Introduction Hadoop is a system to manage large quantity of data. For this report hadoop- 1.0.3 (Released, May 2012) is used and tested on Ubuntu-12.04. The system

More information

How To Install Hadoop 1.2.1.1 From Apa Hadoop 1.3.2 To 1.4.2 (Hadoop)

How To Install Hadoop 1.2.1.1 From Apa Hadoop 1.3.2 To 1.4.2 (Hadoop) Contents Download and install Java JDK... 1 Download the Hadoop tar ball... 1 Update $HOME/.bashrc... 3 Configuration of Hadoop in Pseudo Distributed Mode... 4 Format the newly created cluster to create

More information

2.1 Hadoop a. Hadoop Installation & Configuration

2.1 Hadoop a. Hadoop Installation & Configuration 2. Implementation 2.1 Hadoop a. Hadoop Installation & Configuration First of all, we need to install Java Sun 6, and it is preferred to be version 6 not 7 for running Hadoop. Type the following commands

More information

Easily parallelize existing application with Hadoop framework Juan Lago, July 2011

Easily parallelize existing application with Hadoop framework Juan Lago, July 2011 Easily parallelize existing application with Hadoop framework Juan Lago, July 2011 There are three ways of installing Hadoop: Standalone (or local) mode: no deamons running. Nothing to configure after

More information

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g. Big Data Computing Instructor: Prof. Irene Finocchi Master's Degree in Computer Science Academic Year 2013-2014, spring semester Installing Hadoop Emanuele Fusco (fusco@di.uniroma1.it) Prerequisites You

More information

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop

More information

This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download.

This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download. AWS Starting Hadoop in Distributed Mode This handout describes how to start Hadoop in distributed mode, not the pseudo distributed mode which Hadoop comes preconfigured in as on download. 1) Start up 3

More information

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved. Data Analytics CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL All rights reserved. The data analytics benchmark relies on using the Hadoop MapReduce framework

More information

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung. E6893 Big Data Analytics: Demo Session for HW I Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung 1 Oct 2, 2014 2 Part I: Pig installation and Demo Pig is a platform for analyzing

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Hadoop (pseudo-distributed) installation and configuration

Hadoop (pseudo-distributed) installation and configuration Hadoop (pseudo-distributed) installation and configuration 1. Operating systems. Linux-based systems are preferred, e.g., Ubuntu or Mac OS X. 2. Install Java. For Linux, you should download JDK 8 under

More information

How to install Apache Hadoop 2.6.0 in Ubuntu (Multi node setup)

How to install Apache Hadoop 2.6.0 in Ubuntu (Multi node setup) How to install Apache Hadoop 2.6.0 in Ubuntu (Multi node setup) Author : Vignesh Prajapati Categories : Hadoop Date : February 22, 2015 Since you have reached on this blogpost of Setting up Multinode Hadoop

More information

Single Node Hadoop Cluster Setup

Single Node Hadoop Cluster Setup Single Node Hadoop Cluster Setup This document describes how to create Hadoop Single Node cluster in just 30 Minutes on Amazon EC2 cloud. You will learn following topics. Click Here to watch these steps

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data

More information

CDH 5 Quick Start Guide

CDH 5 Quick Start Guide CDH 5 Quick Start Guide Important Notice (c) 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this

More information

How to install Apache Hadoop 2.6.0 in Ubuntu (Multi node/cluster setup)

How to install Apache Hadoop 2.6.0 in Ubuntu (Multi node/cluster setup) How to install Apache Hadoop 2.6.0 in Ubuntu (Multi node/cluster setup) Author : Vignesh Prajapati Categories : Hadoop Tagged as : bigdata, Hadoop Date : April 20, 2015 As you have reached on this blogpost

More information

HADOOP - MULTI NODE CLUSTER

HADOOP - MULTI NODE CLUSTER HADOOP - MULTI NODE CLUSTER http://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm Copyright tutorialspoint.com This chapter explains the setup of the Hadoop Multi-Node cluster on a distributed

More information

Hadoop Lab - Setting a 3 node Cluster. http://hadoop.apache.org/releases.html. Java - http://wiki.apache.org/hadoop/hadoopjavaversions

Hadoop Lab - Setting a 3 node Cluster. http://hadoop.apache.org/releases.html. Java - http://wiki.apache.org/hadoop/hadoopjavaversions Hadoop Lab - Setting a 3 node Cluster Packages Hadoop Packages can be downloaded from: http://hadoop.apache.org/releases.html Java - http://wiki.apache.org/hadoop/hadoopjavaversions Note: I have tested

More information

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation 1. GridGain In-Memory Accelerator For Hadoop GridGain's In-Memory Accelerator For Hadoop edition is based on the industry's first high-performance dual-mode in-memory file system that is 100% compatible

More information

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay Dipojjwal Ray Sandeep Prasad 1 Introduction In installation manual we listed out the steps for hadoop-1.0.3 and hadoop-

More information

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster Setup Hadoop On Ubuntu Linux ---Multi-Node Cluster We have installed the JDK and Hadoop for you. The JAVA_HOME is /usr/lib/jvm/java/jdk1.6.0_22 The Hadoop home is /home/user/hadoop-0.20.2 1. Network Edit

More information

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software

More information

CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)

CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB) CactoScale Guide User Guide Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB) Version History Version Date Change Author 0.1 12/10/2014 Initial version Athanasios Tsitsipas(UULM)

More information

Running Kmeans Mapreduce code on Amazon AWS

Running Kmeans Mapreduce code on Amazon AWS Running Kmeans Mapreduce code on Amazon AWS Pseudo Code Input: Dataset D, Number of clusters k Output: Data points with cluster memberships Step 1: for iteration = 1 to MaxIterations do Step 2: Mapper:

More information

Centrify Server Suite 2015.1 For MapR 4.1 Hadoop With Multiple Clusters in Active Directory

Centrify Server Suite 2015.1 For MapR 4.1 Hadoop With Multiple Clusters in Active Directory Centrify Server Suite 2015.1 For MapR 4.1 Hadoop With Multiple Clusters in Active Directory v1.1 2015 CENTRIFY CORPORATION. ALL RIGHTS RESERVED. 1 Contents General Information 3 Centrify Server Suite for

More information

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

!#$%&' ( )%#*'+,'-#.//0( !#$%&'()*$+()',!-+.'/', 4(5,67,!-+!89,:*$;'0+$.<.,&0$'09,&)/=+,!()<>'0, 3, Processing LARGE data sets !"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.

More information

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe HADOOP Installation and Deployment of a Single Node on a Linux System Presented by: Liv Nguekap And Garrett Poppe Topics Create hadoopuser and group Edit sudoers Set up SSH Install JDK Install Hadoop Editting

More information

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box By Kavya Mugadur W1014808 1 Table of contents 1.What is CDH? 2. Hadoop Basics 3. Ways to install CDH 4. Installation and

More information

Hadoop Training Hands On Exercise

Hadoop Training Hands On Exercise Hadoop Training Hands On Exercise 1. Getting started: Step 1: Download and Install the Vmware player - Download the VMware- player- 5.0.1-894247.zip and unzip it on your windows machine - Click the exe

More information

Distributed Filesystems

Distributed Filesystems Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science amir@sics.se April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls

More information

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications. Lab 9: Hadoop Development The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications. Introduction Hadoop can be run in one of three modes: Standalone

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters CONNECT - Lab Guide Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters Hardware, software and configuration steps needed to deploy Apache Hadoop 2.4.1 with the Emulex family

More information

Ankush Cluster Manager - Hadoop2 Technology User Guide

Ankush Cluster Manager - Hadoop2 Technology User Guide Ankush Cluster Manager - Hadoop2 Technology User Guide Ankush User Manual 1.5 Ankush User s Guide for Hadoop2, Version 1.5 This manual, and the accompanying software and other documentation, is protected

More information

Hadoop Distributed File System and Map Reduce Processing on Multi-Node Cluster

Hadoop Distributed File System and Map Reduce Processing on Multi-Node Cluster Hadoop Distributed File System and Map Reduce Processing on Multi-Node Cluster Dr. G. Venkata Rami Reddy 1, CH. V. V. N. Srikanth Kumar 2 1 Assistant Professor, Department of SE, School Of Information

More information

Hadoop Basics with InfoSphere BigInsights

Hadoop Basics with InfoSphere BigInsights An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Unit 4: Hadoop Administration An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government Users Restricted

More information

docs.hortonworks.com

docs.hortonworks.com docs.hortonworks.com Hortonworks Data Platform: Upgrading HDP Manually Copyright 2012-2015 Hortonworks, Inc. Some rights reserved. The Hortonworks Data Platform, powered by Apache Hadoop, is a massively

More information

Big Data Operations Guide for Cloudera Manager v5.x Hadoop

Big Data Operations Guide for Cloudera Manager v5.x Hadoop Big Data Operations Guide for Cloudera Manager v5.x Hadoop Logging into the Enterprise Cloudera Manager 1. On the server where you have installed 'Cloudera Manager', make sure that the server is running,

More information

HADOOP MOCK TEST HADOOP MOCK TEST II

HADOOP MOCK TEST HADOOP MOCK TEST II http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at

More information

Pivotal HD Enterprise 1.0 Stack and Tool Reference Guide. Rev: A03

Pivotal HD Enterprise 1.0 Stack and Tool Reference Guide. Rev: A03 Pivotal HD Enterprise 1.0 Stack and Tool Reference Guide Rev: A03 Use of Open Source This product may be distributed with open source code, licensed to you in accordance with the applicable open source

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul

More information

Hadoop 2.2.0 MultiNode Cluster Setup

Hadoop 2.2.0 MultiNode Cluster Setup Hadoop 2.2.0 MultiNode Cluster Setup Sunil Raiyani Jayam Modi June 7, 2014 Sunil Raiyani Jayam Modi Hadoop 2.2.0 MultiNode Cluster Setup June 7, 2014 1 / 14 Outline 4 Starting Daemons 1 Pre-Requisites

More information

Installing Hadoop. Hortonworks Hadoop. April 29, 2015. Mogulla, Deepak Reddy VERSION 1.0

Installing Hadoop. Hortonworks Hadoop. April 29, 2015. Mogulla, Deepak Reddy VERSION 1.0 April 29, 2015 Installing Hadoop Hortonworks Hadoop VERSION 1.0 Mogulla, Deepak Reddy Table of Contents Get Linux platform ready...2 Update Linux...2 Update/install Java:...2 Setup SSH Certificates...3

More information

TP1: Getting Started with Hadoop

TP1: Getting Started with Hadoop TP1: Getting Started with Hadoop Alexandru Costan MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development of web

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris

More information

Hands-on Exercises with Big Data

Hands-on Exercises with Big Data Hands-on Exercises with Big Data Lab Sheet 1: Getting Started with MapReduce and Hadoop The aim of this exercise is to learn how to begin creating MapReduce programs using the Hadoop Java framework. In

More information

Tableau Spark SQL Setup Instructions

Tableau Spark SQL Setup Instructions Tableau Spark SQL Setup Instructions 1. Prerequisites 2. Configuring Hive 3. Configuring Spark & Hive 4. Starting the Spark Service and the Spark Thrift Server 5. Connecting Tableau to Spark SQL 5A. Install

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Introduction to HDFS. Prasanth Kothuri, CERN

Introduction to HDFS. Prasanth Kothuri, CERN Prasanth Kothuri, CERN 2 What s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. HDFS is the primary distributed storage for Hadoop applications. HDFS

More information

Pivotal HD Enterprise

Pivotal HD Enterprise PRODUCT DOCUMENTATION Pivotal HD Enterprise Version 1.1 Stack and Tool Reference Guide Rev: A01 2013 GoPivotal, Inc. Table of Contents 1 Pivotal HD 1.1 Stack - RPM Package 11 1.1 Overview 11 1.2 Accessing

More information

Using The Hortonworks Virtual Sandbox

Using The Hortonworks Virtual Sandbox Using The Hortonworks Virtual Sandbox Powered By Apache Hadoop This work by Hortonworks, Inc. is licensed under a Creative Commons Attribution- ShareAlike3.0 Unported License. Legal Notice Copyright 2012

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

Introduction to HDFS. Prasanth Kothuri, CERN

Introduction to HDFS. Prasanth Kothuri, CERN Prasanth Kothuri, CERN 2 What s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. HDFS is the primary distributed storage for Hadoop applications. Hadoop

More information

Setting up Hadoop with MongoDB on Windows 7 64-bit

Setting up Hadoop with MongoDB on Windows 7 64-bit SGT WHITE PAPER Setting up Hadoop with MongoDB on Windows 7 64-bit HCCP Big Data Lab 2015 SGT, Inc. All Rights Reserved 7701 Greenbelt Road, Suite 400, Greenbelt, MD 20770 Tel: (301) 614-8600 Fax: (301)

More information

Qsoft Inc www.qsoft-inc.com

Qsoft Inc www.qsoft-inc.com Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

Apache Hadoop new way for the company to store and analyze big data

Apache Hadoop new way for the company to store and analyze big data Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File

More information

How to Install and Configure EBF15328 for MapR 4.0.1 or 4.0.2 with MapReduce v1

How to Install and Configure EBF15328 for MapR 4.0.1 or 4.0.2 with MapReduce v1 How to Install and Configure EBF15328 for MapR 4.0.1 or 4.0.2 with MapReduce v1 1993-2015 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic,

More information

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

How To Use Hadoop

How To Use Hadoop Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Pivotal HD Enterprise

Pivotal HD Enterprise PRODUCT DOCUMENTATION Pivotal HD Enterprise Version 1.1.1 Release Notes Rev: A02 2014 GoPivotal, Inc. Table of Contents 1 Welcome to Pivotal HD Enterprise 4 2 PHD Components 5 2.1 Core Apache Stack 5 2.2

More information

Hadoop 2.6 Configuration and More Examples

Hadoop 2.6 Configuration and More Examples Hadoop 2.6 Configuration and More Examples Big Data 2015 Apache Hadoop & YARN Apache Hadoop (1.X)! De facto Big Data open source platform Running for about 5 years in production at hundreds of companies

More information

HDFS Installation and Shell

HDFS Installation and Shell 2012 coreservlets.com and Dima May HDFS Installation and Shell Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses

More information

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters Table of Contents Introduction... Hardware requirements... Recommended Hadoop cluster

More information

RDMA for Apache Hadoop 0.9.9 User Guide

RDMA for Apache Hadoop 0.9.9 User Guide 0.9.9 User Guide HIGH-PERFORMANCE BIG DATA TEAM http://hibd.cse.ohio-state.edu NETWORK-BASED COMPUTING LABORATORY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING THE OHIO STATE UNIVERSITY Copyright (c)

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Integration Of Virtualization With Hadoop Tools

Integration Of Virtualization With Hadoop Tools Integration Of Virtualization With Hadoop Tools Aparna Raj K aparnaraj.k@iiitb.org Kamaldeep Kaur Kamaldeep.Kaur@iiitb.org Uddipan Dutta Uddipan.Dutta@iiitb.org V Venkat Sandeep Sandeep.VV@iiitb.org Technical

More information

Perforce Helix Threat Detection On-Premise Deployment Guide

Perforce Helix Threat Detection On-Premise Deployment Guide Perforce Helix Threat Detection On-Premise Deployment Guide Version 3 On-Premise Installation and Deployment 1. Prerequisites and Terminology Each server dedicated to the analytics server needs to be identified

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

Spectrum Scale HDFS Transparency Guide

Spectrum Scale HDFS Transparency Guide Spectrum Scale Guide Spectrum Scale BDA 2016-1-5 Contents 1. Overview... 3 2. Supported Spectrum Scale storage mode... 4 2.1. Local Storage mode... 4 2.2. Shared Storage Mode... 4 3. Hadoop cluster planning...

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

HADOOP MOCK TEST HADOOP MOCK TEST I

HADOOP MOCK TEST HADOOP MOCK TEST I http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at

More information

Web Crawling and Data Mining with Apache Nutch Dr. Zakir Laliwala Abdulbasit Shaikh

Web Crawling and Data Mining with Apache Nutch Dr. Zakir Laliwala Abdulbasit Shaikh Web Crawling and Data Mining with Apache Nutch Dr. Zakir Laliwala Abdulbasit Shaikh Chapter No. 3 "Integration of Apache Nutch with Apache Hadoop and Eclipse" In this package, you will find: A Biography

More information

Big Data : Experiments with Apache Hadoop and JBoss Community projects

Big Data : Experiments with Apache Hadoop and JBoss Community projects Big Data : Experiments with Apache Hadoop and JBoss Community projects About the speaker Anil Saldhana is Lead Security Architect at JBoss. Founder of PicketBox and PicketLink. Interested in using Big

More information

Getting to know Apache Hadoop

Getting to know Apache Hadoop Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the

More information

Обработка больших данных: Map Reduce (Python) + Hadoop (Streaming) Максим Щербаков ВолгГТУ 8/10/2014

Обработка больших данных: Map Reduce (Python) + Hadoop (Streaming) Максим Щербаков ВолгГТУ 8/10/2014 Обработка больших данных: Map Reduce (Python) + Hadoop (Streaming) Максим Щербаков ВолгГТУ 8/10/2014 1 Содержание Бигдайта: распределенные вычисления и тренды MapReduce: концепция и примеры реализации

More information

Important Notice. (c) 2010-2016 Cloudera, Inc. All rights reserved.

Important Notice. (c) 2010-2016 Cloudera, Inc. All rights reserved. Cloudera QuickStart Important Notice (c) 2010-2016 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information