Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer
Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File System Hadoop MapReduce Hadoop Ecosystem Running Hadoop 2
What is big Data? Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured When dealing with larger datasets, organizations face difficulties in being able to create, manipulate, and manage big data, basic there are 2 problem with big data How to store and work with large volumes of data. And most importantly, how to interpret and analyze this data, Hadoop appears on the market as a solution to these problems, providing a way to store and process this data 3
What is hadoop? Hadoop is an open source software that provides a framework written in Java to allow for the distributed processing of large data sets across clusters built with commodity hardware. Design can go from few nodes to thousands of nodes in a flexible Hadoop is a distributed system using a master-slave architecture, Using to store your Hadoop Distributed File System (HDFS) And MapReduce algorithms for computing Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework 4
Who uses Hadoop? 5
Hadoop Architecture 6
Hadoop Distributed File System HDFS is a distributed file system designed to run on commodity hardware HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets Write-once-read-many access model for files HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode and many DataNodes 7
Hadoop Distributed File System NameNode A master server that manages the file system namespace and regulates access to files by clients. Executes file system namespace operations like opening, closing, and renaming files and directories and determines the mapping of blocks to DataNodes. Keep in memory the file system metadata and control file blocks that each DataNode DataNodes They are responsible for serving read and write requests from the file system s clients. They also perform block creation, deletion, and replication upon instruction from the NameNode. 8
HDFS Architecture 9
Hadoop MapReduce It is a software framework for easily writing applications which process vast amounts of data (multi-terabyte datasets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. MapReduce divides workloads up into multiple tasks that can be executed in parallel. Typically the compute nodes and the storage nodes are the same 10
Hadoop MapReduce The Map/Reduce framework consists of a single master JobTracker and many TaskTracker one per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master. Two main phases: Map and Reduce: Map step: Master node receives and divides a task into smaller tasks which distributes to other nodes to process them. Reduce step: The master node collects all the responses received and combines them to generate the output. 11
Hadoop MapReduce MapReduce job is converted into map and reduce tasks Developers need ONLY to implement the Map and Reduce classes 12
MapReduce Logical Architecture 13
Hadoop ecosystem 14
Running Hadoop 15
Hadoop Cluster Installation Supported Platforms GNU/Linux is supported as a development and production platform Win32 is supported as a development platform. Distributed operation has not been well tested on Win32 Required Software Java 1.6 or +, preferably from Sun, must be installed. ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons. To get a Hadoop distribution, download a recent stable release from one of the Apache Download Mirrors http://www.gtlib.gatech.edu/pub/apache/hadoop/core/ 16
Hadoop Cluster Installation Installing a Hadoop cluster typically involves unpacking the software on all the machines in the cluster The root of the distribution is referred to as HADOOP_HOME Configuration Files hadoop-env.sh Master and slave configuration (master only): conf/masters and conf/slaves Site-specific configuration (all machines): conf/core-site.xml, conf/hdfs-site.xml and conf/mapred-site.xml. 17
Hadoop Cluster Installation Formatting the HDFS filesystem via the NameNode /usr/local/hadoop/bin/hadoop namenode format Starting your multi-node cluster (only master server) HDFS daemons: bin/start-dfs.sh MapReduce daemons: bin/start-mapred.sh The following Java processes should run on master and salve 18
Hadoop Cluster Installation Stopping the multi-node cluster (only master server) MapReduce daemons: bin/stop-mapred.sh HDFS daemons: bin/stop-dfs.sh If there are any errors, examine the log files in the /logs/ directory. hadoop-hduser-namenode-hadoopsrv2.log, hadoop-hduserdatanode-hadoopsrv2.log, hadoop-hduser-tasktracker-hadoopsrv2.log Hadoop Web Interfaces http://localhost:50070/ web UI of the NameNode daemon http://localhost:50030/ web UI of the JobTracker daemon http://localhost:50060/ web UI of the TaskTracker daemon 19
Hadoop Cluster Setup common problems Problem 1 Starting a hadoop cluster a warning message is raised: $HADOOP_HOME is deprecated So to resolve that in.bashrc file, replace the "HADOOP_HOME" variable with "HADOOP_PREFIX" variable. Problem 2 java.io.ioexception: Incompatible namespaceids: If you have formatted the namenode twice. In this case the namespaceid is not replicated to the DataNodes (/app/hadoop/tmp/dfs/name/current/version) 20
Running a MapReduce job We will use the WordCount example job which reads text files and counts how often words occur. Download example input data Copy local example data to HDFS bin/hadoop dfs -copyfromlocal /tmp/inputtext/ /user/hduser/inputtext bin/hadoop dfs -ls /user/hduser 21
Running a MapReduce job Run the MapReduce job bin/hadoop jar hadoop-examples-1.1.2.jar wordcount /user/hduser/input-text /user/hduser/results-output bin/hadoop dfs -ls /user/hduser Retrieve the job result from HDFS bin/hadoop dfs -cat /user/hduser/results-output/part-r-00000 mkdir /tmp/results-output bin/hadoop dfs -getmerge /user/hduser/results-output /tmp/resultsoutput 22
Running a MapReduce job Retrieve the job result from HDFS head /tmp/results-output/results-output 23
Q&A 24
25