Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster Fang (Cherry) Liu, PhD fang.liu@oit.gatech.edu A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Targets for this session Target audience: HPC system admins who wants to support Hadoop cluster Points of interests: Big Data is Common Challenges and Tools Hadoop vs HPC Hadoop Core Characteris@cs Hadoop EcoSystem Core parts HDFS and Mapreduce Distribu@ons Projects Hadoop Basic Opera@ons Configure Hadoop Cluster on HPC cluster PACE Hadoop Cluster Hadoop Advance Opera@ons 18-22 May 2015 2

Big Data is Common The en@re rendering of Avatar reportedly requires over 1 Petabyte of storage space according to BBC s Clickbits, which is the equivalent of 500 hard drives of 2TB each. That s equal to a 32 year long MP3 file. The compu@ng core 34 racks, each with four chassis of 32 machines each adds up to some 40,000 processors and 104 terabytes of RAM. hzp://thenextweb.com/2010/01/01/avatar- takes- 1- petabyte- storage- space- equivalent- 32- year- long- mp3/ Facebook s data warehouses grow by Over half a petabyte every 24 hours (2012) hzp://www.theregister.co.uk/2012/11/09/ facebook_open_sources_corona/ An average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day (2009). hzp://dl.acm.org/cita@on.cfm? id=1327492&dl=acm&coll=dl&cfid=508148704&cftoken=44216082 18-22 May 2015 3

Big Data: Three challenges Volume (Storage) The size of the data Velocity (Processing speed) The latency of data processing rela@ve to the growing demand of interac@vity Variety (structure vs. unstructured data) The diversity of sources, formats, quality, structures 18-22 May 2015 4

What are Tools to use? Hadoop is an open- source sonware for reliable, scalable, distributed compu@ng: Created by Doug Cuong and Michael Cafarella while at Yahoo, named aner Doug s son s toy elephant WriZen in Java Scale to thousands of machines with linear scalability Uses simple programming model (MapReduce) Fault tolerant (HDFS) Spark it is a fast and general engine for large- scale data processing, it claims to run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. It provides high- level tools including Spark SQL, Mllib for machine learning, GraphX, and Spark Streaming. It runs on Hadoop, Mesos, standalone, or in the cloud. It can access data sources including HDFS, Cassandra, Hbase, Hive, and Amazon S3. 18-22 May 2015 5

Hadoop Vs. Conversional HPC Hadoop HPC A cluster of machines collocate the data with the compute node Move computa@on to data MapReduce operates at the higher level, data flow is implicit Fault tolerance through data replica@ons, easier to rerun the Map or Reduce tasks. Restrict to data- processing problem A cluster of machines access a shared filesystem Move data to computa@on, network bandwidth is the bozleneck Message Passing Interface (MPI) explicitly handle the mechanics of the data flow Needs to explicitly manage checkpoin@ng and recovery Can solve more complex algorithm 4-8 August 2014 6

Hadoop Core Characteris.cs Distribu@on - instead of building one big supercomputer, storage and processing are spread across a cluster of smaller machines that communicate and work together. Horizontal scalability - it is easy to extend a Hadoop cluster by just adding new machines. Every new machine increases total storage and processing power of the Hadoop cluster. Fault- tolerance - Hadoop con@nues to operate even when a few hardware or sonware components fail to work properly. Cost- op@miza@on - Hadoop runs on standard hardware; it does not require expensive servers. Programming abstrac@on - Hadoop takes care of all messy details related to distributed compu@ng. Using a high- level API, users can focus on implemen@ng business logic that solves their real- world problems. Data locality don t move large datasets to where applica@on is running, but run the applica@on where the data already is. 18-22 May 2015 7

Hadoop EcoSystem Akaban Workflow Hue Oozie Analysis Pig Scalding Mahout Hive Impata MapReduce HBase HDFS disk disk disk disk disk disk disk 18-22 May 2015 8

Hadoop EcoSystem Core part HDFS: Hadoop Distribute File System (HDFS) is a distributed file system designed to run on a commodity cluster of machines. HDFS is highly fault tolerant and is useful for processing large data sets. MapReduce: MapReduce is a sonware framework for processing large data sets, petabyte scale, on a cluster of commodity hardware. When MapReduce jobs are run, Hadoop splits the input and locates the nodes on the cluster. The actual jobs are then run at or close to the node where the data is residing so that the data is as close to the computa@on node as possible. This avoids transfer of huge amount of data across the network so that the network does not become a bozleneck or get flooded. 18-22 May 2015 9

Hadoop EcoSystem - Distribu.ons Apache: Purely Open Source distribu@on of Hadoop maintained by the community at Apache Sonware Founda@on. Cloudera: Cloudera s distribu@on of Hadoop that is built on top of Apache Hadoop. The distribu@on includes capabili@es such as management, security, high availability and integra@on with a wide variety of hardware and sonware solu@ons. Cloudera is the leading distributor of Hadoop. Horton Works: This also builds on the Open Source Apache Hadoop with claims to enterprise readiness. It also claims to be the only distribu@on that is available for Windows servers. MapR: Hadoop distribu@on with some unique features, most notably the ability to mount the Hadoop cluster over NFS. Amazon EMR : Amazon s hosted version of MapReduce is called Elas@c Map Reduce. This is part of the Amazon Web Services (AWS). EMR allows a Hadoop cluster to be deployed and MapReduce jobs to be run in the AWS cloud with just a few clicks. 18-22 May 2015 10

Hadoop EcoSystem - Related Projects Pig : A high level language to analyzing large data sets which eases development of MapReduce jobs. Hundreds of lines of code can be wrizen with just few lines of Pig. At Yahoo > 60% of Hadoop usage is on Pig. Hive : Hive is a data warehouse framework that stores querying of large data sets stored in Hadoop. Hive provides a high- level SQL like language called HiveQL. HBase : HBase is a distributed scalable data store based on Hadoop. HBase is a distributed, versioned, column- oriented database modeled aner Google s BigTable. Mahout : Mahout is a scalable Machine learning library. Mahout u@lizes Hadoop to achieve massive scalability. YARN : YARN is the next genera@on of MapReduce a.k.a MapReduce 2. The MapReduce framework was overhauled using YARN to overcome the scalability bozlenecks in earlier version of MapReduce when it was run over a very large cluster(thousands of nodes). Ozzie :. Ozzie is a workflow scheduler system that eases the crea@on and management of the sequence of MapReduce jobs. Flume : A distributed, reliable and available service for collec@ng, aggrega@ng and moving log data to HDFS. This is typically useful in systems where log data needs to be moved to HDFS periodically for processing. 18-22 May 2015 11

Hadoop Basics - HDFS HDFS is designed to store a very large amount of informa@on (terabytes or petabytes). This requires spreading the data across a large number of machines. It also supports much larger file sizes than NFS. HDFS should store data reliably. If individual machines in the cluster malfunc@on, data should s@ll be available. HDFS should provide fast, scalable access to this informa@on. It should be possible to serve a larger number of clients by simply adding more machines to the cluster. HDFS should integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally when possible. 18-22 May 2015 12

Hadoop Basics HDFS (Cont.) 18-22 May 2015 13

Hadoop Basics - MapReduce Mapping and reducing tasks run on nodes where individual records of data are already present. 18-22 May 2015 14

Example : Word Count 18-22 May 2015 15

Basic HDFS Commands Create a directory in HDFS hadoop fs mkdir <paths> (e.g. hadoop fs mkdir /user/hadoop/dir1) List files hadoop fs ls <args> (e.g. hadoop fs ls /user/hadoop/dir1) Upload data from local system to HDFS hadoop fs - put <localsrc>... <HDFS_dest_Path> (e.g. hadoop fs put ~/ foo.txt /user/hadoop/dir1/foo.txt) Download file from HDFS hadoop fs - get <hdfs_src> <localdst> (e.g. hadop fs get /user/hadoop/dir1/ foo.txt /home/) Check how much space u@liza@on in a HDFS dir hadoop fs du URI (e.g. hadoop fs du /user/hadoop ) Get help hadoop fs - help 18-22 May 2015 16

Case Study : PACE Hadoop cluster Consists of 5x24- core Altus 2704 servers, each with 128 GB of RAM. Each node has 3TB local disk Raid (6x500GB) All nodes are connected via 40 GB/sec IB connec@on There is one node serving as NameNode + JobTracker + DataNode and named as hadoop- nn1 The rest of four nodes serving as DataNode namely Hadoop- dn1 Hadoop- dn2 Hadoop- dn3 Hadoop- dn4 Check the cluster status and running jobs at: Overview hzp://<namenode Full Qualified Name>:50070 JobTracker hzp://<namenode Full Qualified Name>:8088 18-22 May 2015 17

Installing Hadoop on academic cluster Download release from official website hzp://hadoop.apache.org/, most recent release is 2.7.0 at April 21, 2015. Puong the binary distribu@on on NFS loca@on so that all par@cipated nodes can access. Adding the local disk to each node to serve as HDFS file system Configuring the nodes into name nodes and data nodes. 4-8 August 2014 18

Configuring HDFS All configura@ons are ${HADOOP_PREFIX}/etc/hadoop (e.g. /usr/ local/packages/hadoop/2.6/etc/hadoop) In core- site.html Key Value Example Fs.defaultFS Protocol:// servername:port Hdfs://127.0.0.1:9000 Hadoop.tmp.dir Pathname /dfs/hadoop/tmp In hdfs- site.html Key Value Example Dfs.name.dir Pathname /dfs/hadoop/name Dfs.data.dir Pathname /dfs/hadoop/data Dfs.replica@on Number of replica@on 3 4-8 August 2014 19

Configuring Yarn In ${HADOOP_PREFIX}/yarn- site.xml <property> <name>yarn.resourcemanager.resource- tracker.address</name> <value>hdfs://<hostname>:8025</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>hdfs://<hostname>:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>hdfs://<hostname>:8040</value> </property> 4-8 August 2014 20

Configure Slaves file ${HADOOP_HOME}/etc/ hadoop/slaves hadoop- nn1: hadoop- nn1 hadoop- dn1 hadoop- dn2 hadoop- dn3 hadoop- dn4 hadoop- dn1: hadoop- dn1 hadoop- dn2: hadoop- dn2 hadoop- dn3: hadoop- dn3 hadoop- dn4: hadoop- dn4 DataNodes Hadoop- dn1 Hadoop- dn2 Hadoop- dn3 Hadoop- dn4 NameNode Hadoop- nn1 4-8 August 2014 21

Configure environment Add following two lines in ${HADOOP_PREFIX}/etc/hadoop/hadoop- env.sh export JAVA_HOME=/usr/local/packages/java/1.7.0 export HADOOP_LOG_DIR=/nv/ap2/logs/hadoop Add following line in ${HADOOP_PREFIX}/etc/hadoop/yarn- env.sh export YARN_LOG_DIR=/nv/ap2/logs/hadoop Add following line in ${HADOOP_PREFIX}/etc/hadoop/mapred- env.sh export HADOOP_MAPRED_LOG_DIR=/nv/ap2/logs/hadoop 4-8 August 2014 22

Star.ng HDFS Create a user named hadoop, and start all hadoop services with hadoop user to avoid security issue. Format the file system based on the above configura@on: ${HADOOP_HOME}/bin/hdfs namenode format <cluster_name> Start the HDFS start- dfs.sh Make sure its running correctly: ps ef grep I hadoop Start the Yarn start- yarn.sh Make sure its running correctly : ps ef grep I resourcemanager 4-8 August 2014 23

User management on Hadoop Cluster Login as hadoop user ssh hadoop@hadoop- nn1 Create a directory for given user Hadoop fs mkdir /user/userid Set directory ownership to given user Hadoop fs chown R userid:groupid /user/userid Change the permission to user only Hadoop fs chmod R 700 /user/userid 4-8 August 2014 24

User runs a job User login to Hadoop NameNode, and submit job from there ssh userid@hadoop- nn1 Load Hadoop environment variables, such as Jave, python, HADOOP_PREFIX, HADOOP_YARN_HOME Module load hadoop/2.6.0 Upload input file from local file system to hadoop fs put example.txt /user/userid Run wordcount on example.txt hadoop jar /usr/local/packages/hadoop/2.6.0/share/hadoop/mapreduce/ hadoop- mapreduce- examples- 2.6.0.jar wordcount /user/userid/ /user/userid/ testout Check the result hadoop fs cat /user/userid/testout > output 4-8 August 2014 25

Stopping HDFS Stops resource manager stop- yarn.sh Stops name nodes and data nodes stop- hdfs.sh Note: Some@mes may require manually shutdown on each individual data nodes. 4-8 August 2014 26

Hadoop Cluster overview 4-8 August 2014 27

Hadoop DataNodes 4-8 August 2014 28

Hadoop Job Tracker History 4-8 August 2014 29

References: Yahoo Hadoop Tutorial hzps://developer.yahoo.com/hadoop/tutorial/index.html Introduc@on to Data Science, Bill Howe, University of Washington 4-8 August 2014 30