Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Similar documents
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Large scale processing using Hadoop. Ján Vaňo

Hadoop Ecosystem B Y R A H I M A.

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Hadoop. Sunday, November 25, 12

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Qsoft Inc

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hadoop implementation of MapReduce computational model. Ján Vaňo

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop IST 734 SS CHUNG

Workshop on Hadoop with Big Data

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Map Reduce & Hadoop Recommended Text:

HDFS. Hadoop Distributed File System

Big Data Too Big To Ignore

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

BBM467 Data Intensive ApplicaAons

Big Data Course Highlights

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Big Data With Hadoop

Hadoop Big Data for Processing Data and Performing Workload

Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS

Hadoop Architecture. Part 1

Moving From Hadoop to Spark

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

CSE-E5430 Scalable Cloud Computing. Lecture 4

CSE-E5430 Scalable Cloud Computing Lecture 2

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

MapReduce, Hadoop and Amazon AWS

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Deploying Hadoop with Manager

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Hadoop & Spark Using Amazon EMR

BIG DATA USING HADOOP

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

A Brief Outline on Bigdata Hadoop

Chase Wu New Jersey Ins0tute of Technology

Certified Big Data and Apache Hadoop Developer VS-1221

Hadoop and Map-Reduce. Swati Gore

Case Study : 3 different hadoop cluster deployments

How To Create A Data Visualization With Apache Spark And Zeppelin

Apache Hadoop. Alexandru Costan

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

HADOOP MOCK TEST HADOOP MOCK TEST II

BIG DATA What it is and how to use?

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Apache HBase. Crazy dances on the elephant back

BIG DATA TRENDS AND TECHNOLOGIES

Big Data and Apache Hadoop s MapReduce

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

How To Use Hadoop

CDH 5 Quick Start Guide

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Apache Hadoop new way for the company to store and analyze big data

How To Scale Out Of A Nosql Database

<Insert Picture Here> Big Data

Data processing goes big

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP

Hadoop Training Hands On Exercise

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Peers Techno log ies Pv t. L td. HADOOP

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

THE HADOOP DISTRIBUTED FILE SYSTEM

MapReduce with Apache Hadoop Analysing Big Data

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Internals of Hadoop Application Framework and Distributed File System

How Companies are! Using Spark

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Intro to Map/Reduce a.k.a. Hadoop

Hadoop Job Oriented Training Agenda

Introduc)on to Map- Reduce. Vincent Leroy

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

How to Hadoop Without the Worry: Protecting Big Data at Scale

Big Data and Industrial Internet

Big Data on Microsoft Platform

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Introduction to Big Data Training

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Transcription:

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster Fang (Cherry) Liu, PhD fang.liu@oit.gatech.edu A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Targets for this session Target audience: HPC system admins who wants to support Hadoop cluster Points of interests: Big Data is Common Challenges and Tools Hadoop vs HPC Hadoop Core Characteris@cs Hadoop EcoSystem Core parts HDFS and Mapreduce Distribu@ons Projects Hadoop Basic Opera@ons Configure Hadoop Cluster on HPC cluster PACE Hadoop Cluster Hadoop Advance Opera@ons 18-22 May 2015 2

Big Data is Common The en@re rendering of Avatar reportedly requires over 1 Petabyte of storage space according to BBC s Clickbits, which is the equivalent of 500 hard drives of 2TB each. That s equal to a 32 year long MP3 file. The compu@ng core 34 racks, each with four chassis of 32 machines each adds up to some 40,000 processors and 104 terabytes of RAM. hzp://thenextweb.com/2010/01/01/avatar- takes- 1- petabyte- storage- space- equivalent- 32- year- long- mp3/ Facebook s data warehouses grow by Over half a petabyte every 24 hours (2012) hzp://www.theregister.co.uk/2012/11/09/ facebook_open_sources_corona/ An average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day (2009). hzp://dl.acm.org/cita@on.cfm? id=1327492&dl=acm&coll=dl&cfid=508148704&cftoken=44216082 18-22 May 2015 3

Big Data: Three challenges Volume (Storage) The size of the data Velocity (Processing speed) The latency of data processing rela@ve to the growing demand of interac@vity Variety (structure vs. unstructured data) The diversity of sources, formats, quality, structures 18-22 May 2015 4

What are Tools to use? Hadoop is an open- source sonware for reliable, scalable, distributed compu@ng: Created by Doug Cuong and Michael Cafarella while at Yahoo, named aner Doug s son s toy elephant WriZen in Java Scale to thousands of machines with linear scalability Uses simple programming model (MapReduce) Fault tolerant (HDFS) Spark it is a fast and general engine for large- scale data processing, it claims to run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. It provides high- level tools including Spark SQL, Mllib for machine learning, GraphX, and Spark Streaming. It runs on Hadoop, Mesos, standalone, or in the cloud. It can access data sources including HDFS, Cassandra, Hbase, Hive, and Amazon S3. 18-22 May 2015 5

Hadoop Vs. Conversional HPC Hadoop HPC A cluster of machines collocate the data with the compute node Move computa@on to data MapReduce operates at the higher level, data flow is implicit Fault tolerance through data replica@ons, easier to rerun the Map or Reduce tasks. Restrict to data- processing problem A cluster of machines access a shared filesystem Move data to computa@on, network bandwidth is the bozleneck Message Passing Interface (MPI) explicitly handle the mechanics of the data flow Needs to explicitly manage checkpoin@ng and recovery Can solve more complex algorithm 4-8 August 2014 6

Hadoop Core Characteris.cs Distribu@on - instead of building one big supercomputer, storage and processing are spread across a cluster of smaller machines that communicate and work together. Horizontal scalability - it is easy to extend a Hadoop cluster by just adding new machines. Every new machine increases total storage and processing power of the Hadoop cluster. Fault- tolerance - Hadoop con@nues to operate even when a few hardware or sonware components fail to work properly. Cost- op@miza@on - Hadoop runs on standard hardware; it does not require expensive servers. Programming abstrac@on - Hadoop takes care of all messy details related to distributed compu@ng. Using a high- level API, users can focus on implemen@ng business logic that solves their real- world problems. Data locality don t move large datasets to where applica@on is running, but run the applica@on where the data already is. 18-22 May 2015 7

Hadoop EcoSystem Akaban Workflow Hue Oozie Analysis Pig Scalding Mahout Hive Impata MapReduce HBase HDFS disk disk disk disk disk disk disk 18-22 May 2015 8

Hadoop EcoSystem Core part HDFS: Hadoop Distribute File System (HDFS) is a distributed file system designed to run on a commodity cluster of machines. HDFS is highly fault tolerant and is useful for processing large data sets. MapReduce: MapReduce is a sonware framework for processing large data sets, petabyte scale, on a cluster of commodity hardware. When MapReduce jobs are run, Hadoop splits the input and locates the nodes on the cluster. The actual jobs are then run at or close to the node where the data is residing so that the data is as close to the computa@on node as possible. This avoids transfer of huge amount of data across the network so that the network does not become a bozleneck or get flooded. 18-22 May 2015 9

Hadoop EcoSystem - Distribu.ons Apache: Purely Open Source distribu@on of Hadoop maintained by the community at Apache Sonware Founda@on. Cloudera: Cloudera s distribu@on of Hadoop that is built on top of Apache Hadoop. The distribu@on includes capabili@es such as management, security, high availability and integra@on with a wide variety of hardware and sonware solu@ons. Cloudera is the leading distributor of Hadoop. Horton Works: This also builds on the Open Source Apache Hadoop with claims to enterprise readiness. It also claims to be the only distribu@on that is available for Windows servers. MapR: Hadoop distribu@on with some unique features, most notably the ability to mount the Hadoop cluster over NFS. Amazon EMR : Amazon s hosted version of MapReduce is called Elas@c Map Reduce. This is part of the Amazon Web Services (AWS). EMR allows a Hadoop cluster to be deployed and MapReduce jobs to be run in the AWS cloud with just a few clicks. 18-22 May 2015 10

Hadoop EcoSystem - Related Projects Pig : A high level language to analyzing large data sets which eases development of MapReduce jobs. Hundreds of lines of code can be wrizen with just few lines of Pig. At Yahoo > 60% of Hadoop usage is on Pig. Hive : Hive is a data warehouse framework that stores querying of large data sets stored in Hadoop. Hive provides a high- level SQL like language called HiveQL. HBase : HBase is a distributed scalable data store based on Hadoop. HBase is a distributed, versioned, column- oriented database modeled aner Google s BigTable. Mahout : Mahout is a scalable Machine learning library. Mahout u@lizes Hadoop to achieve massive scalability. YARN : YARN is the next genera@on of MapReduce a.k.a MapReduce 2. The MapReduce framework was overhauled using YARN to overcome the scalability bozlenecks in earlier version of MapReduce when it was run over a very large cluster(thousands of nodes). Ozzie :. Ozzie is a workflow scheduler system that eases the crea@on and management of the sequence of MapReduce jobs. Flume : A distributed, reliable and available service for collec@ng, aggrega@ng and moving log data to HDFS. This is typically useful in systems where log data needs to be moved to HDFS periodically for processing. 18-22 May 2015 11

Hadoop Basics - HDFS HDFS is designed to store a very large amount of informa@on (terabytes or petabytes). This requires spreading the data across a large number of machines. It also supports much larger file sizes than NFS. HDFS should store data reliably. If individual machines in the cluster malfunc@on, data should s@ll be available. HDFS should provide fast, scalable access to this informa@on. It should be possible to serve a larger number of clients by simply adding more machines to the cluster. HDFS should integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally when possible. 18-22 May 2015 12

Hadoop Basics HDFS (Cont.) 18-22 May 2015 13

Hadoop Basics - MapReduce Mapping and reducing tasks run on nodes where individual records of data are already present. 18-22 May 2015 14

Example : Word Count 18-22 May 2015 15

Basic HDFS Commands Create a directory in HDFS hadoop fs mkdir <paths> (e.g. hadoop fs mkdir /user/hadoop/dir1) List files hadoop fs ls <args> (e.g. hadoop fs ls /user/hadoop/dir1) Upload data from local system to HDFS hadoop fs - put <localsrc>... <HDFS_dest_Path> (e.g. hadoop fs put ~/ foo.txt /user/hadoop/dir1/foo.txt) Download file from HDFS hadoop fs - get <hdfs_src> <localdst> (e.g. hadop fs get /user/hadoop/dir1/ foo.txt /home/) Check how much space u@liza@on in a HDFS dir hadoop fs du URI (e.g. hadoop fs du /user/hadoop ) Get help hadoop fs - help 18-22 May 2015 16

Case Study : PACE Hadoop cluster Consists of 5x24- core Altus 2704 servers, each with 128 GB of RAM. Each node has 3TB local disk Raid (6x500GB) All nodes are connected via 40 GB/sec IB connec@on There is one node serving as NameNode + JobTracker + DataNode and named as hadoop- nn1 The rest of four nodes serving as DataNode namely Hadoop- dn1 Hadoop- dn2 Hadoop- dn3 Hadoop- dn4 Check the cluster status and running jobs at: Overview hzp://<namenode Full Qualified Name>:50070 JobTracker hzp://<namenode Full Qualified Name>:8088 18-22 May 2015 17

Installing Hadoop on academic cluster Download release from official website hzp://hadoop.apache.org/, most recent release is 2.7.0 at April 21, 2015. Puong the binary distribu@on on NFS loca@on so that all par@cipated nodes can access. Adding the local disk to each node to serve as HDFS file system Configuring the nodes into name nodes and data nodes. 4-8 August 2014 18

Configuring HDFS All configura@ons are ${HADOOP_PREFIX}/etc/hadoop (e.g. /usr/ local/packages/hadoop/2.6/etc/hadoop) In core- site.html Key Value Example Fs.defaultFS Protocol:// servername:port Hdfs://127.0.0.1:9000 Hadoop.tmp.dir Pathname /dfs/hadoop/tmp In hdfs- site.html Key Value Example Dfs.name.dir Pathname /dfs/hadoop/name Dfs.data.dir Pathname /dfs/hadoop/data Dfs.replica@on Number of replica@on 3 4-8 August 2014 19

Configuring Yarn In ${HADOOP_PREFIX}/yarn- site.xml <property> <name>yarn.resourcemanager.resource- tracker.address</name> <value>hdfs://<hostname>:8025</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>hdfs://<hostname>:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>hdfs://<hostname>:8040</value> </property> 4-8 August 2014 20

Configure Slaves file ${HADOOP_HOME}/etc/ hadoop/slaves hadoop- nn1: hadoop- nn1 hadoop- dn1 hadoop- dn2 hadoop- dn3 hadoop- dn4 hadoop- dn1: hadoop- dn1 hadoop- dn2: hadoop- dn2 hadoop- dn3: hadoop- dn3 hadoop- dn4: hadoop- dn4 DataNodes Hadoop- dn1 Hadoop- dn2 Hadoop- dn3 Hadoop- dn4 NameNode Hadoop- nn1 4-8 August 2014 21

Configure environment Add following two lines in ${HADOOP_PREFIX}/etc/hadoop/hadoop- env.sh export JAVA_HOME=/usr/local/packages/java/1.7.0 export HADOOP_LOG_DIR=/nv/ap2/logs/hadoop Add following line in ${HADOOP_PREFIX}/etc/hadoop/yarn- env.sh export YARN_LOG_DIR=/nv/ap2/logs/hadoop Add following line in ${HADOOP_PREFIX}/etc/hadoop/mapred- env.sh export HADOOP_MAPRED_LOG_DIR=/nv/ap2/logs/hadoop 4-8 August 2014 22

Star.ng HDFS Create a user named hadoop, and start all hadoop services with hadoop user to avoid security issue. Format the file system based on the above configura@on: ${HADOOP_HOME}/bin/hdfs namenode format <cluster_name> Start the HDFS start- dfs.sh Make sure its running correctly: ps ef grep I hadoop Start the Yarn start- yarn.sh Make sure its running correctly : ps ef grep I resourcemanager 4-8 August 2014 23

User management on Hadoop Cluster Login as hadoop user ssh hadoop@hadoop- nn1 Create a directory for given user Hadoop fs mkdir /user/userid Set directory ownership to given user Hadoop fs chown R userid:groupid /user/userid Change the permission to user only Hadoop fs chmod R 700 /user/userid 4-8 August 2014 24

User runs a job User login to Hadoop NameNode, and submit job from there ssh userid@hadoop- nn1 Load Hadoop environment variables, such as Jave, python, HADOOP_PREFIX, HADOOP_YARN_HOME Module load hadoop/2.6.0 Upload input file from local file system to hadoop fs put example.txt /user/userid Run wordcount on example.txt hadoop jar /usr/local/packages/hadoop/2.6.0/share/hadoop/mapreduce/ hadoop- mapreduce- examples- 2.6.0.jar wordcount /user/userid/ /user/userid/ testout Check the result hadoop fs cat /user/userid/testout > output 4-8 August 2014 25

Stopping HDFS Stops resource manager stop- yarn.sh Stops name nodes and data nodes stop- hdfs.sh Note: Some@mes may require manually shutdown on each individual data nodes. 4-8 August 2014 26

Hadoop Cluster overview 4-8 August 2014 27

Hadoop DataNodes 4-8 August 2014 28

Hadoop Job Tracker History 4-8 August 2014 29

References: Yahoo Hadoop Tutorial hzps://developer.yahoo.com/hadoop/tutorial/index.html Introduc@on to Data Science, Bill Howe, University of Washington 4-8 August 2014 30