Deploying Hadoop with Manager

Similar documents
Big Data Trends and Best Practices. Peter Linnell Big Data SUSE Apache Bigtop PMC plinnell@suse.com plinnell@apache.org

TUT5605: Deploying an elastic Hadoop cluster Alejandro Bonilla

Qsoft Inc

Running SAP HANA One on SoftLayer Bare Metal with SUSE Linux Enterprise Server CAS19256

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

SUSE Storage. FUT7537 Software Defined Storage Introduction and Roadmap: Getting your tentacles around data growth. Larry Morris

We are watching SUSE

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Build Platform as a Service (PaaS) with SUSE Studio, WSO2 Middleware, and EC2 Chris Haddad

Big Data, SAP HANA. SUSE Linux Enterprise Server for SAP Applications. Kim Aaltonen

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop Ecosystem B Y R A H I M A.

Advanced Systems Management with Machinery

Hadoop IST 734 SS CHUNG

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

SUSE Linux uutuudet - kuulumiset SUSECon:sta

Using SUSE Linux Enterprise to "Focus In" on Retail Optical Sales

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Peers Techno log ies Pv t. L td. HADOOP

How to Hadoop Without the Worry: Protecting Big Data at Scale

Implementing Linux Authentication and Authorisation Using SSSD

Relax-and-Recover. Johannes Meixner. on SUSE Linux Enterprise 12.

Challenges Implementing a Generic Backup-Restore API for Linux

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

HO15982 Deploy OpenStack. The SUSE OpenStack Cloud Experience. Alejandro Bonilla. Michael Echavarria. Cameron Seader. Sales Engineer

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

HO5604 Deploying MongoDB. A Scalable, Distributed Database with SUSE Cloud. Alejandro Bonilla. Sales Engineer abonilla@suse.com

Data Center Automation with SUSE Manager Federal Deployment Agency Bundesagentur für Arbeit Data Center Automation Project

Configuration Management in SUSE Manager 3

DevOps and SUSE From check-in to deployment

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Installing, Tuning, and Deploying Oracle Database on SUSE Linux Enterprise Server 12 Technical Introduction

Hadoop Architecture. Part 1

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Complete Java Classes Hadoop Syllabus Contact No:

High Availability Storage

Apache Hadoop: Past, Present, and Future

Big Data With Hadoop

Constructing a Data Lake: Hadoop and Oracle Database United!

COURSE CONTENT Big Data and Hadoop Training

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

CDH AND BUSINESS CONTINUITY:

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Certified Big Data and Apache Hadoop Developer VS-1221

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

CURSO: ADMINISTRADOR PARA APACHE HADOOP

<Insert Picture Here> Big Data

Operating System Security Hardening for SAP HANA

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

A very short Intro to Hadoop

#TalendSandbox for Big Data

Testing Big data is one of the biggest

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Apache Hadoop. Alexandru Costan

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Hadoop Job Oriented Training Agenda

Big Data Course Highlights

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

High Availability and Disaster Recovery for SAP HANA with SUSE Linux Enterprise Server for SAP Applications

THE HADOOP DISTRIBUTED FILE SYSTEM

Using SUSE Cloud to Orchestrate Multiple Hypervisors and Storage at ADP

Internals of Hadoop Application Framework and Distributed File System

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

The Greenplum Analytics Workbench

Important Notice. (c) Cloudera, Inc. All rights reserved.

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Software Defined Everything

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Ankush Cluster Manager - Hadoop2 Technology User Guide

Big Data Too Big To Ignore

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

Hadoop & Spark Using Amazon EMR

CA Big Data Management: It s here, but what can it do for your business?

CSE-E5430 Scalable Cloud Computing Lecture 2

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

SUSE Customer Center Roadmap

Upcoming Announcements

Chase Wu New Jersey Ins0tute of Technology

Hadoop and Map-Reduce. Swati Gore

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Transcription:

Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer plinnell@suse.com Alejandro Bonilla / Sales Engineer abonilla@suse.com

2 Hadoop Core Components

3 Typical Hadoop Distribution

How Hadoop Works at Its Core Client Metadata ops Namenode Metadata (name, replicas, ): /home/foo/data, 3,... Read Block ops Rack 1 Rack 2 Replication Datanodes Write Datanodes Blocks Client 4

NameNode The NameNode (NN) stores all metadata Information about file locations in HDFS Information about file ownership and permissions Names of the individual blocks Location of the blocks Metadata is stored on disk and read when the NameNode daemon starts 5

NameNode2 File name is fsimage Block locations are not stored in fsimage Changes to the metadata are made in RAM Changes are also written to a log file on disk called edits Each Hadoop cluster has a single NameNode The Secondary NameNode is not a fail-over NameNode The NameNode is a single point of failure (SPOF) 6

Secondary NameNode (master) The Secondary NameNode (2NN) is not-a fail-over NameNode! It performs memory/intensive administrative functions for the NameNode. Secondary NameNode periodically combines a prior file system snapshot and editlog into a new snapshot New snapshot is transmitted back to the NameNode Secondary NameNode should run on a separate machine in a large installation It requires as much RAM as the NameNode 7

DataNode DataNode (slave) JobTracker (master) / exactly one per cluster TaskTracker (slave) / one or more per cluster 8

Running Jobs A client submits a job to the JobTracker JobTracker assigns a job ID Client calculates the input and splits for the job Client adds job code and configuration to HDFS The JobTracker creates a Map task for each input split TaskTrackers send periodic heartbeats to JobTracker These heartbeats also signal readiness to run tasks JobTracker then assigns tasks to these TaskTrackers 9

Running Jobs The TaskTracker then forks a new JVM to run the task This isolates the TaskTracker from bugs or faulty code A single instance of task execution is called a task attempt Status info periodically sent back to JobTracker Each block is stored on multiple different nodes for redundancy Default is three replicas 10

Anatomy of a File Write 1. Client connects to the NameNode 2. NameNode places an entry for the file in its metadata, returns the block name and list of DataNodes to the client 3. Client connects to the first DataNode and starts sending data 4. As data is received by the first DataNode, it connects to the second and starts sending data 5. Second DataNode similarly connects to the third 6. Ack packets from the pipeline are sent back to the client 7. Client reports to the NameNode when the block is written 11

Hadoop Core Operations Review Client Metadata ops Namenode Metadata (name, replicas, ): /home/foo/data, 3,... Read Block ops Rack 1 Rack 2 Replication Datanodes Write Datanodes Blocks Client 12

13 Expanding on Core Hadoop

Hive, Hbase and Sqoop Hive High level abstraction on top of MapReduce Allows users to query data using HiveQL, a language very similar to standard SQL HBase A distributed, sparse, column oriented data store Sqoop The Hadoop ingestion engine the basis of connectors like Teradata, Informatica, DB2 and many others. 14

Oozie Work flow scheduler system to manage Apache Hadoop jobs Workflow jobs are Directed Acyclical Graphs (DAGs) of actions Coordinator jobs are recurrent Workflow jobs triggered by time (frequency) and data availabilty Integrated with the rest of the Hadoop stack Supports several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) Also supports system specific jobs (such as Java programs and shell scripts) 15

Flume Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data Simple and flexible architecture based on streaming data flows Robust and fault tolerant with tunable reliability mechanisms and many fail-over and recovery mechanisms Uses a simple extensible data model that allows for online analytic application 16

Mahout The Apache Mahout machine learning library's goal is to build scalable machine learning libraries Currently Mahout supports mainly three use cases: Recommendation mining takes users' behavior and from that tries to find items users might like Clustering, for example, takes text documents and groups them into groups of topically related documents Classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabeled documents to the (hopefully) correct category 17

Whirr Set of libraries for launching Hadoop instances on clouds A cloud-neutral way to run services You don't have to worry about the idiosyncrasies of each provider. A common service API The details of provisioning are particular to the service. Smart defaults for services You can get a properly configured system running quickly, while still being able to override settings as needed 18

Giraph Iterative graph processing system built for high scalability Currently used at Facebook to analyze the social graph formed by users and their connections 19

Apache Pig Platform for analyzing large data sets that consist of a high-level language for expressing data analysis programs Language layer currently consists of a textual language called Pig Latin, which has the following key properties: Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. Extensibility. Users can create their own functions to do special-purpose processing. 20

Ambari Project goal is to develop software that simplifies Hadoop cluster management Provisioning a Hadoop Cluster Managing a Hadoop Cluster Monitoring a Hadoop Cluster Ambari leverages well known technology like Ganglia and Nagios under the covers. Provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs 21

HUE Hadoop User Experience Graphical front end to Hadoop tools for launching, editing and monitoring jobs Provides short cuts to various command line shells for working directly with components Can be integrated with authentication services like Kerberos or Active Directory More later on 22

Zookeeper An orchestration stack. Centralized service for: Maintaining configuration information Naming Providing distributed synchronization Delivering group services. 23

Bigtop Packaging, QA testing and integration stack for Apache Hadoop components Made up of engineers from all the major Hadoop distros: Cloudera, Hortonworks, Intel and WANdisco, along with SUSE and independent contributors Almost unique among other Apache projects in that it integrates other projects as its goal All major Hadoop distros base their product on Bigtop 24

25 Typical Hadoop Distribution Review

Web UI Ports for Users Daemon Default Port Configuration parameter NameNode 50070 dfs.http.address DataNode 50075 dfs.datanode.http.address Secondary NameNode 50090 dfs.secondary.http.address Backup/Checkpoint Node 50105 dfs.backup.http.address JobTracker 50030 mapred.job.tracker.http.address TaskTracker 50060 mapred.task.tracker.http.address 26

Why SUSE Manager? Complements existing Hadoop Cluster Management Tools None of which handle the OS stack Auto installation (Provisioning) Patch management Configuration management System and remote management (Groups) Monitoring 27

SHOWTIME! Thank you. 28

Corporate Headquarters Maxfeldstrasse 5 90409 Nuremberg Germany +49 911 740 53 0 (Worldwide) www.suse.com Join us on: www.opensuse.org 29

Unpublished Work of SUSE. All Rights Reserved. This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability. General Disclaimer This document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.