Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks



Similar documents
Hadoop Job Oriented Training Agenda

Big Data Realities Hadoop in the Enterprise Architecture

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Qsoft Inc

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Big Data: Making Sense of it all!

Internals of Hadoop Application Framework and Distributed File System

Bringing Big Data to People

Peers Techno log ies Pv t. L td. HADOOP

Hadoop and Map-Reduce. Swati Gore

Apache Hadoop: The Big Data Refinery

Implement Hadoop jobs to extract business value from large and varied data sets

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Hadoop Ecosystem B Y R A H I M A.

A very short Intro to Hadoop

HADOOP. Revised 10/19/2015

Big Data With Hadoop

Deploying Hadoop with Manager

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Cloudera Certified Developer for Apache Hadoop

Fundamentals Curriculum HAWQ

Modernizing Your Data Warehouse for Hadoop

Constructing a Data Lake: Hadoop and Oracle Database United!

Community Driven Apache Hadoop. Apache Hadoop Basics. May Hortonworks Inc.

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Big Data Too Big To Ignore

ITG Software Engineering

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

COURSE CONTENT Big Data and Hadoop Training

Workshop on Hadoop with Big Data

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

The Inside Scoop on Hadoop

Certified Big Data and Apache Hadoop Developer VS-1221

Big Data Course Highlights

Data processing goes big

HDP Hadoop From concept to deployment.

Please give me your feedback

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop implementation of MapReduce computational model. Ján Vaňo

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Stinger Initiative: Introduction

White Paper: What You Need To Know About Hadoop

Upcoming Announcements

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Open source Google-style large scale data analysis with Hadoop

A Brief Outline on Bigdata Hadoop

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

How to Hadoop Without the Worry: Protecting Big Data at Scale

A Modern Data Architecture with Apache Hadoop

PassTest. Bessere Qualität, bessere Dienstleistungen!

Apache Hadoop Ecosystem

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Complete Java Classes Hadoop Syllabus Contact No:

NoSQL and Hadoop Technologies On Oracle Cloud

Using distributed technologies to analyze Big Data

Apache Hadoop: Past, Present, and Future

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Click Stream Data Analysis Using Hadoop

THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data Weather Analytics Using Hadoop

Hadoop IST 734 SS CHUNG

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Data Security in Hadoop

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Case Study : 3 different hadoop cluster deployments

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

The Evolving Apache Hadoop Eco-System

Training Catalog. Summer 2015 Training Catalog. Apache Hadoop Training from the Experts. Apache Hadoop Training From the Experts

Chapter 7. Using Hadoop Cluster and MapReduce

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

MySQL and Hadoop. Percona Live 2014 Chris Schneider

HDP Enabling the Modern Data Architecture

Hadoop & its Usage at Facebook

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

CSE-E5430 Scalable Cloud Computing Lecture 2

#TalendSandbox for Big Data

Hadoop & its Usage at Facebook

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Open source large scale distributed data management with Google s MapReduce and Bigtable

Big data for the Masses The Unique Challenge of Big Data Integration

<Insert Picture Here> Big Data

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

Hadoop Architecture. Part 1

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Microsoft SQL Server 2012 with Hadoop

Transcription:

Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks

Hortonworks

A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013 2004 2006 2008 2010 2012 2005: Yahoo! creates team under E14 to work on Hadoop Focus on INNOVATION Enterprise Hadoop 2008: Yahoo team extends focus to operations to support multiple projects & growing clusters Focus on OPERATIONS 2011: Hortonworks created to focus on Enterprise Hadoop. Starts with 24 key Hadoop engineers from Yahoo STABILITY Page 3

Hortonworks Snapshot Headquarters: Palo Alto, CA Employees: 180+ and growing Investors: Benchmark, Index, Yahoo We develop, distribute and support the ONLY 100% open source Enterprise Hadoop distribution Develop Distribute Support We employ the core architects, builders and operators of Apache Hadoop We drive innovation within Apache Software Foundation projects Endorsed by Strategic Partners We distribute the only 100% Open Source Enterprise Hadoop Distribution: Hortonworks Data Platform We engineer, test & certify HDP for enterprise usage We are uniquely positioned to deliver the highest quality of Hadoop support We enable the ecosystem to work better with Hadoop Page 4

Leadership that Starts at the Core Driving next generation Hadoop YARN, MapReduce2, HDFS2, High Availability, Disaster Recovery 420k+ lines authored since 2006 More than twice nearest contributor Deeply integrating w/ecosystem Enabling new deployment platforms (ex. Windows & Azure, Linux & VMware HA) Creating deeply engineered solutions (ex. Teradata big data appliance) All Apache, NO holdbacks 100% of code contributed to Apache Page 5

HDP: Enterprise Hadoop Distribution OPERATIONAL SERVICES Manage AMBARI & Operate at Scale OOZIE HADOOP CORE PLATFORM SERVICES FLUME SQOOP DATA SERVICES PIG Store, HIVE Process and Access Data HCATALOG HBASE WEBHDFS Distributed MAP REDUCE Storage HDFS & Processing YARN (in 2.0) Enterprise Readiness: HA, DR, Snapshots, Security, HORTONWORKS DATA PLATFORM (HDP) Hortonworks Data Platform (HDP) Enterprise Hadoop The ONLY 100% open source and complete distribution Enterprise grade, proven and tested at scale Ecosystem endorsed to ensure interoperability OS Cloud VM Appliance Page 6

Overview of Hadoop

In the Beginning It all started when Google needed a way to: Do page ranking Determine which web sites to provide for searches Link Link

Page Rank Solution - Simplified Google engineers developed an internal solution and provided a paper on it titled: MapReduce: Simplified Data Processing on Large Clusters It described a process something like this: 1 2 3 1. Many tasks look at links in parts of the data 2. Mapped results are shuffled to Reducers 3. Reducers compute the links into a result Map Map Reduce Reduce Links to sites A, C, F Links to sites B, D, E Map

Words to Websites - Simplified From words provide locations Provides what to display for a search Note: Page rank determines the order For example to find URLs with books on them Map Reduce <url, keyword> www.barnesandnoble.com books calendars www.yahoo.com sports finance email celebrity www.amazon.com shoes books jeans www.google.com finance email search www.microsoft.com operating-system productivity system K, V <keyword, url> books www.barnesandnoble.com www.amazon.com email www.google.com www.yahoo.com www.facebook.com finance www.yahoo.com www.google.com groceries www.walmart.com www.target.com jeans www.target.com www.amazon.com

Data Model MapReduce works on <key, value> pairs (Key input, Value input) (www.barnesandnoble.com, books calendars) Map (Key intermediate, Value intermediate) (books, www.barnesandnoble.com) Reduce (Key output, Value output) (books, www.barnesandnoble.com www.amazon.com)

Shuffle Hadoop Basic Core Architecture MapReduce Mapper Reducer Map Reduce Hadoop Distributed File System (HDFS)

HDFS & MapReduce Enterprise Apache Hadoop Page 13

Hortonworks Cluster Topology HDFS MapReduce Page 14

Cluster Topology Master Services Slave Services Page 15

Hortonworks Cluster Topology HDFS MapReduce Page 16

HDP: Enterprise Hadoop Distribution OPERATIONAL SERVICES Manage AMBARI & Operate at Scale OOZIE HADOOP CORE PLATFORM SERVICES FLUME SQOOP DATA SERVICES PIG Store, HIVE Process and Access Data HCATALOG HBASE WEBHDFS Distributed MAP REDUCE Storage HDFS & Processing YARN (in 2.0) Enterprise Readiness: HA, DR, Snapshots, Security, HORTONWORKS DATA PLATFORM (HDP) Hortonworks Data Platform (HDP) Enterprise Hadoop The ONLY 100% open source and complete distribution Enterprise grade, proven and tested at scale Ecosystem endorsed to ensure interoperability OS Cloud VM Appliance Page 17

HDFS Distributed file system designed to run on commodity Hardware. Key Assumptions Hardware failure is the norm Need streaming access to data sets. Optimized for high throughput Data sets are large Append only file system. Write once read many times Moving computation is cheaper than moving data Page 18

HDFS: Key Services NameNode Master service Manages the file system namespace and regulates access to files by clients Single service across the cluster DataNode Slave service. Runs on slave nodes Manages block read/write for HDFS Pings NameNode for instructions If heatbest fails, Datanode is removed from the cluster and replicated blocks take over Seconday NameNode Merges Namenode s file system image and edit logs Not a failover namenode!!! Page 19

RACK1 RACK2 RACK3 HDFS: File create lifecycle HDFS CLIENT FILE B B FILE 1 2 ack 3 2 1 Create 4 Complete B 1 NameNode B 1 B 2 B 1 ack B 2 B 2 ack Page 20

Hortonworks Cluster Topology HDFS MapReduce Page 21

HDP: Enterprise Hadoop Distribution OPERATIONAL SERVICES Manage AMBARI & Operate at Scale OOZIE HADOOP CORE PLATFORM SERVICES FLUME SQOOP DATA SERVICES PIG Store, HIVE Process and Access Data HCATALOG HBASE WEBHDFS Distributed MAP REDUCE Storage HDFS & Processing YARN (in 2.0) Enterprise Readiness: HA, DR, Snapshots, Security, HORTONWORKS DATA PLATFORM (HDP) Hortonworks Data Platform (HDP) Enterprise Hadoop The ONLY 100% open source and complete distribution Enterprise grade, proven and tested at scale Ecosystem endorsed to ensure interoperability OS Cloud VM Appliance Page 22

MapReduce A software framework for developing distributed applications to process vast amounts of data in-parallel on large clusters MapReduce job splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner across a cluster of nodes Page 23

MapReduce: Key Services Job Tracker Master service Schedule s jobs component task on the task tracker Monitors task progress Reschedules failed tasks Task Tracker Spawn s Job s task Reports progress to the Job Tracker Runs on slave nodes, collocated with DataNode service Task (Map / Reduce) Spawned by Task Tracker Executes Map/Reduce task, encapsulating the business logic. Page 24

MapReduce: Job Lifecycle Map Phase Shuffle/Sort Reduce Phase DataNode 1 DataNode 1 Mapper Reducer DataNode 2 Mapper Data is shuffled across the network and sorted DataNode 2 DataNode 3 DataNode 3 Mapper Reducer Hortonworks Inc. 2012 Page 25

The Key/Value Pairs of MapReduce <K1, V1> Mapper <K2, V2> Shuffle/Sort <K3, V3> Reducer <K2, (V2,V2,V2,V2)> Map & Reduce operate on (key, value) pairs and output (key, value) pairs Map & Reduce operate on (key, value) pairs and output (key, value) pairs User provides map and reduce functions Input Key and Value is determined by InputFormat Common InputFormats: TextInputFormat, KeyValueTextInputFormat,SequenceFileInputFormat Common OutputFormats: TextOutputFormat, SequenceFileOutputFormat

PIG, HIVE Enterprise Apache Hadoop Page 27

Hortonworks Pig Hive Page 28

HDP: Enterprise Hadoop Distribution OPERATIONAL SERVICES Manage AMBARI & Operate at Scale OOZIE HADOOP CORE PLATFORM SERVICES FLUME SQOOP DATA SERVICES PIG Store, HIVE Process and Access Data HCATALOG HBASE WEBHDFS Distributed MAP REDUCE Storage HDFS & Processing YARN (in 2.0) Enterprise Readiness: HA, DR, Snapshots, Security, HORTONWORKS DATA PLATFORM (HDP) Hortonworks Data Platform (HDP) Enterprise Hadoop The ONLY 100% open source and complete distribution Enterprise grade, proven and tested at scale Ecosystem endorsed to ensure interoperability OS Cloud VM Appliance Page 29

Pig An engine for executing programs on top of Hadoop It provides a language, Pig Latin, to specify these programs Page 30

Why use Pig? Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited sites by users aged 18-25 Page 31

In Map-Reduce 170 lines of code, 4 hours to write Page 32

In Pig Latin Users = load input/users using PigStorage(, ) as (name:chararray, age:int); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load input/pages using PigStorage(, ) as (user:chararray, url:chararray); Jnd = join Fltrd by name, Pages by user; Grpd = group Jnd by url; Smmd = foreach Grpd generate group,count(jnd) as clicks; Srtd = order Smmd by clicks desc; Top5 = limit Srtd 5; store Top5 into output/top5sites using PigStorage(, ); 9 lines of code, 15 minutes to write Page 33

Essence of Pig Map-Reduce is too low a level, SQL too high Pig-Latin, a language intended to sit between the two Provides standard relational transforms (join, sort, etc.) Schemas are optional, used when available, can be defined at runtime User Defined Functions are first class citizens Page 34

Pig Elements Pig Latin High-level scripting language Requires no metadata or schema Statements translated into a series of MapReduce jobs Grunt Interactive shell Piggybank Shared repository for User Defined Functions (UDFs) Page 35

Hortonworks Pig Hive Page 36

HDP: Enterprise Hadoop Distribution OPERATIONAL SERVICES Manage AMBARI & Operate at Scale OOZIE HADOOP CORE PLATFORM SERVICES FLUME SQOOP DATA SERVICES PIG Store, HIVE Process and Access Data HCATALOG HBASE WEBHDFS Distributed MAP REDUCE Storage HDFS & Processing YARN (in 2.0) Enterprise Readiness: HA, DR, Snapshots, Security, HORTONWORKS DATA PLATFORM (HDP) Hortonworks Data Platform (HDP) Enterprise Hadoop The ONLY 100% open source and complete distribution Enterprise grade, proven and tested at scale Ecosystem endorsed to ensure interoperability OS Cloud VM Appliance Page 37

Motivation Hadoop as Enterprise Data Warehouse Adhoc query support Schema information Tool for end-users USED EXTENSIVELY FOR ANALYTICS & BUSINESS INTELLIGENCE Page 38

HiveQL Features HiveQL is similar to other SQLs Uses familiar relational database concepts (tables, rows, columns and schema) Based on the SQL-92 specification Treats Big Data as tables Converts SQL queries into MapReduce jobs User does not need to know MapReduce Also supports plugging custom MapReduce scripts into queries Page 39

Performing Queries SELECT WHERE clause UNION ALL and DISTINCT GROUP BY and HAVING JOIN ORDER BY LIMIT clause Rows returned are chosen at random Can use REGEX Column Specification SELECT '(ds hr)?+.+' FROM sales; Page 40

Hive vs Pig Pig and Hive work well together and many businesses use both. Hive is a good choice: when you want to query the data when you need an answer to specific questions if you are familiar with SQL Pig is a good choice: for ETL (Extract -> Transform -> Load) for preparing data for easier analysis when you have a long series of steps to perform Page 41

Ambari Streamlining Hadoop Operations Page 42

Ambari: Install, Manage, Monitor, Tune Simplify Deployment and Maintenance: Wizard based install, handles dependency checks, recommends service mappings Ensure a Healthy Cluster: Monitor, alert, heat maps Optimize Performance: Root cause analysis for cluster tuning - fix problems BEFORE SLAs are breached Integrate with your operations: Open APIs, standard web-tech 43

Community-Driven, Enterprise Class Productizes over a combined 100 person-years of operational Hadoop experience Stability and Scale: Ops & Dev team that took Yahoo! From 1000 to 45,000+ nodes Fast-paced, open source community driven innovation and integration Red Hat, Teradata, HP, Microsoft Contributions (and more) 44

Demonstration Page 45