Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks
Hortonworks
A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013 2004 2006 2008 2010 2012 2005: Yahoo! creates team under E14 to work on Hadoop Focus on INNOVATION Enterprise Hadoop 2008: Yahoo team extends focus to operations to support multiple projects & growing clusters Focus on OPERATIONS 2011: Hortonworks created to focus on Enterprise Hadoop. Starts with 24 key Hadoop engineers from Yahoo STABILITY Page 3
Hortonworks Snapshot Headquarters: Palo Alto, CA Employees: 180+ and growing Investors: Benchmark, Index, Yahoo We develop, distribute and support the ONLY 100% open source Enterprise Hadoop distribution Develop Distribute Support We employ the core architects, builders and operators of Apache Hadoop We drive innovation within Apache Software Foundation projects Endorsed by Strategic Partners We distribute the only 100% Open Source Enterprise Hadoop Distribution: Hortonworks Data Platform We engineer, test & certify HDP for enterprise usage We are uniquely positioned to deliver the highest quality of Hadoop support We enable the ecosystem to work better with Hadoop Page 4
Leadership that Starts at the Core Driving next generation Hadoop YARN, MapReduce2, HDFS2, High Availability, Disaster Recovery 420k+ lines authored since 2006 More than twice nearest contributor Deeply integrating w/ecosystem Enabling new deployment platforms (ex. Windows & Azure, Linux & VMware HA) Creating deeply engineered solutions (ex. Teradata big data appliance) All Apache, NO holdbacks 100% of code contributed to Apache Page 5
HDP: Enterprise Hadoop Distribution OPERATIONAL SERVICES Manage AMBARI & Operate at Scale OOZIE HADOOP CORE PLATFORM SERVICES FLUME SQOOP DATA SERVICES PIG Store, HIVE Process and Access Data HCATALOG HBASE WEBHDFS Distributed MAP REDUCE Storage HDFS & Processing YARN (in 2.0) Enterprise Readiness: HA, DR, Snapshots, Security, HORTONWORKS DATA PLATFORM (HDP) Hortonworks Data Platform (HDP) Enterprise Hadoop The ONLY 100% open source and complete distribution Enterprise grade, proven and tested at scale Ecosystem endorsed to ensure interoperability OS Cloud VM Appliance Page 6
Overview of Hadoop
In the Beginning It all started when Google needed a way to: Do page ranking Determine which web sites to provide for searches Link Link
Page Rank Solution - Simplified Google engineers developed an internal solution and provided a paper on it titled: MapReduce: Simplified Data Processing on Large Clusters It described a process something like this: 1 2 3 1. Many tasks look at links in parts of the data 2. Mapped results are shuffled to Reducers 3. Reducers compute the links into a result Map Map Reduce Reduce Links to sites A, C, F Links to sites B, D, E Map
Words to Websites - Simplified From words provide locations Provides what to display for a search Note: Page rank determines the order For example to find URLs with books on them Map Reduce <url, keyword> www.barnesandnoble.com books calendars www.yahoo.com sports finance email celebrity www.amazon.com shoes books jeans www.google.com finance email search www.microsoft.com operating-system productivity system K, V <keyword, url> books www.barnesandnoble.com www.amazon.com email www.google.com www.yahoo.com www.facebook.com finance www.yahoo.com www.google.com groceries www.walmart.com www.target.com jeans www.target.com www.amazon.com
Data Model MapReduce works on <key, value> pairs (Key input, Value input) (www.barnesandnoble.com, books calendars) Map (Key intermediate, Value intermediate) (books, www.barnesandnoble.com) Reduce (Key output, Value output) (books, www.barnesandnoble.com www.amazon.com)
Shuffle Hadoop Basic Core Architecture MapReduce Mapper Reducer Map Reduce Hadoop Distributed File System (HDFS)
HDFS & MapReduce Enterprise Apache Hadoop Page 13
Hortonworks Cluster Topology HDFS MapReduce Page 14
Cluster Topology Master Services Slave Services Page 15
Hortonworks Cluster Topology HDFS MapReduce Page 16
HDP: Enterprise Hadoop Distribution OPERATIONAL SERVICES Manage AMBARI & Operate at Scale OOZIE HADOOP CORE PLATFORM SERVICES FLUME SQOOP DATA SERVICES PIG Store, HIVE Process and Access Data HCATALOG HBASE WEBHDFS Distributed MAP REDUCE Storage HDFS & Processing YARN (in 2.0) Enterprise Readiness: HA, DR, Snapshots, Security, HORTONWORKS DATA PLATFORM (HDP) Hortonworks Data Platform (HDP) Enterprise Hadoop The ONLY 100% open source and complete distribution Enterprise grade, proven and tested at scale Ecosystem endorsed to ensure interoperability OS Cloud VM Appliance Page 17
HDFS Distributed file system designed to run on commodity Hardware. Key Assumptions Hardware failure is the norm Need streaming access to data sets. Optimized for high throughput Data sets are large Append only file system. Write once read many times Moving computation is cheaper than moving data Page 18
HDFS: Key Services NameNode Master service Manages the file system namespace and regulates access to files by clients Single service across the cluster DataNode Slave service. Runs on slave nodes Manages block read/write for HDFS Pings NameNode for instructions If heatbest fails, Datanode is removed from the cluster and replicated blocks take over Seconday NameNode Merges Namenode s file system image and edit logs Not a failover namenode!!! Page 19
RACK1 RACK2 RACK3 HDFS: File create lifecycle HDFS CLIENT FILE B B FILE 1 2 ack 3 2 1 Create 4 Complete B 1 NameNode B 1 B 2 B 1 ack B 2 B 2 ack Page 20
Hortonworks Cluster Topology HDFS MapReduce Page 21
HDP: Enterprise Hadoop Distribution OPERATIONAL SERVICES Manage AMBARI & Operate at Scale OOZIE HADOOP CORE PLATFORM SERVICES FLUME SQOOP DATA SERVICES PIG Store, HIVE Process and Access Data HCATALOG HBASE WEBHDFS Distributed MAP REDUCE Storage HDFS & Processing YARN (in 2.0) Enterprise Readiness: HA, DR, Snapshots, Security, HORTONWORKS DATA PLATFORM (HDP) Hortonworks Data Platform (HDP) Enterprise Hadoop The ONLY 100% open source and complete distribution Enterprise grade, proven and tested at scale Ecosystem endorsed to ensure interoperability OS Cloud VM Appliance Page 22
MapReduce A software framework for developing distributed applications to process vast amounts of data in-parallel on large clusters MapReduce job splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner across a cluster of nodes Page 23
MapReduce: Key Services Job Tracker Master service Schedule s jobs component task on the task tracker Monitors task progress Reschedules failed tasks Task Tracker Spawn s Job s task Reports progress to the Job Tracker Runs on slave nodes, collocated with DataNode service Task (Map / Reduce) Spawned by Task Tracker Executes Map/Reduce task, encapsulating the business logic. Page 24
MapReduce: Job Lifecycle Map Phase Shuffle/Sort Reduce Phase DataNode 1 DataNode 1 Mapper Reducer DataNode 2 Mapper Data is shuffled across the network and sorted DataNode 2 DataNode 3 DataNode 3 Mapper Reducer Hortonworks Inc. 2012 Page 25
The Key/Value Pairs of MapReduce <K1, V1> Mapper <K2, V2> Shuffle/Sort <K3, V3> Reducer <K2, (V2,V2,V2,V2)> Map & Reduce operate on (key, value) pairs and output (key, value) pairs Map & Reduce operate on (key, value) pairs and output (key, value) pairs User provides map and reduce functions Input Key and Value is determined by InputFormat Common InputFormats: TextInputFormat, KeyValueTextInputFormat,SequenceFileInputFormat Common OutputFormats: TextOutputFormat, SequenceFileOutputFormat
PIG, HIVE Enterprise Apache Hadoop Page 27
Hortonworks Pig Hive Page 28
HDP: Enterprise Hadoop Distribution OPERATIONAL SERVICES Manage AMBARI & Operate at Scale OOZIE HADOOP CORE PLATFORM SERVICES FLUME SQOOP DATA SERVICES PIG Store, HIVE Process and Access Data HCATALOG HBASE WEBHDFS Distributed MAP REDUCE Storage HDFS & Processing YARN (in 2.0) Enterprise Readiness: HA, DR, Snapshots, Security, HORTONWORKS DATA PLATFORM (HDP) Hortonworks Data Platform (HDP) Enterprise Hadoop The ONLY 100% open source and complete distribution Enterprise grade, proven and tested at scale Ecosystem endorsed to ensure interoperability OS Cloud VM Appliance Page 29
Pig An engine for executing programs on top of Hadoop It provides a language, Pig Latin, to specify these programs Page 30
Why use Pig? Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited sites by users aged 18-25 Page 31
In Map-Reduce 170 lines of code, 4 hours to write Page 32
In Pig Latin Users = load input/users using PigStorage(, ) as (name:chararray, age:int); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load input/pages using PigStorage(, ) as (user:chararray, url:chararray); Jnd = join Fltrd by name, Pages by user; Grpd = group Jnd by url; Smmd = foreach Grpd generate group,count(jnd) as clicks; Srtd = order Smmd by clicks desc; Top5 = limit Srtd 5; store Top5 into output/top5sites using PigStorage(, ); 9 lines of code, 15 minutes to write Page 33
Essence of Pig Map-Reduce is too low a level, SQL too high Pig-Latin, a language intended to sit between the two Provides standard relational transforms (join, sort, etc.) Schemas are optional, used when available, can be defined at runtime User Defined Functions are first class citizens Page 34
Pig Elements Pig Latin High-level scripting language Requires no metadata or schema Statements translated into a series of MapReduce jobs Grunt Interactive shell Piggybank Shared repository for User Defined Functions (UDFs) Page 35
Hortonworks Pig Hive Page 36
HDP: Enterprise Hadoop Distribution OPERATIONAL SERVICES Manage AMBARI & Operate at Scale OOZIE HADOOP CORE PLATFORM SERVICES FLUME SQOOP DATA SERVICES PIG Store, HIVE Process and Access Data HCATALOG HBASE WEBHDFS Distributed MAP REDUCE Storage HDFS & Processing YARN (in 2.0) Enterprise Readiness: HA, DR, Snapshots, Security, HORTONWORKS DATA PLATFORM (HDP) Hortonworks Data Platform (HDP) Enterprise Hadoop The ONLY 100% open source and complete distribution Enterprise grade, proven and tested at scale Ecosystem endorsed to ensure interoperability OS Cloud VM Appliance Page 37
Motivation Hadoop as Enterprise Data Warehouse Adhoc query support Schema information Tool for end-users USED EXTENSIVELY FOR ANALYTICS & BUSINESS INTELLIGENCE Page 38
HiveQL Features HiveQL is similar to other SQLs Uses familiar relational database concepts (tables, rows, columns and schema) Based on the SQL-92 specification Treats Big Data as tables Converts SQL queries into MapReduce jobs User does not need to know MapReduce Also supports plugging custom MapReduce scripts into queries Page 39
Performing Queries SELECT WHERE clause UNION ALL and DISTINCT GROUP BY and HAVING JOIN ORDER BY LIMIT clause Rows returned are chosen at random Can use REGEX Column Specification SELECT '(ds hr)?+.+' FROM sales; Page 40
Hive vs Pig Pig and Hive work well together and many businesses use both. Hive is a good choice: when you want to query the data when you need an answer to specific questions if you are familiar with SQL Pig is a good choice: for ETL (Extract -> Transform -> Load) for preparing data for easier analysis when you have a long series of steps to perform Page 41
Ambari Streamlining Hadoop Operations Page 42
Ambari: Install, Manage, Monitor, Tune Simplify Deployment and Maintenance: Wizard based install, handles dependency checks, recommends service mappings Ensure a Healthy Cluster: Monitor, alert, heat maps Optimize Performance: Root cause analysis for cluster tuning - fix problems BEFORE SLAs are breached Integrate with your operations: Open APIs, standard web-tech 43
Community-Driven, Enterprise Class Productizes over a combined 100 person-years of operational Hadoop experience Stability and Scale: Ops & Dev team that took Yahoo! From 1000 to 45,000+ nodes Fast-paced, open source community driven innovation and integration Red Hat, Teradata, HP, Microsoft Contributions (and more) 44
Demonstration Page 45