Apache Hadoop: Past, Present, and Future

The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer aaa@cloudera.com, twitter: @awadallah

Hadoop Past Fastest sort of a TB, 62secs over 1,460 nodes Sorted a PB in 16.25hours over 3,658 nodes 2

So What is Apache Hadoop? A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license). Core Hadoop has two main systems: Hadoop Distributed File System: self-healing high-bandwidth clustered storage. MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction. Key business values: Flexibility Store any data, Run any analysis. Scalability Start at 1TB/3-nodes grow to petabytes/1000s of nodes. Economics Cost per TB at a fraction of traditional options. 3

Hadoop Present (CDH3) Build/Test: APACHE BIGTOP File System Mount Web Console Data Mining FUSE-NFS HUE APACHE MAHOUT BI Connectivity Job Workflow Metadata Data Integration APACHE FLUME, APACHE SQOOP APACHE WHIRR ODBC/JDBC APACHE OOZIE APACHE HIVE Analytical Languages APACHE PIG, APACHE HIVE Coordination Fast Read/Write Access APACHE HBASE APACHE ZOOKEEPER Cloudera Manager Free Edition (Installation Wizard) 4

HDFS: Hadoop Distributed File System A given file is broken down into blocks (default=64mb), then blocks are replicated across cluster (default=3). Optimized for: Throughput Put/Get/Delete Appends Block Replication for: Durability Availability Throughput Block Replicas are distributed across servers and racks. 5

MapReduce: Resource Manager / Scheduler A given job is broken down into tasks, then tasks are scheduled to be as close to data as possible. Three levels of data locality: Same server as data (local disk) Same rack as data (rack/leaf switch) Wherever there is a free slot (cross rack) Optimized for: Batch Processing Failure Recovery System detects laggard tasks and speculatively executes parallel tasks on the same slice of data. 6

MapReduce: Computational Framework cat *.txt mapper.pl sort reducer.pl > out.txt Split 1 Split i (docid, text) To Be Or Not To Be? (docid, text) Map 1 Map i (words, counts) Be, 12 Be, 5 (sorted words, counts) Reduce 1 Reduce i (sorted words, sum of counts) Be, 30 (sorted words, sum of counts) Output File 1 Output File i Split N (docid, text) Map M Be, 7 Be, 6 (words, counts) Shuffle (sorted words, counts) Reduce R (sorted words, sum of counts) Output File R Map(in_key, in_value) => list of (out_key, intermediate_value) Reduce(out_key, list of intermediate_values) => out_value(s) 7

Hadoop High-Level Architecture Hadoop Client Contacts Name Node for data or Job Tracker to submit jobs Name Node Maintains mapping of file names to blocks to data node slaves. Job Tracker Tracks resources and schedules jobs across task tracker slaves. Data Node Stores and serves blocks of data Share Physical Node Task Tracker Runs tasks (work units) within a job 8

Changes for Better Availability/Scalability Federation partitions out the name space, High Availability via an Active Standby. Hadoop Client Contacts Name Node for data or Job Tracker to submit jobs Each job has its own Application Manager, Resource Manager is decoupled from MR. Name Node Job Tracker Data Node Stores and serves blocks of data Share Physical Node Task Tracker Runs tasks (work units) within a job 9

MapReduce Next Gen Main idea is to split up the JobTracker functions: Cluster resource management (for tracking and allocating nodes) Application life-cycle management (for MapReduce scheduling and execution) Enables: High Availability Better Scalability Efficient Slot Allocation Rolling Upgrades Non-MapReduce Apps 10

HBase versus HDFS HDFS: Optimized For: Large Files Sequential Access (Hi Throughput) Append Only Use For: Fact tables that are mostly append only and require sequential full table scans. HBase: Optimized For: Small Records Random Access (Lo Latency) Atomic Record Updates Use For: Dimension tables which are updated frequently and require random low-latency lookups. Not Suitable For: Low Latency Interactive OLAP. 11

What Next: The Long Term Vision A fully functional Big Data Operating System that allows you to: 1. Collect Data (in real-time). 2. Store Data (very reliably, securely). 3. Process Data (workload management). 4. Analyze Data (metadata management). 5. Serve Data (interactively, low-latency). 12

Hadoop in the Enterprise Data Stack ENGINEERS DATA SCIENTISTS ANALYSTS BUSINESS USERS DATA ARCHITECTS SYSTEM OPERATORS IDEs Modeling Tools BI / Analytics Enterprise Reporting Meta Data/ ETL Tools Cloudera Manager ODBC, JDBC, NFS Sqoop Enterprise Data Warehouse Flume Flume Flume Logs Files Web Data Sqoop Relational Databases Sqoop Online Serving Systems Web/Mobile Applications CUSTOMERS 14

Three Deployment Options Solution Components Option 1: The Complete Solution Cloudera Enterprise (Cloudera Manager + Support) + CDH Fastest rollout Easiest to deploy, manage and configure Fully supported Annual Subscription Option 2: CDH Free Download Cloudera Manager Free Edition + CDH No software cost No support included Manage up to 50 nodes Limited functionality in Cloudera Manager Option 3: Raw Packages CDH Only Self install and configure No software cost No support included More management & installation complexity Longer rollout for most customers Proprietary Software + Support Cloudera Enterprise Cloudera Manager Cloudera Support Cloudera Manager (Free Edition) Up to 50 Nodes 100% Open Source Solution: Apache Hadoop + selected components CDH (Cloudera Distribution including Apache Hadoop) CDH (Cloudera Distribution including Apache Hadoop CDH (Cloudera Distribution including Apache Hadoop 15

Books 16