Apache Hadoop: The Pla/orm for Big Data Amr Awadallah CTO, Founder, Cloudera, Inc. aaa@cloudera.com, twicer: @awadallah 1
The Problems with Current Data Systems BI Reports + Interac7ve Apps RDBMS (aggregated data) 2. Can t Explore Original High Fidelity Raw Data ETL Compute Grid 1. Moving Data To Compute Doesn t Scale Storage Only Grid (original raw data) Mostly Append Collec7on 3. Archiving = Premature Data Death Instrumenta7on 2
The Solu7on: A Combined Storage/Compute Layer BI Reports + Interac7ve Apps RDBMS (aggregated data) 2. Data Explora7on & Advanced Analy7cs 1. Scalable Throughput For ETL & Aggrega7on (ETL Offload) Hadoop: Storage + Compute Grid Mostly Append 3. Keep Data Alive For Ever (Ac7ve Archive) Collec7on Instrumenta7on 3
So What is Apache Hadoop? A scalable fault- tolerant distributed system for data storage and processing (open source under the Apache license). Core Hadoop has two main systems: Hadoop Distributed File System: self- healing high- bandwidth clustered storage. MapReduce: distributed fault- tolerant resource management and scheduling coupled with a scalable data programming abstrackon. Key business values: Flexibility Store any data, Run any analysis. Scalability Start at 1TB/3- nodes grow to petabytes/1000s of nodes. Economics Cost per TB at a frackon of tradikonal opkons. 4
The Key Benefit: Agility/Flexibility Schema- on- Write (RDBMS): Schema must be created before any data can be loaded. An explicit load operakon has to take place which transforms data to DB internal structure. New columns must be added explicitly before new data for such columns can be loaded into the database. Schema- on- Read (Hadoop): Data is simply copied to the file store, no transformakon is needed. A SerDe (Serializer/Deserlizer) is applied during read Kme to extract the required columns (late binding) New data can start flowing anykme and will appear retroackvely once the SerDe is updated to parse it. Read is Fast Standards/Governance Pros Load is Fast Flexibility/Agility 5
Scalability: Scalable So[ware Development Grows without requiring developers to re- architect their algorithms/applicakon. AUTO SCALE 6
Economics: Return on Byte Return on Byte (ROB) = value to be extracted from that byte divided by the cost of storing that byte If ROB is < 1 then it will be buried into tape wasteland, thus we need more economical ac,ve storage. High ROB Low ROB 7
The Big Data Pla\orm: CDH4 June 2012 Build/Test: APACHE BIGTOP Job Workflow Web Console Data Integra7on APACHE FLUME, APACHE SQOOP APACHE OOZIE HUE Cloud Deployment APACHE WHIRR Data Processing Lib DataFu for Pig Low- Latency SQL Impala Batch Processing Languages APACHE PIG, APACHE HIVE Hadoop Core Kernel MapReduce, HDFS Connec7vity ODBC/JDBC/FUSE/HTTPS Data Mining Lib APACHE MAHOUT/DataFu Metadata APACHE HIVE MetaStore Fast Read/ Write Access APACHE HBASE Coordina7on APACHE ZOOKEEPER Cloudera Manager Free Edi7on (Installa7on Wizard) 8 2012 Cloudera, Inc. All Rights Reserved.
CDH4 Enterprise Standard for Hadoop Higher Availability (no NN SPOF, HBase ReplicaKon). Faster Performance (100% faster for lookups). More Scalability (No limit on number of nodes). BeCer Extensibility (YARN and Co- processors). More Granular Security (HBase Table/Column). More Usability (Mahout, Hue). Stronger IntegraKon (ODBC cert, REST API). 9 2012Cloudera, Inc. All Rights Reserved.
CDH4 in the Enterprise Data Stack ENGINEERS DATA SCIENTISTS ANALYSTS BUSINESS USERS DATA ARCHITECTS SYSTEM OPERATORS IDEs Modeling Tools BI / Analy7cs Enterprise Repor7ng Meta Data/ ETL Tools Cloudera Manager ODBC, JDBC, NFS, HTTP Sqoop Sqoop Enterprise Data Warehouse Online Serving Systems Flume Flume Flume Sqoop CUSTOMERS Logs Files Web Data Rela7onal Databases Web/Mobile Applica7ons 10
HBase versus HDFS HDFS: Op7mized For: Large Files SequenKal Access (Hi Throughput) Append Only Use For: Fact tables that are mostly append only and require sequenkal full table scans. HBase: Op7mized For: Small Records Random Access (Lo Latency) Atomic Record Updates Use For: Dimension tables which are updated frequently and require random low- latency lookups. Not Suitable For: Low Latency InteracKve OLAP. 11
MapReduce Next Gen Main idea is to split up the JobTracker func7ons: Cluster resource management (for tracking and allocakng nodes) ApplicaKon life- cycle management (for MapReduce scheduling and execukon) Enables: High Availability BeCer Scalability Efficient Slot AllocaKon Rolling Upgrades Non- MapReduce Apps 12
CDH5 Key Release Themes Low Latency SQL Analy7cs (Impala). Stronger Recoverability (Snapshots). Mul7- Workload Resource Management. Expanded Metadata Management. More Granular Security/Access Control. 13 2012 Cloudera, Inc. All Rights Reserved.
Cloudera Now Powered by Impala BEFORE IMPALA WITH IMPALA USER INTERFACE BATCH PROCESSING REAL- TIME ACCESS Unified storage: Supports HDFS and HBase Flexible file formats Unified Metastore Unified Security Unified Client Interfaces: ODBC SQL syntax Hue Beeswax With Impala: Real- Kme SQL queries NaKve distributed query engine OpKmized for low- latency Provides: Answers as fast as you can ask Everyone can ask queskons of all data Big data storage and analykcs together 14
Impala Near- Term Features Today: Nearly all of Hive's SQL, including insert, join, and subqueries Query results 4-35X faster than Hive for interackve queries Same open Hive metadata model => easy to create & change schema Support for HDFS and HBase storage HDFS file formats: TextFile, SequenceFile HDFS compression: Snappy, GZIP Low latency scheduler (Sparrow) Common ODBC driver with Hive Separate CLI than Hive Next few months: Support for Avro, RCFile & LZO compressed text Trevni columnar format JDBC driver DDL 15
Cloudera Manager 4.5 Patch & Update Management DownKmeless rolling updates Automa7on TemplaKng for different hardware generakons Rolling restarts Expanded Monitoring DiagnosKcs root cause analysis Expanded HBase monitoring by table, region, column family Zookeeper monitoring Impala monitoring 16
Use Case Examples Retail: Price Optimization Media: Content Targeting Finance: Fraud Detection Manufacturing: Diagnostics Info Services: Satellite Imagery Agriculture: Seed Op7miza7on Power: Smart Consump7on 17 2012 Cloudera, Inc. All Rights Reserved.
Core Benefits of the Pla\orm for Big Data 1. FLEXIBILITY STORE ANY DATA RUN ANY ANALYSIS KEEP S PACE WITH THE RATE OF CHANGE OF INCOMING DATA 2. SCALABILITY PROVEN GROWTH TO PBS/1,000s OF NODES NO NEED TO REWRITE QUERIES, AUTOMATICALLY SCALES KEEP S PACE WITH THE RATE OF GROWTH OF INCOMING DATA 3. ECONOMICS COST PER TB AT A FRACTION OF OTHER OPTIONS KEEP ALL OF YOUR DATA ALIVE IN AN ACTIVE ARCHIVE POWERING THE DATA BEATS ALGORITHM MOVEMENT 18 2012 Cloudera, Inc. All Rights Reserved.