Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc. aaa@cloudera.com, twicer: @awadallah

Similar documents
Apache Hadoop: Past, Present, and Future

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here

The Future of Data Management with Hadoop and the Enterprise Data Hub

The Future of Data Management

Qsoft Inc

Hadoop Ecosystem B Y R A H I M A.

MySQL and Hadoop. Percona Live 2014 Chris Schneider

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Constructing a Data Lake: Hadoop and Oracle Database United!

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Deploying Hadoop with Manager

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Self-service BI for big data applications using Apache Drill

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Using RDBMS, NoSQL or Hadoop?

June Production Hadoop systems in the enterprise

Self-service BI for big data applications using Apache Drill

Implement Hadoop jobs to extract business value from large and varied data sets

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Bringing Big Data to People

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Big Data Course Highlights

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

The Enterprise Data Hub and The Modern Information Architecture

Welkom! Copyright 2014 Oracle and/or its affiliates. All rights reserved.

Information Builders Mission & Value Proposition

#TalendSandbox for Big Data

Complete Java Classes Hadoop Syllabus Contact No:

Hadoop and Map-Reduce. Swati Gore

Certified Big Data and Apache Hadoop Developer VS-1221

Peers Techno log ies Pv t. L td. HADOOP

White Paper: What You Need To Know About Hadoop

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Large scale processing using Hadoop. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP

Moving From Hadoop to Spark

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

HDP Hadoop From concept to deployment.

AtScale Intelligence Platform

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

HDP Enabling the Modern Data Architecture

BIG DATA HADOOP TRAINING

HADOOP VENDOR DISTRIBUTIONS THE WHY, THE WHO AND THE HOW? Guruprasad K.N. Enterprise Architect Wipro BOTWORKS

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Big Data Infrastructure at Spotify

BIG DATA TRENDS AND TECHNOLOGIES

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Big Data Analytics Nokia

SQL on NoSQL (and all of the data) With Apache Drill

Data Analyst Program- 0 to 100

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

ITG Software Engineering

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

SQream Technologies Ltd - Confiden7al

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Introduc8on to Apache Spark

Native Connectivity to Big Data Sources in MSTR 10

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Oracle Big Data SQL Technical Update

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Hadoop & Spark Using Amazon EMR

WHITE PAPER USING CLOUDERA TO IMPROVE DATA PROCESSING

Practical Hadoop by Example

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Cisco IT Hadoop Journey

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

A Systematic Approach to Big Data Exploration of the Hadoop Framework

Deploying an Operational Data Store Designed for Big Data

Case Study : 3 different hadoop cluster deployments

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

APACHE HADOOP JERRIN JOSEPH CSU ID#

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Apache Sentry. Prasad Mujumdar

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

BIG DATA What it is and how to use?

TRAINING PROGRAM ON BIGDATA/HADOOP

Splice Machine: SQL-on-Hadoop Evaluation Guide

THE PLATFORM FOR BIG DATA

The Inside Scoop on Hadoop

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Transcription:

Apache Hadoop: The Pla/orm for Big Data Amr Awadallah CTO, Founder, Cloudera, Inc. aaa@cloudera.com, twicer: @awadallah 1

The Problems with Current Data Systems BI Reports + Interac7ve Apps RDBMS (aggregated data) 2. Can t Explore Original High Fidelity Raw Data ETL Compute Grid 1. Moving Data To Compute Doesn t Scale Storage Only Grid (original raw data) Mostly Append Collec7on 3. Archiving = Premature Data Death Instrumenta7on 2

The Solu7on: A Combined Storage/Compute Layer BI Reports + Interac7ve Apps RDBMS (aggregated data) 2. Data Explora7on & Advanced Analy7cs 1. Scalable Throughput For ETL & Aggrega7on (ETL Offload) Hadoop: Storage + Compute Grid Mostly Append 3. Keep Data Alive For Ever (Ac7ve Archive) Collec7on Instrumenta7on 3

So What is Apache Hadoop? A scalable fault- tolerant distributed system for data storage and processing (open source under the Apache license). Core Hadoop has two main systems: Hadoop Distributed File System: self- healing high- bandwidth clustered storage. MapReduce: distributed fault- tolerant resource management and scheduling coupled with a scalable data programming abstrackon. Key business values: Flexibility Store any data, Run any analysis. Scalability Start at 1TB/3- nodes grow to petabytes/1000s of nodes. Economics Cost per TB at a frackon of tradikonal opkons. 4

The Key Benefit: Agility/Flexibility Schema- on- Write (RDBMS): Schema must be created before any data can be loaded. An explicit load operakon has to take place which transforms data to DB internal structure. New columns must be added explicitly before new data for such columns can be loaded into the database. Schema- on- Read (Hadoop): Data is simply copied to the file store, no transformakon is needed. A SerDe (Serializer/Deserlizer) is applied during read Kme to extract the required columns (late binding) New data can start flowing anykme and will appear retroackvely once the SerDe is updated to parse it. Read is Fast Standards/Governance Pros Load is Fast Flexibility/Agility 5

Scalability: Scalable So[ware Development Grows without requiring developers to re- architect their algorithms/applicakon. AUTO SCALE 6

Economics: Return on Byte Return on Byte (ROB) = value to be extracted from that byte divided by the cost of storing that byte If ROB is < 1 then it will be buried into tape wasteland, thus we need more economical ac,ve storage. High ROB Low ROB 7

The Big Data Pla\orm: CDH4 June 2012 Build/Test: APACHE BIGTOP Job Workflow Web Console Data Integra7on APACHE FLUME, APACHE SQOOP APACHE OOZIE HUE Cloud Deployment APACHE WHIRR Data Processing Lib DataFu for Pig Low- Latency SQL Impala Batch Processing Languages APACHE PIG, APACHE HIVE Hadoop Core Kernel MapReduce, HDFS Connec7vity ODBC/JDBC/FUSE/HTTPS Data Mining Lib APACHE MAHOUT/DataFu Metadata APACHE HIVE MetaStore Fast Read/ Write Access APACHE HBASE Coordina7on APACHE ZOOKEEPER Cloudera Manager Free Edi7on (Installa7on Wizard) 8 2012 Cloudera, Inc. All Rights Reserved.

CDH4 Enterprise Standard for Hadoop Higher Availability (no NN SPOF, HBase ReplicaKon). Faster Performance (100% faster for lookups). More Scalability (No limit on number of nodes). BeCer Extensibility (YARN and Co- processors). More Granular Security (HBase Table/Column). More Usability (Mahout, Hue). Stronger IntegraKon (ODBC cert, REST API). 9 2012Cloudera, Inc. All Rights Reserved.

CDH4 in the Enterprise Data Stack ENGINEERS DATA SCIENTISTS ANALYSTS BUSINESS USERS DATA ARCHITECTS SYSTEM OPERATORS IDEs Modeling Tools BI / Analy7cs Enterprise Repor7ng Meta Data/ ETL Tools Cloudera Manager ODBC, JDBC, NFS, HTTP Sqoop Sqoop Enterprise Data Warehouse Online Serving Systems Flume Flume Flume Sqoop CUSTOMERS Logs Files Web Data Rela7onal Databases Web/Mobile Applica7ons 10

HBase versus HDFS HDFS: Op7mized For: Large Files SequenKal Access (Hi Throughput) Append Only Use For: Fact tables that are mostly append only and require sequenkal full table scans. HBase: Op7mized For: Small Records Random Access (Lo Latency) Atomic Record Updates Use For: Dimension tables which are updated frequently and require random low- latency lookups. Not Suitable For: Low Latency InteracKve OLAP. 11

MapReduce Next Gen Main idea is to split up the JobTracker func7ons: Cluster resource management (for tracking and allocakng nodes) ApplicaKon life- cycle management (for MapReduce scheduling and execukon) Enables: High Availability BeCer Scalability Efficient Slot AllocaKon Rolling Upgrades Non- MapReduce Apps 12

CDH5 Key Release Themes Low Latency SQL Analy7cs (Impala). Stronger Recoverability (Snapshots). Mul7- Workload Resource Management. Expanded Metadata Management. More Granular Security/Access Control. 13 2012 Cloudera, Inc. All Rights Reserved.

Cloudera Now Powered by Impala BEFORE IMPALA WITH IMPALA USER INTERFACE BATCH PROCESSING REAL- TIME ACCESS Unified storage: Supports HDFS and HBase Flexible file formats Unified Metastore Unified Security Unified Client Interfaces: ODBC SQL syntax Hue Beeswax With Impala: Real- Kme SQL queries NaKve distributed query engine OpKmized for low- latency Provides: Answers as fast as you can ask Everyone can ask queskons of all data Big data storage and analykcs together 14

Impala Near- Term Features Today: Nearly all of Hive's SQL, including insert, join, and subqueries Query results 4-35X faster than Hive for interackve queries Same open Hive metadata model => easy to create & change schema Support for HDFS and HBase storage HDFS file formats: TextFile, SequenceFile HDFS compression: Snappy, GZIP Low latency scheduler (Sparrow) Common ODBC driver with Hive Separate CLI than Hive Next few months: Support for Avro, RCFile & LZO compressed text Trevni columnar format JDBC driver DDL 15

Cloudera Manager 4.5 Patch & Update Management DownKmeless rolling updates Automa7on TemplaKng for different hardware generakons Rolling restarts Expanded Monitoring DiagnosKcs root cause analysis Expanded HBase monitoring by table, region, column family Zookeeper monitoring Impala monitoring 16

Use Case Examples Retail: Price Optimization Media: Content Targeting Finance: Fraud Detection Manufacturing: Diagnostics Info Services: Satellite Imagery Agriculture: Seed Op7miza7on Power: Smart Consump7on 17 2012 Cloudera, Inc. All Rights Reserved.

Core Benefits of the Pla\orm for Big Data 1. FLEXIBILITY STORE ANY DATA RUN ANY ANALYSIS KEEP S PACE WITH THE RATE OF CHANGE OF INCOMING DATA 2. SCALABILITY PROVEN GROWTH TO PBS/1,000s OF NODES NO NEED TO REWRITE QUERIES, AUTOMATICALLY SCALES KEEP S PACE WITH THE RATE OF GROWTH OF INCOMING DATA 3. ECONOMICS COST PER TB AT A FRACTION OF OTHER OPTIONS KEEP ALL OF YOUR DATA ALIVE IN AN ACTIVE ARCHIVE POWERING THE DATA BEATS ALGORITHM MOVEMENT 18 2012 Cloudera, Inc. All Rights Reserved.