Apache Hadoop: Past, Present, and Future



Similar documents
Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Constructing a Data Lake: Hadoop and Oracle Database United!

The Future of Data Management with Hadoop and the Enterprise Data Hub

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Hadoop IST 734 SS CHUNG

BIG DATA HADOOP TRAINING

Deploying Hadoop with Manager

The Future of Data Management

Peers Techno log ies Pv t. L td. HADOOP

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

CSE-E5430 Scalable Cloud Computing Lecture 2

Extending Hadoop beyond MapReduce

Dell In-Memory Appliance for Cloudera Enterprise

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP

Hadoop implementation of MapReduce computational model. Ján Vaňo

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

BIG DATA TRENDS AND TECHNOLOGIES

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Large scale processing using Hadoop. Ján Vaňo

Certified Big Data and Apache Hadoop Developer VS-1221

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

Hadoop Ecosystem B Y R A H I M A.

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

A very short Intro to Hadoop

Hadoop & Spark Using Amazon EMR

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Open source Google-style large scale data analysis with Hadoop

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

The Inside Scoop on Hadoop

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

A Brief Outline on Bigdata Hadoop

Qsoft Inc

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

White Paper: What You Need To Know About Hadoop

Hadoop Architecture. Part 1

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data Too Big To Ignore

HDP Hadoop From concept to deployment.

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

How to Hadoop Without the Worry: Protecting Big Data at Scale

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Big Data Big Data/Data Analytics & Software Development

Application Development. A Paradigm Shift

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Self-service BI for big data applications using Apache Drill

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Information Builders Mission & Value Proposition

CDH AND BUSINESS CONTINUITY:

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Chase Wu New Jersey Ins0tute of Technology

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop: Embracing future hardware

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Bringing Big Data to People

Self-service BI for big data applications using Apache Drill

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

#TalendSandbox for Big Data

THE PLATFORM FOR BIG DATA

Big Data Management and Security

Big Data With Hadoop

Workshop on Hadoop with Big Data

Hadoop. for Oracle database professionals. Alex Gorbachev Calgary, AB September 2013

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop and Map-Reduce. Swati Gore

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Testing Big data is one of the biggest

Cloudera Administrator Training for Apache Hadoop

Big Data Course Highlights

I/O Considerations in Big Data Analytics

Building Scalable Big Data Pipelines

Data-Intensive Computing with Map-Reduce and Hadoop

How Cisco IT Built Big Data Platform to Transform Data Management

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

HDP Enabling the Modern Data Architecture

Internals of Hadoop Application Framework and Distributed File System

HADOOP. Revised 10/19/2015

Trafodion Operational SQL-on-Hadoop

Transcription:

The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer aaa@cloudera.com, twitter: @awadallah

Hadoop Past Fastest sort of a TB, 62secs over 1,460 nodes Sorted a PB in 16.25hours over 3,658 nodes 2

So What is Apache Hadoop? A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license). Core Hadoop has two main systems: Hadoop Distributed File System: self-healing high-bandwidth clustered storage. MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction. Key business values: Flexibility Store any data, Run any analysis. Scalability Start at 1TB/3-nodes grow to petabytes/1000s of nodes. Economics Cost per TB at a fraction of traditional options. 3

Hadoop Present (CDH3) Build/Test: APACHE BIGTOP File System Mount Web Console Data Mining FUSE-NFS HUE APACHE MAHOUT BI Connectivity Job Workflow Metadata Data Integration APACHE FLUME, APACHE SQOOP APACHE WHIRR ODBC/JDBC APACHE OOZIE APACHE HIVE Analytical Languages APACHE PIG, APACHE HIVE Coordination Fast Read/Write Access APACHE HBASE APACHE ZOOKEEPER Cloudera Manager Free Edition (Installation Wizard) 4

HDFS: Hadoop Distributed File System A given file is broken down into blocks (default=64mb), then blocks are replicated across cluster (default=3). Optimized for: Throughput Put/Get/Delete Appends Block Replication for: Durability Availability Throughput Block Replicas are distributed across servers and racks. 5

MapReduce: Resource Manager / Scheduler A given job is broken down into tasks, then tasks are scheduled to be as close to data as possible. Three levels of data locality: Same server as data (local disk) Same rack as data (rack/leaf switch) Wherever there is a free slot (cross rack) Optimized for: Batch Processing Failure Recovery System detects laggard tasks and speculatively executes parallel tasks on the same slice of data. 6

MapReduce: Computational Framework cat *.txt mapper.pl sort reducer.pl > out.txt Split 1 Split i (docid, text) To Be Or Not To Be? (docid, text) Map 1 Map i (words, counts) Be, 12 Be, 5 (sorted words, counts) Reduce 1 Reduce i (sorted words, sum of counts) Be, 30 (sorted words, sum of counts) Output File 1 Output File i Split N (docid, text) Map M Be, 7 Be, 6 (words, counts) Shuffle (sorted words, counts) Reduce R (sorted words, sum of counts) Output File R Map(in_key, in_value) => list of (out_key, intermediate_value) Reduce(out_key, list of intermediate_values) => out_value(s) 7

Hadoop High-Level Architecture Hadoop Client Contacts Name Node for data or Job Tracker to submit jobs Name Node Maintains mapping of file names to blocks to data node slaves. Job Tracker Tracks resources and schedules jobs across task tracker slaves. Data Node Stores and serves blocks of data Share Physical Node Task Tracker Runs tasks (work units) within a job 8

Changes for Better Availability/Scalability Federation partitions out the name space, High Availability via an Active Standby. Hadoop Client Contacts Name Node for data or Job Tracker to submit jobs Each job has its own Application Manager, Resource Manager is decoupled from MR. Name Node Job Tracker Data Node Stores and serves blocks of data Share Physical Node Task Tracker Runs tasks (work units) within a job 9

MapReduce Next Gen Main idea is to split up the JobTracker functions: Cluster resource management (for tracking and allocating nodes) Application life-cycle management (for MapReduce scheduling and execution) Enables: High Availability Better Scalability Efficient Slot Allocation Rolling Upgrades Non-MapReduce Apps 10

HBase versus HDFS HDFS: Optimized For: Large Files Sequential Access (Hi Throughput) Append Only Use For: Fact tables that are mostly append only and require sequential full table scans. HBase: Optimized For: Small Records Random Access (Lo Latency) Atomic Record Updates Use For: Dimension tables which are updated frequently and require random low-latency lookups. Not Suitable For: Low Latency Interactive OLAP. 11

What Next: The Long Term Vision A fully functional Big Data Operating System that allows you to: 1. Collect Data (in real-time). 2. Store Data (very reliably, securely). 3. Process Data (workload management). 4. Analyze Data (metadata management). 5. Serve Data (interactively, low-latency). 12

Hadoop in the Enterprise Data Stack ENGINEERS DATA SCIENTISTS ANALYSTS BUSINESS USERS DATA ARCHITECTS SYSTEM OPERATORS IDEs Modeling Tools BI / Analytics Enterprise Reporting Meta Data/ ETL Tools Cloudera Manager ODBC, JDBC, NFS Sqoop Enterprise Data Warehouse Flume Flume Flume Logs Files Web Data Sqoop Relational Databases Sqoop Online Serving Systems Web/Mobile Applications CUSTOMERS 14

Three Deployment Options Solution Components Option 1: The Complete Solution Cloudera Enterprise (Cloudera Manager + Support) + CDH Fastest rollout Easiest to deploy, manage and configure Fully supported Annual Subscription Option 2: CDH Free Download Cloudera Manager Free Edition + CDH No software cost No support included Manage up to 50 nodes Limited functionality in Cloudera Manager Option 3: Raw Packages CDH Only Self install and configure No software cost No support included More management & installation complexity Longer rollout for most customers Proprietary Software + Support Cloudera Enterprise Cloudera Manager Cloudera Support Cloudera Manager (Free Edition) Up to 50 Nodes 100% Open Source Solution: Apache Hadoop + selected components CDH (Cloudera Distribution including Apache Hadoop) CDH (Cloudera Distribution including Apache Hadoop CDH (Cloudera Distribution including Apache Hadoop 15

Books 16