Where is Hadoop Going Next?



Similar documents
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Scaling Out With Apache Spark. DTL Meeting Slides based on

Hadoop Ecosystem B Y R A H I M A.

HADOOP MOCK TEST HADOOP MOCK TEST I

Upcoming Announcements

Parquet. Columnar storage for the people

Chase Wu New Jersey Ins0tute of Technology

Constructing a Data Lake: Hadoop and Oracle Database United!

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Big Data With Hadoop

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop: Embracing future hardware

Integrating Kerberos into Apache Hadoop

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Hadoop in the Enterprise

Hadoop: The Definitive Guide

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

HDFS. Hadoop Distributed File System

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

A Brief Introduction to Apache Tez

Integrating Apache Spark with an Enterprise Data Warehouse

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

COURSE CONTENT Big Data and Hadoop Training

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Integrating with Hadoop HPE Vertica Analytic Database. Software Version: 7.2.x

Hadoop & Spark Using Amazon EMR

Hadoop Job Oriented Training Agenda

Deploying Hadoop with Manager

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Alternatives to HIVE SQL in Hadoop File Structure

APACHE HADOOP JERRIN JOSEPH CSU ID#

How To Scale Out Of A Nosql Database

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

A very short Intro to Hadoop

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

The Hadoop Eco System Shanghai Data Science Meetup

Workshop on Hadoop with Big Data

Certified Big Data and Apache Hadoop Developer VS-1221

Big Data Course Highlights

Secure Your Hadoop Cluster With Apache Sentry (Incubating) Xuefu Zhang Software Engineer, Cloudera April 07, 2014

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Processing NGS Data with Hadoop-BAM and SeqPig

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Application Development. A Paradigm Shift

Accelerating and Simplifying Apache

CURSO: ADMINISTRADOR PARA APACHE HADOOP

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Big Fast Data Hadoop acceleration with Flash. June 2013

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

Hadoop Architecture. Part 1

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Extending Hadoop beyond MapReduce

Hadoop implementation of MapReduce computational model. Ján Vaňo

Cloudera Certified Developer for Apache Hadoop

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

Big Data SQL and Query Franchising

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

A Brief Outline on Bigdata Hadoop

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Large Scale Text Analysis Using the Map/Reduce

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

How Companies are! Using Spark

Pivotal HAWQ Release Notes

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Internals of Hadoop Application Framework and Distributed File System

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Transcription:

Where is Hadoop Going Next? Owen O Malley owen@hortonworks.com @owen_omalley November 2014 Page 1

Who am I? Worked at Yahoo Seach Webmap in a Week Dreadnaught to Juggernaut to Hadoop MapReduce Security Hive Apache/Open Source Champion PhD in Software Engr from UC Irvine Page 2

Topics Hadoop History A beginning is the time for taking the most delicate care that the balances are correct. Themes Storage Computation Security - Herbert Page 3

What was the Problem? Yahoo needed to build WebMaps faster Whole web analysis for Yahoo Search WebMap in a Week WebMap used Dreadnaught Roughly like MapReduce and HDFS Scaled to 800 machines Assigned nodes in backup pairs Single application cluster Started on C++ DFS & MapReduce Page 4

What did Hadoop Do Right? Focus on a few customers Helped Yahoo Search analytics team Terasort benchmarks Expected Failures Storage corrects automatically Healthy in minutes instead of hours Nodes are automatically assigned No chokepoints Data never travels through singleton RAM isn t large enough Page 5

What did Hadoop Do Right? Simplified FileSystem abstraction No random writes Apache Many companies working together Open governance Open Source Many hands and eyes Use the source, Luke Open platform Page 6

Storage The more storage you have, the more stuff you accumulate. - Stewart Page 7

HDFS Phases Single HDFS NameNode Cross cluster references Federated HDFS NameNodes Need HDFS Block Storage factored out Wider variety of applications Need co-location of files Not entire table, but sections of the table ACID (and HBase) base and delta files Correlated tables Page 8

File Formats Phases Text and Sequence File RCFile Avro ORC and Parquet Columnar formats Type specific encoding Self describing metadata at end Page 9

ORC Light-weight indexes Predicate pushdown Answers from metadata Seeking within file Available from Hive, Pig, & MapReduce C++ reader/writer coming Page 10

Computation A process cannot be understood by stopping it. Understanding must move with the flow of the process, must join it and flow with it. - Herbert Page 11

Why does Hadoop Need ACID? Hadoop and Hive have always Worked without ACID Perceived as tradeoff for performance Add or replace entire partitions But, your data isn t static It changes daily, hourly, or faster Managing change makes the user s life better Need consistent views of changing data! Page 12

Use Cases Updating a Dimension Table Changing a customer s address Delete Old Records Remove records for compliance Update/Restate Large Fact Tables Fix problems after they are in the warehouse Streaming Data Ingest A continual stream of data coming in Page 13

Longer Term Use Cases Multiple statement transactions Group statements that need to work together Query tables as they appeared in past Configurable length of history Row-level lineage Track users and queries that updated each row Page 14

Design HDFS Does Not Allow Arbitrary Writes Store changes as delta files Stitched together by client on read Writes get a Transaction ID Sequentially assigned by Metastore Reads get Committed Transactions Provides snapshot consistency No locks required Provide a snapshot of data from start of query Page 15

Vectorization MapReduce s RecordReader boolean next(k key, V value); Better to process 1000 rows at a time Amortizes the cost of method calls Use primitive arrays for tight inner loops No access methods Extremely important for operator trees Branches (including virtual dispatch) kill pipelining Can run at 100m rows/second Page 16

Tez Replacing MapReduce as basis for Hive, Pig, Cascading Executes entire DAG of tasks More options for shuffle Scales up and down dynamically Queries scheduled as one application instead of waves of jobs. Page 17

Hive Cost Based Optimizer Current optimizer is a mess of rules Rule interactions are complex Optiq provides a framework YACC for optimizers Make better choices Huge impact on performance Obsoletes lots of old advice Page 18

LLAP Live Long and Process Persistent Hive execution engine JVM startup costs are huge JIT cost alone is staggering Hot Table Data Caching Keep hot columns and partitions in memory Sub-second answers Page 19

Security There is no such thing as perfect security, only varying levels of insecurity. - Rushdie Page 20

Audit and Authorization Three A s of security Authentication, Authorization, and Audit Phases No users Users, but no authentication Authorization Next centralized authorization and audit Encryption Page 21

Encryption Underlying file system Thief breaks into data center HDFS encryption Parallels HDFS authorization Prevents AFN attacks Column encryption Encrypt just PII columns, rolling keys Value encryption No salt weak sauce so joins work Page 22

Thank You! Questions & Answers Hortonworks Inc. 2013 Page 23