Big Data Trends and HDFS Evolution

Similar documents

Hadoop: Embracing future hardware

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

THE HADOOP DISTRIBUTED FILE SYSTEM

Virtualizing Apache Hadoop. June, 2012

Big Data Technology Core Hadoop: HDFS-YARN Internals

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Introduction to Cloud Computing

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Design and Evolution of the Apache Hadoop File System(HDFS)

Hadoop Architecture. Part 1

The Evolving Apache Hadoop Eco-System

Clodoaldo Barrera Chief Technical Strategist IBM System Storage. Making a successful transition to Software Defined Storage

The Hadoop Distributed File System

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

The Inside Scoop on Hadoop

Hadoop on OpenStack Cloud. Dmitry Mescheryakov Software

Benchmarking Sahara-based Big-Data-as-a-Service Solutions. Zhidong Yu, Weiting Chen (Intel) Matthew Farrellee (Red Hat) May 2015

Hadoop & its Usage at Facebook

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Zadara Storage Cloud A

Distributed File Systems

How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

Storage Architectures for Big Data in the Cloud

Microsoft Private Cloud Fast Track Reference Architecture

Hadoop & its Usage at Facebook

Apache Hadoop. Alexandru Costan

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Automating Big Data Benchmarking for Different Architectures with ALOJA

Cloud Computing Trends

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

HDP Hadoop From concept to deployment.

Virtualization and Cloud Computing

Hadoop & Spark Using Amazon EMR

RED HAT STORAGE PORTFOLIO OVERVIEW

2009 Oracle Corporation 1

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Cloud Sure - Virtual Machines

CUMULUX WHICH CLOUD PLATFORM IS RIGHT FOR YOU? COMPARING CLOUD PLATFORMS. Review Business and Technology Series

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Big Data With Hadoop

Distributed File Systems

IaaS Cloud Architectures: Virtualized Data Centers to Federated Cloud Infrastructures

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Cloud Computing. Chapter 1 Introducing Cloud Computing

White Paper Storage for Big Data and Analytics Challenges

STeP-IN SUMMIT June 18 21, 2013 at Bangalore, INDIA. Performance Testing of an IAAS Cloud Software (A CloudStack Use Case)

Scalable Architecture on Amazon AWS Cloud

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Introduction to Cloud : Cloud and Cloud Storage. Lecture 2. Dr. Dalit Naor IBM Haifa Research Storage Systems. Dalit Naor, IBM Haifa Research

Extending Hadoop beyond MapReduce

Big data blue print for cloud architecture

HadoopTM Analytics DDN

Het is een kleine stap naar een hybrid cloud

HDFS Users Guide. Table of contents

Savanna Hadoop on. OpenStack. Savanna Technical Lead

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Big Fast Data Hadoop acceleration with Flash. June 2013

Datacenters and Cloud Computing. Jia Rao Assistant Professor in CS

NoSQL and Hadoop Technologies On Oracle Cloud

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Ground up Introduction to In-Memory Data (Grids)

Technology Insight Series

HDP Enabling the Modern Data Architecture

Oracle s Cloud Computing Strategy

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Apache HBase. Crazy dances on the elephant back

HDFS Architecture Guide

Nutanix Tech Note. Configuration Best Practices for Nutanix Storage with VMware vsphere

EMC IRODS RESOURCE DRIVERS

Enabling Database-as-a-Service (DBaaS) within Enterprises or Cloud Offerings

StorReduce Technical White Paper Cloud-based Data Deduplication

Nutanix Solutions for Private Cloud. Kees Baggerman Performance and Solution Engineer

The Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage Platforms. Abhijith Shenoy Engineer, Hedvig Inc.

Cloud computing - Architecting in the cloud

Accelerating Real Time Big Data Applications. PRESENTATION TITLE GOES HERE Bob Hansen

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Oracle Database Public Cloud Services

Hadoop in the Hybrid Cloud

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

The Virtualization Practice

Introduction to Cloud Computing

Scaling Out With Apache Spark. DTL Meeting Slides based on

The last 18 months. AutoScale. IaaS. BizTalk Services Hyper-V Disaster Recovery Support. Multi-Factor Auth. Hyper-V Recovery.

YARN Apache Hadoop Next Generation Compute Platform

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Microsoft Private Cloud Fast Track

Transcription:

Big Data Trends and HDFS Evolution Sanjay Radia Founder & Architect Hortonworks Inc Page 1

Hello Founder, Hortonworks Part of the Hadoop team at Yahoo! since 2007 Chief Architect of Hadoop Core at Yahoo! Long time Apache Hadoop PMC and Committer Designed and developed several key Hadoop features Prior Data center automation, virtualization, Java, HA, OSs, File Systems (Startup, Sun Microsystems, ) Ph.D., University of Waterloo Page 2

Overview Hadoop (HDFS, Yarn) Trends Hardware Applications Deployment Model Addressing the trends Storage architecture changes Archival Microservers Public Clouds Private Clouds and VMs Page 3

Hadoop: Scalable, Reliable, Manageable Scale IO, Storage, CPU Add commodity servers & JBODs 6K nodes in cluster, 120PB Switch ` Switch Core Switch Switch Hortonworks Inc. 2012 2013. Confidential and Proprietary. r r Fault Tolerant & Easy management r Built in redundancy r Tolerate disk and node failures r Automatically manage addition/ removal of nodes r One operator per 3K node!! Storage server used for computation r Move computation to data r High-bandwidth network access to data via Ethernet r Scalable file system (Not a SAN) r Read, Write, rename, append r No random writes r YARN: General Execution Engine r Not just MapReduce r SQL, ETL, Streaming, ML, Services Simplicity of design why a small team could build such a large system in the first place 4

Trends Cloud usage Public Private Applications Streaming ML, Platform Change Network Bandwidth Cores Memory sizes Flash storage Disk density VMs Micro-servers Usage Patterns Batch to Interactive Data sizes Hortonworks Inc. 2012 2013. Confidential and Proprietary. Page 6

Heterogeneous Storage Architectural Change at lower level that changes how HDFS views storage Page 10

Heterogeneous Storage Original HDFS Architecture DataNode is a single storage unit Storage is uniform - Only storage type Disk Storage types hidden from the file system All disks as a single storage New Architecture DataNode is a collection of storages Storage type exposed to NN and Clients Support different types of storages Disk, SSDs, Memory Support external storages S3, OpenStack Swift Collection of tiered storages S3 Swift SAN Filers Architecting Hortonworks Inc. 2013 the Future of Big Data Page 11

Heterogeneous Storage Support new hardware/storage types DataNodes configured with per-volume storage types Block report per storage better scalability for large and denser drives Mixed storage type support and Automatic Tiering Some replicas in SSD and some on disk Some replicas on disk in HDFS and some on S3/OpenStack Swift Move across tiers based on usage Memory is important storage tier Memory caching for hot files Memory for intermediate files Multi-tenant support quotas per storage type APIs to let HDFS manage the data migration across tiers HDFS evolves towards a Data-Fabric Architecting Hortonworks Inc. 2013 the Future of Big Data Page 12

Archival Hadoop makes it cost effective to store data for several years Only a portion of the data is hot/warm, but the rest is still online Two models 1. Separate cluster with low-power nodes and regular disks But Spindles are the bottleneck Better of putting those spindles on active clusters 2. Lower-cost slower disks in the same cluster Slower disks become another tier Move cold data to slower disks All replicas One or two of the replicas (if data is lukewarm) Hybrid some special nodes that are Disk intensive and low powered CPU Other: Erasure codes for lukewarm and cold data Page 13

Micro-servers Large number of small servers Lower power and unused servers can be shutdown Footprint Storage - Disks or partitions can be moved from one to another Challenges for HDFS/Hadoop Allocating entire spindles is better Allocating partitions may not be most suitable because of seek contention across DataNodes that share the disk seeks Shutting down a drive does not make sense Some challenges in moving a drive to another micro DataNode The OS on the other drive needs to take it over Recent changes due to Heterogeneous storage helps here But multiple replicas on same node will need to be handled Replica allocation will have to take this into account Page 14

Public Clouds Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Hortonworks Inc. 2013 Page 15

Public Clouds A Spectrum Virtual Machines Hosted Deployed Managed Hadoop Clusters Hadoop App as a Service Run/Manage Your Hadoop Cluster On VMs Cluster deployed on VMs or private hosts with easy Cluster Management & Job Management tools Managed virtual cluster that uses a shared Hadoop. Just submit jobs/apps Just submit your job/app MS Azure AWS EC2 Rackspace Cloud MS HDInsight (VMs) Rackspace (VMs or Host in cage) Altiscale AWS EMR ATT You Spin up/down cluster Data on external store (S3, Azure storage) across Cluster lifecycles Hosted: Cluster/Data persists Shared HDFS but Isolated Namespaces Shared Yarn but isolation via Containers or VMs Automatic Spin up/down Data on external store (S3 Azure storage) across Cluster lifecycles, Hortonworks Inc. 2012 2013. Confidential and Proprietary. Page 16

Cluster Lifecycle and Data Persistence For a few: the Cluster and HDFS data persists Rackspace (Hosted and managed private cluster) Altiscale For most: clusters spun up and down Main challenge: need to persist data across cluster lifecycle Local data disappears when cluster is spun down External K-V store Access remote storage (slow) Copy, then process, process, process, spin down (e.g. Netflix) How can we do better? Interleaved storage cluster across rack (an example later) Shadow HDFS that caches K-V Store great for reading Use the K-V store as 4 th lazy HDFS replica works for writing Page 17

Use of External Storage when cluster is Spinup/Spindown S3 Checkpoint Image +Blocks Local HDFS Shadow Mount (Cache) S3 Lazy write Of 4 th replica DataNode Shadow mount external storage (S3) great for reading Cache one or two replicas at mount time when missing, fetch from source Hadoop apps access cached S3 data on local HDFS Use S3 to store blocks and check-pointed image works for writes Write lazy of 4 th replica When cluster is spun down, all block and image must be flushed to S3 Page 18

An Interesting Hadoop as service configuration Core Switch Switch ` Switch ` Switch ` Storage HDFS cluster - Stores data that persists when compute clusters are spun down Transient Hadoop cluster - Clusters (VMs) are spun up and down - Have transient local storage HDFS Cluster for data external to the compute cluster but stripped to allow rack level access Persists when the dynamic customer clusters are spun-down Much better performance than external S3 Page 19

VMs and Hadoop in Private Clouds Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Hortonworks Inc. 2013 Page 20

Virtual Machines and Hadoop Traditional Motivation for VMs Easier to install and manage Elasticity expand and shrink service capacity Relies on expensive shared storage so that data can be accessed from any VM Infrastructure consolidation and improved utilization Some of the VM benefits require shared storage But external shared storage (SAN/NAS) does not make sense for Hadoop Potential Motivation for Hadoop on VMs Public clouds mostly use VMs for isolation reasons Motivation for VMs for private clouds Test, dev clusters-on-fly to allow independent version and cluster management Share infrastructure with other non-hadoop VMs Production clusters Elasticity? Multi-tenancy isolation? Page 21

Virtual Machines and Hadoop Elasticity in Hadoop via VMs Elastic DataNodes? - Not a good idea lots of data to be copied Elastic Compute Nodes works but some challenges Need to allocate local disk partition for shuffle data VM per-tenant will fragment your local storage - This can be fixed by moving tmp storage to HDFS will be enabled in the future Local read optimization (short circuit read) is not available to VM Resource management & Isolation excellent in Hadoop Yarn has flexible resource model (CPU, Memory, others coming) It manages across the nodes Yarn s Capacity scheduler gives capacity guarantees Isolation supported via Linux Containers Docker helps with image management Page 22

Guideline for VMs for Hadoop VMs for Test, Dev, Poc, Staging Clusters works well Can share Host with non-hadoop VMs For such use cases having independent clusters is fine Do not have lots of data or storage fragmentation is less concern Independent clusters allow each to have a different Hadoop version Some customers: a native shared Hadoop cluster for testing apps Production Hadoop VMs generally less attractive But if you do, consider motivation, and configure accordingly (next slide) Hadoop has built-in elasticity for CPU and Storage resource management capacity guarantees for multi-tenancy isolation for multi-tenancy Page 23

Configurations for Hadoop VMs CN DN Model 1 Base cluster: Data+Compute on each VM Expandd compute CN CN DN Model 2 Per-tenant compute cluster Base HDFS cluster: a DataNode on a VM Share the cluster unless you need per-tenant Hadoop-version or more isolation Model 1 is generally better ComputeNode/DataNode shortcut optimization Less storage fragmentation (intermediate shuffle data) Share cluster: Hadoop has excellent Tenant/Resource isolation Use Model 2 if you need need additional tenant isolation but shared data Page 25

Summary Hadoop Virtualizes the Cluster s HW Resources across Tenants Hadoop for the Public Cloud Most based on VMs due to strong complete isolation requirement Altiscale, Rackspace offer additional models Main challenge is data persistence as clusters are spun up/down We describe best practices and what is coming Hadoop for the Private Cloud For test, dev, Poc clusters Both Hadoop on raw machines and VMs are a good idea VMs allow cluster-on-the-fly plus ability to control version for each user For production clusters Hadoop natively provides excellent elasticity, resource management and multi-tenancy If you must use VMs Understand your motivations well and configure accordingly For VMs, follow the best practices we provided Page 26

Closing Remarks Storage is fundamental to Big Data HDFS: Central storage for an enterprise s data Data Lake Buil your Analytics, ETL, etc. around the centrally stored data HDFS provides a proven, rock-solid file system Some create FUD around HDFS but Proven reliability, scalability Has the key enterprise features HDFS has evolved and continues to evolve Improvements on features, scalability, performance will continue Evolve to a Data-fabric rather than mere file storage Tiering of storage type: memory, flash, disks, archival, S3 Data moves across tiers according to application needs Hooks for upper layers to influence how Hadoop handles data Page 29

Q & A Thank You Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Hortonworks Inc. 2013 Page 30