Hadoop: Embracing future hardware



Similar documents
HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Big Data Trends and HDFS Evolution

Design and Evolution of the Apache Hadoop File System(HDFS)

Hadoop Architecture. Part 1

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Large scale processing using Hadoop. Ján Vaňo

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Hadoop implementation of MapReduce computational model. Ján Vaňo

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

CSE-E5430 Scalable Cloud Computing Lecture 2

Enabling High performance Big Data platform with RDMA

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Apache Hadoop. Alexandru Costan

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source Google-style large scale data analysis with Hadoop

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Big Data Technology Core Hadoop: HDFS-YARN Internals

HADOOP MOCK TEST HADOOP MOCK TEST I

Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

Hadoop IST 734 SS CHUNG

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY

Hadoop & its Usage at Facebook

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Distributed File Systems

Hadoop & its Usage at Facebook

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

GraySort and MinuteSort at Yahoo on Hadoop 0.23

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Distributed File Systems

Big Data Analytics - Accelerated. stream-horizon.com

Apache Hadoop Cluster Configuration Guide

BBM467 Data Intensive ApplicaAons

Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division

Big Data With Hadoop

Virtualizing Apache Hadoop. June, 2012

Extending Hadoop beyond MapReduce

Petabyte-scale Data with Apache HDFS. Matt Foley Hortonworks, Inc.

Sujee Maniyam, ElephantScale

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Apache Hadoop FileSystem and its Usage in Facebook

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Benchmarking Hadoop & HBase on Violin

The Greenplum Analytics Workbench

HDFS Architecture Guide

Accelerating and Simplifying Apache

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Certified Big Data and Apache Hadoop Developer VS-1221

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Application Development. A Paradigm Shift

Parallels Cloud Storage

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Fault Tolerance in Hadoop for Work Migration

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop Ecosystem B Y R A H I M A.

Big Fast Data Hadoop acceleration with Flash. June 2013

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

The Hadoop Distributed File System

Hadoop Size does Hadoop Summit 2013

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

BIG DATA TECHNOLOGY. Hadoop Ecosystem

docs.hortonworks.com

Apache Hadoop: Past, Present, and Future

Hadoop & Spark Using Amazon EMR

Constructing a Data Lake: Hadoop and Oracle Database United!

Accelerating Applications and File Systems with Solid State Storage. Jacob Farmer, Cambridge Computer

Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp

How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (

EMC IRODS RESOURCE DRIVERS

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

A very short Intro to Hadoop

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Hortonworks Data Platform Reference Architecture

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

CDH AND BUSINESS CONTINUITY:

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

2009 Oracle Corporation 1

The Hadoop Distributed File System

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015


Case Study : 3 different hadoop cluster deployments

Transcription:

Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1

About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop features Experience in supporting many clusters Including some of the world s largest Hadoop clusters Page 2

Agenda Hadoop Overview Changing Hardware Landscape Changing Use Cases Hierarchical Storage HDFS Cache Future Page 3

Hadoop Quick Overview Tasks NameNode RM DISC 1 DISC 2 DISC N Slaves Slave Node Cluster of commodity servers - 3 to 1000s of nodes HDFS uses slave nodes for storage and large aggregated I/O bandwidth MapReduce uses same nodes for computation Move computation to data Page 4

Architecture File System Namespace Block Management Compute Resource Management Task placement NameNode RM File create, delete, modify Block delete Replicate Heartbeat Heartbeat BlockReport Task Management Tasks TaskManager Client Read/Write Data DataNode DISC 1 DISC 2 DISC N Blocks Blocks Blocks Slave Node Page 5

Commodity Hardware in 2008 Commodity 1U Server 2 quad core CPUs 4x1TB SATA disks per node 8G RAM per node 4000 node cluster at Yahoo 1 gigabit ethernet on each node 40 nodes per rack 4 gigabit ethernet uplinks from each rack to the core Over 30,000 cores, 32 TB memory with nearly 16PB of raw disk! 10 gigabit ethernet was still expensive Page 6

Applications in 2008 Predominantly batch oriented Only MapReduce for computation Hive Pig Batch oriented access patterns Sequential reads and writes Throughput over Latency Low bar for durability File deleted when not closed HBase Random reads Many features added Durability using sync Page 7

New Clusters CPU Moore s law - CPU transistors/perf doubles every 2 years More cores, Faster memory channels, I/O interfaces, Accelerators Storage Increasing storage density - 3TB/4TB disks But I/O throughput and latencies are the same New Storage choices SSD, NVRAM RAM as cache instead of just in-memory application data (100s of GBs) Commodity 1U/2U/4U Server More performance: 2 CPUs 8/12/16 cores More storage: 4x4TB / 6x4TB / 12x4TB SATA disks per node More memory: 72G/128G/196G RAM per node Cluster network 10 gigabit intra-rack ethernet Infiniband in some cases Page 8

New Clusters More storage, compute, memory, network bandwidth per node 2008 4000 1U node cluster = 500 1000 1U nodes cluster today Conversely 2013 4000 nodes = 2008 ~12000 nodes! 2008 software assumptions need a revisit Page 9

Challenges Software needs better scalability Better processors, more {cores, storage, memory} HDFS More files and blocks More blocks per DataNode Bigger block reports Processing Higher number of parallel tasks in a cluster Higher network bandwidth Current assumptions no longer valid storage is uniform nodes are uniform Page 10

New Applications New applications/use cases Batch processing to Interactive query (Stinger/Tez) Memory hungry Latency sensitive Random reads Memory cache New processing frameworks Batch processing to Streaming Graph processing New deployment models Cloud/Hadoop as a service Higher density nodes 240x 2008 node size! VMs instead of physical machines Appliance Higher density nodes Infiniband Page 11

Scalabililty HDFS Federation Multiple independent namespaces support More files and blocks More storage support YARN Resource management separate from compute framework Support for new types of compute framework MapReduce is just one of the compute frameworks More number of parallel tasks in the cluster Better resource management and isolation with cgroups Page 12

Heterogeneous Storage Current Architecture DataNode is a single storage Storage is uniform - Only storage type Disk Storage types hidden from the file system All disks as a single storage New Architecture DataNode is a collection of storages Support different types of storages Disk, SSDs, Memory Support external storages S3, OpenStack Swift S3 Swift SAN Filers Collection of tiered storages Page 13

Heterogeneous Storage Support new hardware/storage types Datanodes configured with per volume storage types Block report per storage better scalability Applications choose storage tier appropriate for the work load SSD for random reads HBase and other query loads Memory only data Intermediate data Mixed storage type support Some copies in SSD and some on disk Some copies on disk in HDFS and some on S3/OpenStack Swift Storage type preference per file Tiered storages Multi-tenant support quotas per storage type Work in progress HDFS-2832 Page 14

HDFS Cache Co-ordinated caching of data in DataNodes Cache dimension tables for Hive Cache data required for multiple queries Mechanism Off heap memory Fast zero copy access to data using DirectByteBuffer Caching by pinning data in memory mmap, mlock, munlock Entire blocks are cached Administrator/user issues commands to cache the data Cache done using paths to files or directories Directory based cache adds newly created files to cache After cache use admin issues command to uncache the data Quota using hierarchical pools Pool to have min/max bytes Work in progress HDFS-4949 Page 15

Future Dynamic caching Access pattern based caching of hot data LRU, LRU2 Cache partial blocks Dynamic migration of data between storage tiers Multiple network interface support Better aggregated bandwidth utilization Isolation of traffic Support NVRAM Better durability without write performance cost File system metadata to NVRAM for better throughput Hardware Security Modules Better key management Processing that requires higher security only on these nodes Important requirement for Financials and Healthcare Page 16

Q & A Thank You! Page 17