HDFS Snapshots and Beyond

Similar documents

Sujee Maniyam, ElephantScale

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

How Bigtop Leveraged Docker for Build Automation and One-Click Hadoop Provisioning

How To Use Cloudera Manager Backup And Disaster Recovery (Brd) On A Microsoft Hadoop (Clouderma) On An Ubuntu Or 5.3.5

The Hadoop Distributed File System

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop: Embracing future hardware

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

HDFS Users Guide. Table of contents

Distributed File Systems

Apache Hadoop FileSystem and its Usage in Facebook

Extended Attributes and Transparent Encryption in Apache Hadoop

DistCp Guide. Table of contents. 3 Appendix Overview Usage Basic Options... 3

docs.hortonworks.com

HADOOP MOCK TEST HADOOP MOCK TEST I

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

CURSO: ADMINISTRADOR PARA APACHE HADOOP

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Design and Evolution of the Apache Hadoop File System(HDFS)

5 HDFS - Hadoop Distributed System

Introduction to HDFS. Prasanth Kothuri, CERN

<Insert Picture Here> Big Data

THE HADOOP DISTRIBUTED FILE SYSTEM

Apache Hadoop. Alexandru Costan

Snapshots in Hadoop Distributed File System

Distributed Filesystems

Cloudera Backup and Disaster Recovery

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook

CDH AND BUSINESS CONTINUITY:

Deploying Hadoop with Manager

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Introduction to HDFS. Prasanth Kothuri, CERN

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Certified Big Data and Apache Hadoop Developer VS-1221

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

HDFS Space Consolidation

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Architecture. Part 1

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Can Storage Fix Hadoop

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

The Hadoop Distributed File System

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Big Data Operations Guide for Cloudera Manager v5.x Hadoop

HADOOP MOCK TEST HADOOP MOCK TEST

Big Data Technologies

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Where is Hadoop Going Next?

Securing Hadoop in an Enterprise Context

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Apache Hadoop new way for the company to store and analyze big data

Apache HBase. Crazy dances on the elephant back

Spectrum Scale HDFS Transparency Guide

Cloudera Manager Training: Hands-On Exercises

Protecting Big Data Data Protection Solutions for the Business Data Lake

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Applied research on data mining platform for weather forecast based on cloud storage

Big Fast Data Hadoop acceleration with Flash. June 2013

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Distributed File Systems

International Journal of Advance Research in Computer Science and Management Studies

A very short Intro to Hadoop

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

How To Install Hadoop From Apa Hadoop To (Hadoop)

HDFS Installation and Shell

HDFS Architecture Guide

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

docs.hortonworks.com

Pivotal HAWQ Release Notes

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Cloudera Backup and Disaster Recovery

Back Up and Restore the Project Center and Info Exchange Servers. Newforma Project Center Server

Implementing the Hadoop Distributed File System Protocol on OneFS Jeff Hughes EMC Isilon

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Solving performance and data protection problems with active-active Hadoop SOLUTIONS BRIEF

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

HDFS Design Principles

docs.hortonworks.com

Transcription:

HDFS Snapshots and Beyond Tsz-Wo (Nicholas) Sze Jing Zhao October 29, 2013 Page 1

About Us Tsz-Wo Nicholas Sze, Ph.D. Software Engineer at Hortonworks PMC Member at Apache Hadoop One of the most active contributors/committers of HDFS Started in 2007 Used Hadoop to compute Pi at the two-quadrillionth (2x10 15 th) bit It is the current World Record. Jing Zhao, Ph.D. Software Engineer at Hortonworks Committer at Apache Hadoop Active Hadoop contributor too Contributed ~150 patches in about a year Page 2

Before Snapshots Deleted files cannot be restored Trash is buggy and not well understood Trash works only for CLI based deletion No point-in-time recovery No periodic snapshots to restore from No admin/user managed snapshots Page 3

HDFS Snapshot Point-in-time image of the file system Read-only Copy-on-write Page 4

Use Cases Protection against user errors Backup Experimental/Test setups Page 5

Example: Periodic Snapshots for Backup A typical snapshot policy: Take a snapshot in every 15 mins and keep it for 24 hrs every 1 hr, keep 2 days every 1 day, keep 14 days every 1 week, keep 3 months every 1 month, keep 1 year Page 6

Design Goal: Efficiency Storage efficiency No block data copying No metadata copying for unmodified files Processing efficiency No additional costs for processing current data Cheap snapshot creation Must be fast and lightweight Must support for a very large number of snapshots Page 7

Design Goal: Features Read-only Files and directories in a snapshot are immutable Nothing can be added to or removed from directories Hierarchical snapshots Snapshots of the entire namespace Snapshots of subtrees User operation Users can take snapshots for their data Admins manage where users can take snapshots Page 8

HDFS-2802: Snapshot Development Available in Hadoop 2 GA release (v2.2.0) Community-driven Special thanks to who have provided for the valuable discussion and feedback on the feature requirements and the open questions 136 subtask JIRAs Mainly contributed by Hortonworks The merge patch has about 28k lines ~8 months of development Page 9

Hortonworks Inc. 2011 Page 10

NameNode Only Operation No complicated distributed mechanism Snapshot metadata stored in NameNode DataNodes have no knowledge of snapshots Block management layer also don t know about snapshots Page 11

Fast Snapshot Creation Snapshot Creation: O(1) It just adds a record to an inode / d1 d2 S1 f1 f2 f3 Page 12

Low Memory Overhead NameNode memory usage: O(M) M is the number of modified files/directories Additional memory is used only when modifications are made relative to a snapshot / d1 d2 S1 Modifications: 1. rm f3 2. add f4 f1 f4 f2 f3 Page 13

File Blocks Sharing Blocks in datanodes are not copied The snapshot files record the block list and the file size No data copying / d S2 S1 f f' f blk0 blk1 blk2 blk3 Page 14

Persistent Data Structures A well-known data structure for time travel Support querying previous version of the data Access slow down The additional time required for the data structure In traditional persistent data structures There is slow down on accessing current data and snapshot data In our implementation No slow down on accessing current data Slow down happens only on accessing snapshot data Page 15

No Slow Down on Accessing Current Data The current data can be accessed directly Modifications are recorded in reverse chronological order Snapshot data = Current data Modifications / ~ modifications d1 d2 S1 Modifications: 1. rm f3 2. add f4 d2 f1 f4 f2 f3 f2 f3 Page 16

Easy Management Snapshots can be taken on any directory Set the directory to be snapshottable Support 65,536 simultaneous snapshots No limit on the number of snapshottable directories Nested snapshottable directories are currently NOT allowed Page 17

Admin Ops Allow snapshots on a directory hdfs dfsadmin allowsnapshot <path> Reset a snapshottable directory hdfs dfsadmin disallowsnapshot <path> Example Page 18

User Ops Create/delete/rename snapshots hdfs dfs -createsnapshot <path> [<snapshotname>] hdfs dfs deletesnapshot <path> <snapshotname> hdfs dfs renamesnapshot <path> <oldname> <newname> Get snapshottable directory listing hdfs lssnapshottabledir Get snapshots difference report hdfs snapshotdiff <path> <from> <to> Page 19

Use snapshot paths in CLI All regular commands and APIs can be used against snapshot path /<snapshottabledir>/.snapshot/<snapshotname>/foo/bar List all the files in a snapshot ls /test/.snapshot/s4 List all the snapshots under that path ls <path>/.snapshot Page 20

Test Snapshot Functionalities ~100 unit tests ~1.4 million generated system tests Covering most combination of (snapshot + rename) operations Automated long-running tests for months Page 21

Future Work Restoring/Rolling back to a snapshot Read-write snapshots Excluding temp files Use snapshots beyond HDFS DistCp Hive exports HBase snapshots Page 22

DistCp Copy data to remote cluster List of files to copy à MapReduce No updates on source files while DistCp Take snapshot before DistCp Use snapshot paths for source files Use snapshots for incremental backup Only copy the snapshot diff Page 23

src distcp hdfs://cluster1/src/.snapshot/s1/ hdfs://cluster2/dst/ dst /src/.snapshot/s1/* 15 min later src distcp hdfs://cluster1/src/.snapshot/s2/ hdfs://cluster2/dst/ dst snapshot-diff (s2, s1) Hortonworks Inc. 2011 Page 24

Hive Export Current mechanism Copy data to a new directory in HDFS Export metadata: dump from mysql and store in HDFS Avoid data copy using snapshots Snapshot(s) on source directories Create symlinks for snapshot files/directories Page 25

HBase Snapshots Current mechanism Reference all the HFiles Copy the current table info, region info, pending recovered edits NameNode load concern Take HDFS snapshots Avoid creating references for all HFiles Page 26

Q & A Myths and misinformation of HDFS Not reliable (was never true) Namenode dies, all state is lost (was never true) Does not support disaster recovery (distcp in Hadoop0.15) Hard to operate for new comers Performance improvements (always ongoing) Major improvements in 1.2 and 2.x Namenode is a single point of failure Needs shared NFS storage for HA Does not have point in time recovery Thank You! Page 27