COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring 2014. HDFS Basics

Similar documents
HDFS - Java API coreservlets.com and Dima May coreservlets.com and Dima May

Distributed Filesystems

5 HDFS - Hadoop Distributed System

HDFS: Hadoop Distributed File System

Hadoop Distributed File System (HDFS) Overview

Hadoop Distributed File System. Dhruba Borthakur June, 2007

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Distributed File Systems

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Hadoop Architecture. Part 1

Hadoop Distributed File System (HDFS)

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

HADOOP MOCK TEST HADOOP MOCK TEST I

THE HADOOP DISTRIBUTED FILE SYSTEM

Introduction to HDFS. Prasanth Kothuri, CERN

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Introduction to HDFS. Prasanth Kothuri, CERN

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

HDFS Architecture Guide

Apache Hadoop. Alexandru Costan

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

The Hadoop Distributed File System

Distributed File Systems

Getting to know Apache Hadoop

HADOOP MOCK TEST HADOOP MOCK TEST II

Hadoop & its Usage at Facebook

Hadoop IST 734 SS CHUNG

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop & its Usage at Facebook

HDFS Users Guide. Table of contents

Intro to Map/Reduce a.k.a. Hadoop

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data Analytics. Lucas Rego Drumond

Implementing the Hadoop Distributed File System Protocol on OneFS Jeff Hughes EMC Isilon

HADOOP MOCK TEST HADOOP MOCK TEST

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Reduction of Data at Namenode in HDFS using harballing Technique

Design and Evolution of the Apache Hadoop File System(HDFS)

Apache Hadoop FileSystem and its Usage in Facebook

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Michael Thomas, Dorian Kcira California Institute of Technology. CMS Offline & Computing Week

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Distributed File Systems

COSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring Mahout

Chase Wu New Jersey Ins0tute of Technology

HDFS Installation and Shell

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Case Study : 3 different hadoop cluster deployments

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Hadoop Distributed File System Propagation Adapter for Nimbus

TP1: Getting Started with Hadoop

Xiaoming Gao Hui Li Thilina Gunarathne

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

A very short Intro to Hadoop

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Big Data With Hadoop

Spectrum Scale HDFS Transparency Guide

Google File System. Web and scalability

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Big Data Technology Core Hadoop: HDFS-YARN Internals

Cloudera Manager Health Checks

Performance Analysis of Mixed Distributed Filesystem Workloads

Data-intensive computing systems

HDFS. Hadoop Distributed File System

GraySort and MinuteSort at Yahoo on Hadoop 0.23

The Hadoop Distributed File System

Detailed Outline of Hadoop. Brian Bockelman

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Hadoop Big Data for Processing Data and Performing Workload

CS54100: Database Systems

The Hadoop Eco System Shanghai Data Science Meetup

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

Testing of several distributed file-system (HadoopFS, CEPH and GlusterFS) for supporting the HEP experiments analisys. Giacinto DONVITO INFN-Bari

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Big Data Management and NoSQL Databases

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

Introduction to MapReduce and Hadoop

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Storage Architectures for Big Data in the Cloud

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Apache Hadoop FileSystem Internals

Alternatives to HIVE SQL in Hadoop File Structure

Hadoop: Embracing future hardware

Accelerating and Simplifying Apache

Transcription:

COSC 6397 Big Data Analytics Distributed File Systems (II) Edgar Gabriel Spring 2014 HDFS Basics An open-source implementation of Google File System Assume that node failure rate is high Assumes a small number of large files Write-once-ready-many pattern Reads are performed in a large streaming fashion Large throughput instead of low latency Moving computation is easier than moving data 1

HDFS components Namenode manages the File System's namespace/meta-data/file blocks Runs on 1 machine to several machines Datanode Stores and retrieves data blocks Reports to Namenode Runs on many machines Secondary Namenode Performs house keeping work so Namenode doesn t have to Requires similar hardware as Namenode machine Not used for high-availability not a backup for Namenode 2

HDFS Blocks Files are split into blocks Managed by Namenode, stored by Datanode Transparent to user Blocks are traditionally either 64MB or 128MB Default is 64MB The motivation is to minimize the cost of seeks as compared to transfer rate Namenode determines replica placement Default replication is 3 1st replica on the local rack 2nd replica on the local rack but different machine 3rd replica on the different rack Namenode Abitrator and repository for all HDFS metadata Data never flows through Namenode Executes file system namespace operations open, close, rename files and directories Determines mapping of blocks to Datanodes Metadata in Memory The entire metadata is in main memory Types of metadata List of files List of Blocks for each file List of DataNodes for each block File attributes, e.g. creation time, replication factor A Transaction Log Records file creations, file deletions etc 3

DataNode A Block Server Stores data in the local file system (e.g. ext4) Stores metadata of a block (e.g. CRC) Serves data and metadata to Clients Block Report Periodically sends a report of all existing blocks to the NameNode Facilitates Pipelining of Data Forwards data to other specified DataNodes Client retrieves a list of DataNodes on which to place replicas of a block Client writes block to the first DataNode The first DataNode forwards the data to the next node in the Pipeline When all replicas are written, the Client moves on to write the next block in file 4

Rebalancer Goal: % disk full on DataNodes should be similar Usually run when new DataNodes are added Cluster is online when Rebalancer is active Rebalancer is throttled to avoid network congestion Command line tool HDFS limitations Bad at handling large amount of small files Write limitations Single writer per file Writes only at the end of file, no-support for arbitrary offset Low-latency reads High-throughput rather than low latency for small chunks of data HBase addresses this issue 5

Serve read / write requests from client Block creation, deletion and replication upon instruction from Namenode No knowledge of HDFS files Stores HDFS data in files on local file system Determines optimal file count per directory Creates subdirectories automatically 6

Comparison HDFS to PVFS2 PVFS2 HDFS Metadata server Distributed Federation of Metadata server in v2.2.0 Dataserver Stateless Unclear, probably stateful (bc. of replication) Default stripe size 64KB 64MB POSIX support No, kernel interfaces implement similar semantics No, similar interfaces available through FUSE 7

Comparison HDFS to PVFS2 Reliability Support for concurrent access of the same file PVFS2 No/ high availability PVFS2 experimental Yes HDFS Replication No Locking No No Other features Atomic append File System Java API org.apache.hadoop.fs.filesystem Abstract class that serves as a generic file system representation Note: it s a class and not an Interface Hadoop ships with multiple concrete implementations: org.apache.hadoop.fs.localfilesystem Good old native file system using local disk(s) org.apache.hadoop.hdfs.distributedfilesystem Hadoop Distributed File System (HDFS) Will mostly focus on this implementation org.apache.hadoop.hdfs.hftpfilesystem Access HDFS in read-only mode over HTTP org.apache.hadoop.fs.ftp.ftpfilesystem File system on FTP server 8

File System Java API public class SimpleLocalLs { public static void main(string[] args) throws Exception{ Path path = new Path("/"); if ( args.length == 1){ path = new Path(args[0]); Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); FileStatus [] files = fs.liststatus(path); for (FileStatus file : files ){ System.out.println(file.getPath().getName()); Hadoop's Path object represents a file or a directory Not java.io.file which tightly couples to local filesystem Path is really a URI on the FileSystem HDFS: hdfs://localhost/user/file1 Local: file:///user/file1 Examples: new Path("/test/file1.txt"); new Path("hdfs://localhost:9000/test/file1.txt"); 9

Reading data from HDFS 1. Create FileSystem 2. Open InputStream to a Path 3. Copy bytes using IOUtils 4. Close Stream Reading data from HDFS FileSystem fs = FileSystem.get(new Configuration()); If you run with yarn command, DistributedFileSystem (HDFS) will be created Utilizes fs.default.name property from configuration Recall that Hadoop framework loads core-site.xml which sets property to hdfs (hdfs://localhost:8020) 10

Reading data from HDFS InputStream input = null; try { input = fs.open(filetoread); finally { IOUtils.closeStream(input); fs.open returns org.apache.hadoop.fs.fsdatainputstream Another FileSystem implementation will return their own custom implementation of InputStream Opens stream with a default buffer of 4k If you want to provide your own buffer size use fs.open(path f, int buffersize) Utilize IOUtils to avoid boiler plate code that catches IOException Reading data from HDFS IOUtils.copyBytes(inputStream, outputstream, buffer); Copy bytes from InputStream to OutputStream Hadoop s IOUtils makes the task simple buffer parameter specifies number of bytes to buffer at a time 11

Reading data from HDFS public class ReadFile { public static void main(string[] args) throws IOException { Path filetoread = new Path("/data/readMe.txt"); FileSystem fs = FileSystem.get(new Configuration()); InputStream input = null; try { input = fs.open(filetoread); IOUtils.copyBytes(input, System.out, 4096); finally { IOUtils.closeStream(input); Reading data - seek FileSystem.open returns FSDataInputStream Extension of java.io.datainputstream Supports random access and reading via interfaces: PositionedReadable : read chunks of the stream Seekable : seek to a particular position in the stream FSDataInputStream implements Seekable interface void seek(long pos) throws IOException Seek to a particular position in the file Next read will begin at that position If you attempt to seek past the file boundary IOException is emitted Expensive operation strive for streaming and not seeking 12

Reading data - seek public class SeekReadFile { public static void main(string[] args) throws IOException { Path filetoread = new Path("/training/data/readMe.txt"); FileSystem fs = FileSystem.get(new Configuration()); FSDataInputStream input = null; try { input = fs.open(filetoread); System.out.print("start postion="+input.getpos()+": IOUtils.copyBytes(input, System.out, 4096, false); input.seek(11); System.out.print("start postion="+input.getpos()+": IOUtils.copyBytes(input, System.out, 4096, false); finally { IOUtils.closeStream(input); Writing Data in HDFS 1. Create FileSystem instance 2. Open OutputStream a) FSDataOutputStream in this case b) Open a stream directly to a Path from FileSystem c) Creates all needed directories on the provided path 3. Copy data using IOUtils 13

HDFS C API #include "hdfs.h" int main(int argc, char **argv) { hdfsfs fs = hdfsconnect("namenode_hostname",namenode_port); if (!fs) fprintf(stderr, "Cannot connect to HDFS.\n");exit(-1); int exists = hdfsexists(fs, filename); if (exists > -1) { fprintf(stdout, "File %s exists!\n", filename); else{ // Create and open file for writing hdfsfile outfile = hdfsopenfile(fs, filename, O_WRONLY O_CREAT, 0, 0, 0); if (!outfile) { fprintf(stderr, Open failed %s\n", filename); exit(-2); hdfswrite(fs, outfile, (void*)message, strlen(message)); hdfsclosefile(fs, outfile); HDFS C API // Open file for reading hdfsfile infile = hdfsopenfile(fs, filename, O_RDONLY, 0, 0, 0); if (!infile) { fprintf(stderr, "Failed to open %s for reading!\n", filename); exit(-2); char* data = malloc(sizeof(char) * size); // Read from file. tsize readsize = hdfsread(fs, infile, (void*)data, size); fprintf(stdout, "%s\n", data); free(data); hdfsclosefile(fs, infile); hdfsdisconnect(fs); return 0; 14