HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1
Large files Files can be many gigabytes or even terabytes in size In fact, HDFS struggles when there are a great many small files as the file structure data is held in memory There are petabyte stores running Hadoop Streaming Access Write once, read many times design Files can be constantly added to, but Hadoop works only by appending, not inserting or updating records Lots of different analyses may be run on the same data, usually involving the entire file Examples: log files, transaction history, machine monitoring, sensor networks,... 2
Commodity Hardware Designed not to need specialist high cost computers Runs over a flexible cluster of nodes Normally a rack of x86 machines, but you can run it on pretty much anything Resilience Large clusters of nodes will suffer failures Hadoop uses replication to cope with this More on that later... 3
Key Concepts A single name node is responsible for storing the location of every file in the system Kept in memory for speed Stored on disk for persistence Many data nodes contain the data stored in blocks Data is stored in fixed size blocks-each file usually made up of many blocks Blocks Files are split into blocks of a fixed size (64MB or 128MB usually) A single file's blocks are spread across the network so every files is distributed Each machine stores part of several files Replication means each block is written to r different nodes where r is the replication factor (usually 3) 4
Data (Slave) Nodes Store the blocks of data, plus local informatin about block location and file identity Report back to the name node to list the blocks they are storing All nodes contain replicated blocks -there are no primary/secondary data nodes So, nodes are not replicated, blocks are Name (Master) Node Manages the namespace of the file system Stores the files store tree and file metadata Stores the location (which node) of all of the blocks in a file Is replicated by a secondary name node for resilience 5
Client The client accesses the data The name node tells the client where the data can be found on the data nodes The client interacts directly with the data node This happens 'under the hood' Example client = HDFS command line A Nice Picture Client Metadata operations Name Node Secondary Name Node Read Write Block operations File 1 Data Node Data Node Data Node F1B1R1 F1B2R2 F3B1R3 F1B1R1 F1B2R1 F2B1R2 F3B3R3 F3B1R2 F3B1R1 F1B3R3 F3B2R3 F1B3R1 File 2 F2B1R1 F1B1R1 F2B2R2 F2B3R1 F2B1R3 F2B3R3 File 3 Rack 1 Rack 2 6
Name Node Federation If a cluster is so large that a single name node cannot cover it all, The whole space is partitioned and shared among a number of name nodes This is called Federation HDFS Command Line Simplest way to interact with the files in HDFS is via the command line Get a shell (via SSH for example) on the server and type Unix like commands to interact with the file system Syntax is: hadoop fs -commandor hdfs dfs -command 7
Examples -ls -cat -appendtofile -copytolocal List directory contents Copy file contents to stdout -mkdir Make a directory in dfs -mv Move file within dfs Append local file to dfs file Copy from dfs to local File Locations File locations are specified using POSIX style: E.g. /usr/kms/somedata/data.csv Permissions work like in Linux, except you cannot execute a file in DFS so there are only r (read) and w (write) permissions for users. You need to copy files (or data) from the local file system to HDFS explicitly -they are separate 8
File Formats Just like any file system, HDFS allows data to be stored in any format Text, csv, jsonetc Media: images, sound etc Additionally, it offers its own container formats Files are split into blocks, so format must support this Text Files CSV files are easy to split and can be processed row by row so are a good choice JSON and XML are more difficult to split, so special tools are needed 9
Sequence files Hadoop File Types Binary files containing key/value pairs Serialization Formats Methods for turning program objects into data streams In the Hadoopiolibrary, there are a number of writable classes that can be used from Java Avro is an Apache project that is designed to provide language independent serialization Compression Compressed files are splittable Codec stored in header, so any compression method may be used Choice between compression speed and degree of compression 10
Java Interface Hadoopis written in Java and works most naturally in that language There are ways of interacting with it in other languages too, though Useful Java classes include org.apache.hadoop.fs org.apache.hadoop.io Access, create and write to files from Java HDFS and Python Hadoopy- python wrapper for Hadoop www.hadoopy.com Spotify have a nice tool called Snakebite labs.spotify.com/2013/05/07/snakebite/ Allows calls to HDFS from Python Quicker than hdfsfrom command line, which launches a JVM for each command 11
Web Server Interface HDFS also runs a web server, which provides information about data and jobs REST API There is also a REST API you can use to interact with HDFS http://<host>:<port>/webhdfs/v1/<path>?op= See hadoop.apache.org/docs/r2.4.0/hadoopproject-dist/hadoop-hdfs/webhdfs.html 12
Higher Level Tools Pig High level language for defining data flow Hive SQL like query language for MapReduce jobs Zoo Keeper Coordination service for distributed systems Spark High speed data analytics Mahout Machine learning http://hadoopecosystemtable.github.io/ 13