Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by Aung Oo, Suzanne McIntosh 1
Today Introduction to the Map/Reduce CONCEPT and an IMPLEMENTATION of it (Apache Hadoop be aware that there are also others) Project Discussion 2
Massive Data-sets Intro HDD_1 (10GB) HDD_100 (10GB) 1 TB of Data
Massive Data-sets Intro HDD_1 (10GB) HDD_100 (10GB) 1 TB of Data in 100 HDDs, but how many computers should there be? When N=1, reading 1 TB requires 2.5 HOURS. What should N be in order to give us appreciable speed-up on reads?
Massive Data-sets Intro Server 1 Server 100 HDD_1 (10GB) HDD_100 (10GB) Given: 10 GB per drive = 10,000,000,000 bytes per drive 100x 10 GB drives = 1 TB = 1,000,000,000,000 bytes Read rate is 100 MB/second Full 1 TB of data can be read in 100 seconds : 10 GB / 100 MB per second = 10,000,000,000 / 100,000,000 = 100 seconds to read one drive. We read all 100 drives in parallel, and the computers can process the data read in parallel. This is the architecture in which distributed computing frameworks shine, because not only is the data read in parallel, it is processed in parallel as well.
So really - how do we do this? 6
Coping wi th large data HPC? Many different HPC solutions MPI GPU computing MapReduce... No one is the best solution: Analyze your problem and choose the best solution for your specific problem, resources, midterm goals,... M/R frameworks are aimed to process huge volumes of data of Tera- or PetaBytes, what fits perfectly in many bioinformatics scenarios
Coping wi th large data MapReduce
Example Map/Reduce 9
Weather data example Dataset You are given a file containing data from weather stations from around the world. Say there are 1000 stations and each measures temperature once a second. Over a day that sums up to 86.400.000 data points Per year we ll have 3^10 points or about 500 GB (using 16bit numbers). Question: what was the maximum temperature for each year? 0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999 0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999 0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999 0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999 0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999 Input to Mapper: Key, Value Year Temp 0, 0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999 106, 0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999 212, 0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999 318, 0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999 424, 0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999 The key is the offset of the start of each record in the file (the records are 105 characters long).
11
12
13
14
More use cases 15
Hadoop Other examples Yahoo! has more than 100.000 CPUs in >40.000 computers running Hadoop The biggest cluster: 4.500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) Used to support research for Ad Systems and Web Search Also used to do scaling tests to support development of Hadoop on larger clusters Facebook uses Hadoop to store copies of internal log and dimension data sources and as a source for reporting/analytics and machine learning. Currently they have 2 major clusters (with a total of 15.000.000 GB storage): A 1100-machine cluster with 8800 cores and about 12 PB raw storage. A 300-machine cluster with 2400 cores and about 3 PB raw storage. Each node has 8 cores and 12 TB of storage.
Hadoop Example: GATK, a genome analysis toolkit
Hadoop Example: Bowtie Crossbow, genome resequencing
Hadoop Example: CloudBurst, a NGS read mapping
Read Mapping k-mer Counting with Map/Reduce map shuffle reduce ATGAACCTTA ATG,1 TGA,1 GAA,1 AAC,1 ACC,1 CCT,1 CTT,1 TTA,1 ACA -> 1 ATG -> 1 CAA -> 1,1 GCA -> 1 TGA -> 1 TTA -> 1,1,1 ACA:1 ATG:1 CAA:2 GCA:1 TGA:1 TTA:3 GAACAACTTA GAA,1 AAC,1 ACA,1 CAA,1 AAC,1 ACT,1 CTT,1 TTA,1 ACT -> 1 AGG -> 1 CCT -> 1 GGC -> 1 TTT -> 1 ACT:1 AGG:1 CCT:1 GGC:1 TTT:1 TTTAGGCAAC TTT,1 TTA,1 TAG,1 AGG,1 GGC,1 GCA,1 Application developers focus on 2 (+1 internal) functions Map: input -> key, value pairs Shuffle: Group together pairs with same key Reduce: key, value-lists -> output CAA,1 AAC,1 AAC -> 1,1,1,1 ACC -> 1 CTT -> 1,1 GAA -> 1,1 TAG -> 1 AAC:4 ACC:1 CTT:2 GAA:2 TAG:1 Map, Shuffle & Reduce All Run in Parallel
Read Mapping Cloudburst 1. Map: Catalog K-mers Emit k-mers in the genome and reads 2. Shuffle: Collect Seeds Conceptually build a hash table of k-mers and their occurrences 3. Reduce: End-to-end alignment If read aligns end-to-end with k errors, record the alignment map Human chromosome 1 shuffle reduce Read 1 Read 1, Chromosome 1, 12345-12365 Read 2 Read 2, Chromosome 1, 12350-12370
Hadoop Example: More NGS on Hadoop
Hadoop Other examples
Hadoop Overview 24
Hadoop A MapReduce implementation Hadoop MapReduce implementation is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, faulttolerant manner.
Hadoop A MapReduce implementation A MapReduce job usually splits the input dataset into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
Hadoop A MapReduce implementation Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the HDFS are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster
Hadoop A MapReduce implementation The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master
Hadoop A MapReduce implementation
HDFS 30
Distributed File Systems Intro Why a distributed file system? Holds a large amount of data Serves many network clients
Distributed File Systems What is HDFS? HDFS = Hadoop Distributed File System Storage system part of Hadoop Developed by Google - Google File System (GFS) A version of GFS later renamed to HDFS Protects against data loss from hardware failure
Distributed File Systems HDFS Stores very large files (100s of MB, GB, TB) Provides high performance streaming data access Uses off-the-shelf, non-custom, hardware Continues working without noticeable interruption if failure Stores replicas of blocks to facilitate recovery from hardware errors File data is accessed in a write once and read many model
Distributed File Systems Intro HDFS is not good for Applications requiring low-latency access to data Lots of small files Lots of small files mean lots of metadata to hold in the NameNode s memory Multiple writers (only single writer is supported because stream-based) Updates to offsets within the file
HDFS So what? 35
Distributed File Systems HDFS Block-structured file system Individual files are broken into blocks of a fixed size HDFS block size is 64MB by default HDFS blocks are large compared to disk blocks (512 bytes) or file system blocks (4KB) Optimal streaming achieved by reducing the latency that many seeks would cause Blocks stored across cluster in one or more machines DataNodes
Distributed File Systems Intro In HDFS, a file can be made of several blocks, and they are not necessarily stored on the same machine Access to a file may require cooperation of multiple machines Advantage: Support for files whose sizes exceed what one machine can accommodate HDFS stores files as a set of large blocks across several machines, and these files are not part of the ordinary file system Typing ls on a machine running a DataNode daemon will display the contents of the ordinary Linux file system being used to host the Hadoop services Files stored inside HDFS are not shown HDFS runs in a separate namespace HDFS comes with its own utilities for file management Blocks that comprise the HDFS files are stored in a directory managed by the DataNode service
Distributed File Systems Intro When the blocks of a file are distributed across the cluster, several machines participate in serving up the file The loss of any one of those machines would make the file unavailable Solution is replication of each block across a number of machines (3 machines, by default)
Distributed File Systems Intro An HDFS cluster is comprised of two types of nodes: One Namenode (Master) Multiple Datanodes (Worker nodes, subservient to NameNode) In HDFS File data is accessed in a write once, read many (WORM) model Metadata structures (names of files and directories) can be modified by many clients concurrently Metadata remains synchronized by using single machine to manage the metadata the NameNode
Distributed File Systems NameNode Master Manages file system namespace Maintains file system tree Maintains metadata for all files and directories in the tree Low amount of metadata stored per file File names Permissions Locations, i.e. DataNodes, of each block of each file Information can be stored in the main memory of NameNode for fast access
Distributed File Systems Namenode Resilience Important that NameNodes are resilient to failure Without NameNode, one cannot use the file system For recovery Metadata is persisted in the local file system Optionally, persisted to multiple backup file system Option to run a secondary NameNode Role of secondary is different from primary NameNode Secondary manages the edit log by continuously merging the namespace image. Secondary NameNode lags the primary Secondary NameNode can be promoted to primary for recovery Namenode marks bad blocks, creates new good replicas
Distributed File Systems DataNodes Worker nodes Subservient to NameNode of the cluster Store and retrieve blocks on demand One large file is split into multiple HDFS blocks Each HDFS block is stored in a DataNode Report to NameNode periodically with lists of blocks they are storing Compute checksums over blocks Report checksum errors to NameNodes
Distributed File Systems To open a file in the HDFS file system Client retrieves from NameNode the list of locations (DataNodes) for the blocks that comprise the file Client reads file data directly from DataNode servers, possibly in parallel NameNode not directly involved in the bulk data transfer, keeping its overhead to a minimum If a DataNode fails Data can be retrieved from one of the replicas Cluster continues to operate If the NameNode fails Cluster is inaccessible until it is manually restored Multiple redundant systems allow the NameNode to protect the file system's metadata in the event of NameNode failure NameNode failure is more severe for the cluster than DataNode failure
Distributed File Systems How is distance between nodes measured? Hadoop uses the tree structure of the nodes in the network to arrive at distance between two nodes Distance is sum of the distance between each node and the nearest common ancestor When two processes are running on the same node, we identify both nodes the same way: / datacenter 1/ rack 1/ node 1, or /d1/r1/n1 for short. The distance is given as: distance(/d1/r1/n1, /d1/r1/n1) = 0 When two processes are running on different nodes in the same rack, the distance is: distance(/d1/r1/n1, /d1/r1/n2) = 2 When two processes are running on nodes in different racks, the distance is: distance(/d1/r1/n1, (/d1/r2/n3) = 4 When two processes are running on nodes in different datacenters*, the distance is: distance(/d1/r1/n1, /d2/r3/n4) = 6 *Note: Hadoop does not yet support this model.
Distributed File Systems Coherency Model Be aware that writes may not be visible, even after flush The current block being written will not be visible to readers Once more than one block s worth of data is written, the first block will become visible to readers HDFS provides sync() method to force all buffers to be synchronized to the DataNodes If your application does not call sync(), and a failure occurs, all data of the block currently being written will be unrecoverable It is advisable to call sync() at appropriate points in your application, remembering that a call to sync() does incur some overhead
NFS vs. HDFS
Distributed File Systems Another distributed file system: NFS One of the oldest NFS server makes local file system visible to network Once mounted, the fact that the files are remote is transparent to the client Files all reside on one machine, therefore limit to how much data can be stored Not extensible No reliability guarantee All clients contend for service from the NFS server Clients must copy the data locally to process it
Distributed File Systems NFS vs. HDFS NFS HDFS Mature technology? Yes (1984) Yes (2004) Serves multiple clients? Yes Yes Number of machines 1 Many Size of file system Fixed Extensible, scalable Reliability guaranteed? No Yes Clients contend for service? Clients copy data before processing it? Supports very large file sizes Yes Yes No Yes, but clients are distributed across n servers No Yes
Distributed File Systems Disadvantages of HDFS Not as general-purpose as NFS HDFS is not suitable for applications that perform random seeks to read from arbitrary locations within a file HDFS is not suitable for applications that perform random seeks to write to arbitrary locations within a file HDFS does not have support for multiple writers to a file
Summary
Distributed File Systems Intro When running in standalone mode, the local file system is used, not HDFS. When running in distributed mode, for example with HDFS as our distributed file system, Hadoop uses data locality optimization when scheduling jobs. Hadoop tries to run the map task on a node where the input data resides in HDFS This is to minimize the amount of copying over the network Network bandwidth is precious If all three nodes hosting the HDFS block replicas for a given split are already running map tasks, the job scheduler will try to schedule the work to run in a rack that already contains the replica Although the replica must be copied, it is intra-rack, so costs less than interrack
Distributed File Systems Intro Mappers write data out to local disks because it is intermediate data and replicas (via HDFS) are unnecessary Reduce tasks do not have the advantage of data locality their input is the output of many Mappers The sorted Mapper outputs get transferred over the network to Reducer nodes The output of the Reduce task(s) is written to HDFS for reliability (replicas) There can be multiple Reduce tasks
Summary MAP/REDUCE is a concept on how to work with large data-sets MAP/REDUCE is implemented in several software packages, e.g. Apache Hadoop, Apache MapReduce The needed ecosystem consists of other elements, such as a filesystem (HDFS) and several control services (e.g. job and tasktracker) There exist several tools for easier usage, e.g. Apache Mahout, Apache Pig, Apache DataFu Alternative approaches include: Apache Spark, Apache Flink, Apache Hama, Facebook Corona, Twitter Storm, etc. More: http://hadoopecosystemtable.github.io/ 53