Intro to Map/Reduce a.k.a. Hadoop

Similar documents

Hadoop Architecture. Part 1

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

CSE-E5430 Scalable Cloud Computing Lecture 2

Chapter 7. Using Hadoop Cluster and MapReduce

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

A very short Intro to Hadoop

THE HADOOP DISTRIBUTED FILE SYSTEM

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Apache Hadoop. Alexandru Costan

Design and Evolution of the Apache Hadoop File System(HDFS)

HADOOP MOCK TEST HADOOP MOCK TEST I

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Big Data With Hadoop

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Data-Intensive Computing with Map-Reduce and Hadoop

Distributed Filesystems

Parallel Processing of cluster by Map Reduce

Hadoop implementation of MapReduce computational model. Ján Vaňo

Open source Google-style large scale data analysis with Hadoop

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Apache Hadoop new way for the company to store and analyze big data

Large scale processing using Hadoop. Ján Vaňo

Distributed File Systems

Hadoop & its Usage at Facebook

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

HDFS: Hadoop Distributed File System

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop IST 734 SS CHUNG

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

GraySort and MinuteSort at Yahoo on Hadoop 0.23

NoSQL and Hadoop Technologies On Oracle Cloud

Hadoop & its Usage at Facebook

Hadoop Distributed File System (HDFS) Overview

Distributed File Systems

Apache Hadoop FileSystem and its Usage in Facebook

Hadoop Distributed File System. Dhruba Borthakur June, 2007

MapReduce, Hadoop and Amazon AWS

Fault Tolerance in Hadoop for Work Migration

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop Ecosystem B Y R A H I M A.

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

The Hadoop Distributed File System

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Reduction of Data at Namenode in HDFS using harballing Technique

MapReduce with Apache Hadoop Analysing Big Data

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Hadoop and ecosystem * 本文中的言论仅代表作者个人观点 * 本文中的一些图例来自于互联网. Information Management. Information Management IBM CDL Lab

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Big Data Technology Core Hadoop: HDFS-YARN Internals

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Keywords: Big Data, HDFS, Map Reduce, Hadoop

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Hadoop: Embracing future hardware

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

BBM467 Data Intensive ApplicaAons

Big Data and Apache Hadoop s MapReduce

Google File System. Web and scalability

Open source large scale distributed data management with Google s MapReduce and Bigtable

<Insert Picture Here> Big Data

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Certified Big Data and Apache Hadoop Developer VS-1221

MapReduce and Hadoop Distributed File System

How To Use Hadoop

CDH AND BUSINESS CONTINUITY:

International Journal of Advance Research in Computer Science and Management Studies

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Map Reduce / Hadoop / HDFS

Cloud Computing at Google. Architecture

Hadoop and Map-Reduce. Swati Gore

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

Getting to know Apache Hadoop

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Hadoop Big Data for Processing Data and Performing Workload

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

HADOOP MOCK TEST HADOOP MOCK TEST II

Apache Hadoop: Past, Present, and Future

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Transcription:

Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by Aung Oo, Suzanne McIntosh 1

Today Introduction to the Map/Reduce CONCEPT and an IMPLEMENTATION of it (Apache Hadoop be aware that there are also others) Project Discussion 2

Massive Data-sets Intro HDD_1 (10GB) HDD_100 (10GB) 1 TB of Data

Massive Data-sets Intro HDD_1 (10GB) HDD_100 (10GB) 1 TB of Data in 100 HDDs, but how many computers should there be? When N=1, reading 1 TB requires 2.5 HOURS. What should N be in order to give us appreciable speed-up on reads?

Massive Data-sets Intro Server 1 Server 100 HDD_1 (10GB) HDD_100 (10GB) Given: 10 GB per drive = 10,000,000,000 bytes per drive 100x 10 GB drives = 1 TB = 1,000,000,000,000 bytes Read rate is 100 MB/second Full 1 TB of data can be read in 100 seconds : 10 GB / 100 MB per second = 10,000,000,000 / 100,000,000 = 100 seconds to read one drive. We read all 100 drives in parallel, and the computers can process the data read in parallel. This is the architecture in which distributed computing frameworks shine, because not only is the data read in parallel, it is processed in parallel as well.

So really - how do we do this? 6

Coping wi th large data HPC? Many different HPC solutions MPI GPU computing MapReduce... No one is the best solution: Analyze your problem and choose the best solution for your specific problem, resources, midterm goals,... M/R frameworks are aimed to process huge volumes of data of Tera- or PetaBytes, what fits perfectly in many bioinformatics scenarios

Coping wi th large data MapReduce

Example Map/Reduce 9

Weather data example Dataset You are given a file containing data from weather stations from around the world. Say there are 1000 stations and each measures temperature once a second. Over a day that sums up to 86.400.000 data points Per year we ll have 3^10 points or about 500 GB (using 16bit numbers). Question: what was the maximum temperature for each year? 0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999 0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999 0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999 0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999 0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999 Input to Mapper: Key, Value Year Temp 0, 0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999 106, 0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999 212, 0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999 318, 0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999 424, 0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999 The key is the offset of the start of each record in the file (the records are 105 characters long).

11

12

13

14

More use cases 15

Hadoop Other examples Yahoo! has more than 100.000 CPUs in >40.000 computers running Hadoop The biggest cluster: 4.500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) Used to support research for Ad Systems and Web Search Also used to do scaling tests to support development of Hadoop on larger clusters Facebook uses Hadoop to store copies of internal log and dimension data sources and as a source for reporting/analytics and machine learning. Currently they have 2 major clusters (with a total of 15.000.000 GB storage): A 1100-machine cluster with 8800 cores and about 12 PB raw storage. A 300-machine cluster with 2400 cores and about 3 PB raw storage. Each node has 8 cores and 12 TB of storage.

Hadoop Example: GATK, a genome analysis toolkit

Hadoop Example: Bowtie Crossbow, genome resequencing

Hadoop Example: CloudBurst, a NGS read mapping

Read Mapping k-mer Counting with Map/Reduce map shuffle reduce ATGAACCTTA ATG,1 TGA,1 GAA,1 AAC,1 ACC,1 CCT,1 CTT,1 TTA,1 ACA -> 1 ATG -> 1 CAA -> 1,1 GCA -> 1 TGA -> 1 TTA -> 1,1,1 ACA:1 ATG:1 CAA:2 GCA:1 TGA:1 TTA:3 GAACAACTTA GAA,1 AAC,1 ACA,1 CAA,1 AAC,1 ACT,1 CTT,1 TTA,1 ACT -> 1 AGG -> 1 CCT -> 1 GGC -> 1 TTT -> 1 ACT:1 AGG:1 CCT:1 GGC:1 TTT:1 TTTAGGCAAC TTT,1 TTA,1 TAG,1 AGG,1 GGC,1 GCA,1 Application developers focus on 2 (+1 internal) functions Map: input -> key, value pairs Shuffle: Group together pairs with same key Reduce: key, value-lists -> output CAA,1 AAC,1 AAC -> 1,1,1,1 ACC -> 1 CTT -> 1,1 GAA -> 1,1 TAG -> 1 AAC:4 ACC:1 CTT:2 GAA:2 TAG:1 Map, Shuffle & Reduce All Run in Parallel

Read Mapping Cloudburst 1. Map: Catalog K-mers Emit k-mers in the genome and reads 2. Shuffle: Collect Seeds Conceptually build a hash table of k-mers and their occurrences 3. Reduce: End-to-end alignment If read aligns end-to-end with k errors, record the alignment map Human chromosome 1 shuffle reduce Read 1 Read 1, Chromosome 1, 12345-12365 Read 2 Read 2, Chromosome 1, 12350-12370

Hadoop Example: More NGS on Hadoop

Hadoop Other examples

Hadoop Overview 24

Hadoop A MapReduce implementation Hadoop MapReduce implementation is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, faulttolerant manner.

Hadoop A MapReduce implementation A MapReduce job usually splits the input dataset into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Hadoop A MapReduce implementation Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the HDFS are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster

Hadoop A MapReduce implementation The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master

Hadoop A MapReduce implementation

HDFS 30

Distributed File Systems Intro Why a distributed file system? Holds a large amount of data Serves many network clients

Distributed File Systems What is HDFS? HDFS = Hadoop Distributed File System Storage system part of Hadoop Developed by Google - Google File System (GFS) A version of GFS later renamed to HDFS Protects against data loss from hardware failure

Distributed File Systems HDFS Stores very large files (100s of MB, GB, TB) Provides high performance streaming data access Uses off-the-shelf, non-custom, hardware Continues working without noticeable interruption if failure Stores replicas of blocks to facilitate recovery from hardware errors File data is accessed in a write once and read many model

Distributed File Systems Intro HDFS is not good for Applications requiring low-latency access to data Lots of small files Lots of small files mean lots of metadata to hold in the NameNode s memory Multiple writers (only single writer is supported because stream-based) Updates to offsets within the file

HDFS So what? 35

Distributed File Systems HDFS Block-structured file system Individual files are broken into blocks of a fixed size HDFS block size is 64MB by default HDFS blocks are large compared to disk blocks (512 bytes) or file system blocks (4KB) Optimal streaming achieved by reducing the latency that many seeks would cause Blocks stored across cluster in one or more machines DataNodes

Distributed File Systems Intro In HDFS, a file can be made of several blocks, and they are not necessarily stored on the same machine Access to a file may require cooperation of multiple machines Advantage: Support for files whose sizes exceed what one machine can accommodate HDFS stores files as a set of large blocks across several machines, and these files are not part of the ordinary file system Typing ls on a machine running a DataNode daemon will display the contents of the ordinary Linux file system being used to host the Hadoop services Files stored inside HDFS are not shown HDFS runs in a separate namespace HDFS comes with its own utilities for file management Blocks that comprise the HDFS files are stored in a directory managed by the DataNode service

Distributed File Systems Intro When the blocks of a file are distributed across the cluster, several machines participate in serving up the file The loss of any one of those machines would make the file unavailable Solution is replication of each block across a number of machines (3 machines, by default)

Distributed File Systems Intro An HDFS cluster is comprised of two types of nodes: One Namenode (Master) Multiple Datanodes (Worker nodes, subservient to NameNode) In HDFS File data is accessed in a write once, read many (WORM) model Metadata structures (names of files and directories) can be modified by many clients concurrently Metadata remains synchronized by using single machine to manage the metadata the NameNode

Distributed File Systems NameNode Master Manages file system namespace Maintains file system tree Maintains metadata for all files and directories in the tree Low amount of metadata stored per file File names Permissions Locations, i.e. DataNodes, of each block of each file Information can be stored in the main memory of NameNode for fast access

Distributed File Systems Namenode Resilience Important that NameNodes are resilient to failure Without NameNode, one cannot use the file system For recovery Metadata is persisted in the local file system Optionally, persisted to multiple backup file system Option to run a secondary NameNode Role of secondary is different from primary NameNode Secondary manages the edit log by continuously merging the namespace image. Secondary NameNode lags the primary Secondary NameNode can be promoted to primary for recovery Namenode marks bad blocks, creates new good replicas

Distributed File Systems DataNodes Worker nodes Subservient to NameNode of the cluster Store and retrieve blocks on demand One large file is split into multiple HDFS blocks Each HDFS block is stored in a DataNode Report to NameNode periodically with lists of blocks they are storing Compute checksums over blocks Report checksum errors to NameNodes

Distributed File Systems To open a file in the HDFS file system Client retrieves from NameNode the list of locations (DataNodes) for the blocks that comprise the file Client reads file data directly from DataNode servers, possibly in parallel NameNode not directly involved in the bulk data transfer, keeping its overhead to a minimum If a DataNode fails Data can be retrieved from one of the replicas Cluster continues to operate If the NameNode fails Cluster is inaccessible until it is manually restored Multiple redundant systems allow the NameNode to protect the file system's metadata in the event of NameNode failure NameNode failure is more severe for the cluster than DataNode failure

Distributed File Systems How is distance between nodes measured? Hadoop uses the tree structure of the nodes in the network to arrive at distance between two nodes Distance is sum of the distance between each node and the nearest common ancestor When two processes are running on the same node, we identify both nodes the same way: / datacenter 1/ rack 1/ node 1, or /d1/r1/n1 for short. The distance is given as: distance(/d1/r1/n1, /d1/r1/n1) = 0 When two processes are running on different nodes in the same rack, the distance is: distance(/d1/r1/n1, /d1/r1/n2) = 2 When two processes are running on nodes in different racks, the distance is: distance(/d1/r1/n1, (/d1/r2/n3) = 4 When two processes are running on nodes in different datacenters*, the distance is: distance(/d1/r1/n1, /d2/r3/n4) = 6 *Note: Hadoop does not yet support this model.

Distributed File Systems Coherency Model Be aware that writes may not be visible, even after flush The current block being written will not be visible to readers Once more than one block s worth of data is written, the first block will become visible to readers HDFS provides sync() method to force all buffers to be synchronized to the DataNodes If your application does not call sync(), and a failure occurs, all data of the block currently being written will be unrecoverable It is advisable to call sync() at appropriate points in your application, remembering that a call to sync() does incur some overhead

NFS vs. HDFS

Distributed File Systems Another distributed file system: NFS One of the oldest NFS server makes local file system visible to network Once mounted, the fact that the files are remote is transparent to the client Files all reside on one machine, therefore limit to how much data can be stored Not extensible No reliability guarantee All clients contend for service from the NFS server Clients must copy the data locally to process it

Distributed File Systems NFS vs. HDFS NFS HDFS Mature technology? Yes (1984) Yes (2004) Serves multiple clients? Yes Yes Number of machines 1 Many Size of file system Fixed Extensible, scalable Reliability guaranteed? No Yes Clients contend for service? Clients copy data before processing it? Supports very large file sizes Yes Yes No Yes, but clients are distributed across n servers No Yes

Distributed File Systems Disadvantages of HDFS Not as general-purpose as NFS HDFS is not suitable for applications that perform random seeks to read from arbitrary locations within a file HDFS is not suitable for applications that perform random seeks to write to arbitrary locations within a file HDFS does not have support for multiple writers to a file

Summary

Distributed File Systems Intro When running in standalone mode, the local file system is used, not HDFS. When running in distributed mode, for example with HDFS as our distributed file system, Hadoop uses data locality optimization when scheduling jobs. Hadoop tries to run the map task on a node where the input data resides in HDFS This is to minimize the amount of copying over the network Network bandwidth is precious If all three nodes hosting the HDFS block replicas for a given split are already running map tasks, the job scheduler will try to schedule the work to run in a rack that already contains the replica Although the replica must be copied, it is intra-rack, so costs less than interrack

Distributed File Systems Intro Mappers write data out to local disks because it is intermediate data and replicas (via HDFS) are unnecessary Reduce tasks do not have the advantage of data locality their input is the output of many Mappers The sorted Mapper outputs get transferred over the network to Reducer nodes The output of the Reduce task(s) is written to HDFS for reliability (replicas) There can be multiple Reduce tasks

Summary MAP/REDUCE is a concept on how to work with large data-sets MAP/REDUCE is implemented in several software packages, e.g. Apache Hadoop, Apache MapReduce The needed ecosystem consists of other elements, such as a filesystem (HDFS) and several control services (e.g. job and tasktracker) There exist several tools for easier usage, e.g. Apache Mahout, Apache Pig, Apache DataFu Alternative approaches include: Apache Spark, Apache Flink, Apache Hama, Facebook Corona, Twitter Storm, etc. More: http://hadoopecosystemtable.github.io/ 53