Data Intensive Computing: MapReduce and Hadoop

Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Intensive Computing: MapReduce and Hadoop Corso di Sistemi Distribuiti e Cloud Computing A.A. 2015/16 Valeria Cardellini Why Big Data? Source: What Happens in an Internet Minute? Valeria Cardellini - SDCC 2015/16 1

How much data? By 2020 40 Zettabytes of data will be created: 40 Zettabytes (40x10 21 40x2 70 )... How big is a Zettabyte? http://www.dailyinfographic.com/2016-the-year-of-thezettabyte-infographic 40,000 Exabytes (40,000x10 18 )... 40,000,000 Petabytes (40,000,000x10 15 )... 40,000,000,000 Terabytes (40,000,000,000x10 12 ) 40,000,000,000,000,000,000,000 bytes! 90% of all the data in the world has been generated over the last two years (in 2013) Source: The Four V's of Big Data Valeria Cardellini - SDCC 2015/16 2 How much data? Some older statistics: Google processes more than 20 PB a day (2008) Facebook has 2.5 PB of user data + 15 TB/day (4/2009) (~25 PB today) ebay has more than 6.5 PB of user data + 50 TB/ day (5/2009) CERN s LHC generates 1 PB of data per second (2013) Valeria Cardellini - SDCC 2015/16 3

How Big? Growth rate Big Data is growing fast Valeria Cardellini - SDCC 2015/16 4 How Big? IoT impact Internet of Things (IoT) will largely contribute to increase Big Data challenges Valeria Cardellini - SDCC 2015/16 5

Big Data definitions Different definitions Big data exceeds the reach of commonly used hardware environments and software tools to capture, manage, and process it with in a tolerable elapsed time for its user population. Teradata Magazine article, 2011 Big data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze. The McKinsey Global Institute, 2012 Big data is mostly about taking numbers and using those numbers to make predictions about the future. The bigger the data set you have, the more accurate the predictions about the future will be. Anthony Goldbloom, Kaggle s founder Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools. Wikipedia, 2014 Valeria Cardellini - SDCC 2015/16 6 so, what is Big Data? Big Data is similar to Small Data, but bigger but having data bigger it requires different approaches (scale changes everything!) New methodologies, tools, architectures with an aim to solve new problems or old problems in a better way Valeria Cardellini - SDCC 2015/16 7

3V model for Big Data Volume: challenging to store and process (how to index, retrieve) Variety: different data types (text, audio, video, record) and degree of structure (structured, semi-structured, unstructured data) Velocity: speed of generation, rate of analysis Defined in 2001 by D. Laney Valeria Cardellini - SDCC 2015/16 8 The extended (3+n)V model 1. Volume (lots of data) 2. Variety (complexity, curse of dimensionality) 3. Velocity (rate of data and information flow) 4. Value (Big data can generate huge competitive advantages) Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis. [Gan11] 5. Variability (data flows can be highly inconsistent with periodic peaks) 6. Veracity (untrusted, uncleaned data) 7. Visualization Valeria Cardellini - SDCC 2015/16 9

Big Data visualization Presenta=on of data in a pictorial and graphical format Mo=va=on: our brain processes images 60,000x faster than text Some examples Flight pakerns www.aaronkoblin.com/work/flightpakerns/ Hurricanes uxblog.idvsolu=ons.com/2012/08/hurricanes-since-1851.html Rela=onships between actors who have won Oscars, the directors they have worked with and all the other actors they have worked with www.pitchinterac=ve.com/infovis/abstract.html Valeria Cardellini - SDCC 2015/16 10 Some examples of Big Data applications Consumer product companies and retail organizations are monitoring social media like Facebook and Twitter to get an unprecedented view into customer behavior, preferences, and product perception Manufacturers are monitoring minute vibration data from their equipment to predict the optimal time to replace or maintain Manufacturers are also monitoring social networks, but with a different goal than marketers: to detect aftermarket support issues before a warranty failure becomes publicly detrimental The government is making data public at both the national, state, and city level for users to develop new applications that can generate public good (Open Data) Valeria Cardellini - SDCC 2015/16 11

many other Big Data applications in very diverse sectors Crime prevention in Los Angeles Diagnosis and treatment of genetic diseases Investments in the financial sector Generation of personalized advertising Astronomical discoveries See the BBC video gsis.mediacore.tv/media/bbc-horizon-the-age-of-big-data Valeria Cardellini - SDCC 2015/16 12 Examples of real-time Big Data applications Real-time analytics over high volume sensor data: analysis of energy consumption measurements (DEBS 2014) www.cse.iitb.ac.in/debs2014/?page_id=42 Real-time analytics over high volume geospatial data streams: analysis of taxi trips based on a stream of trip reports from New York City www.debs2015.org/call-grand-challenge.html Finance: real-time forecasting of stock market Medicine: epidemy tracking Security Fraud detection, DDOS attacks, behavioural pattern recognition Urban traffic management Valeria Cardellini - SDCC 2015/16 13

Processing Big Data Some approaches to deal with big data Old fashion: RDMS but non applicable MapReduce: store and process data sets at massive scale (especially Volume+Variety), also known as batch processing Data stream processing: process fast data (in real-time) as data is being generated, without storing In this lesson we focus on MapReduce In the next lesson, we will examine data stream processing Valeria Cardellini - SDCC 2015/16 14 Parallel programming: background Parallel programming Break processing into parts that can be executed concurrently on multiple processors Challenge Identify tasks that can run concurrently and/or groups of data that can be processed concurrently Not all problems can be parallelized! Valeria Cardellini - SDCC 2015/16 15

Parallel programming: background (2) Simplest environment for parallel programming No dependency among data Data can be split into equal-size chunks Each process can work on a chunk Master/worker approach Master Initializes array and splits it according to the number of workers Sends each worker the sub-array Receives the results from each worker Worker: Receives a sub-array from master Performs processing Sends results to master Single Program, Multiple Data (SMPD): technique to achieve parallelism The most common style of parallel programming Valeria Cardellini - SDCC 2015/16 16 Key idea behind MapReduce: Divide and conquer A feasible approach to tackling large-data problems Partition a large problem into smaller sub-problems Independent sub-problems executed in parallel Combine intermediate results from each individual worker The workers can be: Threads in a processor core Cores in a multi-core processor Multiple processors in a machine Many machines in a cluster Implementation details of divide and conquer are complex Valeria Cardellini - SDCC 2015/16 17

Divide and conquer: how? Decompose the original problem in smaller, parallel tasks Schedule tasks on workers distributed in a cluster, keeping into account: Data locality Resource availability Ensure workers get the data they need Coordinate synchronization among workers Share partial results Handle failures Valeria Cardellini - SDCC 2015/16 18 Key idea behind MapReduce: scale out, not up! For data-intensive workloads, a large number of commodity servers is preferred over a small number of high-end servers Cost of super-computers is not linear Datacenter efficiency is a difficult problem to solve, but recent improvements Processing data is quick, I/O is very slow Sharing vs. shared nothing: Sharing: manage a common/global state Shared nothing: independent entities, no common state Sharing is difficult: Synchronization, deadlocks Finite bandwidth to access data from SAN Temporal dependencies are complicated (restarts) Valeria Cardellini - SDCC 2015/16 19

MapReduce Programming model for processing huge amounts of data sets over thousands of servers Originally proposed by Google in 2004 MapReduce: simplified data processing on large clusters Based on a shared nothing approach Also an associated implementation (framework) of the distributed system that runs the corresponding programs Some examples of applications: Web indexing Reverse Web-link graph Distributed sort Web access statistics Valeria Cardellini - SDCC 2015/16 20 MapReduce: programmer view MapReduce hides system-level details Key idea: separate the what from the how MapReduce abstracts away the distributed part of the system Such details are handled by the framework Programmers get simple API Don t have to worry about handling Parallelization Data distribution Load balancing Fault tolerance Valeria Cardellini - SDCC 2015/16 21

Typical Big Data problem Iterate over a large number of records Extract something of interest from each record Shuffle and sort intermediate results Aggregate intermediate results Generate final output Key idea: provide a functional abstraction of the two Map and Reduce operations Valeria Cardellini - SDCC 2015/16 22 MapReduce: model Processing occurs in two phases: Map and Reduce Functional programming roots (e.g., Lisp) Map and Reduce are defined by the programmer Input and output: sets of key-value pairs Programmers specify two functions: map and reduce map(k 1, v 1 ) [(k 2, v 2 )] reduce(k 2, [v 2 ]) [(k 3, v 3 )] (k, v) denotes a (key, value) pair [ ] denotes a list Keys do not have to be unique: different pairs can have the same key Normally the keys of input elements are not relevant Valeria Cardellini - SDCC 2015/16 23

Map Execute a function on a set of key-value pairs (input shard) to create a new list of values map (in_key, in_value) list(out_key, intermediate_value) Example: square x = x * x map square [1,2,3,4,5] returns [1,4,9,16,25] Map calls are distributed across machines by automatically partitioning the input data into M shards MapReduce library groups together all intermediate values associated with the same intermediate key and passes them to the Reduce function Valeria Cardellini - SDCC 2015/16 24 Reduce Combine values in sets to create a new value reduce (out_key, list(intermediate_value)) list(out_value) Example: sum = (each elem in arr, total +=) reduce [1,4,9,16,25] returns 55 (the sum of the square elements) Valeria Cardellini - SDCC 2015/16 25

MapReduce program A MapReduce program, referred to as a job, consists of: Code for Map and Reduce packaged together Configuration parameters (where the input lies, where the output should be stored) Input data set, stored on the underlying distributed file system The input will not fit on a single computer s disk Each MapReduce job is divided by the system into smaller units called tasks Map tasks Reduce tasks The output of MapReduce job is also stored on the underlying distributed file system Valeria Cardellini - SDCC 2015/16 26 Valeria Cardellini - SDCC 2015/16 MapReduce computation 1. Some number of Map tasks each are given one or more chunks of data from a distributed file system. 2. These Map tasks turn the chunk into a sequence of key-value pairs. The way key-value pairs are produced from the input data is determined by the code written by the user for the Map function. 3. The key-value pairs from each Map task are collected by a master controller and sorted by key. 4. The keys are divided among all the Reduce tasks, so all keyvalue pairs with the same key wind up at the same Reduce task. 5. The Reduce tasks work on one key at a time, and combine all the values associated with that key in some way. The manner of combination of values is determined by the code written by the user for the Reduce function. 6. Output key-value pairs from each reducer are written persistently back onto the distributed file system 7. The output ends up in r files, where r is the number of reducers. Such output may be the input to a subsequent MapReduce phase 27

Where the magic happens Implicit between the map and reduce phases is a distributed group by operation on intermediate keys Intermediate data arrive at each reducer in order, sorted by the key No ordering is guaranteed across reducers Intermediate keys are transient: They are not stored on the distributed file system They are spilled to the local disk of each machine in the cluster Input Splits Intermediate Outputs Final Outputs (k, v) Pairs Map Function (k, v ) Pairs Reduce Function (k, v ) Pairs Valeria Cardellini - SDCC 2015/16 28 MapReduce computation: the complete picture Valeria Cardellini - SDCC 2015/16 29

A simplified view of MapReduce: example Mappers are applied to all input key-value pairs, to generate an arbitrary number of intermediate pairs Reducers are applied to all intermediate values associated with the same intermediate key Between the map and reduce phase lies a barrier that involves a large distributed sort and group by Valeria Cardellini - SDCC 2015/16 30 Hello World in MapReduce: WordCount Problem: counts the number of occurrences for each word in a large collection of documents Input: a repository of documents, each document is an element Map: reads a document and emits a sequence of key-value pairs where: Keys are words of the documents and values are equal to 1: (w1, 1), (w2, 1),, (wn, 1) Grouping: groups by key and generates pairs of the form (w1, [1, 1,, 1]),..., (wn, [1, 1,, 1]) Reduce: adds up all the values and emits (w1, k),, (wn, l) Output: (w,m) pairs where: w is a word that appears at least once among all the input documents and m is the total number of occurrences of w among all those documents Valeria Cardellini - SDCC 2015/16 31

WordCount: Map Map emits each word in the document with an associated value equal to 1 Map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, 1 ); Valeria Cardellini - SDCC 2015/16 32 WordCount: Reduce Reduce adds up all the 1 emitted for a given word Reduce(String key, Iterator values): // key: a word // values: a list of counts int result=0 for each v in values: result += ParseInt(v) Emit(AsString(result)) This is pseudo-code; for the complete code of the example see the MapReduce paper Valeria Cardellini - SDCC 2015/16 33

MapReduce: execution overview Valeria Cardellini - SDCC 2015/16 34 What is Apache Hadoop? Valeria Cardellini - SDCC 2015/16 Open-source software framework for reliable, scalable, distributed data-intensive computing Originally developed by Yahoo! Goal: storage and processing of data-sets at massive scale Infrastructure: clusters of commodity hardware Core components: HDFS: Hadoop Distributed File System Hadoop MapReduce Includes a number of related projects Among which Apache Pig, Apache Hive, Apache HBase Used in production by Facebook, IBM, Linkedin, Twitter, Yahoo! and many others Provided by Amazon (ElasticMapReduce, EMR) as a service running on EC2 35

Hadoop core HDFS A distributed file system characterized by a master/worker architecture Data is replicated with redundancy across the cluster Servers can fail and not abort the computation process Quite similar to Google File System Hadoop MapReduce Allows to easily write applications which process vast amounts of data (multi-terabyte data-sets) in parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, faulttolerant manner The powerhouse behind most of today s big data processing (e.g., Facebook) Valeria Cardellini - SDCC 2015/16 36 HDFS: file management Valeria Cardellini - SDCC 2015/16 37

HDFS: architecture An HDFS cluster has two types of nodes: Multiple DataNodes One NameNode Valeria Cardellini - SDCC 2015/16 38 HDFS concepts The datanodes (workers) just store and retrieve the blocks (also shards or chunks) when they are told to (by clients or the namenode) The namenode (master): Manages the filesystem tree and the metadata for all the files and directories Knows the datanodes on which all the blocks for a given file are located Without the namenode HDFS cannot be used It is important to make the namenode resilient to failure Valeria Cardellini - SDCC 2015/16 39

HDFS: file read Source: Hadoop: The definitive guide NameNode is only used to get block location Valeria Cardellini - SDCC 2015/16 40 HDFS: file write Source: Hadoop: The definitive guide Clients ask NameNode for a list of suitable DataNodes This list forms a pipeline: first DataNode stores a copy of a block, then forwards it to the second, and so on Valeria Cardellini - SDCC 2015/16 41

WordCount on Hadoop Let s analyze the WordCount code on Hadoop hadoop.apache.org/docs/current/hadoop-mapreduceclient/hadoop-mapreduce-client-core/ MapReduceTutorial.html - Example:_WordCount_v1.0 Valeria Cardellini - SDCC 2015/16 42 Hadoop in the Cloud Pros: Gain Cloud scalability and elasticity No need to manage and provision the infrastructure and the platform Main challenges: Move data to the Cloud Latency is not zero! Minor issue: network bandwidth Data security and privacy Valeria Cardellini - SDCC 2015/16 43

Amazon Elastic MapReduce (EMR) Distributed the computational work across a cluster of virtual servers running on EC2 instances Cluster managed with Hadoop Input and output: Amazon S3, DynamoDB Valeria Cardellini - SDCC 2015/16 44 Hadoop ecosystem: a partial big picture See hadoopecosystemtable.github.io for a longer and updated list Valeria Cardellini - SDCC 2015/16 45

Why an ecosystem Hadoop released in 2011 by Apache Software Foundation A platform around which an entire ecosystem of capabilities has been and is built Dozens of self-standing software projects (some are top projects), each addressing a variety of Big Data space and meeting different needs Hive and Pig: simplify development of applications employing MapReduce Spark: improves performance for certain types of Big Data applications It is an ecosystem: complex, evolving, and not easily parceled into neat categories Valeria Cardellini - SDCC 2015/16 46 Pig and Hive Both built upon Hadoop Pig Idea: take the best of both SQL and Map-Reduce Combines high-level declarative querying with lowlevel procedural programming Uses MapReduce to execute all data processing Compiles Pig Latin scripts written by users into a series of one or more MapReduce jobs that are then executed Hive Make the unstructured data looks like tables regardless how it really lays out SQL-based query can be directly against these tables Valeria Cardellini - SDCC 2015/16 47

MapReduce: weaknesses and limitations Programming model Hard to implement everything as a MapReduce program Multiple MapReduce steps can be needed even for simple operations Lack of control, structures and data types Efficiency (recall HDFS ) High communication cost Frequent writing of output to disk Limited exploitation of main memory Real-time processing A MapReduce job requires to scan the entire input Stream processing and random access impossible Valeria Cardellini - SDCC 2015/16 48 Alternative programming models Based on directed acyclic graphs (DAGs) E.g., Spark and Storm Based on Bulk Synchronous Parallel (BSP) For graph analytics at massive scale and massive scientific computations (e.g., matrix, graph and network algorithms) Including: Pregel, Hama, Giraph, GraphLab, GraphX SQL-based Hive and Pig NoSQL databases E.g., HBase Valeria Cardellini - SDCC 2015/16 49

Spark Separate, fast and general-purpose engine for large-scale data processing Not a modified version of Hadoop The leading candidate for successor to MapReduce In-memory data storage for very fast iterative queries Up to 40x faster than Hadoop Suitable for general execution graphs and powerful optimizations Compatible with Hadoop s storage APIs Can read/write to any Hadoop-supported system, including HDFS and HBase Valeria Cardellini - SDCC 2015/16 50 Data sharing in MapReduce Slow due to replication, serialization and disk I/O Valeria Cardellini - SDCC 2015/16 51

Data sharing in Spark Distributed in-memory: much faster than disk and network Valeria Cardellini - SDCC 2015/16 52 Apache Mesos How to use different Big Data frameworks all together on the same cluster and dynamically share the cluster resources among the frameworks? Mesos: a cluster operating system Valeria Cardellini - SDCC 2015/16 53