BIG DATA, MAPREDUCE & HADOOP

Similar documents
Big Data and Apache Hadoop s MapReduce

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Introduction to MapReduce and Hadoop

Hadoop IST 734 SS CHUNG

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Hadoop WordCount Explained! IT332 Distributed Systems

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Map Reduce & Hadoop Recommended Text:

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Hadoop Design and k-means Clustering

Parallel Computing. Benson Muite. benson.

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Getting to know Apache Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 2

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel


A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Introduction to Hadoop

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop and Map-Reduce. Swati Gore

Introduction to Hadoop

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Extreme Computing. Hadoop MapReduce in more detail.

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Distributed Systems + Middleware Hadoop

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Introduction to Cloud Computing

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Large scale processing using Hadoop. Ján Vaňo

Distributed Lucene : A distributed free text index for Hadoop

A Performance Analysis of Distributed Indexing using Terrier

Introduction to Hadoop

Hadoop implementation of MapReduce computational model. Ján Vaňo

Internals of Hadoop Application Framework and Distributed File System

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Data-Intensive Computing with Map-Reduce and Hadoop

Hadoop and Map-reduce computing

Hadoop. Sunday, November 25, 12

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Xiaoming Gao Hui Li Thilina Gunarathne

Big Data With Hadoop

MAPREDUCE Programming Model

Big Data Management and NoSQL Databases

Leveraging Map Reduce With Hadoop for Weather Data Analytics

Yuji Shirasaki (JVO NAOJ)

Introduc)on to Map- Reduce. Vincent Leroy

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Hadoop Configuration and First Examples

BIG DATA What it is and how to use?

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

MapReduce (in the cloud)

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

MapReduce and Hadoop Distributed File System

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Big Data: Opportunities for the Dental Benefits Industry

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

What are Hadoop and MapReduce and how did we get here?

Lecture Data Warehouse Systems

Parallel Processing of cluster by Map Reduce

Data Science in the Wild

The MapReduce Framework

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Connecting Hadoop with Oracle Database

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2

MapReduce with Apache Hadoop Analysing Big Data

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop Ecosystem B Y R A H I M A.

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Distributed File Systems

Advanced Data Management Technologies

What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

L1: Introduction to Hadoop

Hadoop Parallel Data Processing

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Health Care Claims System Prototype

CS54100: Database Systems

Transcription:

BIG, MAPREDUCE & HADOOP LARGE SCALE DISTRIBUTED SYSTEMS By Jean-Pierre Lozi A tutorial for the LSDS class LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 1

OBJECTIVES OF THIS LAB SESSION The LSDS class has been mostly theoretical so far The objective of this lab session is to get hands-on experience with Hadoop I ll give you a short presentation (<< 1h), after that: exercises This is all just for fun, no grades! LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 2

WHAT IS BIG? Companies & organisations gather *lots* of data nowadays They re able to store it because storage has become very cheap! The New York Stock Exchange (NYSE) generates 1 Terabyte of data each day Facebook stores ~250 billion pictures from users several Petabytes of data! The Large Hadron Collider (LHC) generates 15 million petabytes of data! Numbers from 2014, so probably even more now! LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 3

WHAT IS BIG? This data comes from various sources. Re: Facebook, it comes users (as is often the case in the social web), Re: the LHC, it comes from machines Particle accelerator, but it can come from other machines, such as sensor networks (e.g., monitoring temperatures in your server farms all across the world, taxi companies who want to know where all their cars are at any moment, transactions, log files...) This data is often not very well structured. Images, text files, comments, health data (prescriptions...), etc. How do you store and process this data? LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 4

HOW DO YOU STORE THIS? You need many machines that will store a small part of the data «Oh, that was easy!» (part 1) (part 2) (part 3) (part 4) (part 5) (part 6) (part 7) (part 8) (part 9) LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 5

HOW DO YOU STORE THIS? But actually, you really need *lots* of machines E.g., Apple uses their own solar power plant to power their icloud server farm! (part 1) (part 2) (part 3) (part 1) (part 2) (part 3) (part 1) (part 2) DAT (par (part 4) (part 5) (part 6) (part 4) (part 5) (part 6) (part 4) (part 5) DAT (par (part 7) (part 8) (part 9) (part 7) (part 8) (part 9) (part 7) (part 8) DAT (par (part 1) (part 2) (part 2) LARGE (part 3) SCALE DISTRIBUTED (part 1) SYSTEMS (part 2) BIG, (part 3) MAPREDUCE (part& 1) HADOOP 6 DAT (par

HOW DO YOU STORE THIS? Problem: with that many machines, some are bound to crash! You can t just naïvely partition the data, or you ll lose some! (part 1) (part 2) (part 3) (part 1) (part 2) (part 3) (part 1) (part 2) DAT (par (part 4) (part 5) (part 6) (part 4) (part 5) (part 6) (part 4) (part 5) DAT (par (part 7) (part 8) (part 9) (part 7) (part 8) (part 9) (part 7) (part 8) DAT (par (part 1) (part 2) (part 2) LARGE (part 3) SCALE DISTRIBUTED (part 1) SYSTEMS (part 2) BIG, (part 3) MAPREDUCE (part& 1) HADOOP 7 DAT (par

HOW DO YOU STORE THIS? So you need some replication. You can t handle it manually, of course. So you use a Distributed File System (DFS) that does the job for you! In Hadoop, this filesystem is called HDFS (Hadoop Distributed File System) foo.txt: 3,9,6 bar.data: 2,4 Client block #2 of foo.txt? 9 Name node Read block 9 9 3 4 2 9 3 4 9 9 6 2 3 4 6 2 Data nodes HDFS Source: UPenn LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 8

HOW DO YOU PROCESS THIS? Now we know how to store the data, but how do we process it? Historically, we ve been using databases for this. It doesn t work anymore! First, because as we saw earlier : lack of structure! Images, comments, log files, prescriptions,... You can put that in a database, with a fixed structure, tables, relations, etc. Second, databases don t scale very well... Try doubling the number of nodes with a (distributed) database... You won t be twice as fast (far from it) LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 9

HOW DO YOU PROCESS THIS? So what s the alternative? Google invented MapReduce to make the indexer for their search engine scale! Idea of MapReduce : You write a function that you re going to run as a batch process on all of your data And you want to get one result (which can be large) MapReduce is really good at doing this efficiently! Different use case from databases that are better at accessing small bits of your data all the time frequently, instead of all of your data once in a while in a batch process! LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 10

HOW DO YOU PROCESS THIS? How does MapReduce manage to be so efficient at what it does? A very old idea: execute things locally as much as possible and to avoid transfers between nodes as much as possible! MapReduce first runs a function f() on all data Of course, if two nodes contain the same data, you re only going to run the function on one of the nodes only (just ensure it s run on all of the data) And if a node is dead, you ll make sure you run the function on another node that has the same data All of this is done automatically! LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 11

HOW DO YOU PROCESS THIS? After the Map phase, you have partial results located on all nodes... So you want to gather and aggregate all of these results into global results! Intermediary phase: the Shuffle phase brings all data to one machine (could be one of the previous ones) RES RES RES RES LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 12

HOW DO YOU PROCESS THIS? The Shuffle phase naïvely concatenates the results together Usually we want a new function g() that will take the concatenated data......and merge it in a smarter way, to produce the result. RES RES RES OUT PUT RES LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 13

HOW DO YOU PROCESS THIS? MapReduce can seem a bit restrictive: You have to express all of your algorithms with two functions, Map and Reduce. And actually, it is: you can t express everything with MapReduce. But in practice, you will see that many operations that are executed on large amounts of data can be expressed following this paradigm! And if you re able to, you can very easily implement an algorithm that scales well......without having to worry about how you distribute, replicate, or transfer the data! LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 14

HOW DO YOU PROCESS THIS? In practice, it can get more complicated than this, among other things : You can alter the Shuffle phase with a Combiner that will prepare the data after the Map phase locally before it s sent to the Reducer (useful for reducing the amount of data transferred) You can use several Reducers: each will produce part of the data, results stored on the HDFS, so the results will just look like a bunch of files, which sometimes can be just what you want (just merge them or something)...but pretty often, when you have several Reducers, you ll want to combine the data again, so what do you do? You can run a Map phase on the output of your reducers again!...and you can do this over and over again (iterative MapReduce) LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 15

HOW DO YOU PROCESS THIS? Iterative MapReduce Source: Twister4Azure Used e.g. for Google s PageRank LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 16

WAIT BUT THAT S MAPREDUCE, WHAT S HADOOP? MapReduce was invented and is used by Google. Hadoop is a free, open-source implementation of MapReduce. A bit of history... Early 2000s: Doug Cutting develops two open-source search projects: Lucene: Search indexer, used e.g., by Wikipedia Nutch: A spider/crawler (with Mike Carafella) Nutch: Aims to become a web-scale, crawler-based search Written by a few part-time developers Distributed, by necessity (too much data) Able to parse100mb of web pages on 4 nodes, but can t scale to the whole web... Source: UPenn LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 17

WAIT BUT THAT S MAPREDUCE, WHAT S HADOOP? 2003/2004: Google File System (GFS) and MapReduce papers published SOSP 2003: Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung: "The Google File System OSDI 2004: Jeffrey Dean and Sanjay Ghemawat: "MapReduce: Simplified Data Processing on Large Clusters Directly addressed Nutch's scaling issues Following this, GFS & MapReduce added to Nutch Two part-time developers over two years (2004-2006)... With 20 nodes. Much easier to program and run, scales to several 100M web pages......but still far from web scale Source: UPenn LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 18

WAIT BUT THAT S MAPREDUCE, WHAT S HADOOP? 2006: Yahoo hires Doug Cutting Provides engineers, clusters, users... Big boost for the project, tens of M$ Not without a price: slightly different focus (e.g. security) than the rest of the project, delays results... Following this, Hadoop project splits out of Nutch! HDFS corresponds to Google s GFS Finally hits web scale in 2008! Source: UPenn LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 19

WAIT BUT THAT S MAPREDUCE, WHAT S HADOOP? Cutting is now at Cloudera... Originally a startup, started by three top engineers from Google, Facebook, Yahoo, and a former executive from Oracle Has its own version of Hadoop; software remains free, but company sells support and consulting services Was elected chairman of Apache Software Foundation Now Hadoop maintained by the Apache Foundation! Source: UPenn LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 20

HADOOP IN PRACTICE Map and Reduce functions operate on (key, value) pairs E.g., the Map function takes (key, value) pairs, produces (hopefully less!) pairs that are sent as the input of the Reduce function... The Shuffle phase concatenates the values that have the same key. For instance, if your Map phase outputs three pairs : ( foo, 3), ( bar, 4) and ( foo, 5) The Reduce phase will receive : ( foo, [3, 5]), ( bar, 4) The Reduce function takes these pairs and produces (key, value) pairs again... Which means that your output will always be a list of (key, value) pairs! (It may need further processing.) LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 21

HADOOP IN PRACTICE Let s start with an example: we have files that contain meteorological data These files contain records, each record is one line, containing: The code of a weather station on five digits The year when the temperature was recorded The average temperature for that year times ten on four digits (we ll suppose they re all positive to simplify things, multiplication to avoid floats). Only one data point per year here so it s not really Big Data, but this is just a toy example, we could have a lot more records, one per hour for instance. Many more fields such as the wind speed, humidity, etc. An example of a record: 12345195001639362743... LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 22

HADOOP IN PRACTICE The input data will look like this: 12345195001639362743 12123195001341892769 12111195001311271987 12094194902231212122 12093194901651209182... The data can be stored in many files, one per weather station, one per year... Etc. We ll use Hadoop to calculate the maximum average temperature for each year! LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 23

HADOOP IN PRACTICE 12345195001639362743 12123195001341892769 12111195001311271987 12094194902231212122 12093194901651209182... What will the input of the Map function be? Each line produces a (key, value) pair We can ignore the key (usually the character offset), the value is the contents of the line: (0, 12345195001639362743) (20, 12123195001341892769) (40, 12111195001311271987) (60, 12094194902231212122) (80, 12093194901651209182)... LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 24

HADOOP IN PRACTICE 12345195001639362743 12123195001341892769 12111195001311271987 12094194902231212122 12093194901651209182... What will the Map function do? It will discard the key, parse the values, and return (key, value) pairs where the key is the year and the value is the average temperature. The output will be: (1950, 0163) (1950, 0134) (1950, 0131) (1949, 0223) (1949, 0165)... So basically our Map function will be a Java function that takes two parameters, the key (a number) and the value (a string), it will parse the string using the standard API, and produce the (key, value) pair as the output... LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 25

HADOOP IN PRACTICE 12345195001639362743 12123195001341892769 12111195001311271987 12094194902231212122 12093194901651209182... What will the Shuffle phase do? As we ve seen earlier, it will concatenate the values for each key. Plus, keys are sorted: (1949, [0223, 0165]) (1950, [0163, 0134, 0131])... We don t have to implement this phase, it s done automatically... LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 26

HADOOP IN PRACTICE 12345195001639362743 12123195001341892769 12111195001311271987 12094194902231212122 12093194901651209182... What will the Reduce phase do? It s just going to calculate the maximum of each list. The input was: (1949, [0223, 0165]) (1950, [0163, 0134, 0131])... The output will be: (1949, 0165) (1950, 0163)... And that s it, we have the result we want! All we have to do is to implement two very simple functions in Java, Map and Reduce, and everything else, distribution, replication, load-balancing (of keys), fault tolerance (tasks rescheduled to machines that work), etc., will be handled by Hadoop! LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 27

HADOOP IN PRACTICE One last thing before we start writing the code... Hadoop uses its own serialization because Java serialization known to be inefficient Result: a special set of data types All implement the Writable interface Most common types shown here......more specialized types exist (SortedMapWritable, ObjectWritable...) Name Description JDK equivalent IntWritable 32-bit integers Integer LongWritable 64-bit integers Long DoubleWritable Floating-point numbers Double Text Strings String Source: UPenn 12345195001639362743 12123195001341892769 12111195001311271987 12094194902231212122 12093194901651209182... LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 28

HADOOP IN PRACTICE First thing to do when you write a MapReduce program : find the input types of the Map and Reduce functions. Map takes pairs like (0, 12345195001639362743), and produces pairs like (1950, 0163). Input types = (LongWritable, Text), output types = (Text, IntWritable), for instance. (We could use an IntWritable for the year too, but we never use its numerical properties.) Consequently, the Map class will extend: Mapper<LongWritable, Text, Text, IntWritable> 12345195001639362743 12123195001341892769 12111195001311271987 12094194902231212122 12093194901651209182... And contain this function: public void map(longwritable key, Text value, Context context) LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 29

HADOOP IN PRACTICE 12345195001639362743 12123195001341892769 12111195001311271987 12094194902231212122 12093194901651209182... First thing to do when you write a MapReduce program : find the input types of the Map and Reduce functions. Reduce takes pairs like (1949, [0223, 0165]), and produces pairs like (1949, 0165). Input types = (Text, IntWritable), output types = (Text, IntWritable), for instance. Consequently, the Reduce class will extend: Reducer<Text, IntWritable, Text, IntWritable> And contain this function: reduce(text key, Iterable<IntWritable> values, Context context) LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 30

HADOOP IN PRACTICE class MaxTemperature { static class MaxTemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable> { 12345195001639362743 12123195001341892769 12111195001311271987 12094194902231212122 12093194901651209182... } public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); String year = line.substring(5, 9); int temp = Integer.parseInt(line.subString(9, 13)); context.write(new Text(year), new IntWritable(temp)); } }... // Now write the reducer.... // And create the main() function that creates the job and launches it. LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 31

YOUR TURN! That senough information to get you started! You can now start working on the exercises you will find here: http://i3s.unice.fr/~jplozi/hadooplab_lsds/hadooplab_lsds.pdf You will probably need more information than just what we saw in these slides......you re expected to use Google and to figure things out on your own! Good luck! LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 32