Distributed Systems + Middleware Hadoop

Size: px
Start display at page:

Download "Distributed Systems + Middleware Hadoop"

Transcription

1 Distributed Systems + Middleware Hadoop Alessandro Sivieri Dipartimento di Elettronica, Informazione e Bioingegneria Politecnico, Italy [email protected]

2 Contents Introduction to Hadoop History MapReduce HDFS Pig Distributed Systems + Middleware: Hadoop 2

3 INTRODUCTION TO HADOOP Distributed Systems + Middleware: Hadoop 3

4 Data We live in a digital world that produces data at an impressive speed As of 2012, 2.7 ZB of data exist (1 ZB = Bytes) NYSE produces 1 TB of data per day The Internet Archive grows by 20 TB per month The LHC produces 15 PB of data per year AT&T has a 300 TB database 100 TB of data uploaded daily on Facebook Distributed Systems + Middleware: Hadoop 4

5 Data Personal data is growing, too E.g., photos: a single photo taken with a Nikon commercial camera takes about 6 MB (default settings); a year of family photos takes about 8 GB of space adding to the slices of personal stuff uploaded on social networks, video Websites, blogs and more Machine-produced data is also growing Machine logs Sensor networks and monitored data Distributed Systems + Middleware: Hadoop 5

6 Data analysis Main problem: disk reading speed / capacity has not really improved Solution: parallelize the storage and read less data from each disk Problems: hardware replication, data aggregation Take, for example, a RDBMS (keeping in mind that the seeking time on disks measures the latency of the operations): Updating records is fast: a B-Tree structure is efficient Reading many records is slow: if access is dominated by seeking time, it is faster to read the entire disk (which operates at transfer time) Distributed Systems + Middleware: Hadoop 6

7 Hadoop Reliable data storage: Hadoop Distributed File System Data analysis: MapReduce implementation Many tools for the developers Easy cluster administration Query languages Some are similar to SQL Column-oriented distributed databases on top of Hadoop Structured to unstructured repositories and back Distributed Systems + Middleware: Hadoop 7

8 Hadoop vs. The (existing) World RDBMS: Disk seek time Some types of data are not normalized (e.g., logs): MapReduce works well with unstructured data MapReduce scales linearly (while a RDBMS does not) Volunteer computing (e.g., SETI@home) Similar model, but Hadoop works in a localized cluster sharing high-performance bandwidth, while volunteer computing works over the Internet on untrusted computers performing other operations meanwhile Distributed Systems + Middleware: Hadoop 8

9 Hadoop vs. The (existing) World MPI: Works well for compute-intensive jobs, but network becomes the bottleneck when hundredths of GB of data have to be analyzed Conversely, MapReduce does its best to exploit data locality by collocate the data with the compute node (network bandwidth is the most precious resource, it must not be wasted) MapReduce operates at a higher level wrt MPI: data flow is already taken care MapReduce implements failure recovery (in MPI the developer has to handle checkpoints and failure recovery) Distributed Systems + Middleware: Hadoop 9

10 HADOOP HISTORY Distributed Systems + Middleware: Hadoop 10

11 Brief history In 2002, Mike Cafarella and Doug Cutting started working on Apache Nutch, a new Web search engine In 2003, Google published a paper on the Google File System, a distributed filesystem, and Mike and Doug started working on a similar, open source, project In 2004, Google published another paper, on the MapReduce computation model, and yet again Mike and Doug implemented an open source version in Nutch Distributed Systems + Middleware: Hadoop 11

12 Brief history In 2006, these two projects separated from Nutch and became Hadoop In the same year, Doug Cutting started working for Yahoo! and started using Hadoop there In 2008 Hadoop was used by Yahoo! (10000-core cluster), Last.fm, Facebook and the NYT In 2009, Yahoo! broke the world record for sorting 1 TB of data in 62 seconds, using Hadoop Since then, Hadoop became mainstream in industry Distributed Systems + Middleware: Hadoop 12

13 Examples from the Real World Last.fm Each user listening to a song (local or in streaming) generates a trace Hadoop analyses these traces to produce charts Facebook E.g., track statistics per user and per country, weekly top tracks Daily and hourly summaries over user logs Products usage, ads campaigns Ad-hoc jobs over historical data Long term archival store Integrity checks Distributed Systems + Middleware: Hadoop 13

14 Examples from the Real World Nutch search engine Link inversion: find outgoing links that point to a specific Web page URL fetching Produce Lucene indexes (for text searches) Infochimps: explore network graphs Social networks: Twitter analysis, measure communities Biology: neuron connections in roundworms Street connections: OpenStreetMap Distributed Systems + Middleware: Hadoop 14

15 Hadoop umbrella HDFS: distributed filesystem MapReduce: distributed data processing model MRUnit: unit testing of MapReduce applications Pig: data flow language to explore large datasets Hive: distributed data warehouse HBase: distributed, column-oriented db ZooKeeper: distributed coordination service Sqoop: efficient bulk transfers of data over HDFS Distributed Systems + Middleware: Hadoop 15

16 MAPREDUCE Distributed Systems + Middleware: Hadoop 16

17 MapReduce Model for analyzing large amounts of data Data has to be organized as a key value dataset Two phases: map(k1, v1) -> list(k2, v2), where the input domain is different from the output domain (shuffle: intermediate phase to sort the output of map and group by key) reduce(k2, list(v2)) -> list(v3), where the input and output domain is the same Distributed Systems + Middleware: Hadoop 17

18 MapReduce on Hadoop Job: unit of work to be performed by the system Input data Map and Reduce implementations Configuration Map task and Reduce task: smaller pieces of a job Jobtracker: node of the cluster coordinating the job Tasktracker: runs a task and reports to the jobtracker Split: part of the input Distributed Systems + Middleware: Hadoop 18

19 MapReduce on Hadoop The jobtracker splits the input in parts, and a map task is run for each split Splits are run in parallel on all nodes Hadoop tries hard to run a map task for a specific split in the same node where HDFS has saved that split If not possible, at least in the same rack Map task output is written on disk (not on HDFS: intermediate data do not need replication) Reduce tasks receive map output through network (no data locality here) Final output is saved on HDFS Distributed Systems + Middleware: Hadoop 19

20 MapReduce on Hadoop Distributed Systems + Middleware: Hadoop 20

21 MapReduce on Hadoop (with multiple reduce tasks) Distributed Systems + Middleware: Hadoop 21

22 MapReduce on Hadoop (no reduce) Distributed Systems + Middleware: Hadoop 22

23 Intermediate data Combiner function Aggregates data from several map outputs To minimize network transfer Hadoop decides if this is needed, it may not be executed at all In a way, it can be seen as a local instance of the reduce function Shuffle Sorts each map output Transfers it to (one of the) reducers Merges it with other map outputs, maintaining sorting Everything can be configured Memory, buffer sizes, parallel copies check the book! Distributed Systems + Middleware: Hadoop 23

24 MapReduce on Hadoop Everything is configurable Number of map tasks Number of reduce tasks Data compression Custom serialization Memory management Profilers to understand the performances in detail Tools to enable streaming data in subsequent MapReduce runs, called workflows (Apache Oozie) For complex problems Distributed Systems + Middleware: Hadoop 24

25 MapReduce on Hadoop: versions Version 1: first developed version of Hadoop MR The runtime contains the previously mentioned Job and Tasktrackers Still developed (latest version: 1.2) Version 2: the new Hadoop MR The runtime called YARN (Yet Another Resource Negotiator), substitutes (an generalizes) the previous trackers New HDFS features Different configurations and APIs The book mixes the two versions here and there We will follow version 2 (=> Hadoop 2.2.0, the latest version) Distributed Systems + Middleware: Hadoop 25

26 MapReduce example Main interface is written in Java Hadoop is written in Java There are interfaces in many other languages Hadoop is able to work in streaming mode, using Unix pipes Other languages take advantage of this Python Ruby C++ Distributed Systems + Middleware: Hadoop 26

27 MapReduce example A simple dataset Temperature and humidity measured through WSN motes in two rooms Timestamp Room Temperature Humidity Battery Distributed Systems + Middleware: Hadoop 27

28 MapReduce example Dataset Sampling time: 5 minutes Total dataset: about 566 days (more than 1.5 years) CSV (only a few megs it doesn t really exploit Hadoop, but it will help us understand how it works) Calculations Max temperature per day Mean temperature per day Distributed Systems + Middleware: Hadoop 28

29 MapReduce example Classes to be implemented Mapper Reducer Job handler (the main file, launching the execution and waiting for results) this will handle the configuration, too Configuration Default filesystem (HDFS, but Hadoop can read from standard filesystems) Default configuration for node and data managers Distributed Systems + Middleware: Hadoop 29

30 Hadoop configurations Standalone (default configuration) Runs on single JVM No parallelization Useful for debugging purposes Pseudo-distributed (single-host cluster) Parallelization Runs daemons on a single host Cluster Full scale Hadoop Distributed Systems + Middleware: Hadoop 30

31 Hadoop configuration We will see standalone and pseudo-distributed I will include cluster configuration in the exercises, but it requires a certain amount of memory to be run For standalone, you don t need to do almost anything Just set JAVA_HOME environment variable in hadoopenv.sh, before executing hadoop itself Notice that Windows is NOT supported in production mode for Hadoop Examples have been tested on Linux and OS X Distributed Systems + Middleware: Hadoop 31

32 Mapper public class MeanTemperatureMapper extends Mapper<LongWritable, Text, Text, DoubleWritable> } protected void map(longwritable key, Text value, Context context) { } String line = value.tostring(); String[] parts = line.split("\\s+"); if (parts.length < 3) { return; } long timestamp = Long.parseLong(parts[0]) * 1000; int mote = Integer.parseInt(parts[1]); if (mote!= 18) { return; } int temperatureint = Integer.parseInt(parts[2]); double temperature = temperatureint / 100.0; Date date = new Date(timestamp); DateFormat df = new SimpleDateFormat("yyyyMMdd"); String datekey = df.format(date); context.write(new Text(dateKey), new DoubleWritable(temperature)); Distributed Systems + Middleware: Hadoop 32

33 Mapper Timestamp Day Room Temperature Temperature Humidity Battery Distributed Systems + Middleware: Hadoop 33

34 Reducer public class MeanTemperatureReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable> protected void reduce(text key, Iterable<DoubleWritable> values, Context context) { double sum = 0; int counter = 0; for (DoubleWritable dw : values) { sum += dw.get(); ++counter; } context.write(key, new DoubleWritable(sum / counter)); } } Distributed Systems + Middleware: Hadoop 34

35 Reducer Day Temperature Mean temperature Distributed Systems + Middleware: Hadoop 35

36 Job handler public class MeanTemperatureJob extends Configured implements Tool public int run(string[] args) throws Exception { if (args.length!= 2) { System.err.printf("Usage: %s [generic options] <input> <output>\n", getclass().getsimplename()); ToolRunner.printGenericCommandUsage(System.err); return -1; } Job job = Job.getInstance(getConf(), "Mean temperature"); job.setjarbyclass(getclass()); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setmapperclass(meantemperaturemapper.class); job.setreducerclass(meantemperaturereducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(doublewritable.class); return job.waitforcompletion(true)? 0 : 1; } public static void main(string[] args) throws Exception { int exitcode = ToolRunner.run(new MeanTemperatureJob(), args); System.exit(exitCode); } Distributed } Systems + Middleware: Hadoop 36

37 Run MapReduce Java classes are compiled through Maven The example will contain a sample pom.xml file To run the example: mvn package export HADOOP_CLASSPATH=somename.jar hadoop MeanTemperatureJob fs file:/// input/file.csv output Output directory MUST not exist Notice the fs option Distributed Systems + Middleware: Hadoop 37

38 HDFS Distributed Systems + Middleware: Hadoop 38

39 HDFS The previous example used the OS filesystem Hadoop does not have any problem with that But, if we want to move to pseudo-distributed and cluster modes, we need to start using HDFS Hadoop Distributed File System is designed to store very large files suitable for streaming data access Distributed Systems + Middleware: Hadoop 39

40 Name and data nodes HDFS has a single* namenode (master) and several datanodes (slaves) The namenode manages the filesystem namespace (i.e., metadata) It maintains a namespace image and an edit log It has a list of all the datanodes and which blocks they store It does not maintain any file by itself * From version 2, Hadoop added the concept of HDFS federation Backup namenodes! Distributed Systems + Middleware: Hadoop 40

41 File distribution A datanode is a machine storing blocks of files, continuously communicating with the namenode Without a namenode, a datanode is useless Each file uploaded to HDFS is split in blocks of 64 MB each Minimize seek, maximize transfer rate If a file (or the last block of a file) is smaller than 64 MB, it does not occupy 64 MB This is different from OS filesystems A file can be bigger than a single disk in the network Distributed Systems + Middleware: Hadoop 41

42 File distribution Distributed Systems + Middleware: Hadoop 42

43 Interactions Command-line interface ls, mkdir, copyfromlocal, cat Third-party tools Fuse Java APIs Distributed Systems + Middleware: Hadoop 43

44 Writable datatypes Mappers and Reducers used particular types Text, DoubleWritable, LongWritable Hadoop defines specific types wrapping Java types, optimizing network serialization To add new types, you can implement Writable and WritableComparable Distributed Systems + Middleware: Hadoop 44

45 MapReduce example (reprise) Let s assume we want to output the mean temperature per day and per mote We need to use as key in our key-value pairs the pair day mote Create a new class and extend WritableComparable Comparable is needed because Hadoop sorts keys before reducing Distributed Systems + Middleware: Hadoop 45

46 MapReduce example (reprise) public class RoomDayWritable implements WritableComparable<RoomDayWritable> { private Text date; private IntWritable mote; public RoomDayWritable(String date, int mote) { this.date = new Text(date); this.mote = new IntWritable(mote); public void write(dataoutput out) throws IOException { this.mote.write(out); this.date.write(out); } public void readfields(datainput in) throws IOException { this.mote.readfields(in); this.date.readfields(in); public int hashcode() { public boolean equals(object obj) { public int compareto(roomdaywritable other) { } Distributed Systems + Middleware: Hadoop 46

47 MapReduce example (reprise) Day Mote Temperature Day Mote Mean temperature Distributed Systems + Middleware: Hadoop 47

48 Single node Configuration needs to be changed Location of all the daemons is localhost Number of replicas is 1 Format HDFS Start the HDFS daemon Start the YARN daemon Run the demo Distributed Systems + Middleware: Hadoop 48

49 Configuration core-site.xml <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration> hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> mapred-site.xml <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> yarn-site.xml <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name> yarn.nodemanager.aux-services.mapreduce.shuffle.class </name> <value>org.apache.hadoop.mapred.shufflehandler</value> </property> </configuration> Distributed Systems + Middleware: Hadoop 49

50 Run the example Format HDFS hdfs namenode format Start daemons (check the logs!) start-dfs.sh start-yarn.sh mr-jobhistory-daemon.sh start historyserver (*) jps Copy the file on HDFS hadoop fs mkdir p /user/hadoop hadoop fs copyfromlocal source.csv. Distributed Systems + Middleware: Hadoop 50

51 Run the example Run hadoop jar somename.jar MeanTemperatureJob source.csv output Again, the output directory (on HDFS!) MUST not exist Check the output hadoop fs cat output/part-r Distributed Systems + Middleware: Hadoop 51

52 Cluster mode Configuration needs to be changed See example file on the course Website A new file, called slaves, has to be created on the master node, containing addresses of all the datanodes Running the example works in the exact same way Distributed Systems + Middleware: Hadoop 52

53 PIG Distributed Systems + Middleware: Hadoop 53

54 Pig Language and runtime to perform complex queries on richer data structures Pig Latin is the language to express data flows Runtime to execute scripts on Hadoop clusters A script is a sequence of transformations on initial data They will be translated into MapReduce jobs Faster development: analysis can be written in a few lines and executes on terabytes of data User Defined Functions can extend Pig capabilities Performances are comparable with native MR code* Distributed Systems + Middleware: Hadoop 54

55 Pig Three execution modes Script Interactive shell (Grunt) Embedded in Java IDE plugins Scripts can be run in local mode (single JVM) or on a cluster (pseudo or real) Pig is able to generate a (reasonably) complete and concise sample dataset for a script Distributed Systems + Middleware: Hadoop 55

56 MapReduce example with Pig REGISTER./hadooptests-1.0.jar; raw = LOAD 'temperature-sorted.csv' USING PigStorage('\t') AS (timestamp:long, mote:int, temperature:int, humidity:int, battery:int); clean = FILTER raw BY mote!= 1 day = FOREACH clean GENERATE me.sivieri.hadoop.pig.extractdate(timestamp) as date, mote, temperature / as temperature; grouped_date_mote = GROUP day BY (date, mote); mean_temp = FOREACH grouped_date_mote GENERATE group, AVG(day.temperature); DUMP mean_temp; Distributed Systems + Middleware: Hadoop 56

57 MapReduce example with Pig public class ExtractDate extends EvalFunc<String> } public String exec(tuple arg0) throws IOException { } if (arg0 == null arg0.size() == 0) return null; try { Long timestamp = (Long) arg0.get(0); Date date = new Date(timestamp); DateFormat df = new SimpleDateFormat("yyyyMMdd"); return df.format(date); } catch(exception e){ } System.err.println("ExtractDate: failed to proces input; error - " + e.getmessage()); return null; Distributed Systems + Middleware: Hadoop 57

58 Bibliography Tom White, Hadoop The definitive guide, 3 rd Edition, O Reilly Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung, The Google File System, Google 2003 Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Google 2004 Distributed Systems + Middleware: Hadoop 58

INTRODUCTION TO HADOOP

INTRODUCTION TO HADOOP Hadoop INTRODUCTION TO HADOOP Distributed Systems + Middleware: Hadoop 2 Data We live in a digital world that produces data at an impressive speed As of 2012, 2.7 ZB of data exist (1 ZB = 10 21 Bytes)

More information

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:

More information

Hadoop Configuration and First Examples

Hadoop Configuration and First Examples Hadoop Configuration and First Examples Big Data 2015 Hadoop Configuration In the bash_profile export all needed environment variables Hadoop Configuration Allow remote login Hadoop Configuration Download

More information

Getting to know Apache Hadoop

Getting to know Apache Hadoop Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the

More information

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless

More information

Xiaoming Gao Hui Li Thilina Gunarathne

Xiaoming Gao Hui Li Thilina Gunarathne Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal

More information

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

BIG DATA, MAPREDUCE & HADOOP

BIG DATA, MAPREDUCE & HADOOP BIG, MAPREDUCE & HADOOP LARGE SCALE DISTRIBUTED SYSTEMS By Jean-Pierre Lozi A tutorial for the LSDS class LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 1 OBJECTIVES OF THIS LAB SESSION The LSDS

More information

HADOOP MOCK TEST HADOOP MOCK TEST II

HADOOP MOCK TEST HADOOP MOCK TEST II http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology [email protected] Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

Introduc)on to Map- Reduce. Vincent Leroy

Introduc)on to Map- Reduce. Vincent Leroy Introduc)on to Map- Reduce Vincent Leroy Sources Apache Hadoop Yahoo! Developer Network Hortonworks Cloudera Prac)cal Problem Solving with Hadoop and Pig Slides will be available at hgp://lig- membres.imag.fr/leroyv/

More information

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. [email protected] http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Distributed Filesystems

Distributed Filesystems Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science [email protected] April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls

More information

Qsoft Inc www.qsoft-inc.com

Qsoft Inc www.qsoft-inc.com Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside [email protected] About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

Big Data Too Big To Ignore

Big Data Too Big To Ignore Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology Hadoop Dawid Weiss Institute of Computing Science Poznań University of Technology 2008 Hadoop Programming Summary About Config 1 Open Source Map-Reduce: Hadoop About Cluster Configuration 2 Programming

More information

How To Install Hadoop 1.2.1.1 From Apa Hadoop 1.3.2 To 1.4.2 (Hadoop)

How To Install Hadoop 1.2.1.1 From Apa Hadoop 1.3.2 To 1.4.2 (Hadoop) Contents Download and install Java JDK... 1 Download the Hadoop tar ball... 1 Update $HOME/.bashrc... 3 Configuration of Hadoop in Pseudo Distributed Mode... 4 Format the newly created cluster to create

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

How To Use Hadoop

How To Use Hadoop Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer [email protected] Assistants: Henri Terho and Antti

More information

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction

More information

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

!#$%&' ( )%#*'+,'-#.//0( !#$%&'()*$+()',!-+.'/', 4(5,67,!-+!89,:*$;'0+$.<.,&0$'09,&)/=+,!()<>'0, 3, Processing LARGE data sets !"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Complete Java Classes Hadoop Syllabus Contact No: 8888022204

Complete Java Classes Hadoop Syllabus Contact No: 8888022204 1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What

More information

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan [email protected]

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan [email protected] Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed

More information

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance

More information

CS54100: Database Systems

CS54100: Database Systems CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics

More information

ITG Software Engineering

ITG Software Engineering Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,

More information

Peers Techno log ies Pv t. L td. HADOOP

Peers Techno log ies Pv t. L td. HADOOP Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

HDFS. Hadoop Distributed File System

HDFS. Hadoop Distributed File System HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture

More information

L1: Introduction to Hadoop

L1: Introduction to Hadoop L1: Introduction to Hadoop Feng Li [email protected] School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General

More information

Big Data Course Highlights

Big Data Course Highlights Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

More information

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce

More information

Data-intensive computing systems

Data-intensive computing systems Data-intensive computing systems Hadoop Universtity of Verona Computer Science Department Damiano Carra Acknowledgements! Credits Part of the course material is based on slides provided by the following

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm

More information

Hadoop: The Definitive Guide

Hadoop: The Definitive Guide Hadoop: The Definitive Guide Tom White foreword by Doug Cutting O'REILLY~ Beijing Cambridge Farnham Köln Sebastopol Taipei Tokyo Table of Contents Foreword Preface xiii xv 1. Meet Hadoop 1 Da~! 1 Data

More information

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop

More information

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes

More information

Hadoop and its Usage at Facebook. Dhruba Borthakur [email protected], June 22 rd, 2009

Hadoop and its Usage at Facebook. Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009 Hadoop and its Usage at Facebook Dhruba Borthakur [email protected], June 22 rd, 2009 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed on Hadoop Distributed File System Facebook

More information

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich Big Data Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich First, an Announcement There will be a repetition exercise group on Wednesday this week. TAs will answer your questions on SQL, relational

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Hadoop: The Definitive Guide

Hadoop: The Definitive Guide FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

MapReduce. Tushar B. Kute, http://tusharkute.com

MapReduce. Tushar B. Kute, http://tusharkute.com MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Intro to Map/Reduce a.k.a. Hadoop

Intro to Map/Reduce a.k.a. Hadoop Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by

More information

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013 Hadoop 101 Lars George NoSQL- Ma4ers, Cologne April 26, 2013 1 What s Ahead? Overview of Apache Hadoop (and related tools) What it is Why it s relevant How it works No prior experience needed Feel free

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing MapReduce and Hadoop 15 319, spring 2010 17 th Lecture, Mar 16 th Majd F. Sakr Lecture Goals Transition to MapReduce from Functional Programming Understand the origins of

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Tutorial- Counting Words in File(s) using MapReduce

Tutorial- Counting Words in File(s) using MapReduce Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

BIG DATA APPLICATIONS

BIG DATA APPLICATIONS BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics

More information