Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Size: px
Start display at page:

Download "Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13"

Transcription

1 Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik

2 What is Big Data? collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications [ terabytes, petabytes in a few years exabytes (SKA telescope) Challenges Storage Analysis Search... Astrid Rheinländer: Big Data Analytics with MapReduce 2

3 Example Twitter 2012: 63 Billion Tweets, 8.5 TB data Business areas: The data sell access to content Advertisement [All images: Astrid Rheinländer: Big Data Analytics with MapReduce 3

4 Example Cern, LHC 2012: LHC experiments generated 22PB of data 99% have already been thrown away [ Analysis needs 100k modern processors to evaluate experiments Data is stored & processed in LHC Computing Grid 150 data & compute centers around the world Heterogeneous architecture [ Astrid Rheinländer: Big Data Analytics with MapReduce 4

5 Big Data Landscape Astrid Rheinländer: Big Data Analytics with MapReduce 5

6 Big Data Landscape Astrid Rheinländer: Big Data Analytics with MapReduce 6

7 Overview MapReduce Programming model Architecture and execution Distributed File System HDFS Error handling & performance issues MapReduce vs. parallel Databases Extensions of MapReduce Astrid Rheinländer: Big Data Analytics with MapReduce 7

8 Inspired by functional programming concepts map and reduce Operates on key-value pairs Programming Model Map Process key-value pairs individually with UDF Generates key/value pairs Example (LISP): (mapcar 1+ ( )) ( ) 1, x 1 map n, x n 1, y 1 t, y t Astrid Rheinländer: Big Data Analytics with MapReduce 8

9 Inspired by functional programming concepts map and reduce Operates on key-value pairs Programming Model Map Process key-value pairs individually with UDF Generates key/value pairs Example (LISP): (mapcar 1+ ( )) ( ) 1, x 1 map n, x n 1, y 1 t, y t Reduce Merges intermediate key-value pairs with same key Example (LISP): 1, a 1 reduce 1, result 1 1, z 1 (reduce + ( )) 10 Astrid Rheinländer: Big Data Analytics with MapReduce 9

10 Example Term Count Input data: Documents 1, to be, or not to be, that is the question: 2, whether 'tis nobler in the mind to suffer 3, the slings and arrows of outrageous fortune, 4, or to take arms against a sea of troubles Task: How often are terms contained in the set of documents? Astrid Rheinländer: Big Data Analytics with MapReduce 10

11 Example Term Count public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); } } while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, 1); } 1, to be, or not 2, whether 'tis 3, the slings 4, or to take map to, 1 be, 1 or 1 not, 1 whether,1 tis, 1 the, 1 slings, 1 or, 1 to, 1 take, 1 Astrid Rheinländer: Big Data Analytics with MapReduce 11

12 Example Term Count public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); } } while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, 1); } 1, to be, or not 2, whether 'tis 3, the slings 4, or to take map to, 1 be, 1 or 1 not, 1 whether,1 tis, 1 the, 1 slings, 1 or, 1 to, 1 take, 1 Astrid Rheinländer: Big Data Analytics with MapReduce 12

13 Example Term Count public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output) { int sum = 0; while (values.hasnext()) { sum += values.next().get(); } } } output.collect(key, new IntWritable(sum)); not, 1 tis, 1 whether,1 or 1 or, 1 be, 1 to, 1 to, 1 the, 1 reduce not,1 whether,1 tis, 1 the, 1 slings, 1 or, 2 be, 1 to, 2 take, 1 take, 1 slings, 1 Astrid Rheinländer: Big Data Analytics with MapReduce 13

14 Example Term Count public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output) { int sum = 0; while (values.hasnext()) { sum += values.next().get(); } } } output.collect(key, new IntWritable(sum)); not, 1 tis, 1 whether,1 or 1 or, 1 be, 1 to, 1 to, 1 the, 1 reduce not,1 whether,1 tis, 1 the, 1 slings, 1 or, 2 be, 1 to, 2 take, 1 take, 1 slings, 1 Astrid Rheinländer: Big Data Analytics with MapReduce 14

15 Example Term Count Programmer s point of view: doc 1 doc n map term 1, 1 term z, 1 reduce term 1, count 1 term z, count z MapReduce framework takes care of distributed execution Partition and distribute documents to mappers Store and group intermediate term counts Distribute intermediate data to reducers Collect and return final results Deal with failures and exceptions Astrid Rheinländer: Big Data Analytics with MapReduce 15

16 Overview MapReduce Programming model Architecture and execution Distributed File System HDFS Error handling & performance issues MapReduce vs. parallel Databases Extensions of MapReduce Astrid Rheinländer: Big Data Analytics with MapReduce 16

17 Cluster architecture Master/Worker architecture client Workers are commodity hardware Masters are usually well equipped Shared nothing Distributed processing Job Tracker master TaskTracker TaskTracker TaskTracker worker Examples: Google cluster with several thousand nodes 2008: 2 x86 CPUs, 4-8GB RAM, Linux, IDE HDDs Facebook 2010: 2000 nodes (8/16 cores), 32 GB RAM, 21PB storage in HDFS Astrid Rheinländer: Big Data Analytics with MapReduce 17

18 User Specification of MapReduce job Start execution engine Submit job to execution engine Execution engine Fork worker nodes Partition input MapReduce - Execution fork Execution engine master docs do cs chunk 1 chunk n worker worker Astrid Rheinländer: Big Data Analytics with MapReduce 18

19 Master Manages entire execution Assigns tasks and input splits to idle workers Tracks status of current job and all tasks (waiting, running, finished) Tracks status of worker nodes MapReduce - Execution Execution engine master chunk 1 chunk n worker worker Astrid Rheinländer: Big Data Analytics with MapReduce 19

20 Map-Worker Reads assigned splits Parses key-value pairs Executes map function for each pair MapReduce - Execution Buffers intermediate data in memory Execution engine master chunk 1 chunk n worker worker Astrid Rheinländer: Big Data Analytics with MapReduce 20

21 Map-Worker Buffered intermediate data are periodically written to local disks using some partition function Notify master about locations of intermediate data when map() is finished MapReduce - Execution Master pushes locations incrementally to reduce-workers Execution engine master chunk 1 chunk n Local write worker worker Astrid Rheinländer: Big Data Analytics with MapReduce 21

22 Reduce-Worker Read assigned splits from map workers disks (RPC) MapReduce - Execution Usually processes more than one key Sort data by intermediate key Execution engine master chunk 1 chunk n Remote read worker worker Astrid Rheinländer: Big Data Analytics with MapReduce 22

23 Reduce-Worker Executes reduce() function once for each intermediate key reduce() usually iterates over entire list of values per key aggregation Result is written to distributed FS Usually one file per reduce worker MapReduce - Execution Execution engine master chunk 1 chunk n result 1 worker worker result m Astrid Rheinländer: Big Data Analytics with MapReduce 23

24 All Map/Reduce jobs are finished Master notifies user program User gets access to result MapReduce - Execution Execution engine master master chunk 1 chunk n result 1 worker worker result m Astrid Rheinländer: Big Data Analytics with MapReduce 24

25 Overview MapReduce Programming model Architecture and execution Distributed File System HDFS Error handling & performance issues MapReduce vs. parallel Databases Extensions of MapReduce Astrid Rheinländer: Big Data Analytics with MapReduce 25

26 HDFS architecture HDFS: Hadoop distributed file system Why a special file system? Differend priorities/workload Goal: store very large data redundantly on commodity hardware manage much larger data compared to standard fs Assumptions: Nodes fails all the time Mostly read/append file operations, few rewrites streaming access pattern Network latency should not interrupt computations Relatively small number of large files Astrid Rheinländer: Big Data Analytics with MapReduce 26

27 HDFS architecture Master/Worker architecture client Master: Name node Distributed processing Distributed storage Access to DFS Replication Job Tracker Name node 2 nd Name node master Metadata TaskTracker Data node TaskTracker Data node TaskTracker Data node worker Workers: Data nodes Local storage in cluster I/O and maintenance operations Client talks to data nodes and name node Data does not pass name node Astrid Rheinländer: Big Data Analytics with MapReduce 27

28 HDFS Use blocks to store (parts of) files Default: 64MB Recommended: 128 MB Unix: 4KB Advantages: Fixed size Easy to calculate how many blocks fit on disk Files can be larger than any single disk in cluster Well suited for replication File Block1 Block2 Block3 Astrid Rheinländer: Big Data Analytics with MapReduce 28

29 Block1 HDFS Use blocks to store (parts of) files Default: 64MB Recommended: 128 MB Unix: 4KB Advantages: Fixed size Easy to calculate how many blocks fit on disk Files can be larger than any single disk in cluster Well suited for replication File Block1 Block2 Block3 Runs on top of OS file system OS fs blocks under the hood one HDFS block consists of multiple OS fs blocks HDFS OS fs block blocks Astrid Rheinländer: Big Data Analytics with MapReduce 29

30 Pros: Architecture - Summary Low costs Extensible Nodes are easily exchangeable But: Nodes fail regularly for various reasons Disk failures Broken controllers Cable breaks Network partitioning Network bandwith is a potential bottleneck Performance optimization and error handling needed Astrid Rheinländer: Big Data Analytics with MapReduce 30

31 Overview MapReduce Programming model Architecture and execution Distributed File System HDFS Error handling & performance issues MapReduce vs. parallel Databases Current trends Astrid Rheinländer: Big Data Analytics with MapReduce 31

32 Types of errors HDFS data node fails Worker node fails Master / Name node fails (Performance bottlenecks) Astrid Rheinländer: Big Data Analytics with MapReduce 32

33 Goal: Reliability by replication HDFS Replication blocks are replicated on 3 data nodes Rack 2 Rack 1 Placement 1 copy on local node 1 copy on remote node 1 copy on different node in same remote rack Additional copies randomly placed replication write Client Client reads from closest copy Integrity via checksums Astrid Rheinländer: Big Data Analytics with MapReduce 33

34 HDFS Replication Replication engine on name node Name node detects data node failures heartbeat messages Data node failure: move tasks to intact replicas Moving computation is cheaper than moving data Astrid Rheinländer: Big Data Analytics with MapReduce 34

35 Worker fails Master pings all workers regularly (heartbeat messages) Worker answer within timeout If worker does not answer, redistribute tasks to other workers Map/Reduce tasks with status assigned or running get status waiting for execution Finished Map tasks need to be executed again Finished Reduce tasks do not need to be re-run Astrid Rheinländer: Big Data Analytics with MapReduce 35

36 Master or Name node fails single points of failure Execution is aborted and needs to be restarted Solutions: Shadow master / secondary name node Checkpoints master master worker worker Astrid Rheinländer: Big Data Analytics with MapReduce 36

37 Master or Name node fails single points of failure Execution is aborted and needs to be restarted Solutions: Shadow master / secondary name node Checkpoints master master Resume from checkpoints worker Resume from checkpoints worker Astrid Rheinländer: Big Data Analytics with MapReduce 37

38 Performance Bottlenecks Stragglers Tasks which run much longer than others Backup-Tasks Astrid Rheinländer: Big Data Analytics with MapReduce 38

39 Backup tasks Problem: stragglers significantly slow down whole MapReduce job Reasons: Other jobs might consume resources on a machine Bad disks with correctable errors transfer data very slowly Not enough RAM machine starts swapping Solution: When MapReduce job is close to finishing Spawn copies of remaining in-progress tasks Keep results from task that finishes first Takes a few percent resource overhead Astrid Rheinländer: Big Data Analytics with MapReduce 39

40 Performance Bottlenecks Stragglers Network traffic Locality Astrid Rheinländer: Big Data Analytics with MapReduce 40

41 Locality Goal: conserve bandwidth Schedule map tasks to locations of processed chunks physically on same machine many machines can read data with local disk speed Worker 1 Map task A Worker 3 Chunk A Network switch Worker 2 Map task B Worker 4 Chunk B Master task A task B task C Map task X Chunk C Network switch Map Chunk K task D task I Worker 5 Map Chunk D Worker 6 Map Chunk F Astrid Rheinländer: Big Data Analytics with MapReduce 41

42 Locality Goal: conserve bandwidth Schedule map tasks to locations of processed chunks If same machine not possible: same rack / switch Worker 1 Map task A Chunk A Network switch Worker 2 Map task B Chunk B Worker 3 Worker 4 Master task A task B task C Map task X Chunk C Network switch Map task C Chunk K task D task I Worker 5 Map Chunk D Worker 6 Map Chunk F Astrid Rheinländer: Big Data Analytics with MapReduce 42

43 Overview MapReduce Programming model Architecture and execution Distributed File System HDFS Error handling & performance issues MapReduce vs. parallel Databases Current trends Astrid Rheinländer: Big Data Analytics with MapReduce 43

44 MapReduce vs. parallel DBMS DeWitt & Stonebraker: A major step backwards Astrid Rheinländer: Big Data Analytics with MapReduce 44

45 MapReduce vs. parallel DBMS DeWitt & Stonebraker: A major step backwards Criticism: Teradata has done this 20 years ago Not really new Functionality can be reached by UDFs in parallel DBMS Parallel DBMS provide good scalability Interface too low-level Write intermediate data to disk instead of pipelining/streaming Lack of schemata obstructs performance optimizations Astrid Rheinländer: Big Data Analytics with MapReduce 45

46 MapReduce!= parallel DBMS Focus is different! Parallel DBMS: Query very large data sets Transactions User management High reliability Very expensive MapReduce: ETL tasks Complex analytics Pragmatism instead of perfect reliability Much cheaper Open source systems Astrid Rheinländer: Big Data Analytics with MapReduce 46

47 Extensions of MapReduce Data flow languages on top of MapReduce Jaql, Hive, Pig, Record-based data model Optimizers Extensions of programming model Example: Stratosphere Astrid Rheinländer: Big Data Analytics with MapReduce 47

48 Stratosphere research unit Jointly carried out by TU, HU, and HPI Stratosphere system web-scale distributed data analytics system Database-inspired approach Analytical workload Astrid Rheinländer: Big Data Analytics with MapReduce 48

49 Stratosphere research unit Jointly carried out by TU, HU, and HPI Stratosphere system web-scale distributed data analytics system Database-inspired approach Analytical workload Astrid Rheinländer: Big Data Analytics with MapReduce 49

50 Stratosphere programming model PArallelization ConTracts (PACTs) Generalization of MapReduce Astrid Rheinländer: Big Data Analytics with MapReduce 50

51 Stratosphere programming model Parallelization Contracts Generalization of MapReduce Two inputs Matches tuples that share the same key Equi-join on key Each pair handed to UDF Astrid Rheinländer: Big Data Analytics with MapReduce 51

52 Stratosphere programming model Parallelization Contracts Generalization of MapReduce Two inputs Cartesian product of both data sets Each pair handed to UDF Astrid Rheinländer: Big Data Analytics with MapReduce 52

53 Stratosphere programming model Parallelization Contracts Generalization of MapReduce Reduce on two inputs Each key group is handed to UDF Astrid Rheinländer: Big Data Analytics with MapReduce 53

54 Thank you for your attention! Astrid Rheinländer: Big Data Analytics with MapReduce 54

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details

More information

Data Science in the Wild

Data Science in the Wild Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted

More information

HPCHadoop: MapReduce on Cray X-series

HPCHadoop: MapReduce on Cray X-series HPCHadoop: MapReduce on Cray X-series Scott Michael Research Analytics Indiana University Cray User Group Meeting May 7, 2014 1 Outline Motivation & Design of HPCHadoop HPCHadoop demo Benchmarking Methodology

More information

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop WordCount Explained! IT332 Distributed Systems Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,

More information

CS54100: Database Systems

CS54100: Database Systems CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics

More information

Map Reduce / Hadoop / HDFS

Map Reduce / Hadoop / HDFS Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Hadoop: Understanding the Big Data Processing Method

Hadoop: Understanding the Big Data Processing Method Hadoop: Understanding the Big Data Processing Method Deepak Chandra Upreti 1, Pawan Sharma 2, Dr. Yaduvir Singh 3 1 PG Student, Department of Computer Science & Engineering, Ideal Institute of Technology

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

MapReduce (in the cloud)

MapReduce (in the cloud) MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013 Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Intro to Map/Reduce a.k.a. Hadoop

Intro to Map/Reduce a.k.a. Hadoop Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by

More information

map/reduce connected components

map/reduce connected components 1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Petabyte-scale Data with Apache HDFS. Matt Foley Hortonworks, Inc. mfoley@hortonworks.com

Petabyte-scale Data with Apache HDFS. Matt Foley Hortonworks, Inc. mfoley@hortonworks.com Petabyte-scale Data with Apache HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com Matt Foley - Background MTS at Hortonworks Inc. HDFS contributor, part of original ~25 in Yahoo! spin-out of Hortonworks

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Google File System. Web and scalability

Google File System. Web and scalability Google File System Web and scalability The web: - How big is the Web right now? No one knows. - Number of pages that are crawled: o 100,000 pages in 1994 o 8 million pages in 2005 - Crawlable pages might

More information

How To Write A Mapreduce Program In Java.Io 4.4.4 (Orchestra)

How To Write A Mapreduce Program In Java.Io 4.4.4 (Orchestra) MapReduce framework - Operates exclusively on pairs, - that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Data Science Analytics & Research Centre

Data Science Analytics & Research Centre Data Science Analytics & Research Centre Data Science Analytics & Research Centre 1 Big Data Big Data Overview Characteristics Applications & Use Case HDFS Hadoop Distributed File System (HDFS) Overview

More information

Big Data Management. Big Data Management. (BDM) Autumn 2013. Povl Koch November 11, 2013 10-11-2013 1

Big Data Management. Big Data Management. (BDM) Autumn 2013. Povl Koch November 11, 2013 10-11-2013 1 Big Data Management Big Data Management (BDM) Autumn 2013 Povl Koch November 11, 2013 10-11-2013 1 Overview Today s program 1. Little more practical details about this course 2. Recap from last time (Google

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)

More information

Hadoop Cluster Applications

Hadoop Cluster Applications Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday

More information

Design and Evolution of the Apache Hadoop File System(HDFS)

Design and Evolution of the Apache Hadoop File System(HDFS) Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

Apache Hadoop FileSystem and its Usage in Facebook

Apache Hadoop FileSystem and its Usage in Facebook Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs

More information

Biomap Jobs and the Big Picture

Biomap Jobs and the Big Picture Lots of Data, Little Money. A Last.fm perspective Martin Dittus, martind@last.fm London Geek Nights, 2009-04-23 Big Data Little Money You have lots of data You want to process it For your product (Last.fm:

More information

A Comparison of Approaches to Large-Scale Data Analysis

A Comparison of Approaches to Large-Scale Data Analysis A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009 MapReduce

More information

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani Big Data and Hadoop Sreedhar C, Dr. D. Kavitha, K. Asha Rani Abstract Big data has become a buzzword in the recent years. Big data is used to describe a massive volume of both structured and unstructured

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Word count example Abdalrahman Alsaedi

Word count example Abdalrahman Alsaedi Word count example Abdalrahman Alsaedi To run word count in AWS you have two different ways; either use the already exist WordCount program, or to write your own file. First: Using AWS word count program

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Getting to know Apache Hadoop

Getting to know Apache Hadoop Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the

More information

HADOOP PERFORMANCE TUNING

HADOOP PERFORMANCE TUNING PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

Networking in the Hadoop Cluster

Networking in the Hadoop Cluster Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop

More information

Big Data Analytics* Outline. Issues. Big Data

Big Data Analytics* Outline. Issues. Big Data Outline Big Data Analytics* Big Data Data Analytics: Challenges and Issues Misconceptions Big Data Infrastructure Scalable Distributed Computing: Hadoop Programming in Hadoop: MapReduce Paradigm Example

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm

More information

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,

More information

MAPREDUCE Programming Model

MAPREDUCE Programming Model CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce

More information

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Storage, Management and challenges. Ahmed Ali-Eldin Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

Introduc)on to Map- Reduce. Vincent Leroy

Introduc)on to Map- Reduce. Vincent Leroy Introduc)on to Map- Reduce Vincent Leroy Sources Apache Hadoop Yahoo! Developer Network Hortonworks Cloudera Prac)cal Problem Solving with Hadoop and Pig Slides will be available at hgp://lig- membres.imag.fr/leroyv/

More information

THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCE COMPARING HADOOPDB: A HYBRID OF DBMS AND MAPREDUCE TECHNOLOGIES WITH THE DBMS POSTGRESQL

THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCE COMPARING HADOOPDB: A HYBRID OF DBMS AND MAPREDUCE TECHNOLOGIES WITH THE DBMS POSTGRESQL THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCE COMPARING HADOOPDB: A HYBRID OF DBMS AND MAPREDUCE TECHNOLOGIES WITH THE DBMS POSTGRESQL By VANESSA CEDENO A Dissertation submitted to the Department

More information

Big Data Technology Core Hadoop: HDFS-YARN Internals

Big Data Technology Core Hadoop: HDFS-YARN Internals Big Data Technology Core Hadoop: HDFS-YARN Internals Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class Map-Reduce Motivation This class

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:

More information

Lecture 10 - Functional programming: Hadoop and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional

More information

HDFS. Hadoop Distributed File System

HDFS. Hadoop Distributed File System HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files

More information

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Hadoop Distributed File System. Dhruba Borthakur June, 2007 Hadoop Distributed File System Dhruba Borthakur June, 2007 Goals of HDFS Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information