Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13
|
|
- Dominick Burns
- 8 years ago
- Views:
Transcription
1 Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik
2 What is Big Data? collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications [ terabytes, petabytes in a few years exabytes (SKA telescope) Challenges Storage Analysis Search... Astrid Rheinländer: Big Data Analytics with MapReduce 2
3 Example Twitter 2012: 63 Billion Tweets, 8.5 TB data Business areas: The data sell access to content Advertisement [All images: Astrid Rheinländer: Big Data Analytics with MapReduce 3
4 Example Cern, LHC 2012: LHC experiments generated 22PB of data 99% have already been thrown away [ Analysis needs 100k modern processors to evaluate experiments Data is stored & processed in LHC Computing Grid 150 data & compute centers around the world Heterogeneous architecture [ Astrid Rheinländer: Big Data Analytics with MapReduce 4
5 Big Data Landscape Astrid Rheinländer: Big Data Analytics with MapReduce 5
6 Big Data Landscape Astrid Rheinländer: Big Data Analytics with MapReduce 6
7 Overview MapReduce Programming model Architecture and execution Distributed File System HDFS Error handling & performance issues MapReduce vs. parallel Databases Extensions of MapReduce Astrid Rheinländer: Big Data Analytics with MapReduce 7
8 Inspired by functional programming concepts map and reduce Operates on key-value pairs Programming Model Map Process key-value pairs individually with UDF Generates key/value pairs Example (LISP): (mapcar 1+ ( )) ( ) 1, x 1 map n, x n 1, y 1 t, y t Astrid Rheinländer: Big Data Analytics with MapReduce 8
9 Inspired by functional programming concepts map and reduce Operates on key-value pairs Programming Model Map Process key-value pairs individually with UDF Generates key/value pairs Example (LISP): (mapcar 1+ ( )) ( ) 1, x 1 map n, x n 1, y 1 t, y t Reduce Merges intermediate key-value pairs with same key Example (LISP): 1, a 1 reduce 1, result 1 1, z 1 (reduce + ( )) 10 Astrid Rheinländer: Big Data Analytics with MapReduce 9
10 Example Term Count Input data: Documents 1, to be, or not to be, that is the question: 2, whether 'tis nobler in the mind to suffer 3, the slings and arrows of outrageous fortune, 4, or to take arms against a sea of troubles Task: How often are terms contained in the set of documents? Astrid Rheinländer: Big Data Analytics with MapReduce 10
11 Example Term Count public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); } } while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, 1); } 1, to be, or not 2, whether 'tis 3, the slings 4, or to take map to, 1 be, 1 or 1 not, 1 whether,1 tis, 1 the, 1 slings, 1 or, 1 to, 1 take, 1 Astrid Rheinländer: Big Data Analytics with MapReduce 11
12 Example Term Count public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); } } while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, 1); } 1, to be, or not 2, whether 'tis 3, the slings 4, or to take map to, 1 be, 1 or 1 not, 1 whether,1 tis, 1 the, 1 slings, 1 or, 1 to, 1 take, 1 Astrid Rheinländer: Big Data Analytics with MapReduce 12
13 Example Term Count public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output) { int sum = 0; while (values.hasnext()) { sum += values.next().get(); } } } output.collect(key, new IntWritable(sum)); not, 1 tis, 1 whether,1 or 1 or, 1 be, 1 to, 1 to, 1 the, 1 reduce not,1 whether,1 tis, 1 the, 1 slings, 1 or, 2 be, 1 to, 2 take, 1 take, 1 slings, 1 Astrid Rheinländer: Big Data Analytics with MapReduce 13
14 Example Term Count public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output) { int sum = 0; while (values.hasnext()) { sum += values.next().get(); } } } output.collect(key, new IntWritable(sum)); not, 1 tis, 1 whether,1 or 1 or, 1 be, 1 to, 1 to, 1 the, 1 reduce not,1 whether,1 tis, 1 the, 1 slings, 1 or, 2 be, 1 to, 2 take, 1 take, 1 slings, 1 Astrid Rheinländer: Big Data Analytics with MapReduce 14
15 Example Term Count Programmer s point of view: doc 1 doc n map term 1, 1 term z, 1 reduce term 1, count 1 term z, count z MapReduce framework takes care of distributed execution Partition and distribute documents to mappers Store and group intermediate term counts Distribute intermediate data to reducers Collect and return final results Deal with failures and exceptions Astrid Rheinländer: Big Data Analytics with MapReduce 15
16 Overview MapReduce Programming model Architecture and execution Distributed File System HDFS Error handling & performance issues MapReduce vs. parallel Databases Extensions of MapReduce Astrid Rheinländer: Big Data Analytics with MapReduce 16
17 Cluster architecture Master/Worker architecture client Workers are commodity hardware Masters are usually well equipped Shared nothing Distributed processing Job Tracker master TaskTracker TaskTracker TaskTracker worker Examples: Google cluster with several thousand nodes 2008: 2 x86 CPUs, 4-8GB RAM, Linux, IDE HDDs Facebook 2010: 2000 nodes (8/16 cores), 32 GB RAM, 21PB storage in HDFS Astrid Rheinländer: Big Data Analytics with MapReduce 17
18 User Specification of MapReduce job Start execution engine Submit job to execution engine Execution engine Fork worker nodes Partition input MapReduce - Execution fork Execution engine master docs do cs chunk 1 chunk n worker worker Astrid Rheinländer: Big Data Analytics with MapReduce 18
19 Master Manages entire execution Assigns tasks and input splits to idle workers Tracks status of current job and all tasks (waiting, running, finished) Tracks status of worker nodes MapReduce - Execution Execution engine master chunk 1 chunk n worker worker Astrid Rheinländer: Big Data Analytics with MapReduce 19
20 Map-Worker Reads assigned splits Parses key-value pairs Executes map function for each pair MapReduce - Execution Buffers intermediate data in memory Execution engine master chunk 1 chunk n worker worker Astrid Rheinländer: Big Data Analytics with MapReduce 20
21 Map-Worker Buffered intermediate data are periodically written to local disks using some partition function Notify master about locations of intermediate data when map() is finished MapReduce - Execution Master pushes locations incrementally to reduce-workers Execution engine master chunk 1 chunk n Local write worker worker Astrid Rheinländer: Big Data Analytics with MapReduce 21
22 Reduce-Worker Read assigned splits from map workers disks (RPC) MapReduce - Execution Usually processes more than one key Sort data by intermediate key Execution engine master chunk 1 chunk n Remote read worker worker Astrid Rheinländer: Big Data Analytics with MapReduce 22
23 Reduce-Worker Executes reduce() function once for each intermediate key reduce() usually iterates over entire list of values per key aggregation Result is written to distributed FS Usually one file per reduce worker MapReduce - Execution Execution engine master chunk 1 chunk n result 1 worker worker result m Astrid Rheinländer: Big Data Analytics with MapReduce 23
24 All Map/Reduce jobs are finished Master notifies user program User gets access to result MapReduce - Execution Execution engine master master chunk 1 chunk n result 1 worker worker result m Astrid Rheinländer: Big Data Analytics with MapReduce 24
25 Overview MapReduce Programming model Architecture and execution Distributed File System HDFS Error handling & performance issues MapReduce vs. parallel Databases Extensions of MapReduce Astrid Rheinländer: Big Data Analytics with MapReduce 25
26 HDFS architecture HDFS: Hadoop distributed file system Why a special file system? Differend priorities/workload Goal: store very large data redundantly on commodity hardware manage much larger data compared to standard fs Assumptions: Nodes fails all the time Mostly read/append file operations, few rewrites streaming access pattern Network latency should not interrupt computations Relatively small number of large files Astrid Rheinländer: Big Data Analytics with MapReduce 26
27 HDFS architecture Master/Worker architecture client Master: Name node Distributed processing Distributed storage Access to DFS Replication Job Tracker Name node 2 nd Name node master Metadata TaskTracker Data node TaskTracker Data node TaskTracker Data node worker Workers: Data nodes Local storage in cluster I/O and maintenance operations Client talks to data nodes and name node Data does not pass name node Astrid Rheinländer: Big Data Analytics with MapReduce 27
28 HDFS Use blocks to store (parts of) files Default: 64MB Recommended: 128 MB Unix: 4KB Advantages: Fixed size Easy to calculate how many blocks fit on disk Files can be larger than any single disk in cluster Well suited for replication File Block1 Block2 Block3 Astrid Rheinländer: Big Data Analytics with MapReduce 28
29 Block1 HDFS Use blocks to store (parts of) files Default: 64MB Recommended: 128 MB Unix: 4KB Advantages: Fixed size Easy to calculate how many blocks fit on disk Files can be larger than any single disk in cluster Well suited for replication File Block1 Block2 Block3 Runs on top of OS file system OS fs blocks under the hood one HDFS block consists of multiple OS fs blocks HDFS OS fs block blocks Astrid Rheinländer: Big Data Analytics with MapReduce 29
30 Pros: Architecture - Summary Low costs Extensible Nodes are easily exchangeable But: Nodes fail regularly for various reasons Disk failures Broken controllers Cable breaks Network partitioning Network bandwith is a potential bottleneck Performance optimization and error handling needed Astrid Rheinländer: Big Data Analytics with MapReduce 30
31 Overview MapReduce Programming model Architecture and execution Distributed File System HDFS Error handling & performance issues MapReduce vs. parallel Databases Current trends Astrid Rheinländer: Big Data Analytics with MapReduce 31
32 Types of errors HDFS data node fails Worker node fails Master / Name node fails (Performance bottlenecks) Astrid Rheinländer: Big Data Analytics with MapReduce 32
33 Goal: Reliability by replication HDFS Replication blocks are replicated on 3 data nodes Rack 2 Rack 1 Placement 1 copy on local node 1 copy on remote node 1 copy on different node in same remote rack Additional copies randomly placed replication write Client Client reads from closest copy Integrity via checksums Astrid Rheinländer: Big Data Analytics with MapReduce 33
34 HDFS Replication Replication engine on name node Name node detects data node failures heartbeat messages Data node failure: move tasks to intact replicas Moving computation is cheaper than moving data Astrid Rheinländer: Big Data Analytics with MapReduce 34
35 Worker fails Master pings all workers regularly (heartbeat messages) Worker answer within timeout If worker does not answer, redistribute tasks to other workers Map/Reduce tasks with status assigned or running get status waiting for execution Finished Map tasks need to be executed again Finished Reduce tasks do not need to be re-run Astrid Rheinländer: Big Data Analytics with MapReduce 35
36 Master or Name node fails single points of failure Execution is aborted and needs to be restarted Solutions: Shadow master / secondary name node Checkpoints master master worker worker Astrid Rheinländer: Big Data Analytics with MapReduce 36
37 Master or Name node fails single points of failure Execution is aborted and needs to be restarted Solutions: Shadow master / secondary name node Checkpoints master master Resume from checkpoints worker Resume from checkpoints worker Astrid Rheinländer: Big Data Analytics with MapReduce 37
38 Performance Bottlenecks Stragglers Tasks which run much longer than others Backup-Tasks Astrid Rheinländer: Big Data Analytics with MapReduce 38
39 Backup tasks Problem: stragglers significantly slow down whole MapReduce job Reasons: Other jobs might consume resources on a machine Bad disks with correctable errors transfer data very slowly Not enough RAM machine starts swapping Solution: When MapReduce job is close to finishing Spawn copies of remaining in-progress tasks Keep results from task that finishes first Takes a few percent resource overhead Astrid Rheinländer: Big Data Analytics with MapReduce 39
40 Performance Bottlenecks Stragglers Network traffic Locality Astrid Rheinländer: Big Data Analytics with MapReduce 40
41 Locality Goal: conserve bandwidth Schedule map tasks to locations of processed chunks physically on same machine many machines can read data with local disk speed Worker 1 Map task A Worker 3 Chunk A Network switch Worker 2 Map task B Worker 4 Chunk B Master task A task B task C Map task X Chunk C Network switch Map Chunk K task D task I Worker 5 Map Chunk D Worker 6 Map Chunk F Astrid Rheinländer: Big Data Analytics with MapReduce 41
42 Locality Goal: conserve bandwidth Schedule map tasks to locations of processed chunks If same machine not possible: same rack / switch Worker 1 Map task A Chunk A Network switch Worker 2 Map task B Chunk B Worker 3 Worker 4 Master task A task B task C Map task X Chunk C Network switch Map task C Chunk K task D task I Worker 5 Map Chunk D Worker 6 Map Chunk F Astrid Rheinländer: Big Data Analytics with MapReduce 42
43 Overview MapReduce Programming model Architecture and execution Distributed File System HDFS Error handling & performance issues MapReduce vs. parallel Databases Current trends Astrid Rheinländer: Big Data Analytics with MapReduce 43
44 MapReduce vs. parallel DBMS DeWitt & Stonebraker: A major step backwards Astrid Rheinländer: Big Data Analytics with MapReduce 44
45 MapReduce vs. parallel DBMS DeWitt & Stonebraker: A major step backwards Criticism: Teradata has done this 20 years ago Not really new Functionality can be reached by UDFs in parallel DBMS Parallel DBMS provide good scalability Interface too low-level Write intermediate data to disk instead of pipelining/streaming Lack of schemata obstructs performance optimizations Astrid Rheinländer: Big Data Analytics with MapReduce 45
46 MapReduce!= parallel DBMS Focus is different! Parallel DBMS: Query very large data sets Transactions User management High reliability Very expensive MapReduce: ETL tasks Complex analytics Pragmatism instead of perfect reliability Much cheaper Open source systems Astrid Rheinländer: Big Data Analytics with MapReduce 46
47 Extensions of MapReduce Data flow languages on top of MapReduce Jaql, Hive, Pig, Record-based data model Optimizers Extensions of programming model Example: Stratosphere Astrid Rheinländer: Big Data Analytics with MapReduce 47
48 Stratosphere research unit Jointly carried out by TU, HU, and HPI Stratosphere system web-scale distributed data analytics system Database-inspired approach Analytical workload Astrid Rheinländer: Big Data Analytics with MapReduce 48
49 Stratosphere research unit Jointly carried out by TU, HU, and HPI Stratosphere system web-scale distributed data analytics system Database-inspired approach Analytical workload Astrid Rheinländer: Big Data Analytics with MapReduce 49
50 Stratosphere programming model PArallelization ConTracts (PACTs) Generalization of MapReduce Astrid Rheinländer: Big Data Analytics with MapReduce 50
51 Stratosphere programming model Parallelization Contracts Generalization of MapReduce Two inputs Matches tuples that share the same key Equi-join on key Each pair handed to UDF Astrid Rheinländer: Big Data Analytics with MapReduce 51
52 Stratosphere programming model Parallelization Contracts Generalization of MapReduce Two inputs Cartesian product of both data sets Each pair handed to UDF Astrid Rheinländer: Big Data Analytics with MapReduce 52
53 Stratosphere programming model Parallelization Contracts Generalization of MapReduce Reduce on two inputs Each key group is handed to UDF Astrid Rheinländer: Big Data Analytics with MapReduce 53
54 Thank you for your attention! Astrid Rheinländer: Big Data Analytics with MapReduce 54
Introduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process
More informationIstanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış
Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details
More informationData Science in the Wild
Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot
More informationHadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
More informationand HDFS for Big Data Applications Serge Blazhievsky Nice Systems
Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted
More informationHPCHadoop: MapReduce on Cray X-series
HPCHadoop: MapReduce on Cray X-series Scott Michael Research Analytics Indiana University Cray User Group Meeting May 7, 2014 1 Outline Motivation & Design of HPCHadoop HPCHadoop demo Benchmarking Methodology
More informationHadoop WordCount Explained! IT332 Distributed Systems
Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,
More informationCS54100: Database Systems
CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics
More informationMap Reduce / Hadoop / HDFS
Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview
More informationParallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationHadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart
Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated
More informationLecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationIntroduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu
Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.
More informationParallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model
More informationHadoop: Understanding the Big Data Processing Method
Hadoop: Understanding the Big Data Processing Method Deepak Chandra Upreti 1, Pawan Sharma 2, Dr. Yaduvir Singh 3 1 PG Student, Department of Computer Science & Engineering, Ideal Institute of Technology
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationJeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
More informationData-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationMapReduce (in the cloud)
MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationHadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab
IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态
More informationCloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
More informationHadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com
Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationTake An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationParallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationHadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013
Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables
More informationHadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
More informationIntro to Map/Reduce a.k.a. Hadoop
Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by
More informationmap/reduce connected components
1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains
More informationWeekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
More informationInternals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationApache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
More informationPetabyte-scale Data with Apache HDFS. Matt Foley Hortonworks, Inc. mfoley@hortonworks.com
Petabyte-scale Data with Apache HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com Matt Foley - Background MTS at Hortonworks Inc. HDFS contributor, part of original ~25 in Yahoo! spin-out of Hortonworks
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationGoogle File System. Web and scalability
Google File System Web and scalability The web: - How big is the Web right now? No one knows. - Number of pages that are crawled: o 100,000 pages in 1994 o 8 million pages in 2005 - Crawlable pages might
More informationHow To Write A Mapreduce Program In Java.Io 4.4.4 (Orchestra)
MapReduce framework - Operates exclusively on pairs, - that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output
More informationLarge scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
More informationWelcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
More informationData Science Analytics & Research Centre
Data Science Analytics & Research Centre Data Science Analytics & Research Centre 1 Big Data Big Data Overview Characteristics Applications & Use Case HDFS Hadoop Distributed File System (HDFS) Overview
More informationBig Data Management. Big Data Management. (BDM) Autumn 2013. Povl Koch November 11, 2013 10-11-2013 1
Big Data Management Big Data Management (BDM) Autumn 2013 Povl Koch November 11, 2013 10-11-2013 1 Overview Today s program 1. Little more practical details about this course 2. Recap from last time (Google
More informationDistributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
More informationOverview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
More informationDistributed File Systems
Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)
More informationHadoop Cluster Applications
Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday
More informationDesign and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
More informationBig Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationTutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
More informationArchitectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
More informationBig Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
More informationA very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
More informationHadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers
More informationTHE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
More informationHadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN
Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal
More informationIntroduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
More informationApache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
More informationBiomap Jobs and the Big Picture
Lots of Data, Little Money. A Last.fm perspective Martin Dittus, martind@last.fm London Geek Nights, 2009-04-23 Big Data Little Money You have lots of data You want to process it For your product (Last.fm:
More informationA Comparison of Approaches to Large-Scale Data Analysis
A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009 MapReduce
More informationBig Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani
Big Data and Hadoop Sreedhar C, Dr. D. Kavitha, K. Asha Rani Abstract Big data has become a buzzword in the recent years. Big data is used to describe a massive volume of both structured and unstructured
More informationBigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
More informationWord count example Abdalrahman Alsaedi
Word count example Abdalrahman Alsaedi To run word count in AWS you have two different ways; either use the already exist WordCount program, or to write your own file. First: Using AWS word count program
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationGetting to know Apache Hadoop
Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the
More informationHADOOP PERFORMANCE TUNING
PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
More informationSurvey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
More informationFrom GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop
More informationNetworking in the Hadoop Cluster
Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop
More informationBig Data Analytics* Outline. Issues. Big Data
Outline Big Data Analytics* Big Data Data Analytics: Challenges and Issues Misconceptions Big Data Infrastructure Scalable Distributed Computing: Hadoop Programming in Hadoop: MapReduce Paradigm Example
More informationDistributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
More informationOpen source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
More informationPrepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
More informationLecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
More informationA Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
More informationINTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
More informationHadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps
Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm
More informationData Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com
Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,
More informationMAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
More informationBig Data Storage, Management and challenges. Ahmed Ali-Eldin
Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big
More informationJournal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
More informationIntroduc)on to Map- Reduce. Vincent Leroy
Introduc)on to Map- Reduce Vincent Leroy Sources Apache Hadoop Yahoo! Developer Network Hortonworks Cloudera Prac)cal Problem Solving with Hadoop and Pig Slides will be available at hgp://lig- membres.imag.fr/leroyv/
More informationTHE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCE COMPARING HADOOPDB: A HYBRID OF DBMS AND MAPREDUCE TECHNOLOGIES WITH THE DBMS POSTGRESQL
THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCE COMPARING HADOOPDB: A HYBRID OF DBMS AND MAPREDUCE TECHNOLOGIES WITH THE DBMS POSTGRESQL By VANESSA CEDENO A Dissertation submitted to the Department
More informationBig Data Technology Core Hadoop: HDFS-YARN Internals
Big Data Technology Core Hadoop: HDFS-YARN Internals Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class Map-Reduce Motivation This class
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
More informationProcessing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems
Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:
More informationLecture 10 - Functional programming: Hadoop and MapReduce
Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional
More informationHDFS. Hadoop Distributed File System
HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files
More informationHadoop Distributed File System. Dhruba Borthakur June, 2007
Hadoop Distributed File System Dhruba Borthakur June, 2007 Goals of HDFS Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle
More informationMapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
More information