1 PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad University, Najafabad Branch, Esfahan, Iran.
2 Introduction Recently the computing world has been undergoing a significant transformation from the traditional non-centralized distributed system architecture, typified by distributed data and computation on different geographic areas, to a centralized cloud computing architecture, where the computations and data are operated somewhere in the cloud that is, data centers owned and maintained by third party. The interest in cloud computing has been motivated by many factors  such as the low cost of system hardware, the increase in computing power and storage capacity (e.g., the modern data center consists of hundred of thousand of cores and peta-scale storage), and the massive growth in data size generated by digital media (images/audio/video), Web authoring, scientific instruments, physical simulations, and so on. To this end, still the main challenge in the cloud is how to effectively store, query, analyze, and utilize these immense datasets. The traditional data-intensive system (data to computing paradigm) is not efficient for cloud computing due to the bottleneck of the Internet when transferring large amounts of data to a distant CPU . New paradigms should be adopted, where computing and data resources are colocated, thus minimizing the communication cost and benefiting from the large improvements in I/O speeds using local disks, as shown in Figure Alex Szalay and Jim Gray stated in a commentary on 2020 computing : In the future, working with large data sets will typically mean sending computations to data rather than copying the data to your work station.
4 Introduction Google has successfully implemented and practiced the new data-intensive paradigm in their Google MapReduce System (e.g., Google uses its MapReduce framework to process 20 petabytes of data per day ). The MapReduce system runs on top of the Google File System (GFS) , within which data are loaded, partitioned into chunks, and each chunk is replicated. Data processing is co-located with data storage: When a file needs to be processed, the job scheduler consults a storage metadata service to get the host node for each chunk and then schedules a map process on that node, so that data locality is exploited efficiently. At the time of writing, due to its remarkable features including simplicity, fault tolerance, and scalability, MapReduce is by far the most powerful realization of data-intensive cloud computing programming. It is often advocated as an easier-to-use, efficient and reliable replacement for the traditional data-intensive programming model for cloud computing. More significantly, MapReduce has been proposed to form the basis of the datacenter software stack .
5 Introduction MapReduce has been widely applied in various fields including data and compute-intensive applications, machine learning, graphic programming, multi-core programming, and so on. Moreover, many implementations have been developed in different languages for different purposes. Its popular open-source implementation, Hadoop , was developed primarily by Yahoo!, where it processes hundreds of terabytes of data on at least 10,000 cores , and is now used by other companies, including Facebook, Amazon, Last.fm, and the New York Times . Research groups from the enterprise and academia are starting to study the MapReduce model for better fit for the cloud, and they explore the possibilities of adapting it for more applications.
6 Motivation for MapReduce (why) Large-Scale Data Processing Want to use 1000s of CPUs But don t want hassle of managing things MapReduce Architecture provides Automatic parallelization & distribution Fault tolerance I/O scheduling Monitoring & status updates
7 What is Map/Reduce MapReduce is a software framework for solving many largescale computing problems. The MapReduce abstraction is inspired by the Map and Reduce functions, which are commonly used in functional languages such as Lisp . Many problems can be phrased this way Easy to distribute across nodes Nice retry/failure semantics The MapReduce system allows users to easily express their computation as map and reduce functions (more details can be found in Dean and Ghemawat ):
8 Map and Reduce Functions The map function, written by the user, processes a key/value pair to generate a set of intermediate key/value pairs: map (key1, value1) -> list (key2, value2) The reduce function, also written by the user, merges all intermediate values associated with the same intermediate key: reduce (key2, list (value2)) -> list (value2)
9 Map in LISP (Scheme) (map f list [list 2 list 3 ]) (map square ( )) ( )
10 Reduce in LISP (Scheme) (reduce f id list) (reduce + 0 ( )) (+ 16 (+ 9 (+ 4 (+ 1 0)) ) ) 30 (reduce + 0 (map square (map l 1 l 2 ))))
11 The Wordcount Example-1 We have a large file of words, one word to a line Count the number of times each distinct word appears in the file Sample application: analyze web server logs to find popular URLs
12 The Wordcount Example-2 Case 1: Entire file fits in memory Case 2: File too large for mem, but all <word, count> pairs fit in mem Case 3: File on disk, too many distinct words to fit in memory sort datafile uniq c
13 The Wordcount Example-3 To make it slightly harder, suppose we have a large corpus of documents Count the number of times each distinct word occurs in the corpus words(docs/*) sort uniq -c where words takes a file and outputs the words in it, one to a line The above captures the essence of MapReduce Great thing is it is naturally parallelizable
14 Distributed Word Count Very big data Split data Split data Split data count count count count count count merge merged count Split data count count
15 Map+Reduce Very big data M A P Partitioning Function R E D U C E Result Map: Accepts input key/value pair Emits intermediate key/value pair Reduce : Accepts intermediate key/value* pair Emits output key/value pair
16 Diagram (1)
17 Diagram (2)
18 MapReduce Input: a set of key/value pairs User supplies two functions: map(k,v) list(k1,v1) reduce(k1, list(v1)) v2 (k1,v1) is an intermediate key/value pair Output is the set of (k1,v2) pairs
19 Word Count using MapReduce map(key, value): // key: document name; value: text of document for each word w in value: emit(w, 1) reduce(key, values): // key: a word; values: an iterator over counts result = 0 for each count v in values: result += v emit(key,result)
20 Count map(key=url, val=contents): For each word w in contents, emit (w, 1 ) reduce(key=word, values=uniq_counts): Sum all 1 s in values list Emit result (word, sum) see bob run see spot throw see 1 bob 1 run 1 see 1 spot 1 throw 1 bob 1 run 1 see 2 spot 1 throw 1
21 The Wordcount Example The Wordcount application counts the number of occurrences of each word in a large collection of documents. The steps of the process are briefly described as follows: The input is read (typically from a distributed file system) and broken up into key/value pairs (e.g., the Map function emits a word and its associated count of occurrence, which is just 1 ). The pairs are partitioned into groups for processing, and they are sorted according to their key as they arrive for reduction. Finally, the key/value pairs are reduced, once for each unique key in the sorted list, to produce a combined result (e.g., the Reduce function sums all the counts emitted for a particular word).
22 The Wordcount Example
23 Main Features The main features of MapReduce for data-intensive application: Data-Aware. When the MapReduce-Master node is scheduling the Map tasks for a newly submitted job, it takes in consideration the data location information retrieved from the GFS-Master node. Simplicity. As the MapReduce runtime is responsible for parallelization and concurrency control, this allows programmers to easily design parallel and distributed applications. Manageability. In traditional data-intensive applications, where data are stored separately from the computation unit, we need two levels of management: (i) to manage the input data and then move these data and prepare them to be executed; (ii) to manage the output data. In contrast, in the Google MapReduce model, data and computation are allocated, taking advantage of the GFS, and thus it is easier to manage the input and output data.
24 Main Features Scalability. Increasing the number of nodes (data nodes) in the system will increase the performance of jobs with potentially only minor losses. Fault-tolerance and Reliability. The data in GFS are distributed on clusters with thousands of nodes. Thus any nodes with hardware failures can be handled by simply removing them and installing a new node in their place. Moreover, MapReduce, taking the advantages of the replication in GFS, can achieve high reliability by 1. Rerunning all the tasks (completed or in progress) when a host node is going off-line, 2. Rerunning failed tasks on another node, and 3. Launching backup tasks when these tasks are slowing down and causing a bottleneck to entire job.
25 Execution Overview
26 Distributed Execution Overview User Program fork fork fork assign map Master assign reduce Input Data Split 0 Split 1 Split 2 read Worker Worker Worker local write remote read, sort Worker Worker write Output File 0 Output File 1
27 Data flow Input, final output are stored on a distributed file system Scheduler tries to schedule map tasks close to physical storage location of input data Intermediate results are stored on local FS of map and reduce workers Output is often input to another map reduce task
28 Execution Overview The MapReduce library in the user program first splits the input files into M pieces of typically 16 to 64 (MB) per piece. It then starts many copies of the program on a cluster. One is the master and the rest are workers. The master is responsible for scheduling (assigns the map and reduce tasks to the worker) and monitoring (monitors the task progress and the worker health). When map tasks arise, the master assigns the task to an idle worker, taking into account the data locality. A worker reads the content of the corresponding input split and emits a key/value pairs to the user-defined Map function. The intermediate key/value pairs produced by the Map function are first buffered in memory and then periodically written to a local disk, partitioned into R sets by the partitioning function. The master passes the location of these stored pairs to the reduce worker, which reads the buffered data from the map worker using remote procedure calls (RPC). It then sorts the intermediate keys so that all occurrences of the same key are grouped together. For each key, the worker passes the corresponding intermediate value for its entire occurrence to the Reduce function. Finally, the output is available in R output files (one per reduce task).
29 How many Map and Reduce jobs? M map tasks, R reduce tasks Rule of thumb: Make M and R much larger than the number of nodes in cluster One DFS chunk per map is common Improves dynamic load balancing and speeds recovery from worker failure Usually R is smaller than M, because output is spread across R files
30 Combiners Often a map task will produce many pairs of the form (k,v1), (k,v2), for the same key k E.g., popular words in Word Count Can save network time by pre-aggregating at mapper combine(k1, list(v1)) v2 Usually the same as the reduce function Works only if reduce function is commutative and associative
31 Partition Function Inputs to map tasks are created by contiguous splits of input file For reduce, we need to ensure that records with the same intermediate key end up at the same worker System uses a default partition function e.g., hash(key) mod R Sometimes useful to override E.g., hash(hostname(url)) mod R ensures URLs from a host end up in the same output file
32 Execution Summary How is this distributed? 1. Partition input key/value pairs into chunks, run map() tasks in parallel 2. After all map()s are complete, consolidate all emitted values for each unique emitted key 3. Now partition space of output map keys, and run reduce() in parallel If map() or reduce() fails, reexecute!
33 Spotlight on Google MapReduce Implementation
34 MAJOR MAPREDUCE IMPLEMENTATIONS FOR THE CLOUD
35 MAJOR MAPREDUCE IMPLEMENTATIONS FOR THE CLOUD
36 MAJOR MAPREDUCE IMPLEMENTATIONS FOR THE CLOUD HADOOP The Apache Hadoop project develops open source software for reliable, scalable, distributed computing. Hadoop includes these subprojects: Hadoop Common: The common utilities that support the other Hadoop subprojects. Avro: A data serialization system that provides dynamic integration with scripting languages. Chukwa: A data collection system for managing large distributed systems. HBase: A scalable, distributed database that supports structured data storage for large tables. HDFS: A distributed file system that provides high throughput access to application data. Hive: A data warehouse infrastructure that provides data summarization and ad- hoc querying. MapReduce: A software framework for distributed processing of large data sets on compute clusters. Pig: A high level data flow language and execution framework for parallel computation. ZooKeeper: A high performance coordination service for distributed applications.
37 MAJOR MAPREDUCE IMPLEMENTATIONS FOR THE CLOUD HADOOP Hadoop Map-Reduce Overview: The Hadoop common , formerly Hadoop core, includes file System, RPC, and serialization libraries and provides the basic services for building a cloud computing environment with commodity hardware. The two fundamental subprojects are the MapReduce framework and the Hadoop Distributed File System (HDFS). The Hadoop Distributed File System is a distributed file system designed to run on clusters of commodity machines. It is highly fault-tolerant and is appropriate for data-intensive applications as it provides high speed access the application data. The Hadoop MapReduce framework is highly reliant on its shared file system (i.e., it comes with plug-ins for HDFS, CloudStore , and Amazon Simple Storage Service S3 ). The Map/Reduce framework has master/slave architecture. The master, called JobTracker, is responsible for (a) querying the NameNode for the block locations, (b) scheduling the tasks on the slave which is hosting the task s blocks, and (c) monitoring the successes and failures of the tasks. The slaves, called TaskTracker, execute the tasks as directed by the master. Hadoop Communities. Yahoo! uses Hadoop extensively in its Web search and advertising businesses . For example, in 2009, Yahoo! launched, according to them, the world s largest Hadoop production application, called Yahoo! Search Webmap. The Yahoo! Search Webmap runs on a more than 10,000 core Linux cluster and produces data that are now used in every Yahoo! Web search query . Besides Yahoo!, many other vendors have introduced and developed their own solutions for the enterprise cloud; these include IBM Blue Cloud , Cloudera , Open-solaris Hadoop Live CD  by Sun Microsystems, and Amazon Elastic MapReduce .
38 MAJOR MAPREDUCE IMPLEMENTATIONS FOR THE CLOUD HADOOP
39 MAJOR MAPREDUCE IMPLEMENTATIONS FOR THE CLOUD Alternative HADOOP Implementations
40 MAJOR MAPREDUCE IMPLEMENTATIONS FOR THE CLOUD Some Research Directions Longest Approximate Time to End (LATE) to improve the performance of Hadoop in a heterogeneous environment by running speculative tasks that is, looking for tasks that are running slowly and might possibly fail and replicating them on another node just in case they don t perform. In LATE, the slow tasks are prioritized based on how much they hurt job response time, and the number of speculative tasks is capped to prevent thrashing. The second one is driven by the increasing maturity of virtualization technology for example, the successful adoption and use of virtual machines (VMs) in various distributed systems such as grid  and HPC applications [42, 43]. To this end, some efforts have been proposed to efficiently run MapReduce on VM-based cluster, as in Cloudlet  and Tashi .
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2 CPU Memory Machine Learning, Statistics Classical Data Mining Disk 3 20+ billion web pages x 20KB = 400+ TB
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan firstname.lastname@example.org Abstract Every day, we create 2.5 quintillion
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai email@example.com MapReduce is a parallel programming model
HPC & Big Data Adam S.Z Belloum Software and Network engineering group University of Amsterdam 1 Introduc)on to MapReduce programing model 2 Content Introduc)on Master/Worker approach MapReduce Examples
CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensie Computing Uniersity of Florida, CISE Department Prof. Daisy Zhe Wang Map/Reduce: Simplified Data Processing on Large Clusters Parallel/Distributed
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: firstname.lastname@example.org Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: email@example.com Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University firstname.lastname@example.org 14.9-2015 1/36 Google MapReduce A scalable batch processing
Page 1 sur 5 Hadoop From Wikipedia, the free encyclopedia Apache Hadoop is a free Java software framework that supports data intensive distributed applications.  It enables applications to work with
With Saurabh Singh email@example.com The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Hace7epe Üniversitesi Bilgisayar Mühendisliği Bölümü BBM467 Data Intensive ApplicaAons Dr. Fuat Akal firstname.lastname@example.org Problem How do you scale up applicaaons? Run jobs processing 100 s of terabytes
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology email@example.com Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc firstname.lastname@example.org What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
Big Application Execution on Cloud using Hadoop Distributed File System Ashkan Vates*, Upendra, Muwafaq Rahi Ali RPIIT Campus, Bastara Karnal, Haryana, India ---------------------------------------------------------------------***---------------------------------------------------------------------
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside email@example.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated
The MapReduce Framework Luke Tierney Department of Statistics & Actuarial Science University of Iowa November 8, 2007 Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 1 / 16 Background
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen firstname.lastname@example.org Abstract. The Hadoop Framework offers an approach to large-scale
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB
Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 email@example.com (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is
Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University of California at Berkeley Motivation 1. MapReduce
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan firstname.lastname@example.org Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File
Using Hadoop for Webscale Computing Ajay Anand Yahoo! email@example.com Agenda The Problem Solution Approach / Introduction to Hadoop HDFS File System Map Reduce Programming Pig Hadoop implementation
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Verification and Validation of MapReduce Program model for Parallel K-Means algorithm on Hadoop Cluster Amresh Kumar Department of Computer Science & Engineering, Christ University Faculty of Engineering
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System firstname.lastname@example.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services