Intro to Map/Reduce a.k.a. Hadoop
|
|
- Paulina Sutton
- 8 years ago
- Views:
Transcription
1 Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by Aung Oo, Suzanne McIntosh 1
2 Today Introduction to the Map/Reduce CONCEPT and an IMPLEMENTATION of it (Apache Hadoop be aware that there are also others) Project Discussion 2
3 Massive Data-sets Intro HDD_1 (10GB) HDD_100 (10GB) 1 TB of Data
4 Massive Data-sets Intro HDD_1 (10GB) HDD_100 (10GB) 1 TB of Data in 100 HDDs, but how many computers should there be? When N=1, reading 1 TB requires 2.5 HOURS. What should N be in order to give us appreciable speed-up on reads?
5 Massive Data-sets Intro Server 1 Server 100 HDD_1 (10GB) HDD_100 (10GB) Given: 10 GB per drive = 10,000,000,000 bytes per drive 100x 10 GB drives = 1 TB = 1,000,000,000,000 bytes Read rate is 100 MB/second Full 1 TB of data can be read in 100 seconds : 10 GB / 100 MB per second = 10,000,000,000 / 100,000,000 = 100 seconds to read one drive. We read all 100 drives in parallel, and the computers can process the data read in parallel. This is the architecture in which distributed computing frameworks shine, because not only is the data read in parallel, it is processed in parallel as well.
6 So really - how do we do this? 6
7 Coping wi th large data HPC? Many different HPC solutions MPI GPU computing MapReduce... No one is the best solution: Analyze your problem and choose the best solution for your specific problem, resources, midterm goals,... M/R frameworks are aimed to process huge volumes of data of Tera- or PetaBytes, what fits perfectly in many bioinformatics scenarios
8 Coping wi th large data MapReduce
9 Example Map/Reduce 9
10 Weather data example Dataset You are given a file containing data from weather stations from around the world. Say there are 1000 stations and each measures temperature once a second. Over a day that sums up to data points Per year we ll have 3^10 points or about 500 GB (using 16bit numbers). Question: what was the maximum temperature for each year? FM V N CN N FM V N CN N FM V N CN N FM V N CN N FM V N CN N Input to Mapper: Key, Value Year Temp 0, FM V N CN N , FM V N CN N , FM V N CN N , FM V N CN N , FM V N CN N The key is the offset of the start of each record in the file (the records are 105 characters long).
11 11
12 12
13 13
14 14
15 More use cases 15
16 Hadoop Other examples Yahoo! has more than CPUs in > computers running Hadoop The biggest cluster: nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) Used to support research for Ad Systems and Web Search Also used to do scaling tests to support development of Hadoop on larger clusters Facebook uses Hadoop to store copies of internal log and dimension data sources and as a source for reporting/analytics and machine learning. Currently they have 2 major clusters (with a total of GB storage): A 1100-machine cluster with 8800 cores and about 12 PB raw storage. A 300-machine cluster with 2400 cores and about 3 PB raw storage. Each node has 8 cores and 12 TB of storage.
17 Hadoop Example: GATK, a genome analysis toolkit
18 Hadoop Example: Bowtie Crossbow, genome resequencing
19 Hadoop Example: CloudBurst, a NGS read mapping
20 Read Mapping k-mer Counting with Map/Reduce map shuffle reduce ATGAACCTTA ATG,1 TGA,1 GAA,1 AAC,1 ACC,1 CCT,1 CTT,1 TTA,1 ACA -> 1 ATG -> 1 CAA -> 1,1 GCA -> 1 TGA -> 1 TTA -> 1,1,1 ACA:1 ATG:1 CAA:2 GCA:1 TGA:1 TTA:3 GAACAACTTA GAA,1 AAC,1 ACA,1 CAA,1 AAC,1 ACT,1 CTT,1 TTA,1 ACT -> 1 AGG -> 1 CCT -> 1 GGC -> 1 TTT -> 1 ACT:1 AGG:1 CCT:1 GGC:1 TTT:1 TTTAGGCAAC TTT,1 TTA,1 TAG,1 AGG,1 GGC,1 GCA,1 Application developers focus on 2 (+1 internal) functions Map: input -> key, value pairs Shuffle: Group together pairs with same key Reduce: key, value-lists -> output CAA,1 AAC,1 AAC -> 1,1,1,1 ACC -> 1 CTT -> 1,1 GAA -> 1,1 TAG -> 1 AAC:4 ACC:1 CTT:2 GAA:2 TAG:1 Map, Shuffle & Reduce All Run in Parallel
21 Read Mapping Cloudburst 1. Map: Catalog K-mers Emit k-mers in the genome and reads 2. Shuffle: Collect Seeds Conceptually build a hash table of k-mers and their occurrences 3. Reduce: End-to-end alignment If read aligns end-to-end with k errors, record the alignment map Human chromosome 1 shuffle reduce Read 1 Read 1, Chromosome 1, Read 2 Read 2, Chromosome 1,
22 Hadoop Example: More NGS on Hadoop
23 Hadoop Other examples
24 Hadoop Overview 24
25 Hadoop A MapReduce implementation Hadoop MapReduce implementation is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, faulttolerant manner.
26 Hadoop A MapReduce implementation A MapReduce job usually splits the input dataset into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
27 Hadoop A MapReduce implementation Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the HDFS are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster
28 Hadoop A MapReduce implementation The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master
29 Hadoop A MapReduce implementation
30 HDFS 30
31 Distributed File Systems Intro Why a distributed file system? Holds a large amount of data Serves many network clients
32 Distributed File Systems What is HDFS? HDFS = Hadoop Distributed File System Storage system part of Hadoop Developed by Google - Google File System (GFS) A version of GFS later renamed to HDFS Protects against data loss from hardware failure
33 Distributed File Systems HDFS Stores very large files (100s of MB, GB, TB) Provides high performance streaming data access Uses off-the-shelf, non-custom, hardware Continues working without noticeable interruption if failure Stores replicas of blocks to facilitate recovery from hardware errors File data is accessed in a write once and read many model
34 Distributed File Systems Intro HDFS is not good for Applications requiring low-latency access to data Lots of small files Lots of small files mean lots of metadata to hold in the NameNode s memory Multiple writers (only single writer is supported because stream-based) Updates to offsets within the file
35 HDFS So what? 35
36 Distributed File Systems HDFS Block-structured file system Individual files are broken into blocks of a fixed size HDFS block size is 64MB by default HDFS blocks are large compared to disk blocks (512 bytes) or file system blocks (4KB) Optimal streaming achieved by reducing the latency that many seeks would cause Blocks stored across cluster in one or more machines DataNodes
37 Distributed File Systems Intro In HDFS, a file can be made of several blocks, and they are not necessarily stored on the same machine Access to a file may require cooperation of multiple machines Advantage: Support for files whose sizes exceed what one machine can accommodate HDFS stores files as a set of large blocks across several machines, and these files are not part of the ordinary file system Typing ls on a machine running a DataNode daemon will display the contents of the ordinary Linux file system being used to host the Hadoop services Files stored inside HDFS are not shown HDFS runs in a separate namespace HDFS comes with its own utilities for file management Blocks that comprise the HDFS files are stored in a directory managed by the DataNode service
38 Distributed File Systems Intro When the blocks of a file are distributed across the cluster, several machines participate in serving up the file The loss of any one of those machines would make the file unavailable Solution is replication of each block across a number of machines (3 machines, by default)
39 Distributed File Systems Intro An HDFS cluster is comprised of two types of nodes: One Namenode (Master) Multiple Datanodes (Worker nodes, subservient to NameNode) In HDFS File data is accessed in a write once, read many (WORM) model Metadata structures (names of files and directories) can be modified by many clients concurrently Metadata remains synchronized by using single machine to manage the metadata the NameNode
40 Distributed File Systems NameNode Master Manages file system namespace Maintains file system tree Maintains metadata for all files and directories in the tree Low amount of metadata stored per file File names Permissions Locations, i.e. DataNodes, of each block of each file Information can be stored in the main memory of NameNode for fast access
41 Distributed File Systems Namenode Resilience Important that NameNodes are resilient to failure Without NameNode, one cannot use the file system For recovery Metadata is persisted in the local file system Optionally, persisted to multiple backup file system Option to run a secondary NameNode Role of secondary is different from primary NameNode Secondary manages the edit log by continuously merging the namespace image. Secondary NameNode lags the primary Secondary NameNode can be promoted to primary for recovery Namenode marks bad blocks, creates new good replicas
42 Distributed File Systems DataNodes Worker nodes Subservient to NameNode of the cluster Store and retrieve blocks on demand One large file is split into multiple HDFS blocks Each HDFS block is stored in a DataNode Report to NameNode periodically with lists of blocks they are storing Compute checksums over blocks Report checksum errors to NameNodes
43 Distributed File Systems To open a file in the HDFS file system Client retrieves from NameNode the list of locations (DataNodes) for the blocks that comprise the file Client reads file data directly from DataNode servers, possibly in parallel NameNode not directly involved in the bulk data transfer, keeping its overhead to a minimum If a DataNode fails Data can be retrieved from one of the replicas Cluster continues to operate If the NameNode fails Cluster is inaccessible until it is manually restored Multiple redundant systems allow the NameNode to protect the file system's metadata in the event of NameNode failure NameNode failure is more severe for the cluster than DataNode failure
44 Distributed File Systems How is distance between nodes measured? Hadoop uses the tree structure of the nodes in the network to arrive at distance between two nodes Distance is sum of the distance between each node and the nearest common ancestor When two processes are running on the same node, we identify both nodes the same way: / datacenter 1/ rack 1/ node 1, or /d1/r1/n1 for short. The distance is given as: distance(/d1/r1/n1, /d1/r1/n1) = 0 When two processes are running on different nodes in the same rack, the distance is: distance(/d1/r1/n1, /d1/r1/n2) = 2 When two processes are running on nodes in different racks, the distance is: distance(/d1/r1/n1, (/d1/r2/n3) = 4 When two processes are running on nodes in different datacenters*, the distance is: distance(/d1/r1/n1, /d2/r3/n4) = 6 *Note: Hadoop does not yet support this model.
45 Distributed File Systems Coherency Model Be aware that writes may not be visible, even after flush The current block being written will not be visible to readers Once more than one block s worth of data is written, the first block will become visible to readers HDFS provides sync() method to force all buffers to be synchronized to the DataNodes If your application does not call sync(), and a failure occurs, all data of the block currently being written will be unrecoverable It is advisable to call sync() at appropriate points in your application, remembering that a call to sync() does incur some overhead
46 NFS vs. HDFS
47 Distributed File Systems Another distributed file system: NFS One of the oldest NFS server makes local file system visible to network Once mounted, the fact that the files are remote is transparent to the client Files all reside on one machine, therefore limit to how much data can be stored Not extensible No reliability guarantee All clients contend for service from the NFS server Clients must copy the data locally to process it
48 Distributed File Systems NFS vs. HDFS NFS HDFS Mature technology? Yes (1984) Yes (2004) Serves multiple clients? Yes Yes Number of machines 1 Many Size of file system Fixed Extensible, scalable Reliability guaranteed? No Yes Clients contend for service? Clients copy data before processing it? Supports very large file sizes Yes Yes No Yes, but clients are distributed across n servers No Yes
49 Distributed File Systems Disadvantages of HDFS Not as general-purpose as NFS HDFS is not suitable for applications that perform random seeks to read from arbitrary locations within a file HDFS is not suitable for applications that perform random seeks to write to arbitrary locations within a file HDFS does not have support for multiple writers to a file
50 Summary
51 Distributed File Systems Intro When running in standalone mode, the local file system is used, not HDFS. When running in distributed mode, for example with HDFS as our distributed file system, Hadoop uses data locality optimization when scheduling jobs. Hadoop tries to run the map task on a node where the input data resides in HDFS This is to minimize the amount of copying over the network Network bandwidth is precious If all three nodes hosting the HDFS block replicas for a given split are already running map tasks, the job scheduler will try to schedule the work to run in a rack that already contains the replica Although the replica must be copied, it is intra-rack, so costs less than interrack
52 Distributed File Systems Intro Mappers write data out to local disks because it is intermediate data and replicas (via HDFS) are unnecessary Reduce tasks do not have the advantage of data locality their input is the output of many Mappers The sorted Mapper outputs get transferred over the network to Reducer nodes The output of the Reduce task(s) is written to HDFS for reliability (replicas) There can be multiple Reduce tasks
53 Summary MAP/REDUCE is a concept on how to work with large data-sets MAP/REDUCE is implemented in several software packages, e.g. Apache Hadoop, Apache MapReduce The needed ecosystem consists of other elements, such as a filesystem (HDFS) and several control services (e.g. job and tasktracker) There exist several tools for easier usage, e.g. Apache Mahout, Apache Pig, Apache DataFu Alternative approaches include: Apache Spark, Apache Flink, Apache Hama, Facebook Corona, Twitter Storm, etc. More: 53
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationHadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ.
Hadoop Distributed Filesystem Spring 2015, X. Zhang Fordham Univ. MapReduce Programming Model Split Shuffle Input: a set of [key,value] pairs intermediate [key,value] pairs [k1,v11,v12, ] [k2,v21,v22,
More informationLecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationOverview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationTake An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
More informationA very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
More informationTHE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
More informationWelcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
More informationApache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
More informationDesign and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
More informationHADOOP MOCK TEST HADOOP MOCK TEST I
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationDistributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationHadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed
More informationData-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion
More informationDistributed Filesystems
Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science amir@sics.se April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls
More informationParallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model
More informationHadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationHadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers
More informationJournal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
More informationPrepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
More informationTutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
More informationApache Hadoop new way for the company to store and analyze big data
Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File
More informationLarge scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
More informationDistributed File Systems
Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
More informationWeekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
More informationHDFS: Hadoop Distributed File System
Istanbul Şehir University Big Data Camp 14 HDFS: Hadoop Distributed File System Aslan Bakirov Kevser Nur Çoğalmış Agenda Distributed File System HDFS Concepts HDFS Interfaces HDFS Full Picture Read Operation
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationMapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
More informationGraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
More informationNoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
More informationHadoop Distributed File System (HDFS) Overview
2012 coreservlets.com and Dima May Hadoop Distributed File System (HDFS) Overview Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized
More informationDistributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
More informationApache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
More informationHadoop Distributed File System. Dhruba Borthakur June, 2007
Hadoop Distributed File System Dhruba Borthakur June, 2007 Goals of HDFS Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle
More informationMapReduce, Hadoop and Amazon AWS
MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables
More informationFault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationData-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti
More informationLecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
More informationBigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
More informationThe Hadoop Distributed File System
The Hadoop Distributed File System The Hadoop Distributed File System, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, Yahoo, 2010 Agenda Topic 1: Introduction Topic 2: Architecture
More informationHadoop Distributed File System. Jordan Prosch, Matt Kipps
Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?
More informationReduction of Data at Namenode in HDFS using harballing Technique
Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.
More informationMapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues
More informationInternational Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
More informationUsing Hadoop for Webscale Computing. Ajay Anand Yahoo! aanand@yahoo-inc.com Usenix 2008
Using Hadoop for Webscale Computing Ajay Anand Yahoo! aanand@yahoo-inc.com Agenda The Problem Solution Approach / Introduction to Hadoop HDFS File System Map Reduce Programming Pig Hadoop implementation
More informationBig Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13
Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
More informationHadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab
IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态
More informationJeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
More informationBig Data Technology Core Hadoop: HDFS-YARN Internals
Big Data Technology Core Hadoop: HDFS-YARN Internals Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class Map-Reduce Motivation This class
More informationLecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015
Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop
More informationHadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
More informationKeywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
More informationand HDFS for Big Data Applications Serge Blazhievsky Nice Systems
Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted
More informationHadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
More information!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets
!"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.
More informationBBM467 Data Intensive ApplicaAons
Hace7epe Üniversitesi Bilgisayar Mühendisliği Bölümü BBM467 Data Intensive ApplicaAons Dr. Fuat Akal akal@hace7epe.edu.tr Problem How do you scale up applicaaons? Run jobs processing 100 s of terabytes
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationGoogle File System. Web and scalability
Google File System Web and scalability The web: - How big is the Web right now? No one knows. - Number of pages that are crawled: o 100,000 pages in 1994 o 8 million pages in 2005 - Crawlable pages might
More informationHadoop@LaTech ATLAS Tier 3
Cerberus Hadoop Hadoop@LaTech ATLAS Tier 3 David Palma DOSAR Louisiana Tech University January 23, 2013 Cerberus Hadoop Outline 1 Introduction Cerberus Hadoop 2 Features Issues Conclusions 3 Cerberus Hadoop
More informationOpen source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
More informationGoogle Bing Daytona Microsoft Research
Google Bing Daytona Microsoft Research Raise your hand Great, you can help answer questions ;-) Sit with these people during lunch... An increased number and variety of data sources that generate large
More information<Insert Picture Here> Big Data
Big Data Kevin Kalmbach Principal Sales Consultant, Public Sector Engineered Systems Program Agenda What is Big Data and why it is important? What is your Big
More informationMASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015
7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE
More informationCertified Big Data and Apache Hadoop Developer VS-1221
Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer Certification Code VS-1221 Vskills certification for Big Data and Apache Hadoop Developer Certification
More informationMapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially
More informationHow To Use Hadoop
Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop
More informationCDH AND BUSINESS CONTINUITY:
WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable
More informationThe Recovery System for Hadoop Cluster
The Recovery System for Hadoop Cluster Prof. Priya Deshpande Dept. of Information Technology MIT College of engineering Pune, India priyardeshpande@gmail.com Darshan Bora Dept. of Information Technology
More informationInternational Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationHadoop and its Usage at Facebook. Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009
Hadoop and its Usage at Facebook Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed on Hadoop Distributed File System Facebook
More informationScalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
More informationMap Reduce / Hadoop / HDFS
Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview
More informationMapReduce Job Processing
April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File
More informationCloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
More informationhttp://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationBookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011
BookKeeper Flavio Junqueira Yahoo! Research, Barcelona Hadoop in China 2011 What s BookKeeper? Shared storage for writing fast sequences of byte arrays Data is replicated Writes are striped Many processes
More informationGetting to know Apache Hadoop
Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the
More informationBig Data Storage Options for Hadoop Sam Fineberg, HP Storage
Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations
More informationHadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
More informationBig Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani
Big Data and Hadoop Sreedhar C, Dr. D. Kavitha, K. Asha Rani Abstract Big data has become a buzzword in the recent years. Big data is used to describe a massive volume of both structured and unstructured
More informationHADOOP MOCK TEST HADOOP MOCK TEST II
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
More informationApache Hadoop: Past, Present, and Future
The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer aaa@cloudera.com, twitter: @awadallah Hadoop Past
More informationR.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,
More information