A BigData Tour HDFS, Ceph and MapReduce

Size: px
Start display at page:

Download "A BigData Tour HDFS, Ceph and MapReduce"

Transcription

1 A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo! Developer Network MapReduce Tutorial Data Management and Processing Data intensive computing Concerns with the production, manipulation and analysis of data in the range of hundreds of megabytes (MB) to petabytes (PB) and beyond A range of supporting parallel and distributed computing technologies to deal with the challenges of data representation, reliable shared storage, efficient algorithms and scalable infrastructure to perform analysis 1

2 Challenges Ahead Challenges with data intensive computing Scalable algorithms that can search and process massive datasets New metadata management technologies that can scale to handle complex, heterogeneous and distributed data sources Support for accessing in-memory multi-terabyte data structures High performance, highly reliable petascale distributed file system Techniques for data reduction and rapid processing Software mobility to move computation where data is located Hybrid interconnect with support for multi-gigabyte data streams Flexible and high performance software integration technique Hadoop A family of related project, best known for MapReduce and Hadoop Distributed File System (HDFS) Data volumes increasing massively! Clusters, storage capacity increasing massively! Disk speeds are not keeping pace.! Seek speeds even worse than read/write Data Intensive Computing Disk (MB/s), CPU (MIPS) Mahout! data mining 1000x! 2

3 Scale-Out Disk streaming speed ~ 50MB/s! 3TB =17.5 hrs! 1PB = 8 months! Scale-out (weak scaling) - filesystem distributes data on ingest Seeking too slow! ~10ms for a seek! Enough time to read half a megabyte! Batch processing! Go through entire data set in one (or small number) of passes Scale-Out 3

4 Combining results Each node preprocesses its local data! Shuffles its data to a small number of other nodes! Final processing, output is done there Fault Tolerance Data also replicated upon ingest! Runtime watches for dead tasks, restarts them on live nodes! Re-replicates 4

5 Why Hadoop Drivers 500M+ unique users per month Billions of interesting events per day Data analysis is key Need massive scalability PB s of storage, millions of files, 1000 s of nodes Need to do this cost effectively Use commodity hardware Share resources among multiple projects Provide scale when needed Need reliable infrastructure Must be able to deal with failures hardware, software, networking Failure is expected rather than exceptional Transparent to applications very expensive to build reliability into each application The Hadoop infrastructure provides these capabilities Introduction to Hadoop Apache Hadoop Based on 2004 Google MapReduce Paper Originally composed of HDFS (distributed F/S), a core-runtime and an implementation of Map-Reduce Open Source Apache Foundation project Yahoo! is Apache Platinum Sponsor History Started in 2005 by Doug Cutting Yahoo! became the primary contributor in 2006 Yahoo! scaled it from 20 node clusters to 4000 node clusters today Portable Written in Java Runs on commodity hardware Linux, Mac OS/X, Windows, and Solaris 5

6 HPC vs Hadoop HPC attitude The problem of disk-limited, loosely-coupled data analysis was solved by throwing more disks and using weak scaling Flip-side: A single novice developer can write real, scalable, node data-processing tasks in Hadoop-family tools in an afternoon MPI... less so 6

7 7

8 Data Distribution: Disk Hadoop and similar architectures handle the hardest part of parallelism for you - data distribution.! On disk: HDFS distributes, replicates data as it comes in! Keeps track; computations local to data Data Distribution: Network On network: Map Reduce (eg) works in terms of key-value pairs.! Preprocessing (map) phase ingests data, emits (k,v) pairs! Shuffle phase assigns reducers, gets all pairs with same key onto that reducer.! Programmer does not have to design communication patterns (key1,17) (key5, 23) (key1,99) (key2, 12) (key1,83) (key2, 9) (key1,[17,99]) (key5,[23,83]) (key2,[12,9]) 8

9 Built a reusable substrate The filesystem (HDFS) and the MapReduce layer were very well architected.! Enables many higher-level tools! Data analysis, machine learning, NoSQL DBs,...! Extremely productive environment! And Hadoop 2.x (YARN) is now much much more than just MapReduce Image from 9

10 and Hadoop vs HPC Not either-or anyway! Use HPC to generate big / many simulations, Hadoop to analyze results! Use Hadoop to preprocess huge input data sets (ETL), and HPC to do the tightly coupled computation afterwards.! Besides,... Everything is converging 1/2 10

11 Everything is converging 2/2 Big Data Analytics Stack Big Data Analytics Stack Amir Payberah 11 Amir H. Payberah (SICS) Introduction April 8, / 36

12 ore it across multiple machines in a 16/05/15 Big Data - Storage (Filesystem) Big Data Storage (sans POSIX) I Traditional filesystems are not well-designed for large-scale data processing systems. I E ciency has a higher priority than other features, e.g., directory service. I Massive size of data tends to store it across multiple machines in a distributed way. ment Systems (RDMS) were not demore of the ACID properties: BASE I HDFS, Amazon S3,... lue, column-family, graph, document. Amir Payberah Amir H. Payberah (SICS) Introduction April 8, / 36 Big Data - Database ction April 8, / 36 Big Data - Databases I Relational Databases Management Systems (RDMS) were not designed to be distributed. I NoSQL databases relax one or more of the ACID properties: BASE I Di erent data models: key/value, column-family, graph, document. I Dynamo, Scalaris, BigTable, Hbase, Cassandra, MongoDB, Voldemort, Riak, Neo4J,... Amir Payberah Amir H. Payberah (SICS) Introduction April 8, / 36 base, Cassandra, MongoDB, Voldeduction April 8, / 36 12

13 rces in a cluster between multiple ce isolation. 16/05/15 Big Data - Resource Management Big Data Resource Management I Di erent frameworks require di erent computing resources. I Large organizations need the ability to share data and resources between multiple frameworks. I Resource management share resources in a cluster between multiple frameworks while providing resource isolation. I Mesos, YARN, Quincy,... Amir Payberah Amir H. Payberah (SICS) Introduction April 8, / 36 n April 8, / 36 YARN 1/3 To address Hadoop v1 deficiencies with scalability, memory usage and synchronization, the Yet Another Resource Negotiator (YARN) Apache sub-project was started Previously a JobTracker service ran on each node. Its roles were then split into separate daemons for Resource management Job scheduling/monitoring Hortonworks 13

14 YARN 2/3 YARN splits the JobTracker s responsibilities into Resource management the global Resource Manager daemon Per application Application Master The resource manger and per-node slave Node Managers allow generic node management The resource manager has a pluggable scheduler Hortonworks YARN 3/3 The Scheduler performs its scheduling function based on the resource requirements of the applications; it does so based on the abstract notion of a Resource Container which incorporates resource elements such as memory, cpu, disk, network The NodeManager is the per-machine slave, which is responsible for launching the applications containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager. The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress. From the system perspective, the ApplicationMaster itself runs as a normal container. Hortonworks 14

15 Dryad, Hyracks,... anguage Big Data - Execution Engine Big Data Execution Engine I Scalable and fault tolerance parallel data processing on clusters of unreliable machines. ecution engines, e.g., MapReduce, is I Data-parallel programming model for clusters of commodity machines. 16/05/15 I MapReduce, Spark, Stratosphere, Dryad, Hyracks,... mprove the query capabilities of exe- Amir Payberah tions to low-level API of the execution Amir H. Payberah (SICS) Introduction April 8, / 36 Big Data - Query/Scripting Language Big Data Query/Scripting Languages adlinq, SCOPE, not easy for end users.... April 8, / 36 I Low-level programming of execution engines, e.g., MapReduce, is I Need high-level language to improve the query capabilities of execution engines. I It translates user-defined functions to low-level API of the execution engines. I Pig, Hive, Shark, Meteor, DryadLINQ, SCOPE,... Amir H. Amir Payberah Payberah (SICS) Introduction April 8, / 36 oduction April 8, / 36 15

16 Big Data Stream Processing Big Data - Stream Processing I Providing users with fresh and low latency results. I Database Management Systems (DBMS) vs. Systems (SPS) Stream Processing I Storm, S4, SEEP, D-Stream, Naiad,... Amir Payberah Amir H. Payberah (SICS) Introduction April 8, / 36 Big Data Graph Processing Big Data - Graph Processing I Many problems are expressed using graphs: sparse computational dependencies, and multiple iterations to converge. I Data-parallel frameworks, such as MapReduce, are not ideal for these problems: slow I Graph processing frameworks are optimized for graph-based problems. I Pregel, Giraph, GraphX, GraphLab, PowerGraph, GraphChi,... Amir Payberah Amir H. Payberah (SICS) Introduction April 8, / 36 16

17 Big Data Machine Learning Big Data - Machine Learning I Implementing and consuming machine learning techniques at scale are di cult tasks for developers and end users. I There exist platforms that address it by providing scalable machinelearning and data mining libraries. I Mahout, MLBase, SystemML, Ricardo, Presto,... Amir Payberah Hadoop Big Data Analytics Stack Amir H. Payberah (SICS) Introduction April 8, / 36 Hadoop Big Data Analytics Stack Amir Payberah Amir H. Payberah (SICS) Introduction April 8, / 36 17

18 Spark Big Data Analytics Stack Spark Big Data Analytics Stack Amir Payberah Amir H. Payberah (SICS) Introduction April 8, / 36 Hadoop Ecosystem Hortonworks 18

19 Hadoop Ecosystem 2008 onwards usage exploded Creation of many tools on top of Hadoop infrastructure What is Filesystem? What is Filesystem? The Need For Filesystems I Controls how data is stored in and retrieved from disk. I Controls how data is stored in and retrieved from disk. Amir Payberah Amir H. Payberah (SICS) Distributed Filesystems April 8, / 32 Amir H. Payberah (SICS) Distributed Filesystems April 8, / 32 19

20 Distributed Filesystems Distributed Filesystems I When data outgrows the storage capacity of a single machine: partition it across a number of separate machines. I Distributed filesystems: manage the storage across a network of machines. Amir Payberah Amir H. Payberah (SICS) Distributed Filesystems April 8, / 32 20

21 Hadoop Distributed File System (HDFS) A distributed file system designed to run on commodity hardware HDFS was originally built as infrastructure for the Apache Nutch web search engine project, with the aim to achieve fault tolerance, ability to run on low-cost hardware and handle large datasets It is now an Apache Hadoop subproject Share similarities with existing distributed file systems and supports traditional hierarchical file organization Reliable data replication and accessible via Web interface and Shell commands Benefits: Fault tolerant, high throughput, streaming data access, robustness and handling of large data sets HDFS is not a general purpose F/S Assumptions and Goals Hardware failures Detection of faults, quick and automatic recovery Streaming data access Designed for batch processing rather than interactive use by users Large data sets Applications that run on HDFS have large data sets, typically in gigabytes to terabytes in size Optimized for batch reads rather than random reads Simple coherency model Applications need a write-once, read-many times access model for files Computation migration Computation is moved closer to where data is located Portability Easily portable between heterogeneous hardware and software platforms 21

22 HDFS iswhat Not HDFS Good for is not... good for I Low-latency reads High-throughput rather than low latency for small chunks of data. HBase addresses this issue. I Large amount of small files Better for millions of large files instead of billions of small files. I Multiple writers Single writer per file. Writes only at the end of file, no-support for arbitrary o set. Amir Payberah Amir H. Payberah (SICS) Distributed Filesystems April 8, / 32 HDFS Architecture The Hadoop Distributed File System (HDFS) Offers a way to store large files across multiple machines, rather than requiring a single machine to have disk capacity equal to/greater than the summed total size of the files HDFS is designed to be faulttolerant Using data replication and distribution of data When a file is loaded into HDFS, it is replicated and broken up into "blocks" of data These blocks are stored across the cluster nodes designated for storage, a.k.a. DataNodes. 22

23 Files and Blocks (1/3) Files and Blocks 1/3 I Files are split into blocks. I Blocks Single unit of storage: a contiguous piece of information on a disk. Transparent to user. Managed by Namenode, storedbydatanode. Blocks are traditionally either 64MB or 128MB: default is 64MB. Amir Payberah Amir H. Payberah (SICS) Distributed Filesystems April 8, / 32 Files and Blocks Files and (2/3) Blocks 2/3 I Why is a block in HDFS so large? To minimize the cost of seeks. I Time to read a block = seek time + transfer time seektime I Keeping the ratio transfertime small: we are reading data from the disk almost as fast as the physical limit imposed by the disk. I Example: if seek time is 10ms and the transfer rate is 100MB/s, to make the seek time 1% of the transfer time, we need to make the block size around 100MB. Amir Payberah Amir H. Payberah (SICS) Distributed Filesystems April 8, / 32 23

24 Files and Blocks (3/3) Files and Blocks 3/3 I Same block is replicated on multiple machines: default is 3 Replica placements are rack aware. 1st replica on the local rack. 2nd replica on the local rack but di erent machine. 3rd replica on the di erent rack. I Namenode determines replica placement. Amir Payberah Amir H. Payberah (SICS) Distributed Filesystems April 8, / 32 HDFS Daemons HDFS cluster is manager by three types of processes Namenode Manages the filesystem, e.g., namespace, meta-data, and file blocks HDFS Daemons (2/2) Metadata is stored in memory Datanode Stores and retrieves data blocks Reports to Namenode Runs on many machines Secondary Namenode Only for checkpointing. Not a backup for Namenode Amir Payberah Amir H. Payberah (SICS) Distributed Filesystems April 8, / 32 24

25 Hadoop Server Roles NameNode 1/3 The HDFS namespace is a hierarchy of files and directories These are represented in the NameNode using inodes Inodes record attributes permissions, modification and access times; namespace and disk space quotas. The file content is split into large blocks (typically 128 megabytes, but user selectable file-by-file), and each block of the file is independently replicated at multiple DataNodes (typically three, but user selectable file-by-file) The NameNode maintains the namespace tree and the mapping of blocks to DataNodes A Hadoop cluster can have thousands of DataNodes and tens of thousands of HDFS clients per cluster, as each DataNode may execute multiple application tasks concurrently 25

26 NameNode 2/3 The inodes and the list of blocks that define the metadata of the name system are called the image (FsImage above) NameNode keeps the entire namespace image in RAM Each client-initiated transaction is recorded in the journal, and the journal file is flushed and synced before the acknowledgment is sent to the client The NameNode is a multithreaded system and processes requests simultaneously from multiple clients. NameNode 3/3 HDFS requires A NameNode process to run on one node in the cluster All other nodes run the DataNode service to run on each "slave" node that will be processing data. When data is loaded into HDFS Data is replicated and split into blocks that are distributed across the DataNodes The NameNode is responsible for storage and management of metadata, so that when MapReduce or another execution framework calls for the data, the NameNode informs it where the needed data resides. 26

27 Where to Replicate? Tradeoff to choosing replication locations! Close: faster updates, less network bandwidth! switch 1 switch 2 Further: better failure tolerance! Default strategy: first copy on different location on same node, second on different rack (switch), third on same rack location, different node.! Strategy configurable.! Need to configure Hadoop file system to know location of nodes rack1 rack2 DataNode 1/3 Each block replica on a DataNode is represented by two files in the local native filesystem. The first file contains the data itself and the second file records the block's metadata including checksums for the data and the generation stamp. At startup each DataNode connects to a NameNode and preforms a handshake. The handshake verifies that the DataNode is part of the NameNode and runs the same version of software A DataNode identifies block replicas in its possession to the NameNode by sending a block report. A block report contains the block ID, the generation stamp and the length for each block replica the server hosts The first block report is sent immediately after the DataNode registration Subsequent block reports are sent every hour and provide the NameNode with an up-to-date view of where block replicas are located on the cluster. 27

28 DataNode 2/3 During normal operation DataNodes send heartbeats to the NameNode to confirm that the DataNode is operating and the block replicas it hosts are available If the NameNode does not receive a heartbeat from a DataNode in ten minutes, it considers the DataNode to be out of service and the block replicas hosted by that DataNode to be unavailable The NameNode then schedules creation of new replicas of those blocks on other DataNodes. Heartbeats from a DataNode also carry information about total storage capacity, fraction of storage in use, and the number of data transfers currently in progress. These statistics are used for the NameNode's block allocation and load balancing decisions. DataNode 3/3 The NameNode does not directly send requests to DataNodes. It uses replies to heartbeats to send instructions to the DataNodes The instructions include commands to replicate blocks to other nodes, remove local block replicas, re-register and send an immediate block report, and shut down the node These commands are important for maintaining the overall system integrity and therefore it is critical to keep heartbeats frequent even on big clusters. The NameNode can process thousands of heartbeats per second without affecting other NameNode operations. 28

29 HDFS Client 1/3 User applications access the filesystem using the HDFS client, a library that exports the HDFS filesystem interface User is oblivious to backend implementation details eg # of replicas and which servers have appropriate blocks HDFS Client 2/3 When an application reads a file, the HDFS client first asks the NameNode for the list of DataNodes that host replicas of the blocks of the file The list is sorted by the network topology distance from the client The client contacts a DataNode directly and requests the transfer of the desired block. 29

30 Reading a file Client:! Read lines from bigdata.dat 1. Open Reading a file shorter! Get block locations! Read from a replica Namenode /user/ljdursi/diffuse bigdata.dat datanode1 datanode2 datanode3 Reading a file Client:! Read lines from bigdata.dat 2. Get block locations Reading a file shorter! Get block locations! Read from a replica Namenode /user/ljdursi/diffuse bigdata.dat datanode1 datanode2 datanode3 30

31 Reading a file Client:! Read lines from bigdata.dat 3. read blocks Reading a file shorter! Get block locations! Read from a replica Namenode /user/ljdursi/diffuse bigdata.dat datanode1 datanode2 datanode3 HDFS Client 3/3 When a client writes, it first asks the NameNode to choose DataNodes to host replicas of the first block of the file The client organizes a pipeline from node-to-node and sends the data When the first block is filled, the client requests new DataNodes to be chosen to host replicas of the next block A new pipeline is organized, and the client sends the further bytes of the file Choice of DataNodes for each block is likely to be different 31

32 Writing a file multiple stage process! Create file! Get nodes for blocks! Start writing! Data nodes coordinate replication! Get ack back! Complete Writing a file Client:! Write newdata.dat 1. create Namenode /user/ljdursi/diffuse datanode1 datanode2 datanode3 bigdata.dat Writing a file Writing a file multiple stage process! Create file! Get nodes for blocks! Start writing! Data nodes coordinate replication! Get ack back! Complete Client:! Write newdata.dat 2. get nodes Namenode /user/ljdursi/diffuse datanode1 datanode2 datanode3 bigdata.dat 32

33 Writing a file Writing a file multiple stage process! Create file! Get nodes for blocks! Start writing! Data nodes coordinate replication! Get ack back! Complete 3. start writing Client:! Write newdata.dat Namenode /user/ljdursi/diffuse datanode1 datanode2 datanode3 bigdata.dat Writing a file Writing a file multiple stage process! Create file! Get nodes for blocks! Start writing! Data nodes coordinate replication! Get ack back! Complete Client:! Write newdata.dat 4. repl Namenode /user/ljdursi/diffuse datanode1 datanode2 datanode3 bigdata.dat 33

34 Writing a file multiple stage process! Create file! Get nodes for blocks! Start writing! Data nodes coordinate replication! Writing a file Get ack back (while writing)! Complete Client:! Write newdata.dat 5. ack Namenode /user/ljdursi/diffuse datanode1 datanode2 datanode3 bigdata.dat Writing a file Writing a file multiple stage process! Create file! Get nodes for blocks! Start writing! Data nodes coordinate replication! Get ack back! Complete Client:! Write newdata.dat 6. complete Namenode /user/ljdursi/diffuse datanode1 datanode2 datanode3 bigdata.dat 34

35 HDFS Federation HDFS Federation I Hadoop 2+ I Each Namenode will host part of the blocks. I A Block Pool is a set of blocks that belong to a single namespace. I Support for machine clusters. Amir Payberah Amir H. Payberah (SICS) Distributed Filesystems April 8, / 32 File I/O and Leases in HDFS An application Adds data to HDFS by creating a new file and writing data to it On closing the file, new data can only be appended HDFS implements a single-writer, multiple-reader model Leases are granted by the NameNode to HDFS clients Writer clients need to periodically renew the lease via a heartbeat to the NameNode On file close, the lease is revoked There are soft and hard limits for leases (the hard limit being an hour) A write lease does not prevent multiple readers from reading the file 35

36 Data Pipelining for Writing Blocks 1/2 An HDFS file consists of blocks When there is a need for a new block, the NameNode allocates a block with a unique block ID and determines a list of DataNodes to host replicas of the block The DataNodes form a pipeline, the order of which minimizes the total network distance from the client to the last DataNode Data Pipelining for Writing Blocks 2/2 Bytes are pushed to the pipeline as a sequence of packets. The bytes that an application writes first buffer at the client side After a packet buffer is filled (typically 64 KB), the data are pushed to the pipeline The next packet can be pushed to the pipeline before receiving the acknowledgment for the previous packets The number of outstanding packets is limited by the outstanding packets window size of the client. 36

37 HDFS Interfaces There are many interfaces to interact with HDFS Simplest way of interacting with HDFS in command-line Two properties are set in HDFS configuration Default Hadoop filesystem fs.default.name: hdfs://localhost/ Used to determine the host (localhost) and port (8020) for the HDFS NameNode Replication factor dfs.replication Default is 3, disable replication by setting it to 1 (single datanode) Other HDFS interfaces HTTP: a read only interface for retrieving directory listings and data over HTTP FTP: permits the use of the FTP protocol to interact with HDFS Replication in HDFS Replica placement Critical to improve data reliability, availability and network bandwidth utilization Rack-aware policy as rack failure is far less than node failure With the default replication factor (3), one replica is put on one node in the local rack, another on a node in a different (remote) rack, and the last on a different node in the same remote rack One third of replication are on one node; two-third of replicas are on one rack, and the other third are evenly distributed across racks Benefits is to reduce inter-rack write traffic Replica selection A read request is satisfied from a replica that is nearby to the application Minimizes global bandwidth consumption and read latency If HDFS spans multiple data center, replica in the local data center is preferred over any remote replica 37

38 Communication Protocol All HDFS communication protocols are layered on top of the TCP/IP protocol A client establishes a connection to a configurable TCP port on the NameNode machine and uses ClientProtocol DataNodes talk to the NameNode using DataNode protocol A Remote Procedure Call (RPC) abstraction wraps both the ClientProtocol and DataNode protocol NameNode never initiates a RPC, instead it only responds to RPC requests issued by DataNodes or clients Robustness Primary objective of HDFS is to store data reliably even during failures Three common types of failures: NameNode, DataNode and network partitions Data disk failure Heartbeat messages to track the health of DataNodes NameNodes performs necessary re-replication on DataNode unavailability, replica corruption or disk fault Cluster rebalancing Automatically move data between DataNodes, if the free space on a DataNode falls below a threshold or during sudden high demand Data integrity Checksum checking on HDFS files, during file creation and retrieval Metadata disk failure Manual intervention no auto recovery, restart or failover 38

39 Software: Ceph Ceph An Alternative to HDFS in One Slide APP HOST / VM Client RadosGW RBD CephFS Rados LibRados S3 Swift MDS MDS.1 MONs MON.1 Pool 1 Pool 2 Pool... X... Pool n CRUSH map MDS.n MON.n PG 1 PG 2 PG 3 PG 4... PG n activities/tf-storage/ws16/ slides/ low_cost_storage_cephopenstack_swift.pdf n n 1 n Cluster Node [OSDs] Cluster Node [OSDs] Cluster Node [OSDs] 39

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Distributed Filesystems

Distributed Filesystems Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science [email protected] April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls

More information

HDFS Architecture Guide

HDFS Architecture Guide by Dhruba Borthakur Table of contents 1 Introduction... 3 2 Assumptions and Goals... 3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets... 3 2.4 Simple Coherency Model...3 2.5

More information

The Hadoop Distributed File System

The Hadoop Distributed File System The Hadoop Distributed File System The Hadoop Distributed File System, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, Yahoo, 2010 Agenda Topic 1: Introduction Topic 2: Architecture

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

The Hadoop Distributed File System

The Hadoop Distributed File System The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu HDFS

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance

More information

Hadoop Distributed File System (HDFS) Overview

Hadoop Distributed File System (HDFS) Overview 2012 coreservlets.com and Dima May Hadoop Distributed File System (HDFS) Overview Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized

More information

Intro to Map/Reduce a.k.a. Hadoop

Intro to Map/Reduce a.k.a. Hadoop Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by

More information

Big Data Technology Core Hadoop: HDFS-YARN Internals

Big Data Technology Core Hadoop: HDFS-YARN Internals Big Data Technology Core Hadoop: HDFS-YARN Internals Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class Map-Reduce Motivation This class

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed

More information

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Hadoop Distributed File System. Jordan Prosch, Matt Kipps Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY

IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY Hadoop Distributed File System: What and Why? Ashwini Dhruva Nikam, Computer Science & Engineering, J.D.I.E.T., Yavatmal. Maharashtra,

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Design and Evolution of the Apache Hadoop File System(HDFS)

Design and Evolution of the Apache Hadoop File System(HDFS) Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction

More information

BBM467 Data Intensive ApplicaAons

BBM467 Data Intensive ApplicaAons Hace7epe Üniversitesi Bilgisayar Mühendisliği Bölümü BBM467 Data Intensive ApplicaAons Dr. Fuat Akal [email protected] Problem How do you scale up applicaaons? Run jobs processing 100 s of terabytes

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)

More information

Apache Hadoop FileSystem and its Usage in Facebook

Apache Hadoop FileSystem and its Usage in Facebook Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Hadoop: Embracing future hardware

Hadoop: Embracing future hardware Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop

More information

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015 7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan [email protected] Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Sector vs. Hadoop. A Brief Comparison Between the Two Systems Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

!#$%&' ( )%#*'+,'-#.//0( !#$%&'()*$+()',!-+.'/', 4(5,67,!-+!89,:*$;'0+$.<.,&0$'09,&)/=+,!()<>'0, 3, Processing LARGE data sets !"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model

More information

HDFS: Hadoop Distributed File System

HDFS: Hadoop Distributed File System Istanbul Şehir University Big Data Camp 14 HDFS: Hadoop Distributed File System Aslan Bakirov Kevser Nur Çoğalmış Agenda Distributed File System HDFS Concepts HDFS Interfaces HDFS Full Picture Read Operation

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside [email protected] About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

NoSQL Data Base Basics

NoSQL Data Base Basics NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS

More information

Deploying Hadoop with Manager

Deploying Hadoop with Manager Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer [email protected] Alejandro Bonilla / Sales Engineer [email protected] 2 Hadoop Core Components 3 Typical Hadoop Distribution

More information

HDFS Under the Hood. Sanjay Radia. [email protected] Grid Computing, Hadoop Yahoo Inc.

HDFS Under the Hood. Sanjay Radia. Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc. HDFS Under the Hood Sanjay Radia [email protected] Grid Computing, Hadoop Yahoo Inc. 1 Outline Overview of Hadoop, an open source project Design of HDFS On going work 2 Hadoop Hadoop provides a framework

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer [email protected] Assistants: Henri Terho and Antti

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

HADOOP MOCK TEST HADOOP MOCK TEST I

HADOOP MOCK TEST HADOOP MOCK TEST I http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

HDFS Users Guide. Table of contents

HDFS Users Guide. Table of contents Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations

More information

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring 2014. HDFS Basics

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring 2014. HDFS Basics COSC 6397 Big Data Analytics Distributed File Systems (II) Edgar Gabriel Spring 2014 HDFS Basics An open-source implementation of Google File System Assume that node failure rate is high Assumes a small

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics BIG DATA WITH HADOOP EXPERIMENTATION HARRISON CARRANZA Marist College APARICIO CARRANZA NYC College of Technology CUNY ECC Conference 2016 Poughkeepsie, NY, June 12-14, 2016 Marist College AGENDA Contents

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]

More information

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Hadoop Distributed File System. Dhruba Borthakur June, 2007 Hadoop Distributed File System Dhruba Borthakur June, 2007 Goals of HDFS Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. [email protected] http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Understanding Big Data and Big Data Analytics Getting familiar with Hadoop Technology Hadoop release and upgrades

More information

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Application Programming

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Application Programming Distributed and Parallel Technology Datacenter and Warehouse Computing Hans-Wolfgang Loidl School of Mathematical and Computer Sciences Heriot-Watt University, Edinburgh 0 Based on earlier versions by

More information

<Insert Picture Here> Big Data

<Insert Picture Here> Big Data Big Data Kevin Kalmbach Principal Sales Consultant, Public Sector Engineered Systems Program Agenda What is Big Data and why it is important? What is your Big

More information

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop

More information

Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

Extending Hadoop beyond MapReduce

Extending Hadoop beyond MapReduce Extending Hadoop beyond MapReduce Mahadev Konar Co-Founder @mahadevkonar (@hortonworks) Page 1 Bio Apache Hadoop since 2006 - committer and PMC member Developed and supported Map Reduce @Yahoo! - Core

More information

YARN Apache Hadoop Next Generation Compute Platform

YARN Apache Hadoop Next Generation Compute Platform YARN Apache Hadoop Next Generation Compute Platform Bikas Saha @bikassaha Hortonworks Inc. 2013 Page 1 Apache Hadoop & YARN Apache Hadoop De facto Big Data open source platform Running for about 5 years

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information