Hadoop Distributed FileSystem on Cloud
|
|
- Christian Dorsey
- 8 years ago
- Views:
Transcription
1 Hadoop Distributed FileSystem on Cloud Giannis Kitsos, Antonis Papaioannou and Nikos Tsikoudis Department of Computer Science University of Crete {kitsos, papaioan, Abstract. The growing capability of cloud compluting motivates developers to build their appliactions on cloud infrastructures that offer high performance and scalability. We propose a modified version of the Hadoop Distributed File System (HDFS) which is able to run on a cloud environment efficiently. This means that it has the ability to replicate data blocks across different physical machines not just among different virtual machines. In this way HDFS provides better reliability and availability of data. The choice of each replication target is based on the mapping between physical and virtual machines that is provided by the cloud provider. In addition in order to export this kind of mapping to HDFS we implemented a web service that is able to bridge HDFS and the cloud provider. 1 Introduction The last few years there is a trend for enterprises in using cloud computing infrastructures providing to the users the illusion of unlimited computational resources as well as large pools of data. As a result cloud computing offers to developers the opportunity to build extremely scalable applications. Consequently many software frameworks which have built to run in clusters can be used in cloud computing environment. Hadoop [1,8,9] is a software framework that supports data-intensive distributed applications running on a cluster as well as on cloud. Hadoop Distributed File System (HDFS) [7] is the file system component of Hadoop which is responsible to distribute replicas of application s data accross the different nodes of the infrastructure which hosts the application. In a cloud environment a physical machine can host multiple virual machines (VM) and the HDFS is aware only of the VMs and not of the physical machines that host these VM. Consequently there is always the possibility that all the VM that will be chosen to store the data replicas, correspond to the same physical machine. In this report we present a modification of the HDFS replication strategy in order to store replicas of the same data block to different VMs that are hosted by different physical machines of the cloud infrastructure. For that purpose the Alphabetical order
2 2 cloud provider has to export infomation about the correlation between the virtual and physical machines of the cloud system. In order to achieve this kind of bridge between the cloud provider and HDFS we built a web service that could be provided by every cloud provider. Finally we enumerate possible shortcomings as well as we present solutions for these. 2 Background Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data providing a distributed file system. A Hadoop cluster scales storage capacity and IO bandwidth by simply adding commodity servers. Hadoop is a top-level Apache project being built and used by a global community of contributors [2], using the Java programming language. Yahoo! has been the largest contributor to the project [4], and uses Hadoop extensively across its businesses [5]. HDFS is the file system component of Hadoop of which interface is patterned after the UNIX file system. HDFS stores file system metadata and application data separately. A dedicated server, called NameNode, is responsible to store the metadata of the file system. NameNode records include attributes like permissions, modification and access times, namespace and disk space quotas. The servers where the application data are stored called DataNodes. The filesystem uses the TCP/IP layer for communication between servers. The HDFS is designed to handle very large files. The contents of a file are split into large blocks and each block is independently replicated to multiple DataNodes. Each data block is represented by two files on a DataNode s file system. The first file contains the actual data and the other contains the block s metadata such as block s checksum and generation stamp. In order to achieve data reliability and durability HDFS replicates the content of files across multiple DataNodes by following a specific storage and replication policy. This strategy has the added advantage that data transfer bandwidth is multiplied, and there are more opportunities for locating computation near the needed data. User applications can read, write and delete files as well as create and delete directories using HDFS client. HDFS client is a code library that export an interface which allows user applications to access the file system. When an application wants to read a file, the HDFS client asks the NameNode for the list of DataNodes that contain replicas of the blocks of the file. After that the client chooses one of the DataNodes and gets the desired block by directly contacting it. The same way when a client wants to write a data block, it contact the NameNode in order to get a list of DataNodes that will store replicas of the block. Next the client organizes a pipeline and sends the data to be written to all the chosen DataNodes. The user application generally does not need to know that
3 3 the file system metadata and storage are on different servers, or that blocks have multiple replicas. 3 Challenges Exploiting the cloud computing features that many enterprises provide, cloudified distribition of many services become extremely powerful and flexible. Therefore building the Hadoop Distributed File System on a cloud it becomes able to deliver scalable storage with high performing data processing. However despite the obvious benefits of cloudifing such a service many challenges that have to be encountered are rising. Taking advantage of the benefits that a software platform which implements cloud computing offers, we present a more portable, flexible and mainly more scalable HDFS distribution. To accomplish our implementation we use the Eucalyptus Systems [3][6] which is an open-source software platform that offers high performing cloud solutions. By using the Eucalyptus Systems we create a cluster of virtual machines that constitute the DataDodes of the file system, where all data block replicas are stored. Such a cluster can scale by creating and removing virtual machines (VMs) if needed based on the stored data load. According to the replication policy followed by the HDFS NameNode, the blocks that a file is consisted of, are replicated among different machines in order to maintain data reliability and availability despite possilbe DataNode failures. To build a virtual computing cluster we use some physical computer where a lot of virtual machines are possible to run on. Due to the fact that each physical machine can host tens of VMs, is rising a serious problem when the file system runs on such a cluster. Based on the NameNode s replication policy the data blocks are replicated among different virtual machines and these VMs may correspond to the same physical machime. As a result if a physical machine fails then all data replicas will be lost. This failure impacts the data reliability and availability. Our goal is to achieve a more efficient distribution of file block replicas across different physical machines, not just among different VMs, avoiding all data replica loss. The basic challenge in order to achive our goal is to gather the information about the mapping of the virtual machines to the physical machines. This information should be exported by the cloud provider and by modifing the HDFS replication policy we manage to store replicas of the same data block among different physical machines. An important decision that should be taken is about the way the mapping information is transferred from the cloud provider to the HDFS namenode. Probably this requirement may rise security and privacy issues from the cloud provider s side about its policy. Such issues can be overtaken by implementing a web server at the cloud provider side that export only the information about which of the virtual machines are running on the same physical machine. In Fig. 1 we present an overview of the HDFS where the namenode by calling the web service learns which VMs to exlude at each file block storage. If a block is stored in a specific
4 4 Fig.1. Overview of the Hadoop Distributed File System on a Cloud virtual machine automatically the namenode exludes the the VMs that belong to the same physical machine with the selected one, from the next replication storage. More information about the web service is going to be discussed in sections 4 and 5. 4 Implementation The HDFS consists of the NameNode and the DataNodes. The first one serves as namespace manager and maintains the namespace and the mapping of file blocks to the datanodes and is responsible for applying the block placement and replication policy. The laters are the nodes where the file blocks are stored based the following policy from the namenode. The replicas of each data block are stored to different DataNode and each DataNode corresponds to a different virtual machine of the cloud. HDFS holds no information about the VMs except of statistics about the data blocks that they host. As a result HDFS does not know if the chosen DataNodes correspond to the same physical machine and consequently all the replicas will be stored to the same physical machine. On the other hand the cloud provider holds all the necessary information about the mapping between virtual and physical machines. In our setup we use Eucalyptus, that hosts the virtual machines which are used by the HDFS. Eucalyptus keeps log files that contain information about every VM. This kind of information include the virtual IP address of a VM and the IP address of the physical machine that hosts the VM as well as the owner of the VM among others. In more details every 20 seconds Eucalyptous updates the file cloudoutput.log that contains information about the VMs that run at that time. In order to gather these information we built a log parser, written in java, that
5 5 Fig. 2. IP addresses grouping according to the physical machines they belong scans the logfile creating the mapping between the VMs which are in use and the physical machines. HDFS needs to obtain this kind of correlation between VMs and physical machines. For that purpose we have implemented a SOAP-based web service in order to bridge the Eucalytous cloud provider and the HDFS. The web service is implemented at the cloud provider side because only it keeps track about the VMs and the physical machines. On the other hand we have implemented a web service client that is used by the NameNode. In this way we export the necessary information from the cloud provider to HDFS. In more details every VM that is running in Eucalyptous has an owner ID. So the client sends to the web service the owner ID of the VMs that are part of the HDFS and host DataNodes. On the other hand the web service using the log parser filters the logs and creates the correlation between the VMs that has that specific owner ID and the physical machines. We use the owner ID because HDFS, and generally any cloud user, does not need to know anything about the VMs that belong to other cloud users. This mapping is based on the IP addresses of the VMs and the physical machines. In addition the cloud provider may not want to reveal sensitive information such as the IP addresses of the physical machines. For this reason we hide the IP addresses of the physical machines and return the IP addresses of the VMs grouped according to physical machines that they belong. This kind of grouping is illustrated in Fig. 2. This mapping is represented with an arraylist which contains an arraylist in each bucket. The arraylist in each bucket contains the IP addresses of a group of VMs that belong to the same physical machine. This mapping, that is created by the web service, is sent to the client as an encoded string. Once the client receives the necessery information from the web service he reconstructs the same data structure as the web service and can be used by the NameNode in order to distribute the replicas of a data block to DataNodes that belong to different physical machines.
6 6 The basic component of the namenode that performs most of the filesystem management is the FSNamesystem. When a file block needs to be stored or replicated FSNamensystem is responsible to delegate the Replication Target Chooser component to apply the replica placement stategy. The Replication Target Chooser is respinsible for choosing the desired number of targets for placing the block replicas. If a file block needs to be replicated and stored it means that it is originated from a new file of the file system or it has marked as an under-replicated block. There are many reasons that a block is marked as under-replicated. For example after a datanode failure the blocks that that has been replicated there assigned in the under-replicated queue. Other reasons are the possible corructions for the replicas, and the decommisions phase of a datanode decided by the file system administrator where all the existing replicated blocks marked as underreplicated. Under these circumstances the Replication Target Chooser chooses the desired target based on the replication factor and policy and returns a list of datanodes to the namenode in order to start the storage phase of the replicas. According to our implementation, to achive a reliable Hadoop Distributed File System builded on a cloud we modified the namenode to maintain an extra data structure for the correlation betweens the virtual machines and the physical machines. The data structure consists of a table where each index points to a list containig all the virtual machines that belong to the same physical machine. Responsible to store this information is a web servise client which calls the web service as described above and updates the structure, informing the namenode. For each block that needs replication the Replication Target Chooser is executed until it finds all the desired target of each replica according to the repliaction factor. Each time it starts to look for a new target for a block replica Chooser finds from the corresponding data structure the virtual machines that belong to the same physical machine with already chosen target for the same block. If there is any the Chooser excludes these virtual machines from the next choises. This execution has as a result all the replicas of the same block to be stored on different physical machines. If the replication factor is more than the available physical machines then it is the only case where replicas from the same block can be stored in the same physical machine. 5 Shortcomings Despite the advantages of the powerful, flexible, and scalable distribution we present, we also enumerate a list of possible shortcomings. These mainly have to do with the performance of the HDFS and Web Server interaction and the complexity of maintaining and searching in the data structure with the VMs and physical machines mappings. We evaluated our approach in a cluster of only four VMs were the shortcomings are not visible but they are quite obvious in thought of having a huge number of VMs and physical machines.
7 7 Inmoredetails,aswepresentedinsection4,weuseSOAP-basedWebService in order to export the mapping of VMs and physical machines from the cloud provider to the client, called by the NameNode. The type of the returned result is an encoded String. After the client gets the result he restructures the content in an Arraylist<Arraylist<String>>. The complexity of this restructure is O(N) (N is the number of VMs) and in case of a huge number of VMs this will increase the time of creating replicas. Also another issue with complexity arise every time the client wants to learn the VMs that belongs to the same Physical machine through that data structure. The complexity of this search in the data structure is also O(N). As a result, when there is a huge number of VMs the time of creating replicas will also be increased. The last issue is that the client calls cloud provider every time it wants to make new replicas of data block. Combined with the above shortcoming this can result much lower performance of the Replication Target Chooser. A better approach on that would be that client keeps creating replicas based on the existing data structure that contains correlations between VMs and physical pachines until the cloud provider informs it that something has been changed to the NetworkTopology in order to update its data structure. In section 6 we will represent some possible solution on how to efface part of these drawbacks. 6 Future Work In this section we discuss future approaches that will improve our approach eliminating the shortcomings that we enumerated in section 5. Furthermore we present the idea of modifying Hadoop s Balancer in order to work on Cloud. First of all an important issue is to change the existing Web Service, which is based on the Soap model, to another model in order to avoid the cost of restructuring data at Clients side. This can make replication much faster than our existing approach. As we mentioned before another problem is the cost of searching the VMs that belongs to the same physical machine in our structure. A more efficient data structure could help to avoid this cost. For example instead of having the presented data structure the structure Arraylist<Arraylist<MyClass>>, where MyClass contains a Vm s IP addresse and the index of the Arraylist that the address is placed. With this kind of approach we are be able to find VMs that belongs to the same Physical machine with complexity O(1). Hadoop s balancer is an administrative tool that balances[ref] disk space usage on an HDFS cluster. The tool is deployed as an application program that can be run by the cluster administrator. It iteratively moves replicas form DataNodes with higher utilization to DataNodes with lower utilization. Balancer quarantees that the decision does not impact the storage and replication strategy. Our approach will focus to use the information about the correlations between VMsandPhysicalmachinesinordertomovereplicastoVMsthatdoesn tbelong
8 8 to the same Physical machine. This modification would allow ballancer to work more efficient on the Cloud. 7 Conclusions In this report we present a expansion of the Hadoop Distributed File System which is able to be builded more efficient on a cloud. The core idea, in order to become more scalable, powerful and flexible is the importance of a more disctibution of the data replicas across the machines that constitute a virtual cluster on a Cloud. On each physical machine of the cloud can run tens of the virtual machines. In our approach we destribute the data blocks across the physical machines and not just across the virtual machines. In the later case is possible all the replicas of a block to be stored on different VMs but on the same physical machine and a failure of this machines resluts an all replicas loss. To be able to apply tha above distribution the information about the correlation between the VMs and the physical machines is requirred. This information must be exported by the cloud provider so we also present an solution of building a Web Service that gathers the information the sent it to the NameNode of the HDFS. References 1. Apache hadoop Applications and organizations using hadoop Eucalyptus systems Hadoop credits page Yahoo! launches world s largest hadoop production application D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, and D. Zagorodnov. The eucalyptus open-source cloud-computing system. In Proceedings of the th IEEE/ACM International Symposium on Cluster Computing and the Grid, CCGRID 09, pages , Washington, DC, USA, IEEE Computer Society. 7. K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), MSST 10, pages 1 10, Washington, DC, USA, IEEE Computer Society. 8. J. Venner. Pro Hadoop. Apress, T. White. Hadoop: The Definitive Guide. O Reilly Media, Yahoo! Press, 2009.
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
More informationThe Hadoop Distributed File System
The Hadoop Distributed File System The Hadoop Distributed File System, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, Yahoo, 2010 Agenda Topic 1: Introduction Topic 2: Architecture
More informationThe Hadoop Distributed File System
The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu HDFS
More informationTHE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
More informationComparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in
More informationJournal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
More informationHadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
More informationInternational Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationHadoop Distributed File System. Jordan Prosch, Matt Kipps
Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?
More informationArchitecture for Hadoop Distributed File Systems
Architecture for Hadoop Distributed File Systems S. Devi 1, Dr. K. Kamaraj 2 1 Asst. Prof. / Department MCA, SSM College of Engineering, Komarapalayam, Tamil Nadu, India 2 Director / MCA, SSM College of
More informationHDFS Under the Hood. Sanjay Radia. Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc.
HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc. 1 Outline Overview of Hadoop, an open source project Design of HDFS On going work 2 Hadoop Hadoop provides a framework
More informationHDFS Space Consolidation
HDFS Space Consolidation Aastha Mehta*,1,2, Deepti Banka*,1,2, Kartheek Muthyala*,1,2, Priya Sehgal 1, Ajay Bakre 1 *Student Authors 1 Advanced Technology Group, NetApp Inc., Bangalore, India 2 Birla Institute
More informationApache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
More informationProcessing of Hadoop using Highly Available NameNode
Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale
More informationFREE AND OPEN SOURCE SOFTWARE FOR CLOUD COMPUTING SERENA SPINOSO (serena.spinoso@polito.it) FULVIO VALENZA (fulvio.valenza@polito.
+ FREE AND OPEN SOURCE SOFTWARE FOR CLOUD COMPUTING SERENA SPINOSO (serena.spinoso@polito.it) FULVIO VALENZA (fulvio.valenza@polito.it) + OUTLINE INTRODUCTION OF CLOUD DEFINITION OF CLOUD BASIC CLOUD COMPONENTS
More informationLecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
More informationHDFS Users Guide. Table of contents
Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9
More informationDistributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
More informationNoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
More informationResource Scalability for Efficient Parallel Processing in Cloud
Resource Scalability for Efficient Parallel Processing in Cloud ABSTRACT Govinda.K #1, Abirami.M #2, Divya Mercy Silva.J #3 #1 SCSE, VIT University #2 SITE, VIT University #3 SITE, VIT University In the
More informationHadoop Distributed File System. Dhruba Borthakur June, 2007
Hadoop Distributed File System Dhruba Borthakur June, 2007 Goals of HDFS Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationDistributed File Systems
Distributed File Systems Alemnew Sheferaw Asrese University of Trento - Italy December 12, 2012 Acknowledgement: Mauro Fruet Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 1 / 55 Outline
More informationOverview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
More informationAn Experimental Study of Load Balancing of OpenNebula Open-Source Cloud Computing Platform
An Experimental Study of Load Balancing of OpenNebula Open-Source Cloud Computing Platform A B M Moniruzzaman 1, Kawser Wazed Nafi 2, Prof. Syed Akhter Hossain 1 and Prof. M. M. A. Hashem 1 Department
More informationParallel Data Mining and Assurance Service Model Using Hadoop in Cloud
Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud Aditya Jadhav, Mahesh Kukreja E-mail: aditya.jadhav27@gmail.com & mr_mahesh_in@yahoo.co.in Abstract : In the information industry,
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationHadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationHadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers
More informationDistributed File Systems
Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)
More informationBuilding a Distributed Block Storage System for Cloud Infrastructure
Building a Distributed Block Storage System for Cloud Infrastructure Xiaoming Gao gao4@indiana.edu Mike Lowe jomlowe@iupui.edu Yu Ma yuma@indiana.edu Marlon Pierce mpierce@cs.indiana.edu Geoffrey Fox gcf@indiana.edu
More informationGraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
More informationHighly Available Hadoop Name Node Architecture-Using Replicas of Name Node with Time Synchronization among Replicas
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 3, Ver. II (May-Jun. 2014), PP 58-62 Highly Available Hadoop Name Node Architecture-Using Replicas
More informationPRIVACY PRESERVATION ALGORITHM USING EFFECTIVE DATA LOOKUP ORGANIZATION FOR STORAGE CLOUDS
PRIVACY PRESERVATION ALGORITHM USING EFFECTIVE DATA LOOKUP ORGANIZATION FOR STORAGE CLOUDS Amar More 1 and Sarang Joshi 2 1 Department of Computer Engineering, Pune Institute of Computer Technology, Maharashtra,
More informationBig Data Technology Core Hadoop: HDFS-YARN Internals
Big Data Technology Core Hadoop: HDFS-YARN Internals Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class Map-Reduce Motivation This class
More informationData Availability and Durability with the Hadoop Distributed File System
Data Availability and Durability with the Hadoop Distributed File System ROBERT J. CHANSLER Robert J. Chansler is Senior Manager of Hadoop Infrastructure at LinkedIn. This work draws on Rob s experience
More informationHadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
More informationApache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
More informationMapReduce, Hadoop and Amazon AWS
MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables
More informationDistributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
More informationDesign and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
More informationWelcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
More informationOptimizing the Storage of Massive Electronic Pedigrees in HDFS
IoT 2012 1569619175 Optimizing the Storage of Massive Electronic Pedigrees in HDFS Yin Zhang 1, Weili Han 1,2, Wei Wang 1,Chang Lei 1 1. Software School, Fudan University, Shanghai, China 2. DNSLAB, China
More informationHDFS: Hadoop Distributed File System
Istanbul Şehir University Big Data Camp 14 HDFS: Hadoop Distributed File System Aslan Bakirov Kevser Nur Çoğalmış Agenda Distributed File System HDFS Concepts HDFS Interfaces HDFS Full Picture Read Operation
More informationSLA-aware Resource Scheduling for Cloud Storage
SLA-aware Resource Scheduling for Cloud Storage Zhihao Yao Computer and Information Technology Purdue University West Lafayette, Indiana 47906 Email: yao86@purdue.edu Ioannis Papapanagiotou Computer and
More informationHadoop Distributed File System Propagation Adapter for Nimbus
University of Victoria Faculty of Engineering Coop Workterm Report Hadoop Distributed File System Propagation Adapter for Nimbus Department of Physics University of Victoria Victoria, BC Matthew Vliet
More informationMyCloudLab: An Interactive Web-based Management System for Cloud Computing Administration
MyCloudLab: An Interactive Web-based Management System for Cloud Computing Administration Hoi-Wan Chan 1, Min Xu 2, Chung-Pan Tang 1, Patrick P. C. Lee 1 & Tsz-Yeung Wong 1, 1 Department of Computer Science
More informationFault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
More informationGeneric Log Analyzer Using Hadoop Mapreduce Framework
Generic Log Analyzer Using Hadoop Mapreduce Framework Milind Bhandare 1, Prof. Kuntal Barua 2, Vikas Nagare 3, Dynaneshwar Ekhande 4, Rahul Pawar 5 1 M.Tech(Appeare), 2 Asst. Prof., LNCT, Indore 3 ME,
More informationParallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model
More informationAccelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
More informationLSKA 2010 Survey Report I Device Drivers & Cloud Computing
LSKA 2010 Survey Report I Device Drivers & Cloud Computing Yu Huang and Hao-Chung Yang {r98922015, r98944016}@csie.ntu.edu.tw Department of Computer Science and Information Engineering March 31, 2010 Abstract
More informationHADOOP MOCK TEST HADOOP MOCK TEST I
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
More informationHadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services
More informationRole of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
More informationData-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion
More informationLecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015
Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationHadoop Scalability at Facebook. Dmytro Molkov (dms@fb.com) YaC, Moscow, September 19, 2011
Hadoop Scalability at Facebook Dmytro Molkov (dms@fb.com) YaC, Moscow, September 19, 2011 How Facebook uses Hadoop Hadoop Scalability Hadoop High Availability HDFS Raid How Facebook uses Hadoop Usages
More informationA STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
More informationHADOOP MOCK TEST HADOOP MOCK TEST II
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
More informationRecognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),
More informationIJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY
IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY Hadoop Distributed File System: What and Why? Ashwini Dhruva Nikam, Computer Science & Engineering, J.D.I.E.T., Yavatmal. Maharashtra,
More informationThe Recovery System for Hadoop Cluster
The Recovery System for Hadoop Cluster Prof. Priya Deshpande Dept. of Information Technology MIT College of engineering Pune, India priyardeshpande@gmail.com Darshan Bora Dept. of Information Technology
More informationA Cost-Evaluation of MapReduce Applications in the Cloud
1/23 A Cost-Evaluation of MapReduce Applications in the Cloud Diana Moise, Alexandra Carpen-Amarie Gabriel Antoniu, Luc Bougé KerData team 2/23 1 MapReduce applications - case study 2 3 4 5 3/23 MapReduce
More informationDistributed Filesystems
Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science amir@sics.se April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls
More informationHDFS Federation. Sanjay Radia Founder and Architect @ Hortonworks. Page 1
HDFS Federation Sanjay Radia Founder and Architect @ Hortonworks Page 1 About Me Apache Hadoop Committer and Member of Hadoop PMC Architect of core-hadoop @ Yahoo - Focusing on HDFS, MapReduce scheduler,
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationAPACHE HADOOP JERRIN JOSEPH CSU ID#2578741
APACHE HADOOP JERRIN JOSEPH CSU ID#2578741 CONTENTS Hadoop Hadoop Distributed File System (HDFS) Hadoop MapReduce Introduction Architecture Operations Conclusion References ABSTRACT Hadoop is an efficient
More informationIntroduction to HDFS. Prasanth Kothuri, CERN
Prasanth Kothuri, CERN 2 What s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. HDFS is the primary distributed storage for Hadoop applications. Hadoop
More informationDATA SECURITY MODEL FOR CLOUD COMPUTING
DATA SECURITY MODEL FOR CLOUD COMPUTING POOJA DHAWAN Assistant Professor, Deptt of Computer Application and Science Hindu Girls College, Jagadhri 135 001 poojadhawan786@gmail.com ABSTRACT Cloud Computing
More informationHypertable Architecture Overview
WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for
More informationPerformance Analysis of Book Recommendation System on Hadoop Platform
Performance Analysis of Book Recommendation System on Hadoop Platform Sugandha Bhatia #1, Surbhi Sehgal #2, Seema Sharma #3 Department of Computer Science & Engineering, Amity School of Engineering & Technology,
More informationWeekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
More informationEvaluating Cassandra Data-sets with Hadoop Approaches
Evaluating Cassandra Data-sets with Hadoop Approaches Ruchira A. Kulkarni Student (BE), Computer Science & Engineering Department, Shri Sant Gadge Baba College of Engineering & Technology, Bhusawal, India
More informationA Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud
A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud Thuy D. Nguyen, Cynthia E. Irvine, Jean Khosalim Department of Computer Science Ground System Architectures Workshop
More informationHDFS Architecture Guide
by Dhruba Borthakur Table of contents 1 Introduction... 3 2 Assumptions and Goals... 3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets... 3 2.4 Simple Coherency Model...3 2.5
More informationPrepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationLecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
More informationAn Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov
An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationLOCATION-AWARE REPLICATION IN VIRTUAL HADOOP ENVIRONMENT. A Thesis by. UdayKiran RajuladeviKasi. Bachelor of Technology, JNTU, 2008
LOCATION-AWARE REPLICATION IN VIRTUAL HADOOP ENVIRONMENT A Thesis by UdayKiran RajuladeviKasi Bachelor of Technology, JNTU, 2008 Submitted to Department of Electrical Engineering and Computer Science and
More informationApache Hadoop FileSystem Internals
Apache Hadoop FileSystem Internals Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Storage Developer Conference, San Jose September 22, 2010 http://www.facebook.com/hadoopfs
More information5 HDFS - Hadoop Distributed System
5 HDFS - Hadoop Distributed System 5.1 Definition and Remarks HDFS is a file system designed for storing very large files with streaming data access patterns running on clusters of commoditive hardware.
More informationBig Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani
Big Data and Hadoop Sreedhar C, Dr. D. Kavitha, K. Asha Rani Abstract Big data has become a buzzword in the recent years. Big data is used to describe a massive volume of both structured and unstructured
More informationNetwork-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks
Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON BIG DATA MANAGEMENT AND ITS SECURITY PRUTHVIKA S. KADU 1, DR. H. R.
More informationApache HBase. Crazy dances on the elephant back
Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage
More informationpwalrus: Towards Better Integration of Parallel File Systems into Cloud Storage
: Towards Better Integration of Parallel File Systems into Storage Yoshihisa Abe and Garth Gibson Department of Computer Science Carnegie Mellon University Pittsburgh, PA, USA {yoshiabe, garth}@cs.cmu.edu
More informationCLOUD COMPUTING USING HADOOP TECHNOLOGY
CLOUD COMPUTING USING HADOOP TECHNOLOGY DHIRAJLAL GANDHI COLLEGE OF TECHNOLOGY SALEM B.NARENDRA PRASATH S.PRAVEEN KUMAR 3 rd year CSE Department, 3 rd year CSE Department, Email:narendren.jbk@gmail.com
More informationHadoop Data Replication in HDFS
2741 QoS-Aware Data Replication in Hadoop Distributed File System Dr. Sunita Varma Department of ComputerTechnology and Application S. G. S. I. T. S. Indore, (M. P.), India sunita.varma19@gmail.com Ms.
More information