Review of Limitations on Namespace Distribution for Cloud Filesystems
|
|
|
- Nelson Lee
- 10 years ago
- Views:
Transcription
1 ICT Innovations 2012 Web Proceedings ISSN Review of Limitations on Namespace Distribution for Cloud Filesystems Genti Daci 1, Frida Gjermeni 1 Polytechnic University of Tirana, Faculty of Information Technology, Tirana, Albania [email protected], [email protected] Abstract. There are many challenges today for storing, processing and transferring intensive amounts of data in a distributed, large scale environment like cloud computing systems, where Apache Hadoop is a recent, well-known platform to provide such services. Such platforms use HDSF File System organized on two key components: the Hadoop Distributed File System (HDSF) for file storage and MapReduce, a distributed processing system for intensive cloud data applications. The main features of this structure are scalability, increased fault tolerance, efficiency and high-performance for the whole system. Hadoop supports today scientific applications, like high energy physics, astronomy genetics or even meteorology. Many organizations, including Yahoo! and Facebook have successfully implemented Hadoop as their internal distributed platform. Because HDSF architecture relies on a master/slave model, where a single name-node server is responsible for managing the namespace and all metadata operations in the file system. As a result this poses a limit on its growth and ability to scale due to the amount of RAM available on the single namespace server. A bottleneck resource is another problem that derives from this internal organization. By analyzing the limitations of HDSF architecture and resource distribution challenges, this paper reviews many solutions for addressing such limitations. This is done by discussing and comparing two file systems: Ceph Distributed File System and the Scalable Distributed File System. We will discuss and evaluate their main features, strengths and weaknesses with regards to increased scalability and performance. Also we will analyse and compare from a qualitative approach the algorithms implemented in these schemes such as CRUSH, RADOS algorithm for Ceph and RA, RM algorithms for SDFS. Keywords: Algorithm, cloud filesystems, Hadoop distributed filesystem, performance, Ceph File System, Scalable Distributed File System 1 Introduction Apache Hadoop is a famous, suitable platform for storing, processing and transferring large amounts of data sets in a cloud environment. The compact organization
2 420 ICT Innovations 2012 Web Proceedings ISSN realised by Hadoop Distributed File System [2] for storing data of different complexity and variety and MapReduce [3] for processing intensive distributed data applications, give HDSF a fundamental role in adapting the changing way of storing and processing information today. Projected to run on low cost hardware, HDSF file system is highly fault tolerant due to its replication policies for recovering data blocks in case of hardware failures or data corruption. Efficiency and high performance are provided adding to replication policies the ability to access data files in a stream, write-once-read-many [2] manner. One of the main features of HDSF is its ability to scale through hundreds of nodes in cloud computing systems. This is achieved by decoupling or separating metadata from data operations since metadata operations are usually faster than the data ones. MapReduce exploits the scalability of the file system [7] offering computation movement to interested data for a certain application. Several important fields benefit from HDSF services such as astronomy, mathematics, genetics [9] or even economy used in online commerce and fraud detection [7], etc. Also important organizations like Yahoo!, Facebook and ebay have successfully implemented Hadoop as their internal distributed platform. HDSF file system structure organization is based on the master/slave model where only a single NameNode server (master) menages the entire namespace, metadata operations and access to data files from clients of the file system. Data files are split into data blocks that are stored in DataNodes, usually one for every node in a cluster environment. Because of speed increasing, the whole file system metadata is stored in the NameNode single server s main memory. This implies a limit on the capacity of the system to grow and scale into a larger number of nodes than those supported by the available amount of RAM present in the NameNode server. Also the presence of this single server as the only metadata source creates a bottleneck resource and a single point-of-failure if it goes down. The same scenario can happen in the case where the number of contemporary requests for metadata operations that address the namespace server, reaches the limits of its capacities. Also the network bandwidth available for exchanging data between the actors in the file system is a limiting factor regarding scalability and overall performance. The contributions of this paper are: 1) review of Ceph distributed file system and the new Scalable Distributed File System with regards to growth and scalability limits and 2) compare their features, strengths and weaknesses 3) analyse and compare from a qualitative approach the algorithms implemented in these schemes such as CRUSH, RADOS algorithm for Ceph and RA, RM algorithms for DSFS. The rest of the paper is organized in the following order: section 3 will describe the limitations and solutions of the existing HDSF architecture, regarding scalability improvement. In section 4, features and comparison of Ceph file system and DSFS file system will be discussed and in section. Also algorithms implemented in Ceph such as CRUSH, RADOS and those implemented in DSFS such as RA, RM will be compared from a qualitative approach. A lot of work has been made during the last years on the field of distributed networked file systems. Andrew File System (AFS) [15] is a distributed file system that presents several advantages over traditional networked file systems, especially in the fields of security and scalability. It uses a cluster of servers to provide a homogeneous
3 ICT Innovations 2012 Web Proceedings ISSN and location transparent namespace to all the clients who want to interact with it. No matter the fact that it can handle large amounts of file requests it may not be convenient for large scale and scientific operations like those handled by HDSF. The Networked File System (NFS) [16] is another way to share files on a network. It has the disadvantage that it oversimplifies everything by making a remote file system appear as a local one. In situations like those handled in HDFS it wouldn t be appropriate or even reliable. Other file systems used in distributed enviroments are Frangipiani [17] and Inter- Mezzo [18]. Frangipiani is a distributed file system where disks on multiple machines are stored and managed in a single shared pool. Due to a simple internal structure system recovery and load balancing are easily managed. In InterMezzo, local file systems serve as server storage and client cashe and make the kernel file system driver a wrapper around the local file system. These file systems do not have a good resouce allocation regarding dynamic link, storage and proccesing capacities. Another very well-known file system is the Google File System (GFS) [6] which resembles to HDSF and has recently announced a namespace distribution even if no information for the public hasn t been provided yet. 2 Related Work A lot of work has been made during the last years on the field of distributed networked file systems. Andrew File System (AFS) [15] is a distributed file system that presents several advantages over traditional networked file systems, especially in the fields of security and scalability. It uses a cluster of servers to provide a homogeneous and location transparent namespace to all the clients who want to interact with it. No matter the fact that it can handle large amounts of file requests it may not be convenient for large scale and scientific operations like those handled by HDSF. The Networked File System (NFS) [16] is another way to share files on a network. It has the disadvantage that it oversimplifies everything by making a remote file system appear as a local one. In situations like those handled in HDFS it wouldn t be appropriate or even reliable. Other file systems used in distributed enviroments are Frangipiani [17] and Inter- Mezzo [18]. Frangipiani is a distributed file system where disks on multiple machines are stored and managed in a single shared pool. Due to a simple internal structure system recovery and load balancing are easily managed. In InterMezzo, local file systems serve as server storage and client cashe and make the kernel file system driver a wrapper around the local file system. These file systems do not have a good resouce allocation regarding dynamic link, storage and proccesing capacities. Another very well-known file system is the Google File System (GFS) [6] which resembles to HDSF and has recently announced a namespace distribution even if no information for the public hasn t been provided yet.
4 422 ICT Innovations 2012 Web Proceedings ISSN HDFS and Scalability The two main actors in HDSF architecture execute different operations over the file system. Operations like open close and rename of the files and directories of the namespace are executed by the NameNode server. Whereas operations like read and write, that in fact are client request operations over the data As mentioned above there are serious limitations regarding scalability, caused by the single NameNode server and its amount of RAM for namespace storage. Furthermore, from works by Shvacko at al. [10], 30% of the processing capacity of the name-node capacity is dedicated to internal load, such as processing block reports and heartbeat [2]. The remaining 70% of capacity processing can easily be expired by writers on a 10,000 node HDSF cluster as calculated by Shvacko at al. [10]. Consequently a bottleneck and name-node saturation would happen. Any intention to increase the number of DataNode servers, files stored in the DataNodes or clients requests beyond the supporting abilities of the single namespace server, causes a single point-of-failure in the file system. When a client requests access to a file for a read or write operation, he first contacts the namespace server for the location of the data blocks forming the desired file on the datanodes. Than the transfer to or from the DataNodes takes place, as described in Fig.1. Fig. 1. HDSF architecture by [1] Eventually the distribution of the namespace server and the decentralisation of the existing master/slave architecture represents an inevitable fact. That said, not all the approaches that attempt to introduce several namespace servers are successful. Such an example is Federation [8], where the number of clients and the amount of data stored increases, due to the presence of several NameNodes. But the static partitioning prevents from redistribution in the case when the growth of a volume reaches the NameNode capacities, so again we arrive at a lack of scalability. No matter the system s necessity of distributing the namespace server into a cluster of NameNode servers, the costs of money and time (even years for building such a distributed environment) it takes, have to be considered. This means we have to think
5 ICT Innovations 2012 Web Proceedings ISSN about a distributed namespace rather than a distributed environment of Namenode servers. Hbase [1] is such an approach, but even here the algorithms that have to be implemented for the namespace partitioning and the issue of renaming represent hot topics of discussion. 4 Solutions Based on Distributed File Systems In this section we analyse two file systems that address HDSF limtitations above discussed. Ceph Distributed File System and Scalable Distributed File System (SDSF) main features, strengths and weaknesses are specified regarding namespace distribution, scalability improvement, higher performance and increased fault tolerance. 4.1 Ceph [11]: How and Why Ceph is an open-source platform [6]. How to address HDSF scalability limitations using Ceph? Simply by using Ceph as the file system for Hadoop platform as explained in [6]. Ceph is an object based parallel file system, which internal organization [6] fulfills HDSF main requirements for playing the role of its file system. One of the main strengths of Ceph is the distribution of the entire metadata operations and namespace file system into a cluster of metadata servers (MDS). The process of distribution is realised in a dynamic way thanks to the Dynamic Subtree Partitioning algorithm. Each MDS measures the popularity of metadata using counters. Any operation that affects the actual i-node increments the counters and all of it ancestors till the root. This way a weighted MDS regarding the recent local distribution, is provided. The achieved values are periodically compared and as e result appropriately sized subtrees are migrated, to keep the workload evenly distributed. Besides workload balancing, also the ability to prevent hot spot points for metadata operation requests is achievable. Surely, the scalability and overall performance increase. When a client wants to interact with the file system, first he requests the MDS as can be seen by Fig.2. As a response the MDS returns the i-node number, replication factor and the striping strategy of the requested file. Than the client can calculate on its own the placement of the data objects and replicas of the requested files. OSDs (Object Storage Device) are the storage intelligent devices, that store both data and metadata, but due to RADOS (Reliable Autonomic Object Storage) [13] (also discussed in section IV) they are seen as a single logical object store by the clients and MDSs (as explined in Fig.2). RADOS represents a key component for Ceph because it provides every communication iniciatior with a cluster map containing information about OSDs, thier status and CRUSH (Controlled Replication under Scalable Hashing) [12]. CRUSH [12] resolves the problem of data distribution and location. It eliminates allocation tables by giving every party the ability to
6 424 ICT Innovations 2012 Web Proceedings ISSN calculate the data location. No matter the strengths that RADOS and CRUSH offer, Ceph still remains in an experimental phase. Besides the advantages in scalability, it introduces a higher degree of complexity to the the file system, since it inserts objects, as storing entities. The abstraction that from it derives, it is not to be let out of attention. Fig. 2. Internal Architecture of Ceph by [11] 4.2 Scalable Distributed File System: Another Alternative Scalable Distributed File System (SDSF) [4] is another distributed file system that attempts to resolve the limitations of HDSF platform. One of the main innovations of SDFS is the introduction of a front end light server (FES), as a connection point between the NameNode servers (NNS) and the client requests. The NNS handling the current request is chosen by hashing the request ID. To a certain point of view, this solution distributes metadata operations workload, but how many client requests can handle at once the FES [4]? Wouldn t this be another critical point for the file system overall performance? No matter the fact that the presence of this front end server does not cause a resource bottleneck, since it is stateless, what about the load of a single NNS [4]? No information is provided about how many contemporary requests can a NNS handle. All the questions above raised represent fragile points that need further development or alternative solutions. A strength point in SDFS is the introduction of a communication protocol that can achieve a better link utilisation during the operations with the file system. This is realised by the Resource Allocator (RA) and Resource Monitor (RM) algorithms, that will be further discussed in the next section. After a client connects to the FES, it is its rensponsibility to find the appropriate NNS to handle the client s request. As can be seen in Fig.3 the NNS finds the best block server (BS) after consulting the RA and the RM. The RA acts like a software router helped by the RM which gives the rmetric of every BS it is associated to. This scenario provides an increased scalability that derives from full link utilisation and rational resource allocation. By the end of this section we can conclude that namespace distribution is the key to scalability improvement regarding cloud file systems.
7 ICT Innovations 2012 Web Proceedings ISSN Fig. 3. Overview of SDFS by [4] 5 Comparing File Systems and Internal Algorithms In this section we compare Ceph and SDSF file systems in their main features and their internal algorithms. While comparing we also try to give conclusions and further improvements for both file systems. Decoupled data and metadata- Ceph and SDSF both offer separated data management from metadata. Since metadata operations are more complex and represent 30-80% of all file system operations, decoupling data from metadata operations gives the clients of the file system the possibility to a faster access on data manipulation. Therefore this derives to a performance improvement. Distribution of Metadata Operation- Both file systems offer a distributed cluster of servers for namespace management, providing this way an increase in scalability. SDFS applies the hashing way [14] of spreading metadata workloads through the namenode servers (NNS). To some extend workload is evenly distributed for well behaved hash functions, but still hot spots can be created consisting of individual files located on a single NNS. Besides, the hashing algorithm destroys hierarchical locality and benefits that from it derive. Ceph offers a much better alternative using dynamic subtree partitionig for workload distribution in the MDS Cluster. As above explained this algorithm answers in a more effective way the dynamic needs of such cloud environments and a better chance to scale.
8 426 ICT Innovations 2012 Web Proceedings ISSN Data storage- Since Ceph is an object-based file system, the key logical unit of the whole file system are objects. CRUSH [6] is the algorithm that distributes data objects in a pseudo-random way through the OSDs. It allows any party to individually calculate the location of data and subtitutes this way, the process of lookup with that of calculation. Also this avoids all the overhead crewated during the lookup of the allocation tables. The abstraction of allocation tables is still present in SDSF where data is stored in the BS as data blocks directed by the NNS. In general NNS use the policies of k-local allocation or even global allocation for the storage of new data blocks. Somehow these algorithms offer a fair distribution but are not able to respond to the contiguous evolution of cloud computing systems. Client requests- In SDSF client requests are handled at the front end light server that directs them to an appropriate NNS. No matter the fact that this server is stateless and in case of failures bringinig it up is very fast, still it becomes a critical point and compromises performance. The number of requests it can contemporary handle is a finite one. This can be a limiting factor for scalability that in fact we want to increase. On the other side Ceph offers an alternative solution provided by RADOS. RADOS enables every client with a global cluster map to direct requests. It also realises, OSDs to be seen as a single logical store. This way a critical point like the presence of the FES is avoided. A basical advantage of SDSF over Ceph is taking into consideration full link utilisation for every operation in the file system. The RA and RM algorithms provide that, by periodically monitoring and fairly allocating resources on every request. This provides a higher data rate of exchanging information that brings to an increased performance of the whole file system. We have to underline the fact that Ceph Distributed File system provides a higher abstraction for the system, but also introduces a more intelligent way to store and distribute data than SDFS. Ceph is an object-based file system, while the Scalable Distributed File System still conserves the stored data on blocks, stored in the BS[4]. This results in a higher performance and increased ability to scale. At the end of this section we can conclude that both couples of algorithms (CRUSH, RADOS in Ceph and RA, RM in SDSF) are implemented in distributed file systems that are highly discussed today to become the file systems serving Hadoop, for scalability and performance increase. Even though there are main differences in the way they behave and the logical data partitions they operate on, a paralel comparision can be realised addressing the issues of scalability and performance. 6 Conclusions and Future Work This paper reviewed solutions for the issue of scalability and performance increasing, regarding a well-known cloud filesystem like Hadoop Distributed File System. Because of having a single name node server as the source of all metadata operations, all stored in a finite size of RAM, scalability is seriously compromised. Two file systems were analysed and compared in their main fatures and internal algorithms to address HDSF problems: 1) Ceph Distributed File System and 2) Scalable Distributed
9 ICT Innovations 2012 Web Proceedings ISSN File System (SDFS) and their algorithms 1)CRUSH, RADOS for Ceph and 2)RA, RM for SDFS. It is for sure that to provide scalability improvement, the namespace has to be distributed in a cluster of servers where fair policies of metadata workload have to implemented. Ceph seems to answer in a better way these needs with the dynamic subtree partitioning algorithm for handling namespace distribution dynamically. Also CRUSH and RADOS algorithms handle data storage, client requests and file system opearations in a way that fulfills the contiguous evolution of cloud enviroments. Anyway the complexity is high and further developement is needed for providing full compatibily with HDSF platform. On the other side SDSF preserves a more traditional structure with data stored in blocks and allocation tables used to retrieve them. This brings heavier workloads in the file system and therefore a lower performance. An advantage of SDSF over Ceph is this new protocol realised by the RA and RM algorithms that utilises nearly full link and results in faster operations in the file system. Implementing this new approach in Ceph should be point of further research and devlopement. A weak point in SDSF is the FES server because the amount of contemporary requests it can handle. Alternative solutions have to be provided to handle this scenario. References 1. Apache, The Apache HBase Book, October 2010: http//hbase.apache.org/ book.html. 2. D. Borthakur The Hadoop Distributed File System: Architecture and Design, The Apache Software Foundation, J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. In proceedings of the 6th Symposium on Operating systems Design and Implementation (OSDI 2004), pp USENIX Association D.Fesehaye, R.Malik, K.Nahrstedt A Scalable Distributed File System for Cloud Computing,Technical Report, March M.K. McKusick and S. Quinlan, GFS: Evolution on Fast-forward, ACM Queue, vol. 7, no. 7, ACM, New York, NY. August C. Maltzahn, E. Molina-estolano, A. Khurana, A.J. Nelson, S. Brandt, S. Weil Ceph as a scalable alternative to the Hadoop Distributed File System, ;login: vol. 35, no. 4, pp.38-49, August M. Olson HADOOP: Scalable,Flexible Data Storage and Analysis, IQT QUARTERLY/Connecting Innovation and Intelligence, vol.1, no. 3, pp.14-18, May S.Radia, S.Srinivas, Scaling HDFS Cluster Using Namenode Federation, HDFS-1052, August 2010: /high-leveldesign.pdf. 9. K. V. Shvachko, Apache Hadoop The Scalability Update, ;login: vol.36, no.3, pp.7-13, June K.V.Shvachko, HDFS Scalability: The Limits to Growth,, ;login:, vol.35, no.2, pp.6 16, April Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D.E. Long, and Carlos Maltzahn, Ceph: A Scalable, High-Performance Distributed File System, Proccedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI), Seattle, WA, November 2006, pp
10 428 ICT Innovations 2012 Web Proceedings ISSN Sage A. Weil, Scott A. Brandt, Ethan L. Miller, and Carlos Maltzahn. CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data, Proc. of the 2006 ACM/IEEE Conf. on Supercomputing (SC 06), Tampa, FL, November Sage A. Weil, Andrew Leung, Scott A. Brandt, and Carlos Maltzahn. Rados: A Fast, Scalable, and Reliable Storage Service for Petabyte-Scale Storage Clusters, Proceedings of the 2007 ACM Petascale Data Storage Workshop (PDSW 07), Reno, NV, November Sage A. Weil, Kristal T. Pollack, Scott A. Brandt, and Ethan L. Miller, Dynamic Metadata Management for Petabyte-Scale File Systems Proc. of the 2004 ACM/IEEE Conference on Supercomputing (SC 04), Pittsburgh, PA, November Howard, J., Kazar, M., Menees, S., Nichols, D., Satyanarayanan,M., Sidebotham, R., and West, M. Scale and performance in a distributed file system. ACM Transactions on Computer Systems (TOCS) 6,1 (1988), Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame, C.,Eisler, M., and Noveck, D. Network file system (NFS) version 4 protocol. Request for Comments 3530 (2003). 17. Thekkath, C., Mann, T., and Lee, E. Frangipani: A scalable distributed file system, ACM SIGOPS Operating System Review 31, 5 (1997), Braam, P., Callahan, M., and Schwan, P. The intermezzo file system. In Proceedings of the 3rd of the Perl Confer-ence,O ı Reilly Open Source Convention, Citeseer.
Snapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
Distributed Metadata Management Scheme in HDFS
International Journal of Scientific and Research Publications, Volume 3, Issue 5, May 2013 1 Distributed Metadata Management Scheme in HDFS Mrudula Varade *, Vimla Jethani ** * Department of Computer Engineering,
Distributed File Systems
Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)
Comparison of Distributed File System for Cloud With Respect to Healthcare Domain
Comparison of Distributed File System for Cloud With Respect to Healthcare Domain 1 Ms. Khyati Joshi, 2 Prof. ChintanVarnagar 1 Post Graduate Student, 2 Assistant Professor, Computer Engineering Department,
HDFS scalability: the limits to growth
Konstantin V. Shvachko HDFS scalability: the limits to growth Konstantin V. Shvachko is a principal software engineer at Yahoo!, where he develops HDFS. He specializes in efficient data structures and
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
RADOS: A Scalable, Reliable Storage Service for Petabyte- scale Storage Clusters
RADOS: A Scalable, Reliable Storage Service for Petabyte- scale Storage Clusters Sage Weil, Andrew Leung, Scott Brandt, Carlos Maltzahn {sage,aleung,scott,carlosm}@cs.ucsc.edu University of California,
Design and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.
Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Efficient Metadata Management for Cloud Computing applications
Efficient Metadata Management for Cloud Computing applications Abhishek Verma Shivaram Venkataraman Matthew Caesar Roy Campbell {verma7, venkata4, caesar, rhc} @illinois.edu University of Illinois at Urbana-Champaign
R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 [email protected], 2 [email protected],
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
The Google File System
The Google File System Motivations of NFS NFS (Network File System) Allow to access files in other systems as local files Actually a network protocol (initially only one server) Simple and fast server
Distributed File Systems
Distributed File Systems Alemnew Sheferaw Asrese University of Trento - Italy December 12, 2012 Acknowledgement: Mauro Fruet Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 1 / 55 Outline
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed
Processing of Hadoop using Highly Available NameNode
Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Ceph. A file system a little bit different. Udo Seidel
Ceph A file system a little bit different Udo Seidel Ceph what? So-called parallel distributed cluster file system Started as part of PhD studies at UCSC Public announcement in 2006 at 7 th OSDI File system
A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems*
A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems* Junho Jang, Saeyoung Han, Sungyong Park, and Jihoon Yang Department of Computer Science and Interdisciplinary Program
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
HDFS Space Consolidation
HDFS Space Consolidation Aastha Mehta*,1,2, Deepti Banka*,1,2, Kartheek Muthyala*,1,2, Priya Sehgal 1, Ajay Bakre 1 *Student Authors 1 Advanced Technology Group, NetApp Inc., Bangalore, India 2 Birla Institute
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Survey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university
Hadoop Distributed File System. Jordan Prosch, Matt Kipps
Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?
Suresh Lakavath csir urdip Pune, India [email protected].
A Big Data Hadoop Architecture for Online Analysis. Suresh Lakavath csir urdip Pune, India [email protected]. Ramlal Naik L Acme Tele Power LTD Haryana, India [email protected]. Abstract Big Data
HDFS Under the Hood. Sanjay Radia. [email protected] Grid Computing, Hadoop Yahoo Inc.
HDFS Under the Hood Sanjay Radia [email protected] Grid Computing, Hadoop Yahoo Inc. 1 Outline Overview of Hadoop, an open source project Design of HDFS On going work 2 Hadoop Hadoop provides a framework
Hadoop Distributed File System (HDFS) Overview
2012 coreservlets.com and Dima May Hadoop Distributed File System (HDFS) Overview Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized
MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM
MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM Julia Myint 1 and Thinn Thu Naing 2 1 University of Computer Studies, Yangon, Myanmar [email protected] 2 University of Computer
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Storage and Retrieval of Data for Smart City using Hadoop
Storage and Retrieval of Data for Smart City using Hadoop Ravi Gehlot Department of Computer Science Poornima Institute of Engineering and Technology Jaipur, India Abstract Smart cities are equipped with
Apache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
Comparative analysis of Google File System and Hadoop Distributed File System
Comparative analysis of Google File System and Hadoop Distributed File System R.Vijayakumari, R.Kirankumar, K.Gangadhara Rao Dept. of Computer Science, Krishna University, Machilipatnam, India, [email protected]
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
http://www.paper.edu.cn
5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission
IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY
IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY Hadoop Distributed File System: What and Why? Ashwini Dhruva Nikam, Computer Science & Engineering, J.D.I.E.T., Yavatmal. Maharashtra,
Hadoop Distributed File System. Dhruba Borthakur June, 2007
Hadoop Distributed File System Dhruba Borthakur June, 2007 Goals of HDFS Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
HADOOP MOCK TEST HADOOP MOCK TEST I
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
HDFS Architecture Guide
by Dhruba Borthakur Table of contents 1 Introduction... 3 2 Assumptions and Goals... 3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets... 3 2.4 Simple Coherency Model...3 2.5
Distributed Filesystems
Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science [email protected] April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Scalable Multiple NameNodes Hadoop Cloud Storage System
Vol.8, No.1 (2015), pp.105-110 http://dx.doi.org/10.14257/ijdta.2015.8.1.12 Scalable Multiple NameNodes Hadoop Cloud Storage System Kun Bi 1 and Dezhi Han 1,2 1 College of Information Engineering, Shanghai
Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2
Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2 1 PDA College of Engineering, Gulbarga, Karnataka, India [email protected] 2 PDA College of Engineering, Gulbarga, Karnataka,
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
MapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially
Reduction of Data at Namenode in HDFS using harballing Technique
Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu [email protected] [email protected] Abstract HDFS stands for the Hadoop Distributed File System.
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
SOLVING LOAD REBALANCING FOR DISTRIBUTED FILE SYSTEM IN CLOUD
International Journal of Advances in Applied Science and Engineering (IJAEAS) ISSN (P): 2348-1811; ISSN (E): 2348-182X Vol-1, Iss.-3, JUNE 2014, 54-58 IIST SOLVING LOAD REBALANCING FOR DISTRIBUTED FILE
Big Data Storage Options for Hadoop Sam Fineberg, HP Storage
Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations
Apache Hadoop new way for the company to store and analyze big data
Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File
Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks
Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,
Highly Available Hadoop Name Node Architecture-Using Replicas of Name Node with Time Synchronization among Replicas
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 3, Ver. II (May-Jun. 2014), PP 58-62 Highly Available Hadoop Name Node Architecture-Using Replicas
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
Big Data Analysis and Its Scheduling Policy Hadoop
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 1, Ver. IV (Jan Feb. 2015), PP 36-40 www.iosrjournals.org Big Data Analysis and Its Scheduling Policy
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
Critical Analysis of Distributed File Systems on Mobile Platforms
Critical Analysis of Distributed File Systems on Mobile Platforms Madhavi Vaidya Dr. Shrinivas Deshpande Abstract -The distributed file system is the central component for storing data infrastructure.
Efficient Data Replication Scheme based on Hadoop Distributed File System
, pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,
The Hadoop Distributed File System
The Hadoop Distributed File System The Hadoop Distributed File System, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, Yahoo, 2010 Agenda Topic 1: Introduction Topic 2: Architecture
GeoGrid Project and Experiences with Hadoop
GeoGrid Project and Experiences with Hadoop Gong Zhang and Ling Liu Distributed Data Intensive Systems Lab (DiSL) Center for Experimental Computer Systems Research (CERCS) Georgia Institute of Technology
Testing of several distributed file-system (HadoopFS, CEPH and GlusterFS) for supporting the HEP experiments analisys. Giacinto DONVITO INFN-Bari
Testing of several distributed file-system (HadoopFS, CEPH and GlusterFS) for supporting the HEP experiments analisys. Giacinto DONVITO INFN-Bari 1 Agenda Introduction on the objective of the test activities
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF
Non-Stop for Apache HBase: -active region server clusters TECHNICAL BRIEF Technical Brief: -active region server clusters -active region server clusters HBase is a non-relational database that provides
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
The Hadoop Distributed File System
The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu HDFS
Seminar Presentation for ECE 658 Instructed by: Prof.Anura Jayasumana Distributed File Systems
Seminar Presentation for ECE 658 Instructed by: Prof.Anura Jayasumana Distributed File Systems Prabhakaran Murugesan Outline File Transfer Protocol (FTP) Network File System (NFS) Andrew File System (AFS)
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011
BookKeeper Flavio Junqueira Yahoo! Research, Barcelona Hadoop in China 2011 What s BookKeeper? Shared storage for writing fast sequences of byte arrays Data is replicated Writes are striped Many processes
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
HDFS Federation. Sanjay Radia Founder and Architect @ Hortonworks. Page 1
HDFS Federation Sanjay Radia Founder and Architect @ Hortonworks Page 1 About Me Apache Hadoop Committer and Member of Hadoop PMC Architect of core-hadoop @ Yahoo - Focusing on HDFS, MapReduce scheduler,
Sunita Suralkar, Ashwini Mujumdar, Gayatri Masiwal, Manasi Kulkarni Department of Computer Technology, Veermata Jijabai Technological Institute
Review of Distributed File Systems: Case Studies Sunita Suralkar, Ashwini Mujumdar, Gayatri Masiwal, Manasi Kulkarni Department of Computer Technology, Veermata Jijabai Technological Institute Abstract
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
MinCopysets: Derandomizing Replication In Cloud Storage
MinCopysets: Derandomizing Replication In Cloud Storage Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and Mendel Rosenblum Stanford University [email protected], {stutsman,rumble,skatti,ouster,mendel}@cs.stanford.edu
Cyber Forensic for Hadoop based Cloud System
Cyber Forensic for Hadoop based Cloud System ChaeHo Cho 1, SungHo Chin 2 and * Kwang Sik Chung 3 1 Korea National Open University graduate school Dept. of Computer Science 2 LG Electronics CTO Division
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
Survey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
