A Taxonomy and Survey on Distributed File Systems
|
|
|
- Arabella Williamson
- 10 years ago
- Views:
Transcription
1 Fourth International Conference on Networked Computing and Advanced Information Management A Taxonomy and Survey on Distributed File Systems Tran Doan Thanh 1, Subaji Mohan 1, Eunmi Choi 1, SangBum Kim 2, Pilsung Kim 2 1 School of Business IT, Kookmin University, Seoul, Korea [email protected], [email protected], [email protected] 2 SK Telecom Convergence and Internet R&D Center / SK T-Tower, Seoul, Korea [email protected], [email protected] Abstract Applications that process large volumes of data (such as, search engines, grid computing applications, data mining applications, etc.) require a backend infrastructure for storing data. The distributed file system is the central component for storing data infrastructure. There have been many projects focused on network computing that have designed and implemented distributed file systems with a variety of architectures and functionalities. In this paper, we develop a comprehensive taxonomy for describing distributed file system architectures and use this taxonomy to survey existing distributed file system implementations in very large-scale network computing systems such as Grids, Search Engines, etc. We use the taxonomy and the survey results to identify architectural approaches that have not been fully explored in the distributed file system research. 1. Introduction The Distributed File System (DFS) is used to build a hierarchical view of multiple file s and shares on the network. Instead of having to think of a specific machine name for each set of files, the user will only have to remember one name; which will be the 'key' to a list of shares found on multiple s on the network. Permanent Storage is a fundamental abstraction in computing. A permanent storage consists of a named set of objects that (1) come into existence by explicit creation, (2) are immune to temporary failures of the system, and (3) persist until explicitly destroyed. The naming structure, the characteristics of the objects, and the set of operations associated with them characterize a specific refinement of the basic abstraction. A file system is one such refinement. A DFS is a file system that supports the sharing of files in the form of persistent storage over a set of network connected nodes [4]. Many DFS s have been developed over the years and almost two decades of research have not succeeded in producing a fullyfeatured DFS[1, 2, 3]. Multiple users who are physically dispersed in a network of autonomous computers share in the use of a common file system. A useful way to view such a system is to think of it as a distributed implementation of the timesharing file system abstraction. The challenge is in realizing this abstraction in an efficient, secure and robust manner. In addition, the issues of file location and availability assume significance. One way of increasing the availability of files within a DFS is by using the replication of files. Most of the replication techniques can be divided into two main categories such as optimistic replication and pessimistic replication [5]. Another major bottleneck in the performance of DFS is the dramatic improvements in the processor speeds. To overcome this limitation DFS uses caches at various points [7] and these caches can be positioned at either the file or at the client [6]. To provide a consistent view of the data seen by all clients in a DFS and reliability in the case of failures, write operations are allowed to complete only after the data has been committed to stable storage. Therefore, the dominant loads on the file are due to writes. Thus allowing write-backs from client can reduce this writeload on the [7, 8]. The ability to use commodity devices for easily and economically scale-up is now very important in DFS s because of the demand of large-scale distributed applications. This includes the incremental scalability which is the ability to add more devices to scale up the system in incremental fashion. The systems surveyed are Google File System (GFS) [15], Lustre [16], Kosmos File System [17], Hadoop Distributed File System (Hadoop) [14], Panasas [18], Parallel Virtual File System (PVFS2) [19], and Redhat Global File System (RGFS) [20]. Requirements for DFS were described and an abstract functional model developed. The requirements and Corresponding author: Eunmi Choi ([email protected]). This work is supported by SKT research project fund in /08 $ IEEE DOI /NCM
2 model were used to develop the taxonomy. This helped in identifying some of the key DFS approaches and issues that are yet to be explored and we expect such unexplored issues as topics of future research. A comprehensive bibliography forms the importance of the paper. The structure of the paper is as follows. Sections 2 cover Background information about Distributed File System. In section 3, Taxonomy of Distributed File System is reviewed in detail. In Section 4 overview of different Distributed File Systems and comparison between them are shown and in Section 5, Findings related to DFS are presented and in Section 6 Discussion and Conclusion of the paper is outlined 2. Background This review was started with the basic abstraction of DFS and developed taxonomy of issues in the design of DFS. A major step in the evolution of DFS s was the recognition that access to remote file could be made to resemble access to local files. This property, called network transparency, implies that any operation that can be performed on a local file may also be performed on a remote file. The extent to which an actual implementation meets this ideal is an important measure of quality. The Newcastle Connection and Cocanet [9] are two early examples of systems that provided network transparency. In both cases the name of the remote site was a prefix of a remote file name. DFS provides location transparency and redundancy to improve data availability in the face of failure or heavy load by allowing shares in multiple different locations to be logically grouped under one folder, or DFS root. Many DFS performances are very low compared to the local file systems because they perform synchronous I/O operations for cache coherence and data safety [10]. File systems such as AFS [11] and NFS [12] present users with the abstraction of a single, coherent namespace shared across multiple clients. Although caching data on local clients improves performance, many file operations still use synchronous message exchanges between client and to maintain cache consistency and protect against client or failure. Structuring a distributed system is a demanding task, even if the size of the system is quite limited. But the work becomes much more difficult when the scale of the system is very large. We need to consider that any viable distributed system architecture must support the notion of autonomy if it is to scale at all in the real world [13]. When we consider the location transparency it can be viewed as a binding issue. The binding of location to name is static and permanent when pathnames with embedded machine names are used. The binding is less permanent in a system like Sun NFS. It is most dynamic and flexible in Google and Hadoop. Usage experience has confirmed the benefits of a fully dynamic location mechanism in a large distributed environment. Another major issue that attracts attention in the DFS is the failure of a machine ( or client), which cannot be distinguished from the failure of a communication link, or from slow responses due to extreme overloading. Therefore, when a site does not respond one cannot determine if the site has failed and stopped processing, or if a communication link has failed and the site is still operational. One must then assume that the inaccessible site is still capable of processing file requests. The file system protocol should handle this case in such a way that the consistency and semantic guarantees of the system will not be violated. Based on this background issues we have developed our taxonomy that need to be considered when designing a DFS for grids, search engines etc. The following section covers the taxonomy in detail. 3. Taxonomy of Distributed File System The motivation behind developing this taxonomy is to analyze the features that constitute the DFS and helps to incorporate a most appropriate and suitable file system that performs better, fault tolerant and secured one. Architecture First Issue considered during the study was to find the types of DFS architectures that are available. Different DFS Architectures exists such as Client- Server Architectures (e.g. Sun Microsystem s Network File System) which provides a standardized view of its local file system. This old fashion of DFS comes with a communication protocol that allows clients to access the files stored on a thus allowing a heterogeneous collection of processes running on different operating systems and machines share a common file system. Advantage of this scheme is that it is largely independent of local file systems. Important issue is that it cannot be used in MS-DOS due to its short file names. Another type of Architecture is Cluster-Based Distributed File System such as Google File Systems. It consists of a Single master along with multiple chunk s and divided into chunks of 64 Mbytes each. The advantage 145
3 is its simplicity and it allows single master to control a few hundred chunk s. In the Cluster-based DFS, there are three important features of architecture that usually be considered during design are Decoupled metadata and data, Reliable Autonomic Distributed Object Storage, and Dynamic Distributed Metadata Management. Third type of architecture is Symmetric Architecture that is based on peer-to-peer technology. It uses a DHT based system for distributing data, combined with a key based lookup mechanism. In a symmetric file system, the clients also host the metadata manager code, resulting in all nodes understanding the disk structures. In contrast, an Asymmetric Architecture file system is a file system in which there are one or more dedicated metadata managers that maintain the file system and its associated disk structures. Examples include Panasas ActiveScale, Lustre and traditional NFS file systems. Finally, a Parallel Architecture file system is one in which data blocks are striped, in parallel, across multiple storage devices on multiple storage s. Support for parallel applications is provided allowing all nodes access to the same files at the same time, thus providing concurrent read and write capabilities. Most of the current DFS s support this important feature. An important note is that all of the above definitions overlap. A DFS can be symmetric or asymmetric. Its s may be clustered or single s. And it may support parallel applications or it may not. Based on the survey it is been identified that Multiple layers architecture allows flexibility so that protocol or functional layers can be easily added. Processes Even though DFS s processes have no unusual properties the important aspect concerning this is whether they should be stateless or not. The primary advantage of the stateless approach is simplicity. But it will be difficult to follow during implementation because locking a file cannot be done easily by a stateless. Processes in some of the most commonly used DFS s are studied and its flaws are analyzed. Except PVFS2, almost other DFS s support stateful processes. The major advantage of a stateless architecture is that clients can fail and resume without disturbing the system as a whole. This feature allows PVFS2 to scale to hundreds of s and thousands of clients without being impacted by the overhead and complexity of tracking file state or locking information associated with these clients. Communication Most of the DFS s use Remote Procedure Call method to communicate as they make the system independent from underlying operating systems, networks and transport protocols. In RPC approach, there are two communication protocols to consider, which are TCP and UDP. TCP is mostly used by all DFS s. However, UDP is also considered for improving performance in Hadoop. There is also a completely different approach to handle communication in DFS is Plan 9. It is mainly a filebased distributed system and in this all resources are accessed in the same way, namely with file like syntax and operations, including even resources such as processes and network interfaces. In this aspect, Lustre has considered a more flexible architecture in which they can provide Network Independence. Lustre can be used over a wide variety of networks due to its use of an open Network Abstraction Layer. Therefore, it provides unique support for heterogeneous networks. Naming It plays an important role as each object has an associated logical path name and physical address. An aggregation of all the logical path names comprises a distributed name space which can be logically partitioned into domain. The addresses of the objects are used to access the objects in order to retrieve information from the distributed system. The naming structure of the file system, the application programming interface, the mapping of the file system abstraction on to physical storage media, and the integrity of the file system across power, hardware, media and software failures. It is been identified in systems such as Network File System. Its fundamental idea is to provide its clients complete transparent access to a remote file system. The currently common approach employs a central metadata to manage file name space. Therefore decoupling metadata and data improve the file namespace throughput and relief the synchronization problem. Another approach is metadata distributed in all nodes resulting in all nodes understanding the disk structure. But serious implication is users do not share name spaces due to security issues. It makes file sharing harder. The different systems are studied and analyzed to define the most appropriate naming structure and method. Synchronization The vital issue that is to be analyzed in the DFS is Synchronization issue. In a distributed system, the Semantics of File Sharing becomes a bit tricky when performance issues are at stake. When a same file is shared by two or more users, it is necessary to define the semantics of reading and writing precisely to avoid problems. Even though it looks conceptually simple, it is quite difficult to implement. There are few approaches that are available such as UNIX semantics, 146
4 Session semantics, Immutable semantics, and Transactions. Apart from semantics, we also consider to analyze the File Locking System in the DFS. Depending on the purpose of applications deploying on the DFS, it is developed with different locking mechanism. Major usages require Write-once-readmany access model. However, there are applications such as search engines require Multipleproducer/single-consumer access model. GFS is the infamous example for this model. To support their access model, some systems choose to give locks on objects to clients, and some choose to perform all operations synchronously on the. Giving locks on objects to clients lead to one performance improvement by caching at client. Lustre is the one that apply hybrid solution for File Locking System. In Lustre, Locking mode is chosen differently depending on the resource contention level. The last issue we study in synchronization problem is using, which is the most common method to control the parallel access to DFS. Consistency and Replication To provide the consistency, most of DFS employ checksum to validate the data after sending through communication network. Besides, Caching and Replication play an important role in DFS, most notable when they are designed to operate over widearea network. It can be done in quite few ways such as Client-side caching and Server-Side replication. There are two types of data need to be considered for replication: metadata replication and data object replication. Metadata is the most important part of the whole DFS. Thus, all DFS provide a mechanism to ensure the availability and recoverability of this data such as backup metadata and snapshot of metadata with transaction logs. For data objects, there are different approaches depending on the purpose of applications. DFSs like Lustre and Panasas assume that data object is available as long as the physical devices are available. Hence, they consider a physical failure as exception and the object data can be lost. However, there are some systems like Lustre which supports RAID0 model to store data to reduce the probability of loosing data and increase the access performance. In case of other DFSs like GFS and Hadoop, their applications require the availability of data as the critical condition and failure will be the norm rather than the exception. Thus, data objects are replicated in different s. This high bandwidth consuming feature leads to the asynchronous replication method name Replication in pipeline which is employed in GFS and Hadoop. Fault Tolerance Fault tolerance is very much related to the replication feature because replication is created to provide availability and support transparency of failures to users. As mentioned in Consistency and Replication section, there are two approaches for fault tolerance on object data: failure as exception and failure as norm. Failure as exception systems will isolate the failure node or recover the system from last normal running state. Failure as norm systems employ replication of all kind of data and execute rereplication whenever replication ratio becomes unsafe. Security Authentication Issues and access control are some of the important security issues in DFS s that need to be analyzed. Impact of decentralized authentication is also taken into consideration during the survey of the DFS s. Most DFS employ security with by leveraging existing security systems. Yet, some DFS s for specific purposes such as GFS and Hadoop, base on the trust between all nodes and clients so that they don t employ no dedicated security mechanism in their architecture. Other Issues One important issue of DFS s is the ability to use commodity devices to build up the system. The advantage of this capability is whenever the commodity devices are improved, the DFS is automatically and naturally improved. Besides, it also become very cost effective when there is the need to scale up the system. 4. Comparison of Distributed File Systems The systems surveyed are Google File System (GFS) [15], Lustre [16], Kosmos File System [17], Hadoop Distributed File System (Hadoop) [14], Panasas [18], Parallel Virtual File System (PVFS2) [19], and Redhat Global File System (RGFS) [20]. There are many more DFS in the literature such as NFS, AFS, QFS, and ZFS However, due to space limitation, their novelty and their representative, we could not add them into our survey. We summarize the comparison in Table
5 Table 1: Overall Comparison of Different Distributed File Systems File system GFS KFS Hadoop Lustre Panasas PVFS2 RGFS Architecture symmetric, Clusteredbased, parallel, objectbasebasebasebasebased parallel, symmetric, aggregation- based parallel, blockbased Processes Stateful Stateful Stateful Stateful Stateful Stateless Stateful Communication RPC/TCP RPC/TCP RPC/TCP&UDP Network RPC/TCP RPC/TCP RPC/TCP Independence Naming Central metadata Central metadata Central metadata Central metadata Central metadata Metadata Metadata distributed in all distributed in nodes all nodes Synchronization Write-once-readmany, Write-once-read- Write-once-read- Hybrid locking Give locks on No locking Give locks on Multiplemany, give locks many, give locks mechanism, objects to clients method, no objects to producer/single- on objects to on objects to using clients consumer, give clients, using clients, using locks on objects to clients, using Consistency and Server side Server side Server side Server side Server side No replication, No replication Replication replication, replication, replication, replication replication relaxed semantic Asynchronous Asynchronous Asynchronous Only metadata Only metadata for consistency replication, replication, replication, replication, replication checksum, relax checksum checksum Client side consistency among caching, checksum replications of data objects Fault tolerance Failure as norm Failure as norm Failure as norm Failure as Failure as Failure as Failure as exception exception exception exception Security No dedicated No dedicated No dedicated Security in the Security in the Security in the Security in the security mechanism security mechanism security mechanism form of form of form of form of 5. Findings Based on the survey and taxonomy, the following findings on different DFS s can help to select an appropriate DFS according to the application and the requirements. Lustre: Lustre is a shared disk file system. Commonly used for large scale cluster computing. It is an open-standard based system with great modularity and compatibility with interconnects, networking components and storage hardware. It is suitable for general purposes file systems. Currently, it is only available for Linux. Kosmos File System: Kosmos Distributed File System (KFS), a high performance DFS that supports applications whose workload could be characterized as, Primarily writeonce/read-many workloads, Few millions of large files, where each file is on the order of a few tens of MB to a few tens of GB in size, Mostly sequential access. It provides high performance combined with availability and reliability. It is intended to be used as the backend storage infrastructure for data intensive apps such as, search engines, data mining, grid computing etc. Hadoop: Hadoop is a Distributed parallel fault tolerant file system. It is designed to reliably store very large files across machines in a large cluster. It is inspired by the Google File System. Hadoop DFS stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. Blocks belonging to a file are replicated for fault tolerance. The block size and replication factor are configurable per file. Files are write once and have strictly one writer at any time. Google file system: Google File System is a proprietary DFS developed by Google for its own use. It is designed to provide efficient, reliable access to data using large clusters of commodity hardware. In GFS files are huge by traditional standards and are divided into chunks of 64 megabytes. Most files are mutated by appending new data rather than overwriting existing data: once written, the files are only read and often only sequentially. It is also optimized to run on computing clusters, the nodes of which consist of cheap, "commodity" computers, which means precautions must be taken against the high failure rate of individual nodes and the data loss. 148
6 Panasas: It implements file system entirely in hardware. It is suitable for general purposes file systems. To improve overall utilization of storage systems, network performance and increasing access to vital data, Panasas has developed ActiveScale Storage cluster. By combining a DFS with smart hardware, the Panasas Storage Cluster scales dramatically in both capacity and performance and extends appliance-like ease-ofmanageability to a virtually boundless storage system. PVFS2: The data access is achieved without file or metadata locking. PVFS2 is best suited for I/O-intensive (i.e., scientific) applications. PVFS2 could be used for highperformance scratch storage where data is copied and simulation results are written from thousands of cycles simultaneously. RGFS: It is an open-standard based system with great modularity and compatibility with interconnects, networking components and storage hardware. Besides, it is a relatively low-cost, SAN-based technology. It is suitable for general purposes file systems. However, it is only available on Red Hat Enterprise Linux. 6. Conclusion The DFS is one of the most important and widelyused form of shared permanent storage. The continuing interest in DFS bears testimony to the robustness of this model of data sharing. As elaborated in the preceding section, architecture, naming, synchronization, availability, heterogeneity and support for databases will be key issues that are to be taken into consideration while designing the DFS. Security will continue to be a serious concern and may, in fact, turn out to be the big concern for large distributed systems. In this paper, taxonomy was developed for the DFS and based on the taxonomy some of the most popular and common distributed file were reviewed and surveyed. The features, its advantages and disadvantages of each DFS are outlined and in detailed and also outline the findings that enables to select an appropriate one according to their needs. 7. References [1] Chandramohan A. Thekkath, et al, "Frangipani: A scalable Distributed File System", System Research Center, Digital Equipment Corporation, Palo Alto, CA, [2] Barbara Liskov, et al, "Replication in the Harp File System", Laboratory of Computer Science, MIT, Cambridge, CA, [3] John Douceur and Roger Wattenhofer, "Optimizing file availability in a -less distributed file system" In Proceedings of the 20th Symposium on Reliable Distributed Systems, [4] Eliezer levy and Abraham silberschatz, "Distributed File Systems: Concepts and Examples", ACM Computing Surveys, Vol. 22, No. 4, December [5]Yasushi Saito and Marc Shapiro, "Optimistic Replication", ACM Computing Surveys, Vol. 37, No. 1, March 2005, pp [6] Satyanarayanan, M., "A Survey of Distributed File Systems," Technical Report CMU-CS , Department of Computer Science, Camegie Mellon University, 1989 [7] Howard, J.H., et al, "Scale and Performance in a Distributed File System," ACM Transactions on Computer Systems, Vol. 6, Issue 1, February [8] Nelson, M.N., et al. "Caching in the Sprite Network File System," ACM Transactions on Computer Systems, February, 1988 [9] Rowe, L.A., Birman, K.P. A Local Network Based on the Unix Operating System, IEEE Transactions on Software Engineering SE-8(2), March, [10] Edmund B. Nightingale, Peter M. Chen, and Jason Flinn, Speculative Execution in a Distributed File System, ACM SOSP 05, October 23 26, 2005, Brighton, United Kingdom. [11]Howard,J.H.,Kazar,M.L.,Menees,S.G.,Nichols, D. A., Satyanarayanan, M., Sidebotham, R. N., and West, M. J. Scale and performance in a distributed file system, ACM Transactions on Computer Systems, Vol. 6, Issue1, February 1988 [12] Callaghan, B., Pavlowski, B., and Staubach, P., NFS Version 3 Protocol Specification, Technical Report RFC 1813, IETF, June [13] Alonso, Rafael and Luis L. Cova, Resource Sharing in a Distributed Environment, Proceedings of ACM SIGOPS European workshop, Cambridge, England, September, [14] The Hadoop Distributed File System [15] Ghemawat, S., Gobioff, H., Leung, S.T., The Google file system, ACM SIGOPS Operating Systems Review, Volume 37, Issue 5, pp , December, [16] Braam, P.J, The Lustre storage architecture, White Paper, Cluster File Systems, Inc., October, [17] KOSMOS DISTRIBUTED FILE SYSTEM, [18] Nagle, D., Serenyi, D., Matthews, A., The Panasas ActiveScale Storage Cluster: Delivering Scalable High Bandwidth Storage, Proceedings of the 2004 ACM/IEEE conference on Supercomputing, pp. 53-, [19] Yu, W., Liang, Sh., Panda, D.K., High performance support of parallel virtual file system (PVFS2) over Quadrics, Proceedings of the 19th annual international conference on Supercomputing, pp , [20] Red Hat Global File System, White Paper, 149
Distributed File Systems
Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)
Erlang Distributed File System (edfs)
Indian Institute of Technology Bombay CS 492 : BTP Stage I Erlang Distributed File System (edfs) By: Aman Mangal (100050015) Coordinator: Prof. G. Sivakumar November 30, 2013 Contents Abstract 3 1 Introduction
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Sunita Suralkar, Ashwini Mujumdar, Gayatri Masiwal, Manasi Kulkarni Department of Computer Technology, Veermata Jijabai Technological Institute
Review of Distributed File Systems: Case Studies Sunita Suralkar, Ashwini Mujumdar, Gayatri Masiwal, Manasi Kulkarni Department of Computer Technology, Veermata Jijabai Technological Institute Abstract
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
Distributed File Systems
Distributed File Systems File Characteristics From Andrew File System work: most files are small transfer files rather than disk blocks? reading more common than writing most access is sequential most
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Distributed Metadata Management Scheme in HDFS
International Journal of Scientific and Research Publications, Volume 3, Issue 5, May 2013 1 Distributed Metadata Management Scheme in HDFS Mrudula Varade *, Vimla Jethani ** * Department of Computer Engineering,
The Google File System
The Google File System Motivations of NFS NFS (Network File System) Allow to access files in other systems as local files Actually a network protocol (initially only one server) Simple and fast server
Big data management with IBM General Parallel File System
Big data management with IBM General Parallel File System Optimize storage management and boost your return on investment Highlights Handles the explosive growth of structured and unstructured data Offers
A Comparison on Current Distributed File Systems for Beowulf Clusters
A Comparison on Current Distributed File Systems for Beowulf Clusters Rafael Bohrer Ávila 1 Philippe Olivier Alexandre Navaux 2 Yves Denneulin 3 Abstract This paper presents a comparison on current file
COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel I/O (I) I/O basics Fall 2012 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network card 1 Network card
CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL
CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL This chapter is to introduce the client-server model and its role in the development of distributed network systems. The chapter
Critical Analysis of Distributed File Systems on Mobile Platforms
Critical Analysis of Distributed File Systems on Mobile Platforms Madhavi Vaidya Dr. Shrinivas Deshpande Abstract -The distributed file system is the central component for storing data infrastructure.
Web Email DNS Peer-to-peer systems (file sharing, CDNs, cycle sharing)
1 1 Distributed Systems What are distributed systems? How would you characterize them? Components of the system are located at networked computers Cooperate to provide some service No shared memory Communication
Design and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
HDFS Architecture Guide
by Dhruba Borthakur Table of contents 1 Introduction... 3 2 Assumptions and Goals... 3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets... 3 2.4 Simple Coherency Model...3 2.5
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
Distributed File Systems
Distributed File Systems Alemnew Sheferaw Asrese University of Trento - Italy December 12, 2012 Acknowledgement: Mauro Fruet Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 1 / 55 Outline
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
Load Balancing in Distributed Data Base and Distributed Computing System
Load Balancing in Distributed Data Base and Distributed Computing System Lovely Arya Research Scholar Dravidian University KUPPAM, ANDHRA PRADESH Abstract With a distributed system, data can be located
A REVIEW: Distributed File System
Journal of Computer Networks and Communications Security VOL. 3, NO. 5, MAY 2015, 229 234 Available online at: www.ijcncs.org EISSN 23089830 (Online) / ISSN 24100595 (Print) A REVIEW: System Shiva Asadianfam
Scala Storage Scale-Out Clustered Storage White Paper
White Paper Scala Storage Scale-Out Clustered Storage White Paper Chapter 1 Introduction... 3 Capacity - Explosive Growth of Unstructured Data... 3 Performance - Cluster Computing... 3 Chapter 2 Current
Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.
Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated
Parallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model
A Brief Analysis on Architecture and Reliability of Cloud Based Data Storage
Volume 2, No.4, July August 2013 International Journal of Information Systems and Computer Sciences ISSN 2319 7595 Tejaswini S L Jayanthy et al., Available International Online Journal at http://warse.org/pdfs/ijiscs03242013.pdf
Seminar Presentation for ECE 658 Instructed by: Prof.Anura Jayasumana Distributed File Systems
Seminar Presentation for ECE 658 Instructed by: Prof.Anura Jayasumana Distributed File Systems Prabhakaran Murugesan Outline File Transfer Protocol (FTP) Network File System (NFS) Andrew File System (AFS)
BlobSeer: Towards efficient data storage management on large-scale, distributed systems
: Towards efficient data storage management on large-scale, distributed systems Bogdan Nicolae University of Rennes 1, France KerData Team, INRIA Rennes Bretagne-Atlantique PhD Advisors: Gabriel Antoniu
Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle
Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle Agenda Introduction Database Architecture Direct NFS Client NFS Server
The Google File System
The Google File System By Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung (Presented at SOSP 2003) Introduction Google search engine. Applications process lots of data. Need good file system. Solution:
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Violin: A Framework for Extensible Block-level Storage
Violin: A Framework for Extensible Block-level Storage Michail Flouris Dept. of Computer Science, University of Toronto, Canada [email protected] Angelos Bilas ICS-FORTH & University of Crete, Greece
Comparative analysis of Google File System and Hadoop Distributed File System
Comparative analysis of Google File System and Hadoop Distributed File System R.Vijayakumari, R.Kirankumar, K.Gangadhara Rao Dept. of Computer Science, Krishna University, Machilipatnam, India, [email protected]
COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel Computation Parallel I/O (I) I/O basics Spring 2008 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network
HDFS Space Consolidation
HDFS Space Consolidation Aastha Mehta*,1,2, Deepti Banka*,1,2, Kartheek Muthyala*,1,2, Priya Sehgal 1, Ajay Bakre 1 *Student Authors 1 Advanced Technology Group, NetApp Inc., Bangalore, India 2 Birla Institute
Google File System. Web and scalability
Google File System Web and scalability The web: - How big is the Web right now? No one knows. - Number of pages that are crawled: o 100,000 pages in 1994 o 8 million pages in 2005 - Crawlable pages might
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
Running a Workflow on a PowerCenter Grid
Running a Workflow on a PowerCenter Grid 2010-2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise)
Network Attached Storage. Jinfeng Yang Oct/19/2015
Network Attached Storage Jinfeng Yang Oct/19/2015 Outline Part A 1. What is the Network Attached Storage (NAS)? 2. What are the applications of NAS? 3. The benefits of NAS. 4. NAS s performance (Reliability
HDFS Under the Hood. Sanjay Radia. [email protected] Grid Computing, Hadoop Yahoo Inc.
HDFS Under the Hood Sanjay Radia [email protected] Grid Computing, Hadoop Yahoo Inc. 1 Outline Overview of Hadoop, an open source project Design of HDFS On going work 2 Hadoop Hadoop provides a framework
Snapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
Massive Data Storage
Massive Data Storage Storage on the "Cloud" and the Google File System paper by: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung presentation by: Joshua Michalczak COP 4810 - Topics in Computer Science
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Panasas at the RCF. Fall 2005 Robert Petkus RHIC/USATLAS Computing Facility Brookhaven National Laboratory. Robert Petkus Panasas at the RCF
Panasas at the RCF HEPiX at SLAC Fall 2005 Robert Petkus RHIC/USATLAS Computing Facility Brookhaven National Laboratory Centralized File Service Single, facility-wide namespace for files. Uniform, facility-wide
Transparency in Distributed Systems
Transparency in Distributed Systems By Sudheer R Mantena Abstract The present day network architectures are becoming more and more complicated due to heterogeneity of the network components and mainly
We mean.network File System
We mean.network File System Introduction: Remote File-systems When networking became widely available users wanting to share files had to log in across the net to a central machine This central machine
POWER ALL GLOBAL FILE SYSTEM (PGFS)
POWER ALL GLOBAL FILE SYSTEM (PGFS) Defining next generation of global storage grid Power All Networks Ltd. Technical Whitepaper April 2008, version 1.01 Table of Content 1. Introduction.. 3 2. Paradigm
Object Storage: A Growing Opportunity for Service Providers. White Paper. Prepared for: 2012 Neovise, LLC. All Rights Reserved.
Object Storage: A Growing Opportunity for Service Providers Prepared for: White Paper 2012 Neovise, LLC. All Rights Reserved. Introduction For service providers, the rise of cloud computing is both a threat
Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms
Volume 1, Issue 1 ISSN: 2320-5288 International Journal of Engineering Technology & Management Research Journal homepage: www.ijetmr.org Analysis and Research of Cloud Computing System to Comparison of
Storage Architectures for Big Data in the Cloud
Storage Architectures for Big Data in the Cloud Sam Fineberg HP Storage CT Office/ May 2013 Overview Introduction What is big data? Big Data I/O Hadoop/HDFS SAN Distributed FS Cloud Summary Research Areas
Shared Parallel File System
Shared Parallel File System Fangbin Liu [email protected] System and Network Engineering University of Amsterdam Shared Parallel File System Introduction of the project The PVFS2 parallel file system
Cray DVS: Data Virtualization Service
Cray : Data Virtualization Service Stephen Sugiyama and David Wallace, Cray Inc. ABSTRACT: Cray, the Cray Data Virtualization Service, is a new capability being added to the XT software environment with
High Availability with Windows Server 2012 Release Candidate
High Availability with Windows Server 2012 Release Candidate Windows Server 2012 Release Candidate (RC) delivers innovative new capabilities that enable you to build dynamic storage and availability solutions
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Research Article Hadoop-Based Distributed Sensor Node Management System
Distributed Networks, Article ID 61868, 7 pages http://dx.doi.org/1.1155/214/61868 Research Article Hadoop-Based Distributed Node Management System In-Yong Jung, Ki-Hyun Kim, Byong-John Han, and Chang-Sung
Distributed File Systems. Chapter 10
Distributed File Systems Chapter 10 Distributed File System a) A distributed file system is a file system that resides on different machines, but offers an integrated view of data stored on remote disks.
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage
White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage
Client/Server Computing Distributed Processing, Client/Server, and Clusters
Client/Server Computing Distributed Processing, Client/Server, and Clusters Chapter 13 Client machines are generally single-user PCs or workstations that provide a highly userfriendly interface to the
What is Analytic Infrastructure and Why Should You Care?
What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group [email protected] ABSTRACT We define analytic infrastructure to be the services,
A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems*
A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems* Junho Jang, Saeyoung Han, Sungyong Park, and Jihoon Yang Department of Computer Science and Interdisciplinary Program
Big Data Storage Architecture Design in Cloud Computing
Big Data Storage Architecture Design in Cloud Computing Xuebin Chen 1, Shi Wang 1( ), Yanyan Dong 1, and Xu Wang 2 1 College of Science, North China University of Science and Technology, Tangshan, Hebei,
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Analisi di un servizio SRM: StoRM
27 November 2007 General Parallel File System (GPFS) The StoRM service Deployment configuration Authorization and ACLs Conclusions. Definition of terms Definition of terms 1/2 Distributed File System The
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
How To Understand The Concept Of A Distributed System
Distributed Operating Systems Introduction Ewa Niewiadomska-Szynkiewicz and Adam Kozakiewicz [email protected], [email protected] Institute of Control and Computation Engineering Warsaw University of
GeoGrid Project and Experiences with Hadoop
GeoGrid Project and Experiences with Hadoop Gong Zhang and Ling Liu Distributed Data Intensive Systems Lab (DiSL) Center for Experimental Computer Systems Research (CERCS) Georgia Institute of Technology
Diagram 1: Islands of storage across a digital broadcast workflow
XOR MEDIA CLOUD AQUA Big Data and Traditional Storage The era of big data imposes new challenges on the storage technology industry. As companies accumulate massive amounts of data from video, sound, database,
Survey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university
Introduction to Gluster. Versions 3.0.x
Introduction to Gluster Versions 3.0.x Table of Contents Table of Contents... 2 Overview... 3 Gluster File System... 3 Gluster Storage Platform... 3 No metadata with the Elastic Hash Algorithm... 4 A Gluster
Deploying a distributed data storage system on the UK National Grid Service using federated SRB
Deploying a distributed data storage system on the UK National Grid Service using federated SRB Manandhar A.S., Kleese K., Berrisford P., Brown G.D. CCLRC e-science Center Abstract As Grid enabled applications
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
IDENTIFYING AND OPTIMIZING DATA DUPLICATION BY EFFICIENT MEMORY ALLOCATION IN REPOSITORY BY SINGLE INSTANCE STORAGE
IDENTIFYING AND OPTIMIZING DATA DUPLICATION BY EFFICIENT MEMORY ALLOCATION IN REPOSITORY BY SINGLE INSTANCE STORAGE 1 M.PRADEEP RAJA, 2 R.C SANTHOSH KUMAR, 3 P.KIRUTHIGA, 4 V. LOGESHWARI 1,2,3 Student,
Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2
Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2 1 PDA College of Engineering, Gulbarga, Karnataka, India [email protected] 2 PDA College of Engineering, Gulbarga, Karnataka,
RAID Storage, Network File Systems, and DropBox
RAID Storage, Network File Systems, and DropBox George Porter CSE 124 February 24, 2015 * Thanks to Dave Patterson and Hong Jiang Announcements Project 2 due by end of today Office hour today 2-3pm in
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1
Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System
ORACLE DATABASE 10G ENTERPRISE EDITION
ORACLE DATABASE 10G ENTERPRISE EDITION OVERVIEW Oracle Database 10g Enterprise Edition is ideal for enterprises that ENTERPRISE EDITION For enterprises of any size For databases up to 8 Exabytes in size.
Cloud Optimize Your IT
Cloud Optimize Your IT Windows Server 2012 The information contained in this presentation relates to a pre-release product which may be substantially modified before it is commercially released. This pre-release
Scalable filesystems boosting Linux storage solutions
Scalable filesystems boosting Linux storage solutions Daniel Kobras science + computing ag IT-Dienstleistungen und Software für anspruchsvolle Rechnernetze Tübingen München Berlin Düsseldorf Motivation
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
Client/Server and Distributed Computing
Adapted from:operating Systems: Internals and Design Principles, 6/E William Stallings CS571 Fall 2010 Client/Server and Distributed Computing Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Traditional
Alternatives to HIVE SQL in Hadoop File Structure
Alternatives to HIVE SQL in Hadoop File Structure Ms. Arpana Chaturvedi, Ms. Poonam Verma ABSTRACT Trends face ups and lows.in the present scenario the social networking sites have been in the vogue. The
Berkeley Ninja Architecture
Berkeley Ninja Architecture ACID vs BASE 1.Strong Consistency 2. Availability not considered 3. Conservative 1. Weak consistency 2. Availability is a primary design element 3. Aggressive --> Traditional
How to Choose your Red Hat Enterprise Linux Filesystem
How to Choose your Red Hat Enterprise Linux Filesystem EXECUTIVE SUMMARY Choosing the Red Hat Enterprise Linux filesystem that is appropriate for your application is often a non-trivial decision due to
Scalable Windows Storage Server File Serving Clusters Using Melio File System and DFS
Scalable Windows Storage Server File Serving Clusters Using Melio File System and DFS Step-by-step Configuration Guide Table of Contents Scalable File Serving Clusters Using Windows Storage Server Using
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
Efficient Metadata Management for Cloud Computing applications
Efficient Metadata Management for Cloud Computing applications Abhishek Verma Shivaram Venkataraman Matthew Caesar Roy Campbell {verma7, venkata4, caesar, rhc} @illinois.edu University of Illinois at Urbana-Champaign
IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY
IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY Hadoop Distributed File System: What and Why? Ashwini Dhruva Nikam, Computer Science & Engineering, J.D.I.E.T., Yavatmal. Maharashtra,
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
159.735. Final Report. Cluster Scheduling. Submitted by: Priti Lohani 04244354
159.735 Final Report Cluster Scheduling Submitted by: Priti Lohani 04244354 1 Table of contents: 159.735... 1 Final Report... 1 Cluster Scheduling... 1 Table of contents:... 2 1. Introduction:... 3 1.1
Filesystems Performance in GNU/Linux Multi-Disk Data Storage
JOURNAL OF APPLIED COMPUTER SCIENCE Vol. 22 No. 2 (2014), pp. 65-80 Filesystems Performance in GNU/Linux Multi-Disk Data Storage Mateusz Smoliński 1 1 Lodz University of Technology Faculty of Technical
BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011
BookKeeper Flavio Junqueira Yahoo! Research, Barcelona Hadoop in China 2011 What s BookKeeper? Shared storage for writing fast sequences of byte arrays Data is replicated Writes are striped Many processes
