A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Transcription

1 A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information Technology, J.D.I.E.T, [email protected] ABSTRACT Hadoop is a software framework that supports data intensive distributed application. Hadoop creates clusters of machine and coordinates the work among them. It include two major component, HDFS (Hadoop Distributed File System) and Map Reduce. HDFS is designed to store large amount of data reliably and provide high availability of data to user application running at client. It creates multiple data blocks and store each of the block redundantly across the pool of servers to enable reliable, extreme rapid computation. Map Reduce is software framework for the analyzing and transforming a very large data set in to desired output. This paper focus on how the replicas are managed in HDFS for providing high availability of data under extreme computational requirement.this paper focus on possible failure that will affect the Hadoop cluster and which are failover mechanism can be deployed for protecting the cluster. Keywords: Hadoop, HDFS, Map Reduce, Hadoop Technology, High Availability INTRODUCTION The Hadoop is in the parallel access to data that can reside on a single node or on thousands of nodes.the Hadoop Distributed File System (HDFS) is one of many different components and project contained within the community Hadoop ecosystem. The Apache Hadoop defines HDFS as: the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.hadoop utilizes a scale-out architecture that makes use of commodity servers configured as a cluster, where each server possesses inexpensive internal disk drives. Data in Hadoop is broken down into blocks and spread throughout a cluster. Once that happens, Map Reduce tasks can be carried out on the smaller subsets of data that may make up a very large dataset overall, thus accomplishing the type of scalability needed for big data processing [1].It Has The ability to carry out some degree of automatic redundancy and failover make it popular for modern businesses looking for data warehouse batch analytics solutions. 1.1What is Hadoop? Hadoop is an Open Source implementation of a large-scale batch processing system. It uses the Map Reduce framework introduced by Google by leveraging the concept of map and reduce functions well known used in Functional Programming. Although the Hadoop framework is written in Java, it allows developers to deploy custom- written programs coded in Java or any other language to process data in a parallel fashion across hundreds or thousands of commodity s ervers. It is optimized for contiguous read requests (streaming reads), where processing consists of scanning all the data. Depending on the complexity of the process and the volume of data, response time can vary from minutes to hours. While Hadoop can processes data fast, its key advantage is its massive scalability. Hadoop leverages a cluster of nodes to run Map Reduce programs massively in parallel [2]. This program consists of two steps: the Map step processes input data and the Reduce step assembles intermediate results into a final result. Each cluster node has a local file system and local CPU on which to run the Map Reduce programs. The local files contain the file system called Hadoop Distributed File System (HDFS). The number of nodes in each cluster varies from hundreds to thousands of machines. Hadoop can also allow for a certain set of fail-over scenarios.

2 Fig.1:- A high level overview of Hadoop cluster Hadoop is currently being used for index web searches, spam detection, recommendation engines, prediction in financial services, genome manipulation in life sciences, and for analysis of unstructured dat a such as log, text, and click stream. While many of these applications could in fact be implemented in a relational database, the main role of the Hadoop framework is functionally different from RDBMs 1.2 Map Reduce Map Reduce has emerged as a popular way to harness the power of large clusters of computers. It allows programmers to think in a data-centric fashion: they focus on applying transformations to sets of data records, and allow the details of distributed execution, network communication and fault tolerance to be handled by the Map Reduce framework. And also Map Reduce is typically applied to large batch-oriented computations that are concerned primarily with time to job completion. The Google Map Reduce framework and open-source Hadoop system reinforce this usage model through a batch-processing implementation strategy: the entire output of each map and reduce task is materialized to a local file before it can be consumed by the next stage. Materialization allows for a simple and elegant checkpoint/restart fault tolerance mechanism that is critical in large deployments, which have a high probability of slowdowns or failures at worker nodes. We propose a modified MapReduce architecture in which intermediate data is pipelined between operators, while preserving the programming interfaces and fault tolerance models of previous Map Reduce frameworks. To validate this design, R. Abbott and H. Garcia- Molina develop the Hadoop Online Prototype (HOP), a pipelining version of Hadoop [4]. Pipelining provides several important advantages for Map Reduce framework; it also raises new design challenges. 2. Design of Hadoop Technology 2.1 Software Architectural Bottlenecks Sometimes HDFS is not utilized to its full potential due to scheduling delays in the Hadoop architecture that result in cluster nodes waiting for new tasks. Instead of using the disk in a streaming manner, the access pattern is periodic. Some Data perfecting is not employed to improve performance, even though the typical Map Reduce streaming access pattern is highly predictable. 2.2 Portability Limitations: The performance enhancing features in the native file system are not available in Java in a platformindependent manner. It includes options such as bypassing the file system page cache and transferring data directly from disk into user buffers.the HDFS implementation runs less efficiently and has higher processor usage than would otherwise be necessary [3].

3 2.3 Portability Assumptions: The classic notion of software portability is simple: does the application run on multiple platforms? But, a broader notion of portability is: does the application perform well on multiple platforms? While HDFS is strictly portable, its performance is highly dependent on the behavior of underlying software layers, spe cifically the OS I/O scheduler and native file system allocation algorithm. Here, author quantify the impact and significance of these HDFS bottlenecks. Fig-2:Hadoop Architecture The study of Hadoop Architecture, understanding of its design and implementation and reasoning why hadoop possess has excellent scalability, good fault tolerance capability but moderate performance. Fig-3. Hadoop Architecture Design: The particular aspect of database design application level I/O scheduling exploits application access patterns to maximize storage bandwidth in a way that is not similarly exploitable by HDFS. Application -level I/O scheduling is frequently used to improve database performance by reducing seeks in systems with large numbers of concurrent queries. Because database workloads often have data re-use, storage usage can be reduced by sharing data between active queries. Here, part or the entire disk is continuously scanned in a sequential manner. Clients join

4 the scan stream in-flight, leave after they have received all necessary data and never interrupt the stream by triggering immediate seeks. The highest overall throughput can be maintained for all queries. This particular type of scheduling is only beneficial when multiple clients each access some portion of shared data, which is not common in many HDFS workloads [5]. 3. TECHNOLOGY Hadoop, is also called as Apache Hadoop [4], It is an Apache Software Foundation project and open source software platform for scalable, distributed computing. Also provide fast and reliable analysis of both structured data and unstructured data. Given its capabilities to handle large data sets, it's often associated with the phrase big data. Performing computation on large volumes of data has been done before, usually in a distributed setting but writing distributed systems is notoriously hard. Map Reduce systems such as Hadoop are used in large- scale deployments.eliminating HDFS bottlenecks will not only boost application performance, but also improve overall cluster efficiency, thereby reducing power and cooling costs and allowing more computation to be accomplished with the same number of cluster nodes. These solutions include improved I/O scheduling, adding pipelining and prefetching to both task scheduling and HDFS clients, pre-allocating file space on disk, and modifying or eliminating the local file system, among other methods. Apache Hadoop controls costs by storing data more affordably per terabyte than other platforms. Instead of thousands to tens of thousands per terabyte Hadoop delivers compute and storage for hundreds of dollars per terabyte. Fault tolerance is one of the most important advantages of using Hadoop. Even if individual nodes experience high rates of failure when running jobs on a large cluster, data is replicated across a cluster so that it can be recovered easily in the face of disk, node or track failures. The flexible way that data is stored in Apache Hadoop is one of its biggest assets enabling businesses to generate value from data that was previously considered too expensive to be stored and processed databases. 3.1 HDFS analysis After the analysis of the Hadoop with Architect, here s the dependency graph of project HDFS use mostly hadoop-common and profound libraries. When external libraries are used, it s better to check if we can easily change a third party lib by another one without impacting the whole application, there are many reasons that can encourage to change a third party libraries. As an example of jetty lib and search which methods from hdfs use it directly. Fig-4. HDFS Analysis 5. CONCLUSION

5 Hadoop is very flexible and permit us to change the behavior without changing the source code. Hadoop Map Reduce is a large scale, open source software framework dedicated to scalable, distributed, data -intensive computing. Using frameworks as user is very interesting, but going inside this framework could give us more info suitable to understand it better, and adapt it to our needs easily. Hadoop is a powerful framework used by many companies, and most of them need to customize it. REFERENCES [1] R. Abbott and H. Garcia-Molina. Scheduling I/O requests with dead-lines: A performance evaluation. In Proceedings of the 11th Real-Time Systems Symposium, pages , Dec [2] G. Candea, N. Polyzotis, and R. Vingralek. A scalable, predictable join Operator for highly concurrent data warehouses. In 35 th International Conference on Very Large Data Bases (VLDB), [3] Understanding Hadoop Clusters and the Network. Available at Accessed on June 1, [4] In-Database Map-Reduce [5] Yahoo! Hadoop Tutorial. Yahoo! Developer Network. Available at Accessed on June 4, [6]HDFS Java API: [7]HDFS source code: [8] Hadoop HDFS over HTTP - Documentation Sets alpha. Apache Software Foundation. Available at Accessed on June 5, 2013.

6