IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY

Size: px
Start display at page:

Download "IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY"

Transcription

1 IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY Hadoop Distributed File System: What and Why? Ashwini Dhruva Nikam, Computer Science & Engineering, J.D.I.E.T., Yavatmal. Maharashtra, India, Abstract As the amount of capture data increase over the years, so do our storage needs. Companies are realising that Data is king, but still the question remains unsolved how do we analyse this big data? The question gets the answer when the hadoop technology starts emerging in the new way. Hadoop has become an important distributed processing model for large scale data intensive applications like data mining and web indexing. The hadoop technology offers new frameworks like Hadoop Distributed File System and Map Reduce framework to store and analyse the big data. The Hadoop Distributed File System offers new ways to store this big data efficiently. The Hadoop Distributed File System at Yahoo! stores forty Pita Bytes of application data across Thirty thousand nodes. Hadoop Distributed File System is highly fault-tolerant and is designed to be deployed on low-cost hardware. Hadoop Distributed File System provides high throughput access to application data and is suitable for applications that have large data sets. Index Terms: Hadoop, Distributed File system, big data, Map Reduce, Throughput. 1. INTRODUCTION What is Hadoop Distributed File System and Why to use Hadoop Distributed File System? The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware to analyse and process the big data. It is a underline framework. Hadoop Distributed File System is a file system which is every machine has independently What is the big data? As name suggests big data is big volume. Now a day, there are lot of social networking like facebook, Google plus, twitter, LinkedIn which is everyday generating huge volume of data. Previous days the data was maximum is consider in Gigabytes (GB). But now a day data is generated in Terabytes (TB) to Pita Bytes (PB) and some time in future it will reach to Zeta Bytes (ZB). This calculation of data was only for social networking industries. If we consider other industries like finance where the data makes a very big difference is like stock market which has globalist hundreds of stock exchanges. There are other data like company s Finance Data. So this is also generating a volume in per day GB or TB of data and collectively if full history is consider, it reaches to TB or PB. Fig-1: Big data resources 1) There is another characteristic with the big data which is Variety What variety means? The data generated may be: a) Structured Data It is normal square format data. It can be case file, separated file and byte separated file which has first field, second field, and third field.

2 b) Semi-structured Data- Semi structured data can be bloggers like s etc. c) Unstructured Data- Unstructured data are audio files, video files, images etc. If one considers health care industry, lot of images are generated like patients data. Even people upload videos,images on social networking sites. These all are unstructured data which is difficult to processed and analyse. 2) The next characteristics of big data is Velocity: As name suggest velocity is the how fast the data is generated. At every moment face book is getting lots of huge data. At the same time, every day stock market is opens, so huge volume is generated. Everyday medical industry is generating huge data in every millisecond and every nanosecond. Even mobile phone, GPS system, security infrastructure on the computer network. So these are all way the data is generated and the speed is generated is very fast. Now a days a challenge here is the data is increasing very fast and the processing speed of the data is remains steady without any disaster evolution. So overall this gap is increasing. Because of these gap is increasing, lots of valuable information in big data has been losing. 3) What Valuable information has been losing? Consider stock market data. It has a huge data volume. But throughout history like 30 years data history, if one want to process it and generate some risk model, so it takes months to process this history to run and generating analytical data model because of existing legacy system. 2. ASSUMPTIONS AND GOALS OF HADOOP DISTRIBUTED FILE SYSTEM 2.1. Hardware Failure Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always nonfunctional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates Large Data Sets Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance Simple Coherency Model HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A Map/Reduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future Moving Computation is Cheaper than Moving Data A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located Portability across Heterogeneous Hardware and Software Platforms HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications Streaming Data Access

3 3. HADOOP DISTRIBUTED FILE SYSTEM ARCHITECTURE Hadoop Distributed File System is based on Google File System and the architecture is almost similar. So, the architecture of Hadoop Distributed File System is Master- Slave architecture. So there is always one master NameNode in HDFS and some slave nodes. highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case. The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode Features of Hadoop Distributed File System 1) The File System Namespace Fig-2: Hadoop Distributed File System Architecture 1.1. Hadoop Distributed File System has three Daemon Services running: a) Name Node, b) Secondary Name Node, c) Data Node The NameNode and DataNode are pieces of software designed to run on commodity machines. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the HDFS supports a traditional but efficient and hierarchical file organization. A user or an application can create directories and store files inside the Namespace directories. The file system namespace hierarchy is similar to most other existing file systems in which one can create and remove files, move a file from one directory to another, or rename a file. But HDFS does not yet implement user quotas or access permissions. HDFS does not yet support hard links or soft links. However, the HDFS architecture does not preclude implementing these features. The NameNode that is master node maintains the file system namespace. Any change to these file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by the HDFS. The number of copies of a file is known as the replication factor of that file. This information about replication factor is stored by the NameNode. 2) Data Replication Fig-3: Name space edit log

4 HDFS is designed to reliably store very large volume files across machines in a large cluster. It stores each file as a sequence of the blocks; and all blocks in a file except the last block are the same size. The blocks of a file are replicated for safety to fault tolerance. The block size and the replication factor are configurable to per file. An application can specify the number of replicated copies of a file. The replication factor can be specified at the time of file creation and can be changed later. Files in HDFS are writeonce and read-many and have strictly one writer at any time. The NameNode takes all decisions regarding replication of blocks. It periodically used to receive a Heartbeat and a Block report from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Block report contains a list of all blocks on a DataNode. bandwidth between machines in different racks. The NameNode determines the rack in each DataNode belongs to via the process outlined in Rack Awareness. A simple but non-optimal policy is to place replicas on the unique racks. This prevents losing data in case of entire rack failure and allows use of bandwidth from multiple racks when reading data. This policy evenly distributes replicas in the cluster which makes it easy to balance the load on component failure. However, this policy increases the cost of writes because a write needs to transfer these blocks to multiple racks. For the common case, when the replication factor is four, HDFS s placement policy is to put one replica on one node in the local rack, another on a different node in the local rack, and remaining on a different node in a different rack. This policy cuts the inter-rack write traffic which generally improves the write performance of the system. The chance of rack failure is far less than that of node failure; but this policy does not impact data reliability and availability guarantees. However, it reduces the aggregate network bandwidth used when reading data since a block is placed in only two unique racks rather than three. With this policy, the replicas of a file do not evenly distribute across these racks. One third of replicas are on one node, two Thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks. This policy improves the write performance without compromising data reliability or read performance. The current and default replica placement policy described here is a work in progress. b) Replica Selection Fig-4: Block replication a) Replica Placement: The First Baby Steps The placement of replicas is critical to HDFS reliability and to performance. Optimizing these replica placements distinguishes HDFS from most other distributed file systems. This is a feature of HDFS that needs lots of tuning and experience. The purpose of a rack-aware replica placement policy is to improve the data reliability, availability, and network bandwidth utilization. The current implementation for the replica placement policy is very first effort in this direction. The short-term goals of implementing this policy are to validate it on production systems, learn more about its behaviour, and build a foundation to test and research more sophisticated policies. Large HDFS instances run on a cluster of computers that commonly spread across many racks. Communication between two nodes in different racks has to go through these switches. In most of the cases, network bandwidth between machines in the same rack is greater than network To minimize the global bandwidth consumption and read latency, HDFS has tried to satisfy a read request from a replica that is closest to the reader. If there exists any replica on the same rack as the reader node, then that replica is preferred to satisfy the read request from the client. If HDFS cluster spans multiple data centres, then a replica that is resident in the local data centre is preferred over any of the remote replica. c) Safe mode On start-up, the NameNode enters a special state known as Safe mode. Replication of data blocks does not occur when the NameNode or Master Node is in the Safemode state. The NameNode receives the Heartbeat and Block report messages from the DataNodes. A Block report contains the list of all the data blocks that a DataNode is hosting. Each block has a specified minimum number of replicas. A block is considered safely replicated if the minimum number of replicas of that data block has checked in with the NameNode. After a configurable percentage of the safely replicated data blocks checks in with the NameNode (plus an additional 30 seconds), the NameNode exits the Safe

5 mode state. It then determines the list of all the data blocks (if any) that still have fewer than the specified number of replicas. The NameNode then replicates all these blocks to other DataNodes. Procedure Call (RPC) abstraction wraps both the Client Protocol and the DataNode Protocol. By design, the NameNode system never initiates any RPCs. Instead, NameNode only responds to RPC requests issued by DataNodes or clients. 3) The Persistence of File System Metadata The HDFS namespace is stored by the Master Node that is by the NameNode. The NameNode uses a transaction log known as the EditLog to persistently record every change that occurs to file system metadata. Similarly, change in the replication factor of a file causes a new record to be inserted into the EditLog. The NameNode uses a file in its local host Operating Syatem file system to store the EditLog. The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. The FsImage is stored as a file in the local file of the NameNode system. The NameNode keeps an FsImage of the entire file system namespace and file Blockmap in the memory. This key metadata item in RAM is designed to be compact, such that a NameNode with 4 Gigabytes of RAM is plenty to support a huge number of files and directories. When the NameNode starts up, it reads the FsImage and EditLog from the disk, applies all the transactions from the EditLog to the in-memory representation of the FsImage, and flushes out this new version into a new FsImage on the disk. It can then truncate the old EditLog because its all transactions have been applied to the persistent FsImage. This process is known as a checkpoint. In the current implementation, a checkpoint occurs only when the NameNode starts up. Search is in progress to support periodic check pointing in the near future. The DataNode stores all HDFS data in files in its local file system. The DataNode has no knowledge about HDFS files stored on HDFS. It stores each block of HDFS data in a separate file in local file system of HDFS. The DataNode does not create all files in the same directory. But, it uses a heuristic to determine the optimal number of files per directory and creates some subdirectories appropriately. It is not optimal to create all local files in the same directory because the local file system might not be able to efficiently support a large number of files in a single directory. When a DataNode starts up, it scans through its local file system, generates a list of all HDFS data file s blocks that correspond to each of these local files and sends this report to the NameNode: this is what a Blockreport. 4) The Communication Protocols All HDFS communication protocols are layered on top of the Transfer Control Protocol (TCP)/Internet protocol (IP). A client establishes a connection to a configurable TCP port on the NameNode system. It talks the ClientProtocol with the NameNode system. The DataNodes talk to the NameNode system using the DataNode Protocol. A Remote Fig-5: Communication between HDFS clusters 5) Robustness The primary objective of the HDFS is to store the data reliably even in the presence of failures. The three common types of failures are NameNode failures, DataNode failures and network failures. a) Data Disk Failure, Heartbeats and Re- Replication Each DataNode sends a Heartbeat to the NameNode periodically. A network failure can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat and marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. NameNode keep record that any data that was registered to a dead DataNode is not available to HDFS anymore. This DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary to make the specified replication factor again. The necessity for rereplication arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file has been increased.

6 b) Cluster Rebalancing The HDFS architecture is compatible with the data rebalancing schemes. A scheme might automatically move data from one DataNode to another if there is any free space on a DataNode falls below a certain threshold. In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster. But these types of data rebalancing schemes are not yet implemented. c) Data Integrity It is possible that a block of data fetched from a DataNode arrives may be corrupted. This corruption can occur because of faults in a storage device, network failure, or buggy software. The HDFS client software implements checksum checking on the contents of the HDFS files. When a client creates an HDFS file, it means it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace. When a client retrieves any file contents, it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file. If not, then the client can opt to retrieve the block he wants to retrieve from another DataNode that has a replica of that block. d) Metadata Disk Failure The FsImage and the EditLog are at central data structures of HDFS. A corruption of these files can lead the HDFS instance to be non-functional. Therefore, the NameNode can be configured to support maintaining multiple copies of these FsImage and EditLog. Any update to either these FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously. This synchronous updating of multiple copies of these FsImage and EditLog may degrade the rate of namespace transactions per second that a NameNode can support. However, this degradation is acceptable because even though the HDFS applications are very data intensive in nature, they are not said to be metadata intensive. When a NameNode restarts, it selects the very latest consistent FsImage and EditLog to use. The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode machine fails due to any failure, manual intervention is necessary. Currently, automatic restart and automatic failover of the NameNode software to another machine is not supported. e) Snapshots Snapshots support storing a copy of a data at a particular specified instant of time. One use of the snapshot feature may be to roll back a corrupted HDFS instance to a previously known good point in time. HDFS does not currently support snapshot but hope that it will support in a future release. C. Future Work The Hadoop cluster is effectively unavailable if its NameNode is down. Given that HDFS is used primarily as a batch system, restarting the NameNode has been a satisfactory recovery means. However, further scientists have taken steps towards automated failover. Currently a BackNode receives all transactions from the primary NameNode and this will allow a failover to a warm or even a hot BackupNode if we send block reports to oath the primary NameNode and BackupNode. A few Hadoop users outside Yahoo! have experimented with this manual failover. The plan is to use the Zookeeper, Yahoo s distributed consensus technology to build an automated failover solution. Scalability of the NameNode or Master node in HDFS has been a key struggle. Because the NameNode keeps all the namespace and block locations in memory, the size of the NameNode heap has been limited the number of files and also the number of blocks addressable. The main challenge with the NameNode has been that when its memory usage is very close to the maximum the NameNode becomes unresponsive due to Java garbage collection and sometimes requires a restart. While Yahoo! has encouraged users to create larger files, this has not happened since it would require multiple changes in application behaviour. They added quotas to manage the usage of file system and have provided an archive tool. However this does not fundamentally address the scalability problem. Our near-term solution to scalability is to allow multiple namespaces to share the physical storage within a cluster. We are extending our block IDs to be prefixed by the block pool identifiers. Block pools are analogous to the LUNs in a SAN storage system and a namespace with its pool of blocks is analogous as a file system. This approach is fairly very simple and requires minimal changes to the system. It offers a number of advantages besides scalability: it also isolates namespaces of different sets of applications and improves the overall availability of the cluster. It further generalizes the block storage abstraction to allow other services to use the block storage service with perhaps a different namespace structure. We plan to explore other approaches also to scaling such as storing only partial namespace in memory and truly distributed implementation of the NameNode in the future. In particular say, our assumption that applications

7 will create a small number of large files was flawed. As previous stated, changing application behaviour is hard. Furthermore, yahoo team seeing new classes of applications for HDFS that need to store a huge number of smaller files. The main drawback of multiple independent namespaces is the cost of managing of these files, especially if the number of namespaces is large. We are also planning to use application or job centric namespaces rather than cluster centric namespaces and this is analogous to the per-process namespaces that are used to deal with remote execution in distributed systems in the late 80s and early 90s. Currently our clusters are less than 4000 nodes. We believe that we can scale to much larger clusters with the solutions outlined above. However, we believe it is efficient to have multiple clusters rather than a single large cluster as it allows much improved availability and isolation. To that end we are planning to provide greater cooperation between these clusters. CONCLUSION This paper conclude that the Hadoop Distributed File System is the solution for social networking industries, software companies and financial industries to sort out, process and analyse the big data. Hadoop Distributed File System provides a new and efficient way to process with the big data. HDFS is similar to previous Distributed File System but it is different and more efficient in working from traditional tricks to process the big data. The evolution in the Hadoop Distributed File System will leads to the same evolution in the cloud computing. REFERENCES [1] K.Shvachko, H.Kuang, S.Radia, and R.Chansler, The Hadoop Distributed File System, 26th IEEE Symposium on Massive Storage Systems and Technologies (MSST2010), May [2] HDFS tracking system: HDFS-2535: A Model for Data Durability : issues.apache.org/jira/browse/hdfs [3] BORTHAKUR, D. The hadoop distributed file system: Architecture and design. docs /r0.18.0/hdfs_design.pdf. [4] CHANG, F., DEAN, J., GHEMAWAT, S., HSIEH, W. C., WALLACH, D. A., BURROWS, M., CHANDRA, T., FIKES, A., AND GRUBER, and R. Bigtable: A distributed storage system for structured data. In OSDI (2006), pp [5] RODEH, O., AND TEPERMAN, A. zfs - a scalable distributed file system using object disks. In IEEE Symposium on Mass Storage Systems (2003), pp [6] Hadoop distributed file system. Org/hdfs/. [7] BORTHAKUR, D. HDFS architecture. Tech. rep., Apache Software Foundation, 2008.