PERFORMANCE ANALYSIS OF A DISTRIBUTED FILE SYSTEM
|
|
|
- George Ferguson
- 10 years ago
- Views:
Transcription
1 PERFORMANCE ANALYSIS OF A DISTRIBUTED FILE SYSTEM SUBMITTED BY DIBYENDU KARMAKAR EXAMINATION ROLL NUMBER: M4SWE13-07 REGISTRATION NUMBER: of A THESIS SUBMITTED TO THE FACULTY OF ENGINEERING & TECHNOLOGY OF JADAVPUR UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ENGINEERING IN SOFTWARE ENGINEERING UNDER THE SUPERVISION OF MR. UTPAL KUMAR RAY VISITING FACULTY DEPARTMENT OF INFORMATION TECHNOLOGY JADAVPUR UNIVERSITY 2013
2 DEPARTMENT OF INFORMATION TECHNOLOGY FACULTY OF ENGINEERING & TECHNOLOGY JADAVPUR UNIVERSITY CERTIFICATE OF SUBMISSION I hereby recommend that the thesis, entitled Performance Analysis of a Distributed File System, prepared by Dibyendu Karmakar (Registration. No of ) under my supervision, be accepted in partial fulfillment of the requirement for the degree of Master of Engineering in Software Engineering from the Department of Information Technology under Jadavpur University. (MR. UTPAL KUMAR RAY) VISITING FACULTY DEPARTMENT OF INFORMATION TECHNOLOGY JADAVPUR UNIVERSITY COUNTERSIGNED BY: (HEAD OF THE DEPARTMENT) INFORMATION TECHNOLOGY JADAVPUR UNIVERSITY
3 DEPARTMENT OF INFORMATION TECHNOLOGY FACULTY OF ENGINEERING & TECHNOLOGY JADAVPUR UNIVERSITY CERTIFICATE OF APPROVAL The thesis at instance is hereby approved as a creditable study of an Engineering subject carried out and presented in a manner satisfactory to warrant its acceptance as a prerequisite to the degree for which it has been submitted. It is understood that by this approval the undersigned does not necessarily endorse or approve any statement made, opinion expressed or conclusion drawn therein, but approve this thesis for the purpose for which it is submitted. (DR. SASWAT CHAKRABARTI) PROFESSOR AND HEAD GS SANYAL SCHOOL OF TELECOMMUNICATION IIT - KHARAGPUR (MR. UTPAL KUMAR RAY) VISITING FACULTY DEPARTMENT OF INFORMATION TECHNOLOGY JADAVPUR UNIVERSITY
4 DECLARATION OF ORIGINALITY AND COMPLIANCE OF ACADEMIC ETHICS I hereby declare that this thesis contains literature survey and original research work by me, as a part of my Master of Engineering in Software Engineering studies. All information in this document have been obtained and presented in accordance with academic rules and ethical conduct. I also declare that, as required by these rules and conduct, I have fully cited and referenced all material and results that are not original to this work. (SIGNATURE WITH DATE) NAME: DIBYENDU KARMAKAR EXAM ROLL NO: M4SWE13-07 THESIS TITLE: PERFORMANCE ANALYSIS OF A DISTRIBUTED FILE SYSTEM
5 ACKNOWLEDGEMENT THE SUCCESS AND FINAL OUTCOME OF THIS PROJECT REQUIRED A LOT OF GUIDANCE AND ASSISTANCE FROM MANY PEOPLE AND I AM EXTREMELY FORTUNATE TO HAVE GOT THIS ALL ALONG THE COMPLETION OF MY PROJECT WORK. WHATEVER I HAVE DONE IS ONLY DUE TO SUCH GUIDANCE AND ASSISTANCE AND I WOULD NOT FORGET TO THANK THEM. I OWE MY PROFOUND GRATITUDE TO MY PROJECT GUIDE PROF. UTPAL KUMAR RAY, WHO TOOK KEEN INTEREST ON MY PROJECT WORK AND GUIDED ME ALL ALONG, TILL THE COMPLETION OF MY PROJECT WORK BY PROVIDING ALL THE NECESSARY INFORMATION. I WOULD NOT FORGET TO REMEMBER THE HADOOP COMMUNITY MEMBERS FOR THEIR TIMELY SUPPORT AND ASSISTANCE. I WOULD ALSO LIKE TO THANK ALL OF MY CLASSMATES FOR THE CONSTANT SUPPORT AND HELP THEY PROVIDED ALL THE TIME. LOCATION: DATE: REGARDS, (DIBYENDU KARMAKAR) M.E. IN SOFTWARE ENGINEERING CLASS ROLL NO: EXAM ROLL NO: M4SWE13-07 REGISTRATION NO: OF
6 DEDICATED TO MY BELOVED PARENTS
7 ABSTRACT The need for a large storage of data has gradually been increased in recent years. Individual companies are storing petabytes of data. The need for reduction in access time of data is also evident. As the size and value of the stored data increases, the importance of fault tolerant system, reliability increases. Distributed file system has gradually become popular in this regard. This thesis represents an approach to improve the performance of a distributed file system by analyzing and tuning few configuration parameters of a distributed file system. These parameters follow a particular curve or graph and are tunable for better performance of a distributed file system. Hadoop distributed file system (HDFS) has been taken as a standard distributed file system highlighting the basic working principles of hadoop along with the setup configuration. The experimental results and a suitable graph for each performance tuning parameters have been shown explaining how the distributed file system behaves with respect to these parameters. A useful conclusion has been discussed specifying which parameter plays a key role in increasing the performance of the distributed file systems along with tunable values for those parameters.
8 Table of Contents Contents Page No. Chapter 1 : Introduction Motivation Focus Organization 2 Chapter 2 : Introduction to Distributed File System What is a File System? Definition of Distributed File System Why Distributed File System.. 5 Chapter 3 : Hadoop Concepts Architecture Cluster Namenode Datanode HDFS Client Image and Journal Checkpoint Node Backup Node File System Snapshots IO Operations and Replica Management File Read and Write Heartbeat and Block Report Staging Replication Pipelining Data Block Placement Replica Management Balancer Block Scanner Decommissioning Inter Cluster Data Copy.. 19 Chapter 4 : Setting up the Hadoop Environment Hadoop Configuration Node Configuration 21
9 4.3. System configuration. 22 Chapter 5 : Hadoop Performance Tuning Parameters Cluster-Level Tunable Parameters Server-Level Tunable Parameters HDFS Tunable Parameters.. 26 Chapter 6 : Performance Results & Analysis Scenario 1: Effect of Multiple Clients Performance Analysis Scenario 2: Effect of Replication Factor (Replication Factor < No. of Available Datanodes) Performance Analysis Scenario 3: Effect of Replication Factor (Replication Factor > No. of Available Datanodes) Performance Analysis Scenario 4: Effect of Block Size (dfs.block.size) Performance Results Performance Analysis Scenario 5: Effect of IO Buffer Size (io.file.buffer.size) Performance Results Performance Analysis Scenario 6: Effect of dfs.access.time.precision Performance Results Performance Analysis Scenario 7: Effect of dfs.replication.interval Performance Results Performance Analysis Scenario 8: Effect of Heartbeat and Blockreport Intervals Performance Results Performance Analysis Scenario 9: Effect of Server and Block Level Threads Performance Results Performance Analysis. 49 Chapter 7 : Conclusion Conclusion Further Work.. 54 References.. 55
10 APPENDIX A: Hadoop Installation. 58 APPENDIX B: Hadoop Shell Commands 70 APPENDIX C: Dealing with Installation Errors.. 78 APPENDIX D: Hadoop User and Admin Commands. 84
11 CHAPTER 1 INTRODUCTION
12 1.1 MOTIVATION In recent years the amount of data stored worldwide has been increased by a factor of nine. Individual companies are often storing petabytes of data containing their business information for getting some useful business strategy that lead to their continued growth and success. However, the amount of data needed to store is often too large to store and analyse the data in traditional relational databases or the time required to analyse the data is too big. Further, the useful business information that can be gained from these large amounts of data may be valuable, but this useful analysed business information is effectively inaccessible if the IT costs to reach them are yet greater. Distributed software platform has grown to be popular for managing and storing large amount of data in a cost-effective way satisfying the above needs. Distributed file system gives the developers the opportunity to focus on the high-level algorithms by providing high reliability, instant backup facility, fault tolerance etc. It is designed to run on a large cluster scaling to hundreds or thousands of nodes. Hence the need for a distributed file system to overcome performance issue is evident FOCUS The main focus of this thesis is to improve the performance of distributed file systems. Now, the performance of a file system can be increased by having less computational time to perform the required operations. An obvious approach to have less computational time is to upgrade the processors of each individual machine which is evidently not a cost effective approach. This thesis represents an approach to gain better performance by analyzing and tuning few configuration parameters of a distributed file system. These parameters follow a particular curve or graph and are tunable for better performance of a distributed file system. Hadoop Distributed File System (HDFS) has been taken as a standard distributed file system in this work. So, all experiments are performed in HDFS ORGANIZATION The organization of this thesis is as follows: CHAPTER 2 defines file system, distributed file system and tells why distributed file system is in demand today. CHAPTER 3 highlights the basic concepts of Hadoop Distributed File System (HDFS) focusing on its architecture and operational aspects. 2 P a g e
13 CHAPTER 4 shows the hadoop environment used in this experiment i.e. the cluster, number and type of nodes in the cluster, hadoop version, network bandwidth and machine configuration. CHAPTER 5 discusses about the performance tuning parameters (a subset of hadoop configuration parameters). CHAPTER 6 analyzes the parameters (in Chapter 5) with respect to performance showing a curve or graph that each parameter follows along with the experimental data (i.e. measurements). CHAPTER 7 provides conclusion specifying which parameter plays a key role in performance improvement of hadoop and which can be ignored or should be. It is followed by REFERENCES used to fulfill the project. APPENDICES provide information about Hadoop Installation, Dealing with errors while installing hadoop and List of all Hadoop Commands. 3 P a g e
14 CHAPTER 2 INTRODUCTION TO DISTRIBUTED FILE SYSTEM Chapter Gist: This chapter defines file system and distributed file system and thereafter describes the need for a distributed file system.
15 1.1. WHAT IS FILE SYSTEM? A file system[1] is a subsystem of an operating system that performs file management activities such as organization, storing, retrieval, naming, sharing and protection of files. File systems are used on data storage devices, such as hard disk drives, floppy disks, optical discs, or flash memory storage devices, to maintain the physical locations of the computer files and directories. Example: - I. FAT (File Allocation Table) File System II. NTFS (New Technology File System) used in Microsoft's Windows 7, Windows Vista, Windows XP and Windows DEFINITION OF DISTRIBUTED FILE SYSTEM A distributed file system[1][2][9] is a client/server based application that allows clients to access and process data stored on the server as if it were on their own computer. A distributed file system organizes file and directory services of individual servers into a global directory in such a way that remote data access is not location-specific but is identical from any client. All files are accessible to all users of the global file system and organization is hierarchical and directory-based. As a whole, a distributed file system is any file system that allows access to files from multiple hosts/clients via a computer network. Example:- I. Hadoop Distributed File System II. Mobile Agent Based Distributed File System III. Parallel Virtual File System IV. Fraunhofer File System etc WHY DISTRIBUTED FILE SYSTEM Distributed File System has been introduced due to several advantages[2] over centralized file system such as:- USER MOBILITY Flexibility to work on different nodes at different times without the necessity of physically relocating the Secondary Storage devices 5 P a g e
16 REMOTE INFORMATION SHARING Transparent access of files by processes of any node (host/client) irrespective of file s location AVAILABILITY File availability for use in the event of temporary failure of one or more nodes using replicas (copies of original files) PERFORMANCE High performance can be achieved by executing sub-processes of a particular process in parallel on multiple remote nodes. 6 P a g e
17 CHAPTER 3 HADOOP CONCEPTS Chapter Gist: This chapter highlights the basic concepts of Hadoop Distributed File System (HDFS) focusing on its architecture and operational aspects.
18 Hadoop is an Apache project. All components are available via the Apache open source license. Yahoo has developed and contributed to 80% of the core of Hadoop. Hadoop provides a distributed file system and a framework for the analysis and transformation of very large data sets using the Map-Reduce paradigm. An important characteristic of Hadoop is the partitioning of data and computation across many (thousands) of hosts, and executing application computations in parallel close to their data. HDFS is the file system component of Hadoop. While the interface to HDFS is patterned after the UNIX file system, faithfulness to standards was sacrificed in favor of improved performance for the applications at hand. HDFS stores file system metadata and application data separately. As in other distributed file systems, like PVFS, Lustre and GFS, HDFS stores metadata on a dedicated server, called the NameNode. Application data are stored on other servers called DataNodes. All servers are fully connected and communicate with each other using TCP-based protocols ARCHITECTURE CLUSTER Hadoop Distributed File System is composed of two types of node DataNode[3][4][5][6] and NameNode[3][4][5]. All nodes in this distributed file system are grouped into clusters. Each Cluster contains one NameNode and multiple DataNode NAMENODE The HDFS namespace is a hierarchy of files and directories. Files and directories are represented on the NameNode by inodes, which record attributes like permissions, modification and access times, namespace and disk space quotas. The file content is split into large blocks (typically 128 megabytes, but user selectable file-by-file) and each block of the file is independently replicated at multiple DataNodes (typically three, but user selectable file-by-file). The NameNode maintains the namespace tree and the mapping of file blocks to DataNodes. An HDFS client wanting to read a file first contacts the NameNode for the locations of data blocks comprising the file and then reads block contents from the DataNode closest to the client. When writing data, the client requests the NameNode to nominate a suite of three DataNodes to host the block replicas. The client then writes data to the DataNodes in a pipeline fashion. The current design has a single NameNode for each cluster. The cluster can have thousands of DataNodes and tens of thousands of HDFS clients per cluster, as each DataNode may execute multiple application tasks concurrently. HDFS keeps the entire namespace in RAM. The inode data and the list of blocks belonging to each file comprise the meta-data of the name system called the image. The persistent record of 8 P a g e
19 the image stored in the local host s native files system is called a checkpoint. The NameNode also stores the modification log of the image called the journal in the local host s native file system. For improved durability, redundant copies of the checkpoint and journal can be made at other servers. During restarts the NameNode restores the namespace by reading the namespace and replaying the journal. The locations of block replicas may change over time and are not part of the persistent checkpoint DATANODE Figure 3.1 HDFS Architecture Each block replica on a DataNode is represented by two files in the local host s native file system. The first file contains the data itself and the second file is block s metadata including checksums for the block data and the block s generation stamp. The size of the data file equals the actual length of the block and does not require extra space to round it up to the nominal block size as in traditional file systems. Thus, if a block is half full it needs only half of the space of the full block on the local drive. During startup each DataNode connects to the NameNode and performs a handshake. The purpose of the handshake is to verify the namespace ID and the software version of the DataNode. If either does not match that of the NameNode the DataNode automatically shuts down. 9 P a g e
20 The namespace ID is assigned to the file system instance when it is formatted. The namespace ID is persistently stored on all nodes of the cluster. Nodes with a different namespace ID will not be able to join the cluster, thus preserving the integrity of the file system. The consistency of software versions is important because incompatible version may cause data corruption or loss, and on large clusters of thousands of machines it is easy to overlook nodes that did not shut down properly prior to the software upgrade or were not available during the upgrade. A DataNode that is newly initialized and without any namespace ID is permitted to join the cluster and receive the cluster s namespace ID. After the handshake the DataNode registers with the NameNode. DataNodes persistently store their unique storage IDs. The storage ID is an internal identifier of the DataNode, which makes it recognizable even if it is restarted with a different IP address or port. The storage ID is assigned to the DataNode when it registers with the NameNode for the first time and never changes after that. A DataNode recognizes block replicas in its possession by sending a block report. Datanodes also sends heartbeats to the Namenode indicating the presence or the proper functioning of the datanode. The NameNode does not directly call DataNodes. It uses replies to heartbeats to send instructions to the DataNodes. The instructions include commands to: Replicate blocks to other nodes Remove local block replicas Reregister or to shut down the node Send an immediate block report These commands are important for maintaining the overall system integrity and therefore it is critical to keep heartbeats frequent even on big clusters. The NameNode can process thousands of heartbeats per second without affecting other NameNode operations HDFS CLIENT User applications[4][5][6][10][11] access the file system using the HDFS client, a code library that exports the HDFS file system interface. Similar to most conventional file systems, HDFS supports operations to read, write and delete files, and operations to create and delete directories. The user references files and directories by paths in the namespace. The user application generally does not need to know that file system metadata and storage are on different servers, or that blocks have multiple replicas. When an application reads a file, the HDFS client first asks the NameNode for the list of DataNodes that host replicas of the blocks of the file. It then contacts a DataNode directly and requests the transfer of the desired block. When a client writes, it first asks the NameNode to choose DataNodes to host replicas of the first block of the file. The client organizes a pipeline from 10 P a g e
21 node-to-node and sends the data. When the first block is filled, the client requests new DataNodes to be chosen to host replicas of the next block and so on. Unlike conventional file systems, HDFS provides an API that exposes the locations of a file blocks. This allows applications like the MapReduce framework to schedule a task to where the data are located, thus improving the read performance. It also allows an application to set the replication factor of a file. By default a file s replication factor is three IMAGE AND JOURNAL The namespace image[6][7] is the file system metadata that describes the organization of application data as directories and files. A persistent record of the image written to disk is called a checkpoint. The journal is a write-ahead commit log for changes to the file system that must be persistent. For each client-initiated transaction, the change is recorded in the journal, and the journal file is flushed and synched before the change is committed to the HDFS client. The checkpoint file is never changed by the NameNode. It is replaced in its entirety when a new checkpoint is created during restart, when requested by the administrator, or by the CheckpointNode described in the next section. During startup the NameNode initializes the namespace image from the checkpoint, and then replays changes from the journal until the image is up-to-date with the last state of the file system. A new checkpoint and empty journal are written back to the storage directories before the NameNode starts serving clients. If either the checkpoint or the journal is missing, or becomes corrupt, the namespace information will be lost partly or entirely. In order to preserve this critical information HDFS can be configured to store the checkpoint and journal in multiple storage directories. Recommended practice is to place the directories on different volumes, and for one storage directory to be on a remote NFS server. The first choice prevents loss from single volume failures, and the second choice protects against failure of the entire node. If the NameNode encounters an error writing the journal to one of the storage directories it automatically excludes that directory from the list of storage directories. The NameNode automatically shuts itself down if no storage directory is available. The NameNode is a multithreaded system and processes requests simultaneously from multiple clients. Saving a transaction to disk becomes a bottleneck since all other threads need to wait until the synchronous flush-and-sync procedure initiated by one of them is complete. In order to optimize this process the NameNode batches multiple transactions initiated by different clients. When one of the NameNode s threads initiates a flush-and-sync operation, all transactions batched at that time are committed together. Remaining threads only need to check that their transactions have been saved and do not need to initiate a flush-and-sync operation. 11 P a g e
22 CHECKPOINTNODE The NameNode in HDFS, in addition to its primary role serving client requests, can alternatively execute either of two other roles, either a CheckpointNode[3][5][6][7] or a BackupNode[6][7][8]. The role is specified at the node startup. The CheckpointNode periodically combines the existing checkpoint and journal to create a new checkpoint and an empty journal. The CheckpointNode usually runs on a different host from the NameNode since it has the same memory requirements as the NameNode. It downloads the current checkpoint and journal files from the NameNode, merges them locally, and returns the new checkpoint back to the NameNode. Creating periodic checkpoints is one way to protect the file system metadata. The system can start from the most recent checkpoint if all other persistent copies of the namespace image or journal are unavailable. Creating a checkpoint lets the NameNode truncate the tail of the journal when the new checkpoint is uploaded to the NameNode. HDFS clusters run for prolonged periods of time without restarts during which the journal constantly grows. If the journal grows very large, the probability of loss or corruption of the journal file increases. Also, a very large journal extends the time required to restart the NameNode. For a large cluster, it takes an hour to process a week-long journal. Good practice is to create a daily checkpoint BACKUPNODE A recently introduced feature of HDFS is the BackupNode. Like a CheckpointNode, the BackupNode is capable of creating periodic checkpoints, but in addition it maintains an in-memory, up-to-date image of the file system namespace that is always synchronized with the state of the NameNode. The BackupNode accepts the journal stream of namespace transactions from the active NameNode, saves them to its own storage directories, and applies these transactions to its own namespace image in memory. The NameNode treats the BackupNode as a journal store the same as it treats journal files in its storage directories. If the NameNode fails, the BackupNode s image in memory and the checkpoint on disk is a record of the latest namespace state. The BackupNode can create a checkpoint without downloading checkpoint and journal files from the active NameNode, since it already has an up-to-date namespace image in its memory. This makes the checkpoint process on the BackupNode more efficient as it only needs to save the namespace into its local storage directories. The BackupNode can be viewed as a read-only NameNode. It contains all file system metadata information except for block locations. It can perform all operations of the regular NameNode that do not involve modification of the namespace or knowledge of block locations. Use of a BackupNode provides the option of running the NameNode without persistent storage, delegating responsibility for the namespace state persisting to the BackupNode. 12 P a g e
23 FILE SYSTEM SNAPSHOT During software upgrades the possibility of corrupting the system due to software bugs or human mistakes increases. The purpose of creating snapshots in HDFS is to minimize potential damage to the data stored in the system during upgrades. The snapshot mechanism[8][7][6] lets administrators persistently save the current state of the file system, so that if the upgrade results in data loss or corruption it is possible to rollback the upgrade and return HDFS to the namespace and storage state as they were at the time of the snapshot. The snapshot (only one can exist) is created at the cluster administrator s option whenever the system is started. If a snapshot is requested, the NameNode first reads the checkpoint and journal files and merges them in memory. Then it writes the new checkpoint and the empty journal to a new location, so that the old checkpoint and journal remain unchanged. During handshake the NameNode instructs DataNodes whether to create a local snapshot. The local snapshot on the DataNode cannot be created by replicating the data files directories as this will require doubling the storage capacity of every DataNode on the cluster. Instead each DataNode creates a copy of the storage directory and hard links existing block files into it. When the DataNode removes a block it removes only the hard link, and block modifications during appends use the copy-on-write technique. Thus old block replicas remain untouched in their old directories. The cluster administrator can choose to roll back HDFS to the snapshot state when restarting the system. The NameNode recovers the checkpoint saved when the snapshot was created. DataNodes restore the previously renamed directories and initiate a background process to delete block replicas created after the snapshot was made. Having chosen to roll back, there is no provision to roll forward. The cluster administrator can recover the storage occupied by the snapshot by commanding the system to abandon the snapshot, thus finalizing the software upgrade. System evolution may lead to a change in the format of the NameNode s checkpoint and journal files, or in the data representation of block replica files on DataNodes. The layout version identifies the data representation formats, and is persistently stored in the NameNode s and the DataNodes storage directories. During startup each node compares the layout version of the current software with the version stored in its storage directories and automatically converts data from older formats to the newer ones. The conversion requires the mandatory creation of a snapshot when the system restarts with the new software layout version. HDFS does not separate layout versions for the NameNode and DataNodes because snapshot creation must be an all-cluster effort rather than a node-selective event. If an upgraded NameNode due to a software bug purges its image then backing up only the namespace state still results in total data loss, as the NameNode will not recognize the blocks reported by DataNodes, 13 P a g e
24 and will order their deletion. Rolling back in this case will recover the metadata, but the data itself will be lost. A coordinated snapshot is required to avoid a cataclysmic destruction FILE I/O OPERATIONS AND REPLICA MANAGEMENT FILE READ AND WRITE An application adds data to HDFS by creating a new file and writing the data to it. After the file is closed, the bytes written cannot be altered or removed except that new data can be added to the file by reopening the file for append. HDFS implements a single-writer, multiple-reader model. The HDFS client that opens a file for writing is granted a lease for the file. No other client can write to the file. The writing client periodically renews the lease by sending a heartbeat to the NameNode. When the file is closed, the lease is revoked. The lease duration is bound by a soft limit and a hard limit. Until the soft limit expires, the writer is certain of exclusive access to the file. If the soft limit expires and the client fails to close the file or renew the lease, another client can preempt the lease. If after the hard limit expires (one hour) and the client has failed to renew the lease, HDFS assumes that the client has quit and will automatically close the file on behalf of the writer, and recover the lease. The writer's lease does not prevent other clients from reading the file. A file may have many concurrent readers. An HDFS file consists of blocks. When there is a need for a new block, the NameNode allocates a block with a unique block ID and determines a list of DataNodes to host replicas of the block. The DataNodes form a pipeline, the order of which minimizes the total network distance from the client to the last DataNode. Bytes are pushed to the pipeline as a sequence of packets. The bytes that an application writes first buffer at the client side. After a packet buffer is filled (typically 64 KB), the data are pushed to the pipeline. The next packet can be pushed to the pipeline before receiving the acknowledgement for the previous packets. The number of outstanding packets is limited by the outstanding packets window size of the client. In a cluster of thousands of nodes, failures of a node (most commonly storage faults) are daily occurrences. A replica stored on a DataNode may become corrupted because of faults in memory, disk, or network. HDFS generates and stores checksums for each data block of an HDFS file. Checksums are verified by the HDFS client while reading to help detect any corruption caused either by client, DataNodes, or network. When a client creates an HDFS file, it computes the checksum sequence for each block and sends it to a DataNode along with the data. A DataNode stores checksum in a metadata file separate from the block s data file. When HDFS reads a file, each block s data and checksums are shipped to the client. The client computes the checksum for the received data and verifies that the newly computed checksums matches the checksums it received. If not, the client notifies the NameNode of the corrupt replica and then fetches a different replica of the block from another DataNode. 14 P a g e
25 When a client opens a file to read, it fetches the list of blocks and the locations of each block replica from the NameNode. The locations of each block are ordered by their distance from the reader. When reading the content of a block, the client tries the closest replica first. If the read attempt fails, the client tries the next replica in sequence. A read may fail if the target DataNode is unavailable, the node no longer hosts a replica of the block, or the replica is found to be corrupt when checksums are tested. HDFS permits a client to read a file that is open for writing. When reading a file open for writing, the length of the last block still being written is unknown to the NameNode. In this case, the client asks one of the replicas for the latest length before starting to read its content. The design of HDFS I/O is particularly optimized for batch processing systems, like MapReduce, which require high throughput for sequential reads and writes. However, many efforts have been put to improve its read/write response time in order to support applications like Scribe that provide real-time data streaming to HDFS, or HBase that provides random, real-time access to large tables HEARTBEAT AND BLOCKREPORT A DataNode identifies block replicas in its possession to the NameNode by sending a block report. A block report contains the block id, the generation stamp and the length for each block replica the server hosts. The first block report is sent immediately after the DataNode registration. Subsequent block reports are sent every hour and provide the NameNode with an up-todate view of where block replicas are located on the cluster. During normal operation DataNodes send heartbeats to the NameNode to confirm that the DataNode is operating and the block replicas it hosts are available. The default heartbeat interval is three seconds. If the NameNode does not receive a heartbeat from a DataNode in ten minutes the NameNode considers the DataNode to be out of service and the block replicas hosted by that DataNode to be unavailable. The NameNode then schedules creation of new replicas of those blocks on other DataNodes. Heartbeats[11][12] from a DataNode also carry information about total storage capacity, fraction of storage in use, and the number of data transfers currently in progress[1][2] STAGING A client request to create a file does not reach the NameNode immediately. In fact, initially the HDFS client caches the file data into a temporary local file. Application writes are transparently redirected to this temporary local file. When the local file accumulates data worth over one HDFS block size, the client contacts the NameNode. The NameNode inserts the file name into the file system hierarchy and allocates a data block for it. The NameNode responds to the client request with the identity of the DataNode and the destination data block. Then the client flushes the block of data from the local temporary file to the specified DataNode. When a file is closed, the 15 P a g e
26 remaining un-flushed data in the temporary local file is transferred to the DataNode. The client then tells the NameNode that the file is closed. At this point, the NameNode commits the file creation operation into a persistent store. If the NameNode dies before the file is closed, the file is lost REPLICATION PIPELINEING When a client is writing data to an HDFS file, its data is first written to a local file as explained in the previous section. Suppose the HDFS file has a replication factor of three. When the local file accumulates a full block of user data, the client retrieves a list of DataNodes from the NameNode. This list contains the DataNodes that will host a replica of that block. The client then flushes the data block to the first DataNode. The first DataNode starts receiving the data in small portions (4 KB), writes each portion to its local repository and transfers that portion to the second DataNode in the list. The second DataNode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes that portion to the third DataNode. Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data is pipelined from one DataNode to the next [3][5][7]. Figure 3.2 HDFS Replication Pipelining 16 P a g e
27 DATA BLOCK PLACEMENT HDFS nodes are spread across multiple racks. Nodes of a rack share a switch, and rack switches are connected by one or more core switches. Communication between two nodes in different racks has to go through multiple switches. In most cases, network bandwidth between nodes in the same rack is greater than network bandwidth between nodes in different racks. Figure 3.3 HDFS Cluster When a new block is created, HDFS places the first replica on the node where the writer is located, the second and the third replicas on two different nodes in a different rack, and the rest are placed on random nodes with restrictions that n more than one replica is placed at one node and no more than two replicas are placed in the same rack when the number of replicas is less than twice the number of racks REPLICA MANAGEMENT The NameNode endeavors to ensure that each block always has the intended number of replicas. The NameNode detects that a block has become under- or over-replicated when a block report from a DataNode arrives. When a block becomes over replicated, the NameNode chooses a replica to remove. The NameNode will prefer not to reduce the number of racks that host replicas, and secondly prefer to remove a replica from the DataNode with the least amount of available disk space. The goal is to balance storage utilization across DataNodes without reducing the block s availability. When a block becomes under-replicated, it is put in the replication priority queue. A block with only one replica has the highest priority, while a block with a number of replicas that is greater than two thirds of its replication factor has the lowest priority. A background thread periodically scans the head of the replication queue to decide where to place new replicas. Block replication follows a similar policy as that of the new block placement. If the number of existing replicas is one, HDFS places the next replica on a different rack. In case that the block has two existing replicas, if the two existing replicas are on the same rack, the third replica is placed on a different 17 P a g e
28 rack; otherwise, the third replica is placed on a different node in the same rack as an existing replica. Here the goal is to reduce the cost of creating new replicas BALANCER HDFS block placement strategy does not take into account DataNode disk space utilization. This is to avoid placing new more likely to be referenced data at a small subset of the DataNodes. Therefore data might not always be placed uniformly across DataNodes. Imbalance also occurs when new nodes are added to the cluster. The balancer is a tool that balances disk space usage on an HDFS cluster. It takes a threshold value as an input parameter, which is a fraction in the range of (0, 1). A cluster is balanced if for each DataNode, the utilization of the node (ratio of used space at the node to total capacity of the node) differs from the utilization of the whole cluster (ratio of used space in the cluster to total capacity of the cluster) by no more than the threshold value. The tool is deployed as an application program that can be run by the cluster administrator. It iteratively moves replicas from DataNodes with higher utilization to DataNodes with lower utilization. One key requirement for the balancer is to maintain data availability. When choosing a replica to move and deciding its destination, the balancer guarantees that the decision does not reduce either the number of replicas or the number of racks. The balancer optimizes the balancing process by minimizing the inter-rack data copying. If the balancer decides that a replica A needs to be moved to a different rack and the destination rack happens to have a replica B of the same block, the data will be copied from replica B instead of replica A. A second configuration parameter limits the bandwidth consumed by rebalancing operations. The higher the allowed bandwidth, the faster a cluster can reach the balanced state, but with greater competition with application processes BLOCK SCANNER Each DataNode runs a block scanner that periodically scans its block replicas and verifies that stored checksums match the block data. If a client reads a complete block and checksum verification succeeds, it informs the DataNode. The DataNode treats it as a verification of the replica. Whenever a read client or a block scanner detects a corrupt block, it notifies the NameNode. The NameNode marks the replica as corrupt, but does not schedule deletion of the replica immediately. Instead, it starts to replicate a good copy of the block. Only when the good replica count reaches the replication factor of the block the corrupt replica is scheduled to be removed [4][5][8]. 18 P a g e
29 DECOMMISSIONING The cluster administrator specifies which nodes can join the cluster by listing the host addresses of nodes that are permitted to register and the host addresses of nodes that are not permitted to register. The administrator can command the system to re-evaluate these include and exclude lists. A present member of the cluster that becomes excluded is marked for decommissioning. Once a DataNode is marked as decommissioning, it will not be selected as the target of replica placement, but it will continue to serve read requests. The NameNode starts to schedule replication of its blocks to other DataNodes. Once the NameNode detects that all blocks on the decommissioning DataNode are replicated, the node enters the decommissioned state. Then it can be safely removed from the cluster without jeopardizing any data availability INTER-CLUSTER DATA COPY When working with large datasets, copying data into and out of a HDFS cluster is daunting. HDFS provides a tool called DistCp for large inter/intra-cluster parallel copying. It is a MapReduce job; each of the map tasks copies a portion of the source data into the destination file system. The MapReduce framework automatically handles parallel task scheduling, error detection and recovery [5]. 19 P a g e
30 CHAPTER 4 SETTING UP THE HADOOP ENVIRONMENT Chapter Gist: This chapter shows the hadoop environment used in this experiment i.e. the cluster, number and type of nodes in the cluster, hadoop version, network bandwidth and machine configuration.
31 4.1. Hadoop Configuration Hadoop consists of two components: a distributed file system and a Map-Reduce Framework. In first component, there exist two types of nodes, namely a single namenode and several datanodes. Data is stored in datanodes and metadata i.e. file system namespace is stored in namenode. The namenode also manages the replication factor of data blocks. There is a secondary namenode which keeps a copy of the namenode data and is used to restart the namenode in the event of failure. In the second component, there are two processes, namely a Jobtracker and a separate Tasktracker for each datanode. The Jobtracker can be run on a dedicated node or on the namenode. The Tasktracker runs on each datanode. The Jobtracker schedules all jobs in the cluster. A job is split into several tasks which run on datanodes. The Tasktracker is responsible for starting the scheduled tasks on the working nodes (i.e. datanodes) and reporting progress to the Jobtracker. The Hadoop Distributed File System used in this work has the following configuration. Version Hadoop Distributed File System Non-default Parameters hdfs-site.xml slaves dfs.replication Node03-08 masters Node01 Version Java Description Java(TM) SE Runtime Environment (build b105) Java HotSpot(TM) Client VM (build b105, mixed mode, sharing) Table 4.1 Hadoop Configuration 4.2. Node Configuration The Hadoop Distributed File System was setup in 13 nodes. The nodes include a dedicated Namenode, a dedicated Jobtracker, a maximum of 6 datanodes and 5 clients. The following table shows the Node configuration of Hadoop Environment. 21 P a g e
32 Node Configuration Namenode Datanode Jobtracker Tasktracker Client Single dedicated node No of datanodes varies; Max 6 datanodes Single dedicated node Runs on datanodes No of Clients varies; Max 5 clients Table 4.2 Node Configuration FIGURE 4.1 NODE CONFIGURATIONS 22 P a g e
33 4.3. System Configuration The configuration details specifying the nodes, their hardware, operating system and network configuration has been listed in Table 4.3. Hardware Configuration Processor RAM Hard Disk OS Type OS Version Intel Pentium Dual Core 2 CPU 3.00 GHz, 2.99 GHz 504 MB (Minimum) 40 GB (Minimum) Linux Linux itl ELsmp #1 SMP Wed Jan 5 19:30:39 EST 2005 i686 i686 i386 GNU/Linux LSB Version 1.3 Operating System Distributor ID RedHatEnterpriseAS Distribution Version Description Red Hat Enterprise Linux AS release 4 (Nahant) Release 4 Codename Nahant Network Configuration GCC Version LAN Bandwidth (Red Hat EL4) Ethernet 100 MBPS Table 4.3 System Configuration 23 P a g e
34 CHAPTER 5 HADOOP PERFORMANCE TUNING PARAMETERS Chapter Gist: This chapter discusses about the performance tuning parameters (a subset of hadoop configuration parameters).
35 Hadoop Core is designed for running jobs that have large input data sets and medium to large outputs, running on large sets of dissimilar machines. The framework has been heavily optimized for this use case. Hadoop Core is optimized for clusters of heterogeneous machines that are not highly reliable. The HDFS file system is optimized for small numbers of very large files that are accessed sequentially. The Hadoop file system provides several parameters that are tunable i.e. the performance of the HDFS along with its map-reduce framework can be improved by optimizing these parameters. There are different tunable parameters that affect different components of hadoop distributed file systems. Few of them have been discussed below CLUSTER-LEVEL TUNABLE PARAMETERS The cluster-level tunable parameters[3][4] require a cluster restart to take effect. Some of them may require a restart of the HDFS portion of the cluster; others may require a restart of the MapReduce portion of the cluster. These parameters take effect only when the relevant server starts SERVER-LEVEL PARAMETERS The server-level parameters, shown affect basic behavior of the servers. In general, these affect the number of worker threads, which may improve general responsiveness of the servers with an increase in CPU and memory use. The variables are generally configured by setting the values in the conf/hadoop-site.xml file. It is possible to set them via command-line options for the servers, either in the conf/hadoopenv.sh file or by setting environment variables (as is done in conf/hadoop-env.sh). The nofile parameter is not a Hadoop configuration parameter. It is an operating system parameter. For users of the bash shell, it may be set or examined via the command ulimit n [value to set]. Quite often, the operating system-imposed limit is too low, and the administrator must increase that value. The value is considered a safe minimum for medium-size busy clusters. Parameters Description Default Value dfs.datanode.handler.count dfs.namenode.handler.count tasktracker.http.threads ipc.server.listen.queue.size The number of threads servicing DataNode block requests The number of threads servicing Namenode requests The number of threads for servicing map output files to reduce tasks The number of network incoming connections that may queue for a server P a g e
36 nofile The on the number of file descriptors a process can open (alter /etc/security/limits.con for Linux machines) 1024 Table 5.1 Server-Level Tuning Parameters HDFS TUNABLE PARAMETERS The most commonly tuned parameter for HDFS is the file system block size. The default block size is 64MB, specified as bytes in dfs.block.size. The larger this value, the fewer individual blocks will be stored on the DataNodes, and the larger the input splits will be. The DataNodes through at least Hadoop have a limit to the number of blocks that can be stored. This limit appears to be roughly 500,000 blocks. After this size, the DataNode will start to drop in and out of the cluster. If enough DataNodes are having this problem, the HDFS performance will tend toward full stop. When computing the number of tasks for a job, a task is created per input split, and input splits are created one per block of each input file by default. There is a maximum rate at which the JobTracker can start tasks, at least through Hadoop The more tasks to execute, the longer it will take the JobTracker to schedule them, and the longer it will take the TaskTrackers to set up and tear down the tasks. The other reason for increasing the block size is that on modern machines, an I/O-bound task will read 64MB of data in a small number of seconds, resulting in the ratio of task over-head to task runtime being very large. A downside to increasing this value is that it sets the minimum amount of I/O that must be done to access a single record. If your access patterns are not linearly reading large chunks of data from the file, having a large block size will greatly increase the disk and network loading required to service your I/O. The DataNode and NameNode parameters are presented in the following Table. Parameters Description Default Value fs.default.value The URI of the shared file system. This should be dfs://namenodehostname:port - 26 P a g e
37 Parameters Description Default Value fs.trash.interval dfs.hosts dfs.hosts.exclude dfs.namenode.decommissi on.interval dfs.replication.interval dfs.access.time.precision dfs.max.objects The interval between trash checkpoints. If 0, the trash feature is disabled. The trash is used only for deletions done via the hadoop dfs -rm series of commands. The full path to a file containing the list of hostnames that are allowed to connect to the NameNode. If specified, only the hosts in this file are permitted to connect to the NameNode. A path to file containing a list of hosts to blacklist from the NameNode. If the file does not exist, no hosts are blacklisted. If a set of DataNode hostnames are added to this file while the Namenode is running and the command hadoop dfsadmin refreshnodes is executed, the DataNodes listed will be decommissioned. Any blocks stored on them will be redistributed to other nodes on the cluster such that the default replication for the blocks is satisfied. It is best to have this point to an empty file that exits, so that Datanodes may be decommissioned as needed The interval in seconds that the Namenode checks to see if a DataNode decommission has finished. The period in seconds that the NameNode computes the list of blocks needing replication. The precision in msec that access times are maintained. If this value is 0, no access times are maintained. Setting this to 0 may increase performance on busy clusters where the bottleneck is the namenode edit log write speed. The maximum number of files, directories and blocks permitted P a g e
38 Parameters Description Default Value dfs.replication dfs.block.size dfs.datanode.handler.count dfs.replication.considerload dfs.datanode.du.reserved dfs.permissions dfs.df.interval dfs.blockreport.intervalmsec dfs.heartbeat.interval The number of replicas of each block stored in the cluster. Larger values allow more DataNodes to fail before blocks are unavailable but increase the amount of network I/O required to store data and the disk space requirements. Large values also increase the likelihood that a map task will have a local replica of the input split. The basic block size for the file system. This may be too small or too large for your cluster, depending on your job data access patterns. The number of threads handling block requests. Increasing this may increase DataNode throughput, particularly if the DataNode uses multiple separate physical devices for block storage. Consider the DataNode Loading when picking replication locations. The amount of space that must be kept free in each location used for block storage. Permission checking is enabled for file access The interval between disk usage statistics collection in msec. The amount of time between block reports. The block report does a scan of every block that is stored on the Datanode and reports this information to the Namenode This reports as of Hadoop blocks the Datanode from servicing block reports and is the cause of the congestion collapse of HDFS when more 500,000 blocks are stored on a Datanode. The heartbeat interval with the Namenode True 0.0 True P a g e
39 Parameters Description Default Value Dfs.namenode.handler.co unt Dfs.name.dir The number of server threads for the Namenode. This is commonly greatly increased in busy and large clusters. The location where the namenode metadata storage is kept. This may be a comma-separated list of directories. A copy will be kept in each location. Writes to the locations are synchronous. If this data is lost, your entire HDFS data set is lost. Keep multiple copies on multiple machines. 10 ${hadoop.tmp.dir}/dfs/ name, in /tmp by default Dfs.name.edits.dir The location where metadata edits are synchronously written. This may be commaseparated list of directories. Ideally, this should hold multiple locations on separate physical devices. If this is lost, your last few minutes of changes will be lost. ${dfs.name.dir} Dfs.data.dir The comma-separated list of directories to use for block storage. This list will be used in a round-robin fashion for storing new data blocks. The locations should be on separate physical devices. Using multiple physical devices yields roughly 50% better performance than RAID 0 striping. ${hadoop.tmp.dir}/dfs/ data Dfs.safemode.threshold.p ct Dfs.balance.bandwidthPe rsec The percentage of blocks that must be minimally replicated before the HDFS will start accepting write requests. This condition is examined only on HDFS startup. The amount of bandwidth that may be used torebalance block storage among DataNodes. This value is in bytes per second f Table 5.2 HDFS Tunable Parameters 29 P a g e
40 CHAPTER 6 PERFORMANCE RESULTS AND ANALYSIS Chapter Gist: This chapter analyzes the parameters (in Chapter 5) with respect to performance showing a curve or graph that each parameter follows along with the experimental data (i.e. measurements).
41 6.1. SCENARIO 1: EFFECT OF MULTIPLE CLIENTS No of Clients No of Datanodes Replication Factor Size of the Transferred File (MB) Operation Performed Estimated time hadoop fs put hadoop fs put hadoop fs put hadoop fs put hadoop fs put 56 sec 1 min 33 sec 2 min 14 sec 2 min 54 sec 3 min 36 sec Table 6.1 Performance Results in scenario PERFORMANCE ANALYSIS The above the test result highlights the fact that if the number of clients increases, the time required to write the data will increase. Here, the clients are performing the write operation (i.e. copying 500 MB file from local file system to hadoop distributed file system) concurrently. It tends to produce a straight line as the number of clients increases and is shown below in Figure 6.1. One can imagine how long it will take to perform a simple write operation when the number of clients is near about 100 or even So, in order to maintain an acceptable access time for data, we have to maintain the availability of the data blocks in such a way that the data blocks are spread over the network uniformly ( i.e. having almost equal distance in respect to time from each data block replica ). It can also be shown that the increment in access time doesn t depend upon the operation performed (i.e. the read time of a file in hadoop distributed file system has also the effect as the write time). There are several options to increase the uniform availability of data over the cluster such as increasing the replication factor, reducing the block size, increasing the server and datablock level threads etc. 31 P a g e
42 Estimated Time in Seconds No. of Clients Figure 6.1 Effects of Multiple Clients 6.2. SCENARIO 2: EFFECT OF REPLICATION FACTOR (REPLICATION FACTOR < NO. OF AVAILABLE DATANODES) No of Clients No of Datanodes Replication Factor Size of the Transferred File (MB) Estimated time min 7 sec min 33 sec Table 6.2 Effects of Replication Factor for Write Operation in Scenario 2 No of Clients No of Datanodes Replication Factor Size of the Transferred File (MB) Estimated time min 7 sec min 56 sec Table 6.3 Effects of Replication Factor for Read Operation in Scenario 2 32 P a g e
43 Estimated Time in Seconds PERFORMANCE ANALYSIS The above performance result shows the effect of replication factor where the replication factor is not beyond the number of available datanodes in the cluster. There are two cases: Table 4 shows the effect for write operation and Table 5 for read operation. In the first case, the estimated time of the write operation increases from 2 min 7 sec to 2 min 33 sec as the replication factor along with the number of datanodes increases from 1 to 3, as shown in Table 6.2. This produces a straight line showing increments in write time as the replication factor increases (shown in Figure 6.2) Replication Factor Figure 6.2 Effects of increasing the Replication Factor (write operation) In second case, the estimated time tends to be reduced from 3 min 7 sec to 2 min 56 sec as the replication factor along with the no of available working nodes increases, as shown in Table 6.3. The effect of this reduction is shown in Figure 6.1 with a straight line. Thus it clearly reveals the fact that the read time of data can be reduced by increasing the replication factor i.e. having more number of copies of data that is spread over the cluster. Also the fact that there cannot be any increment in write operation is evident from the above result because an increment in replication factor results in extra time required writing the extra copies of data in datanodes. 33 P a g e
44 Estimated Time in Seconds Replication Factor Figure 6.3 Effects of increasing the Replication Factor (read operation) 6.3. SCENARIO 3: EFFECT OF REPLICATION FACTOR (REPLICATION FACTOR > NO. OF AVAILABLE DATANODES) No of Clients No of Datanodes Replication Factor Size of the Transferred File (MB) Estimated time min 33 sec min 46 sec Table 6.4 Effects of Replication Factor for Write Operation in Scenario 3 No of Clients No of Datanodes Replication Factor Size of the Transferred File (MB) Estimated time min 56 sec min 57 sec Table 6.5 Effects of Replication Factor for Read Operation in Scenario 3 34 P a g e
45 Estimated Time in Seconds PERFORMANCE ANALYSIS The above performance results show the effect where the replication factor goes beyond the number of datanodes available in the cluster. There are also two cases to consider. One for write operation shown in Table 6.3 and another for read operation as shown in Table 6.4. In first case, the estimated time for write operation increases from 2 min 33 sec to 2 min 46 sec as the replication factor increases from 3 to 6 as shown Table 6.3. However, in this scenario the number of available working nodes in the cluster is still 3 when the replication factor becomes 6. Figure 6.4 shows a straight line indicating the effect of write time in this situation. The reason of this increment in write time is as exactly same as it was in scenario 2. If the replication factor increases, the no of replicas/copies that must be kept increases resulting in extra time required writing the replicas/copies. Hence, the conclusion that can be drawn from the above test is that the increment in replication factor results in increment of time required writing a file in the distributed file system and this time is independent of the available datanodes running in the cluster Replication Factor Figure 6.4 Replication Factor beyond available datanodes (write operation) In second case (i.e. for read operation), the read time remains almost constant as shown in Table 6.4. It becomes 2 min 57 sec from 2 min 56 sec as the replication increases from 3 to 6. In this case also, the number of available datanodes running in the cluster is made constant i.e. 3. So, this situation clearly reveals the fact that there is no gain of increasing the replication factor beyond the no. of datanodes available. The effect of read operation is shown in Figure P a g e
46 Let s take a deep look into the situation. When the replication factor goes beyond the number of datanodes available in the cluster, then obviously each datanode is going to store more than one replica of each data block. This would lead to the situation where each datanode is able to serve each request a separate block replica for respective operations. But the time required to serve those requests will have no improvement as the requests will be queued up to a local node for the service. So, it really doesn t matter whether each request has access to a common block replica or separate block replicas. Figure 6.5 Replication Factor beyond available datanodes (read operation) However, there is an advantage in case of datablock corruption. Each datablock in hadoop distributed file system is associated with a separate checksum. If a datablock becomes corrupted, the access can still be granted to the requesting process with no delay since multiple block replicas exists in each datanode. But the chances of data block corruption are very rare. So, the advantage is so much costly as compared to the disadvantages it brings. 36 P a g e
47 6.4. SCENARIO 4: EFFECT OF BLOCK SIZE ( DFS.BLOCK.SIZE ) PERFORMANCE RESULTS No of Clients No of Datanodes Replication Factor Size of the Transferred File (MB) Block Size Estimated time M 1 min 49.5 s M 1 min 47 s M 1 min 45.5 s M 1 min 39 s M 1 min 38.5 s M 2 min 3.5 s Table 6.6 Effects of Block Size in Write Operation No of Clients No of Datanodes Replication Factor Size of the Transferred File (MB) Block Size Estimated time M 2 min 20 s M 2 min 18 s M 2 min 16 s M 1 min 51 s M 1 min 59 s M 2 min 20 s Table 6.7 Effects of Block Size in Read Operation PERFORMANCE ANALYSIS The above performance result highlights the effect of block size in a distributed file system performance. The default value of the dfs.block.size ( in conf/hdfs-site.xml of hadoop P a g e
48 Estimated Time in seconds distribution ) is 64M which gives an estimated time measure of 1 min 45.5 seconds in case of write operation and 2 min 16 seconds in case of read operation. As usual, we re going to consider read and write operation separately. Table 6.6 shows the effect in case of write operation and table 6.7 in case of read operation. In case of write operation, the fifth row in Table 6.6 i.e. the operation performed with custom block size of 16M takes the minimum time. If we increase the block size beyond 16M, the estimated time increases slowly following a straight line. But if we decrease the estimated time below 16M, the time tends to increase rapidly in exponential manner. It tends to produce a curve as shown below in Figure 6.6. The curve shows a region of 10M to 20M where the measured time goes below 100 seconds and the minimum time lies in this range dfs.block.size in MB Figure 6.6 Effects of dfs.block.size on Read Operations In case of read operation, the fourth row in Table 6.7 i.e. the operation performed with custom block size of 32M takes the minimum time. If we increase the block size beyond 32M up to 64M, the estimated time increases but further increment doesn t seem to produce a good result. And if we decrease the estimated time below 32M, the time tends to increase linearly. It tends to produce a curve as shown below in Figure 6.7. If the block size is reduced, then the no. of available datablocks in the cluster increases and thus the uniform availability of the datablocks also get increased. But when the size of the datablock is too small, the overhead of keeping track of large number of small datablocks and its associated checksum produced huge traffic delay and thus reducing the overall performance of the distributed file system. The curve shows a region of 20M to 40M where the measured time goes below 120 seconds and the minimum time lies in this range. 38 P a g e
49 Estimated Time in seconds dfs.block.size in MB Figure 6.7 Effect of dfs.block.size on write operation As a whole, the minimum estimated time can be found for block size having between 10M to 20M for write operation and for block size having between 20M to 40M for read operation SCENARIO 5: EFFECT OF IO BUFFER SIZE (IO.FILE.BUFFER.SIZE) PERFORMANCE RESULTS The experiment was carried out with default block size (i.e. dfs.block.size=64m), default replication factor (i.e. dfs.replication=3) and 400MB file transfer. The parameter to be examined here is io.file.buffer.size (in conf/core-site.xml). Table 6.8 shows the results for write operation and Table 6.9 for read operation. Buffer Size in Bytes Estimated Time min 53 sec min sec min 52 sec min sec min sec Table 6.8 Effects of IO Buffer size on write operation 39 P a g e
50 Estimated Time in Seconds Buffer Size in Bytes Estimated Time min sec min sec min 23 sec min 33 sec min 38 sec Table 6.9 Effect of IO Buffer size on read operation PERFORMANCE ANALYSIS The second row in Table 6.8 shows the minimum estimated time of 1 min sec where the IO buffer size is 32KB. The increment or decrement of buffer size results in increment of the estimated time in case of write operation. The improvement is though too small. Figure 6.8 shows the effect of io buffer size on write operation. The estimated time is minimum in the region of 25K to 65K as shown in Figure io.file.buffer.size in KB Figure 6.8 Effect of IO Buffer Size on write Operation Table 6.9 highlights the performance result in case of read operation. The third row in Table 6.9 shows the minimum estimated time of 2 min 23 seconds. The increment or decrement of buffer size shows a decent amount of increment in estimated time. Figure 6.9 showcases the 40 P a g e
51 Estimated Time in Seconds effect of io.file.buffer.size in case of read operation.. The estimated time is minimum in the region of 50K to 80K as shown in Figure 6.9 where the estimated time goes below 144 seconds io.file.buffer.size in KB Figure 6.9 Effect of io.file.buffer.size on read operation As a conclusion, the parameter io.file.buffer.size indicates buffer size for I/O(read/write) operation on sequence files stored in disk files i.e. it determines how much data is buffered in I/O pipes before from transferring to other operation during read and write operations. It should be multiple of OS file system block size (4KB normally). It has significantly effect on performance; too low or too high causes much performance issue. It is normally good with 32KB, 64KB, and 128KB SCENARIO 6: EFFECT OF DFS.ACCESS.TIME.PRECISION PERFORMANCE RESULTS HDFS should support some type of statistics that allows an administrator to determine when a file was last accessed. By default, access times will be precise to the most recent hour boundary. A configuration parameter dfs.access.time.precision (milliseconds) is used to control this precision. Setting a value of 0 will disable persisting access times for HDFS files. The following experiment was carried out with default values of replication factor, block size and io buffer size. The file size was 400MB. The involved parameter dfs.access.time.precision (in conf/hdfs-site.xml) has a default value of msec. 41 P a g e
52 Estimated Time in seconds dfs.access.time.precision Estimated time 0 2 min 16 sec min 25 sec min 23 sec Table 6.10 Effect of dfs.access.time.precision on write operation dfs.access.time.precision Estimated time 0 2 min 28 sec min 33 sec min 30 sec Table 6.11 Effect of dfs.access.time.precision on read operation PERFORMANCE ANALYSIS The estimated time as measured shows that the improvement is maximum if dfs.access.time.precision is having a value of 0. The Figure 6.10 shows the corresponding graph for Table 6.10 and Figure 6.11 for Table dfs.access.time.precision in msec Figure 6.10 Effect of dfs.access.time.precision on write operation 42 P a g e
53 Estimated Time in Seconds dfs.access.time.precision in msec Figure 6.11 Effect of dfs.access.time.precision on read operation Access times are maintained for every file operation i.e. writing a transaction log for every file operation. Now, hadoop distributed file system breaks files in blocks of fixed size. So, for each block access, a transaction log in written. This evidently is a serious performance killer. So, setting a value of zero which disables the access time to be maintained is a good approach to improve the performance SCENARIO7: EFFECT OF DFS.REPLICATION.INTERVAL PERFORMANCE RESULTS The following experiment was carried out with 400 MB file transfer and the only varying parameter is dfs.replication.interval in conf/hdfs-site.xml which is the period in seconds that the Namenode computes the list of blocks needing replication. dfs.replication.interval Estimated Time 3 msec 2 min 20 sec 1000 msec 2 min 10 sec 6000 msec 2 min 5 sec Table 6.12 Effect of dfs.replication.interval on write operation 43 P a g e
54 Estimated Time in Seconds dfs.replication.interval Estimated Time 3 msec 2 min 35 sec 1000 msec 2 min 35 sec 6000 msec 2 min 34 sec Table 6.13 Effect of dfs.replication.interval on read operation PERFORMANCE ANALYSIS The estimated time measurements in Table 6.12 shows that the increment in dfs.replication.interval results in reduced write time as shown in the following Figure dfs.replication.interval in msec. Figure 6.12 Effect of dfs.replication.interval on write operation dfs.replication.interval is the period that the Namenode checks whether there is a need of replication of any datablock. Evidently, setting a large period or interval reduces the network traffic and thus reducing the write time of the write operation. The reduction almost follows a straight line in the range of 0 to 1000 due to the reduction in network load. Further Reduction in interval has a small effect on estimated time because the reduction in traffic is very small as compared to the traffic due to the write operation. The estimated time measurements shown in Table 6.13 shows the following Figure. The read time has a very little effect on dfs.replication.interval as compared to the write time. The main reason is the presence of replication pipelining in write operation which reduces the write time traffic significantly. 44 P a g e
55 Estimated Time in Seconds dfs.replication.interval in msec. Figure 6.13 Effect of dfs.replication.interval on read operation 6.8. EFFECT OF HEARTBEAT AND BLOCKREPORT INTERVALS PERFORMANCE RESULTS The following experiment was carried out with 400MB file transfer with all parameters having default value except dfs.heartbeat.interval and dfs.blockreport.intervalmsec. dfs.heartbeat.interval Estimated Time 3 msec 2 min sec 1000 msec 2 min 6.67 sec 6000 msec 2 min 3 sec Table 6.14 Effect of heartbeat interval on write operation dfs.heartbeat.interval Estimated Time 3 msec 2 min 37 sec 1000 msec 2 min 34 sec 6000 msec 2 min 34 sec Table 6.15 Effect of heartbeat interval on read operation 45 P a g e
56 Estimated Time in Seconds dfs.blockreport.intervalmsec Estimated Time 3 msec 3 min 1 sec 1000 msec 2 min 24 sec 6000 msec 2 min 5 sec Table 6.16 Effect of Blockreport Interval on write operation dfs.blockreport.intervalmsec Estimated Time 3 msec 2 min 43 sec 1000 msec 2 min 39 sec 6000 msec 2 min 37 sec Table 6.17 Effect of Blockreport Interval on read operation PERFORMANCE ANALYSIS The effect of heartbeat interval has been shown in Figure 6.14 for write operation and in Figure 6.15 for read operation with respect to the Tables 6.14 and And the effect of blockreport has been shown in Figure 6.16 (for write operation) and 6.17 (for read operation) with respect to Table 6.16 and dfs.heartbeat.interval in msec. Figure 6.14 Effect of Heartbeat Interval on write operation 46 P a g e
57 Estimated Time in Seconds Estimated Time in Seconds dfs.heartbeat.interval in msec. Figure 6.15 Effect of Heartbeat Interval on read operation dfs.heartbeat.interval and dfs.blockreport.intervalmsec will affect performance in larger clusters. Datanodes send a message to the namenode saying they are still alive every dfs.heartbeat.interval seconds, and after dfs.namenode.stale.datanode.interval milliseconds without a heartbeat, the namenode will mark that datanode as stale. Similarly, the datanode will send a list of all the blocks it has every dfs.blockreport.intervalmsec milliseconds. For a cluster of 30 machines, that means the namenode receives a heartbeat, on average, every 0.1 seconds, and a block report every 6 minutes, which should be a negligible load and worth the extra reliability dfs.blockreport.intervalmsec in msec. Figure 6.16 Effect of Blockreport Interval on write opearion 47 P a g e
58 Estimated Time in Seconds dfs.blockreport.intervalmsec Figure 6.17 Effect of Blockreport Interval on Read Operation So, the heartbeat and the blockreport produce high traffic delay. So, having a large interval reduces the traffic and thereby reducing the write time and read time of a file. A close observation of these four graphs shows that the graphs follow a certain nature where the reduction in estimated time is high in the range 0 to 1000 as in the case of dfs.replication.interval. Further increment of these parameters result in a very small improvement in the performance of the distributed file system. Since, these intervals control major functions of the distributed file system, one must also be cautious while increasing these intervals as this may lead to improper functioning of the distributed file system EFFECT OF SERVER AND BLOCK LEVEL THREADS PERFORMANCE RESULTS dfs.namenode.handler.count Estimated Time 10 2 min 7 sec min 57 sec min 54 sec Table 6.18 Effect of Server Level Thread on write operation 48 P a g e
59 dfs.namenode.handler.count Estimated Time 3 2 min 31 sec min 25 sec min 21 sec Table 6.19 Effect of Server Level Threads on read operation dfs.datanode.handler.count Estimated Time 3 1 min 26 sec 10 1 min 19 sec min 14 sec Table 6.20 Effect of Block Level Threads on write operation dfs.datanode.handler.count Estimated Time 3 2 min 10 sec 10 2 min 7 sec min 6.67 sec Table 6.21 Effect of Block Level Threads on read operation The following experiment was carried out with 400MB file transfer with all parameters having default values expect dfs.namenode.handler.count and dfs.datanode.handler.count. dfs.namenode.handler.count indicates the number of server level threads i.e. namenode threads and dfs.datanode.handler.count indicates the block level threads i.e. threads in the datanodes PERFORMANCE ANALYSIS dfs.namenode.handler.count and dfs.datanode.handler.count control how many concurrent threads the server and the datanode will have to handle incoming requests. The default values should be fine for smaller clusters, but if one has a lot of simultaneous HDFS operations, it will have performance gains by increasing these numbers. The Figure 6.18, 6.19, 6.20 and 6.21 shows the corresponding graphs for Table 6.18, 6.19, 6.20 and 6.21 respectively. 49 P a g e
60 Estimated Time in Seconds Estimated Time in Seconds dfs.namenode.handler.count Figure 6.18 Effect of Server Level Threads on write operation A close observation of these graphs shows that the estimated time decreases rapidly in the beginning (i.e. 0 to 200 for server level threads and 0 to 40 for datablock level threads) but later the improvement is not that much effective. The reason is that if a server has large number of threads, it will be able to handle large number of concurrent operations (including system operations i.e. heartbeat, blockreport signal receive etc.). But if the number of threads is greater than the maximum number of concurrent operations performed by the server of datablock, then it will less effect on the performance of the distributed file system dfs.namenode.handler.count Figure 6.19 Effect of Server Level Threads on read operation 50 P a g e
61 Estimated Time in Seconds Estimated Time in Seconds dfs.datanode.handler.count Figure 6.20 Effect of Block Level Threads on write operation The improvement of performance of the distributed file system is evident in both operations (i.e. read operation and write operation). To have performance gain by increasing the number of server and datablock level threads, one has to make sure that it has the memory to spare and also need to adjust heap sizes accordingly dfs.datanode.handler.count Figure 6.21 Effect of Block Level Threads on read operation 51 P a g e
62 CHAPTER 7 CONCLUSION Chapter Gist: This chapter provides conclusion specifying which parameter plays a key role in performance improvement of hadoop and which can be ignored or should be.
63 7.1. CONCLUSION Whereas even a few years ago a terabyte was seen as a large amount of data, today individual companies are dealing with petabytes of data. Over 100,000 hours of video per day are uploaded to You Tube, translating to 360 terabytes every day 500 terabytes of new data per day are ingested in Facebook databases Even the sensors from a Boeing jet engine create 20 terabytes of data every hour IBM Is Building the Largest Data Storage Array Ever 120 Petabytes The performance is going to be one of major issue when dealing with such a large amount of data. Every approach towards better performance has its own advantages and disadvantages. The approach towards the improvement of performance of a distributed file system using tunable configuration parameters is no exception. There are quite a few interesting factors which stand out among all. They are given below: 1. The effective write time required to write a file increases as the replication factor of the distributed file system increases and this time is independent of the available working nodes in the cluster. 2. The time required to perform the data manipulation operations (such as data retrieval, deletion, modification etc.) reduces as the replication factor increases but doesn t go beyond the available working nodes in the cluster. 3. There is no improvement in performance for both read and write operation if the replication factor goes beyond the available working nodes in the cluster. 4. The write time and read time of a distributed file system strongly depends on the block size used by the file system. Having a too large or too small block size affects the performance. Suitable values of block size lies in the range of 16M to 32M. 5. Buffer size for I/O(read/write) operation also has significant effect on performance; too low or too high causes much performance issue. It is normally good with 32KB, 64KB, and 128KB. 6. Maintaining access time by writing a transaction log for each file access is a serious performance killer. So, disabling the access time by setting a zero value for dfs.access.time.precision increases the performance. In order to maintain access time, a. Maintain an access bit associated with each rather than a transaction log b. Record the timestamp when an "open" occurred in namenode memory 7. Distributed File System receives few periodic signals such as heartbeat, block-report, disk usage statistics, Datanode decommission checking, replication checking etc. We can improve the performance of the file system by extending the period of these checking but having a too large period can cause a serious damage in operational aspects of the file system. The parameters include: a. dfs.replication.interval to check list of blocks needing replication b. dfs.blockreport.intervalmsec to receive block reports c. dfs.heartbeat.interval to receive heartbeat 53 P a g e
64 d. dfs.namenode.decommission.interval to check datanode decommissioning through dfs.hosts.exclude e. dfs.df.interval to check disk usage statictics 8. dfs.namenode.handler.count and dfs.datanode.handler.count control how many concurrent threads the server will have to handle incoming requests. The default values should be fine for smaller clusters, but if one has a lot of simultaneous HDFS operations, it will have performance gains by increasing these numbers. While the above mentioned approaches increase the performance of a distributed file systems, there are some disadvantages of using this approach: 1. Since, access time gets disabled having dfs.access.time.precision equaling to zero, it will not be possible to retrieve recently accessed files. Further, the facility to delete those files that haven t been accessed for long is not there. 2. Secondly, the use of above mentioned parameters must be done with caution. For example, datanodes send a message to the namenode saying they are still alive every dfs.heartbeat.interval seconds, and after dfs.namenode.stale.datanode.interval milliseconds without a heartbeat, the namenode will mark that datanode as stale. So, if the value of dfs.heartbeat.interval goes beyond dfs.namenode.stale.datanode.interval, the normal operation of distributed file system gets inturrupted FURTHER WORK Further work can be done in the following cases: When a client performs a read operation, the request first goes to Namenode. Then, the Namenode chooses the nearest data-node that can cater this request. So replication definitely helps, in the sense that a replica might be placed on a node nearer to the client. Otherwise, the request will go to the same datanode as the namenode doesn't check whether the datanode is busy serving other requests or not. So, a possible implementation would be for namenode to redirect the request if the requested datanode is busy serving requests. Another work includes checking whether a file or a set of files are being accessed heavily during an interval. If so, the replication factor for those files can be increased temporarily during that interval in order to serve as many requests in parallel as possible. This would obviously increase the performance and also the throughput of the distributed file system. 54 P a g e
65 REFERENCES
66 REFERENCES [ 1 ] [ 2 ] Distributed Operating System by Pradeep K. Sinha [ 3 ] [ 4 ] [ 5 ] [ 6 ] T. White, Hadoop: The Definitive Guide. O'Reilly Media [ 7 ] The Hadoop Distributed File System by Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler [ 8 ] S. Ghemawat, H. Gobioff, S. Leung. The Google file system, In Proc. of ACM Symposium on Operating Systems Principles [ 9 ] Cache Coherency in Distributed File System by Anagha Kulkarni and Radhika Bhagwat [ 10 ] The Design of VIP-FS: A Virtual, Parallel File System for High Performance Parallel and Distributed Computing by Juan Miguel del Rosario, Michael Harry and Alok Choudhary [ 11 ] A Quick Start Guide to the Parallel Virtual File System by Philip H. Carns, Robert B. Ross, Walter B. Ligon [ 12 ] S. Weil, S. Brandt, E. Miller, D. Long, C. Maltzahn, Ceph: A Scalable, High-Performance Distributed File System, 56 P a g e
67 APPENDICES
68 APPENDIX A
69 HADOOP INSTALLATION A1. PREREQUISITES SUPPORTED PLATFORMS GNU/Linux is supported as a development and production platform. Hadoop has demonstrated on Gnu/Linux clusters with 2000 nodes. Win32 is supported as a development platform. Distributed operation has not been well tested on Win32, so it is not supported as a production platform. REQUIRED SOFTWARE JavaTM 1.6.x, preferably from Sun, must be installed. ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons. ADDITIONAL REQUIREMENTS FOR WINDOWS INCLUDE: Cygwin - Required for shell support in addition to the required software above. A2. INSTALLING SOFTWARE The cluster must have the requisite software installed on it. INSTALLATION ON UBUNTU LINUX: $ sudo apt-get install ssh $ sudo apt-get install rsync On Windows, cygwin must be installed and one must start the cygwin installer and select the packages: openssh - the Net category A3. GET HADOOP DISTRIBUTION Download a stable release of hadoop from the following mirror: A4. PREPARE TO START THE HADOOP CLUSTER Unpack the downloaded Hadoop distribution using the following command. 59 P a g e
70 $ tar -xvf hadoop tar.gz Note: - All the installation instructions are specified with the Hadoop distribution. In case of installing a different hadoop distribution, hadoop must be changed to the distribution name. In the hadoop distribution, find the location of conf directory. This conf directory is the configuration directory where all the necessary changes on hadoop configuration parameter can be done. This directory is usually located just under the home distribution directory. Edit the file conf/hadoop-env.sh to define at least JAVA_HOME to be the root of your Java installation. Try the following two commands: $ cd hadoop $ ll It will give the following output: 60 P a g e
71 Now try the following command: $ bin/hadoop This will display the usage documentation for the hadoop script. The output is shown below. If the above commands show the correct results, then only Hadoop cluster can be installed. The installation of Hadoop cluster can be done in one of the following three supported modes: Local (Standalone) Mode Pseudo-Distributed Mode Fully-Distributed Mode A5. LOCAL (STANDALONE) MODE By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging. The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory. 61 P a g e
72 $ mkdir input $ cp conf/*.xml input $ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+' $ cat output/* A6. PSEUDO-DISTRIBUTED MODE Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process. A6.1. CONFIGURATION Hadoop /conf directory has the following five files that control the behavior of hadoop distributed file system: hdfs-site.xml core-stie.xml mapred-site.xml masters slaves The content of these files (for pseudo-distributed mode) is given below: HDFS-SITE.XML <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> The property tag contains the hadoop configuration parameters to be specified. The name tag specifies the name of the hadoop configuration parameter and the value tag is used to mention its value. An additional description tag can be used to describe the parameter in case one uses too many hadoop configuration parameters. 62 P a g e
73 Hadoop uses a default replication factor of 3. Since, this is a pseudo-distributed mode setup (i.e. Hadoop will be running on only one node), the replication factor should be one. So, hdfs-site.xml file has been used here to declare a custom replication factor of 1 with property name dfs.replication. CORE-SITE.XML <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration> MAPRED-SITE.XML <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration> MASTERS localhost SLAVES localhost 63 P a g e
74 The content of the last two files (i.e. Masters and Slaves) is localhost by default and so don t need to explicitly specify unless someone is trying to configure hadoop from fully-distributed mode to pseudo-distributed mode. A6.2. SETUP PASSPHRASELESS SSH Hadoop distributed file system uses ssh to start datanodes and namenodes on local machines. Normally, when we use ssh command, we need to provide passphrase. With passphraseless SSH, one don t need to provide passphrase to ssh to a local machine. So, it is very important to have passphraseless SSH. Otherwise, one have to type the passphrase for each daemon started by the hadoop distributed file system. For example, if a hadoop cluster contains 1000 datanodes and 1 namenode, then the administrator must type the passphrase 1001 times to start the hadoop running on the cluster properly. Check whether it is possible to ssh to the localhost without a passphrase with the following command: $ ssh localhost If you cannot ssh to localhost without a passphrase, execute the following commands: $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys A6.3. EXECUTION 1. Format a new distributed-file system: $ bin/hadoop namenode -format 2. Start the hadoop daemons: $ bin/start-all.sh The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory. By default, this directory is ${HADOOP_HOME}/logs. You need to format the hadoop file system only once while setting up the hadoop for the first time. Next time, when you want to start the hadoop daemons on the local machines, you don t need to format the file system. Formatting the file system too many 64 P a g e
75 times can lead to errors. And datanode or namenode may not even start due to that error. 3. Copy the input files into the distributed filesystem: $ bin/hadoop fs -put conf input 4. Run some of the examples provided: $ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+' 5. Examine the output files. Copy the output files from the distributed filesystem to the local filesytem and examine them: $ bin/hadoop fs -get output output $ cat output/* Or View the output files on the distributed file system: $ bin/hadoop fs -cat output/* 6. When you're done, stop the daemons with the following command: $ bin/stop-all.sh A7. FULLY-DISTRIBUTED MODE A7.1. CONFIGURATION The Fully distributed mode of hadoop means hadoop daemons running on several machines. So, separate copies of hadoop distribution directory must be available on the nodes in which a particular daemon is to be run. The xml files in the conf directory must also have respective content and the content of an xml file may be different in different nodes depending upon the requirement. The following files need to be configured: hdfs-site.xml, core-site.xml, mapred-site.xml masters, slaves 65 P a g e
76 HDFS-SITE.XML <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>3</value> <description>default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <property> <name>dfs.name.dir</name> <value>${hadoop_nn_dir}</value> <description>determines where on the local filesystem the DFS name node should store the name table. If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. </description> </property> <property> <name>dfs.data.dir</name> <value>${hadoop_dn_dir}</value> <description>determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. </description> </property> </configuration> 66 P a g e
77 CORE-SITE.XML <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>hdfs://${hadoop_nn_host}:8020</value> <description>the name of the default file system. Either the literal string "local" or a host:port for NDFS. </description> </property> <property> <name>hadoop.tmp.dir</name> <value>${hadoop_temp_directory}</value> <description> Determines where on the local filesystem DFS is going to store temporary directories locally. </description> </property> <property> <name>fs.trash.interval</name> <value>360</value> <description>number of minutes between trash checkpoints. If zero, the trash feature is disabled. </description> </property> </configuration> Note: - By default, Hadoop sets hadoop.tmp.dir to /tmp folder. And we all know that /tmp gets wiped out by Linux after rebooting. Hence, it is essential to have a local directory to be specified as a hadoop.tmp.dir in order to avoid this problem. The fs.trash.interval property is not essential for the proper functioning of hadoop. So, it is not a minimal requirement to run hadoop but is an essential feature of hadoop. So, it is advised to include this feature in core-site.xml. 67 P a g e
78 MAPRED-SITE.XML <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>${hadoop_jt_host}:8021</value> <final>true</final> </property> </configuration> MASTERS IP-Address of Namenode, for example, SLAVES List of IP-Addresses of all Datanodes, for example etc. A7.2. HADOOP STARTUP To start a Hadoop cluster you will need to start both the HDFS and Map/Reduce cluster. 1. Format a new distributed filesystem using the following command: $ bin/hadoop namenode -format 2. Start the HDFS with the following command, run on the designated NameNode: $ bin/start-dfs.sh The bin/start-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and starts the DataNode daemon on all the listed slaves. 68 P a g e
79 3. Start Map-Reduce with the following command, run on the designated JobTracker: $ bin/start-mapred.sh The bin/start-mapred.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the JobTracker and starts the TaskTracker daemon on all the listed slaves. A7.3. HADOOP SHUTDOWN 1. Stop HDFS with the following command, run on the designated NameNode: $ bin/stop-dfs.sh The bin/stop-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and stops the DataNode daemon on all the listed slaves. 2. Stop Map/Reduce with the following command, run on the designated the designated JobTracker: $ bin/stop-mapred.sh The bin/stop-mapred.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the JobTracker and stops the TaskTracker daemon on all the listed slaves. 69 P a g e
80 APPENDIX B
81 HADOOP SHELL COMMANDS FS SHELL The FileSystem (FS) shell is invoked by bin/hadoop fs <args>. All the FS shell commands take path URIs as arguments. The URI format is scheme://autority/path. For HDFS the scheme is hdfs, and for the local filesystem the scheme is file. The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used. An HDFS file or directory such as /parent/child can be specified as hdfs://namenodehost/parent/child or simply as /parent/child (given that your configuration is set to point to hdfs://namenodehost). Most of the commands in FS shell behave like corresponding UNIX commands. Differences are described with each of the commands. Error information is sent to stderr and the output is sent to stdout. CAT Usage: hadoop fs -cat URI [URI ] Copies source paths to stdout. Example: hadoop fs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2 hadoop fs -cat file:///file3 /user/hadoop/file4 Exit Code: Returns 0 on success and -1 on error. CHGRP Usage: hadoop fs -chgrp [-R] GROUP URI [URI ] Change group association of files. With -R, make the change recursively through the directory structure. The user must be the owner of files, or else a super-user. CHMOD Usage: hadoop fs -chmod [-R] <MODE[,MODE]... OCTALMODE> URI [URI ] Change the permissions of files. With -R, make the change recursively through the directory structure. The user must be the owner of the file, or else a super-user. 71 P a g e
82 CHOWN Usage: hadoop fs -chown [-R] [OWNER][:[GROUP]] URI [URI ] Change the owner of files. With -R, make the change recursively through the directory structure. The user must be a super-user. COPYFROMLOCAL Usage: hadoop fs -copyfromlocal <localsrc> URI This command is similar to put command, except that the source is restricted to a local file reference. COPYTOLOCAL Usage: hadoop fs -copytolocal [-ignorecrc] [-crc] URI <localdst> This command is similar to get command, except that the destination is restricted to a local file reference. CP Usage: hadoop fs -cp URI [URI ] <dest> Copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory. Example: Exit Code: hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir Returns 0 on success and -1 on error. DU Usage: hadoop fs -du URI [URI ] Displays aggregate length of files contained in the directory or the length of a file in case it s just a file. Example: Exit Code: hadoop fs -du /user/hadoop/dir1 /user/hadoop/file1 hdfs://nn.example.com/user/hadoop/dir1 Returns 0 on success and -1 on error. 72 P a g e
83 DUS Usage: hadoop fs -dus <args> Displays a summary of file lengths. EXPUNGE Usage: hadoop fs -expunge Empty the Trash. Refer to HDFS Design for more information on Trash feature. GET Usage: hadoop fs -get [-ignorecrc] [-crc] <src> <localdst> Copy files to the local file system. Files that fail the CRC check may be copied with the -ignorecrc option. Files and CRCs may be copied using the - crc option. Example: hadoop fs -get /user/hadoop/file localfile hadoop fs -get hdfs://nn.example.com/user/hadoop/file localfile Exit Code: Returns 0 on success and -1 on error. GETMERGE Usage: hadoop fs -getmerge <src> <localdst> [addnl] Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally addnl can be set to enable adding a newline character at the end of each file. LS Usage: hadoop fs -ls <args> For a file returns stat on the file with the following format: filename <number of replicas> filesize modification_date modification_time permissions userid groupid For a directory it returns list of its direct children as in unix. A directory is listed as: dirname <dir> modification_time modification_time permissions userid groupid 73 P a g e
84 Example: Exit Code: hadoop fs -ls /user/hadoop/file1 /user/hadoop/file2 hdfs://nn.example.com/user/hadoop/dir1 /nonexistentfile Returns 0 on success and -1 on error. LSR Usage: hadoop fs -lsr <args> Recursive version of ls. Similar to Unix ls -R. MKDIR Usage: hadoop fs -mkdir <paths> Takes path uri's as argument and creates directories. The behavior is much like unix mkdir -p creating parent directories along the path. Example: Exit Code: hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2 hadoop fs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir Returns 0 on success and -1 on error. MOVEFROMLOCAL Usage: dfs -movefromlocal <src> <dst> Displays a "not implemented" message. MV Usage: hadoop fs -mv URI [URI ] <dest> Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across filesystems is not permitted. Example: hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2 hadoop fs -mv hdfs://nn.example.com/file1 hdfs://nn.example.com/file2 hdfs://nn.example.com/file3 hdfs://nn.example.com/dir1 74 P a g e
85 Exit Code: Returns 0 on success and -1 on error. PUT Usage: hadoop fs -put <localsrc>... <dst> Copy single src, or multiple srcs from local file system to the destination filesystem. Also reads input from stdin and writes to destination filesystem. hadoop fs -put localfile /user/hadoop/hadoopfile hadoop fs -put localfile1 localfile2 /user/hadoop/hadoopdir hadoop fs -put localfile hdfs://nn.example.com/hadoop/hadoopfile hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin. Exit Code: Returns 0 on success and -1 on error. RM Usage: hadoop fs -rm URI [URI ] Delete files specified as args. Only deletes non empty directory and files. Refer to rmr for recursive deletes. Example: hadoop fs -rm hdfs://nn.example.com/file /user/hadoop/emptydir Exit Code: Returns 0 on success and -1 on error. RMR Usage: hadoop fs -rmr URI [URI ] Recursive version of delete. Example: hadoop fs -rmr /user/hadoop/dir hadoop fs -rmr hdfs://nn.example.com/user/hadoop/dir Exit Code: Returns 0 on success and -1 on error. 75 P a g e
86 SETREP Usage: hadoop fs -setrep [-R] <path> Changes the replication factor of a file. -R option is for recursively increasing the replication factor of files within a directory. Example: hadoop fs -setrep -w 3 -R /user/hadoop/dir1 Exit Code: Returns 0 on success and -1 on error. STAT Usage: hadoop fs -stat URI [URI ] Returns the stat information on the path. Example: hadoop fs -stat path Exit Code: Returns 0 on success and -1 on error. TAIL Usage: hadoop fs -tail [-f] URI Displays last kilobyte of the file to stdout. -f option can be used as in Unix. Example: hadoop fs -tail pathname Exit Code: Returns 0 on success and -1 on error. TEST Usage: hadoop fs -test -[ezd] URI Options: -e check to see if the file exists. Return 0 if true. -z check to see if the file is zero length. Return 0 if true -d check return 1 if the path is directory else return P a g e
87 Example: hadoop fs -test -e filename TEXT Usage: hadoop fs -text <src> Takes a source file and outputs the file in text format. The allowed formats are zip and TextRecordInputStream. TOUCHZ Usage: hadoop fs -touchz URI [URI ] Create a file of zero length. Example: hadoop -touchz pathname Exit Code: Returns 0 on success and -1 on error. 77 P a g e
88 APPENDIX C
89 DEALING WITH INSTALLATION ERRORS C1. ERROR 1 INFO org.apache.hadoop.mapred.jobtracker: problem cleaning system directory: null java.net.connectexception: Call to localhost/ :8020 failed on connection exception: java.net.connectexception: Connection refused at org.apache.hadoop.ipc.client.wrapexception(client.java:767) SOLUTION By default, Hadoop sets hadoop.tmp.dir to /tmp folder. This is a problem, because /tmp gets wiped out by Linux when you reboot, leading to this lovely error from the JobTracker. Include the following property to core-site.xml: <property> <name>hadoop.tmp.dir</name> <value>/hadoop_temp</value> </property> And create a directory named hadoop_temp under hadoop distribution directory i.e. hadoop home directory. C2. ERROR 2 INFO org.apache.hadoop.ipc.client: Retrying connect to server: localhost/ :8020. Already tried 0 time(s). INFO org.apache.hadoop.ipc.client: Retrying connect to server: localhost/ :8020. Already tried 1 time(s). INFO org.apache.hadoop.ipc.client: Retrying connect to server: localhost/ :8020. Already tried 2 time(s) INFO org.apache.hadoop.ipc.client: Retrying connect to server: localhost/ :8020. Already tried 9 time(s). SOLUTION This error occurs only when the namenode is down. Possible reason is either you have forgotten to start the hadoop cluster or the namenode is not starting. Use the following command to see which hadoop daemons are running on which machines: jps If it appears that no hadoop daemon is running, use the following command to run all hadoop daemons: bin/start-all.sh or bin/start-dfs.sh And if only namenode is not starting on hadoop cluster, use the following command to start the namenode: bin/hadoop-daemon.sh start namenode 79 P a g e
90 C3. ERROR 3 WARN hdfs.dfsclient: DataStreamer Exception: org.apache.hadoop.ipc.remoteexception: java.io.ioexception: File /user/dibyendu_karmakar/a could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.fsnamesystem.getadditionalbloc k(fsnamesystem.java:1558) at org.apache.hadoop.hdfs.server.namenode.namenode.addblock(namenode.jav a:696) at sun.reflect.nativemethodaccessorimpl.invoke0(native Method) at sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl. java:39) at sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodacces sorimpl.java:25) at java.lang.reflect.method.invoke(method.java:597) at org.apache.hadoop.ipc.rpc$server.call(rpc.java:563) at org.apache.hadoop.ipc.server$handler$1.run(server.java:1388) at org.apache.hadoop.ipc.server$handler$1.run(server.java:1384) at java.security.accesscontroller.doprivileged(native Method) at javax.security.auth.subject.doas(subject.java:396) at org.apache.hadoop.security.usergroupinformation.doas(usergroupinforma tion.java:1121) at org.apache.hadoop.ipc.server$handler.run(server.java:1382) at org.apache.hadoop.ipc.client.call(client.java:1070) at org.apache.hadoop.ipc.rpc$invoker.invoke(rpc.java:225) at $Proxy1.addBlock(Unknown Source) at sun.reflect.nativemethodaccessorimpl.invoke0(native Method) at sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl. java:39) at sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodacces sorimpl.java:25) at java.lang.reflect.method.invoke(method.java:597) at org.apache.hadoop.io.retry.retryinvocationhandler.invokemethod(retryi nvocationhandler.java:82) at org.apache.hadoop.io.retry.retryinvocationhandler.invoke(retryinvocat ionhandler.java:59) at $Proxy1.addBlock(Unknown Source) at org.apache.hadoop.hdfs.dfsclient$dfsoutputstream.locatefollowingblock (DFSClient.java:3510) at org.apache.hadoop.hdfs.dfsclient$dfsoutputstream.nextblockoutputstrea m(dfsclient.java:3373) at org.apache.hadoop.hdfs.dfsclient$dfsoutputstream.access$2600(dfsclien t.java:2589) at org.apache.hadoop.hdfs.dfsclient$dfsoutputstream$datastreamer.run(dfs Client.java:2829) WARN hdfs.dfsclient: Error Recovery for block null bad datanod e[0] nodes == null 80 P a g e
91 WARN hdfs.dfsclient: Could not get block locations. Source fil e "/user/dibyendu_karmakar/a" - Aborting... put: java.io.ioexception: File /user/dibyendu_karmakar/a could only be replicate d to 0 nodes, instead of 1 ERROR hdfs.dfsclient: Exception closing file /user/dibyendu_ka rmakar/a : org.apache.hadoop.ipc.remoteexception: java.io.ioexception: File /use r/dibyendu_karmakar/a could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.fsnamesystem.getadditionalbloc k(fsnamesystem.java:1558) at org.apache.hadoop.hdfs.server.namenode.namenode.addblock(namenode.jav a:696) at sun.reflect.nativemethodaccessorimpl.invoke0(native Method) at sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl. java:39) at sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodacces sorimpl.java:25) at java.lang.reflect.method.invoke(method.java:597) at org.apache.hadoop.ipc.rpc$server.call(rpc.java:563) at org.apache.hadoop.ipc.server$handler$1.run(server.java:1388) at org.apache.hadoop.ipc.server$handler$1.run(server.java:1384) at java.security.accesscontroller.doprivileged(native Method) at javax.security.auth.subject.doas(subject.java:396) at org.apache.hadoop.security.usergroupinformation.doas(usergroupinforma tion.java:1121) at org.apache.hadoop.ipc.server$handler.run(server.java:1382) org.apache.hadoop.ipc.remoteexception: java.io.ioexception: File /user/dibyendu_ karmakar/a could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.fsnamesystem.getadditionalbloc k(fsnamesystem.java:1558) at org.apache.hadoop.hdfs.server.namenode.namenode.addblock(namenode.jav a:696) at sun.reflect.nativemethodaccessorimpl.invoke0(native Method) at sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl. java:39) at sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodacces sorimpl.java:25) at java.lang.reflect.method.invoke(method.java:597) at org.apache.hadoop.ipc.rpc$server.call(rpc.java:563) at org.apache.hadoop.ipc.server$handler$1.run(server.java:1388) at org.apache.hadoop.ipc.server$handler$1.run(server.java:1384) at java.security.accesscontroller.doprivileged(native Method) at javax.security.auth.subject.doas(subject.java:396) at org.apache.hadoop.security.usergroupinformation.doas(usergroupinforma tion.java:1121) at org.apache.hadoop.ipc.server$handler.run(server.java:1382) 81 P a g e
92 at org.apache.hadoop.ipc.client.call(client.java:1070) at org.apache.hadoop.ipc.rpc$invoker.invoke(rpc.java:225) at $Proxy1.addBlock(Unknown Source) at sun.reflect.nativemethodaccessorimpl.invoke0(native Method) at sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl. java:39) at sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodacces sorimpl.java:25) at java.lang.reflect.method.invoke(method.java:597) at org.apache.hadoop.io.retry.retryinvocationhandler.invokemethod(retryi nvocationhandler.java:82) at org.apache.hadoop.io.retry.retryinvocationhandler.invoke(retryinvocat ionhandler.java:59) at $Proxy1.addBlock(Unknown Source) at org.apache.hadoop.hdfs.dfsclient$dfsoutputstream.locatefollowingblock (DFSClient.java:3510) at org.apache.hadoop.hdfs.dfsclient$dfsoutputstream.nextblockoutputstrea m(dfsclient.java:3373) at org.apache.hadoop.hdfs.dfsclient$dfsoutputstream.access$2600(dfsclien t.java:2589) at org.apache.hadoop.hdfs.dfsclient$dfsoutputstream$datastreamer.run(dfs Client.java:2829) SOLUTION You ve probably noticed that the datanode is starting after showing the error. The reason is that the datanodes get locked. Actually, every datanode stores files on dfs.data.dir and this directory must be different for different datanodes. Whenever a datanode is started, the datanode locks the dfs.data.dir and no other datanode can use the directory. This error possibly occurs only if your file system gets mounted from a server where a location of directory is same for all machines. Edit the following property in hdfssite.xml: <property> <name>dfs.data.dir</name> <value>/tmp/datanode_dir</value> </property> And create a directory named datanode_dir in /tmp folder. C4. ERROR 4 ERROR org.apache.hadoop.hdfs.server.datanode.datanode: java.io.ioexception: Incompatible namespaceids in /User/Data/Hadoop/dfs/data: namenode namespaceid = ; datanode namespaceid = at org.apache.hadoop.hdfs.server.datanode.datastorage.dotransition(datastorage.java:233) at org.apache.hadoop.hdfs.server.datanode.datastorage.recovertransitionread(datastorage.java:148) at org.apache.hadoop.hdfs.server.datanode.datanode.startdatanode(datanode.java:298) at org.apache.hadoop.hdfs.server.datanode.datanode.<init>(datanode.java:216) at org.apache.hadoop.hdfs.server.datanode.datanode.makeinstance(datanode.java:1283) at org.apache.hadoop.hdfs.server.datanode.datanode.instantiatedatanode(datanode.java:1238) 82 P a g e
93 at org.apache.hadoop.hdfs.server.datanode.datanode.createdatanode(datanode.java:1246) at org.apache.hadoop.hdfs.server.datanode.datanode.main(datanode.java:1368) SOLUTION Namenode generates new namespaceid every time HDFS is formatted. Delete all the data along with directories which are of previous namespace OR, you can also change the dfs.data.dir and dfs.name.dir and delete the previously located directories. An alternative approach would be to rollback to previous namespace. 83 P a g e
94 APPENDIX D
95 HADOOP USER AND ADMIN COMMANDS D1. OVERVIEW All hadoop commands are invoked by the bin/hadoop script. Running the hadoop script without any arguments prints the description for all commands. Usage: hadoop [--config confdir] [COMMAND] [GENERIC_OPTIONS] [COMMAND_OPTIONS] Hadoop has an option parsing framework that employs parsing generic options as well as running classes. COMMAD_OPTION --config confdir GENERIC_OPTIONS COMMAND COMMAND_OPTIONS DESCRIPTION Overwrites the default Configuration directory. Default is ${HADOOP_HOME}/conf. The common set of options supported by multiple commands. Various commands with their options are described in the following sections. The commands have been grouped into User Commands and Administration. D2. GENERIC OPTIONS The following options are supported by dfsadmin, fs, fsck, job and fetchdt. Applications should implement Tool to support GenericOptions. GENERIC_OPTION -conf <configuration file> DESCRIPTION Specify an application configuration file. -D <property=value> Use value for given property. -fs <local namenode:port> -jt <local jobtracker:port> -files <comma separated list of files> -libjars <comma seperated list of jars> -archives <comma separated list of archives> Specify a namenode. Specify a job tracker. Applies only to job. Specify comma separated files to be copied to the map reduce cluster. Applies only to job. Specify comma separated jar files to include in the classpath. Applies only to job. Specify comma separated archives to be unarchived on the compute machines. Applies only to job. 85 P a g e
96 D3. USER COMMANDS Commands useful for users of a hadoop cluster. ARCHIVE Creates a hadoop archive. More information can be found at Hadoop Archives. Usage: hadoop archive -archivename NAME <src>* <dest> COMMADN_OPTION -archivename NAME src dest DESCRIPTION Name of the archive to be created. Filesystem pathnames which work as usual with regular expressions. Destination directory which would contain the archive. DISTCP Copy file or directories recursively. More information can be found at Hadoop rdistcp Guide. Usage: hadoop distcp <srcurl> <desturl> COMMAND_OPTION DESCRIPTION srcurl desturl Source Url Destination Url FS Usage: hadoop fs [GENERIC_OPTIONS] [COMMAND_OPTIONS] Runs a generic filesystem user client. The various COMMAND_OPTIONS can be found at File System Shell Guide. FSCK Runs a HDFS file system checking utility. See Fsck for more info. Usage: hadoop fsck [GENERIC_OPTIONS] <path> [-move -delete - openforwrite] [-files [-blocks [-locations -racks]]] COMMAND_OPTION <path> -move Start checking from this path. Move corrupted files to /lost+found DESCRIPTION 86 P a g e
97 -delete -openforwrite -files -blocks -locations -racks Delete corrupted files. Print out files opened for write. Print out files being checked. Print out block report. Print out locations for every block. Print out network topology for data-node locations. FETCHDT Gets Delegation Token from a Namenode. See fetchdt for more info. Usage: hadoop fetchdt [GENERIC_OPTIONS] [--webservice <namenode_http_addr>] <path> <filename> COMMAND_OPTION --webservice <https_address> DESCRIPTION File name to store the token into. use http protocol instead of RPC JAR Runs a jar file. Users can bundle their Map Reduce code in a jar file and execute it using this command. Usage: hadoop jar <jar> [mainclass] args... The streaming jobs are run via this command. Examples can be referred from Streaming examples. Word count example is also run using jar command. It can be referred from Word count example JOB Command to interact with Map Reduce Jobs. Usage: hadoop job [GENERIC_OPTIONS] [-submit <job-file>] [-status <jobid>] [-counter <job-id> <group-name> <counter-name>] [-kill <job-id>] [-events <job-id> <from-event-#> <#-of-events>] [-history [all] <joboutputdir>] [-list [all]] [-kill-task <task-id>] [-fail-task <task-id>] [-set-priority <job-id> <priority>] COMMAND_OPTION DESCRIPTION -submit <job-file> Submits the job. 87 P a g e
98 -status <job-id> -counter <job-id> <group-name> <counter-name> -kill <job-id> -events <job-id> <from-event-#> <#- of-events> -history [all] <joboutputdir> -list [all] -kill-task <task-id> -fail-task <task-id> -set-priority <jobid> <priority> Prints the map and reduce completion percentage and all job counters. Prints the counter value. Kills the job. Prints the events' details received by jobtracker for the given range. -history <joboutputdir> prints job details, failed and killed tip details. More details about the job such as successful tasks and task attempts made for each task can be viewed by specifying the [all] option. -list all displays all jobs. -list displays only jobs which are yet to complete. Kills the task. Killed tasks are NOT counted against failed attempts. Fails the task. Failed tasks are counted against failed attempts. Changes the priority of the job. Allowed priority values are VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW PIPES Runs a pipes job. Usage: hadoop pipes [-conf <path>] [-jobconf <key=value>, <key=value>,...] [-input <path>] [-output <path>] [-jar <jar file>] [-inputformat <class>] [-map <class>] [-partitioner <class>] [-reduce <class>] [-writer <class>] [-program <executable>] [-reduces <num>] COMMAND_OPTION -conf <path> -jobconf <key=value>, <key=value>,... -input <path> -output <path> -jar <jar file> DESCRIPTION Configuration for job Add/override configuration for job Input directory Output directory Jar filename 88 P a g e
99 -inputformat <class> -map <class> -partitioner <class> -reduce <class> -writer <class> -program <executable> -reduces <num> InputFormat class Java Map class Java Partitioner Java Reduce class Java RecordWriter Executable URI Number of reduces QUEUE command to interact and view Job Queue information Usage : hadoop queue [-list] [-info <job-queue-name> [-showjobs]] [- showacls] COMMAND_OPTIION -list -info <job-queuename> [-showjobs] -showacls DESCRIPTION Gets list of Job Queues configured in the system. Along with scheduling information associated with the job queues. Displays the job queue information and associated scheduling information of particular job queue. If -showjobs options is present a list of jobs submitted to the particular job queue is displayed. Displays the queue name and associated queue operations allowed for the current user. The list consists of only those queues to which the user has access. VERSION Prints the version. Usage: hadoop version CLASSNAME hadoop script can be used to invoke any class. Usage: hadoop CLASSNAME Runs the class named CLASSNAME. CLASSPATH Prints the class path needed to get the Hadoop jar and the required libraries. Usage: hadoop classpath 89 P a g e
100 D4. ADMINISTRATION COMMANDS Commands useful for administrators of a hadoop cluster. BALANCER Runs a cluster balancing utility. An administrator can simply press Ctrl-C to stop the rebalancing process. See Rebalancer for more details. Usage: hadoop balancer [-threshold <threshold>] COMMAND_OPTION -threshold <threshold> DESCRIPTION Percentage of disk capacity. This overwrites the default threshold. DAEMONLOG Get/Set the log level for each daemon. Usage: hadoop daemonlog -getlevel <host:port> <name> Usage: hadoop daemonlog -setlevel <host:port> <name> <level> COMMAND_OPTION -getlevel <host:port> <name> -setlevel <host:port> <name> <level> DESCRIPTION Prints the log level of the daemon running at <host:port>. This command internally connects to Sets the log level of the daemon running at <host:port>. This command internally connects to DATANODE Runs a HDFS datanode. Usage: hadoop datanode [-rollback] COMMAND_OPTION -rollback DESCRIPTION Rollsback the datanode to the previous version. This should be used after stopping the datanode and distributing the old hadoop version. DFSADMIN Runs a HDFS dfsadmin client. Usage: hadoop dfsadmin [GENERIC_OPTIONS] [-report] [-safemode enter leave get wait] [-refreshnodes] [-finalizeupgrade] [-upgradeprogress 90 P a g e
101 status details force] [-metasave filename] [-setquota <quota> <dirname>...<dirname>] [-clrquota <dirname>...<dirname>] [-help [cmd]] -report COMMAND_OPTION -safemode enter leave get wait -refreshnodes -finalizeupgrade -upgradeprogress status details force -metasave filename -setquota <quota> <dirname>...<dirname> -clrquota DESCRIPTION Reports basic filesystem information and statistics. Safe mode maintenance command. Safe mode is a Namenode state in which it 1. does not accept changes to the name space (readonly) 2. does not replicate or delete blocks. Safe mode is entered automatically at Namenode startup, and leaves safe mode automatically when the configured minimum percentage of blocks satisfies the minimum replication condition. Safe mode can also be entered manually, but then it can only be turned off manually as well. Re-read the hosts and exclude files to update the set of Datanodes that are allowed to connect to the Namenode and those that should be decommissioned or recommissioned. Finalize upgrade of HDFS. Datanodes delete their previous version working directories, followed by Namenode doing the same. This completes the upgrade process. Request current distributed upgrade status, a detailed status or force the upgrade to proceed. Save Namenode's primary data structures to <filename> in the directory specified by hadoop.log.dir property. <filename> will contain one line for each of the following 1. Datanodes heart beating with Namenode 2. Blocks waiting to be replicated 3. Blocks currrently being replicated 4. Blocks waiting to be deleted Set the quota <quota> for each directory <dirname>. The directory quota is a long integer that puts a hard limit on the number of names in the directory tree. Best effort for the directory, with faults reported if 1. N is not a positive integer, or 2. user is not an administrator, or 3. the directory does not exist or is a file, or 4. the directory would immediately exceed the new quota. Clear the quota for each directory <dirname>. Best effort for the directory. with fault reported if 91 P a g e
102 <dirname>...<dirname> -help [cmd] 1. the directory does not exist or is a file, or 2. user is not an administrator. It does not fault if the directory has no quota. Displays help for the given command or all commands if none is specified. MRADMIN Runs MR admin client Usage: hadoop mradmin [ GENERIC_OPTIONS ] [-refreshqueueacls] COMMAND_OPTION -refreshqueueacls DESCRIPTION Refresh the queue acls used by hadoop, to check access during submissions and administration of the job by the user. The properties present inmapred-queue-acls.xml is reloaded by the queue manager. JOBTRACKER Runs the MapReduce job Tracker node. Usage: hadoop jobtracker [-dumpconfiguration] COMMAND_OPTION -dumpconfiguration DESCRIPTION Dumps the configuration used by the JobTracker alongwith queue configuration in JSON format into Standard output used by the jobtracker and exits. NAMENODE Runs the namenode. More info about the upgrade, rollback and finalize is at Upgrade Rollback Usage: hadoop namenode [-format] [-upgrade] [-rollback] [-finalize] [-importcheckpoint] COMMAND_OPTION -format -upgrade -rollback DESCRIPTION Formats the namenode. It starts the namenode, formats it and then shut it down. Namenode should be started with upgrade option after the distribution of new hadoop version. Rollsback the namenode to the previous version. This should be used after stopping the cluster and distributing the old hadoop version. 92 P a g e
103 -finalize -importcheckpoint Finalize will remove the previous state of the files system. Recent upgrade will become permanent. Rollback option will not be available anymore. After finalization it shuts the namenode down. Loads image from a checkpoint directory and save it into the current one. Checkpoint dir is read from property fs.checkpoint.dir SECONDARYNAMENODE Runs the HDFS secondary namenode. See Secondary Namenode for more info. Usage: hadoop secondarynamenode [-checkpoint [force]] [-geteditsize] COMMAND_OPTION DESCRIPTION -checkpoint [force] Checkpoints the Secondary namenode if EditLog size >= fs.checkpoint.size. If -force is used, checkpoint irrespective of EditLog size. -geteditsize Prints the EditLog size. TASKTRACKER Runs a MapReduce task Tracker node. Usage: hadoop tasktracker 93 P a g e
The Hadoop Distributed File System
The Hadoop Distributed File System The Hadoop Distributed File System, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, Yahoo, 2010 Agenda Topic 1: Introduction Topic 2: Architecture
The Hadoop Distributed File System
The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu HDFS
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
HDFS Architecture Guide
by Dhruba Borthakur Table of contents 1 Introduction... 3 2 Assumptions and Goals... 3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets... 3 2.4 Simple Coherency Model...3 2.5
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
HDFS Users Guide. Table of contents
Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
The Hadoop Distributed File System
The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Abstract The Hadoop
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
Distributed File Systems
Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
Distributed File Systems
Distributed File Systems Alemnew Sheferaw Asrese University of Trento - Italy December 12, 2012 Acknowledgement: Mauro Fruet Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 1 / 55 Outline
Hadoop Distributed File System. Dhruba Borthakur June, 2007
Hadoop Distributed File System Dhruba Borthakur June, 2007 Goals of HDFS Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
HDFS Under the Hood. Sanjay Radia. [email protected] Grid Computing, Hadoop Yahoo Inc.
HDFS Under the Hood Sanjay Radia [email protected] Grid Computing, Hadoop Yahoo Inc. 1 Outline Overview of Hadoop, an open source project Design of HDFS On going work 2 Hadoop Hadoop provides a framework
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed
HDFS Reliability. Tom White, Cloudera, 12 January 2008
HDFS Reliability Tom White, Cloudera, 12 January 2008 The Hadoop Distributed Filesystem (HDFS) is a distributed storage system for reliably storing petabytes of data on clusters of commodity hardware.
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Hadoop Distributed File System. Jordan Prosch, Matt Kipps
Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?
Distributed Filesystems
Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science [email protected] April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
Hadoop. History and Introduction. Explained By Vaibhav Agarwal
Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow
Design and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
The Google File System
The Google File System By Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung (Presented at SOSP 2003) Introduction Google search engine. Applications process lots of data. Need good file system. Solution:
Hadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
HDFS Design Principles
HDFS Design Principles The Scale-out-Ability of Distributed Storage SVForum Software Architecture & Platform SIG Konstantin V. Shvachko May 23, 2012 Big Data Computations that need the power of many computers
HADOOP MOCK TEST HADOOP MOCK TEST I
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
HADOOP MOCK TEST HADOOP MOCK TEST
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015
7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan [email protected] Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
HADOOP MOCK TEST HADOOP MOCK TEST II
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
Parallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
HDFS: Hadoop Distributed File System
Istanbul Şehir University Big Data Camp 14 HDFS: Hadoop Distributed File System Aslan Bakirov Kevser Nur Çoğalmış Agenda Distributed File System HDFS Concepts HDFS Interfaces HDFS Full Picture Read Operation
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
Sunita Suralkar, Ashwini Mujumdar, Gayatri Masiwal, Manasi Kulkarni Department of Computer Technology, Veermata Jijabai Technological Institute
Review of Distributed File Systems: Case Studies Sunita Suralkar, Ashwini Mujumdar, Gayatri Masiwal, Manasi Kulkarni Department of Computer Technology, Veermata Jijabai Technological Institute Abstract
Google File System. Web and scalability
Google File System Web and scalability The web: - How big is the Web right now? No one knows. - Number of pages that are crawled: o 100,000 pages in 1994 o 8 million pages in 2005 - Crawlable pages might
IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY
IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY Hadoop Distributed File System: What and Why? Ashwini Dhruva Nikam, Computer Science & Engineering, J.D.I.E.T., Yavatmal. Maharashtra,
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
CDH AND BUSINESS CONTINUITY:
WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable
Intro to Map/Reduce a.k.a. Hadoop
Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by
Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015
Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop
HDFS Installation and Shell
2012 coreservlets.com and Dima May HDFS Installation and Shell Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Running a Workflow on a PowerCenter Grid
Running a Workflow on a PowerCenter Grid 2010-2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise)
How To Install An Aneka Cloud On A Windows 7 Computer (For Free)
MANJRASOFT PTY LTD Aneka 3.0 Manjrasoft 5/13/2013 This document describes in detail the steps involved in installing and configuring an Aneka Cloud. It covers the prerequisites for the installation, the
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.
Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software
BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011
BookKeeper Flavio Junqueira Yahoo! Research, Barcelona Hadoop in China 2011 What s BookKeeper? Shared storage for writing fast sequences of byte arrays Data is replicated Writes are striped Many processes
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
ZooKeeper. Table of contents
by Table of contents 1 ZooKeeper: A Distributed Coordination Service for Distributed Applications... 2 1.1 Design Goals...2 1.2 Data model and the hierarchical namespace...3 1.3 Nodes and ephemeral nodes...
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Apache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
Outline. Failure Types
Outline Database Management and Tuning Johann Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Unit 11 1 2 Conclusion Acknowledgements: The slides are provided by Nikolaus Augsten
Big Data Technology Core Hadoop: HDFS-YARN Internals
Big Data Technology Core Hadoop: HDFS-YARN Internals Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class Map-Reduce Motivation This class
Sujee Maniyam, ElephantScale
Hadoop PRESENTATION 2 : New TITLE and GOES Noteworthy HERE Sujee Maniyam, ElephantScale SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member
Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems
Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Hadoop Distributed File System Propagation Adapter for Nimbus
University of Victoria Faculty of Engineering Coop Workterm Report Hadoop Distributed File System Propagation Adapter for Nimbus Department of Physics University of Victoria Victoria, BC Matthew Vliet
Hadoop Scalability at Facebook. Dmytro Molkov ([email protected]) YaC, Moscow, September 19, 2011
Hadoop Scalability at Facebook Dmytro Molkov ([email protected]) YaC, Moscow, September 19, 2011 How Facebook uses Hadoop Hadoop Scalability Hadoop High Availability HDFS Raid How Facebook uses Hadoop Usages
DistCp Guide. Table of contents. 3 Appendix... 6. 1 Overview... 2 2 Usage... 2 2.1 Basic...2 2.2 Options... 3
Table of contents 1 Overview... 2 2 Usage... 2 2.1 Basic...2 2.2 Options... 3 3 Appendix... 6 3.1 Map sizing... 6 3.2 Copying between versions of HDFS... 6 3.3 MapReduce and other side-effects...6 1 Overview
IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE
White Paper IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE Abstract This white paper focuses on recovery of an IBM Tivoli Storage Manager (TSM) server and explores
Spectrum Scale HDFS Transparency Guide
Spectrum Scale Guide Spectrum Scale BDA 2016-1-5 Contents 1. Overview... 3 2. Supported Spectrum Scale storage mode... 4 2.1. Local Storage mode... 4 2.2. Shared Storage Mode... 4 3. Hadoop cluster planning...
Deploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer [email protected] Alejandro Bonilla / Sales Engineer [email protected] 2 Hadoop Core Components 3 Typical Hadoop Distribution
CA ARCserve and CA XOsoft r12.5 Best Practices for protecting Microsoft SQL Server
CA RECOVERY MANAGEMENT R12.5 BEST PRACTICE CA ARCserve and CA XOsoft r12.5 Best Practices for protecting Microsoft SQL Server Overview Benefits The CA Advantage The CA ARCserve Backup Support and Engineering
Reduction of Data at Namenode in HDFS using harballing Technique
Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu [email protected] [email protected] Abstract HDFS stands for the Hadoop Distributed File System.
Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani
Big Data and Hadoop Sreedhar C, Dr. D. Kavitha, K. Asha Rani Abstract Big data has become a buzzword in the recent years. Big data is used to describe a massive volume of both structured and unstructured
HADOOP PERFORMANCE TUNING
PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The
Diagram 1: Islands of storage across a digital broadcast workflow
XOR MEDIA CLOUD AQUA Big Data and Traditional Storage The era of big data imposes new challenges on the storage technology industry. As companies accumulate massive amounts of data from video, sound, database,
5 HDFS - Hadoop Distributed System
5 HDFS - Hadoop Distributed System 5.1 Definition and Remarks HDFS is a file system designed for storing very large files with streaming data access patterns running on clusters of commoditive hardware.
Data-intensive computing systems
Data-intensive computing systems Hadoop Universtity of Verona Computer Science Department Damiano Carra Acknowledgements! Credits Part of the course material is based on slides provided by the following
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
CA XOsoft Replication for Windows
CA XOsoft Replication for Windows Microsoft SQL Server Operation Guide r12.5 This documentation and any related computer software help programs (hereinafter referred to as the Documentation ) is for the
Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
LOCATION-AWARE REPLICATION IN VIRTUAL HADOOP ENVIRONMENT. A Thesis by. UdayKiran RajuladeviKasi. Bachelor of Technology, JNTU, 2008
LOCATION-AWARE REPLICATION IN VIRTUAL HADOOP ENVIRONMENT A Thesis by UdayKiran RajuladeviKasi Bachelor of Technology, JNTU, 2008 Submitted to Department of Electrical Engineering and Computer Science and
Snapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
