Data Availability and Durability with the Hadoop Distributed File System
|
|
|
- Teresa Cross
- 10 years ago
- Views:
Transcription
1 Data Availability and Durability with the Hadoop Distributed File System ROBERT J. CHANSLER Robert J. Chansler is Senior Manager of Hadoop Infrastructure at LinkedIn. This work draws on Rob s experience as manager of the HDFS development team at Yahoo!. A Caltech graduate, Rob earned a PhD in computer science at Carnegie Mellon University investigating distributed systems. After a detour through compilers, printing systems, electronic commerce, and network management, Rob returned to distributed systems, where many problems were still familiar but all the numbers had two or three more zeros. [email protected] The Hadoop Distributed File System at Yahoo! stores 40 petabytes of application data across 30,000 nodes. The most conventional strategy for data protection just make a copy somewhere else is not practical for such large data sets. To be a good custodian of this much data, HDFS must continuously manage the number of replicas for each block, test the integrity of blocks, balance the usage of resources as the hardware infrastructure changes, report status to administrators, and be on guard for the unexpected. Furthermore, the system administrators must ensure that thousands of hosts are operational, have network connectivity, and are executing the HDFS Name Node and Data Node applications. How well has this worked in practice? HDFS is (almost) always available and (almost) never loses data. A Survey of HDFS Availability The Grid Operations team tracks each cluster loss-of-service incident. Many of these incidents are caused by facilities other than the file system. But for some incidents, the immediate problem is that HDFS is unavailable or not performing well. Sometimes the initial incident report originates with a user frustrated that HDFS is not performing as well as expected. Automated tools are better at monitoring whether individual host machines and network switches are operational. Yahoo! uses Nagios for alerting. An internal tool called Simon is used for time series collection and presentation from Nagios; Ganglia is a popular tool used elsewhere for this purpose. HDFS itself provides several statistics streams available via JMX and collected by Simon. An introduction to using Nagios and Ganglia for cluster monitoring is available from IBM developerworks (see Resources). The logs from cluster hosts both the operating system logs and the HDFS application logs can be used for fault diagnosis. Table 1 summarizes all of the grid down events recorded during the 500 days preceding 5 April 2011, where loss of HDFS service is the initial or primary complaint. The table includes some data about the aggregate size of the clusters for perspective. This somewhat arbitrary interval was chosen to be long enough to collect an interesting data set, but not so long as to include much data from when HDFS was to be frank not so good. Roughly, this corresponds to the deployment Hadoop 0.20 versions. The HDFS Name Node is a Java application. As such, there is always concern that garbage collection might interrupt service. Indeed, garbage collection (GC) represents 44% of the reported loss-of-service incidents. In a GC event, HDFS is 16 ;login: VOL. 37, NO. 1
2 unavailable for a few (less than 10) minutes. The system resumes normal operation without intervention. The nature of the Java Virtual Machine is such that there can be no promise that GC will not interrupt service. Most GC events are close to the threshold of observability, about five minutes. During mid-2010, the biggest grids (about 4000 nodes) had short interruptions once or twice a week, although only one report was generated each month. Careful provisioning and some thoughtful software changes have reduced the frequency of interruptions, so that today big clusters have fewer than one event every other month. GC issues were nagging problems from time to time, but among HDFS developers the consensus is that HDFS has benefited from the choice to implement the system in Java. Development efficiency was higher, and inconvenient service interruptions that did occur were not the kind of faults that might threaten the integrity of the file system. Table 2 summarizes each of the Other HDFS Down Incidents ; HDFS was unavailable for an hour or two while the fault was repaired and the cluster restarted. HDFS ought to be excused from 16 of the events. There is not much the file system can do if LDAP, NFS, power, or a switch fails. Two of the events are attributed to software faults where code did the wrong thing. Nine events occurred when system storage resources were exhausted. Either the data nodes ran out of available space or the name system could not create additional names or blocks. User application behavior was implicated in five cases, where either the rate of requests was extraordinary or the required response to a request was exceptionally large. There is still no good protection from the former, but the latter problem is eliminated. There is no convincing explanation for the remaining four cases. The longest unexcused absence was 3.2 hours. Clusters 19 April 2011 GC Other HDFS Down Incidents Nodes Blocks ( 10 6 ) New Files Per Day ( 10 6 ) Research clusters , Production clusters , Totals , Table 1: Loss of HDFS service incidents for 500 days preceding 5 April Scheduled maintenance events are not included. Multiple Hadoop clusters have been aggregated into two classes: research clusters, generally available to the engineering community, and production clusters, restricted to approved jobs and serving more time-critical processes. Four continuing initiatives have helped to reduce the number of incidents and the duration of incidents that do occur. 1. Operations now have better guidance on what the practical limits are and can better manage the clusters to avoid encountering the hard limits. 2. Better-provisioned servers have been tested with higher practical limits. 3. The time necessary to restart the name system has been reduced. This involved reducing the minimum necessary time to restart the name system, and better tools to validate the integrity of the system before resuming service. 4. The garbage collection parameters for the Java VM have been more carefully tuned. ;login: FEBRUARY 2012 Data Availability and Durability with the Hadoop Distributed File System 17
3 Is HDFS getting better? There were seven unexcused absences in the last 100 days of this survey, but only two in the earliest 100 days. But HDFS has over twice as many nodes and clusters under management at the end of the survey as at the beginning, and the introduction of the Security feature means HDFS is more dependent on infrastructure like LDAP and Kerberos servers. Certainly garbage collection is much better, as explained previously. And if resource usage is carefully managed, the likelihood of a surprise is reduced by half. HDFS can be satisfyingly resistant to multiple insults. In one recent incident, a journal storage volume was lost when the volume filled up under extraordinary client demand and a rack switch died. HDFS continued service and re-created the missing block replicas, although there was a user complaint that the file system was slow. Down Incidents Number Details Garbage collection 24 JVM promotion failures interrupted service. Excused absence 16 Loss of power, hardware failure, misconfiguration, operation error, network faults, and the like; HDFS restarted. Bugs 2 Software did wrong; system was restarted without software change. Space/objects exhausted 9 Insufficient resources for users; sometimes system continued when resources were freed. User applications 5 HDFS was unable to service demand effectively, but system recovered without intervention when excess demand was removed. Unknown 4 No convincing explanation; HDFS restarted. Table 2: Loss-of-service incidents categorized. The apparent cause of the incident was determined by inspection of logs and reports from the Operations team. An initiative is underway to bring conventional high availability to the HDFS name system. The general idea is that a second host is provisioned and able to execute the name system application with only a modest delay (a few seconds) and without interrupting user applications. This would not remedy all of the incidents that have been described here. The loss of a critical infrastructure component would not be remediated, and some operational problems of the name system (resource exhaustion, say) would just be transferred to the other host with no benefit. Leaving aside the question of garbage collection events, 7 of the 16 excused absences would be remediated as would 1 of the 4 unattributed events. Whether to failover to the second host in case of a garbage collection event is a difficult policy question. The decision rests on whether the transfer really is a low-risk activity, and to what extent the transfer really is transparent to user applications. But if transfer in the case of garbage collection is a Good Thing, the number of incidents will not be reduced, but the duration of the interruption will be bounded by the transfer time. 18 ;login: VOL. 37, NO. 1
4 The larger context for this discussion considers not just adventitious interruptions of the file system, but also the many other causes of service interruptions for a cluster. With our problem child as an example (a large cluster with an especially diverse user community), there were 13 reports of file system interruptions, but a total of 86 cluster service interruptions when scheduled maintenance activities are included along with surprises attributed to the job system and other facilities. Twelve of the maintenance windows were for deployment of Hadoop. If Hadoop including HDFS were engineered for continuous service during software updates, many of these interruptions could be eliminated. Also, some of the maintenance windows were scheduled to perform administrative functions (e.g., change VM size) that could have been performed with less interruption if a high availability solution had already been at hand. A Survey of HDFS Data Durability Data stored in HDFS may become unavailable either because it is not possible to access the bytes on disk that are the data, or else the bytes have inadvertently been changed so that the bytes accessed are not the data originally stored. Since data storage in HDFS is organized as files composed of a sequence of large blocks (256 megabytes is typical), missing data is manifest as lost blocks. As the typical file has one or two blocks, the loss of a block is the same as the loss of a file for present purposes. The failure of a data server does not result in the loss of blocks, as each block has replicas at other nodes. The usual case is that there are two other replicas. But if the application program requested exactly one replica, then loss of the node hosting the single replica must necessarily result in the loss of the block. The failure of a rack switch does not result in the loss of blocks, as each block has a replica at some other rack. The core switches connecting racks are provisioned so that component failures do not isolate a slice of nodes across racks or among multiple racks. Since each node hosts about 50,000 block replicas, it s near certainty that the loss of a slice of the cluster or multiple racks will cause data to be unavailable. Fortunately, in the case of switch failures, the switch will be repaired and the lost block replicas will be available again. When a node fails, new replicas are created for blocks where a replica was hosted on the failed node. All is well if the re-replication process keeps up with the occasional failure of a node. (One percent of nodes fail each month.) Of course, it might just happen that a number of nodes fail during a short interval. How likely is the loss of blocks due to uncorrelated failures of multiple nodes? A statistical model (Table 3, next page) suggests that it is improbable that any cluster has observed the loss of a block with three replicas due to uncorrelated failure of nodes. Nor is there any record of this having happened. There can be correlated failures of nodes. Experience suggests that several nodes of a cluster will not survive a power-on restart. In that case a few (~10) blocks with three replicas will be lost, consistent with the statistical model for simultaneous failures. ;login: FEBRUARY 2012 Data Availability and Durability with the Hadoop Distributed File System 19
5 Probability of losing a block on a large (4000 nodes) cluster In the next 24 hours In the next 365 days Table 3: Probability model for predicting data loss due to uncorrelated failure of nodes. The details of the model are in the HDFS issue tracking system; a link can be found in the Resources section of this article. The remaining opportunity for losing blocks is when HDFS just does the wrong thing. HDFS has certainly matured in the past couple of years, from when one bad circumstance might cause a block to be lost to where at least a couple of bad circumstances must occur before the loss of a block is threatened. Two aspects of the HDFS architecture are especially pertinent. First, the strategy of replicating blocks is critically dependent on the correctness of the replication monitor. Second, if a client application begins writing a file but later abandons it, the writer s lease for the file must be recovered. In the former case, the number of different circumstances that might be the proximate cause for needing a new replica is large, and so making the correct diagnosis as to which replica ought to be replaced is difficult logic. In the latter case, the intervening hour presents a long interval when various misfortunes can happen, again complicating the logic. One type of catastrophic failure requires special mention. If the file system metadata were ever lost, all data in the entire cluster would be compromised. Briefly, the metadata is actively managed by two hosts, the data is written to multiple volumes on different hosts, the integrity is periodically tested, and snapshots are retained for long periods. There is no record of having lost any piece of system metadata in the past two years. If the name system server fails and cannot be immediately restarted, a new host can begin service using the saved metadata. The metadata is managed by a write-ahead log, and so name system server failure will not cause loss of even the most recent metadata records. A Review of 2009 Table 4 summarizes the record of lost blocks for the clusters managed by the Grid Operations team during This data has the weakness that reporting is certainly incomplete. With that in mind, the total number of lost blocks in 2009 is likely to be less than 1.5 times the number reported here. To provide some perspective, the table indicates an estimate of the number of blocks managed by the clusters and the number of new files created each day (recall a typical file is one or two blocks). Clusters October 2009 Blocks Reported Lost Nodes Blocks ( 10 6 ) New Files Per day ( 10 6 ) Research clusters , Production clusters Totals , Table 4: Reported loss-of-data events for In one curious case, an unexplained fault changed a host directory into a regular file. The greater number of nodes in Table 1 is due to the increased usage of Hadoop resources during 2010 and early ;login: VOL. 37, NO. 1
6 In one of the research clusters, for a short period (a few days), client applications were abandoning an extraordinary number of files. An error in recovering the lease for a small fraction (10-4 ) resulted in losing the block being written. This fault accounted for 533 of the total number of lost blocks. On another cluster, a single user experimented with setting the replication factor for many files to one. He lost his gamble and 91 blocks when a node failed. On other clusters, two blocks were lost when a data node reported replicas as too big; four blocks were lost when nodes failed during a cluster cold start; four blocks were lost due to a faulty write pipeline recovery; an unreported (but tiny) number of blocks were lost when nodes returned to service with corrupt replicas. For 2009, maybe two dozen blocks were lost due to all other software faults. A Review of 2010 It was not possible to repeat this analysis for 2010 well, not impossible, but considerably less convenient, because the apparent loss of blocks became dominated by a new operating procedure. There is never too much space, and when space is short, the third replica of a block looks to be an excess of caution. Blocks with two replicas are, in fact, pretty durable. Furthermore, in our practice it is not uncommon for a file in one cluster to already have a copy on another cluster. If each of those files has only two replicas at each of two clusters, space is saved while durability is enhanced. Therefore we moved to a practice whereby many files have replication set to two. Any two nodes are likely to each have a replica of some block. When many blocks have only two replicas, if any two nodes fail to start, some blocks (in proportion to the fraction of blocks with replication two) will be lost. In this context, it is important to consider two sources of correlated failures of data nodes. A power-on restart of a big cluster begins by switching on each of the cluster hosts, and if the hosts successfully start the operating system, the administrator will issue commands to begin execution of the HDFS Data Node and Name Node applications. In practice, a few dozen nodes will not immediately restart. Also, consider a circumstance where a host can continue service but cannot restart the operating system or Data Node application. This might occur if the system disk volume has failed, for instance. In this case even a warm restart of the cluster (without power interruption) will cause the un-restartable nodes to fail together. When many blocks have only two replicas, if several nodes fail to join the cluster after a cluster start the number of lost blocks will be in proportion to the number of pairs of failed nodes. The result may be thousands of missing blocks. It is probable that those blocks can be recovered by copying from another cluster or by persuading the failed nodes to join the cluster, but that effort requires a day or more of work by an administrator. We are investigating automated recovery! This process dominates reports of lost blocks; no one pays attention to other possible causes anymore. For over a year, no one has done a block-by-block analysis of missing blocks as was done in The new disk fail-in-place feature of HDFS helps to alleviate the problem of un-restartable nodes. But what of those problems that were previously known causes of missing blocks? All have been fixed in Hadoop Only one new cause of missing blocks has been discovered in A data node that is unable to reliably read data from other nodes might attempt to read each replica of a block and report each replica as corrupt. In this case, the block looks to be missing even though all existing replicas are ;login: FEBRUARY 2012 Data Availability and Durability with the Hadoop Distributed File System 21
7 good. It is possible for an administrator to recover from such false reports. HDFS has adopted the policy that a report of a corrupt block is only accepted from a node that has read a good replica. Know Your Neighbors Can users do anything to improve the robustness and availability of their own data? There is a wide variance among clusters of incidence of service or data unavailability. The clusters that host a more experimental user community experience more untoward events. The production clusters with restricted access are more robust than the research clusters available to all engineers. Especially with the introduction of permissions two years ago, HDFS users have substantial protection from the carelessness of others. The most immediate threat to data availability comes from an application job that exhausts some cluster resource. The quota features of HDFS are partial protections. How to yet better manage system resources is an area of active investigation. That said, HDFS operated by skilled administrators is a reliable custodian of Yahoo! s petabytes. Resources K. Shvachko, H. Kuang, S. Radia, and R. Chansler, The Hadoop Distributed File System, 26th IEEE Symposium on Massive Storage Systems and Technologies (MSST2010), May Konstantin V. Shvachko, HDFS Scalability: The Limits to Growth, ;login:, vol. 35, no. 2, April 2010: shvachko.pdf. Konstantin V. Shvachko, Apache Hadoop: The Scalability Update, ;login:, vol. 36, no. 3, June 2011: Shvachko.pdf. Using Ganglia and Nagios: l-ganglia-nagios-1/. HDFS tracking system: HDFS-2535: A Model for Data Durability : issues.apache.org/jira/browse/hdfs ;login: VOL. 37, NO. 1
The Hadoop Distributed File System
The Hadoop Distributed File System The Hadoop Distributed File System, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, Yahoo, 2010 Agenda Topic 1: Introduction Topic 2: Architecture
The Hadoop Distributed File System
The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu HDFS
Distributed File Systems
Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
Hadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
HDFS Reliability. Tom White, Cloudera, 12 January 2008
HDFS Reliability Tom White, Cloudera, 12 January 2008 The Hadoop Distributed Filesystem (HDFS) is a distributed storage system for reliably storing petabytes of data on clusters of commodity hardware.
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
Hadoop Distributed FileSystem on Cloud
Hadoop Distributed FileSystem on Cloud Giannis Kitsos, Antonis Papaioannou and Nikos Tsikoudis Department of Computer Science University of Crete {kitsos, papaioan, tsikudis}@csd.uoc.gr Abstract. The growing
How Routine Data Center Operations Put Your HA/DR Plans at Risk
How Routine Data Center Operations Put Your HA/DR Plans at Risk Protect your business by closing gaps in your disaster recovery infrastructure Things alter for the worse spontaneously, if they be not altered
Neverfail Solutions for VMware: Continuous Availability for Mission-Critical Applications throughout the Virtual Lifecycle
Neverfail Solutions for VMware: Continuous Availability for Mission-Critical Applications throughout the Virtual Lifecycle Table of Contents Virtualization 3 Benefits of Virtualization 3 Continuous Availability
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
HDFS Users Guide. Table of contents
Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9
Highly Available Hadoop Name Node Architecture-Using Replicas of Name Node with Time Synchronization among Replicas
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 3, Ver. II (May-Jun. 2014), PP 58-62 Highly Available Hadoop Name Node Architecture-Using Replicas
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
Design and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
Bigdata High Availability (HA) Architecture
Bigdata High Availability (HA) Architecture Introduction This whitepaper describes an HA architecture based on a shared nothing design. Each node uses commodity hardware and has its own local resources
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY
IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY Hadoop Distributed File System: What and Why? Ashwini Dhruva Nikam, Computer Science & Engineering, J.D.I.E.T., Yavatmal. Maharashtra,
Deploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer [email protected] Alejandro Bonilla / Sales Engineer [email protected] 2 Hadoop Core Components 3 Typical Hadoop Distribution
Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF
Non-Stop for Apache HBase: -active region server clusters TECHNICAL BRIEF Technical Brief: -active region server clusters -active region server clusters HBase is a non-relational database that provides
Data Loss in a Virtual Environment An Emerging Problem
Data Loss in a Virtual Environment An Emerging Problem Solutions to successfully meet the requirements of business continuity. An Altegrity Company 2 3 4 5 Introduction Common Virtual Data Loss Scenarios
Processing of Hadoop using Highly Available NameNode
Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale
Network Virtualization Platform (NVP) Incident Reports
Network Virtualization Platform (NVP) s ORD Service Interruption During Scheduled Maintenance June 20th, 2013 Time of Incident: 03:45 CDT While performing a scheduled upgrade on the Software Defined Networking
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
HDFS Under the Hood. Sanjay Radia. [email protected] Grid Computing, Hadoop Yahoo Inc.
HDFS Under the Hood Sanjay Radia [email protected] Grid Computing, Hadoop Yahoo Inc. 1 Outline Overview of Hadoop, an open source project Design of HDFS On going work 2 Hadoop Hadoop provides a framework
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
America s Most Wanted a metric to detect persistently faulty machines in Hadoop
America s Most Wanted a metric to detect persistently faulty machines in Hadoop Dhruba Borthakur and Andrew Ryan dhruba,[email protected] Presented at IFIP Workshop on Failure Diagnosis, Chicago June
Distributed File Systems
Distributed File Systems Alemnew Sheferaw Asrese University of Trento - Italy December 12, 2012 Acknowledgement: Mauro Fruet Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 1 / 55 Outline
White Paper. Managing MapR Clusters on Google Compute Engine
White Paper Managing MapR Clusters on Google Compute Engine MapR Technologies, Inc. www.mapr.com Introduction Google Compute Engine is a proven platform for running MapR. Consistent, high performance virtual
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
The Microsoft Large Mailbox Vision
WHITE PAPER The Microsoft Large Mailbox Vision Giving users large mailboxes without breaking your budget Introduction Giving your users the ability to store more e mail has many advantages. Large mailboxes
HDFS Architecture Guide
by Dhruba Borthakur Table of contents 1 Introduction... 3 2 Assumptions and Goals... 3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets... 3 2.4 Simple Coherency Model...3 2.5
Designing a Cloud Storage System
Designing a Cloud Storage System End to End Cloud Storage When designing a cloud storage system, there is value in decoupling the system s archival capacity (its ability to persistently store large volumes
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
Data Refinery with Big Data Aspects
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data
HDFS Space Consolidation
HDFS Space Consolidation Aastha Mehta*,1,2, Deepti Banka*,1,2, Kartheek Muthyala*,1,2, Priya Sehgal 1, Ajay Bakre 1 *Student Authors 1 Advanced Technology Group, NetApp Inc., Bangalore, India 2 Birla Institute
Nutanix Tech Note. Failure Analysis. 2013 All Rights Reserved, Nutanix Corporation
Nutanix Tech Note Failure Analysis A Failure Analysis of Storage System Architectures Nutanix Scale-out v. Legacy Designs Types of data to be protected Any examination of storage system failure scenarios
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Final Project Proposal. CSCI.6500 Distributed Computing over the Internet
Final Project Proposal CSCI.6500 Distributed Computing over the Internet Qingling Wang 660795696 1. Purpose Implement an application layer on Hybrid Grid Cloud Infrastructure to automatically or at least
IBM Security QRadar Version 7.2.0. Troubleshooting System Notifications Guide
IBM Security QRadar Version 7.2.0 Troubleshooting System Notifications Guide Note: Before using this information and the product that it supports, read the information in Notices and Trademarks on page
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
Detection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
Networking in the Hadoop Cluster
Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop
MinCopysets: Derandomizing Replication In Cloud Storage
MinCopysets: Derandomizing Replication In Cloud Storage Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and Mendel Rosenblum Stanford University [email protected], {stutsman,rumble,skatti,ouster,mendel}@cs.stanford.edu
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Techniques for implementing & running robust and reliable DB-centric Grid Applications
Techniques for implementing & running robust and reliable DB-centric Grid Applications International Symposium on Grid Computing 2008 11 April 2008 Miguel Anjo, CERN - Physics Databases Outline Robust
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Exchange Data Protection: To the DAG and Beyond. Whitepaper by Brien Posey
Exchange Data Protection: To the DAG and Beyond Whitepaper by Brien Posey Exchange is Mission Critical Ask a network administrator to name their most mission critical applications and Exchange Server is
Connectivity. Alliance Access 7.0. Database Recovery. Information Paper
Connectivity Alliance Access 7.0 Database Recovery Information Paper Table of Contents Preface... 3 1 Overview... 4 2 Resiliency Concepts... 6 2.1 Database Loss Business Impact... 6 2.2 Database Recovery
CDH AND BUSINESS CONTINUITY:
WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
Hadoop Distributed File System. Jordan Prosch, Matt Kipps
Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?
High Availability Essentials
High Availability Essentials Introduction Ascent Capture s High Availability Support feature consists of a number of independent components that, when deployed in a highly available computer system, result
BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011
BookKeeper Flavio Junqueira Yahoo! Research, Barcelona Hadoop in China 2011 What s BookKeeper? Shared storage for writing fast sequences of byte arrays Data is replicated Writes are striped Many processes
Creating A Highly Available Database Solution
WHITE PAPER Creating A Highly Available Database Solution Advantage Database Server and High Availability TABLE OF CONTENTS 1 Introduction 1 High Availability 2 High Availability Hardware Requirements
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov
An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research
IBM PROTECTIER: FROM BACKUP TO RECOVERY
SOLUTION PROFILE IBM PROTECTIER: FROM BACKUP TO RECOVERY NOVEMBER 2011 When it comes to backup and recovery, backup performance numbers rule the roost. It s understandable really: far more data gets backed
An Oracle White Paper January 2013. A Technical Overview of New Features for Automatic Storage Management in Oracle Database 12c
An Oracle White Paper January 2013 A Technical Overview of New Features for Automatic Storage Management in Oracle Database 12c TABLE OF CONTENTS Introduction 2 ASM Overview 2 Total Storage Management
Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.
Beyond Web Application Log Analysis using Apache TM Hadoop A Whitepaper by Orzota, Inc. 1 Web Applications As more and more software moves to a Software as a Service (SaaS) model, the web application has
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Connectivity. Alliance Access 7.0. Database Recovery. Information Paper
Connectivity Alliance 7.0 Recovery Information Paper Table of Contents Preface... 3 1 Overview... 4 2 Resiliency Concepts... 6 2.1 Loss Business Impact... 6 2.2 Recovery Tools... 8 3 Manual Recovery Method...
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Disaster Recovery Solutions for Oracle Database Standard Edition RAC. A Dbvisit White Paper
Disaster Recovery Solutions for Oracle Database Standard Edition RAC A Dbvisit White Paper Copyright 2011-2012 Dbvisit Software Limited. All Rights Reserved v2, Mar 2012 Contents Executive Summary... 1
Real-time Protection for Hyper-V
1-888-674-9495 www.doubletake.com Real-time Protection for Hyper-V Real-Time Protection for Hyper-V Computer virtualization has come a long way in a very short time, triggered primarily by the rapid rate
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution
WHITEPAPER A Technical Perspective on the Talena Data Availability Management Solution BIG DATA TECHNOLOGY LANDSCAPE Over the past decade, the emergence of social media, mobile, and cloud technologies
A SWOT ANALYSIS ON CISCO HIGH AVAILABILITY VIRTUALIZATION CLUSTERS DISASTER RECOVERY PLAN
A SWOT ANALYSIS ON CISCO HIGH AVAILABILITY VIRTUALIZATION CLUSTERS DISASTER RECOVERY PLAN Eman Al-Harbi [email protected] Soha S. Zaghloul [email protected] Faculty of Computer and Information
Testing Big data is one of the biggest
Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing
MyCloudLab: An Interactive Web-based Management System for Cloud Computing Administration
MyCloudLab: An Interactive Web-based Management System for Cloud Computing Administration Hoi-Wan Chan 1, Min Xu 2, Chung-Pan Tang 1, Patrick P. C. Lee 1 & Tsz-Yeung Wong 1, 1 Department of Computer Science
Software Execution Protection in the Cloud
Software Execution Protection in the Cloud Miguel Correia 1st European Workshop on Dependable Cloud Computing Sibiu, Romania, May 8 th 2012 Motivation clouds fail 2 1 Motivation accidental arbitrary faults
1.1 Difficulty in Fault Localization in Large-Scale Computing Systems
Chapter 1 Introduction System failures have been one of the biggest obstacles in operating today s largescale computing systems. Fault localization, i.e., identifying direct or indirect causes of failures,
The Hadoop Distributed File System
The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Abstract The Hadoop
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Apache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 [email protected], 2 [email protected],
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
Chapter 1 - Web Server Management and Cluster Topology
Objectives At the end of this chapter, participants will be able to understand: Web server management options provided by Network Deployment Clustered Application Servers Cluster creation and management
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
THE WINDOWS AZURE PROGRAMMING MODEL
THE WINDOWS AZURE PROGRAMMING MODEL DAVID CHAPPELL OCTOBER 2010 SPONSORED BY MICROSOFT CORPORATION CONTENTS Why Create a New Programming Model?... 3 The Three Rules of the Windows Azure Programming Model...
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Hadoop Cluster Applications
Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday
Snapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
EMC DOCUMENTUM xplore 1.1 DISASTER RECOVERY USING EMC NETWORKER
White Paper EMC DOCUMENTUM xplore 1.1 DISASTER RECOVERY USING EMC NETWORKER Abstract The objective of this white paper is to describe the architecture of and procedure for configuring EMC Documentum xplore
Running a Workflow on a PowerCenter Grid
Running a Workflow on a PowerCenter Grid 2010-2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise)
Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
