Declustered RAID & HDFS Replicated Systems
|
|
|
- Lewis Hodges
- 9 years ago
- Views:
Transcription
1 Comparison of Data Durability in Parity Declustered RAID & HDFS Replicated Systems & Dimitar Vlassarev Cloud Modeling & Data Analytics Seagate Technology RMACC 4 th High Performance Computing Symposium Boulder CO, August 12-13, 2014
2 Introduction HPC simulations generate a lot of data that are amenable to analysis in the Hadoop ecosystem Requires data migration from the HPC file system (FS to HDFS (Hadoop File System Can data migration be avoided and Hadoop run of HPC FS? Meet Lustre a high performance parallel file system which is largely the mainstay of many HPC systems Hadoop on Lustre FS should enable avoiding data migration Main questions of interest: How does performance and data durability of Lustre compare with that of HDFS for Hadoop applications? This talk focuses on Data Durability
3 HDFS Replication and Re-Replication HDFS breaks each file into blocks of certain size (128MB current default, earlier 64MB and stores replicas (three by default of each block on different nodes Two blocks are stored in the same rack (using rack-awareness so as to enhance read performance, the remaining replica helps ensure availability If a node is down and doesn t send heartbeat to the NameNode, then re-replication of the blocks lost are triggered and copies made on any of the remaining DataNodes (parallel rebuild that scales with increasing cluster size Pictures courtesy of HDFS Apache & bradheadlund.com
4 Parity Declustered (PD RAID with Spare Blocks for Lustre RAID - 5 Parity Declustered RAID - 5 D0.0 D0.1 D0.2 P0 S0 D1.0 D1.1 P1 D1.2 S1 D2.0 P2 D2.1 D2.2 S2 P3 D3.0 D3.1 D3.2 S3 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 D0.0 D0.1 D0.2 P0 S0 D1.0 D1.1 P1 S1 D1.2 D2.0 P2 S2 D2.2 D2.1 P3 S3 D3.1 D3.2 D3.0 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Involves stretching out record size across may drives while permuting blocks of original RAID configuration (N+M, N is data blocks and M is the parity blocks Failure of a drive would lead to independent rebuild of blocks instead of all blocks in a device like RAID, and hence involves partial data extraction from all remaining drives and writes to dispersed spare blocks decreasing rebuild time drastically parallel rebuild Above shows examples of rebuilds for RAID - 5 and Parity declustered RAID 5 (3+1+1 (with spare blocks S i s spread out among all disks
5 Markov Model Framework µ (1-D 1 µ (1-D 2 µ (1-D 3 µ (1-D 4 Healthy T λ -1 HDD (T-1 λ -2 HDDs (T-2 λ -3 HDDs (T-3 λ -4 HDDs µd 2 µd 3 µd 1 µd 4 µd 0 Definitions λ HDD Failure Rate (hrs -1 µ HDD Repair Rate (hrs -1 T Number of HDDs Probability of Repair Failure D i Data Loss * Repair of hard drives occurs simultaneously in parallel
6 Repair Failure Probability for Parity Declustered (PD RAID D(N,M,j the probability of data loss for a particular block when j drives fail for an underlying RAID (N+M with T total drives in the PD RAID set is: where, the probability of loosing a block due to placement is given by: = + = k i p T i M N j T i j k D 0 1 ( k M br M k p D k D j M N D = = * (,, ( 0 and the probability of loosing data due to bit-rot is given by: D j the probability of data loss (at each PD-RAID level due to j failed drives is: = + i M N T 0 ( _per_grid num_blocks,, ( 1 1 j M N D j D = (in bits block_size (1 1 UER D br = Random placement of blocks assumed here All blocks of uniform size
7 Rebuild Time for Parity Declustered (PD RAID When a drive fails, all remaining drives are used to rebuild the blocks of data (nblocks in the failed drive Involves three operations Reading (N*nblocks from the remaining hard drives Reconstructing the lost nblocks Writing the nblocks to all the remaining hard drives So time to repair =T_repair = max (T_read, T_reconstruct, T_write T_read = (N * nblocks * block_size/(read_speed*remaining_hdds T_write = (nblocks * block_size/(write_speed * remaining_hdds T_reconstruct = needs to be modeled (set to zero here read_speed and write_speed is the speed to read/write to a disk (is assumed constant here So, repair rate = µ = (1/T_repair
8 Repair Failure Probability for HDFS D(N,j the probability of data loss for a particular block when j drives fail with T total drives and a N replication strategy in HDFS is: D( N, N k br where, the probability of loosing a block due to placement is given by: j k D p ( k = T k and the probability of loosing data due to bit-rot is given by: N j = D ( k * D k = 0 p D br = 1 (1 UER block_size (in bits D j the probability of data loss due to j failed drives is: total_num_ blocks ( 1 ( N, M, D j = 1 D j
9 Rebuild Time for HDFS Repair Speed = (2/3*(2 racks * n DNs *min(nwbw DN-TOR *0.93,(1/2*HDDspeed*n HDD/DN + (1/3*(n Racks *min(nwbw TOR-TOR *0.93,(1/2*HDDspeed*n HDD/RACK Assumptions for Repair Speed Calculations: (2/3 of the blocks in a drive have a copy in the same rack and one on another (1/3 of the blocks in a drive have both the remaining blocks on another rack This analysis is limited to the number of drives of the range of and beyond that durability will be affected
10 Results: 1 st Year System Level Durability Table No of Drives / Capacity RAID6 (8+2 RAID6 (4+2 PD-RAID (8+2 HDFS (3 Replica Storage Overhead 25 % 50 % 25 % 200 % 82 (~ 0.3PB 4 nines 5 nines 10 nines 7 nines ~0.2PB usable storage ~0.15PB usable storage ~0.2PB usable storage ~0.1PB usable storage 492 (~ 1.9PB 3 nines 4 nines 9 nines 6 nines ~1.4 PB usable storage ~0.9PB usable storage ~1.4 PB usable storage ~0.6PB usable storage 1386 (~ 5.5PB 3 nines 3 nines 8 nines 6 nines ~4.1PB usable storage ~2.7PB usable storage ~4.1PB usable storage ~1.8PB usable storage (~ 277PB 1 nine 2 nines 7 nines 6 nines ~207PB usable storage ~138PB usable storage ~207PB usable storage ~92PB usable storage System level durability implies: Probability of not loosing any data in 1 year Choices of parameters: MTBF=1.4 Million, Speed=100MB/sec, UER=1e-15, Disk Size=4TB, Block Size=128MB PD-RAID delivers better/comparable system reliability in comparison to HDFS, but HDFS would beat PD-RAID at scale > several 100 PB s
11 Conclusions Parity-declustered RAID helps achieve HDFS level data durability At scale > several 100PB, HDFS durability thrives, but small scale relevant for almost all current HPC systems Switching to RAID6 (4+2 could provide marginal increase in durability with 25% increase in overhead from RAID6 (8+2(still better than HDFS overhead Bottom Line: Parity Declustered RAID a must to make Luster attractive for Hadoop applications Plus it also enables storage overhead reduction by a factor of ~ 8x THANK YOU!!
12 Synopsis Comparison of Data Durability in Parity Declustered RAID and HDFS Replicated Systems & Dimitar Vlassarev Cloud Modeling and Data Analytics Seagate Technology 389 Disc Dr., Longmont, CO Data stored on large scale high-performance computing systems is in many situations suitable for analysis within the Hadoop ecosystem. To facilitate that analysis efficiently, avoiding data migration from RAID backed storage systems to the replicated Hadoop File System (HDFS is essential. For these applications, high-performance parallel file systems like Lustre can offer an appealing alternative to HDFS. Two important considerations in comparing the two systems are their performance and data durability. Here we compare the data durability of HDFS s replication strategy to that of a Parity Declustered RAID backed Lustre system. A Continuous Time Markov Chains model for data durability of the two systems suggest that the Parity Declustered RAID backed Lustre solution can be as resilient as the replicated HDFS solution. QUESTIONS??
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
IBM General Parallel File System (GPFS ) 3.5 File Placement Optimizer (FPO)
IBM General Parallel File System (GPFS ) 3.5 File Placement Optimizer (FPO) Rick Koopman IBM Technical Computing Business Development Benelux [email protected] Enterprise class replacement for HDFS
Reliability of Data Storage Systems
Zurich Research Laboratory Ilias Iliadis April 2, 25 Keynote NexComm 25 www.zurich.ibm.com 25 IBM Corporation Long-term Storage of Increasing Amount of Information An increasing amount of information is
Big Data Technology Core Hadoop: HDFS-YARN Internals
Big Data Technology Core Hadoop: HDFS-YARN Internals Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class Map-Reduce Motivation This class
Hadoop Distributed File System. Dhruba Borthakur June, 2007
Hadoop Distributed File System Dhruba Borthakur June, 2007 Goals of HDFS Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle
Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp
Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp Introduction to Hadoop Comes from Internet companies Emerging big data storage and analytics platform HDFS and MapReduce
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
Design and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
HADOOP MOCK TEST HADOOP MOCK TEST I
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
Distributed File Systems
Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)
GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"
GPFS Storage Server Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " Agenda" GPFS Overview" Classical versus GSS I/O Solution" GPFS Storage Server (GSS)" GPFS Native RAID
Hadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
The Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage Platforms. Abhijith Shenoy Engineer, Hedvig Inc.
The Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage Platforms Abhijith Shenoy Engineer, Hedvig Inc. @hedviginc The need for new architectures Business innovation Time-to-market
Bright Cluster Manager
Bright Cluster Manager A Unified Management Solution for HPC and Hadoop Martijn de Vries CTO Introduction Architecture Bright Cluster CMDaemon Cluster Management GUI Cluster Management Shell SOAP/ JSONAPI
Parallels Cloud Storage
Parallels Cloud Storage White Paper Best Practices for Configuring a Parallels Cloud Storage Cluster www.parallels.com Table of Contents Introduction... 3 How Parallels Cloud Storage Works... 3 Deploying
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
Understanding Hadoop Performance on Lustre
Understanding Hadoop Performance on Lustre Stephen Skory, PhD Seagate Technology Collaborators Kelsie Betsch, Daniel Kaslovsky, Daniel Lingenfelter, Dimitar Vlassarev, and Zhenzhen Yan LUG Conference 15
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe
Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 2010 Storage Developer Conference. Insert Your Company Name. All Rights Reserved. 1 In the beginning There was replication Long before
MinCopysets: Derandomizing Replication In Cloud Storage
MinCopysets: Derandomizing Replication In Cloud Storage Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and Mendel Rosenblum Stanford University [email protected], {stutsman,rumble,skatti,ouster,mendel}@cs.stanford.edu
An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing
An Alternative Storage Solution for MapReduce Eric Lomascolo Director, Solutions Marketing MapReduce Breaks the Problem Down Data Analysis Distributes processing work (Map) across compute nodes and accumulates
HDFS Under the Hood. Sanjay Radia. [email protected] Grid Computing, Hadoop Yahoo Inc.
HDFS Under the Hood Sanjay Radia [email protected] Grid Computing, Hadoop Yahoo Inc. 1 Outline Overview of Hadoop, an open source project Design of HDFS On going work 2 Hadoop Hadoop provides a framework
Xyratex Update. Michael K. Connolly. Partner and Alliances Development
Xyratex Update Michael K. Connolly Partner and Alliances Development Is Now 2 The Continued Power of Xyratex Global Solutions Provider of High Quality Data Storage Hardware, Software and Services Broad
Understanding Hadoop Clusters and the Network
Understanding Hadoop lusters and the Network Part 1. Introduction and Overview Brad Hedlund http://bradhedlund.com http://www.linkedin.com/in/bradhedlund @bradhedlund Hadoop Server Roles lients Distributed
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
ISSN:2320-0790. Keywords: HDFS, Replication, Map-Reduce I Introduction:
ISSN:2320-0790 Dynamic Data Replication for HPC Analytics Applications in Hadoop Ragupathi T 1, Sujaudeen N 2 1 PG Scholar, Department of CSE, SSN College of Engineering, Chennai, India 2 Assistant Professor,
Replication and Erasure Coding Explained
Replication and Erasure Coding Explained Caringo Elastic Content Protection Technical Overview Paul Carpentier, CTO, Caringo Introduction: data loss Unlike what most storage vendors will try to make you
Improving Lustre OST Performance with ClusterStor GridRAID. John Fragalla Principal Architect High Performance Computing
Improving Lustre OST Performance with ClusterStor GridRAID John Fragalla Principal Architect High Performance Computing Legacy RAID 6 No Longer Sufficient 2013 RAID 6 data protection challenges Long rebuild
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Reliability and Fault Tolerance in Storage
Reliability and Fault Tolerance in Storage Dalit Naor/ Dima Sotnikov IBM Haifa Research Storage Systems 1 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
HDFS Space Consolidation
HDFS Space Consolidation Aastha Mehta*,1,2, Deepti Banka*,1,2, Kartheek Muthyala*,1,2, Priya Sehgal 1, Ajay Bakre 1 *Student Authors 1 Advanced Technology Group, NetApp Inc., Bangalore, India 2 Birla Institute
IBM System x GPFS Storage Server
IBM System x GPFS Storage Server Schöne Aussicht en für HPC Speicher ZKI-Arbeitskreis Paderborn, 15.03.2013 Karsten Kutzer Client Technical Architect Technical Computing IBM Systems & Technology Group
Efficient Data Replication Scheme based on Hadoop Distributed File System
, pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 21 Outline
Introduction to HDFS. Prasanth Kothuri, CERN
Prasanth Kothuri, CERN 2 What s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. HDFS is the primary distributed storage for Hadoop applications. Hadoop
Data Protection Technologies: What comes after RAID? Vladimir Sapunenko, INFN-CNAF HEPiX Spring 2012 Workshop
Data Protection Technologies: What comes after RAID? Vladimir Sapunenko, INFN-CNAF HEPiX Spring 2012 Workshop Arguments to be discussed Scaling storage for clouds Is RAID dead? Erasure coding as RAID replacement
The Hadoop Distributed File System
The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu HDFS
Apache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
HadoopTM Analytics DDN
DDN Solution Brief Accelerate> HadoopTM Analytics with the SFA Big Data Platform Organizations that need to extract value from all data can leverage the award winning SFA platform to really accelerate
The Greenplum Analytics Workbench
The Greenplum Analytics Workbench External Overview 1 The Greenplum Analytics Workbench Definition Is a 1000-node Hadoop Cluster. Pre-configured with publicly available data sets. Contains the entire Hadoop
PIONEER RESEARCH & DEVELOPMENT GROUP
SURVEY ON RAID Aishwarya Airen 1, Aarsh Pandit 2, Anshul Sogani 3 1,2,3 A.I.T.R, Indore. Abstract RAID stands for Redundant Array of Independent Disk that is a concept which provides an efficient way for
Use of Hadoop File System for Nuclear Physics Analyses in STAR
1 Use of Hadoop File System for Nuclear Physics Analyses in STAR EVAN SANGALINE UC DAVIS Motivations 2 Data storage a key component of analysis requirements Transmission and storage across diverse resources
Optimizing Hadoop Block Placement Policy & Cluster Blocks Distribution
International Journal of Computer, Electrical, Automation, Control and Information Engineering Vol:6, No:1, 212 Optimizing Hadoop Block Placement Policy & Cluster Blocks Distribution Nchimbi Edward Pius,
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp
Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp Agenda Hadoop and storage Alternative storage architecture for Hadoop Use cases and customer examples
Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer [email protected] Assistants: Henri Terho and Antti
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
Big Data Storage Options for Hadoop Sam Fineberg, HP Storage
Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations
Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing
Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Hsin-Wen Wei 1,2, Che-Wei Hsu 2, Tin-Yu Wu 3, Wei-Tsong Lee 1 1 Department of Electrical Engineering, Tamkang University
NLSS: A Near-Line Storage System Design Based on the Combination of HDFS and ZFS
NLSS: A Near-Line Storage System Design Based on the Combination of HDFS and Wei Hu a, Guangming Liu ab, Yanqing Liu a, Junlong Liu a, Xiaofeng Wang a a College of Computer, National University of Defense
IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads
89 Fifth Avenue, 7th Floor New York, NY 10003 www.theedison.com @EdisonGroupInc 212.367.7400 IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads A Competitive Test and Evaluation Report
An Affordable Commodity Network Attached Storage Solution for Biological Research Environments.
An Affordable Commodity Network Attached Storage Solution for Biological Research Environments. Ari E. Berman, Ph.D. Senior Systems Engineer Buck Institute for Research on Aging [email protected]
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
Apache Hadoop FileSystem Internals
Apache Hadoop FileSystem Internals Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Storage Developer Conference, San Jose September 22, 2010 http://www.facebook.com/hadoopfs
Storage Architectures for Big Data in the Cloud
Storage Architectures for Big Data in the Cloud Sam Fineberg HP Storage CT Office/ May 2013 Overview Introduction What is big data? Big Data I/O Hadoop/HDFS SAN Distributed FS Cloud Summary Research Areas
Netapp HPC Solution for Lustre. Rich Fenton ([email protected]) UK Solutions Architect
Netapp HPC Solution for Lustre Rich Fenton ([email protected]) UK Solutions Architect Agenda NetApp Introduction Introducing the E-Series Platform Why E-Series for Lustre? Modular Scale-out Capacity Density
Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
HPC data becomes Big Data. Peter Braam [email protected]
HPC data becomes Big Data Peter Braam [email protected] me 1983-2000 Academia Maths & Computer Science Entrepreneur with startups (5x) 4 startups sold Lustre emerged Held executive jobs with
PARALLELS CLOUD STORAGE
PARALLELS CLOUD STORAGE The Ideal Storage Solution for Hosters 1 Table of Contents Introduction... 3 The Problem with Direct Attached Storage... 3 The Solution: Parallels Cloud Storage... Error! Bookmark
Load Balancing in Fault Tolerant Video Server
Load Balancing in Fault Tolerant Video Server # D. N. Sujatha*, Girish K*, Rashmi B*, Venugopal K. R*, L. M. Patnaik** *Department of Computer Science and Engineering University Visvesvaraya College of
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed
Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk
WHITE PAPER Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk 951 SanDisk Drive, Milpitas, CA 95035 2015 SanDisk Corporation. All rights reserved. www.sandisk.com Table of Contents Introduction
Hadoop Distributed File System. Jordan Prosch, Matt Kipps
Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?
Analysis and Modeling of MapReduce s Performance on Hadoop YARN
Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University [email protected] Dr. Thomas C. Bressoud Dept. of Mathematics and
Nutanix Tech Note. Failure Analysis. 2013 All Rights Reserved, Nutanix Corporation
Nutanix Tech Note Failure Analysis A Failure Analysis of Storage System Architectures Nutanix Scale-out v. Legacy Designs Types of data to be protected Any examination of storage system failure scenarios
IBM System x GPFS Storage Server
IBM System x GPFS Storage Crispin Keable Technical Computing Architect 1 IBM Technical Computing comprehensive portfolio uniquely addresses supercomputing and mainstream client needs Technical Computing
Storage node capacity in RAID0 is equal to the sum total capacity of all disks in the storage node.
RAID configurations defined 1/7 Storage Configuration: Disk RAID and Disk Management > RAID configurations defined Next RAID configurations defined The RAID configuration you choose depends upon how you
High Availability on MapR
Technical brief Introduction High availability (HA) is the ability of a system to remain up and running despite unforeseen failures, avoiding unplanned downtime or service disruption*. HA is a critical
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Why RAID is Dead for Big Data Storage. The business case for why IT Executives are making a strategic shift from RAID to Information Dispersal
Why RAID is Dead for Big Data Storage The business case for why IT Executives are making a strategic shift from RAID to Information Dispersal Executive Summary Data is exploding, growing 10X every five
HDFS Architecture Guide
by Dhruba Borthakur Table of contents 1 Introduction... 3 2 Assumptions and Goals... 3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets... 3 2.4 Simple Coherency Model...3 2.5
Quantcast Petabyte Storage at Half Price with QFS!
9-131 Quantcast Petabyte Storage at Half Price with QFS Presented by Silvius Rus, Director, Big Data Platforms September 2013 Quantcast File System (QFS) A high performance alternative to the Hadoop Distributed
Designing a Cloud Storage System
Designing a Cloud Storage System End to End Cloud Storage When designing a cloud storage system, there is value in decoupling the system s archival capacity (its ability to persistently store large volumes
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Enabling High performance Big Data platform with RDMA
Enabling High performance Big Data platform with RDMA Tong Liu HPC Advisory Council Oct 7 th, 2014 Shortcomings of Hadoop Administration tooling Performance Reliability SQL support Backup and recovery
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
<Insert Picture Here> Big Data
Big Data Kevin Kalmbach Principal Sales Consultant, Public Sector Engineered Systems Program Agenda What is Big Data and why it is important? What is your Big
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
HDFS Federation. Sanjay Radia Founder and Architect @ Hortonworks. Page 1
HDFS Federation Sanjay Radia Founder and Architect @ Hortonworks Page 1 About Me Apache Hadoop Committer and Member of Hadoop PMC Architect of core-hadoop @ Yahoo - Focusing on HDFS, MapReduce scheduler,
Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.
Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software
Practical Applications of Lustre/ZFS Hybrid Systems LUG 2014 Miami FL
Practical Applications of Lustre/ZFS Hybrid Systems LUG 2014 Miami FL Q2-2014 Josh Judd, CTO Agenda Brief Review: Luster over ZFS Brief Overview: platforms used in example solutions Discuss three cases
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
Big Data - Infrastructure Considerations
April 2014, HAPPIEST MINDS TECHNOLOGIES Big Data - Infrastructure Considerations Author Anand Veeramani / Deepak Shivamurthy SHARING. MINDFUL. INTEGRITY. LEARNING. EXCELLENCE. SOCIAL RESPONSIBILITY. Copyright
Sistemas Operativos: Input/Output Disks
Sistemas Operativos: Input/Output Disks Pedro F. Souto ([email protected]) April 28, 2012 Topics Magnetic Disks RAID Solid State Disks Topics Magnetic Disks RAID Solid State Disks Magnetic Disk Construction
HDFS Users Guide. Table of contents
Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9
Yuji Shirasaki (JVO NAOJ)
Yuji Shirasaki (JVO NAOJ) A big table : 20 billions of photometric data from various survey SDSS, TWOMASS, USNO-b1.0,GSC2.3,Rosat, UKIDSS, SDS(Subaru Deep Survey), VVDS (VLT), GDDS (Gemini), RXTE, GOODS,
Detailed Outline of Hadoop. Brian Bockelman
Detailed Outline of Hadoop Brian Bockelman Outline of Hadoop Before we dive in to an installation, I wanted to survey the landscape. HDFS Core Services Grid services HDFS Aux Services Putting it all together
