BlobSeer: Towards efficient data storage management on large-scale, distributed systems
|
|
- Logan Hodge
- 8 years ago
- Views:
Transcription
1 : Towards efficient data storage management on large-scale, distributed systems Bogdan Nicolae University of Rennes 1, France KerData Team, INRIA Rennes Bretagne-Atlantique PhD Advisors: Gabriel Antoniu and Luc Bougé December 1, 2010 Bogdan Nicolae : Efficient data storage on distributed systems 1/ 45
2 Outline Context Related work and its limitations Contribution: Principles High level description Zoom on metadata management Synthetic benchmarks as a storage backend for Hadoop MapReduce providing virtual machine image storage for clouds as a QoS enabled storage service for applications running on the cloud Bogdan Nicolae : Efficient data storage on distributed systems 2/ 45
3 Information overload Large-scale computing infrastructure Data storage at large scale We live in exponential times... Every two years the amount of information doubles Making something useful out of it becomes incresingly difficult We depend more and more on large-scale computing infrastructure Bogdan Nicolae : Efficient data storage on distributed systems 3/ 45
4 Information overload Large-scale computing infrastructure Data storage at large scale Dealing with information overload: Enterprise datacenters Tens of thousands of machines in huge clusters Leveraged directly by the owner Commodity hardware: minimizes per unit cost Easy to add, upgrade and replace Data-intensive applications Bogdan Nicolae : Efficient data storage on distributed systems 4/ 45
5 Information overload Large-scale computing infrastructure Data storage at large scale Dealing with information overload: Clouds Computing as utility rather than capital investment Driven by pay-as-you-go model Several levels of abstraction: IaaS, PaaS, SaaS Several advantages: low entry cost, elasticity, rapid development Bogdan Nicolae : Efficient data storage on distributed systems 5/ 45
6 Context Information overload Large-scale computing infrastructure Data storage at large scale Dealing with the information overload: HPC infrastructures Complex scientific and engineering applications High-end hardware Manipulate information at petabyte-scale and beyond Bogdan Nicolae : Efficient data storage on distributed systems 6/ 45
7 Information overload Large-scale computing infrastructure Data storage at large scale Data storage and management is a key issue Requirements Easy manipulation of data High access throughput Scalability Data Metadata Bogdan Nicolae : Efficient data storage on distributed systems 7/ 45
8 Information overload Large-scale computing infrastructure Data storage at large scale Current approaches: Parallel file systems Mostly used in HPC infrastructures POSIX access interface Data striping Advanced caching Pros Distributed data MPI optimizations Cons Locking-based Too many small files Bogdan Nicolae : Efficient data storage on distributed systems 8/ 45
9 Information overload Large-scale computing infrastructure Data storage at large scale Current approaches: Data-intensive oriented file systems Huge files Writes at random offsets are seldom Files grow by atomic appends Fine grain concurrent reads Pros No locking Data location aware Cons Centralized metadata Expensive updates Bogdan Nicolae : Efficient data storage on distributed systems 9/ 45
10 Information overload Large-scale computing infrastructure Data storage at large scale Current approaches: Cloud data storage services Virtualize storage resources Pay for duration, size and traffic Flat naming scheme Simple access model Pros High data availability Versioning Cons Limited object size Limited concurrency control Bogdan Nicolae : Efficient data storage on distributed systems 10/ 4
11 Information overload Large-scale computing infrastructure Data storage at large scale Limitations of existing approaches Issue Parallel FS Data-intensive FS Cloud store Too many small files Centralized metadata No versioning support No fine grain writes Addressed Addressed Addressed Addressed Addressed Bogdan Nicolae : Efficient data storage on distributed systems 11/ 4
12 Information overload Large-scale computing infrastructure Data storage at large scale Limitations of existing approaches Issue Parallel FS Data-intensive FS Cloud store??? Too many small files Centralized metadata No versioning support No fine grain writes Addressed Addressed Addressed Addressed Addressed Addressed Addressed Addressed Addressed Bogdan Nicolae : Efficient data storage on distributed systems 12/ 4
13 Principles High level description Metadata management Design issues Synthetic benchmarks Contribution: Data Sharing at Large Scale Bogdan Nicolae : Efficient data storage on distributed systems 13/ 4
14 Principles Context Principles High level description Metadata management Design issues Synthetic benchmarks BLOBs Eliminate need to keep many small files Provide fine-grain R/W access Data striping Distributes I/O workload Enables user to configure distribution strategy Decentralized metadata Distributes metadata at fine granularity Brings scalability and high avalability Versioning is a key design principle Bogdan Nicolae : Efficient data storage on distributed systems 14/ 4
15 Versioning as a key principle Principles High level description Metadata management Design issues Synthetic benchmarks Clients mutate BLOBs by submitting diffs A BLOB is never overwritten: a new snapshot is generated Only diffs are stored Clients see whole, fully independent snapshots Fine-grain read access to any past snapshot is possible Bogdan Nicolae : Efficient data storage on distributed systems 15/ 4
16 Principles High level description Metadata management Design issues Synthetic benchmarks Contribution: a versioning-oriented access interface id = CREATE() v = APPEND(id, size, buffer) v = WRITE(id, offset, size, buffer) (v, size) = GET RECENT(id) READ(id, v, offset, size, buffer) new id = CLONE(id, v) dv = MERGE(sid, sv, soffset, ssize, did, doffset) Bogdan Nicolae : Efficient data storage on distributed systems 16/ 4
17 Consistency semantics Context Principles High level description Metadata management Design issues Synthetic benchmarks Writes: are atomic and totally ordered No guarantee when they become visible to clients Finish before becoming visible to clients Reads: require a version explicitly GET RECENT does not guarantee latest version Can read any version older than GET RECENT Read exposed writes only Bogdan Nicolae : Efficient data storage on distributed systems 17/ 4
18 Advantages of this proposal Principles High level description Metadata management Design issues Synthetic benchmarks Access to historic data Revert to previous snapshots Track changes to data Exploit data parallelism better Avoid synchronization to achieve better throughput Build complex workflows Bogdan Nicolae : Efficient data storage on distributed systems 18/ 4
19 Architecture Context Principles High level description Metadata management Design issues Synthetic benchmarks Data providers Metadata providers Provider manager Allocation strategy Version manager Guarantees total ordering and atomicity Bogdan Nicolae : Efficient data storage on distributed systems 19/ 4
20 How does a read work? Context Principles High level description Metadata management Design issues Synthetic benchmarks 1 Select a version (optionally ask version manager for the most recently exposed version) 2 Fetch the corresponding metadata from the metadata providers 3 Contact providers in parallel and fetch the chunks into the local buffer Bogdan Nicolae : Efficient data storage on distributed systems 20/ 4
21 How does a write work? Context Principles High level description Metadata management Design issues Synthetic benchmarks 1 Get a list of providers, one for each chunk 2 Contact providers in parallel and write the chunks 3 Get a version number for the update 4 Add new metadata to consolidate the new version 5 Report the new version is ready Version manager will eventually expose it Bogdan Nicolae : Efficient data storage on distributed systems 21/ 4
22 Principles High level description Metadata management Design issues Synthetic benchmarks How to guarantee total ordering and atomicity? Clients ask for a version number Version manager assigns version numbers Clients write metadata concurrently Clients confirm completion Version manager exposes versions in order of assignment Bogdan Nicolae : Efficient data storage on distributed systems 22/ 4
23 Principles High level description Metadata management Design issues Synthetic benchmarks Zoom on metadata management: Motivation Present fully independent snapshots in spite of writing only diffs Access performance should not degrade with increasing number of diffs Proposed so far: B-Trees, Shadowing Difficult to maintain in a distributed fashion Expensive synchronization for concurrent updates Our goals: Easy to manage in a distributed fashion Efficient concurrent updates Bogdan Nicolae : Efficient data storage on distributed systems 23/ 4
24 Principles High level description Metadata management Design issues Synthetic benchmarks Contribution: Versioning over Distributed Segment Trees Deals with distributing the metadata while avoiding expensive management Binary tree is associated to each BLOB snapshot Reads descend towards leaves, writes build new trees bottom-up Key ideas Keep metadata immutable Share whole sub-trees Bogdan Nicolae : Efficient data storage on distributed systems 24/ 4
25 Principles High level description Metadata management Design issues Synthetic benchmarks Contribution: Versioning over Distributed Segment Trees Deals with distributing the metadata while avoiding expensive management Binary tree is associated to each BLOB snapshot Reads descend towards leaves, writes build new trees bottom-up Key ideas Keep metadata immutable Share whole sub-trees Bogdan Nicolae : Efficient data storage on distributed systems 24/ 4
26 Principles High level description Metadata management Design issues Synthetic benchmarks Contribution: Versioning over Distributed Segment Trees Deals with distributing the metadata while avoiding expensive management Binary tree is associated to each BLOB snapshot Reads descend towards leaves, writes build new trees bottom-up Key ideas Keep metadata immutable Share whole sub-trees Bogdan Nicolae : Efficient data storage on distributed systems 24/ 4
27 Principles High level description Metadata management Design issues Synthetic benchmarks Contribution: Versioning over Distributed Segment Trees Deals with distributing the metadata while avoiding expensive management Binary tree is associated to each BLOB snapshot Reads descend towards leaves, writes build new trees bottom-up Key ideas Keep metadata immutable Share whole sub-trees Bogdan Nicolae : Efficient data storage on distributed systems 24/ 4
28 Metadata forward references Principles High level description Metadata management Design issues Synthetic benchmarks Solve the problem of efficient concurrent updates to the metadata Key idea: precalculate children of lower versions instead of waiting Example Initial BLOB Three concurrent writers finished writing their chunks Black is faster but knows about blue and links to the not-yet-existing node After red and blue finish, metadata is consistent Bogdan Nicolae : Efficient data storage on distributed systems 25/ 4
29 Metadata forward references Principles High level description Metadata management Design issues Synthetic benchmarks Solve the problem of efficient concurrent updates to the metadata Key idea: precalculate children of lower versions instead of waiting Example Initial BLOB Three concurrent writers finished writing their chunks Black is faster but knows about blue and links to the not-yet-existing node After red and blue finish, metadata is consistent Bogdan Nicolae : Efficient data storage on distributed systems 25/ 4
30 Metadata forward references Principles High level description Metadata management Design issues Synthetic benchmarks Solve the problem of efficient concurrent updates to the metadata Key idea: precalculate children of lower versions instead of waiting Example Initial BLOB Three concurrent writers finished writing their chunks Black is faster but knows about blue and links to the not-yet-existing node After red and blue finish, metadata is consistent Bogdan Nicolae : Efficient data storage on distributed systems 25/ 4
31 Metadata forward references Principles High level description Metadata management Design issues Synthetic benchmarks Solve the problem of efficient concurrent updates to the metadata Key idea: precalculate children of lower versions instead of waiting Example Initial BLOB Three concurrent writers finished writing their chunks Black is faster but knows about blue and links to the not-yet-existing node After red and blue finish, metadata is consistent Bogdan Nicolae : Efficient data storage on distributed systems 25/ 4
32 Design considerations Context Principles High level description Metadata management Design issues Synthetic benchmarks Event-driven, layered design Callbacks instead of blocking Asynchronous RPC for interprocess communication Metadata providers form a DHT Custom implementation Plugin-able allocation strategy on provider manager Round-robin load-balancing by default More adaptive solutions possible Bogdan Nicolae : Efficient data storage on distributed systems 26/ 4
33 Fault tolerance Context Principles High level description Metadata management Design issues Synthetic benchmarks Clients die during write Before asking for a version: no problem After asking for a version: delegate to metadata provider Data and/or metadata providers die Replication of chunks and metadata pieces Data and metadata immutable: no sync between replicas needed Version manager and/or provider manager dies Distributed state machine using leader election Bogdan Nicolae : Efficient data storage on distributed systems 27/ 4
34 Experimental platform: Grid 5000 Principles High level description Metadata management Design issues Synthetic benchmarks Experimental testbed distributed in 9 sites around France A total of more than 5000 cores Reservation system grants exclusive access for experiments x86 64 CPUs, >2 GB RAM, locally attached disks Interconnect: Gigabit Ethernet, Myrinet, Infiniband Bogdan Nicolae : Efficient data storage on distributed systems 28/ 4
35 Results: synthetic benchmarks Principles High level description Metadata management Design issues Synthetic benchmarks Data striping Metadata decentralization Bogdan Nicolae : Efficient data storage on distributed systems 29/ 4
36 Results: synthetic benchmarks (2) Principles High level description Metadata management Design issues Synthetic benchmarks Increase write pressure while reading Increase read pressure while writing Bogdan Nicolae : Efficient data storage on distributed systems 30/ 4
37 Storage backend for Hadoop MapReduce Virtual machine image storage for IaaS clouds Quality-of-service enabled storage for cloud applications Storage backend for Hadoop MapReduce Efficient VM image deployment and snapshotting on clouds QoS enabled storage for clouds Bogdan Nicolae : Efficient data storage on distributed systems 31/ 4
38 Storage backend for Hadoop MapReduce Virtual machine image storage for IaaS clouds Quality-of-service enabled storage for cloud applications MapReduce Data-intensive oriented paradigm Covers a wide range of data-intensive application classes Users stick to a well-defined model Widely adopted: Google, Yahoo Popular open-source implementation: Hadoop Bogdan Nicolae : Efficient data storage on distributed systems 32/ 4
39 Storage backend for Hadoop MapReduce Virtual machine image storage for IaaS clouds Quality-of-service enabled storage for cloud applications as a storage backend for Hadoop MapReduce Proposal replaces HDFS (default storage backend) Design issues Implement Hadoop API Hierarchic namespace for BLOBs Data prefetching Affinity scheduling: exposing data location Bogdan Nicolae : Efficient data storage on distributed systems 33/ 4
40 Storage backend for Hadoop MapReduce Virtual machine image storage for IaaS clouds Quality-of-service enabled storage for cloud applications Results: synthetic benchmarks Single writer Concurrent readers Bogdan Nicolae : Efficient data storage on distributed systems 34/ 4
41 Storage backend for Hadoop MapReduce Virtual machine image storage for IaaS clouds Quality-of-service enabled storage for cloud applications Results: Real MapReduce applications Write intensive Read intensive Bogdan Nicolae : Efficient data storage on distributed systems 35/ 4
42 Storage backend for Hadoop MapReduce Virtual machine image storage for IaaS clouds Quality-of-service enabled storage for cloud applications In short Improvement of 11%-30% over HDFS for real MapReduce applications Potential to leverage versioning in Hadoop Further improve performance New features Bogdan Nicolae : Efficient data storage on distributed systems 36/ 4
43 Storage backend for Hadoop MapReduce Virtual machine image storage for IaaS clouds Quality-of-service enabled storage for cloud applications Virtual machine image storage for IaaS clouds On IaaS clouds users rent resources as VMs VM image customized with user application Two patterns: Multi-deployment: instantiate many VMs from the same image Multi-snapshotting: save state of VMs into independent images State-of-art: full pre-propagation, then copy images back to repository Our goal: reduce cost (execution time, storage space, network traffic) Bogdan Nicolae : Efficient data storage on distributed systems 37/ 4
44 Storage backend for Hadoop MapReduce Virtual machine image storage for IaaS clouds Quality-of-service enabled storage for cloud applications Proposal Store image in a striped fashion Leverage to store each image as a BLOB Mirror image contents locally Lazy scheme: only read from BLOB when needed Keep changes to image local Only the necessary parts are accessed Consolidate local changes into an independent image CLONE BLOB, then WRITE local changes to BLOB Provides illusion of independent images Only differences are stored Bogdan Nicolae : Efficient data storage on distributed systems 38/ 4
45 Storage backend for Hadoop MapReduce Virtual machine image storage for IaaS clouds Quality-of-service enabled storage for cloud applications Results: scalability of multi-deployment under concurrency VM boot: runtime speedup vs. pre-propagation VM boot: total network traffic vs. pre-propagation Bogdan Nicolae : Efficient data storage on distributed systems 39/ 4
46 Storage backend for Hadoop MapReduce Virtual machine image storage for IaaS clouds Quality-of-service enabled storage for cloud applications Results: performance of multi-snapshotting Avg. time to take snapshots Total network traffic (storage space) Bogdan Nicolae : Efficient data storage on distributed systems 40/ 4
47 Storage backend for Hadoop MapReduce Virtual machine image storage for IaaS clouds Quality-of-service enabled storage for cloud applications In short Large speedup and network traffic savings over pre-propagation for multi-deployment Efficient multi-snapshotting Portable approach: does not depend on hypervisor to manage diffs Bogdan Nicolae : Efficient data storage on distributed systems 41/ 4
48 Storage backend for Hadoop MapReduce Virtual machine image storage for IaaS clouds Quality-of-service enabled storage for cloud applications Quality-of-service enabled storage for cloud applications Storage for data-intensive applications deployed on IaaS clouds QoS impacted by several factors Multiple customers share the same storage service Hardware components prone to failures Application access pattern We need: High aggregated throughput under concurrency Stable throughput for individual data accesses Bogdan Nicolae : Efficient data storage on distributed systems 42/ 4
49 Storage backend for Hadoop MapReduce Virtual machine image storage for IaaS clouds Quality-of-service enabled storage for cloud applications Proposal: QoS improvement methodology Methodology 1 Monitor storage service 2 Collect app feedback 3 Identify + classify behavior patterns using GloBeM 4 Prevent undesired patterns Input: MapReduce access patterns + faults Applied methodology to find bottlenecks Output: Improved allocation strategy Bogdan Nicolae : Efficient data storage on distributed systems 43/ 4
50 Storage backend for Hadoop MapReduce Virtual machine image storage for IaaS clouds Quality-of-service enabled storage for cloud applications Results: improvement of throughput stability Clients separated from providers Clients co-deployed with providers Bogdan Nicolae : Efficient data storage on distributed systems 44/ 4
51 Storage backend for Hadoop MapReduce Virtual machine image storage for IaaS clouds Quality-of-service enabled storage for cloud applications In short General methodology to improve QoS for cloud storage Concretely: Reduction in standard deviation for read throughput of up to 25% Promising results for cloud providers to improve SLA for the same price Bogdan Nicolae : Efficient data storage on distributed systems 45/ 4
Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000
Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000 Alexandra Carpen-Amarie Diana Moise Bogdan Nicolae KerData Team, INRIA Outline
More informationBlobSeer: Enabling Efficient Lock-Free, Versioning-Based Storage for Massive Data under Heavy Access Concurrency
BlobSeer: Enabling Efficient Lock-Free, Versioning-Based Storage for Massive Data under Heavy Access Concurrency Gabriel Antoniu 1, Luc Bougé 2, Bogdan Nicolae 3 KerData research team 1 INRIA Rennes -
More informationGoing Back and Forth: Efficient Multideployment and Multisnapshotting on Clouds
Going Back and Forth: Efficient Multideployment and Multisnapshotting on Clouds Bogdan Nicolae INRIA Saclay France bogdan.nicolae@inria.fr John Bresnahan Argonne National Laboratory USA bresnaha@mcs.anl.gov
More informationComputing in clouds: Where we come from, Where we are, What we can, Where we go
Computing in clouds: Where we come from, Where we are, What we can, Where we go Luc Bougé ENS Cachan/Rennes, IRISA, INRIA Biogenouest With help from many colleagues: Gabriel Antoniu, Guillaume Pierre,
More informationA Cost-Evaluation of MapReduce Applications in the Cloud
1/23 A Cost-Evaluation of MapReduce Applications in the Cloud Diana Moise, Alexandra Carpen-Amarie Gabriel Antoniu, Luc Bougé KerData team 2/23 1 MapReduce applications - case study 2 3 4 5 3/23 MapReduce
More informationStorage Architectures for Big Data in the Cloud
Storage Architectures for Big Data in the Cloud Sam Fineberg HP Storage CT Office/ May 2013 Overview Introduction What is big data? Big Data I/O Hadoop/HDFS SAN Distributed FS Cloud Summary Research Areas
More informationDistributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
More informationJ. Parallel Distrib. Comput. BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds
J. Parallel Distrib. Comput. 73 (2013) 698 711 Contents lists available at SciVerse ScienceDirect J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc BlobCR: Virtual disk based
More informationAccelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
More informationTHE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
More informationBlobCR: Efficient Checkpoint-Restart for HPC Applications on IaaS Clouds using Virtual Disk Image Snapshots
BlobCR: Efficient Checkpoint-Restart for HPC Applications on IaaS Clouds using Virtual Disk Image Snapshots Bogdan Nicolae INRIA Saclay, Île-de-France, France bogdan.nicolae@inria.fr Franck Cappello INRIA
More informationWrite a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical
Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services
More informationGeoGrid Project and Experiences with Hadoop
GeoGrid Project and Experiences with Hadoop Gong Zhang and Ling Liu Distributed Data Intensive Systems Lab (DiSL) Center for Experimental Computer Systems Research (CERCS) Georgia Institute of Technology
More informationGoogle File System. Web and scalability
Google File System Web and scalability The web: - How big is the Web right now? No one knows. - Number of pages that are crawled: o 100,000 pages in 1994 o 8 million pages in 2005 - Crawlable pages might
More informationLecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
More informationSistemi Operativi e Reti. Cloud Computing
1 Sistemi Operativi e Reti Cloud Computing Facoltà di Scienze Matematiche Fisiche e Naturali Corso di Laurea Magistrale in Informatica Osvaldo Gervasi ogervasi@computer.org 2 Introduction Technologies
More informationTowards the Magic Green Broker Jean-Louis Pazat IRISA 1/29. Jean-Louis Pazat. IRISA/INSA Rennes, FRANCE MYRIADS Project Team
Towards the Magic Green Broker Jean-Louis Pazat IRISA 1/29 Jean-Louis Pazat IRISA/INSA Rennes, FRANCE MYRIADS Project Team Towards the Magic Green Broker Jean-Louis Pazat IRISA 2/29 OUTLINE Clouds and
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationPerformance Analysis of Mixed Distributed Filesystem Workloads
Performance Analysis of Mixed Distributed Filesystem Workloads Esteban Molina-Estolano, Maya Gokhale, Carlos Maltzahn, John May, John Bent, Scott Brandt Motivation Hadoop-tailored filesystems (e.g. CloudStore)
More informationHPC performance applications on Virtual Clusters
Panagiotis Kritikakos EPCC, School of Physics & Astronomy, University of Edinburgh, Scotland - UK pkritika@epcc.ed.ac.uk 4 th IC-SCCE, Athens 7 th July 2010 This work investigates the performance of (Java)
More informationWill They Blend?: Exploring Big Data Computation atop Traditional HPC NAS Storage
Will They Blend?: Exploring Big Data Computation atop Traditional HPC NAS Storage Ellis H. Wilson III 1,2 Mahmut Kandemir 1 Garth Gibson 2,3 1 Department of Computer Science and Engineering, The Pennsylvania
More informationEvaluation Methodology of Converged Cloud Environments
Krzysztof Zieliński Marcin Jarząb Sławomir Zieliński Karol Grzegorczyk Maciej Malawski Mariusz Zyśk Evaluation Methodology of Converged Cloud Environments Cloud Computing Cloud Computing enables convenient,
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationEnergy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
More informationDistributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationScala Storage Scale-Out Clustered Storage White Paper
White Paper Scala Storage Scale-Out Clustered Storage White Paper Chapter 1 Introduction... 3 Capacity - Explosive Growth of Unstructured Data... 3 Performance - Cluster Computing... 3 Chapter 2 Current
More informationDistributed File Systems
Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)
More informationNoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
More informationViolin: A Framework for Extensible Block-level Storage
Violin: A Framework for Extensible Block-level Storage Michail Flouris Dept. of Computer Science, University of Toronto, Canada flouris@cs.toronto.edu Angelos Bilas ICS-FORTH & University of Crete, Greece
More informationEvaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
More informationIaaS Cloud Architectures: Virtualized Data Centers to Federated Cloud Infrastructures
IaaS Cloud Architectures: Virtualized Data Centers to Federated Cloud Infrastructures Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Introduction
More informationBookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011
BookKeeper Flavio Junqueira Yahoo! Research, Barcelona Hadoop in China 2011 What s BookKeeper? Shared storage for writing fast sequences of byte arrays Data is replicated Writes are striped Many processes
More informationHadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
More informationPerformance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms
Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms Elena Burceanu, Irina Presa Automatic Control and Computers Faculty Politehnica University of Bucharest Emails: {elena.burceanu,
More informationOverview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
More informationWelcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
More informationAmazon EC2 Product Details Page 1 of 5
Amazon EC2 Product Details Page 1 of 5 Amazon EC2 Functionality Amazon EC2 presents a true virtual computing environment, allowing you to use web service interfaces to launch instances with a variety of
More informationDirect NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle
Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle Agenda Introduction Database Architecture Direct NFS Client NFS Server
More informationSolving I/O Bottlenecks to Enable Superior Cloud Efficiency
WHITE PAPER Solving I/O Bottlenecks to Enable Superior Cloud Efficiency Overview...1 Mellanox I/O Virtualization Features and Benefits...2 Summary...6 Overview We already have 8 or even 16 cores on one
More informationOn- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform
On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform Page 1 of 16 Table of Contents Table of Contents... 2 Introduction... 3 NoSQL Databases... 3 CumuLogic NoSQL Database Service...
More informationCloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationHDFS Users Guide. Table of contents
Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9
More informationBig data management with IBM General Parallel File System
Big data management with IBM General Parallel File System Optimize storage management and boost your return on investment Highlights Handles the explosive growth of structured and unstructured data Offers
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More informationJournal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
More informationCan High-Performance Interconnects Benefit Memcached and Hadoop?
Can High-Performance Interconnects Benefit Memcached and Hadoop? D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
More informationThe Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
More informationBENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
More informationHDFS Architecture Guide
by Dhruba Borthakur Table of contents 1 Introduction... 3 2 Assumptions and Goals... 3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets... 3 2.4 Simple Coherency Model...3 2.5
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationScalable Data-Intensive Processing for Science on Clouds: A-Brain and Z-CloudFlow
Scalable Data-Intensive Processing for Science on Clouds: A-Brain and Z-CloudFlow Lessons Learned and Future Directions Gabriel Antoniu, Inria Joint work with Radu Tudoran, Benoit Da Mota, Alexandru Costan,
More informationCloud Computing Trends
UT DALLAS Erik Jonsson School of Engineering & Computer Science Cloud Computing Trends What is cloud computing? Cloud computing refers to the apps and services delivered over the internet. Software delivered
More informationCIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.
CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensie Computing Uniersity of Florida, CISE Department Prof. Daisy Zhe Wang Map/Reduce: Simplified Data Processing on Large Clusters Parallel/Distributed
More informationAssignment # 1 (Cloud Computing Security)
Assignment # 1 (Cloud Computing Security) Group Members: Abdullah Abid Zeeshan Qaiser M. Umar Hayat Table of Contents Windows Azure Introduction... 4 Windows Azure Services... 4 1. Compute... 4 a) Virtual
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationOn the Duality of Data-intensive File System Design: Reconciling HDFS and PVFS
On the Duality of Data-intensive File System Design: Reconciling HDFS and PVFS Wittawat Tantisiriroj Seung Woo Son Swapnil Patil Carnegie Mellon University Samuel J. Lang Argonne National Laboratory Garth
More informationParallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model
More informationHadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
More informationOpen Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)
Open Cloud System (Integration of Eucalyptus, Hadoop and into deployment of University Private Cloud) Thinn Thu Naing University of Computer Studies, Yangon 25 th October 2011 Open Cloud System University
More informationTake An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
More informationA Brief Analysis on Architecture and Reliability of Cloud Based Data Storage
Volume 2, No.4, July August 2013 International Journal of Information Systems and Computer Sciences ISSN 2319 7595 Tejaswini S L Jayanthy et al., Available International Online Journal at http://warse.org/pdfs/ijiscs03242013.pdf
More informationCUMULUX WHICH CLOUD PLATFORM IS RIGHT FOR YOU? COMPARING CLOUD PLATFORMS. Review Business and Technology Series www.cumulux.com
` CUMULUX WHICH CLOUD PLATFORM IS RIGHT FOR YOU? COMPARING CLOUD PLATFORMS Review Business and Technology Series www.cumulux.com Table of Contents Cloud Computing Model...2 Impact on IT Management and
More informationParallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
More informationMixing Hadoop and HPC Workloads on Parallel Filesystems
Mixing Hadoop and HPC Workloads on Parallel Filesystems Esteban Molina-Estolano *, Maya Gokhale, Carlos Maltzahn *, John May, John Bent, Scott Brandt * * UC Santa Cruz, ISSDM, PDSI Lawrence Livermore National
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationCloud Computing: Meet the Players. Performance Analysis of Cloud Providers
BASEL UNIVERSITY COMPUTER SCIENCE DEPARTMENT Cloud Computing: Meet the Players. Performance Analysis of Cloud Providers Distributed Information Systems (CS341/HS2010) Report based on D.Kassman, T.Kraska,
More informationSoftware Execution Protection in the Cloud
Software Execution Protection in the Cloud Miguel Correia 1st European Workshop on Dependable Cloud Computing Sibiu, Romania, May 8 th 2012 Motivation clouds fail 2 1 Motivation accidental arbitrary faults
More informationMaxDeploy Hyper- Converged Reference Architecture Solution Brief
MaxDeploy Hyper- Converged Reference Architecture Solution Brief MaxDeploy Reference Architecture solutions are configured and tested for support with Maxta software- defined storage and with industry
More informationManaged Virtualized Platforms: From Multicore Nodes to Distributed Cloud Infrastructures
Managed Virtualized Platforms: From Multicore Nodes to Distributed Cloud Infrastructures Ada Gavrilovska Karsten Schwan, Mukil Kesavan Sanjay Kumar, Ripal Nathuji, Adit Ranadive Center for Experimental
More informationMapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
More informationBig Data Processing in the Cloud. Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center
Big Data Processing in the Cloud Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center Data is ONLY as useful as the decisions it enables 2 Data is ONLY as useful as the decisions it enables
More informationOptimize the execution of local physics analysis workflows using Hadoop
Optimize the execution of local physics analysis workflows using Hadoop INFN CCR - GARR Workshop 14-17 May Napoli Hassen Riahi Giacinto Donvito Livio Fano Massimiliano Fasi Andrea Valentini INFN-PERUGIA
More informationLecture 02a Cloud Computing I
Mobile Cloud Computing Lecture 02a Cloud Computing I 吳 秀 陽 Shiow-yang Wu What is Cloud Computing? Computing with cloud? Mobile Cloud Computing Cloud Computing I 2 Note 1 What is Cloud Computing? Walking
More informationComparative analysis of Google File System and Hadoop Distributed File System
Comparative analysis of Google File System and Hadoop Distributed File System R.Vijayakumari, R.Kirankumar, K.Gangadhara Rao Dept. of Computer Science, Krishna University, Machilipatnam, India, vijayakumari28@gmail.com
More informationCloud/SaaS enablement of existing applications
Cloud/SaaS enablement of existing applications GigaSpaces: Nati Shalom, CTO & Founder About GigaSpaces Technologies Enabling applications to run a distributed cluster as if it was a single machine 75+
More informationIBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud
IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud February 25, 2014 1 Agenda v Mapping clients needs to cloud technologies v Addressing your pain
More informationVirtualizing Apache Hadoop. June, 2012
June, 2012 Table of Contents EXECUTIVE SUMMARY... 3 INTRODUCTION... 3 VIRTUALIZING APACHE HADOOP... 4 INTRODUCTION TO VSPHERE TM... 4 USE CASES AND ADVANTAGES OF VIRTUALIZING HADOOP... 4 MYTHS ABOUT RUNNING
More informationCOSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel Computation Parallel I/O (I) I/O basics Spring 2008 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network
More informationDesign and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
More informationCOSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel I/O (I) I/O basics Fall 2012 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network card 1 Network card
More informationA Very Brief Introduction To Cloud Computing. Jens Vöckler, Gideon Juve, Ewa Deelman, G. Bruce Berriman
A Very Brief Introduction To Cloud Computing Jens Vöckler, Gideon Juve, Ewa Deelman, G. Bruce Berriman What is The Cloud Cloud computing refers to logical computational resources accessible via a computer
More informationLoad Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2
Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2 1 PDA College of Engineering, Gulbarga, Karnataka, India rlrooparl@gmail.com 2 PDA College of Engineering, Gulbarga, Karnataka,
More informationIntroduction to OpenStack
Introduction to OpenStack Carlo Vallati PostDoc Reseracher Dpt. Information Engineering University of Pisa carlo.vallati@iet.unipi.it Cloud Computing - Definition Cloud Computing is a term coined to refer
More informationArchitecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,
More informationCluster, Grid, Cloud Concepts
Cluster, Grid, Cloud Concepts Kalaiselvan.K Contents Section 1: Cluster Section 2: Grid Section 3: Cloud Cluster An Overview Need for a Cluster Cluster categorizations A computer cluster is a group of
More informationA Service for Data-Intensive Computations on Virtual Clusters
A Service for Data-Intensive Computations on Virtual Clusters Executing Preservation Strategies at Scale Rainer Schmidt, Christian Sadilek, and Ross King rainer.schmidt@arcs.ac.at Planets Project Permanent
More informationAn Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing
An Alternative Storage Solution for MapReduce Eric Lomascolo Director, Solutions Marketing MapReduce Breaks the Problem Down Data Analysis Distributes processing work (Map) across compute nodes and accumulates
More informationHadoop@LaTech ATLAS Tier 3
Cerberus Hadoop Hadoop@LaTech ATLAS Tier 3 David Palma DOSAR Louisiana Tech University January 23, 2013 Cerberus Hadoop Outline 1 Introduction Cerberus Hadoop 2 Features Issues Conclusions 3 Cerberus Hadoop
More informationENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE
ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE Hadoop Storage-as-a-Service ABSTRACT This White Paper illustrates how EMC Elastic Cloud Storage (ECS ) can be used to streamline the Hadoop data analytics
More informationCloud Computing. Chapter 1 Introducing Cloud Computing
Cloud Computing Chapter 1 Introducing Cloud Computing Learning Objectives Understand the abstract nature of cloud computing. Describe evolutionary factors of computing that led to the cloud. Describe virtualization
More informationA Study on Analysis and Implementation of a Cloud Computing Framework for Multimedia Convergence Services
A Study on Analysis and Implementation of a Cloud Computing Framework for Multimedia Convergence Services Ronnie D. Caytiles and Byungjoo Park * Department of Multimedia Engineering, Hannam University
More informationCloud Computing through Virtualization and HPC technologies
Cloud Computing through Virtualization and HPC technologies William Lu, Ph.D. 1 Agenda Cloud Computing & HPC A Case of HPC Implementation Application Performance in VM Summary 2 Cloud Computing & HPC HPC
More information