Quantcast Petabyte Storage at Half Price with QFS!

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Quantcast Petabyte Storage at Half Price with QFS!"

Transcription

1 9-131 Quantcast Petabyte Storage at Half Price with QFS Presented by Silvius Rus, Director, Big Data Platforms September 2013

2 Quantcast File System (QFS) A high performance alternative to the Hadoop Distributed File System (HDFS). Manages multi-petabyte Hadoop workloads with significantly faster I/O than HDFS and uses only half the disk space. Offers massive cost savings to large scale Hadoop users (fewer disks = fewer machines). Production hardened at Quantcast under massive processing loads (multi exabyte). Fully Compatible with Apache Hadoop. 100% Open Source. 2

3 Quantcast Technology Innovation Timeline Quantcast Measurement Launched Quantcast Advertising Launched Launch QFS Receiving 1TB/day Receiving 10TB/day Receiving 20TB/day Receiving 40TB/day Processing 1PB/day Processing 10PB/day Processing 20PB/day Started using Hadoop Using and sponsoring KFS Turned off HDFS 3

4 Architecture Client Implements high level file interface (read/write/delete) On write, RS encodes chunks and distributes stripes to nine chunk servers. On read, collects RS stripes from six chunk servers and recomposes chunk. Client Read/write RS encoded data from/to chunk servers Rack 1 Chunk servers Metaserver Maps /file/paths to chunk ids Manages chunk locations Directs clients to chunk servers Locate or allocate chunks Chunk replication and rebalancing instructions Copy/Recover chunks Chunk servers Chunk Server Handles IO to locally stored 64MB chunks Monitors host file system health Replicates and recovers chunks as metaserver directs Metaserver Rack 2 4

5 QFS vs. HDFS Broadly comparable feature set, with significant storage efficiency advantages. Feature QFS HDFS Scalable, distributed storage designed for efficient batch processing ü ü Open source ü ü Hadoop compatible ü ü Unix style file permissions ü ü Error Recovery mechanism Reed-Solomon encoding Multiple data copies Disk space required (as a multiple of raw data) 1.5x 3x 5

6 Reed-Solomon Error Correction Leveraging high-speed modern networks HDFS optimizes toward data locality for older networks. 1. Break original data into 64K stripes. Reed-Solomon Parallel Data I/O 10Gbps networks are now common, making disk I/O a more critical bottleneck. QFS leverages faster networks to achieve better parallelism and encoding efficiency. Result: higher error tolerance, faster performance, with half the disk space. 2. Reed-Solomon generates three parity stripes for every six data strips 3. Write those to nine different drives. 4. Up to three stripes can become unreadable yet the original data can still be recovered Every write parallelized across 9 drives, every read across 6 6

7 MapReduce on 6+3 Erasure Coded Files versus 3x Replicated Files Positives Negatives Writing is ½ off, both in terms of space and time Any 3 broken or slow devices will be tolerated vs. any 2 with 3-way replication Re-executed stragglers run faster due to reading from multiple devices (striping) There is no locality, reading will require the network On read failure, recovery is needed however it s lightning fast on modern CPUs (2 GB/s per core) Writes don t achieve network line rate as original + parity data is written by a single client 7

8 Read/Write Benchmarks End-to-end time (minutes) HDFS 64 MB HDFS 2.5 GB QFS 64 MB Host network behavior during tests QFS write = ½ disk I/O of HDFS write QFS write à network/disk = 8/9 HDFS write à network/disk = 6/9 QFS read à network/disk = 1 HDFS read à network/disk = very small Write Read End-to-end 20 TB write test End-to-end 20 TB read test 8,000 workers * 2.5 GB each Tests ran as Hadoop MapReduce jobs 8

9 Metaserver Performance Intel E GB RAM 70 million directories stat rmdir mkdir ls QFS HDFS Operations per second (thousands) 9

10 Production Hardening for Petascale Continuous I/O Balancing Optimization Operations Full feedback loop Metaserver knows the I/O queue size of every device Activity biased towards under-loaded chunkservers Direct I/O = short loop Direct I/O and fixed buffer space = predictable RAM and storage device usage C++, own memory allocation and layout Vector instructions for Reed Solomon coding Hibernation Evacuation through recovery Continuous space/ consistency rebalancing Monitoring and alerts 10

11 Use Case Quantsort: All I/O over QFS Concurrent append. 10,000 writers append to same file at once. Largest sort = 1 PB Daily = 1 to 2 PB, max = 3 PB 11

12 Use Case Fast Broadcast through Wide Striping Broadcast Time (s) HDFS Default HDFS Small Blocks QFS on Disk QFS in RAM Configuration 12

13 Refreshingly Fast Command Line Tool hadoop fs -ls / versus qfs ls / HDFS Time (msec) 7 QFS Time (msec) 13

14 How Well Does It Work Reliable at Scale Hundreds of days of metaserver uptime common Quantcast MapReduce sorter uses QFS as distributed virtualized store instead of local disk 8 petabytes of compressed data Close to 1 billion chunks 7,500 I/O devices 14

15 How Well Does It Work Reliable at Scale Fast and Large Hundreds of days of metaserver uptime common Quantcast MapReduce sorter uses QFS as distributed virtualized store instead of local disk 8 petabytes of compressed data Close to 1 billion chunks 7,500 I/O devices Ran petabyte sort last weekend. Direct I/O not hurting fast scans: Sawzall query performance similar to Presto: Presto/ HDFS Turbo/ QFS Seconds Rows 920 M 970 M Bytes 31 G 294 G Rows/sec 57.5 M 60.6 M Bytes/sec 2.0 G 18.4 G 15

16 How Well Does It Work Reliable at Scale Fast and Large Easy to Use Hundreds of days of metaserver uptime common Quantcast MapReduce sorter uses QFS as distributed virtualized store instead of local disk 8 petabytes of compressed data Close to 1 billion chunks 7,500 I/O devices Ran petabyte sort last weekend. Direct I/O not hurting fast scans: Sawzall query performance similar to Presto: Presto/ HDFS Turbo/ QFS Seconds Rows 920 M 970 M Bytes 31 G 294 G Rows/sec 57.5 M 60.6 M Bytes/sec 2.0 G 18.4 G 1 Ops Engineer for QFS and MapReduce on 1,000+ node cluster Neustar set up multi petabyte instance without help from Quantcast Migrate from HDFS using hadoop distcp Hadoop MapReduce just works on QFS 16

17 Metaserver Statistics in Production QFS metaserver statistics over Quantcast production file systems in July High Availability is nice to have but not a must-have for MapReduce. There are certainly other use cases where High Availability is a must. Federation may be needed to support file systems beyond 10 PB, depending on file size 17

18 Who will find QFS valuable? Likely to benefit from QFS May find HDFS a better fit Existing Hadoop users with large-scale data clusters. Data heavy, tech savvy organizations for whom performance and efficient use of hardware are high priorities. Small or new Hadoop deployments, as HDFS has been deployed in a broader variety of production environments. Clusters with slow or unpredictable network connectivity. Environments needing specific HDFS features such as head node federation or hot standby. 18

19 Summary Key Benefits of QFS Delivers stable high performance alternative to HDFS in a production-hardened 1.0 release Offers high performance management of multi-petabyte workloads Faster I/O than HDFS with half the disk space. Fully Compatible with Apache Hadoop 100% Open Source Quantcast 2012

20 Future Work What QFS Doesn t Have Just Yet Kerberos Security under development HA No strong case at Quantcast, but nice to have Federation Not a strong case either at Quantcast Contributions welcome Quantcast 2012

21 Thank You. Questions? Download QFS for free at: github.com/quantcast/qfs San Francisco 201 Third Street San Francisco, CA New York 432 Park Avenue South New York, NY London 48 Charlotte Street London, W1T 2NS Quantcast File System 9-13 Quantcast

The Quantcast File System

The Quantcast File System The Quantcast File System Michael Ovsiannikov Quantcast movsiannikov@quantcast Paul Sutter Quantcast psutter@quantcast.com Silvius Rus Quantcast srus@quantcast.com Sriram Rao Microsoft sriramra@microsoft.com

More information

Design and Evolution of the Apache Hadoop File System(HDFS)

Design and Evolution of the Apache Hadoop File System(HDFS) Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop

More information

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics BIG DATA WITH HADOOP EXPERIMENTATION HARRISON CARRANZA Marist College APARICIO CARRANZA NYC College of Technology CUNY ECC Conference 2016 Poughkeepsie, NY, June 12-14, 2016 Marist College AGENDA Contents

More information

Enable and Optimize Erasure Code for Big data on Hadoop. High Performance Computing, Intel Jun Jin

Enable and Optimize Erasure Code for Big data on Hadoop. High Performance Computing, Intel Jun Jin Enable and Optimize Erasure Code for Big data on Hadoop High Performance Computing, Intel Jun Jin (jun.i.jin@intel.com) Agenda Background and overview Codec performance Block placement Policy Performance

More information

The Google File System

The Google File System The Google File System By Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung (Presented at SOSP 2003) Introduction Google search engine. Applications process lots of data. Need good file system. Solution:

More information

Scala Storage Scale-Out Clustered Storage White Paper

Scala Storage Scale-Out Clustered Storage White Paper White Paper Scala Storage Scale-Out Clustered Storage White Paper Chapter 1 Introduction... 3 Capacity - Explosive Growth of Unstructured Data... 3 Performance - Cluster Computing... 3 Chapter 2 Current

More information

Big Fast Data Hadoop acceleration with Flash. June 2013

Big Fast Data Hadoop acceleration with Flash. June 2013 Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Optimizing Dell PowerEdge Configurations for Hadoop

Optimizing Dell PowerEdge Configurations for Hadoop Optimizing Dell PowerEdge Configurations for Hadoop Understanding how to get the most out of Hadoop running on Dell hardware A Dell technical white paper July 2013 Michael Pittaro Principal Architect,

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction

More information

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Parallel IO. Single namespace. Performance. Disk locality awareness? Data integrity. Fault tolerance. Standard interface. Network of disks?

Parallel IO. Single namespace. Performance. Disk locality awareness? Data integrity. Fault tolerance. Standard interface. Network of disks? PARALLEL IO Parallel IO Single namespace Network of disks? Performance Data replication Multiple I/O paths Disk locality awareness? Data integrity Multiple writers Locking? Fault tolerance Hardware failure

More information

Big Data Technology Core Hadoop: HDFS-YARN Internals

Big Data Technology Core Hadoop: HDFS-YARN Internals Big Data Technology Core Hadoop: HDFS-YARN Internals Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class Map-Reduce Motivation This class

More information

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing An Alternative Storage Solution for MapReduce Eric Lomascolo Director, Solutions Marketing MapReduce Breaks the Problem Down Data Analysis Distributes processing work (Map) across compute nodes and accumulates

More information

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Use of Hadoop File System for Nuclear Physics Analyses in STAR 1 Use of Hadoop File System for Nuclear Physics Analyses in STAR EVAN SANGALINE UC DAVIS Motivations 2 Data storage a key component of analysis requirements Transmission and storage across diverse resources

More information

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW 757 Maleta Lane, Suite 201 Castle Rock, CO 80108 Brett Weninger, Managing Director brett.weninger@adurant.com Dave Smelker, Managing Principal dave.smelker@adurant.com

More information

CS54100: Database Systems

CS54100: Database Systems CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics

More information

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct

More information

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013 Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics

More information

Apache Hadoop FileSystem and its Usage in Facebook

Apache Hadoop FileSystem and its Usage in Facebook Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs

More information

Big + Fast + Safe + Simple = Lowest Technical Risk

Big + Fast + Safe + Simple = Lowest Technical Risk Big + Fast + Safe + Simple = Lowest Technical Risk The Synergy of Greenplum and Isilon Architecture in HP Environments Steffen Thuemmel (Isilon) Andreas Scherbaum (Greenplum) 1 Our problem 2 What is Big

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Using Hadoop to Expand Data Warehousing

Using Hadoop to Expand Data Warehousing Using Hadoop to Expand Data Warehousing Mike Peterson VP of Platforms and Data Architecture, Neustar Feb 28, 2013 1 Copyright Think Big Analytics and Neustar Inc. Why do this? Transforming to an Information

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 03 CSC 456 Operating Systems Seminar Presentation (11/11/2010) Elif Eyigöz, Walter Lasecki Outline Background Architecture

More information

Facebook Storage Tiers

Facebook Storage Tiers Facebook Storage Tiers How the fleet is divided: Facebook runs three main storage tiers: Type III MySQL Database (user data) Type IV Hadoop (site activity analytics and logs, messages) Type V Haystack

More information

VIRTUOZZO STORAGE VS. CEPH

VIRTUOZZO STORAGE VS. CEPH VIRTUOZZO STORAGE VS. CEPH I/O Performance Comparison April 29, 2016 Executive Summary Software-defined storage (SDS) is one of the key technologies IT organizations are looking toward as they explore

More information

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system Christian Clémençon (EPFL-DIT)  4 April 2013 GPFS Storage Server Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " Agenda" GPFS Overview" Classical versus GSS I/O Solution" GPFS Storage Server (GSS)" GPFS Native RAID

More information

Apache Hadoop FileSystem Internals

Apache Hadoop FileSystem Internals Apache Hadoop FileSystem Internals Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Storage Developer Conference, San Jose September 22, 2010 http://www.facebook.com/hadoopfs

More information

NoSQL Data Base Basics

NoSQL Data Base Basics NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS

More information

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013 Big Data Use Case How Rackspace is using Private Cloud for Big Data Bryan Thompson May 8th, 2013 Our Big Data Problem Consolidate all monitoring data for reporting and analytical purposes. Every device

More information

NextGen Infrastructure for Big DATA Analytics.

NextGen Infrastructure for Big DATA Analytics. NextGen Infrastructure for Big DATA Analytics. So What is Big Data? Data that exceeds the processing capacity of conven4onal database systems. The data is too big, moves too fast, or doesn t fit the structures

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

HDFS. Hadoop Distributed File System

HDFS. Hadoop Distributed File System HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files

More information

Hadoop: Embracing future hardware

Hadoop: Embracing future hardware Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

High Throughput WAN Data Transfer with Hadoop-based Storage

High Throughput WAN Data Transfer with Hadoop-based Storage High Throughput WAN Data Transfer with Hadoop-based Storage A Amin 2, B Bockelman 4, J Letts 1, T Levshina 3, T Martin 1, H Pi 1, I Sfiligoi 1, M Thomas 2, F Wüerthwein 1 1 University of California, San

More information

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA WHITE PAPER April 2014 Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA Executive Summary...1 Background...2 File Systems Architecture...2 Network Architecture...3 IBM BigInsights...5

More information

Actian SQL in Hadoop Buyer s Guide

Actian SQL in Hadoop Buyer s Guide Actian SQL in Hadoop Buyer s Guide Contents Introduction: Big Data and Hadoop... 3 SQL on Hadoop Benefits... 4 Approaches to SQL on Hadoop... 4 The Top 10 SQL in Hadoop Capabilities... 5 SQL in Hadoop

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information

Windows Server 2008 R2 Essentials

Windows Server 2008 R2 Essentials Windows Server 2008 R2 Essentials Installation, Deployment and Management 2 First Edition 2010 Payload Media. This ebook is provided for personal use only. Unauthorized use, reproduction and/or distribution

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices Sawmill Log Analyzer Best Practices!! Page 1 of 6 Sawmill Log Analyzer Best Practices! Sawmill Log Analyzer Best Practices!! Page 2 of 6 This document describes best practices for the Sawmill universal

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

Google File System. Web and scalability

Google File System. Web and scalability Google File System Web and scalability The web: - How big is the Web right now? No one knows. - Number of pages that are crawled: o 100,000 pages in 1994 o 8 million pages in 2005 - Crawlable pages might

More information

Extending Hadoop beyond MapReduce

Extending Hadoop beyond MapReduce Extending Hadoop beyond MapReduce Mahadev Konar Co-Founder @mahadevkonar (@hortonworks) Page 1 Bio Apache Hadoop since 2006 - committer and PMC member Developed and supported Map Reduce @Yahoo! - Core

More information

CC5212 1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2016. Lecture 4: DFS & MapReduce I. Aidan Hogan

CC5212 1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2016. Lecture 4: DFS & MapReduce I. Aidan Hogan CC5212 1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2016 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Fundamentals of Distributed Systems MASSIVE DATA PROCESSING (THE GOOGLE WAY ) Inside Google circa

More information

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015 7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

More information

Hadoop vs Apache Spark

Hadoop vs Apache Spark Innovate, Integrate, Transform Hadoop vs Apache Spark www.altencalsoftlabs.com Introduction Any sufficiently advanced technology is indistinguishable from magic. said Arthur C. Clark. Big data technologies

More information

High Performance NAS for Hadoop

High Performance NAS for Hadoop High Performance NAS for Hadoop HPC ADVISORY COUNCIL, STANFORD FEB 8, 2013 DR. BRENT WELCH, CTO, PANASAS Panasas and Hadoop PANASAS TECHNICAL DIFFERENTIATION Scalable Performance Balanced object-storage

More information

Big data management with IBM General Parallel File System

Big data management with IBM General Parallel File System Big data management with IBM General Parallel File System Optimize storage management and boost your return on investment Highlights Handles the explosive growth of structured and unstructured data Offers

More information

Tableau Server Scalability Explained

Tableau Server Scalability Explained Tableau Server Scalability Explained Author: Neelesh Kamkolkar Tableau Software July 2013 p2 Executive Summary In March 2013, we ran scalability tests to understand the scalability of Tableau 8.0. We wanted

More information

The Rise of Industrial Big Data. Brian Courtney General Manager Industrial Data Intelligence

The Rise of Industrial Big Data. Brian Courtney General Manager Industrial Data Intelligence The Rise of Industrial Big Data Brian Courtney General Manager Industrial Data Intelligence Agenda Introduction Big Data for the industrial sector Case in point: Big data saves millions at GE Energy Seeking

More information

The Business Intelligence for Hadoop Benchmark

The Business Intelligence for Hadoop Benchmark The Business Intelligence for Hadoop Benchmark Q1 2016 Table of Contents Hadoop as an Analytics Platform Executive Summary: Key Findings The Business Intelligence Evaluation Framework Benchmark Data Set

More information

CDH AND BUSINESS CONTINUITY:

CDH AND BUSINESS CONTINUITY: WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable

More information

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014 Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ Cloudera World Japan November 2014 WANdisco Background WANdisco: Wide Area Network Distributed Computing Enterprise ready, high availability

More information

Configuration Maximums VMware Infrastructure 3

Configuration Maximums VMware Infrastructure 3 Technical Note Configuration s VMware Infrastructure 3 When you are selecting and configuring your virtual and physical equipment, you must stay at or below the maximums supported by VMware Infrastructure

More information

POSIX and Object Distributed Storage Systems

POSIX and Object Distributed Storage Systems 1 POSIX and Object Distributed Storage Systems Performance Comparison Studies With Real-Life Scenarios in an Experimental Data Taking Context Leveraging OpenStack Swift & Ceph by Michael Poat, Dr. Jerome

More information

www.basho.com Technical Overview Simple, Scalable, Object Storage Software

www.basho.com Technical Overview Simple, Scalable, Object Storage Software www.basho.com Technical Overview Simple, Scalable, Object Storage Software Table of Contents Table of Contents... 1 Introduction & Overview... 1 Architecture... 2 How it Works... 2 APIs and Interfaces...

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything BlueArc unified network storage systems 7th TF-Storage Meeting Scale Bigger, Store Smarter, Accelerate Everything BlueArc s Heritage Private Company, founded in 1998 Headquarters in San Jose, CA Highest

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

SAS Grid Manager Testing and Benchmarking Best Practices for SAS Intelligence Platform

SAS Grid Manager Testing and Benchmarking Best Practices for SAS Intelligence Platform SAS Grid Manager Testing and Benchmarking Best Practices for SAS Intelligence Platform INTRODUCTION Grid computing offers optimization of applications that analyze enormous amounts of data as well as load

More information

HDFS Under the Hood. Sanjay Radia. Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc.

HDFS Under the Hood. Sanjay Radia. Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc. HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc. 1 Outline Overview of Hadoop, an open source project Design of HDFS On going work 2 Hadoop Hadoop provides a framework

More information

Uptime Infrastructure Monitor. Installation Guide

Uptime Infrastructure Monitor. Installation Guide Uptime Infrastructure Monitor Installation Guide This guide will walk through each step of installation for Uptime Infrastructure Monitor software on a Windows server. Uptime Infrastructure Monitor is

More information

RAID. Tiffany Yu-Han Chen. # The performance of different RAID levels # read/write/reliability (fault-tolerant)/overhead

RAID. Tiffany Yu-Han Chen. # The performance of different RAID levels # read/write/reliability (fault-tolerant)/overhead RAID # The performance of different RAID levels # read/write/reliability (fault-tolerant)/overhead Tiffany Yu-Han Chen (These slides modified from Hao-Hua Chu National Taiwan University) RAID 0 - Striping

More information

The Hadoop Distributed File System

The Hadoop Distributed File System The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu HDFS

More information

SQL Server Business Intelligence on HP ProLiant DL785 Server

SQL Server Business Intelligence on HP ProLiant DL785 Server SQL Server Business Intelligence on HP ProLiant DL785 Server By Ajay Goyal www.scalabilityexperts.com Mike Fitzner Hewlett Packard www.hp.com Recommendations presented in this document should be thoroughly

More information

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

IBM Software Information Management Creating an Integrated, Optimized, and Secure Enterprise Data Platform:

IBM Software Information Management Creating an Integrated, Optimized, and Secure Enterprise Data Platform: Creating an Integrated, Optimized, and Secure Enterprise Data Platform: IBM PureData System for Transactions with SafeNet s ProtectDB and DataSecure Table of contents 1. Data, Data, Everywhere... 3 2.

More information

IBM Netezza High Capacity Appliance

IBM Netezza High Capacity Appliance IBM Netezza High Capacity Appliance Petascale Data Archival, Analysis and Disaster Recovery Solutions IBM Netezza High Capacity Appliance Highlights: Allows querying and analysis of deep archival data

More information

HDFS Users Guide. Table of contents

HDFS Users Guide. Table of contents Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

GeoGrid Project and Experiences with Hadoop

GeoGrid Project and Experiences with Hadoop GeoGrid Project and Experiences with Hadoop Gong Zhang and Ling Liu Distributed Data Intensive Systems Lab (DiSL) Center for Experimental Computer Systems Research (CERCS) Georgia Institute of Technology

More information

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance. Agenda Enterprise Performance Factors Overall Enterprise Performance Factors Best Practice for generic Enterprise Best Practice for 3-tiers Enterprise Hardware Load Balancer Basic Unix Tuning Performance

More information

Configuration Maximums VMware Infrastructure 3: Update 2 and later for ESX Server 3.5, ESX Server 3i version 3.5, VirtualCenter 2.

Configuration Maximums VMware Infrastructure 3: Update 2 and later for ESX Server 3.5, ESX Server 3i version 3.5, VirtualCenter 2. Topic Configuration s VMware Infrastructure 3: Update 2 and later for ESX Server 3.5, ESX Server 3i version 3.5, VirtualCenter 2.5 When you are selecting and configuring your virtual and physical equipment,

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

Get More Scalability and Flexibility for Big Data

Get More Scalability and Flexibility for Big Data Solution Overview LexisNexis High-Performance Computing Cluster Systems Platform Get More Scalability and Flexibility for What You Will Learn Modern enterprises are challenged with the need to store and

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul

More information

Traditional v/s CONVRGD

Traditional v/s CONVRGD Traditional v/s CONVRGD Traditional Virtualization Stack Converged Virtualization Infrastructure with HCE/HSE Data protection software applications PDU Backup Servers + Virtualization Storage Switch HA

More information

PARALLELS CLOUD STORAGE

PARALLELS CLOUD STORAGE PARALLELS CLOUD STORAGE Performance Benchmark Results 1 Table of Contents Executive Summary... Error! Bookmark not defined. Architecture Overview... 3 Key Features... 5 No Special Hardware Requirements...

More information

Big Data: Conor Duffy: Enterprise Strategist. Ziad Najjar: National Director, ESG Specialist team April 2014

Big Data: Conor Duffy: Enterprise Strategist. Ziad Najjar: National Director, ESG Specialist team April 2014 Big Data: Conor Duffy: Enterprise Strategist. Ziad Najjar: National Director, ESG Specialist team April 2014 Mega Trends impacting IT Big Data Cloud 69% of CXOs expect to see a significant or complete

More information

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014 Four Orders of Magnitude: Running Large Scale Accumulo Clusters Aaron Cordova Accumulo Summit, June 2014 Scale, Security, Schema Scale to scale 1 - (vt) to change the size of something let s scale the

More information

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle Agenda Introduction Database Architecture Direct NFS Client NFS Server

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)

More information