Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division



Similar documents
Hadoop: Embracing future hardware

Enabling High performance Big Data platform with RDMA

Storage Architectures for Big Data in the Cloud

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

CSE-E5430 Scalable Cloud Computing Lecture 2

NoSQL Data Base Basics

Software-defined Storage Architecture for Analytics Computing

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

Big Data Analytics - Accelerated. stream-horizon.com

NextGen Infrastructure for Big DATA Analytics.

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Big Fast Data Hadoop acceleration with Flash. June 2013

Software-defined Storage

Oracle Big Data SQL Technical Update

ebay Storage, From Good to Great

I/O Considerations in Big Data Analytics

How To Create A Data Visualization With Apache Spark And Zeppelin

Networking in the Hadoop Cluster

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

Design and Evolution of the Apache Hadoop File System(HDFS)

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Hadoop Architecture. Part 1

Evolution from Big Data to Smart Data

Scala Storage Scale-Out Clustered Storage White Paper

Luncheon Webinar Series May 13, 2013

Accelerating and Simplifying Apache

How To Handle Big Data With A Data Scientist

HADOOP MOCK TEST HADOOP MOCK TEST I

Hadoop Big Data for Processing Data and Performing Workload

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Architectures for Big Data Analytics A database perspective

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Transforming the Telecoms Business using Big Data and Analytics

Big Data With Hadoop

Dominik Wagenknecht Accenture

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Moving From Hadoop to Spark

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

A Big Data Storage Architecture for the Second Wave David Sunny Sundstrom Principle Product Director, Storage Oracle

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

MapReduce with Apache Hadoop Analysing Big Data

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Accelerating Real Time Big Data Applications. PRESENTATION TITLE GOES HERE Bob Hansen

Red Hat Enterprise Linux is open, scalable, and flexible

Benchmarking Hadoop & HBase on Violin

The Modern Virtualized Data Center

Massive Cloud Auditing using Data Mining on Hadoop

Hadoop & its Usage at Facebook

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

BIG DATA What it is and how to use?

Using distributed technologies to analyze Big Data

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

High Performance Computing OpenStack Options. September 22, 2015

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

A very short Intro to Hadoop

Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Hadoop Cluster Applications

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

OLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni

Apache Hadoop: Past, Present, and Future

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

How To Scale Out Of A Nosql Database

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

[Hadoop, Storm and Couchbase: Faster Big Data]

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Building a Scalable Big Data Infrastructure for Dynamic Workflows

Lab Evaluation of NetApp Hybrid Array with Flash Pool Technology

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Apache Hadoop. Alexandru Costan

Hadoop & Spark Using Amazon EMR

Hadoop. Sunday, November 25, 12

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Case Study : 3 different hadoop cluster deployments

Virtualizing Apache Hadoop. June, 2012

Workshop on Hadoop with Big Data

SOFTWAREDEFINED-STORAGE

Nutanix Solutions for Private Cloud. Kees Baggerman Performance and Solution Engineer

Open source large scale distributed data management with Google s MapReduce and Bigtable

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Microsoft Windows Server in a Flash

Transcription:

Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division

In this talk Big data storage: Current trends Issues with current storage options Evolution of storage to support big data applications Hadoop is not a solution to a data problem! 2

Big Data : The Storage Concerns! Volume Petascale / Exascale data Velocity Frequency of generation Variety Largely unstructured/semi structured Value Frequency of analysis Computation Model Parallel tasks, scale out architecture How much are you worth to Zuckerberg? 3

A typical big data ecosystem Data Mining and Analytics Applications High Level Language (e.g. Pig Latin, Hive QL) Structured databases e.g. HBase, Hive etc. Storage Framework (e.g. HDFS, Cassandra) Storage (DAS/Networked) 4

Big Data Storage Model 1 Centralized metadata node Datanodes store data in local disks Clients Client Name Talk to metadata node and then datanodes e.g. Hadoop Data Data Data 5

Big Data Storage Model 2 No centralized metadata node Client Datanodes store data in local disks Data Clients routed to appropriate node based on hash prefix Data Hash prefix based routing Data e.g. Cassandra Data 6

Computation Model Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Data + Compute Data + Compute Data + Compute Data + Compute 7

Big Data Storage Access Patterns Typically write once, read many times workloads Metadata lookups, object reads Large sized blocks/objects 64 MB to 128 MB (e.g. Hadoop -MR) Small sized accesses e.g. HBase, Cassandra Objects Files Get(), Put() Objects Local File System Files Local Disks 8

Issues with Existing Storage Architecture 9

DAS : Not so smart! Distributing data all over the cluster makes data management difficult Replicated data wastage of storage space Tightly coupled computation and storage Inflexible infrastructure 10

Networks vs Disks: The blame game Over the last decade Datacenter network speeds have dramatically improved 10 Gb/s Ethernet, optical networks Flat network topologies Soon.. 40 Gb/s, 100 Gb/s Ethernet will be common Disks are barely keeping up Take away: Data locality will no more be an issue! 11

Changing times, changing values Value of data is constantly changing Not all data is equally popular Recent analysis of large scale datacenters [1] Only 10-30% of data is most popular Differentiated storage for big data Impossible with DAS Needs sophisticated storage [1] Ananthanarayan et al., HotOS 2011. Least valuable data Most valuable data (frequency of analysis, time of generation etc.) 12

New applications, new requirements Traditionally Sequential access, large blocks Task-local data access, batch jobs Aging data, replication Remote accesses dominate Real time queries and online jobs Row/record accesses in indexed NoSQL databases e.g. Accumulo, Hypertable etc. 13

Revisiting Big Data Storage 14

Rethinking storage for big data Shared nothing DAS vs shared storage Management vs scalability Storage bandwidth and latency capacities Converging multiple storage silos. Primary Cluster Datacenter Analytics Cluster Storage Management Layer 15

Sharing is a virtue! Shared nothing is extreme, inefficient but scalable Shared storage resources Spindles, caches, network bandwidth Scale out storage systems Scale out object/block/file storage systems Shared Nothing Big Data Storage Traditional Enterprise Shared Storage 16

HA with performance guarantees Performance guarantees Latency, BW Data reliability and failure resilience guarantees Big data archival with relaxed performance numbers Compression/ deduplication Archival Low Perf. Storage Manager 17

Storage Federation Federated storage management Integrate multiple storage islands into an archipelago Varying performance/cost characteristics Seamless data migration Dynamic workload characteristics Cost/value model Storage Manager Software 18

Heterogeneous storage clients Primary workloads Offline batch processing analytics jobs Real time online analytics queries Primary workloads Real time Analytics Converged Storage System Offline Analytics 19

Data Management options Storage aware big data infrastructure Storage managing big data blocks Storage tracks blocks Dynamically migrates blocks Big data application aided storage Analytics and computation Storage System 20

Storage technology trends Flash : Flashcache, all flash arrays etc. Interleaved accesses Non-volatile Memory Low latency, persistent tier Fast SAN Fiber channel, 40 Gb/s iscsi etc. 21

Low level changes Revisit block device access semantics Objects files blocks interactions NVM / flash Access protocols, application modifications Shared caches, proportional caching Better I/O schedulers 22

Summary and Conclusion Needed: A change in big data storage perspective Converged storage solutions Changing big data application characteristics Emerging technologies and performance improvements Overhaul traditional disk access semantics and protocols 23