I/O Considerations in Big Data Analytics



Similar documents
Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Big Data Technologies Compared June 2014

EMC Greenplum. Big Data meets Big Integration. Wolfgang Disselhoff Sr. Technology Architect, Greenplum. André Münger Sr. Account Manager, Greenplum

EMC BACKUP MEETS BIG DATA

WHAT S NEW IN SAS 9.4

Greenplum Database. Getting Started with Big Data Analytics. Ofir Manor Pre Sales Technical Architect, EMC Greenplum

Big Data and Data Science: Behind the Buzz Words

Advanced In-Database Analytics

Enabling High performance Big Data platform with RDMA

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop IST 734 SS CHUNG

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Oracle Big Data SQL Technical Update

Constructing a Data Lake: Hadoop and Oracle Database United!

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Big Data Analytics - Accelerated. stream-horizon.com

Hadoop. Sunday, November 25, 12

<Insert Picture Here> Big Data

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

BIG DATA TRENDS AND TECHNOLOGIES

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop implementation of MapReduce computational model. Ján Vaňo

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

HadoopTM Analytics DDN

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data and Apache Hadoop s MapReduce

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Data processing goes big

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

Universal PMML Plug-in for EMC Greenplum Database

Cisco Unified Data Center Solutions for MapR: Deliver Automated, High-Performance Hadoop Workloads

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Apache Hadoop: Past, Present, and Future

Building a Scalable Storage with InfiniBand

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Workshop on Hadoop with Big Data

Hadoop Big Data for Processing Data and Performing Workload

Hadoop: Embracing future hardware

How To Scale Out Of A Nosql Database

Trafodion Operational SQL-on-Hadoop

Big Data Too Big To Ignore

Proact whitepaper on Big Data

Big + Fast + Safe + Simple = Lowest Technical Risk

MapReduce with Apache Hadoop Analysing Big Data

Big data management with IBM General Parallel File System

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Hadoop and Map-Reduce. Swati Gore

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

EMC/Greenplum Driving the Future of Data Warehousing and Analytics

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

<Insert Picture Here> Refreshing Your Data Protection Environment with Next-Generation Architectures

Accelerating and Simplifying Apache

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Deploying Hadoop with Manager

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Big Data Analytics - Accelerated. stream-horizon.com

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Case Study : 3 different hadoop cluster deployments

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

NoSQL and Hadoop Technologies On Oracle Cloud

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Virtualizing Apache Hadoop. June, 2012

The Greenplum Analytics Workbench

Hadoop Architecture. Part 1

Apache Hadoop FileSystem and its Usage in Facebook

Saving Millions through Data Warehouse Offloading to Hadoop. Jack Norris, CMO MapR Technologies. MapR Technologies. All rights reserved.

HADOOP MOCK TEST HADOOP MOCK TEST I

A very short Intro to Hadoop

THE HADOOP DISTRIBUTED FILE SYSTEM

Integrated Grid Solutions. and Greenplum

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Transcription:

Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1

Paradigms in Big Data Structured (relational) data Very Large Databases (100 s TB +) SQL is the access method Can be monolithic or distributed/parallel Vendors distribute software only or appliance Unstructured (non-relational data) Hadoop Clusters (100 s + nodes) Map/Reduce is the access method Vendors distribute software only (mostly) 2

Obstacles in Big Data Both Relational and Non Relational Approaches must deal with I/O issues: Latency Bandwidth Data movement in/out of cluster Backup/Recovery High Availability 3

MPP (Massively Parallel Processing) Shared-Nothing Architecture MPP has extreme scalability on general purpose systems Provides automatic parallelization Just load and query like any database Map/Reduce jobs run in parallel All nodes can scan and process in parallel Extremely scalable and I/O optimized Linear scalability by adding nodes Each adds storage, query, processing and loading performance Master Servers Query planning & dispatch Network Interconnect Segment Servers Query processing & data storage External Sources Loading, streaming, etc.... SQL MapReduce......... 4

Software and Appliances in Relational Big Data Greenplum DCA EMC (software and appliance) Neteeza Twin Fin IBM (appliance only) Teradata 2580 Teradata ( appliance only) Vertica HP (software and appliance) All above use distributed data with conventional I/O Neteeza and Teradata virtual proprietary network s/w Exadata Oracle (appliance only) Oracle is the only vendor with a shared disk model Uses Infiniband to solve latency and bandwidth issues 5

Backing up to from an Appliance Requirements: Parallel backup from all nodes, not just the master Incremental or dedup ability via NFS shares or similar Conneted to private network, not public 6

MPP Database Resilience Relies on In-Cluster Mirroring Logic Cluster comprises Master servers Multiple Segment servers Segment servers support multiple database instances Active primary instances Standby mirror instances 1:1 mapping between Primary and Mirror instances Synchronous mirroring Set of Active Segment Instances 7

SAN Mirror Configuration: Mirrors Placed on SAN Storage Segment Server 1 P1 P2 P3 M6 M8 M1 0 Segment Server 2 P4 P5 P6 SAN M1 M9 M1 1 Segment Server 3 P7 P8 P9 M2 M4 M1 2 Segment Server 4 P10 P11 P12 M3 M5 M7 Doesn t this violate principle of all local storage? Maybe, maybe not. 8

One Example: SAN Mirror to VMAX SAN P1 P96 M1 M96 Default DCA configuration has Segment Primaries and Segment Mirrors on internal storage SAN Mirror offloads Segment Mirrors to VMAX SAN storage Doubles effective capacity of a DCA Foundation of SAN leverage Seamless off-host backups Data replication No performance impact Primaries on internal storage SAN sized for load and failed segment server 9

One Example: SAN Mirror With SAN based replication for DR Greenplum DCA Symmetrix V-MAX Symmetrix V-MAX Greenplum DCA SRDF No consumption of server based resources to perform distance copies WAN or SAN P1 M1/R1 M1/R2 SAN Mirror Synchronous or Asynchronous Performed by SRDF / V-MAX Database on SAN 10

What is Hadoop? Three major components An infrastructure for running Map/Reduce jobs Mappers produce name/value pairs Reducers aggregate Mapper Output HDFS - A distributed file system for holding input data, output data, and intermediate result An ecosystem of higher level tools overlaid on MapReduce and HDFS Hive Pig Hbase Zookeeper Mahout Others 11

Why Hadoop? With massive growth of unstructured data Hadoop has quickly become an important new data platform and technology We ve seen this first-hand with customers deploying Hadoop alongside relational databases A large number of major business/government agency are evaluating Hadoop or have Hadoop in production Over 22 organizations running 1+ PB Hadoop Clusters Average Hadoop cluster is 30 nodes and growing. 12

Why Not Hadoop? Hadoop still a roll your own technology Appliances just appearing Sep/Oct 2011 Wide scale acceptance requires Better HA features More performant I/O Ease of use and management Access to HDFS through a single Name Node Single point of failure Possible contention in large clusters All Name Node data held in memory, limiting number of files in cluster Unlike SQL, programming model via Hadoop API a rare skill Apache distribution written in Java, good for portability, less good for speed of execution 13

Storage Layer Improvements to Apache Hadoop Distribution HDFS optimizations Recoded in C, not Java, different I/O philosophy Completely API compatible NFS interface for data movement in/out of HDFS Distributed Name Node eliminates SPOF for Name- Node Remote Mirroring and Snapshots for HA Multiple readers/writers lockless storage Built-in transparent compression/encryption 14

Thank you Marshall Presser 240.401.1750 cell marshall.presser@emc.com 15