Data movement for globally deployed Big Data Hadoop architectures

Similar documents
Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

How a global bank is overcoming technical, business and regulatory barriers to use Hadoop for mission-critical applications

CDH AND BUSINESS CONTINUITY:

Solving performance and data protection problems with active-active Hadoop SOLUTIONS BRIEF

Virtualizing Apache Hadoop. June, 2012

Hadoop in the Hybrid Cloud

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Introduction to Cloud : Cloud and Cloud Storage. Lecture 2. Dr. Dalit Naor IBM Haifa Research Storage Systems. Dalit Naor, IBM Haifa Research

EMC IRODS RESOURCE DRIVERS

Server Consolidation with SQL Server 2008

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

THE REALITIES OF NOSQL BACKUPS

BIG DATA TRENDS AND TECHNOLOGIES

More Data in Less Time

Microsoft SQL Server 2008 R2 Enterprise Edition and Microsoft SharePoint Server 2010

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

Frequently Asked Questions: EMC ViPR Software- Defined Storage Software-Defined Storage

Sujee Maniyam, ElephantScale

Cloud Platforms, Challenges & Hadoop. Aditee Rele Karpagam Venkataraman Janani Ravi

Big Data Management and Security

PEPPERDATA OVERVIEW AND DIFFERENTIATORS

CA Technologies Big Data Infrastructure Management Unified Management and Visibility of Big Data

Software-Defined Networks Powered by VellOS

BACKUP IS DEAD: Introducing the Data Protection Lifecycle, a new paradigm for data protection and recovery WHITE PAPER

StorageX 7.5 Case Study

CloudCenter Full Lifecycle Management. An application-defined approach to deploying and managing applications in any datacenter or cloud environment

I/O Considerations in Big Data Analytics

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

SOLUTION BRIEF Citrix Cloud Solutions Citrix Cloud Solution for Disaster Recovery

Module: Sharepoint Administrator

Always On Infrastructure for Software as a Ser vice

Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp

RPO represents the data differential between the source cluster and the replicas.

Cloud Computing Disaster Recovery (DR)

Cloudera Administrator Training for Apache Hadoop

Enabling High performance Big Data platform with RDMA

We look beyond IT. Cloud Offerings

NetApp SnapMirror. Protect Your Business at a 60% lower TCO. Title. Name

Building Storage as a Service with OpenStack. Greg Elkinbard Senior Technical Director

Enabling Multi-Tenancy with NetApp MultiStore

Operationalize Policies. Take Action. Establish Policies. Opportunity to use same tools and practices from desktop management in server environment

MANAGED SERVICE PROVIDERS SOLUTION BRIEF

Protecting Big Data Data Protection Solutions for the Business Data Lake

DATA SECURITY MODEL FOR CLOUD COMPUTING

EMC VPLEX FAMILY. Continuous Availability and data Mobility Within and Across Data Centers

Cloud Storage and Backup

Big Data Storage Architecture Design in Cloud Computing

Data Center Optimization. Disaster Recovery

Who moved my cloud? Part I: Introduction to Private, Public and Hybrid clouds and smooth migration

Understanding Object Storage and How to Use It

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

TRANSFORM YOUR BUSINESS: BIG DATA AND ANALYTICS WITH VCE AND EMC

Multi-Datacenter Replication

EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise

Quickly Deploy Microsoft Private Cloud and SQL Server 2012 Data Warehouse on Hitachi Converged Solutions. September 25, 2013

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Oracle Public Cloud. Peter Schmidt Principal Sales Consultant Oracle Deutschland BV & CO KG

Oracle Data Integration: CON7926 Oracle Data Integration: A Crucial Ingredient for Cloud Integration

The Modern Online Application for the Internet Economy: 5 Key Requirements that Ensure Success

Dell Software. Jiří Svatuška

IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems

Virtualization Essentials

Server Virtualization Cloud Partner Training Series

Real-time Data Replication

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Cloud Computing: Making the right choices

Why DBMSs Matter More than Ever in the Big Data Era

Increased Security, Greater Agility, Lower Costs for AWS DELPHIX FOR AMAZON WEB SERVICES WHITE PAPER

Solutions for Software Companies. Powered by

Luncheon Webinar Series May 13, 2013

Adobe Deploys Hadoop as a Service on VMware vsphere

Upgrading to Microsoft SQL Server 2008 R2 from Microsoft SQL Server 2008, SQL Server 2005, and SQL Server 2000

CONVERGE APPLICATIONS, ANALYTICS, AND DATA WITH VCE AND PIVOTAL

High Availability of VistA EHR in Cloud. ViSolve Inc. White Paper February

Top 10 Reasons why MySQL Experts Switch to SchoonerSQL - Solving the common problems users face with MySQL

EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst

Data Integration Checklist

GigaSpaces Real-Time Analytics for Big Data

How to Move Your Business to Big Data: The Next Generation Enterprise Architecture

How To Backup With Ec Avamar

Hadoop & its Usage at Facebook

Web Application Deployment in the Cloud Using Amazon Web Services From Infancy to Maturity

Hadoop & Spark Using Amazon EMR

Seagate Cloud Systems & Solutions

Proact whitepaper on Big Data

Deployment Options for Microsoft Hyper-V Server

Reducing the Cost and Complexity of Business Continuity and Disaster Recovery for

Migration Scenario: Migrating Backend Processing Pipeline to the AWS Cloud

AN IN-DEPTH VIEW. Cleo Cleo Harmony - An In-Depth View

EMC SOLUTIONS TO OPTIMIZE EMR INFRASTRUCTURE FOR CERNER

Introduction to Cloud Computing

Transcription:

Data movement for globally deployed Big Data Hadoop architectures Scott Rudenstein VP Technical Services November 2015

WANdisco Background WANdisco: Wide Area Network Distributed Computing " Enterprise ready, high availability software solutions that enable globally distributed organizations to meet today s data challenges of secure storage, scalability and availability Leader in tools for software engineers Subversion " Apache Software Foundation sponsor Highly successful IPO, London Stock Exchange, June 2012 (LSE:WAND) US patented active-active replication technology granted, November 2012

Customers

Scott Rudenstein has worked in commercial software sales for 18 years in Application Lifecycle Management and High Performance Computing. Throughout his career in the US and UK, he specialized in replication, where data and environments need high availability, disaster recovery and backup capabilities. As WANdisco's VP of Technical Services, Scott works with partners, prospects and customers to help them understand and evolve the requirements for mission critical/ enterprise-ready Hadoop. Phil is the WANdisco Central Region Director, and brings over 20 years of experience working with clients on enhancing their Big Data, business intelligence, PLM, ALM, search, analytics and visualization environments. He has worked for JP Morgan, CSC, Kodak, Dassault Systemes, and Attivio with the last eight (8) years focused on Big Data and is an active DAMA member.

How Much Data We Generate in 60 Seconds

Even More Challenging with the Internet of Things

The Result: Hadoop is Under Pressure Data-at-rest Reporting: Trends, Search indexes, KPIs, Matrices and BI Exploration: Data science, Large data sets analysis Data-in-motion Streaming analysis: Real time summary and aggregation Transactional processing: Per event analytics, stream analysis

Before we move data We ve brought data in to the cluster from multiple data sources, but before we move it we need to consider the following: We need to understand where the ingested data is going and what is the data s purpose. " Backup " Disaster recovery " Compute restraints " High availability " Enterprise Data Warehouse And we need to understand the data s relevance or timeliness required for the data to be actionable. Network Capacity Security Cluster Utilization

Enterprise Ready Hadoop Require Data Availability SLA s, Regulatory Compliance Regional datacenter failure Data Relevance Require Hadoop Deployed Globally Share Data Between Data Centers Data is Consistent and Not Eventual Characteristics of mission critical Hadoop clusters sharing data Ease Administra9ve Burden Reduce OperaConal Complexity Simplify Disaster Recovery Lower RTO/RPO Allow Maximum U9liza9on of Resource Within the Data Center Across Data Centers

Key Issue For Sharing Data Across Clusters LAN / WAN

Considerations Questions before transfers What is the value of the data being copied? " Customer facing / Business critical " Internal, medium priority " Disaster recovery / backup How important is it that the data is verified? " Compliance must be " Recovery near as possible " Don t care can get/regenerate the data again How many unique folders need to be replicated site to site? " What is the complexity of the transfer pattern? " Does the receiving site need to transfer data back to the originating site? " How does your organization troubleshoot a failure in replication Is the Disaster Recovery Site for? " Running critical applications when the production site is lost " Recovering data only " Both must run critical applications and provide recovery data How is the Disaster Recovery Site sized? " Equal to total production clusters " Smaller footprint " Larger footprint What are the SLA s that must be met? Is some data (folders) higher priority than other data? What are you using to control the copy process (i.e. distcp, Falcon, BDR, home grown)? How many tasks are there for production to copy the data (or will be in production if not currently)?

The complexity issue C 1 C 2 repl1 repl1

The complexity issue C 1 C 2 repl1 repl1 repl2 repl2

The complexity issue C 1 C 2 HBase repl1 repl1 HDFS repl3 repl3 repl4 repl2 repl2 C 3 C 4 repl4 Hive repl3 repl1 repl2 repl3

The complexity issue C 1 C 2

Analysis of common replication techniques

Multi Data Center Hadoop Today What's wrong with the status quo Periodic Synchronization - DistCp Parallel Data Ingest - Load Balancer, Streaming

Multi Data Center Hadoop Today Hacks currently in use Runs as Map reduce Periodic Synchronization - DistCp DR Data Center is read only Over time, Hadoop clusters become inconsistent Manual and labor intensive process to reconcile differences Inefficient us of the network N to M datanode communication

Multi Data Center Hadoop Today Hacks currently in use Hiccups in either of the Hadoop cluster causes the two file systems to diverge Parallel Data Ingest - Load Balancer, Streaming Potential to run out of buffer when WAN is down Requires constant attention and sys-admin hours to keep running Data created on the cluster is not replicated Use of streaming technologies (like flume) for data redirection are only for streaming

What we want in our Data Centers What s in a Data Center Standby Datacenter Idle Resource " Single Data Center Ingest " Disaster Recovery Only One way synchronization " DistCp Error Prone " Clusters can diverge over time Difficult to scale > 2 Data Centers " Complexity of sharing data increases Active Datacenter DR Resource Available " Ingest at all Data Centers " Run Jobs in both Data Centers Replication is Multi-Directional " Hive " HBase/Other Absolute Consistency N Data Center support " Global Hadoop shared only appropriate data

Scaling Hadoop Across Data Centers Continuous Availability and Disaster Recovery over the WAN The system should appear, act, operate as a single cluster " Continuous replication of data and metadata " RPO is as low as possible due to continuous replication as opposed to periodic Parts of the cluster on different data centers should have equal roles " Data could be ingested or accessed through any of the centers Data creation and access should be at LAN speed " Running time of a job executed on one data center as if there are no other centers Failure scenarios: the system should provide service and remain consistent " WAN Partitioning does not cause a data center outage 21

Use Cases

Disaster Recovery Data is as current as possible (no periodic synchs) " Low RPO Virtually zero downtime to recover from regional data center failure " Active/Active means your already running Meets or exceeds strict regulatory compliance around disaster recovery " Auditability

Multi Data-Center Ingest and multi-tenant workloads Ingest and analyze anywhere Analyze Everywhere " Fraud Detection " Equity Trading Information " New Business " Etc All Datacenter(s) can be used for work " No idle resource

Cluster Zones Share data, not processing " Isolate lower priority (dev/test) work " Share data not resource Maximize Resource Utilization " No idle standby Mixed Hardware Profiles " Memory, Disk, CPU " Isolate memory-hungry processing (Storm/Spark) from regular jobs Mixed Vendor " CDH, PHD, HDP, MAPR

Migration Simplifies and automates migration seamlessly between competitive distributions or legacy versions. " MAPR, CDH HDFS, CDH Isilon, HDP HDFS, HDP Isilon, PHD With easily configurable, real-time active-active replication, data stays secure and maintains its integrity during the migration process " Eliminates the manual labor and pain on IT Facilitates and expedites migrations and upgrades to distributions, reducing costly downtime and minimizing data loss. During migration process, enterprises can access their data continuously. " Both sites become operational " Data Liveliness Avoids vendor lock in

Regulatory Compliance Basel III " Consistency of Data Guidelines Data Privacy Directive " Data Sovereignty o data doesn t leave country of origin RegulaCon Compliance

Security Between Data Centers Hadoop clusters do not require direct communication with each other. " No n x m communication among datanodes across datacenters " Reduced firewall / socks complexities Reduced Attack Surface Fast network protocols can keep up with demanding network replication

WANdisco Fusion Non-Intrusive Provides Continuous Replication Across the LAN/WAN Active/Active

What is Wandisco Fusion? Active-active replication for Hadoop Compatible File Systems Active-active coordination of updates to HCFS data Replication of data between similar or mixed HCFS clusters Management UI backed by REST API ü ü ü HDP, CDH, MapR, Apache Hadoop HDFS, EMC Isilon, Amazon S3 On-premise, cloud

WANdisco Fusion Overview Powered by WANdisco s patented DConE to enable: Active-active replication with guaranteed data consistency across clusters over LAN or WAN Unifies clusters running on CDH, HDP,PHD, MapR, EMC Isilon, NetApp, EMC ECS, Amazon S3, MS Azure, OpenStack Swift, Google Cloud Storage and others Provides a single virtual namespace across clusters any distance apart Totally non-invasive: Proxy server(s) deployed with each cluster No modification to underlying storage An SDK (Software Development Kit) is available for developers to write plug-ins for additional data sources

The complexity issue C 1 C 2 HBase repl1 repl1 HDFS repl3 repl3 repl4 repl2 repl2 C 3 C 4 repl4 Hive repl3 repl1 repl2 repl3

The complexity issue Note: Slides for internal use by WANdisco and IBM teams

QUESTIONS Please feel free to submit your questions