The State of Hadoop and Data Lifecycle Management



Similar documents
WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

Big Data must become a first class citizen in the enterprise

HDP Enabling the Modern Data Architecture

A Sumo Logic White Paper. Harnessing Continuous Intelligence to Enable the Modern DevOps Team

Hadoop in the Hybrid Cloud

DevOps Best Practices: Combine Coding with Collaboration

HDP Hadoop From concept to deployment.

Datameer Cloud. End-to-End Big Data Analytics in the Cloud

Building Your Big Data Team

CA Big Data Management: It s here, but what can it do for your business?

PEPPERDATA OVERVIEW AND DIFFERENTIATORS

Big Data and Hadoop for the Executive A Reference Guide

Virtual Machine Environments: Data Protection and Recovery Solutions

Big Data Support Services. Service Definition

Cisco IT Hadoop Journey

BMC Software's batch-job juggernaut gets hip with Hadoop support


Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Modern IT Operations Management. Why a New Approach is Required, and How Boundary Delivers

Testing Big data is one of the biggest

The Inside Scoop on Hadoop

Ubuntu and Hadoop: the perfect match

The Power of Pentaho and Hadoop in Action. Demonstrating MapReduce Performance at Scale

OnX Big Data Reference Architecture

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Solution White Paper Connect Hadoop to the Enterprise

VMware vsphere Data Protection 5.8 TECHNICAL OVERVIEW REVISED AUGUST 2014

Quantium captures new niche in data analytics market

How Big Is Big Data Adoption? Survey Results. Survey Results Big Data Company Strategy... 6

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

High Availability & Disaster Recovery Development Project. Concepts, Design and Implementation

IBM QRadar Security Intelligence Platform appliances

Machine Data Analytics with Sumo Logic

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

Application Lifecycle Management White Paper. Source Code Management Best Practice: Applying Economic Logic to Migration ALM

THE JOURNEY TO A DATA LAKE

TIBCO StreamBase High Availability Deploy Mission-Critical TIBCO StreamBase Applications in a Fault Tolerant Configuration

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Cloudera Manager Introduction

Dominik Wagenknecht Accenture

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

Big Data Analytics OverOnline Transactional Data Set

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Vistara Lifecycle Management

APPROACHABLE ANALYTICS MAKING SENSE OF DATA

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

DevOps. Production Operations - The Last Mile of a DevOps Strategy

Upgrade to Oracle E-Business Suite R12 While Controlling the Impact of Data Growth WHITE PAPER

Lab : Planning and Implementing a Virtual Machine Deployment and Management Strategy

Solution Brief Availability and Recovery Options: Microsoft Exchange Solutions on VMware

SUPPLY CHAIN SEGMENTATION 2.0: WHAT S NEXT. Rich Becks, General Manager, E2open. Contents. White Paper

Integrating a Big Data Platform into Government:

White Paper. Managing MapR Clusters on Google Compute Engine

XpoLog Competitive Comparison Sheet

Proficy Monitoring & Analysis. Software to harness the industrial internet

FLASH ARRAY MARKET TRENDS

VMware vsphere Data Protection 6.0

DATABASE ANALYST I DATABASE ANALYST II

Hadoop, the Data Lake, and a New World of Analytics

How To Turn Big Data Into An Insight

White. Paper. EMC Isilon: A Scalable Storage Platform for Big Data. April 2014

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Introduction to VMware vsphere Data Protection TECHNICAL WHITE PAPER

Welkom! Copyright 2014 Oracle and/or its affiliates. All rights reserved.

Platfora Big Data Analytics

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Increased Security, Greater Agility, Lower Costs for AWS DELPHIX FOR AMAZON WEB SERVICES WHITE PAPER

Analytics With Hadoop. SAS and Cloudera Starter Services: Visual Analytics and Visual Statistics

Data Discovery, Analytics, and the Enterprise Data Hub

Understanding Your Customer Journey by Extending Adobe Analytics with Big Data

CDH AND BUSINESS CONTINUITY:

Virtualization of CBORD Odyssey PCS and Micros 3700 servers. The CBORD Group, Inc. January 13, 2007

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

MS Implementing an Advanced Server Infrastructure

Broadcloud improves competitive advantage with efficient, flexible and scalable disaster recovery services

VMware Solutions for Small and Midsize Business

Cyber security tackling the risks with new solutions and co-operation Miikka Pönniö

Amazon Web Services. For Government, Education, and Nonprofit Organizations. Jakob Huhn. Partner Manager Benelux, Public Sector

CA Technologies Big Data Infrastructure Management Unified Management and Visibility of Big Data

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

Hosting as a Service (HaaS) Playbook. Version 0.92

巨 量 資 料 分 層 儲 存 解 決 方 案

Introduction to Apache Kafka And Real-Time ETL. for Oracle DBAs and Data Analysts

Apache Hadoop: Past, Present, and Future

Apache Hadoop: The Big Data Refinery

Transcription:

The State of Hadoop and Data Lifecycle Management September 15

INTRODUCTION Thought leaders and Big Data practitioners completed a Talena survey in which they detailed their adoption and use of Hadoop technologies. The respondents also highlighted the challenges associated with data lifecycle management processes for their Hadoop environment. The most important findings are outlined below. KEY TAKEAWAYS Over 5% of the respondents are actively looking for or implementing a data management solution for their Hadoop infrastructure 85% of Hadoop users highlight analytics as their primary use case Over 5% of the respondents have deployed three or fewer Hadoop clusters Budgets grow substantially as projects move from research to production 3% of the respondents store over 1 TB in their largest cluster Methodology We invited people whom we met at Hadoop conferences, thought leaders, or contacts made via our own research into the Hadoop community to fill out the survey. Over 1 responses were recorded. All the raw responses were converted into percentages for the purposes of the charts below, and for a few questions there were multiple responses allowed which made the percentage total above 1. 2

1 Disaster Recovery and Copying Data for Test/Dev Are Pain Points Respondents were asked to prioritize various data management processes associated with their Hadoop environment. Disaster recovery and using production data in test & dev environments were highlighted as the two most critical challenges with Hadoop for companies. While the former seems self-evident, the latter could be because the increased presence of DevOps shops makes it imperative to support rapid application iterations while still ensuring PII and other sensitive data are masked and protected. These issues play a role in delaying production rollouts. 35 3 32 32 25 15 1 19 15 5 DR Backup Test/Dev Archive - Cost What Is Your #1 Challenge With Hadoop As It Relates To Data Management? What is your #1 Hadoop challenge as it relates to data management? 2 Archive - Compliance 3

2 Scripting Is An Answer, But Is It The Solution? When asked how they were currently solving these challenges, over 6% of those looking for a solution for test/dev/analysis turned to scripting. Over 3% of those who saw disaster recovery, archiving or backups as their biggest issue used scripting while another % were actively looking for a solution. In short, implementing some form of data management (even scripting) is relevant to over 5% of the survey group, independent of use case. However, it brings up an interesting question as to whether these increased deployment budgets are wisely spent on a solution that neither scales not supports efficient deployments. Scripting Looking For A Solution Not a current requirement - maybe in future Not a current requirement - not in future 7 6 61 5 4 3 35 38 35 33 1 23 23 23 21 18 12 12 15 18 28 23 25 DR Backup Test/Dev How Are You Currently Solving These Data Management Challenges? Archive - Cost Archive - Compliance 4

3 There is a Strong Correlation Between Budget and Deployment Phase Nearly 9% of the enterprises who are still researching their Hadoop options have allocated less than $1K for their future project needs. As these projects move into pilot and production phases, the budgets grow substantially. Just over % of those companies that are in pilot or production have budgets less than $1K, while the majority spend significant amounts for infrastructure and engineers to build these nextgeneration applications (and the scripts for managing their data). $1K-$25K $51K-$1M Less than $1K $1K-$25K 6 6 22 28 88 % $51K-$1M 8 24 % 18 $1M+ Less than $1K $251K-$5K 15 Budget for Research 15 Budget for Pilot/Production 5

4 Analytics Remains Predominant Use Case With Log Analysis Surprisingly Prevalent Respondents labeled analytics as their #1 use case with over 85% of respondents pinpointing this need. 35% of the respondents use Hadoop for log analysis. Given the emergence of products purposed-built for log analysis like Splunk, Sumo Logic, ELK, and Loggly, we thought this number to be high. Evidently the processing power of Hadoop for large data sets like logs still provides users with compelling value despite all the real-time, machine learning capabilities of these other tools. 9 8 7 85 6 5 4 56 51 3 36 1 8 Analytics Storage/Archival ETL Log Analysis Other How is your organization using Hadoop?* How is your organization using Hadoop? (multiple responses allowed) *Multiple responses allowed 6

5 Cloudera Remains The Distribution of Choice Cloudera was by far the most favored distribution with a greater than 2x advantage over Hortonworks, and nearly 6x over MapR, reinforcing the popular opinion around the relative penetration of the different Hadoop distributions. 7 6 5 63 4 3 1 25 11 18 6 Cloudera Hortonworks MapR Apache Other What Hadoop Distribution Do You Use?* What Hadoop Distribution Do You Use? (multiple responses allowed) *Multiple responses allowed 7

6 Most Companies Currently Implementing Three or Fewer Clusters Over 5% of the companies are using three or less Hadoop clusters (whether production or non-production), while a small percentage of companies have over ten clusters in their environment. This supports the popular theory that there is a smaller group of very large Hadoop deployments, while the broader Hadoop ecosystem is built around more modest cluster sizes that, over time, will grow as companies obtain value from their Big Data applications. 45 4 35 3 4 25 15 1 16 18 18 5 8 1 2 to 3 4 to 6 7 to 9 1 or more How Many Hadoop Clusters Do You Operate? How Many Hadoop Clusters Do You Operate? 8

7 Nearly a third of the largest clusters contain over 1 TB of data While companies aren t necessarily deploying large numbers of clusters, they are starting to put large amounts of data in their existing clusters. Over 3% of our respondents store over 1 TB in their largest Hadoop cluster, presumably running analytics, ETL or log analysis on these data sets. 4 35 3 36 25 15 1 5 16 18 11 1-1 TB 11-25 TB 26-1 TB 11-499 TB 51+ TB How Much Data Resides In Your Largest Hadoop Cluster? How Much Data Resides In Your Largest Hadoop Cluster? 9

SUMMARY With regard to the general adoption of use of Hadoop, our findings support what has been written about in the popular press, with a greater percentage of users actually in production than the overall population. The latter is undoubtedly a result of our sample bias which explicitly includes Hadoop users or those who expressed a preference for Hadoop. About Talena Talena, the next-generation data availability management company, solves the problems associated with unavailable or lost data, and potential compliance risk related to Big Data applications. Our exabyte-scale solution automates backup, test/dev, archive and disaster-recovery functions. With Talena, companies enable rapid application iteration, save engineering and infrastructure resources, and prevent data loss from user error or application corruption. Please contact us for more information at info@talena-inc.com or visit us at www.talena-inc.com. Talena, Inc. 83 Hillview Court, Suite 138, Milpitas, CA 9535 +1.48.649.6338 talena-inc.com 15 Talena, Inc. All rights reserved. Talena and the Talena logo are trademarks of Talena in the US and in other countries. Information subject to change without notice. All other trademarks and service marks are property of their respective owners.