Big Data must become a first class citizen in the enterprise



Similar documents
Why enterprise data archiving is critical in a changing landscape

Enterprise-grade Hadoop: The Building Blocks

More Data in Less Time

The Future of Data Management

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

How To Turn Big Data Into An Insight

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Agile Business Intelligence Data Lake Architecture

Interactive data analytics drive insights

Dell In-Memory Appliance for Cloudera Enterprise

Integrating a Big Data Platform into Government:

Protecting Big Data Data Protection Solutions for the Business Data Lake

BIG DATA TRENDS AND TECHNOLOGIES

Information Architecture

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Deploying an Operational Data Store Designed for Big Data

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Navigating Big Data business analytics

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

How to Enhance Traditional BI Architecture to Leverage Big Data

Luncheon Webinar Series May 13, 2013

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

How To Handle Big Data With A Data Scientist

IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!

The 4 Pillars of Technosoft s Big Data Practice

Implement Hadoop jobs to extract business value from large and varied data sets

The Future of Data Management with Hadoop and the Enterprise Data Hub

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

VIEWPOINT. High Performance Analytics. Industry Context and Trends

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Native Connectivity to Big Data Sources in MSTR 10

The State of Hadoop and Data Lifecycle Management

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Building Your Big Data Team

How to avoid building a data swamp

Your Data, Any Place, Any Time.

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Big Data and New Paradigms in Information Management. Vladimir Videnovic Institute for Information Management

WHITE PAPER. Written by: Michael Azoff. Published Mar, 2015, Ovum

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Move Data from Oracle to Hadoop and Gain New Business Insights

Hadoop in the Hybrid Cloud

Microsoft Big Data. Solution Brief

THE JOURNEY TO A DATA LAKE

Actian SQL in Hadoop Buyer s Guide

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

SAP and Hortonworks Reference Architecture

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Cisco Data Preparation

DocAve Software Platform

HDP Hadoop From concept to deployment.

Big Data and Market Surveillance. April 28, 2014

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Ganzheitliches Datenmanagement

Hadoop Trends and Practical Use Cases. April 2014

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Differentiate your business with a cloud contact center

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

QUICK FACTS. Delivering a Unified Data Architecture for Sony Computer Entertainment America TEKSYSTEMS GLOBAL SERVICES CUSTOMER SUCCESS STORIES

Cloud Integration and the Big Data Journey - Common Use-Case Patterns

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

TDWI: BUSINESS INTELLIGENCE & DATA WAREHOUSING EDUCATION EUROPE

Real-Time Big Data Analytics SAP HANA with the Intel Distribution for Apache Hadoop software

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

Cloudera Enterprise Data Hub in Telecom:

Bringing Big Data into the Enterprise

HDP Enabling the Modern Data Architecture

On the Radar: Tamr. Applying machine learning to integrating Big Data. Publication Date: Sept Product code: IT

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Driving Growth in Insurance With a Big Data Architecture

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

SWOT Assessment: FireMon Security Manager Suite v7.0

Are You Big Data Ready?

Why Big Data in the Cloud?

Customized Report- Big Data

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Big Data Comes of Age: Shifting to a Real-time Data Platform

Navigating the Big Data infrastructure layer Helena Schwenk

CDH AND BUSINESS CONTINUITY:

Cloudera Enterprise Data Hub. GCloud Service Definition Lot 3: Software as a Service

Hadoop Ecosystem B Y R A H I M A.

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

UNIFY YOUR (BIG) DATA

Transcription:

Big Data must become a first class citizen in the enterprise An Ovum white paper for Cloudera Publication Date: 14 January 2014 Author: Tony Baer SUMMARY Catalyst Ovum view Big Data analytics have caught the imagination of enterprises because of the opportunities for discovering new insights from data beyond the reach of enterprise data warehouses, using a variety of approaches, some of which were not previously feasible using relational databases. Created by a community of developers from the Internet world, Hadoop has emerged as the leading new platform for Big Data analytics because of its scalability, flexibility, and reliance on low-cost commodity infrastructure. Not surprisingly, as an emerging platform, early adopters typically deployed Hadoop on dedicated infrastructure because of its unique resource consumption characteristics, and dedicated teams because of the need for highly specialized skills. Clearly, this implementation pattern will not be sustainable for enterprises because of the need to accommodate Hadoop and Big Data analytics largely with the teams and IT infrastructure that they already have. Big Data -- and Hadoop -- must become first class citizens in the enterprise. The technology must become accessible to the people and skills that already form the IT organization. Big Data platforms cannot exist in their own islands. Instead they must map to existing data center infrastructure, policies, and practices for managing resources and capacity; meeting service level requirements; and governing and securing data. In turn Big Data projects must address competitive or operational issues that already face the organization. An "embrace and extend" strategy is essential, as Big Data will require new skillsets, adaptations to running the data center, and new approaches to analyzing data. With Hadoop rapidly evolving from raw open source framework to enterprise data platform, enterprises should evaluate the vendor's roadmap for Page 1

promoting accessibility, along with integrating its offering within the data center and existing data warehousing environment.. Key messages To address the needs of enterprises, Big Data must become a first class citizen with the IT organization, the data center, and the business. Due to its scalability, flexibility, and economics, Hadoop has emerged as the leading analytic platform for Big Data. The most direct path to making Big Data -- and Hadoop -- a first-class citizen will be through an "embrace and extend" approach that not only maps to existing skill sets, data center policies and practices, and business use cases, but also extends them. Big Data platform vendors must design their offerings to deliver the same degree of manageability, security, and integration as establishing data warehousing systems. FROM SWAT TEAM TO ENTERPRISE MAINSTREAM Hadoop's emergence The origins of modern Big Data implementations began with Internet companies whose analytic compute needs overwhelmed the carrying capacity of established SQL relational database technology in several ways. The sheer volumes of data overwhelmed existing relational data warehouses, with regard to daily refreshes that exceeded their batch windows, and the sheer variety of data that was difficult to model because of the volatility, not only in data structure, but also analytic needs. Furthermore, as data volumes surged to the petabytes, costs of licensing and maintaining traditional relational platforms grew unaffordable. Not surprisingly, the relational data warehousing model broke down for Internet companies seeking to build search indexes, optimize ad placement, or enhance online gamer experiences. As a result, Internet firms created their own technology, open sourced it, and required special expertise and dedicated infrastructure to run Big Data (primarily, but not exclusively Hadoop). There were few concerns over security, capacity utilization, data stewardship, or information lifecycle management, as the stakes for market dominance were high and resources deep. Hadoop emerged as a data processing framework designed to solve unique, Internet-scale operational problems such as optimizing ad placement or building search indexes. In its early days, Hadoop lacked tooling, and its performance management and resource consumption characteristics were not well understood. Consequently, there were few practitioners available, with deployments typically managed as separate projects tended by small, elite groups Page 2

of programmers on clusters apart from the data center. As such, at the time there were few concerns over security, capacity utilization, data stewardship, or information lifecycle management. Significantly, the primary security concern with early installations was authenticating users to gain access to remote clusters to provision additional compute capacity. Making the transition to the enterprise With early successes, enterprises grew interested in applying the scalability and power of Hadoop to address issues such as optimizing the customer experience; increasing operational efficiency; or improving risk mitigation, fraud detection, and compliance. Hadoop also started maturing as vendors began offering commercial support with value-add for features such as simplified deployment; integrated monitoring; enhanced data ingestion and integration; authentication, authorization, and access control; data security; and support of new processing frameworks providing alternatives to MapReduce. The "SWAT team" model used by early adopters for implementing Hadoop model is clearly unsustainable for mainstream enterprises, who cannot afford to replace their SQL developers with new talent; run Hadoop clusters as separate islands; or treat every question as a unique data science exercise. Furthermore, as enterprises implement Hadoop, they must deal with the same constraints and requirements that are customary for any major business application or data platform, because nobody has unlimited capital budgets to keep opening or expending data centers dedicated to Big Data and Hadoop. That entails policies regarding data access and utilization, protection of customer privacy, and the need to manage compute and maintain service levels in data centers with finite capacity. BECOMING A FIRST CLASS CITIZEN The goals are the same, but the means are different Enterprise interest in Big Data, and using the Hadoop platform, is evolution, not revolution. It is about gaining insight to address competitive, strategic, or operational issues facing the organization. With Big Data, the difference is that there is now more data -- and more kinds of it -- that can be used for deriving that insight. The goal remains the same; however, with Big Data, the means may be different. For instance queries can evolve with the organization's needs, because data does not have to be formed into a schema until runtime. They can be run using SQL, or other approaches such as MapReduce, for large-scale processing; streaming, for real-time operational decisions; search, adding another technique for ad hoc analytics that is useful when starting with variably structured data; and so on. Big Data may involve new platforms in addition to relational systems; Hadoop has emerged as the Page 3

leading alternative to relational platforms for Big Data analytics on the strength of its low costs, flexibility, and scalability. Supporting the analytic value chain As Hadoop becomes more enterprise-ready, its role is evolving from offline data storage and exploratory processing platform to one that could supplement or claim the role of supporting the analytic value chain from end to end. Hadoop's strength is not only its economics and scalability, but also its flexibility for managing data and its growing capabilities to execute multiple types of analytic and operational workloads. That dictates that Hadoop become an intrinsic part of the analytic value chain, not a separate island: It must become a first class citizen with IT, the data center, and the enterprise, as shown in Table 1. Table 1. Making Hadoop a first-class citizen IT organization Data Center Enterprise Customer Hadoop implementation becomes accessible to existing skillsets Hadoop must be managed to support existing data center policies, practices, and constraints Hadoop and Big Data analytics are performed to address familiar enterprise business issues Vendor and/or Open Source Community Extend Hadoop platform features, making it accessible to developers skilled in SQL, Java, and popular scripting languages Develop/improve data management and governance capabilities: tracking data consumption and lineage; security including access control, authorization, and authentication; and ability to deliver predictable service levels/availability/reliability; support full backup and disaster recovery capabilities Support integration with existing and emerging Big Data analytic tools and applications Source: Ovum Embrace and Extend Based on experiences of Ovum enterprises clients, we have found that the most effective strategy for implementing Hadoop and Big Data analytics will involve an "embrace and extend" strategy that builds off existing competencies, policies, and analytics, and extends them to leverage the unique benefits that Big Data analytics and knowledge of the Hadoop platform provides (see Figure 1). Therefore, beyond mapping Hadoop implementation to existing IT organization skills base, data center policies and practices, and enterprise business cases, it will require adaptation that: Extends platform and analytics know-how; Modifies data center operation to account for new forms and volumes of data; and Page 4

Extends the reach of analytics to address existing issues with new approaches or forms of querying. Figure 1. Embrace and Extend Source: Ovum For the IT organization Embrace existing SQL, Java, Python and similar programming language skills bases. While Hadoop was originally designed with features such as Hive (as Hadoop's SQL-like implementation of a data warehouse), Pig (as a data flow language that is familiar to programmers), there are new capabilities that are emerging for supporting interactive SQL. Likewise, Hadoop programing frameworks such as MapReduce and Spark were designed for Java, and can accommodate analytic programs written in other popular languages such as Python or R. In many cases, organization's adopting Hadoop can utilize many of their existing tools on Hadoop, as most BI, analytics, and data transformation tools providers have already extended support for this platform. To take maximum advantage of the power of Hadoop, these skills should be extended for working with larger and more variable, changing sets of data. For instance, while schema remains essential, developers should take advantage of Hadoop's support for building schema at run time. Page 5

Additionally, new techniques, such as search, graph, and stream processing can add context to analytics, probe relationships between groups of people or things, and open a window to closedloop real-time operational insight. In some cases, roles may be extended; power users could assume data curation roles, where they not only generate queries, but also help identify potentially relevant sets of data from internal and external sources for analysis. For the Data Center operation Few enterprises have unlimited budgets when it comes to building and running their data centers. Likewise, many organizations may be subject to regulatory scrutiny regarding access to and usage of sensitive data. As such, Hadoop installations must embrace the rules, policies, and practices that are expected of any data platform -- especially since in many cases, it may store the same types of structured data that have been stored in relational data warehouses (this is especially common with active archiving use cases). But it must also extend them to account for the unique demands of ingesting, storing, and consuming new types of data sets. This impacts conduct of security, resource management, and data governance and stewardship, as described below.: Security This encompasses managing access and authorization for different classes of end users, and strong measures for authenticating end users. Depending on the sensitivity of the data, security may also involve protecting the sanctity of data and safeguarding privacy of customer records, and closely monitoring (and managing) the activity around how the data is used or transformed. Resource management and service level management While a key benefit of Hadoop is its reliance on inexpensive commodity infrastructure, at some point, there are limits as to how much compute or storage can be allocated. Hadoop platforms (and/or third party tools) must support resource management policies, rules, and practices that prioritize workloads; provide capabilities for managing service levels (encompassing monitoring performance, balancing load, and ensuring availability and reliability). On the horizon, there will be demand for managing the full lifecycle of data, from optimizing tiering of hot data into memory to archival or disposal. Data governance and stewardship Big Data does not change the need for data quality, but it may demand different approaches based on the nature, sensitivity, and the types of queries that will be run against the data (will the queries be exploratory in nature or require precise answers). For instance, some data types such as machine data or log files will not necessarily get cleansed, while other data types (e.g., social network or mobile device geolocation data) may become more valuable when correlated with existing customer master identities. Compared to traditional data warehousing practices, there will be a broader range of approaches to managing quality of Big Data, from record-by-record cleansing to alternatives that utilize probabilistic matching, machine learning, crowdsourcing, and Page 6

other approaches. Additionally, data lineage solutions, that track data by source, will become useful tools for assessing the quality of data by how it is used and by the reliability of the source. For the Enterprise One of the most frequent questions that Ovum receives from clients is how to get started with Big Data. We believe that that is the wrong question to ask. The purpose is not necessarily to work with Big Data for its own sake, but for identifying use cases where Big Data can pick up where conventional analytics leave off in providing better answers to existing competitive, operational, or compliance-related issues facing the enterprise. Making Big Data a first-class citizen in the enterprise means embracing the business cases that are already important to the enterprise, while having the ability to re-imagine analytics without the constraints imposed by relational systems, to reveal new answers. For instance, Hadoop's support of schema on read allows organizations to preserve the original raw data, allowing them to ask new questions on different pieces of data that become more relevant as conditions in the marketplace change. Hadoop's scalability and flexibility enables organizations to extend their analytics across diverse sets of data that were traditionally not stored inside enterprise data warehouses, and run different types of queries (e.g., streaming or graph analytics) that were not feasible with SQL. CLOUDERA'S STRATEGY FOR ENTERPRISE HADOOP From offline data store to enterprise data hub As the first vendor to deliver commercial support for Hadoop, Cloudera's strategy has been consistent with Ovum's vision for making the platform a first class citizen of the data center. Its positioning of Cloudera Distribution including Hadoop (CDH) as enterprise data hub is a clear acknowledgement that Hadoop must become sufficiently robust to provide the platform for managing multiple forms of data with the capability for running multiple types of workloads. Admittedly, the quest for furnishing the logical and physical hub for enterprise data is, and will continue to be, a hotly contested one. The takeaway is that delivering such a hub will not be possible unless the platform can reside as a first-class citizen in the data center, providing full manageability and support for enterprise policies regarding data access, protection, utilization, stewardship, and governance. Adding capabilities for data management, access, and query Cloudera has been building towards this strategy by supporting (and contributing to) the Apache open source projects, and delivering value-added features of its own to make Hadoop more manageable. For instance, Cloudera Manager offers capabilities such as automates deployment and configuration of Hadoop platform components; manages rolling updates, restarts, and Page 7

rollbacks; and provides features for monitoring system health and diagnostics. Recent enhancements include an automated backup and recovery feature that not only replicates data, but preserves all the metadata to ensure that data remains in sync even after restoration. Cloudera Navigator, another recently-added capability, addresses data lineage by tracking the origin and use of data, and selectively enforcing access to specific sets of data. Cloudera is also making Hadoop more accessible to the large professional skills base of SQL developers. Having long partnered with leading ETL, BI, and Data Warehousing platform and tool providers to provide connectivity between Hadoop and relational platforms, Cloudera has taken the next step with Impala, which supports interactive SQL query with a high-performance, parallel processing framework that works against any Hadoop file format. Impala is intended to supplement, not replace your enterprise data warehouse, providing an interface that can be utilized, not only by SQL developers, but also familiar SQL-based query and BI tools from providers such as Tableau, Qlikview, and MicroStrategy. Cloudera is working with other initiatives designed to make Hadoop more versatile and accessible. Cloudera Search optimizes Apache Solr for the Hadoop platform, enabling users to query Hadoop data using a Google-like process. Additionally, Cloudera's support of the Apache Spark project will provide a complementary in-memory programming model for analytics. RECOMMENDATIONS FOR ENTERPRISES Big Data and Hadoop should be evolutionary moves for expanding the scope of analytics. Ultimately, Ovum believes that most enterprises will implement Big Data analytics as part of an analytics ecosystem where queries are directed at the right data sets, on the right platform, at the right time based on parameters such as cost, priority, required service levels, and location of the data. Such federated analytic will provide enterprises the flexibility they need -- and are only possible if Hadoop is integrated with the rest of their analytic data platform environment. When evaluating Hadoop platforms, examine the vendor's roadmap for supporting data integration along with the core management, security, and data management capabilities that are deemed essential for any data warehousing platform. Admittedly, Hadoop technology is a rapidly evolving and fast moving target; while the platform may not currently have parity with established relational data warehousing systems, new capabilities are emerging rapidly from open source and vendorspecific technologies and innovations. Nonetheless, as the natural path for most organizations is to pilot, it is not essential that all capabilities be available on day one. However, in the long run, your enterprise should plan on Hadoop as an addition that will function inside your data center. Adopting an "embrace and extend" strategy, your Hadoop implementation should be compliant with your existing policies regarding data access, security, data quality, and lifecycle management; but at the same time, Page 8

APPENDIX Author those policies and practices will have to be extended because of the unique characteristics (and benefits) of managing Big Data. Tony Baer, Principal Analyst, Ovum IT Information Management tony.baer@ovum.com Ovum Consulting Disclaimer We hope that this analysis will help you make informed and imaginative business decisions. If you have further requirements, Ovum s consulting team may be able to help you. For more information about Ovum s consulting capabilities, please contact us directly at consulting@ovum.com. All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior permission of the publisher, Ovum (an Informa business). The facts of this report are believed to be correct at the time of publication but cannot be guaranteed. Please note that the findings, conclusions, and recommendations that Ovum delivers will be based on information gathered in good faith from both primary and secondary sources, whose accuracy we are not always in a position to guarantee. As such Ovum can accept no liability whatever for actions taken based on any information that may subsequently prove to be incorrect. Page 9