Cloudera in the Public Cloud

Similar documents
Hadoop in the Hybrid Cloud

Deploying an Operational Data Store Designed for Big Data

Cloudera Enterprise Data Hub in Telecom:

MULTITENANCY AND THE ENTERPRISE DATA HUB:

VMware Hybrid Cloud. Accelerate Your Time to Value

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

More Data in Less Time

WHITE PAPER. Hadoop and HDFS: Storage for Next Generation Data Management. Version: Q

Virtualizing Apache Hadoop. June, 2012

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

The Future of Data Management

Master Hybrid Cloud Management with VMware vrealize Suite. Increase Business Agility, Efficiency, and Choice While Keeping IT in Control

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Elastic Private Clouds

CDH AND BUSINESS CONTINUITY:

VMware Solutions for Small and Midsize Business

Increased Security, Greater Agility, Lower Costs for AWS DELPHIX FOR AMAZON WEB SERVICES WHITE PAPER

Making a Smooth Transition to a Hybrid Cloud with Microsoft Cloud OS

Interactive data analytics drive insights

Cisco and Red Hat: Application Centric Infrastructure Integration with OpenStack

Establishing a Private Cloud

Operational Analytics

Introduction to AWS Economics

Hadoop & Spark Using Amazon EMR

Driving Growth in Insurance With a Big Data Architecture

How To Compare The Two Cloud Computing Models

VMware vcloud Powered Services

It s Not Public Versus Private Clouds - It s the Right Infrastructure at the Right Time With the IBM Systems and Storage Portfolio

An Enterprise Data Hub, the Next Gen Operational Data Store

CA Technologies Big Data Infrastructure Management Unified Management and Visibility of Big Data

Cloud Lifecycle Management

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

WHITE PAPER. Easing the Way to the Cloud:

Fujitsu Cloud IaaS Trusted Public S5. shaping tomorrow with you

Confidence in the Cloud Five Ways to Capitalize with Symantec

HYBRID CLOUD SERVICES HYBRID CLOUD

IBM Spectrum Protect in the Cloud

A Guide to Hybrid Cloud for Government Agencies An inside-out approach for extending your data center to the cloud

C Examcollection.Premium.Exam.34q

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Achieving Business Agility Through An Agile Data Center

cloud functionality: advantages and Disadvantages

Databricks. A Primer

Oracle Database Backup Service. Secure Backup in the Oracle Cloud

Accenture Cloud Platform Unlocks Agility and Control

Protecting Big Data Data Protection Solutions for the Business Data Lake

Virtualizing Exchange

Cloud Computing: Elastic, Scalable, On-Demand IT Services for Everyone. Table of Contents. Cloud.com White Paper April Executive Summary...

SQL Server 2012 Parallel Data Warehouse. Solution Brief

Overview. The Cloud. Characteristics and usage of the cloud Realities and risks of the cloud

Datacenter Management and Virtualization. Microsoft Corporation

Accelerate your Big Data Strategy. Execute faster with Capgemini and Cloudera s Enterprise Data Hub Accelerator

Detecting Anomalous Behavior with the Business Data Lake. Reference Architecture and Enterprise Approaches.

Optimizing the Data Center for Today s Federal Government

High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances

Hybrid IT A Low-Risk Path from On-Premise to ITaaS

I D C T E C H N O L O G Y S P O T L I G H T

CloudCenter Full Lifecycle Management. An application-defined approach to deploying and managing applications in any datacenter or cloud environment

HGST Object Storage for a New Generation of IT

Accelerating Enterprise Big Data Success. Tim Stevens, VP of Business and Corporate Development Cloudera

How To Compare The Cost Of A Microsoft Private Cloud To A Vcloud With Vsphere And Vspheon

Adobe Deploys Hadoop as a Service on VMware vsphere

Cisco Intelligent Automation for Cloud

Building Private & Hybrid Cloud Solutions

The Hybrid Cloud: Bringing Cloud-Based IT Services to State Government

A Guide to Hybrid Cloud An inside-out approach for extending your data center to the cloud

Accelerate Your Enterprise Private Cloud Initiative

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

agility made possible

Enabling Database-as-a-Service (DBaaS) within Enterprises or Cloud Offerings

No matter the delivery model private, public, hybrid the cloud has the same core attributes:

Enterprise Storage Solution for Hyper-V Private Cloud and VDI Deployments using Sanbolic s Melio Cloud Software Suite April 2011

Veritas NetBackup With and Within the Cloud: Protection and Performance in a Single Platform

Getting the Most Out of VMware Mirage with Hitachi Unified Storage and Hitachi NAS Platform WHITE PAPER

Build A private PaaS.

EMC XtremSF: Delivering Next Generation Storage Performance for SQL Server

Hybrid Cloud Delivery Managing Cloud Services from Request to Retirement SOLUTION WHITE PAPER

Cloudera Enterprise Data Hub. GCloud Service Definition Lot 3: Software as a Service

Data Discovery, Analytics, and the Enterprise Data Hub

CONVERGE APPLICATIONS, ANALYTICS, AND DATA WITH VCE AND PIVOTAL

Cloud-based web hosting consolidation with an IBM Drupal solution

Changing the Equation on Big Data Spending

can you simplify your infrastructure?

Maximize strategic flexibility by building an open hybrid cloud Gordon Haff

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Vblock Systems hybrid-cloud with Cisco Intercloud Fabric

SkySight: New Capabilities to Accelerate Your Journey to the Cloud

Simplified Management With Hitachi Command Suite. By Hitachi Data Systems

Transcription:

Cloudera in the Public Cloud Deployment Options for the Enterprise Data Hub Version: Q414-102

Table of Contents Executive Summary 3 The Case for Public Cloud 5 Public Cloud vs On-Premise 6 Public Cloud Deployment Patterns 7 Cloudera Director: Hadoop in the Cloud Without Compromise 9 The Cloudera Difference 10 About Cloudera 10 2

On-demand provisioning and elasticity in the public cloud opens new possibilities for Cloudera s enterprise data hub, yet this deployment option does not fundamentally change the architecture. Executive Summary Information-driven enterprises have long held the common business and IT objective of unified data management to improve insight and build knowledge. For many, the conventional data warehouse and data mart built on relational technology offered the only avenue to enterprise-grade analytics, while storage arrays and archives provided the only methods for keeping diverse data accessible for longer time periods. Today, these organizations have a better way to address the challenge of data management with an enterprise data hub (EDH). The Cloudera enterprise data hub, built with Apache Hadoop, provides a flexible, scalable, and economical data management platform that can perform a variety of enterprise workloads including batch processing, interactive SQL, enterprise search, advanced analytics and more on a single, shared copy of data on a common storage substrate. Enterprises are embracing the enterprise data hub as the centerpiece of their data management strategy, and they are evaluating the public cloud as a deployment option. While deployment choice does not fundamentally change the architecture of the enterprise data hub, the additional benefit of on-demand provisioning and elasticity in the public cloud does open new possibilities for this evolution in data management. Process Discover Model Serve Security and Administration Unlimited Storage 3

Organizations that realize an enterprise data hub with Cloudera gain numerous benefits including the full technology stack and ecosystem offerings built for Hadoop, comprehensive system and data management tools, and limitless data storage fine-grained, durable, readily available, and cost-effective for all data. Moreover, enterprise IT teams receive mission-critical support for their EDH systems, so business users have confidence that the data and applications are ready and able to meet the challenges in today s environment. With Cloudera, enterprises can bring this same EDH experience to their cloud installations with no restrictions to their choice of cloud vendor. Deploying a Cloudera EDH to the cloud means that organizations can leverage the elasticity and on-demand consumption models best suited to their particular business and processing needs, yet still profit from the advantages offered by the EDH. 4

Acute capacity constraints in the data center and the relative importance of time-to-value rather than performance are often key drivers to public cloud deployment decisions. The Case for Public Cloud The public cloud is a set of compute, storage, and networking resources, ranging from bare-bones architecture to fully automated infrastructure-as-a-service stacks, that a service provider offers to the general public through an on-demand model. The value and importance of the public cloud and of cloud computing in general has been accelerating as more enterprises discover the convenience and flexibility of this deployment platform. There are a number of key business drivers that enterprises consider when weighing the public cloud option. Procurement and Capacity Enterprise IT teams typically need flexibility with proof-of-concepts (POC), pilots, and trials to demonstrate the proper architecture for an enterprise data hub. As a result, enterprises tend to build their production environments after the POC completes as a way to mitigate the capital risk associated with procurement. Public cloud environments meet these needs perfectly as enterprises can provision and change their evaluation environments very quickly in the public cloud and use them for the duration of the POC and incur limited usage costs, let alone avoid misaligned hardware purchases. Thus IT teams can develop the right architecture and configuration with minimal capital exposure and then confidently procure and provision the on-premise production environment. Enterprises who can procure infrastructure quickly to deploy an enterprise data hub in production sometimes encounter physical capacity constraints in their data center. These organizations often leverage the public cloud as a way to gain the needed capacity and avoid provisioning delays. Furthermore, an organization s first foray into enterprise data hub deployments are typically non-production, where the focus of the effort is on evaluation as well as training for cluster management, data management, and the various frameworks in the EDH. A POC or pilot program typically needs limited hardware to get started, and in most cases, time-to-value rather than performance via hardware is the most important criterion. Strategic Flexibility Enterprises often consider new projects and systems, like an EDH, candidates for public cloud deployment after adopting an infrastructure-level or corporate-level decision to embrace a cloud model. Some of the corporate drivers for such decisions include cloud backup, instant geo-locality, and elasticity. The case for Hadoop in the public cloud can be even stronger if the data itself is generated in the cloud as a way to minimize data movement. Over time, enterprises might have clusters both in the public cloud and on-premise in order to find the proper set of features that best fit the business and technology needs, and thus the enterprise data hub will span these two environments. As enterprise IT leaders plan their enterprise data hub strategy, they will need to ensure that their choice of cloud vendor does not dictate the EDH strategy and vice versa and should avoid having a different EDH in each cloud vendor. These deployment considerations might not be immediate but are critical to a forward-thinking and adaptable IT strategy. 5

Data location, like cloud-based storage, and types of workloads, like periodic batch processing, are strong influencers on the decision to deploy into the public cloud, yet many see the total cost of ownership in terms of rapid procurement and provisioning of resources and the associated opportunity costs as the most important motivator. Public Cloud vs. On-Premise The decision to use public cloud infrastructure for an enterprise data hub is a fairly simple one for IT teams who have an immediate need for storage and computing or who are driven by an organization-wide initiative. For those weighing their options between on-premise and public cloud, there are several criteria to consider in deciding on the best deployment route. Data Location Where is the data generated? Data can be viewed as having mass and thus can prove difficult (and expensive) to move from storage to computing. If the EDH is not the primary location for data, best practices suggest establishing the enterprise data hub as close to data generation or storage to help mitigate the costs and effort, especially for large volumes that are common to EDH workloads. That said, IT teams should explore the nature and use of the data closely, as volume and velocity might allow for streaming in small quantities or transfers of large, single blocks to an on-premise environment. Often, if data is generated in the public cloud or if the data is stored long term in cloud storage, such as an object store for backup or geo-locality, public cloud deployment becomes a more natural choice. Workload Types What are the workload characteristics? For periodic batch workloads such as MapReduce jobs, enterprises can realize cost savings by running the cluster only for the duration of the job and paying for the usage as opposed to keeping the cluster activated at all times. This is especially true if the workload is run only a couple hours a day or a couple of days a week. For workloads that have continuous and long-running performance needs such as Apache HBase and Cloudera Impala, the overhead of commissioning and decommissioning a cluster for the term of the event may not be justified. Performance Demands What are the performance needs? One of the underlying tenets of Hadoop is tightly coupled units of compute and local storage that scale out linearly and simultaneously. This computation proximity enables Hadoop to parallelize the workload and significantly accelerate the processing of massive amounts of data within a short period of time. However, a common foundation of cloud architectures is pools of shared storage and virtualized compute capacity that are connected via a network pipe. These capabilities scale independently, but the network adds latency and shared storage can become a performance bottleneck for a high-throughput MapReduce job, but the exact performance needs vary from workload to workload. The ecosystem of cloud vendors offers enterprises many architectural options and configurations that can address more directly the particular needs of a workload. For example, IT teams should examine the proximity of storage to compute as well as the degree of shared resources within the service as potential factors to performance, from fully virtual instances to standalone, bare-metal systems. Performance often is an important criterion when processing large volumes of data typical of Hadoop workloads. For non-production, development, or test workloads, this factor might be less of a concern, which makes running these workloads against shared storage a potentially viable option. For production workloads, public cloud environments are still viable, but IT teams need to be more deliberate in their selection of proximity and resource contention, for example, in order to meet the performance requirements. 6

Separating metadata from data gives Hadoop a scalable design for achieving high availability and tunable replication without sacrificing performance. Cloud TCO What is the difference in Total Cost of Ownership (TCO)? Calculating the TCO of a public cloud deployment can extend beyond the options for compute, storage, data transfer, and the pricing thereof. A good starting point to narrow down the options is to use reference architectures from Cloudera for the cloud environment of choice. Based on the options from the reference architecture best suited for the workload or workloads, enterprises can further develop their expected usage patterns and arrive at a more accurate TCO for deploying an EDH in the public cloud. Cloudera and its partners can further assist with TCO evaluations for any environment, including those that span on-premise and public cloud. Public Cloud Deployment Patterns The decision to employ a public cloud as part of a company s IT strategy is typically driven by a number of independent factors, and an EDH is commonly a component of this larger process. However, there are a number of cases where a Hadoop-based EDH is especially well suited for the benefits provided by the elasticity of cloud computing and are the drivers of a cloud deployment model. Examples such as the parallel processing desired for search indexing and interactive query and the temporary influx of workload for batch processing coalesce into two primary deployment patterns that take advantage of EDH cloud environments. Long-Running Clusters The full-fidelity data experience of the enterprise data hub is based on the concept of collocated storage and compute on a cluster of industry standard servers. This tenet implies a long-running cluster within the cloud environment that provides the base storage for the data and the compute power for typical day-to-day activities, and this type of cluster is not very different from a typical on-premise Business Services Provisioned Servers Long-Running deployment. The EDH, once established in the cloud, is managed exactly as an on-premise deployment, but there are some unique benefits to the cloud environment. For example, one key advantage is that IT teams can provision new capacity with a few simple commands. In a matter of minutes, enterprise IT teams can bring online a new cluster that meets additional business needs or grow the storage or computing capacity of an existing cluster for a current business process. Enterprises gain IT agility without having to worry about data center capacity issues and long procurement processes. A further benefit to a cloud environment is that enterprises are not restricted to current server or cluster configurations if business needs change. For a typical on-premise environment, IT teams must determine CPU, memory, and disk capacity at the time of procurement and often purchase servers with excess capacity than currently necessary to future proof the infrastructure investment. In the cloud model, however, IT administrators can provision servers with different configurations at will. Enterprises can therefore provision clusters exactly as needed for today, not tomorrow, thus maximizing working capital, yet also adapt to changing business needs by allocating new servers with more CPU, memory, or disk and decommissioning older, older or obsolete servers. Data Cloud Business Services Data On-Premise 7

Periodic and Transient Workloads Reporting Task Even when operating a long-running cluster, businesses might need additional capacity for periodic workloads. Monthly or bi-weekly reporting processes are typical examples that represent additional computing capacity needs. Once an enterprise has established a production EDH in the cloud, IT teams can dynamically grow and shrink computing capacity in response to these periodic jobs. Administrators simply commission the new report servers as needed, process the reports, store the resulting information back into the EDH, and then decommission the servers. This periodic lifecycle translates into reduced costs, for instead of paying for extra machines that are only partially utilized, an enterprise pays for only the hours utilized. Some workloads are even more transient and might not require a long-running cluster. For example, an organization may have a large amount of data to process whose results might require significant time to interpret as useful or to determine the next task. To procure servers for this kind of transient or sporadic activity might not make economic sense for some organizations. The cloud offers a compelling Provisioned Servers Periodic Processing Task Temporary Servers Transient Temporary Servers Reports Cloud Storage Import & Export solution to this type of workload by combining rapid cluster provisioning and low-cost storage capabilities, such as Amazon S3. In this workload lifecycle, administrators provision a Hadoop cluster, import the data from a cloud object store, process the data, write the result back to the object store, and then decommission cluster. This approach can be very cost-effective when processing massive amounts of data if the workload is highly transient. For the occasional execution of batch jobs, elastic cloud environments might be more cost-efficient than dedicated long-running clusters. However, IT administrators should consider that multiple users might run periodic, transient jobs against the same dataset that is stored in an object store, for example. In this situation, the aggregate utilization of the cluster is a more relevant metric for calculating the cost benefits. IT teams might discover that always-on clusters are more economical than ones repeatedly provisioned for each user. 8

The long-term vision of Cloudera is to embrace the potential and flexibility of the hybrid model, where the enterprise data hub can operate transparently between onpremises, private cloud, and public cloud deployments. By bringing together a diverse partner ecosystem of cloud providers, Cloudera is helping customers bring Hadoop and the EDH to more enterprise users and applications. Cloudera continues to be the industry standard for next-generation enterprise data management and analytics, wherever data and workloads live. To learn more about Cloudera s broad partner ecosystem, visit http://www.cloudera.com/content/ cloudera/en/solutions/partner.html Cloudera Director: Hadoop in the Cloud Without Compromise Cloudera Director, part of Cloudera s platform, brings consistency and ease for users looking to deploy in the cloud, while still maintaining the benefits of Cloudera s enterprise data hub. Cloudera Director is the first portable, self-service solution for deploying and managing enterprise-grade Hadoop in the cloud. It provides a single pane of glass administration experience for central IT to reduce costs and deliver agility, and for end-users to selfservice provision and elastically scale clusters, all while ensuring auditability. Integrated with Cloudera s enterprise data hub, users not only get all the features necessary for cloud deployments, but also continue to get all of the enterprise-grade features available with Cloudera s platform including the security, governance, and administration necessary for production-ready deployments. With Cloudera Director, users can deploy one or more clusters in their preferred VPC environment, running on an EC2 instance. Cloudera Director offers the choice of a simple web user interface, command line interface (CLI), or REST API for deploying and managing CDH or Cloudera Enterprise clusters. The web UI provides a single dashboard view of all clusters deployed through Cloudera Director and includes a self-service experience for deploying, cloning, dynamically scaling, and terminating clusters. The CLI and API provide advanced support for more customized and complex cluster topologies that are well-suited for a wider variety of workloads. Additionally, both administrators and users can repeatedly deploy multiple clusters on-demand, using cluster blueprints. This reliable, cloud-centric experience can be leveraged across multiple cloud providers, with current support available with Amazon Web Services, and other cloud environments planned for future releases. Key benefits of Cloudera Director include: Customer Benefit Unique Capability Enabling Features Simplify Cluster Lifecycle Management Simple UI to spin up, scale, and spin down clusters Self-Service spin up/teardown Dynamic scaling for spiky workloads Simple cloning of clusters Cloud blueprints for repeatable deployments Eliminate Lock-in Flexible, open platform 100% open source Hadoop distribution Accelerate Time-to-Value Enterprise-ready security and administration Native support for hybrid deployments Third-party software deployment within same workflow Support for custom, workload-specific deployments Support for complex cluster topologies Minimum size cluster when capacity constrained Management tooling Compliance-ready security and governance Reduce Support Costs Monitoring & metering tools Multi-cluster dashboard Backup and disaster recovery with an optimized cloud storage connector Instance tracking for account billing 9

The Cloudera Difference Enterprises who deploy a Cloudera enterprise data hub in the public cloud can leverage several benefits unique to Cloudera. Business and technology teams gain the same full-fidelity EDH experience as an on-premise environment, from technology capabilities to system and data management tools, coupled with mission-critical support. And organizations do not have to compromise on enterprise-grade capabilities such as data security, data governance, and latest innovations in the Hadoop platform such as Cloudera Impala, Apache Sentry, Cloudera Search, and others when operating in the public cloud. In addition, Cloudera has designed an expanded partner program that includes a cloud services and solution provider division, called Cloudera Connect: Cloud, which can meet the growing needs of organizations looking to optimize Hadoop deployments in cloud environments for unified data management and analytics like the EDH by offering the utmost flexibility in deployment, consumption, and choice of vendor. Enterprises now have a choice of multiple pricing and support models for the enterprise data hub in the cloud. Organizations can choose either a traditional subscription model or a usage-based model for Cloudera s offerings while purchasing infrastructure separately from the cloud partner. Alternatively, organizations can purchase directly through their cloud vendor of choice both Cloudera products and cloud infrastructure as one offering and pay one bill. Moreover, IT strategists should anticipate EDH deployments in any environment, from on-premise to cloud, in order to meet more fully the particular demands and restrictions of a workload, data set, or business user. In all of these situations, the full-fidelity experience of an EDH and the continuity of the experience, no matter the environment, are critical to achieving maximum efficiency of applications and personnel. Cloudera is unique in providing this advantage to enterprises while leaving the choice of cloud provider vendor to the customer. With upcoming enhancements to the Cloudera product suite that streamline cloud operations, enterprises easily can leverage the elasticity and on-demand consumption models of the public cloud for their Hadoop installations and consider platforms like OpenStack and VMWare for private cloud deployments. Organizations need to consider multiple factors when deciding what part of the EDH footprint resides where. Cloudera is well positioned to help enterprises explore these factors and enable all deployment options available. With Cloudera, enterprises can take full advantage of enterprise data hub and the next generation in data management across all deployment options and environments, from on-premise to public cloud. About Cloudera Cloudera is revolutionizing enterprise data management by offering the first unified Platform for Big Data, an enterprise data hub built on Apache Hadoop. Cloudera offers enterprises one place to store, access, process, secure, and analyze all their data, empowering them to extend the value of existing investments while enabling fundamental new ways to derive value from their data. Cloudera s open source Big Data platform is the most widely adopted in the world, and Cloudera is the most prolific contributor to the open source Hadoop ecosystem. As the leading educator of Hadoop professionals, Cloudera has trained over 22,000 individuals worldwide. Over 1,200 partners and a seasoned professional services team help deliver greater time to value. Finally, only Cloudera provides proactive and predictive support to run an enterprise data hub with confidence. Leading organizations in every industry plus top public sector organizations globally run Cloudera in production. www.cloudera.com. cloudera.com 1-888-789-1488 or 1-650-362-0488 Cloudera, Inc. 1001 Page Mill Road, Palo Alto, CA 94304, USA 2015 Cloudera, Inc. All rights reserved. Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera Inc. in the USA and other countries. All other trademarks are the property of their respective companies. Information is subject to change without notice.