Cloudera in the Public Cloud

Cloudera in the Public Cloud Deployment Options for the Enterprise Data Hub Version: Q414-102

Table of Contents Executive Summary 3 The Case for Public Cloud 5 Public Cloud vs On-Premise 6 Public Cloud Deployment Patterns 7 Cloudera Director: Hadoop in the Cloud Without Compromise 9 The Cloudera Difference 10 About Cloudera 10 2

On-demand provisioning and elasticity in the public cloud opens new possibilities for Cloudera s enterprise data hub, yet this deployment option does not fundamentally change the architecture. Executive Summary Information-driven enterprises have long held the common business and IT objective of unified data management to improve insight and build knowledge. For many, the conventional data warehouse and data mart built on relational technology offered the only avenue to enterprise-grade analytics, while storage arrays and archives provided the only methods for keeping diverse data accessible for longer time periods. Today, these organizations have a better way to address the challenge of data management with an enterprise data hub (EDH). The Cloudera enterprise data hub, built with Apache Hadoop, provides a flexible, scalable, and economical data management platform that can perform a variety of enterprise workloads including batch processing, interactive SQL, enterprise search, advanced analytics and more on a single, shared copy of data on a common storage substrate. Enterprises are embracing the enterprise data hub as the centerpiece of their data management strategy, and they are evaluating the public cloud as a deployment option. While deployment choice does not fundamentally change the architecture of the enterprise data hub, the additional benefit of on-demand provisioning and elasticity in the public cloud does open new possibilities for this evolution in data management. Process Discover Model Serve Security and Administration Unlimited Storage 3

Organizations that realize an enterprise data hub with Cloudera gain numerous benefits including the full technology stack and ecosystem offerings built for Hadoop, comprehensive system and data management tools, and limitless data storage fine-grained, durable, readily available, and cost-effective for all data. Moreover, enterprise IT teams receive mission-critical support for their EDH systems, so business users have confidence that the data and applications are ready and able to meet the challenges in today s environment. With Cloudera, enterprises can bring this same EDH experience to their cloud installations with no restrictions to their choice of cloud vendor. Deploying a Cloudera EDH to the cloud means that organizations can leverage the elasticity and on-demand consumption models best suited to their particular business and processing needs, yet still profit from the advantages offered by the EDH. 4

Acute capacity constraints in the data center and the relative importance of time-to-value rather than performance are often key drivers to public cloud deployment decisions. The Case for Public Cloud The public cloud is a set of compute, storage, and networking resources, ranging from bare-bones architecture to fully automated infrastructure-as-a-service stacks, that a service provider offers to the general public through an on-demand model. The value and importance of the public cloud and of cloud computing in general has been accelerating as more enterprises discover the convenience and flexibility of this deployment platform. There are a number of key business drivers that enterprises consider when weighing the public cloud option. Procurement and Capacity Enterprise IT teams typically need flexibility with proof-of-concepts (POC), pilots, and trials to demonstrate the proper architecture for an enterprise data hub. As a result, enterprises tend to build their production environments after the POC completes as a way to mitigate the capital risk associated with procurement. Public cloud environments meet these needs perfectly as enterprises can provision and change their evaluation environments very quickly in the public cloud and use them for the duration of the POC and incur limited usage costs, let alone avoid misaligned hardware purchases. Thus IT teams can develop the right architecture and configuration with minimal capital exposure and then confidently procure and provision the on-premise production environment. Enterprises who can procure infrastructure quickly to deploy an enterprise data hub in production sometimes encounter physical capacity constraints in their data center. These organizations often leverage the public cloud as a way to gain the needed capacity and avoid provisioning delays. Furthermore, an organization s first foray into enterprise data hub deployments are typically non-production, where the focus of the effort is on evaluation as well as training for cluster management, data management, and the various frameworks in the EDH. A POC or pilot program typically needs limited hardware to get started, and in most cases, time-to-value rather than performance via hardware is the most important criterion. Strategic Flexibility Enterprises often consider new projects and systems, like an EDH, candidates for public cloud deployment after adopting an infrastructure-level or corporate-level decision to embrace a cloud model. Some of the corporate drivers for such decisions include cloud backup, instant geo-locality, and elasticity. The case for Hadoop in the public cloud can be even stronger if the data itself is generated in the cloud as a way to minimize data movement. Over time, enterprises might have clusters both in the public cloud and on-premise in order to find the proper set of features that best fit the business and technology needs, and thus the enterprise data hub will span these two environments. As enterprise IT leaders plan their enterprise data hub strategy, they will need to ensure that their choice of cloud vendor does not dictate the EDH strategy and vice versa and should avoid having a different EDH in each cloud vendor. These deployment considerations might not be immediate but are critical to a forward-thinking and adaptable IT strategy. 5

Data location, like cloud-based storage, and types of workloads, like periodic batch processing, are strong influencers on the decision to deploy into the public cloud, yet many see the total cost of ownership in terms of rapid procurement and provisioning of resources and the associated opportunity costs as the most important motivator. Public Cloud vs. On-Premise The decision to use public cloud infrastructure for an enterprise data hub is a fairly simple one for IT teams who have an immediate need for storage and computing or who are driven by an organization-wide initiative. For those weighing their options between on-premise and public cloud, there are several criteria to consider in deciding on the best deployment route. Data Location Where is the data generated? Data can be viewed as having mass and thus can prove difficult (and expensive) to move from storage to computing. If the EDH is not the primary location for data, best practices suggest establishing the enterprise data hub as close to data generation or storage to help mitigate the costs and effort, especially for large volumes that are common to EDH workloads. That said, IT teams should explore the nature and use of the data closely, as volume and velocity might allow for streaming in small quantities or transfers of large, single blocks to an on-premise environment. Often, if data is generated in the public cloud or if the data is stored long term in cloud storage, such as an object store for backup or geo-locality, public cloud deployment becomes a more natural choice. Workload Types What are the workload characteristics? For periodic batch workloads such as MapReduce jobs, enterprises can realize cost savings by running the cluster only for the duration of the job and paying for the usage as opposed to keeping the cluster activated at all times. This is especially true if the workload is run only a couple hours a day or a couple of days a week. For workloads that have continuous and long-running performance needs such as Apache HBase and Cloudera Impala, the overhead of commissioning and decommissioning a cluster for the term of the event may not be justified. Performance Demands What are the performance needs? One of the underlying tenets of Hadoop is tightly coupled units of compute and local storage that scale out linearly and simultaneously. This computation proximity enables Hadoop to parallelize the workload and significantly accelerate the processing of massive amounts of data within a short period of time. However, a common foundation of cloud architectures is pools of shared storage and virtualized compute capacity that are connected via a network pipe. These capabilities scale independently, but the network adds latency and shared storage can become a performance bottleneck for a high-throughput MapReduce job, but the exact performance needs vary from workload to workload. The ecosystem of cloud vendors offers enterprises many architectural options and configurations that can address more directly the particular needs of a workload. For example, IT teams should examine the proximity of storage to compute as well as the degree of shared resources within the service as potential factors to performance, from fully virtual instances to standalone, bare-metal systems. Performance often is an important criterion when processing large volumes of data typical of Hadoop workloads. For non-production, development, or test workloads, this factor might be less of a concern, which makes running these workloads against shared storage a potentially viable option. For production workloads, public cloud environments are still viable, but IT teams need to be more deliberate in their selection of proximity and resource contention, for example, in order to meet the performance requirements. 6

Separating metadata from data gives Hadoop a scalable design for achieving high availability and tunable replication without sacrificing performance. Cloud TCO What is the difference in Total Cost of Ownership (TCO)? Calculating the TCO of a public cloud deployment can extend beyond the options for compute, storage, data transfer, and the pricing thereof. A good starting point to narrow down the options is to use reference architectures from Cloudera for the cloud environment of choice. Based on the options from the reference architecture best suited for the workload or workloads, enterprises can further develop their expected usage patterns and arrive at a more accurate TCO for deploying an EDH in the public cloud. Cloudera and its partners can further assist with TCO evaluations for any environment, including those that span on-premise and public cloud. Public Cloud Deployment Patterns The decision to employ a public cloud as part of a company s IT strategy is typically driven by a number of independent factors, and an EDH is commonly a component of this larger process. However, there are a number of cases where a Hadoop-based EDH is especially well suited for the benefits provided by the elasticity of cloud computing and are the drivers of a cloud deployment model. Examples such as the parallel processing desired for search indexing and interactive query and the temporary influx of workload for batch processing coalesce into two primary deployment patterns that take advantage of EDH cloud environments. Long-Running Clusters The full-fidelity data experience of the enterprise data hub is based on the concept of collocated storage and compute on a cluster of industry standard servers. This tenet implies a long-running cluster within the cloud environment that provides the base storage for the data and the compute power for typical day-to-day activities, and this type of cluster is not very different from a typical on-premise Business Services Provisioned Servers Long-Running deployment. The EDH, once established in the cloud, is managed exactly as an on-premise deployment, but there are some unique benefits to the cloud environment. For example, one key advantage is that IT teams can provision new capacity with a few simple commands. In a matter of minutes, enterprise IT teams can bring online a new cluster that meets additional business needs or grow the storage or computing capacity of an existing cluster for a current business process. Enterprises gain IT agility without having to worry about data center capacity issues and long procurement processes. A further benefit to a cloud environment is that enterprises are not restricted to current server or cluster configurations if business needs change. For a typical on-premise environment, IT teams must determine CPU, memory, and disk capacity at the time of procurement and often purchase servers with excess capacity than currently necessary to future proof the infrastructure investment. In the cloud model, however, IT administrators can provision servers with different configurations at will. Enterprises can therefore provision clusters exactly as needed for today, not tomorrow, thus maximizing working capital, yet also adapt to changing business needs by allocating new servers with more CPU, memory, or disk and decommissioning older, older or obsolete servers. Data Cloud Business Services Data On-Premise 7

Periodic and Transient Workloads Reporting Task Even when operating a long-running cluster, businesses might need additional capacity for periodic workloads. Monthly or bi-weekly reporting processes are typical examples that represent additional computing capacity needs. Once an enterprise has established a production EDH in the cloud, IT teams can dynamically grow and shrink computing capacity in response to these periodic jobs. Administrators simply commission the new report servers as needed, process the reports, store the resulting information back into the EDH, and then decommission the servers. This periodic lifecycle translates into reduced costs, for instead of paying for extra machines that are only partially utilized, an enterprise pays for only the hours utilized. Some workloads are even more transient and might not require a long-running cluster. For example, an organization may have a large amount of data to process whose results might require significant time to interpret as useful or to determine the next task. To procure servers for this kind of transient or sporadic activity might not make economic sense for some organizations. The cloud offers a compelling Provisioned Servers Periodic Processing Task Temporary Servers Transient Temporary Servers Reports Cloud Storage Import & Export solution to this type of workload by combining rapid cluster provisioning and low-cost storage capabilities, such as Amazon S3. In this workload lifecycle, administrators provision a Hadoop cluster, import the data from a cloud object store, process the data, write the result back to the object store, and then decommission cluster. This approach can be very cost-effective when processing massive amounts of data if the workload is highly transient. For the occasional execution of batch jobs, elastic cloud environments might be more cost-efficient than dedicated long-running clusters. However, IT administrators should consider that multiple users might run periodic, transient jobs against the same dataset that is stored in an object store, for example. In this situation, the aggregate utilization of the cluster is a more relevant metric for calculating the cost benefits. IT teams might discover that always-on clusters are more economical than ones repeatedly provisioned for each user. 8

The long-term vision of Cloudera is to embrace the potential and flexibility of the hybrid model, where the enterprise data hub can operate transparently between onpremises, private cloud, and public cloud deployments. By bringing together a diverse partner ecosystem of cloud providers, Cloudera is helping customers bring Hadoop and the EDH to more enterprise users and applications. Cloudera continues to be the industry standard for next-generation enterprise data management and analytics, wherever data and workloads live. To learn more about Cloudera s broad partner ecosystem, visit http://www.cloudera.com/content/ cloudera/en/solutions/partner.html Cloudera Director: Hadoop in the Cloud Without Compromise Cloudera Director, part of Cloudera s platform, brings consistency and ease for users looking to deploy in the cloud, while still maintaining the benefits of Cloudera s enterprise data hub. Cloudera Director is the first portable, self-service solution for deploying and managing enterprise-grade Hadoop in the cloud. It provides a single pane of glass administration experience for central IT to reduce costs and deliver agility, and for end-users to selfservice provision and elastically scale clusters, all while ensuring auditability. Integrated with Cloudera s enterprise data hub, users not only get all the features necessary for cloud deployments, but also continue to get all of the enterprise-grade features available with Cloudera s platform including the security, governance, and administration necessary for production-ready deployments. With Cloudera Director, users can deploy one or more clusters in their preferred VPC environment, running on an EC2 instance. Cloudera Director offers the choice of a simple web user interface, command line interface (CLI), or REST API for deploying and managing CDH or Cloudera Enterprise clusters. The web UI provides a single dashboard view of all clusters deployed through Cloudera Director and includes a self-service experience for deploying, cloning, dynamically scaling, and terminating clusters. The CLI and API provide advanced support for more customized and complex cluster topologies that are well-suited for a wider variety of workloads. Additionally, both administrators and users can repeatedly deploy multiple clusters on-demand, using cluster blueprints. This reliable, cloud-centric experience can be leveraged across multiple cloud providers, with current support available with Amazon Web Services, and other cloud environments planned for future releases. Key benefits of Cloudera Director include: Customer Benefit Unique Capability Enabling Features Simplify Cluster Lifecycle Management Simple UI to spin up, scale, and spin down clusters Self-Service spin up/teardown Dynamic scaling for spiky workloads Simple cloning of clusters Cloud blueprints for repeatable deployments Eliminate Lock-in Flexible, open platform 100% open source Hadoop distribution Accelerate Time-to-Value Enterprise-ready security and administration Native support for hybrid deployments Third-party software deployment within same workflow Support for custom, workload-specific deployments Support for complex cluster topologies Minimum size cluster when capacity constrained Management tooling Compliance-ready security and governance Reduce Support Costs Monitoring & metering tools Multi-cluster dashboard Backup and disaster recovery with an optimized cloud storage connector Instance tracking for account billing 9

The Cloudera Difference Enterprises who deploy a Cloudera enterprise data hub in the public cloud can leverage several benefits unique to Cloudera. Business and technology teams gain the same full-fidelity EDH experience as an on-premise environment, from technology capabilities to system and data management tools, coupled with mission-critical support. And organizations do not have to compromise on enterprise-grade capabilities such as data security, data governance, and latest innovations in the Hadoop platform such as Cloudera Impala, Apache Sentry, Cloudera Search, and others when operating in the public cloud. In addition, Cloudera has designed an expanded partner program that includes a cloud services and solution provider division, called Cloudera Connect: Cloud, which can meet the growing needs of organizations looking to optimize Hadoop deployments in cloud environments for unified data management and analytics like the EDH by offering the utmost flexibility in deployment, consumption, and choice of vendor. Enterprises now have a choice of multiple pricing and support models for the enterprise data hub in the cloud. Organizations can choose either a traditional subscription model or a usage-based model for Cloudera s offerings while purchasing infrastructure separately from the cloud partner. Alternatively, organizations can purchase directly through their cloud vendor of choice both Cloudera products and cloud infrastructure as one offering and pay one bill. Moreover, IT strategists should anticipate EDH deployments in any environment, from on-premise to cloud, in order to meet more fully the particular demands and restrictions of a workload, data set, or business user. In all of these situations, the full-fidelity experience of an EDH and the continuity of the experience, no matter the environment, are critical to achieving maximum efficiency of applications and personnel. Cloudera is unique in providing this advantage to enterprises while leaving the choice of cloud provider vendor to the customer. With upcoming enhancements to the Cloudera product suite that streamline cloud operations, enterprises easily can leverage the elasticity and on-demand consumption models of the public cloud for their Hadoop installations and consider platforms like OpenStack and VMWare for private cloud deployments. Organizations need to consider multiple factors when deciding what part of the EDH footprint resides where. Cloudera is well positioned to help enterprises explore these factors and enable all deployment options available. With Cloudera, enterprises can take full advantage of enterprise data hub and the next generation in data management across all deployment options and environments, from on-premise to public cloud. About Cloudera Cloudera is revolutionizing enterprise data management by offering the first unified Platform for Big Data, an enterprise data hub built on Apache Hadoop. Cloudera offers enterprises one place to store, access, process, secure, and analyze all their data, empowering them to extend the value of existing investments while enabling fundamental new ways to derive value from their data. Cloudera s open source Big Data platform is the most widely adopted in the world, and Cloudera is the most prolific contributor to the open source Hadoop ecosystem. As the leading educator of Hadoop professionals, Cloudera has trained over 22,000 individuals worldwide. Over 1,200 partners and a seasoned professional services team help deliver greater time to value. Finally, only Cloudera provides proactive and predictive support to run an enterprise data hub with confidence. Leading organizations in every industry plus top public sector organizations globally run Cloudera in production. www.cloudera.com. cloudera.com 1-888-789-1488 or 1-650-362-0488 Cloudera, Inc. 1001 Page Mill Road, Palo Alto, CA 94304, USA 2015 Cloudera, Inc. All rights reserved. Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera Inc. in the USA and other countries. All other trademarks are the property of their respective companies. Information is subject to change without notice.