Deploying an Operational Data Store Designed for Big Data A fast, secure, and scalable data staging environment with no data volume or variety constraints Sponsored by: Version: 102
Table of Contents Introduction 3 Challenges with a Traditional Operational Data Store 4 How Cloudera Can Help 5 The Modern Operational Data Store 7 2
An operational data store (ODS) is a data landing zone that integrates data from multiple sources for operational and analytical purposes. The ODS is intended to:: Gather data from dissimilar source systems Process the data for analytical and operational use Store the data for future access and reuse Introduction Many information professionals today are watching the burgeoning growth in data generation overwhelm their operational data stores. Traditional architectures struggle with large data volumes and unstructured data, such as information gleaned from log data or social media, and the amount of data these data stores must ingest and process today is creating performance bottlenecks. This is unacceptable in the current technological landscape, and organizations are scrambling to maintain an efficient operational data store (ODS) as the amount of information available and necessary to perform business-critical analyses steadily grows. When the data warehouse performs multiple simultaneous complex transformations (which can consume up to 80% of a traditional system s resources) and bogs down the system, everyone suffers. A user who wants to check reports while this is happening might not get critical information in time because the infrastructure is performing resource-intensive processing on the data. In a similar fashion, analysts might be forced to work with outdated or incomplete data because the system fails or takes too long to crank out the answers. The problem only worsens for traditional systems as your organization adds larger volumes of diverse data. The explosive growth of data has created a new paradigm, one that requires organizations to extend their traditional information architecture in order to handle larger volumes of diverse data. What does that new augmented architecture look like? It includes an enterprise data hub (EDH). 3
Challenges with a Traditional ODS Consider the top three issues businesses face with a traditional ODS system and the temporary fixes they often resort to hoping to address them. 1 2 3 Limited ingest - Collecting and ingesting a wide variety of diverse data is not a simple task. As businesses acquire more data, IT departments add systems to their existing architecture to increase capacity. To avoid overburdening the traditional ODS, some organizations have chosen to devalue certain data, making only the data deemed most valuable accessible to end-users migrating less valuable data to archives or deleting it, sometimes without ever using it. This limits analysts ability to perform agile experiments and include complementary data in their analytics and operations. Inefficient processing - Most organizations need to process large volumes of diverse data efficiently. These processing pipelines not only take months to set up, but can take resources away from alternative, mission-critical workloads. And if this processing fails or takes so long that results don t reach end-users in a timely manner, users will be forced to work with outdated or incomplete information. Discarded/archived data - As volumes of data grow larger and more diverse, and systems reach capacity, IT professionals often archive data deemed of lesser value or which exceeds some arbitrary time period. Some organizations even delete information if it isn t touched within a certain timeframe. Data that migrates offline to an archive does the business no good and may even do harm. If, for example, analysts are trying to find patterns in historic data but can t see the whole picture because some of the information is offline, their analyses will suffer. Archiving or deleting data reduces the return on investment. Traditional Architecture Data Sources Ingest Operational Data Store Storage N 3 Archive Enterprise Data Warehouse Applications BI System Unstructured 1 Storage #2 2 Storage #1 ETL Serve Modeling Structured Ingest Process 2 ETL Load Reporting Deploying Cloudera s enterprise data hub as an ODS addresses and resolves these challenges. 4
How Cloudera Can Help Cloudera s implementation of an enterprise data hub (EDH), powered by Apache Hadoop, can store unlimited data, cost-effectively and reliably, for as long as you need, and lets users access that data in a variety of ways. Data can be collected, stored, processed, explored, modeled, and served in one secure unified platform. Modern Architecture Data Sources Operational Data Store Applications BI System Ingest ETL EDH Serve ETL Unstructured Ingest Archive Load Modeling Active Structured Data Serve Reporting Structured Enterprise Data Warehouse When organizations deploy an EDH as an ODS they are able to provide a flexible, agile data staging environment that supports a fast pace business. When evaluating an ODS solution for your organization, consider these high-level capabilities: Scalable storage/ingest The system should accommodate current and future data needs as users and applications require larger volumes of diverse data. Make sure the system enables growth, agility, and speed as business users continue to ask for new data sources and dimensions. An EDH leverages the Hadoop Distributed File System (HDFS), a fault-tolerant and self-healing distributed file system that stores data in full-fidelty with no predifined schema required, and optimizes for high bandwidth streaming, and scales to proven deployments of 100 PB and beyond. On top of HDFS, Cloudera runs Apache HBase a distributed, scalable NoSQL database. Distributed by design to leverage the cost-effective capabilities of commodity hardware, HDFS allows you to store any volume of diverse data (including complete data sets) and integrates with existing systems such as ETL tools, relational databases, NoSQL databases, and EDWs. This lets you offload outdated data from these systems to Cloudera to optimize system performance while keeping all data online. Data Modeling Determining the best data model for data requires the ability to discover net-new data and data patterns on large data sets in order to implement large scale repeatable processing workflows. 5
SCHEMA-ON-READ (SOR) VS. SCHEMA-ON-WRITE (SOW) SOW: This is the traditional database schema, with static tables for structured data that must conform in order to be loaded. If the data doesn t fit into this schema, then the data is not ingested. SOR: Schema-on-read allows organizations to ingest complete structured and unstructured data sets. This model is built for large scans, so you can move through the data quickly to find what you are looking for. This also allows for transformations on the fly without having to request more data from the source system. Cloudera Impala is a fully integrated, state-of-the-art analytic database that collects and ingests any data type or volume of data, in full fidelity. Impala allows analysts to discover new patterns in new data to facilitate large scale processing. Parallel processing It is critical that organizations process data efficiently so that applications don t experience latency and end-users leverage the most current information in their analyses. By offloading heavy workloads to an EDH for parallel processing, you can significantly reduce processing time on large volumes of data, from days to hours. Cloudera s implementation of an EDH employs MapReduce and Apache Spark to divide workloads into multiple tasks that can be executed in parallel. With MapReduce, storage and computation coexist on the same physical nodes in the cluster, so data doesn t need to travel to the compute location for execution. This data proximity allows MapReduce to process exceedingly large amounts of data unencumbered by traditional bottlenecks like network bandwidth. Apache Spark, an open source, parallel data processing framework, makes it easier to develop stream and batch processing pipelines with less code while delivering results 10 to 100 times faster than MapReduce. Comprehensive security As more data flows through the system, the chance that sensitive information will be uploaded increases. If data is not protected, enterprises will expose themselves to risk. Cloudera s enterprise data hub is the only solution with a comprehensive security package. This includes complete governance data protection, integrated authentication, authorization, encryption, key management, audit, and lineage allowing you to track data and manage user interactions. Cloudera s governance capability not only ensures complete data security, it also provides compliance-readiness for regulated industries. Leveraging industry-standard Kerberos, LDAP/AD, and SAML, Cloudera Navigator Encrypt provides strong but manageable authentication. Navigator Key Trustee, a virtual safe-deposit box, offers robust key management policies that prevent cloud and operating system administrators, hackers, and other unauthorized personnel from accessing cryptographic keys and sensitive data. Production-ready administration Monitoring an EDH and keeping mission-critical workloads operating at peak performance is crucial. You need to guarantee smooth operation as your EDH grows, but without proper management tools in place, you risk exposing your enterprise to downtime. With Cloudera Manager, the industry s first and most sophisticated management application for Apache Hadoop and the EDH, you can easily deploy, manage, monitor, and perform diagnostics on your Hadoop cluster. Cloudera Manager provides mission-critical enterprise features like backup/disaster recovery, proactive support, comprehensive security, and zero downtime upgrades. The application automates the installation process, reducing deployment time from weeks to minutes; gives you a clusterwide, real-time view of nodes and services running; provides a central console to enact configuration changes across 6
Customer Spotlight Company Overview Experian Marketing Services (EMS) helps marketers connect with customers through relevant communications across a variety of channels, driven by advanced analytics on an extensive database of geographic, demographic, and lifestyle data. Challenge Data volumes were slowing down their applications forcing end users to make decisions off of stale data. Traditional systems could not process omni-channel data fast enough limiting customers to monthly reports. Solution Cloudera provided Experian a landing zone where they could process and store large volumes of disparate data at scale. Plugging Cloudera into their existing ecosystem allowed them to complement their existing work while allowing them to meet their performance goals. your cluster; and incorporates a full range of reporting and diagnostic tools to help you optimize performance and utilization through heatmaps, proactive health checks, and alerts (via existing enterprise monitoring tools such as SNMP, SMTP, and a comprehensive API). ETL vendor integration The system should protect your investment in existing extract, transform, and load tools. Organizations that have invested in ETL vendors should be able to plug them directly into the new ODS. Cloudera lets you stick with existing tools while you deploy this new solution. So if your organization has invested in Informatica, Pentaho, Syncsort, or some other ETL solution, Cloudera integrates seamlessly in order to protect your investment while allowing you to grow and retain the talent and skills you have on staff. The Modern Operational Data Store Compared to the traditional ODS, the Cloudera ODS solution streamlines data pipelines, limits data movement, scales data storage capacity, and accelerates time to data access through batch and stream parallel processing. Deploying Cloudera s EDH as an ODS allows your architecture to expand as your business demands faster access to more volumes of diverse data to get the information it needs. Because of Cloudera s scalable nature, there is never a reason to archive or delete data. Historic data can remain on the platform in full-fidelity giving organizations the agility to provide the right data to end users faster. When making system decisions as an enterprise architect or IT professional, you must consider your business s immediate needs as well as future needs. Deploying an EDH as an ODS provides a flexible, scalable platform that not only complements your current data investments but also provides you with a system that will grow with your business indefinitely. Benefit Process data 50X faster Increase consumer report frequency from monthly to weekly Process 28K records per second 7
About Cloudera Cloudera is revolutionizing enterprise data management by offering the first unified Platform for Big Data, an enterprise data hub built on Apache Hadoop. Cloudera offers enterprises one place to store, access, process, secure, and analyze all their data, empowering them to extend the value of existing investments while enabling fundamental new ways to derive value from their data. Cloudera s open source Big Data platform is the most widely adopted in the world, and Cloudera is the most prolific contributor to the open source Hadoop ecosystem. As the leading educator of Hadoop professionals, Cloudera has trained over 22,000 individuals worldwide. Over 1,400 partners and a seasoned professional services team help deliver greater time to value. Finally, only Cloudera provides proactive and predictive support to run an enterprise data hub with confidence. Leading organizations in every industry plus top public sector organizations globally run Cloudera in production. www.cloudera.com. cloudera.com 1-888-789-1488 or 1-650-362-0488 Cloudera, Inc. 1001 Page Mill Road, Palo Alto, CA 94304, USA 2015 Cloudera, Inc. All rights reserved. Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera Inc. in the USA and other countries. All other trademarks are the property of their respective companies. Information is subject to change without notice.