An Enterprise Data Hub, the Next Gen Operational Data Store


Similar documents
Operational Analytics

Data Discovery, Analytics, and the Enterprise Data Hub

Deploying an Operational Data Store Designed for Big Data


Integrating Cloudera and SAP HANA

Cloudera Enterprise Data Hub in Telecom:

More Data in Less Time

Driving Growth in Insurance With a Big Data Architecture

Cloudera in the Public Cloud

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

ROCANA WHITEPAPER. Improving Event Data Management and Legacy Systems

Data Virtualization A Potential Antidote for Big Data Growing Pains

IST722 Data Warehousing

VIEWPOINT. High Performance Analytics. Industry Context and Trends

An Architecture for Integrated Operational Business Intelligence

Symantec Global Intelligence Network 2.0 Architecture: Staying Ahead of the Evolving Threat Landscape

Accelerate your Big Data Strategy. Execute faster with Capgemini and Cloudera s Enterprise Data Hub Accelerator

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

The Enterprise Data Hub and The Modern Information Architecture

Unifying the Enterprise Data Hub and the Integrated Data Warehouse

Traditional BI vs. Business Data Lake A comparison

Dell* In-Memory Appliance for Cloudera* Enterprise

Cisco Data Preparation

How to Enhance Traditional BI Architecture to Leverage Big Data

Sharing The Wealth. The Progression to the Complete CIF Environment

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Hadoop Data Hubs and BI. Supporting the migration from siloed reporting and BI to centralized services with Hadoop

Agile Business Intelligence Data Lake Architecture

The Future of Data Management

Why Big Data in the Cloud?

Informatica and the Vibe Virtual Data Machine

White Paper: Evaluating Big Data Analytical Capabilities For Government Use

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

IBM Analytics. Just the facts: Four critical concepts for planning the logical data warehouse

Big Data for the Rest of Us Technical White Paper

Build an effective data integration strategy to drive innovation

Analytics With Hadoop. SAS and Cloudera Starter Services: Visual Analytics and Visual Statistics

January Fast-Tracking Data Warehousing & Business Intelligence Projects via Intelligent Data Modeling. Sponsored by:

A Tipping Point for Automation in the Data Warehouse.

Ganzheitliches Datenmanagement


A discussion of information integration solutions November Deploying a Center of Excellence for data integration.

Scalable Enterprise Data Integration Your business agility depends on how fast you can access your complex data



Establish and maintain Center of Excellence (CoE) around Data Architecture

Analance Data Integration Technical Whitepaper

90% of your Big Data problem isn t Big Data.

Investor Presentation. Second Quarter 2015

Whitepaper. Data Warehouse/BI Testing Offering YOUR SUCCESS IS OUR FOCUS. Published on: January 2009 Author: BIBA PRACTICE


Data virtualization: Delivering on-demand access to information throughout the enterprise

Information-Driven Transformation in Retail with the Enterprise Data Hub Accelerator

High Performance Data Management Use of Standards in Commercial Product Development


Addressing Risk Data Aggregation and Risk Reporting Ben Sharma, CEO. Big Data Everywhere Conference, NYC November 2015

Actian SQL in Hadoop Buyer s Guide

Five Technology Trends for Improved Business Intelligence Performance

DATAMEER WHITE PAPER. Beyond BI. Big Data Analytic Use Cases

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

MDM and Data Warehousing Complement Each Other

Analance Data Integration Technical Whitepaper

CAS Seminar on Ratemaking! "! ###!!

Data Integration for the Real Time Enterprise

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Business Intelligence and Big Data Analytics: Speeding the Cycle from Insights to Action Four Steps to More Profitable Customer Engagement

Bringing the Power of SAS to Hadoop. White Paper


How to avoid building a data swamp

COURSE OUTLINE. Track 1 Advanced Data Modeling, Analysis and Design


A Service-oriented Architecture for Business Intelligence


Are You Big Data Ready?

Three Open Blueprints For Big Data Success

Integrating SAP and non-sap data for comprehensive Business Intelligence

The Role of the Analyst in Business Analytics. Neil Foshay Schwartz School of Business St Francis Xavier U

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC

Business Intelligence & IT Governance

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

White Paper: Enhancing Functionality and Security of Enterprise Data Holdings

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Adobe Insight, powered by Omniture

WHITE PAPER. Hadoop and HDFS: Storage for Next Generation Data Management. Version: Q


Buying vs. Building Business Analytics. A decision resource for technology and product teams

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

Accelerate BI Initiatives With Self-Service Data Discovery And Integration

Automated Business Intelligence

Discovering Business Insights in Big Data Using SQL-MapReduce


Best Practices for Deploying Managed Self-Service Analytics and Why Tableau and QlikView Fall Short

Big Data and Big Data Modeling

Data Management, Analytics and Business Intelligence


In-Database Analytics

Blazent IT Data Intelligence Technology:

The IBM Cognos Platform

Improving your Data Warehouse s IQ


An Enterprise Data Hub, the Next Gen Operational Data Store Version: 101

Table of Contents Summary 3 The ODS in Practice 4 Drawbacks of the ODS Today 5 The Case for ODS on an EDH 5 Conclusion 6 About the Author 6 2

Summary Enterprise computing platforms are being redesigned to make use of the economics, scalability and performance of distributed computing systems such as the Apache Hadoop. Most systems in place today rely on relational databases that are brittle, expensive and limited in scale and performance (though within expectations are considered useful and important). As organizations begin the process of migrating or augmenting with Hadoop, there are many unanswered questions about the nature of applications and data management going forward. This is especially true when examining systems that evolved from an era of scarcity, such as data warehouses and Operational Data Stores (ODS). These systems were developed to provide reporting and analysis separate from the primary operational systems. The economics of enterprise computing have substantially changed. One such popular arrangement was the Corporate Information Factory 1, a complex design that separated data, data flows, governance and metadata in order to provide acceptable performance. In a typical CIF, there may be as many as 20 databases and more than 20 schema for each purpose. Today, most of these separate components can be collapsed into an enterprise data hub (EDH). Systems designed for reporting and business intelligence have historically been constrained by the cost and complexity of resources and methods. This managing from scarcity provided for some ingenious workaround approaches, but today, with scarcity mostly a thing of the past, many of these approaches are in need of review. One in particular is the ODS. A popular approach for achieving rational performance for reporting and analysis was an architecture called the Corporate Information Factory, of which the ODS was an integral part. As you can see from the diagram below the CIS was a complicated set of structures and processes with a great deal of physical movement between the components. If designed well, it delivered adequate load processing and acceptable query processing, but the relentless rate of change in organizations added a large burden to efforts to keep it running. Library & Toolbox Information Workshop Workbench Information Feedback External ERP ERP Data Acquisition Data Warehouse CIF Data Data Delivery Exploration Warehouse Data Mining Warehouse Internal Operational Data Store OLAP Data Mart Other Operational Systems Systems Data Acquisition Figure 1: The Corporate Information Factory Inmon & Imhof Tri Meta Data Operation & Administration Service Oper Mart Change 3 1 Mastering data warehouse design, Imhoff, Claudia Galemmo, Nicholas Geiger, Jonathan G.ISBN:978-0-471-32421-8

The ODS in Practice The ODS was developed 2 as a means to provide access to data from live operational systems without disturbing the processing of the operations themselves, and to overcome limitations of data warehouses, particularly the slow batch loading of data warehouses and limited scaling potential. At first glance it is reasonable to expect that the ODS is a good candidate to reside in an enterprise data hub, but there are some subtleties that need to be addressed. There are conflicting definitions for the ODS, but the original specification called for a subject oriented, current valued, integrated and detailed design. One common misconception about the ODS was that it was merely a staging area for further refinement of data for a data warehouse. Another was that it is often mistaken as part of a data warehouse, or the data warehouse itself, but it is not and it serves a different purpose. For example: Subject oriented: An ODS is not a dumping ground for all sorts of data. Each is designed for major sets of data such as CUSTOMER or PRODUCT, but not a business process such as Sales, Replenishment or Yield. Current Valued: Distinguishes an ODS from a data warehouse it only contains the current period, however that is designed (day, week, etc.) They do not retain history like a data warehouse. Integrated: Even if the ODS contains data about a single subject, that data may reside in multiple source systems and the separate feeds must be integrated to give a coherent view. Keep in mind, without the historical component; this is a much easier task than a data warehouse. Detailed: As a result of managing from scarcity, the most detailed data was often brought into the ODS, but more summarized data flowed to the data warehouse. The data warehouse typically had multiple, integrated subject areas, a much longer historical perspective and a multi-level physical schema to support various activities requiring indexing, aggregation and duplication of data. The ODS can be quite large, but can be considered more lightweight than a data warehouse. The ODS is a useful solution when certain situations prevail: There is a need to have access to detailed internal data, such as transactions from operational systems, at a level of detail finer than the data warehouse (because data warehouse size is usually limited by both cost and performance) There is a requirement for timely reporting of operations, especially if it requires integration of data from more than one system The volume of data is large, a measure that is dependent on the economics of the exiting technology Update of the data was near-real-time. This means that data can be used almost immediately, though timing differences from data trickling in have to be dealt with in the reporting application. There is no provision for versioning of data. A common application for the ODS was to support live access to integrated customer service data when the operational systems lacked the functionality to do so, but the breadth of information in the data warehouse was not necessary and its performance was not adequate. 4 2 The earliest book on the subject is Building the Operational Data Store, October 27, 1995 by W. H. Inmon, Claudia Imhoff, Greg Battas

Drawbacks of the ODS Today Since ODS s were typically built using the predominant relational database technologies and platforms, the ODS was an expensive proposition in terms of hardware profiles, proprietary software complexities, and labor. In addition, the ODS was integrated, meaning the data from various sources had to be blended and cleansed; so despite its proposed role, overcoming latency was still a difficult and ongoing challenge. Reporting, Data Discovery, and Analytics performance of the ODS was dependent on the physical design and the workload. There were always a number of schemas in the ODS, such as transactional 3NF design for ingesting data quickly, a dimensional schema for providing reasonably good performance for queries and all structures in the ODS were highly designed and configured based on assumption of usage patterns. Modifications to the schema generated effort to modify the upstream and downstream processes. Some relational database technologies were able to scale to meet the demand of ODS and data warehouse using massively parallel processing, but many popular offerings could not. A major limitation of all relational database technologies is that the query parser, optimizer and compute layers cannot be separated, which limited their scalability. As a result, all data structures were carefully designed using parsimonious techniques to limit the scale and usage as much as possible. All were quite costly. This is not the case with Hadoop. The Case for ODS on an EDH As organizations become aware of the value of Big Data, data flowing from a myriad of internal and external sources, and a variety of formats that a relational database cannot process, the relational-based ODS becomes untenable. Eliminating the managing from scarcity element from ODS design (reporting, business intelligence and analytics), many of the limitations of relational ODS no longer exist. Scale and latency go away, so there is no need to physically separate subject areas into different ODS s. Nor is there a need to flush history. However the ODS and the EDH remain separate concepts. While ODS data may reside in the EDH, ODS processing is only a part of the portfolio of the EDH. The EDH, built on Hadoop technology (and economics) becomes the obvious choice for ODS because: Economics: The Cloudera platform is cost-efficient versus existing relational-based architectures. Scale: Hadoop can scale to handle enormous volumes of data and concurrent work. Unlimited choices: XXX Performance: The EDH data, with its associated tools for performance, security and scalability, allow for far fewer data structures and far less maintenance of physical optimizations such as aggregation and indexing. In other words, the same physical copy of the data can support many virtual operations. Avoidance of design (schema): The whole concept of schema on read reduces the need for design and maintenance of structure s for performance and latency, as the process of parsing queries, optimizing them and presenting results can be separated into their logical locations and fit. 5

Conclusion One other thing to consider: the EDH can t operate as a useful enterprise tool without metadata and governance. Hadoop entered the scene as an unruly tool for singular use by data scientists. Now that it is maturing into a platform for enterprise computing, that unruliness is no longer acceptable. Cloudera provides many tools to facilitate an enterprise solution architecture with Cloudera Navigator (for governance), Impala for HDFS-based relational capabilities and a growing collection of other tools for security, development and performance. With an adequate metadata management system, the ODS can be a purely virtual structure. As an EDH contains all of the data needed for the ODS (and more of course), the ODS structures and schema can be strictly virtual. Hadoop has the processing power to present the data on request without layers of integration and physical data movement. All of the physical structures that support existing ODS s take time to maintain and with an EDH can be replaced as virtual structures with many applications using the same data. About the Author Neil Raden, based in Santa Fe, NM, is an industry analyst and active consultant, widely published author and speaker and the founder of Hired Brains Research LLC, Hired Brains provides research, advisory and consulting services in Analytics, Big Data, and Decision for clients worldwide. Neil is also the co-author of the Dresner Advisory Services Wisdom of BI series on Advanced and Predictive Analytics. Neil was a contributing author to one of the first (1995) books on designing data warehouses and he is more recently the co-author of Smart (Enough) Systems: How to Deliver Competitive Advantage by Automating Hidden Decisions, Prentice-Hall. He is a contributor to publications such as Wall Street Week, Forbes, Information Week and ComputerWorld. He welcomes your comments at or at his blog at 6

About Cloudera Cloudera is revolutionizing enterprise data management by offering the first unified Platform for big data, an enterprise data hub built on Apache Hadoop. Cloudera offers enterprises one place to store, access, process, secure, and analyze all their data, empowering them to extend the value of existing investments while enabling fundamental new ways to derive value from their data. Cloudera s open source big data platform is the most widely adopted in the world, and Cloudera is the most prolific contributor to the open source Hadoop ecosystem. As the leading educator of Hadoop professionals, Cloudera has trained over 22,000 individuals worldwide. Over 1,400 partners and a seasoned professional services team help deliver greater time to value. Finally, only Cloudera provides proactive and predictive support to run an enterprise data hub with confidence. Leading organizations in every industry plus top public sector organizations globally run Cloudera in production. For additional information, please visit us at: 1-888-789-1488 or 1-650-362-0488 Cloudera, Inc. 1001 Page Mill Road, Palo Alto, CA 94304, USA 2015 Cloudera, Inc. All rights reserved. Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera Inc. in the USA and other countries. All other trademarks are the property of their respective companies. Information is subject to change without notice.