An Enterprise Data Hub, the Next Gen Operational Data Store Version: 101
Table of Contents Summary 3 The ODS in Practice 4 Drawbacks of the ODS Today 5 The Case for ODS on an EDH 5 Conclusion 6 About the Author 6 2
Summary Enterprise computing platforms are being redesigned to make use of the economics, scalability and performance of distributed computing systems such as the Apache Hadoop. Most systems in place today rely on relational databases that are brittle, expensive and limited in scale and performance (though within expectations are considered useful and important). As organizations begin the process of migrating or augmenting with Hadoop, there are many unanswered questions about the nature of applications and data management going forward. This is especially true when examining systems that evolved from an era of scarcity, such as data warehouses and Operational Data Stores (ODS). These systems were developed to provide reporting and analysis separate from the primary operational systems. The economics of enterprise computing have substantially changed. One such popular arrangement was the Corporate Information Factory 1, a complex design that separated data, data flows, governance and metadata in order to provide acceptable performance. In a typical CIF, there may be as many as 20 databases and more than 20 schema for each purpose. Today, most of these separate components can be collapsed into an enterprise data hub (EDH). Systems designed for reporting and business intelligence have historically been constrained by the cost and complexity of resources and methods. This managing from scarcity provided for some ingenious workaround approaches, but today, with scarcity mostly a thing of the past, many of these approaches are in need of review. One in particular is the ODS. A popular approach for achieving rational performance for reporting and analysis was an architecture called the Corporate Information Factory, of which the ODS was an integral part. As you can see from the diagram below the CIS was a complicated set of structures and processes with a great deal of physical movement between the components. If designed well, it delivered adequate load processing and acceptable query processing, but the relentless rate of change in organizations added a large burden to efforts to keep it running. Library & Toolbox Information Workshop Workbench Information Feedback External ERP ERP Data Acquisition Data Warehouse CIF Data Data Delivery Exploration Warehouse Data Mining Warehouse Internal Operational Data Store OLAP Data Mart Other Operational Systems Systems Data Acquisition Figure 1: The Corporate Information Factory Inmon & Imhof Tri Meta Data Operation & Administration Service Oper Mart Change 3 1 Mastering data warehouse design, Imhoff, Claudia Galemmo, Nicholas Geiger, Jonathan G.ISBN:978-0-471-32421-8
The ODS in Practice The ODS was developed 2 as a means to provide access to data from live operational systems without disturbing the processing of the operations themselves, and to overcome limitations of data warehouses, particularly the slow batch loading of data warehouses and limited scaling potential. At first glance it is reasonable to expect that the ODS is a good candidate to reside in an enterprise data hub, but there are some subtleties that need to be addressed. There are conflicting definitions for the ODS, but the original specification called for a subject oriented, current valued, integrated and detailed design. One common misconception about the ODS was that it was merely a staging area for further refinement of data for a data warehouse. Another was that it is often mistaken as part of a data warehouse, or the data warehouse itself, but it is not and it serves a different purpose. For example: Subject oriented: An ODS is not a dumping ground for all sorts of data. Each is designed for major sets of data such as CUSTOMER or PRODUCT, but not a business process such as Sales, Replenishment or Yield. Current Valued: Distinguishes an ODS from a data warehouse it only contains the current period, however that is designed (day, week, etc.) They do not retain history like a data warehouse. Integrated: Even if the ODS contains data about a single subject, that data may reside in multiple source systems and the separate feeds must be integrated to give a coherent view. Keep in mind, without the historical component; this is a much easier task than a data warehouse. Detailed: As a result of managing from scarcity, the most detailed data was often brought into the ODS, but more summarized data flowed to the data warehouse. The data warehouse typically had multiple, integrated subject areas, a much longer historical perspective and a multi-level physical schema to support various activities requiring indexing, aggregation and duplication of data. The ODS can be quite large, but can be considered more lightweight than a data warehouse. The ODS is a useful solution when certain situations prevail: There is a need to have access to detailed internal data, such as transactions from operational systems, at a level of detail finer than the data warehouse (because data warehouse size is usually limited by both cost and performance) There is a requirement for timely reporting of operations, especially if it requires integration of data from more than one system The volume of data is large, a measure that is dependent on the economics of the exiting technology Update of the data was near-real-time. This means that data can be used almost immediately, though timing differences from data trickling in have to be dealt with in the reporting application. There is no provision for versioning of data. A common application for the ODS was to support live access to integrated customer service data when the operational systems lacked the functionality to do so, but the breadth of information in the data warehouse was not necessary and its performance was not adequate. 4 2 The earliest book on the subject is Building the Operational Data Store, October 27, 1995 by W. H. Inmon, Claudia Imhoff, Greg Battas
Drawbacks of the ODS Today Since ODS s were typically built using the predominant relational database technologies and platforms, the ODS was an expensive proposition in terms of hardware profiles, proprietary software complexities, and labor. In addition, the ODS was integrated, meaning the data from various sources had to be blended and cleansed; so despite its proposed role, overcoming latency was still a difficult and ongoing challenge. Reporting, Data Discovery, and Analytics performance of the ODS was dependent on the physical design and the workload. There were always a number of schemas in the ODS, such as transactional 3NF design for ingesting data quickly, a dimensional schema for providing reasonably good performance for queries and all structures in the ODS were highly designed and configured based on assumption of usage patterns. Modifications to the schema generated effort to modify the upstream and downstream processes. Some relational database technologies were able to scale to meet the demand of ODS and data warehouse using massively parallel processing, but many popular offerings could not. A major limitation of all relational database technologies is that the query parser, optimizer and compute layers cannot be separated, which limited their scalability. As a result, all data structures were carefully designed using parsimonious techniques to limit the scale and usage as much as possible. All were quite costly. This is not the case with Hadoop. The Case for ODS on an EDH As organizations become aware of the value of Big Data, data flowing from a myriad of internal and external sources, and a variety of formats that a relational database cannot process, the relational-based ODS becomes untenable. Eliminating the managing from scarcity element from ODS design (reporting, business intelligence and analytics), many of the limitations of relational ODS no longer exist. Scale and latency go away, so there is no need to physically separate subject areas into different ODS s. Nor is there a need to flush history. However the ODS and the EDH remain separate concepts. While ODS data may reside in the EDH, ODS processing is only a part of the portfolio of the EDH. The EDH, built on Hadoop technology (and economics) becomes the obvious choice for ODS because: Economics: The Cloudera platform is cost-efficient versus existing relational-based architectures. Scale: Hadoop can scale to handle enormous volumes of data and concurrent work. Unlimited choices: XXX Performance: The EDH data, with its associated tools for performance, security and scalability, allow for far fewer data structures and far less maintenance of physical optimizations such as aggregation and indexing. In other words, the same physical copy of the data can support many virtual operations. Avoidance of design (schema): The whole concept of schema on read reduces the need for design and maintenance of structure s for performance and latency, as the process of parsing queries, optimizing them and presenting results can be separated into their logical locations and fit. 5
Conclusion One other thing to consider: the EDH can t operate as a useful enterprise tool without metadata and governance. Hadoop entered the scene as an unruly tool for singular use by data scientists. Now that it is maturing into a platform for enterprise computing, that unruliness is no longer acceptable. Cloudera provides many tools to facilitate an enterprise solution architecture with Cloudera Navigator (for governance), Impala for HDFS-based relational capabilities and a growing collection of other tools for security, development and performance. With an adequate metadata management system, the ODS can be a purely virtual structure. As an EDH contains all of the data needed for the ODS (and more of course), the ODS structures and schema can be strictly virtual. Hadoop has the processing power to present the data on request without layers of integration and physical data movement. All of the physical structures that support existing ODS s take time to maintain and with an EDH can be replaced as virtual structures with many applications using the same data. About the Author Neil Raden, based in Santa Fe, NM, is an industry analyst and active consultant, widely published author and speaker and the founder of Hired Brains Research LLC, http://www.hiredbrains.com. Hired Brains provides research, advisory and consulting services in Analytics, Big Data, and Decision for clients worldwide. Neil is also the co-author of the Dresner Advisory Services Wisdom of BI series on Advanced and Predictive Analytics. Neil was a contributing author to one of the first (1995) books on designing data warehouses and he is more recently the co-author of Smart (Enough) Systems: How to Deliver Competitive Advantage by Automating Hidden Decisions, Prentice-Hall. He is a contributor to publications such as Wall Street Week, Forbes, Information Week and ComputerWorld. He welcomes your comments at firstname.lastname@example.org or at his blog at http://hiredbrains.wordpress.com 6
About Cloudera Cloudera is revolutionizing enterprise data management by offering the first unified Platform for big data, an enterprise data hub built on Apache Hadoop. Cloudera offers enterprises one place to store, access, process, secure, and analyze all their data, empowering them to extend the value of existing investments while enabling fundamental new ways to derive value from their data. Cloudera s open source big data platform is the most widely adopted in the world, and Cloudera is the most prolific contributor to the open source Hadoop ecosystem. As the leading educator of Hadoop professionals, Cloudera has trained over 22,000 individuals worldwide. Over 1,400 partners and a seasoned professional services team help deliver greater time to value. Finally, only Cloudera provides proactive and predictive support to run an enterprise data hub with confidence. Leading organizations in every industry plus top public sector organizations globally run Cloudera in production. For additional information, please visit us at: www.cloudera.com cloudera.com 1-888-789-1488 or 1-650-362-0488 Cloudera, Inc. 1001 Page Mill Road, Palo Alto, CA 94304, USA 2015 Cloudera, Inc. All rights reserved. Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera Inc. in the USA and other countries. All other trademarks are the property of their respective companies. Information is subject to change without notice.