INTEGRITY IN All Your INformation R TECHNOLOGY INCORPORATED Enterprise Intelligence - Enabling High Quality in the Data Warehouse/DSS Environment by Bill Inmon WPS.INM.E.399.1.e
Introduction In a few short years, data warehousing has passed from theory to conventional wisdom. In the explosive growth that has transpired, a body of thought has developed surrounding it. From the beginning, data warehousing was never a theoretical exercise, but has always been rooted in pragmatism. But as is inevitable given the breathtaking growth that has been the lot of data warehousing, an organized, thorough intellectual framework has begun to grow around both its infrastructure and rationale. There are many aspects to this intellectual framework. One of the important considerations, critical to the infrastructure, is the quality of data that courses through the veins of components of the warehouse. Indeed, quality in many different forms is one the cornerstones of data warehousing. If the data warehouse is ever to achieve the lofty goal of becoming a foundation for enterprise intelligence, data quality must become a reality. It is simply unthinkable that analysis for important corporate decisions should proceed on the basis of incorrect and incomplete data. Therefore, a de facto prerequisite for enterprise intelligence is quality throughout the data warehouse environment. The corporate information factory Before there can be a discussion of the quality of data in the data warehouse/dss environment, there needs to be a discussion of the structure of the data warehouse environment and of its infrastructure. The data warehouse has grown from a a database separate and apart from transaction processing into a sophisticated structure known as the "corporate information factory." Figure 1 (on the following page) depicts the corporate information factory. The genesis of data in the corporate information factory is the application environment. Here, detailed data is gathered, audited, transacted, and stored. The application is written for specific requirements. The essence of the application environment is transactions, which typically execute very quickly, operating on small amounts of data. Once data is gathered into the application environment, it is passed through a layer of programs called the "integration and transformation" layer. The programs for integration and transformation processing integrate and convert the application data into a corporate format. The integration and transformation programs that a corporation writes usually represent its largest expense and effort in developing the data warehouse. Once data passes through the integration and transformation layer, it heads to one of two directions: to the data warehouse or to the ODS (operational data store). When the data heads to the ODS, it goes to an environment that is a hybrid DSS/ operational structure. The ODS is a place where it is possible to achieve high-performance OLTP response time. At the same time, it is possible to access and analyze integrated data there and, on occasion, to do DSS processing. Not all companies have a need for an ODS. But where there is a need for an ODS, a business is served well by having one. Eventually, data that passes into the ODS also passes into the data warehouse. Enterprise Intelligence-Enabling High Quality in the Data Warehouse DSS Environment 1
Figure 1: The Corporate Information Factory Integration/Transformation Data Marts Applications Enterprise Data Warehouse Exploration Warehouse O D S Near Line Storage The data warehouse is then fed integrated data from either the integration and transformation layer or the ODS. The data warehouse is the heart of the DSS infrastructure. It is the place where the integrated granular data of the corporation resides. It contains historical data, sometimes up to ten years of data, depending on the business of the corporation. It represents the single "source of truth" for the data residing in the corporation. And it represents the ultimate basis for reconciling any discrepancies that a corporation might have. There is almost always a large volume of data residing in the data warehouse. And the volume of data found in the data warehouse grows at a rate that is breathtaking. Data emanates from the data warehouse in many directions. Data marts are created from the granular data found in the data warehouse. Data marts reflect departmental views of the corporation: each data mart selects and shapes the granular data to its own needs. Consequently, the data marts are significantly smaller than the data warehouse. As such, they can take advantage of specialized technology such as multi-dimensional technology and cube technology. Another extension of the data warehouse is the exploration warehouse. This is built for the explorers of the corporation. By creating a separate facility for explorers, companies avoid disrupting the regular work of the data warehouse. The exploration warehouse environment is best served by technology unique to its environment. There is one other important component of the corporate information factory: near line storage. Near line storage exists to house bulk and infrequently used data. It allows the cost of warehousing to be driven down to a relatively small expenditure. By introducing near line storage into the corporate information factory, the designer is free to take data down to the lowest level of granularity desired. 2 Copyright, Vality Technology Inc. All rights reserved.
Issues of quality What, then, are the issues of quality that arise in creating and operating a corporate information factory? The heart of the corporate information factory is the data warehouse. The first major issue of data quality in the corporate information factory is how to ensure the data arrives in the data warehouse with the highest degree of quality. Figure 2 shows that there are three opportunities for ensuring data quality as data is prepared for loading into the data warehouse. All of these opportunities for data quality have their own considerations. In fact, it is recommended that all three be used together for maximum effectiveness. Figure 2: The Three Opportunities for Quality in the DataWarehouse Environment These 3 opportunities are: 1. Cleansing data at the source, the application environment, 2. Cleansing data as the data is integrated upon leaving the applications and entering the integration and transformation programs, and 3. Cleansing and auditing data after it has been loaded into the warehouse. Data cleansing at the application level At first glance, it appears that the most natural place for assuring data quality is in the application. Data first enters the corporate information factory and is captured in the application. Indeed, the cleaner the data at the point of entry, the better off the corporate information factory. One theory says that if the data is perfectly cleaned at the application level, it need not be cleaned elsewhere. Unfortunately, this is not the case at all. Enterprise Intelligence-Enabling High Quality in the Data Warehouse DSS Environment 3
Several mitigating factors prevent the application from being the panacea for data quality. The first difficulty is the state of the application itself. In many cases, the application is old and undocumented. Applications programmers are legitimately scared to go back into old application code and alter it in any significant way. The fear is that one problem may be fixed, but two others may arise. Fixing one problem then might set off a cascade of other problems. The result is that the application is worse off than it was before the application was maintained. The second reason why application developers are loathe to go back into old code is that they see no benefit in doing so. Application developers focus on immediate requirements, and they see no urgency, or for that matter, any motivation, in going back into old code and modifying it to solve someone else's problems. Politics then enters the picture of what is and is not a priority. There is then a motivational and an organizational problem in trying to get changes made at the application level. But even if you could magically and easily do anything you wished at the application level, you would still need to cleanse data elsewhere in the corporate information factory. The reason why integration and transformation cleansing is still necessary elsewhere, even when application data is perfect, is that application data is not integrated. Data may be just fine in the eyes of a single application developer or user. But the data residing in the application still needs to be integrated across the corporate information factory. There is a big difference between cleaning application data and integrating application data. Only AFTER the data comes out of the application is there a need and opportunity for integrating the data. The first opportunity for integration arises as data passes into the integration and transformation layer. Data cleansing in the integration and transformation layer Multiple applications pass data into the integration and transformation layer. Each application has its own interpretation of data, as originally specified by the application designer. Keys, attributes, structures, encoding conventions are all different across the many applications. But in order for the data warehouse to contain integrated data, the many application structures and conventions must be integrated into a single, cohesive set of structures and conventions. There is then a complex task in store for the integration and transformation processing. Not only are keys, structures, and encoding conventions different across the many applications, but, in many cases, relationships between data within systems, as well as across systems, are undetected. Legacy information is often buried and floating within free-form text fields such as name and address lines, comment fields, and other data fields that have become a storage closet for meanings and relationships not accounted for in the original system. Data relationships may be hidden because initial systems did not provide a key structure that linked all relevant records, e.g., multiple account numbers might block the fact that all the records are from subsidiaries of the same company. Data anomalies in names, addresses, part descriptions, account codes are another area to rectify. And inconsistencies between meta field definitions and the applications tend to surface over time as the application systems become part of the operational fabric of an organization, e.g., commercial names mixed with personal names, addresses with missing information, truncated information, use of special characters as separators, missing values, abbreviations, 4 Copyright, Vality Technology Inc. All rights reserved.
etc. These quality issues can be found in one set of application data, can be multiplied when integrated within multiple applications, and can put the effectiveness of the resulting data warehouse at risk for delivering enterprise intelligence. The result of the tedious and difficult integration and transformation processing is integrated data. And the process of integrating the many applications together is certainly one form of cleansing data. It is noteworthy that this form of cleansing data is not possible until the data has passed out of the application. Therefore, there is another separate opportunity for data quality other than cleansing data in the application. But there is a also third place where data quality needs to be addressed: after the data has been loaded into the data warehouse. Data quality inside the data warehouse Suppose that you could create perfect application and perfect integration and transformation programs. Would you still need a data quality facility within the data warehouse itself? The answer is "yes." First of all, as new application data is added to the data warehouse environment, all the integration and transformation layer issues will be re-addressed, and the new data may also uncover more hidden anomalies and relationships even in the warehouse itself. However, another key reason is that the data warehouse contains data collected over a spectrum of time. In some cases, the spectrum is as long as ten years. The problem with data collected over time is that data itself changes over time. In some cases, the changes are slow and subtle. In other cases, the changes are fast and radical. In any case, it is simply a fact of life that data changes over time. And with these changes comes the need to integrate data over time within the data warehouse after it has already been loaded. Even if data is entered perfectly from the applications and integration and transformation programs, there will still be a need to examine data quality inside the data warehouse over time. But has the data remained constant over those years? Hardly. Figure 3: In a data warehouse, data is loaded into the warehouse over time. 1995 1996 1997 1998 Enterprise Intelligence-Enabling High Quality in the Data Warehouse DSS Environment 5
Figure 4 shows some common changes that have occurred over the years. Figure 4: There are plenty of examples where data undergoes a fundamental change over time. 1995 1996 1997 1998 standard chart of accounts standard chart of accounts SAP SAP 1995 1996 1997 1998 Franc, Pound, Peseta,... Franc, Pound, Peseta,... Franc, Pound, Peseta,... Eurodollar In the case shown in Figure 4, there was a standard chart of accounts until 1998. Then, in 1998, SAP was brought into the corporation, and a new chart of accounts was created. Trying to use the chart of accounts codes from 1996 to 1999 based on data in the warehouse produces very misleading and inaccurate results. As another example, money is measured in the local currency prior to 1998. But in 1999, money is measured in eurodollars. Trying to perform a cash analysis from 1996 to 1999 will be very difficult because the underlying meaning of the data has changed. Therefore, even if data quality has been perfected elsewhere, it remains to be perfected one more time after the data enters the warehouse simply because data ages inside the warehouse. Referential Integrity in the data warehouse environment There is another form of data quality that deserves mention: the quality of relationships among types of data inside the data warehouse. This type of data quality has long been known as "referential integrity." As a simple example of referential integrity in the classical operational environment, consider a common relationship between two elements of data, A and B, the parent/child relationship. In this relationship when data element A exists and data element B relates to data element A in a parent/child manner, if A is deleted, then data element B is also deleted by the referential integrity facility. Or if a user wants to insert data element B,, data element B cannot be inserted unless the data element A that it relates to also exists. The facility for referential integrity exists in order to ensure that the relationships defined are held intact by the database management system. 6 Copyright, Vality Technology Inc. All rights reserved.
Referential integrity applies to the world of data warehouses just as it applies to operational systems. However, referential integrity is implemented quite differently in the data warehouse environment. There are several reasons why: The volumes of data in the data warehouse are significantly larger than the volumes of data found in the operational environment. Snapshots are created in the data warehouse whereas updates of data are done in the operational environment. Data in the warehouse represents a spectrum of time while data in the operational environment is usually taken to be current valued data - that is, the data is current as of the moment of online access. For these and other reasons, referential integrity in the data warehouse environment is implemented quite differently than it is in the operational environment. As a simple example of the difference in the way that referential integrity is maintained in the DSS data warehouse environment, consider the parent/child relationship again. In the data warehouse environment, this relationship would be framed by some parameters of time. There would be a START TIME and a STOP TIME. The relationship between A and B would be valid from January 1 to February 15. The data warehouse referential integrity facility would first check the moment in time being considered. If this moment lies outside of the dates defined for the relationship, say July 20, then there would be no implication of a relationship between A and B. But if the dates being discussed lie between the START TIME and STOP TIME, say between January 18 and February 2, then the relationship between A and B would be enforced. Three places for data quality It is interesting to compare the three places where quality of data needs to be addressed inside the data warehouse. In the application arena, there is the need to see that data is entered and recorded correctly. Data quality standards for applications include ensuring that data is entered correctly and that information is not buried and floating within free-form fields. Clear routines for data defect detection are critical to ensure that misspellings do not result in duplicate customer or product entries, and that relationships between entities, such as subsidiaries or multiple accounts for a single client, are maintained. In the integration and transformation layer, it is necessary to see that data has been integrated. In most environments this is the most difficult of all data quality audits. Integrating data involves determining relationships across disparate data files where there are multiple formats as well as complex matching and consolidation, particularly where there are relationships among non-keyed fields. And once inside the data warehouse, it is necessary to examine the integration of data over time. In many cases there will be no differences. But where there are differences, there is the question of what to do about discrepancies. Enterprise Intelligence-Enabling High Quality in the Data Warehouse DSS Environment 7
As the volume and capacity of the data warehouse grows, massive conversions are not a viable option. Data quality tools such as Vality's INTEGRITY (TM) Data Reengineering Environment utilize parsing and matching technology to automatically unify and correct disparate formats, creative data entry, spelling and keying anomalies, and undiscovered data inconsistencies. Regular batch runs for cleansing and real- time defect detection and correction together can provide a continuing high degree of data integrity. Figure 5: Quality of data becomes a different issue after it is acquired. Integration/Transformation Data Marts Applications Enterprise Data Warehouse Exploration Warehouse O D S Near Line Storage Analytical data quality It is one thing to ensure that data that resides in the data warehouse is of the highest quality. It is another thing to say that the data used for analysis is also of the highest quality. There is an important split in the corporate information factory that delimits the difference in the approach to data quality. Figure 5 shows this line of demarcation. Figure 5 shows that there are two divisions in the corporate information factory in relation to data quality. To the left is the application arena, the integration and transformation arena, and the data warehouse itself. In this arena the objective is to cleanse and purify the data as much as possible. But to the right is the data mart and exploration warehouse arena. In this arena there is a choice as to what data is best used for analysis. 8 Copyright, Vality Technology Inc. All rights reserved.
In the data mart and exploration warehouse arena the issue of ensuring quality switches to become an issue of ensuring that the right data is being used for analysis. The strongest guarantee that the best data is indeed being used for analysis is the analyst himself or herself. The analyst needs to be sure of what the data means, where it came from, and how fresh it is. He or she needs to understand the data intimately in order to use it most effectively. And he or she is responsible for the interpretation of the data. The best aid the analyst can have is accurate and robust metadata. Figure 6 shows the metadata that can be very useful to the analyst. The metadata that describes the data residing in the different components of the corporate information factory is varied in content. Typically, the metadata that describes the parts of the corporate information factory contains descriptors for: Table descriptions Attribute descriptions Sources of data Definitions of data Relationships of data, and so forth Figure 6: Upon analysis of the data, metadata becomes the central issue. md md md md Data Marts Exploration Warehouse md Near Line Storage md Enterprise Intelligence-Enabling High Quality in the Data Warehouse DSS Environment 9
Metadata mining provides an automated means to surface essential business information buried within legacy systems or the data warehouse. This information, unreachable by metadata movement tool, is needed by both data warehouse data modelers and business users of information systems. Metadata mining is a low-level investigation of operational data. It analyzes each and every data value within each record occurrence in order to assign a data type to each value and perform entity identification. The ability to process data at the value/instance level is the fundamental prerequisite for solving type identification, entity identification, and quality problems at the heart of the enterprise information architecture. Anything that can make the job of the analyst easier and more organized is welcome. Once the analyst has a clear idea of what is available and what differs from one set of data to another, he or she is prepared to make the most concise and incisive analysis Summary In order to achieve enterprise intelligence, data quality must be achieved at the data level and the metadata level. There are three different places for ensuring quality in a data warehouse environment: in the source or application environment, during the integration and transformation stage when data is moving into the data warehouse, and routinely within the data warehouse itself in order to address changes in data values over time. About Bill Inmon Bill Inmon is widely recognized as the father of the data warehouse concept. He has more that 26 years of database technology management experience and data warehouse design expertise. He has published 36 books and more than 350 articles in the major computer journals. Bill is also the author of DM Review magazine's "Information Management: Charting the Course" column. Before founding Pine Cone Systems, Bill was a co-founder of Prism Solutions, Inc. Mr. Inmon is responsible for the high-level design of Pine Cone products, as well as for the architecture of planned and future products. Mr. Inmon has consulted with a large number of Fortune 1000 clients, offering data warehouse design and database management services. About Vality The leading provider of data quality tools and services for enterprise intelligence, Vality Technology is the industry's leading supplier of data standardization and matching software and consulting services. Our customers are Global 5000 corporations in finance, healthcare, insurance, manufacturing, retail, telecommunications, energy, and utilities. These companies rely on our flagship product, the INTEGRITY Data Re-engineering Environment, to help them uncover patterns and relationships within their data to optimize their strategic information assets in areas such as data quality, data warehouse and business intelligence systems, ERP conversions, enterprise relationship management systems, and electronic commerce. For more information For more information about Vality Technology Inc. and the INTEGRITY Data Re-engineering Environment, please call 617-338-0300 or visit the Vality Web site at http://www.vality.com. 10 Copyright, Vality Technology Inc. All rights reserved.