Enterprise Intelligence - Enabling High Quality in the Data Warehouse/DSS Environment. by Bill Inmon. INTEGRITY IN All Your INformation



Similar documents
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

Data Warehousing: A Technology Review and Update Vernon Hoffner, Ph.D., CCP EntreSoft Resouces, Inc.

Data Warehousing and Data Mining

Lection 3-4 WAREHOUSING

B.Sc (Computer Science) Database Management Systems UNIT-V

Part 22. Data Warehousing

IST722 Data Warehousing

When to consider OLAP?

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management

CHAPTER SIX DATA. Business Intelligence The McGraw-Hill Companies, All Rights Reserved

Foundations of Business Intelligence: Databases and Information Management

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

Framework for Data warehouse architectural components

Enterprise Data Quality

Course MIS. Foundations of Business Intelligence

Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management

Chapter 6 FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT Learning Objectives

The growth of computing can be measured in two ways growth in what is termed structured systems and growth in what is termed unstructured systems.

Corporate Governance and Compliance: Could Data Quality Be Your Downfall?

THE DATA WAREHOUSE ETL TOOLKIT CDT803 Three Days

INTELLIGENCE AND HOMELAND DEFENSE INSIGHT

OLAP AND DATA WAREHOUSE BY W. H. Inmon

January Fast-Tracking Data Warehousing & Business Intelligence Projects via Intelligent Data Modeling. Sponsored by:

The Data Warehouse ETL Toolkit

The Quality Data Warehouse: Solving Problems for the Enterprise

HYPERION MASTER DATA MANAGEMENT SOLUTIONS FOR IT

Data Mining for Successful Healthcare Organizations

DATA MINING AND WAREHOUSING CONCEPTS

Integrating IBM Cognos TM1 with Oracle General Ledger

Master Data Management and Data Warehousing. Zahra Mansoori

MDM and Data Warehousing Complement Each Other

Data Warehouse Overview. Srini Rengarajan

A Knowledge Management Framework Using Business Intelligence Solutions

Information Systems and Technologies in Organizations

Five Technology Trends for Improved Business Intelligence Performance

Introduction to Datawarehousing

Dimensional Modeling and E-R Modeling In. Joseph M. Firestone, Ph.D. White Paper No. Eight. June 22, 1998

Databases and Information Management

INFO Koffka Khan. Tutorial 6

What You Don t Know Does Hurt You: Five Critical Risk Factors in Data Warehouse Quality. An Infogix White Paper

Foundations of Business Intelligence: Databases and Information Management

Analance Data Integration Technical Whitepaper

INFORMATION MANAGEMENT. Transform Your Information into a Strategic Asset

Data Warehousing and Data Mining in Business Applications

Data Warehouse Testing

Data Quality Assessment. Approach

Build an effective data integration strategy to drive innovation

Speeding ETL Processing in Data Warehouses White Paper

Accenture Human Capital Management Solutions. Transforming people and process to achieve high performance

Data Profiling and Mapping The Essential First Step in Data Migration and Integration Projects

Data Management Roadmap

ORACLE ENTERPRISE DATA QUALITY PRODUCT FAMILY

Realizing the True Power of Insurance Data: An Integrated Approach to Legacy Replacement and Business Intelligence

Integrating SAP and non-sap data for comprehensive Business Intelligence

Scalable Enterprise Data Integration Your business agility depends on how fast you can access your complex data

Technology in Action. Alan Evans Kendall Martin Mary Anne Poatsy. Eleventh Edition. Copyright 2015 Pearson Education, Inc.

An Introduction to Data Warehousing. An organization manages information in two dominant forms: operational systems of

Data Quality Assurance

While people are often a corporation s true intellectual property, data is what

Implementing Oracle BI Applications during an ERP Upgrade

White Paper. An Overview of the Kalido Data Governance Director Operationalizing Data Governance Programs Through Data Policy Management

Making Business Intelligence Easy. White Paper Spreadsheet reporting within a BI framework

Analance Data Integration Technical Whitepaper

Issues in Information Systems Volume 15, Issue II, pp , 2014

Service Oriented Data Management

dxhub Denologix MDM Solution Page 1

Making Business Intelligence Easy. Whitepaper Measuring data quality for successful Master Data Management

C A S E S T UDY The Path Toward Pervasive Business Intelligence at an International Financial Institution

Next Generation Business Performance Management Solution

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 28

[callout: no organization can afford to deny itself the power of business intelligence ]

An Enterprise Data Hub, the Next Gen Operational Data Store

Architecting an Industrial Sensor Data Platform for Big Data Analytics

Whitepaper. Data Warehouse/BI Testing Offering YOUR SUCCESS IS OUR FOCUS. Published on: January 2009 Author: BIBA PRACTICE

PARALLEL PROCESSING AND THE DATA WAREHOUSE

CIC Audit Review: Experian Data Quality Enterprise Integrations. Guidance for maximising your investment in enterprise applications

Three Fundamental Techniques To Maximize the Value of Your Enterprise Data

Life Cycle of a Data Warehousing Project in Healthcare

Original Research Articles

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Getting started with a data quality program

Implementing Oracle BI Applications during an ERP Upgrade

A Design Technique: Data Integration Modeling

Business Intelligence Solutions. Cognos BI 8. by Adis Terzić

Master Data Management. Zahra Mansoori

IBM Software A Journey to Adaptive MDM

AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO SINGAPORE SYDNEY TOKYO

JOURNAL OF OBJECT TECHNOLOGY

Business Intelligence Engineer Position Description

!"#$%&&'(#)*+,+#*-#./(0/1#2'3*4,%(5#%(#678'1# /(&#9:/,#;*0#)/(#<*#/=*0,#>:'?# !"#$%&'()%*&

Quality Data for Your Information Infrastructure

Making confident decisions with the full spectrum of analysis capabilities

Enable Business Agility and Speed Empower your business with proven multidomain master data management (MDM)

Open Source on the Trading Desk

POLAR IT SERVICES. Business Intelligence Project Methodology

Transcription:

INTEGRITY IN All Your INformation R TECHNOLOGY INCORPORATED Enterprise Intelligence - Enabling High Quality in the Data Warehouse/DSS Environment by Bill Inmon WPS.INM.E.399.1.e

Introduction In a few short years, data warehousing has passed from theory to conventional wisdom. In the explosive growth that has transpired, a body of thought has developed surrounding it. From the beginning, data warehousing was never a theoretical exercise, but has always been rooted in pragmatism. But as is inevitable given the breathtaking growth that has been the lot of data warehousing, an organized, thorough intellectual framework has begun to grow around both its infrastructure and rationale. There are many aspects to this intellectual framework. One of the important considerations, critical to the infrastructure, is the quality of data that courses through the veins of components of the warehouse. Indeed, quality in many different forms is one the cornerstones of data warehousing. If the data warehouse is ever to achieve the lofty goal of becoming a foundation for enterprise intelligence, data quality must become a reality. It is simply unthinkable that analysis for important corporate decisions should proceed on the basis of incorrect and incomplete data. Therefore, a de facto prerequisite for enterprise intelligence is quality throughout the data warehouse environment. The corporate information factory Before there can be a discussion of the quality of data in the data warehouse/dss environment, there needs to be a discussion of the structure of the data warehouse environment and of its infrastructure. The data warehouse has grown from a a database separate and apart from transaction processing into a sophisticated structure known as the "corporate information factory." Figure 1 (on the following page) depicts the corporate information factory. The genesis of data in the corporate information factory is the application environment. Here, detailed data is gathered, audited, transacted, and stored. The application is written for specific requirements. The essence of the application environment is transactions, which typically execute very quickly, operating on small amounts of data. Once data is gathered into the application environment, it is passed through a layer of programs called the "integration and transformation" layer. The programs for integration and transformation processing integrate and convert the application data into a corporate format. The integration and transformation programs that a corporation writes usually represent its largest expense and effort in developing the data warehouse. Once data passes through the integration and transformation layer, it heads to one of two directions: to the data warehouse or to the ODS (operational data store). When the data heads to the ODS, it goes to an environment that is a hybrid DSS/ operational structure. The ODS is a place where it is possible to achieve high-performance OLTP response time. At the same time, it is possible to access and analyze integrated data there and, on occasion, to do DSS processing. Not all companies have a need for an ODS. But where there is a need for an ODS, a business is served well by having one. Eventually, data that passes into the ODS also passes into the data warehouse. Enterprise Intelligence-Enabling High Quality in the Data Warehouse DSS Environment 1

Figure 1: The Corporate Information Factory Integration/Transformation Data Marts Applications Enterprise Data Warehouse Exploration Warehouse O D S Near Line Storage The data warehouse is then fed integrated data from either the integration and transformation layer or the ODS. The data warehouse is the heart of the DSS infrastructure. It is the place where the integrated granular data of the corporation resides. It contains historical data, sometimes up to ten years of data, depending on the business of the corporation. It represents the single "source of truth" for the data residing in the corporation. And it represents the ultimate basis for reconciling any discrepancies that a corporation might have. There is almost always a large volume of data residing in the data warehouse. And the volume of data found in the data warehouse grows at a rate that is breathtaking. Data emanates from the data warehouse in many directions. Data marts are created from the granular data found in the data warehouse. Data marts reflect departmental views of the corporation: each data mart selects and shapes the granular data to its own needs. Consequently, the data marts are significantly smaller than the data warehouse. As such, they can take advantage of specialized technology such as multi-dimensional technology and cube technology. Another extension of the data warehouse is the exploration warehouse. This is built for the explorers of the corporation. By creating a separate facility for explorers, companies avoid disrupting the regular work of the data warehouse. The exploration warehouse environment is best served by technology unique to its environment. There is one other important component of the corporate information factory: near line storage. Near line storage exists to house bulk and infrequently used data. It allows the cost of warehousing to be driven down to a relatively small expenditure. By introducing near line storage into the corporate information factory, the designer is free to take data down to the lowest level of granularity desired. 2 Copyright, Vality Technology Inc. All rights reserved.

Issues of quality What, then, are the issues of quality that arise in creating and operating a corporate information factory? The heart of the corporate information factory is the data warehouse. The first major issue of data quality in the corporate information factory is how to ensure the data arrives in the data warehouse with the highest degree of quality. Figure 2 shows that there are three opportunities for ensuring data quality as data is prepared for loading into the data warehouse. All of these opportunities for data quality have their own considerations. In fact, it is recommended that all three be used together for maximum effectiveness. Figure 2: The Three Opportunities for Quality in the DataWarehouse Environment These 3 opportunities are: 1. Cleansing data at the source, the application environment, 2. Cleansing data as the data is integrated upon leaving the applications and entering the integration and transformation programs, and 3. Cleansing and auditing data after it has been loaded into the warehouse. Data cleansing at the application level At first glance, it appears that the most natural place for assuring data quality is in the application. Data first enters the corporate information factory and is captured in the application. Indeed, the cleaner the data at the point of entry, the better off the corporate information factory. One theory says that if the data is perfectly cleaned at the application level, it need not be cleaned elsewhere. Unfortunately, this is not the case at all. Enterprise Intelligence-Enabling High Quality in the Data Warehouse DSS Environment 3

Several mitigating factors prevent the application from being the panacea for data quality. The first difficulty is the state of the application itself. In many cases, the application is old and undocumented. Applications programmers are legitimately scared to go back into old application code and alter it in any significant way. The fear is that one problem may be fixed, but two others may arise. Fixing one problem then might set off a cascade of other problems. The result is that the application is worse off than it was before the application was maintained. The second reason why application developers are loathe to go back into old code is that they see no benefit in doing so. Application developers focus on immediate requirements, and they see no urgency, or for that matter, any motivation, in going back into old code and modifying it to solve someone else's problems. Politics then enters the picture of what is and is not a priority. There is then a motivational and an organizational problem in trying to get changes made at the application level. But even if you could magically and easily do anything you wished at the application level, you would still need to cleanse data elsewhere in the corporate information factory. The reason why integration and transformation cleansing is still necessary elsewhere, even when application data is perfect, is that application data is not integrated. Data may be just fine in the eyes of a single application developer or user. But the data residing in the application still needs to be integrated across the corporate information factory. There is a big difference between cleaning application data and integrating application data. Only AFTER the data comes out of the application is there a need and opportunity for integrating the data. The first opportunity for integration arises as data passes into the integration and transformation layer. Data cleansing in the integration and transformation layer Multiple applications pass data into the integration and transformation layer. Each application has its own interpretation of data, as originally specified by the application designer. Keys, attributes, structures, encoding conventions are all different across the many applications. But in order for the data warehouse to contain integrated data, the many application structures and conventions must be integrated into a single, cohesive set of structures and conventions. There is then a complex task in store for the integration and transformation processing. Not only are keys, structures, and encoding conventions different across the many applications, but, in many cases, relationships between data within systems, as well as across systems, are undetected. Legacy information is often buried and floating within free-form text fields such as name and address lines, comment fields, and other data fields that have become a storage closet for meanings and relationships not accounted for in the original system. Data relationships may be hidden because initial systems did not provide a key structure that linked all relevant records, e.g., multiple account numbers might block the fact that all the records are from subsidiaries of the same company. Data anomalies in names, addresses, part descriptions, account codes are another area to rectify. And inconsistencies between meta field definitions and the applications tend to surface over time as the application systems become part of the operational fabric of an organization, e.g., commercial names mixed with personal names, addresses with missing information, truncated information, use of special characters as separators, missing values, abbreviations, 4 Copyright, Vality Technology Inc. All rights reserved.

etc. These quality issues can be found in one set of application data, can be multiplied when integrated within multiple applications, and can put the effectiveness of the resulting data warehouse at risk for delivering enterprise intelligence. The result of the tedious and difficult integration and transformation processing is integrated data. And the process of integrating the many applications together is certainly one form of cleansing data. It is noteworthy that this form of cleansing data is not possible until the data has passed out of the application. Therefore, there is another separate opportunity for data quality other than cleansing data in the application. But there is a also third place where data quality needs to be addressed: after the data has been loaded into the data warehouse. Data quality inside the data warehouse Suppose that you could create perfect application and perfect integration and transformation programs. Would you still need a data quality facility within the data warehouse itself? The answer is "yes." First of all, as new application data is added to the data warehouse environment, all the integration and transformation layer issues will be re-addressed, and the new data may also uncover more hidden anomalies and relationships even in the warehouse itself. However, another key reason is that the data warehouse contains data collected over a spectrum of time. In some cases, the spectrum is as long as ten years. The problem with data collected over time is that data itself changes over time. In some cases, the changes are slow and subtle. In other cases, the changes are fast and radical. In any case, it is simply a fact of life that data changes over time. And with these changes comes the need to integrate data over time within the data warehouse after it has already been loaded. Even if data is entered perfectly from the applications and integration and transformation programs, there will still be a need to examine data quality inside the data warehouse over time. But has the data remained constant over those years? Hardly. Figure 3: In a data warehouse, data is loaded into the warehouse over time. 1995 1996 1997 1998 Enterprise Intelligence-Enabling High Quality in the Data Warehouse DSS Environment 5

Figure 4 shows some common changes that have occurred over the years. Figure 4: There are plenty of examples where data undergoes a fundamental change over time. 1995 1996 1997 1998 standard chart of accounts standard chart of accounts SAP SAP 1995 1996 1997 1998 Franc, Pound, Peseta,... Franc, Pound, Peseta,... Franc, Pound, Peseta,... Eurodollar In the case shown in Figure 4, there was a standard chart of accounts until 1998. Then, in 1998, SAP was brought into the corporation, and a new chart of accounts was created. Trying to use the chart of accounts codes from 1996 to 1999 based on data in the warehouse produces very misleading and inaccurate results. As another example, money is measured in the local currency prior to 1998. But in 1999, money is measured in eurodollars. Trying to perform a cash analysis from 1996 to 1999 will be very difficult because the underlying meaning of the data has changed. Therefore, even if data quality has been perfected elsewhere, it remains to be perfected one more time after the data enters the warehouse simply because data ages inside the warehouse. Referential Integrity in the data warehouse environment There is another form of data quality that deserves mention: the quality of relationships among types of data inside the data warehouse. This type of data quality has long been known as "referential integrity." As a simple example of referential integrity in the classical operational environment, consider a common relationship between two elements of data, A and B, the parent/child relationship. In this relationship when data element A exists and data element B relates to data element A in a parent/child manner, if A is deleted, then data element B is also deleted by the referential integrity facility. Or if a user wants to insert data element B,, data element B cannot be inserted unless the data element A that it relates to also exists. The facility for referential integrity exists in order to ensure that the relationships defined are held intact by the database management system. 6 Copyright, Vality Technology Inc. All rights reserved.

Referential integrity applies to the world of data warehouses just as it applies to operational systems. However, referential integrity is implemented quite differently in the data warehouse environment. There are several reasons why: The volumes of data in the data warehouse are significantly larger than the volumes of data found in the operational environment. Snapshots are created in the data warehouse whereas updates of data are done in the operational environment. Data in the warehouse represents a spectrum of time while data in the operational environment is usually taken to be current valued data - that is, the data is current as of the moment of online access. For these and other reasons, referential integrity in the data warehouse environment is implemented quite differently than it is in the operational environment. As a simple example of the difference in the way that referential integrity is maintained in the DSS data warehouse environment, consider the parent/child relationship again. In the data warehouse environment, this relationship would be framed by some parameters of time. There would be a START TIME and a STOP TIME. The relationship between A and B would be valid from January 1 to February 15. The data warehouse referential integrity facility would first check the moment in time being considered. If this moment lies outside of the dates defined for the relationship, say July 20, then there would be no implication of a relationship between A and B. But if the dates being discussed lie between the START TIME and STOP TIME, say between January 18 and February 2, then the relationship between A and B would be enforced. Three places for data quality It is interesting to compare the three places where quality of data needs to be addressed inside the data warehouse. In the application arena, there is the need to see that data is entered and recorded correctly. Data quality standards for applications include ensuring that data is entered correctly and that information is not buried and floating within free-form fields. Clear routines for data defect detection are critical to ensure that misspellings do not result in duplicate customer or product entries, and that relationships between entities, such as subsidiaries or multiple accounts for a single client, are maintained. In the integration and transformation layer, it is necessary to see that data has been integrated. In most environments this is the most difficult of all data quality audits. Integrating data involves determining relationships across disparate data files where there are multiple formats as well as complex matching and consolidation, particularly where there are relationships among non-keyed fields. And once inside the data warehouse, it is necessary to examine the integration of data over time. In many cases there will be no differences. But where there are differences, there is the question of what to do about discrepancies. Enterprise Intelligence-Enabling High Quality in the Data Warehouse DSS Environment 7

As the volume and capacity of the data warehouse grows, massive conversions are not a viable option. Data quality tools such as Vality's INTEGRITY (TM) Data Reengineering Environment utilize parsing and matching technology to automatically unify and correct disparate formats, creative data entry, spelling and keying anomalies, and undiscovered data inconsistencies. Regular batch runs for cleansing and real- time defect detection and correction together can provide a continuing high degree of data integrity. Figure 5: Quality of data becomes a different issue after it is acquired. Integration/Transformation Data Marts Applications Enterprise Data Warehouse Exploration Warehouse O D S Near Line Storage Analytical data quality It is one thing to ensure that data that resides in the data warehouse is of the highest quality. It is another thing to say that the data used for analysis is also of the highest quality. There is an important split in the corporate information factory that delimits the difference in the approach to data quality. Figure 5 shows this line of demarcation. Figure 5 shows that there are two divisions in the corporate information factory in relation to data quality. To the left is the application arena, the integration and transformation arena, and the data warehouse itself. In this arena the objective is to cleanse and purify the data as much as possible. But to the right is the data mart and exploration warehouse arena. In this arena there is a choice as to what data is best used for analysis. 8 Copyright, Vality Technology Inc. All rights reserved.

In the data mart and exploration warehouse arena the issue of ensuring quality switches to become an issue of ensuring that the right data is being used for analysis. The strongest guarantee that the best data is indeed being used for analysis is the analyst himself or herself. The analyst needs to be sure of what the data means, where it came from, and how fresh it is. He or she needs to understand the data intimately in order to use it most effectively. And he or she is responsible for the interpretation of the data. The best aid the analyst can have is accurate and robust metadata. Figure 6 shows the metadata that can be very useful to the analyst. The metadata that describes the data residing in the different components of the corporate information factory is varied in content. Typically, the metadata that describes the parts of the corporate information factory contains descriptors for: Table descriptions Attribute descriptions Sources of data Definitions of data Relationships of data, and so forth Figure 6: Upon analysis of the data, metadata becomes the central issue. md md md md Data Marts Exploration Warehouse md Near Line Storage md Enterprise Intelligence-Enabling High Quality in the Data Warehouse DSS Environment 9

Metadata mining provides an automated means to surface essential business information buried within legacy systems or the data warehouse. This information, unreachable by metadata movement tool, is needed by both data warehouse data modelers and business users of information systems. Metadata mining is a low-level investigation of operational data. It analyzes each and every data value within each record occurrence in order to assign a data type to each value and perform entity identification. The ability to process data at the value/instance level is the fundamental prerequisite for solving type identification, entity identification, and quality problems at the heart of the enterprise information architecture. Anything that can make the job of the analyst easier and more organized is welcome. Once the analyst has a clear idea of what is available and what differs from one set of data to another, he or she is prepared to make the most concise and incisive analysis Summary In order to achieve enterprise intelligence, data quality must be achieved at the data level and the metadata level. There are three different places for ensuring quality in a data warehouse environment: in the source or application environment, during the integration and transformation stage when data is moving into the data warehouse, and routinely within the data warehouse itself in order to address changes in data values over time. About Bill Inmon Bill Inmon is widely recognized as the father of the data warehouse concept. He has more that 26 years of database technology management experience and data warehouse design expertise. He has published 36 books and more than 350 articles in the major computer journals. Bill is also the author of DM Review magazine's "Information Management: Charting the Course" column. Before founding Pine Cone Systems, Bill was a co-founder of Prism Solutions, Inc. Mr. Inmon is responsible for the high-level design of Pine Cone products, as well as for the architecture of planned and future products. Mr. Inmon has consulted with a large number of Fortune 1000 clients, offering data warehouse design and database management services. About Vality The leading provider of data quality tools and services for enterprise intelligence, Vality Technology is the industry's leading supplier of data standardization and matching software and consulting services. Our customers are Global 5000 corporations in finance, healthcare, insurance, manufacturing, retail, telecommunications, energy, and utilities. These companies rely on our flagship product, the INTEGRITY Data Re-engineering Environment, to help them uncover patterns and relationships within their data to optimize their strategic information assets in areas such as data quality, data warehouse and business intelligence systems, ERP conversions, enterprise relationship management systems, and electronic commerce. For more information For more information about Vality Technology Inc. and the INTEGRITY Data Re-engineering Environment, please call 617-338-0300 or visit the Vality Web site at http://www.vality.com. 10 Copyright, Vality Technology Inc. All rights reserved.