Data Quality Assessment. Approach



Similar documents
Five Fundamental Data Quality Practices

Enterprise Data Quality

Building a Data Quality Scorecard for Operational Data Governance

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

Big Data-Challenges and Opportunities

Evaluating the Business Impacts of Poor Data Quality

Data Warehouse and Business Intelligence Testing: Challenges, Best Practices & the Solution

Business Performance & Data Quality Metrics. David Loshin Knowledge Integrity, Inc. loshin@knowledge-integrity.com (301)

Enterprise Data Governance

Operationalizing Data Governance through Data Policy Management

Monitoring Data Quality Performance Using Data Quality Metrics

B2B Operational Intelligence

BI and ETL Process Management Pain Points

Enabling Data Quality

Data Quality for BASEL II

dbspeak DBs peak when we speak

Government Business Intelligence (BI): Solving Your Top 5 Reporting Challenges

White Paper. An Overview of the Kalido Data Governance Director Operationalizing Data Governance Programs Through Data Policy Management

DATA GOVERNANCE AND INSTITUTIONAL BUSINESS INTELLIGENCE WORKSHOP

DataFlux Data Management Studio

Integrating Data Governance into Your Operational Processes

Data Quality Management and Financial Services

Increase Business Intelligence Infrastructure Responsiveness and Reliability Using IT Automation

Existing Technologies and Data Governance

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

OPTIMUS SBR. Optimizing Results with Business Intelligence Governance CHOICE TOOLS. PRECISION AIM. BOLD ATTITUDE.

Measuring and Monitoring the Quality of Master Data By Thomas Ravn and Martin Høedholt, November 2008

IT Outsourcing s 15% Problem:

Data Governance: The Lynchpin of Effective Information Management

Information Governance Workshop. David Zanotta, Ph.D. Vice President, Global Data Management & Governance - PMO

Whitepaper Data Governance Roadmap for IT Executives Valeh Nazemoff

Drive business process improvement and performance with high quality data

agility made possible

CA Service Desk Manager

Data Quality in Data warehouse: problems and solution

The Firewall Audit Checklist Six Best Practices for Simplifying Firewall Compliance and Risk Mitigation

ITSM Maturity Model. 1- Ad Hoc 2 - Repeatable 3 - Defined 4 - Managed 5 - Optimizing No standardized incident management process exists

DATA QUALITY MATURITY

The Key Components of a Data Governance Program. John R. Talburt, PhD, IQCP University of Arkansas at Little Rock Black Oak Analytics, Inc

Populating a Data Quality Scorecard with Relevant Metrics WHITE PAPER

Bringing agility to Business Intelligence Metadata as key to Agile Data Warehousing. 1 P a g e.

JOURNAL OF OBJECT TECHNOLOGY

TDWI Data Integration Techniques: ETL & Alternatives for Data Consolidation

What You Don t Know Does Hurt You: Five Critical Risk Factors in Data Warehouse Quality. An Infogix White Paper

A Comprehensive Approach to Master Data Management Testing

Making Business Intelligence Easy. White Paper Spreadsheet reporting within a BI framework

SOLUTION BRIEF: CA IT ASSET MANAGER. How can I reduce IT asset costs to address my organization s budget pressures?

How Can I Better Manage My Software Assets And Mitigate The Risk Of Compliance Audits?

DATA GOVERNANCE AND DATA QUALITY

Coverity White Paper. Effective Management of Static Analysis Vulnerabilities and Defects

For more information about UC4 products please visit Automation Within, Around, and Beyond Oracle E-Business Suite

The Requirements Compliance Matrix columns are defined as follows:

White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices.

Effecting Data Quality Improvement through Data Virtualization

Business Usage Monitoring for Teradata

CSPP 53017: Data Warehousing Winter 2013" Lecture 6" Svetlozar Nestorov" " Class News

The ROI of Data Governance: Seven Ways Your Data Governance Program Can Help You Save Money

Explore the Possibilities

5 Best Practices for SAP Master Data Governance

Making Business Intelligence Easy. Whitepaper Measuring data quality for successful Master Data Management

Course Outline. Module 1: Introduction to Data Warehousing

7 Directorate Performance Managers. 7 Performance Reporting and Data Quality Officer. 8 Responsible Officers

Data Quality Where did it all go wrong? Ed Wrazen, Trillium Software

DATA QUALITY IN BUSINESS INTELLIGENCE APPLICATIONS

Data Governance Best Practices

Requirements-Based Testing: Encourage Collaboration Through Traceability

Data Integrity and Integration: How it can compliment your WebFOCUS project. Vincent Deeney Solutions Architect

Data Governance for Financial Institutions

Agile Master Data Management A Better Approach than Trial and Error

Seven Ways To Help ERP IN 2014 AND BEYOND

Don t simply manage work in your Professional Services business. Manage dollars and profits.

Three Fundamental Techniques To Maximize the Value of Your Enterprise Data

Data Governance: A Business Value-Driven Approach

ETL-EXTRACT, TRANSFORM & LOAD TESTING

Part A OVERVIEW Introduction Applicability Legal Provision...2. Part B SOUND DATA MANAGEMENT AND MIS PRACTICES...

Data Governance: A Business Value-Driven Approach

Build an effective data integration strategy to drive innovation

Reduce and manage operating costs and improve efficiency. Support better business decisions based on availability of real-time information

Accelerate BI Initiatives With Self-Service Data Discovery And Integration

SQL Server 2012 Business Intelligence Boot Camp

Aligning Quality Management Processes to Compliance Goals

Improving Operations Through Agent Portal Data Quality

The Butterfly Effect on Data Quality How small data quality issues can lead to big consequences

DATA CONSISTENCY, COMPLETENESS AND CLEANING. By B.K. Tyagi and P.Philip Samuel CRME, Madurai

Implementing a Data Warehouse with Microsoft SQL Server 2012 MOC 10777

Responding to Regulatory Activity: 6 Vital Areas to Gauge the Effectiveness of your Regulatory Change Management Process

Data Governance 8 Steps to Success

Feature. A Framework for Estimating ROI of Automated Internal Controls. Do you have something to say about this article?

Transparent Government Demands Robust Data Quality

THE QUALITY OF DATA AND METADATA IN A DATAWAREHOUSE

US Department of Education Federal Student Aid Integration Leadership Support Contractor January 25, 2007

EXPLORING THE CAVERN OF DATA GOVERNANCE

Why Data Governance - 1 -

A TECHNICAL WHITE PAPER ATTUNITY VISIBILITY

Integration Maturity Model Capability #5: Infrastructure and Operations

Data Governance. David Loshin Knowledge Integrity, inc. (301)

Automated IT Asset Management Maximize organizational value using BMC Track-It! WHITE PAPER

Controlling Costs with Managed Mobility Services

Data Quality Management The Most Critical Initiative You Can Implement

Session M6. Andrea Matulick, Acting Manager, Business Intelligence, Robert Davies, Technical Team Leader, Enterprise Data

Transcription:

Approach Prepared By: Sanjay Seth Data Quality Assessment Approach-Review.doc Page 1 of 15

Introduction Data quality is crucial to the success of Business Intelligence initiatives. Unless data in source systems is accurate and reliable more effort will be spent by BI users on manual activities, rework than on business related activities. In order to improve the quality of data many companies initiate data quality assessment programs and form data stewardship groups. Yet in absence of a comprehensive methodology, measuring data quality remains an elusive concept. It proves to be easier to produce hundreds or thousands of data error reports than to make any sense of them. The purpose of this document is to provide a methodology for managing data quality issues. The process is described in the order the events should occur, from the initial capture of data quality issues to presenting the findings to the data owners for further action. Data quality can be defined as the state of completeness, consistency, timeliness and accuracy which makes the data appropriate for a specific use. The dimensions of data quality are: Accuracy loading of facts/dimensions correctly Completeness - having all relevant data stored Consistency - format uniformity for storing the data Timeliness - data storage in the required time frame Thus data quality should ensure that the data loaded into the target destination is timely, accurate, complete and consistent. Data quality issues result from:- Mainly due to incorrect manual entry of data in the source system Lack of common data standards across various business divisions when integrating data from multiple data sources while loading to a data warehouse Lack of a proper business process. In some cases root cause analysis of a data quality issue may point to a business process that is required to be re-designed to mitigate the issue. Data Quality Assessment Approach-Review.doc Page 2 of 15

The main source of data quality issues as provided in a TDWI survey is categorized and ranked below as Figure 1: Sources of Data Quality Problems Benefits with Improved Data Quality The following benefits are a resultant of improved information quality: Data quality related benefits can be related as either hard benefits, or soft benefits Soft benefits are those that are evident, clearly have an effect on productivity, yet are difficult to measure. These include: Build user confidence and trust in the data disseminated by the Data Warehouse solution - Good data quality promotes use of the Data Warehouse. Improve throughput for volume processing By reducing the delays associated with detecting and correcting data errors, and the rework associated with that correction, more transactions can be processed, resulting in greater volume processing and lower cost per transaction. Improve customer profiling Having more compliant customer information allows the business intelligence process to provide more accurate customer profiling, which in turn can lead to increased sales, better customer service, and increased valued customer retention. Decrease resource requirements Redundant data, correction, and rework put an unnecessary strain on an organization s resource pool. Eliminating redundant data and reducing the amount of rework reduces that strain and provides better resource allocation and utilization. Hard benefits are those which can be estimated and/or measured. These include: Customer attrition, which occurs when a customer s reaction to poor data quality results in the customer s complete cessation of business Data Quality Assessment Approach-Review.doc Page 3 of 15

Costs of error detection and correction Detection costs, which are incurred when a system error or processing failure occurs and a process is invoked to track down the problem. Extra time it takes to correct data problems Correction costs, which are associated with the actual correction of a problem as well as the restarting of any failed processes or activities. The amount of time associated with the activity that failed, along with extraneous employee activity, are all rolled up into correction costs. Costs of data maintenance i.e. maintaining spreadsheets to meet information requirements Extra resources needed to correct data problems Time and effort required to re-run jobs that a bend Time wasted arguing over inconsistent reports Lost business opportunities due to unavailable data Fines paid for noncompliance with government regulations Shipping products to the wrong customers Bad public relations with customers leads to alienated and lost customer The Data Warehousing Institute (TDWI) estimates that poor quality customer data costs U.S. businesses a staggering $611 billion a year in postage, printing, and staff overhead. Organizations can frustrate and alienate loyal customers by incorrectly addressing letters or failing to recognize them when they call or visit a store or Web site. Once a company loses its loyal customers, it loses its base of sales and referrals, and future revenue potential. The above benefits can be provided in the solution by following the Data Quality Assessment Methodology as mentioned in the following section Data Quality Assessment Methodology The Data Quality Assessment methodology consists of a five stage process for assessing and improving data quality of the solution being addressed 1. Extract - Identify and analyze the source data being cleansed 2. Discover - Identify and understand the current data issues 3. Cleanse Implement checks to rectify data errors 4. Monitoring - Processes for regularly validating the data 5. Prevention Fixing the processes by which these data errors are introduced. Data Quality Assessment Approach-Review.doc Page 4 of 15

Step 1 Extract Step 2 Discover Step 3 Cleanse Step 4 Monitor Step 5 Prevention Deploy automated profiling tools for accelerated data analysis Understand current data issues Investigate data errors and identify root cause for failure Quantify gap between Current and Desired quality levels Prioritize data quality levels to meet objectives Implement validation checks and business rules to detect data errors and ensure logical consistency of data Rectify erroneous data using ETL rules Manage data exceptions manually Monitor and Identify data gaps and plan for maintaining and enhancing quality of data Communicate errors and impact to data providers Figure 2: Data Quality Methodology 1. Extract Data profiling is the first step towards ensuring data quality. Data Profiling is the assessment of data to understand its content, structure, quality and dependencies. Data profiling looks at standard column analysis such as frequency, NULL checks, cardinality, etc. This process will usually expose some of the more offending data quality problems. Some of the common methods in data profiling are listed below: Structure discovery Check whether the different patterns of data are valid. e.g.: Format of zip-code, phone-number, address etc can be checked. Data discovery Is the data value correct, error free and valid? e.g.: Check if mandatory attributes are having incorrect data. Relationship discovery Verify if all the key relationships are maintained and we are able to get an end to end linkage. e.g.: Link between a child and parent table. Data redundancy Check if same data is represented in multiple times. The results of the profiling should be discussed with the client to determine which of the issues are related to business problems. Data Quality Assessment Approach-Review.doc Page 5 of 15

Some examples of data profiling analysis that can be done are for Profiling Analysis Completeness What data is missing or unusable? Null checks, Uniqueness checks Conformity What data is stored in a non-standard format? Standard code sets, Rules Consistency What data values give conflicting information? Relationship Analysis Accuracy What data is incorrect or out of date? Domain validation, Range validation Duplicates What data records or attributes are repeated? Redundancy Evaluation Integrity What data is missing or not referenced? Referential Integrity, Cardinality Analysis Figure 3: Data Profiling Analysis 2. Discover After identifying the problems, data quality involves correcting the various errors identified. Data standardization Same data is represented in different formats. Identify them and use common standard values. Pattern standardization Particular attribute may have data following different patterns. Generate a common pattern for the data. e.g.: Standardize phone no s: 999-9999999 or 999(9999999), (999) 999-9999 Data verification Verify the correctness of data and reduce ambiguities. e.g.: Customer address data can be checked to see if it is correct. 3. Cleanse Through cleansing data quality defects are corrected using one of the appropriate data cleansing action. The common data cleansing actions that can be taken during data cleansing: Filter data to remove rule violations Correct data to repair rule violations Data Quality Assessment Approach-Review.doc Page 6 of 15

Filter data: The purpose of filtering is to remove problematic data; this action is typically applied when data is considered defective to a degree that makes it unusable. Correct data: The purpose of correction is to fix defective data. Correction alters the values of individual fields. Determining the replacement value to be used may use the following techniques 1) Identifying errors while integrating data. Inserting a default value that indicates absence of reliable data Removal of redundant information e.g.: Same data is available in two different source systems. Same data is represented in two different formats in two source systems e.g. Address information can be represented in different formats. 2) Searching alternative sources to find a replacement value e.g. incorporating additional external data to add value to existing records, which Increase understanding effectively e.g. customer data is appended with more business details, a better understanding of the customer can be obtained. Data quality rules are implemented in the following order: Assess data quality objectives Identify & Define data cleansing rules Implement data validation routines in source feed file Implement data cleansing rules Missing data values or data elements, etc. Solution Identify cleansing rules for various subject areas (customer, products etc). Define Functional / program specifications for cleansing rules Implement validation and cleansing rules Figure 4: Data Quality Rules Implementation Data Quality Assessment Approach-Review.doc Page 7 of 15

Data Warehouse solutions have a role to play in ensuring the quality of data. It ensures that only data that is fit for use is allowed to be loaded and be made available to consumers. These solutions can also use dedicated data cleansing tools to conduct activities such as address cleansing, standardization and de-duping. Data Warehouse solutions have various ways of validating that the source data is of good quality and conditioning the data where appropriate, prior to loading it into the final target tables. These solutions can also provide for automated proactive alerting. 4. Monitor The measurement step involved measuring the data quality and tracking errors. It ensures that all reported anomalies are corrected and monitored. Data quality monitoring is a process which focuses on improving the quality of data. It ensures that the data is valid. A proper standardization is attained for the data. Also the redundant data is identified and eliminated. Once data is corrected, regular monitoring of the data is necessary to avoid errors and ambiguities. It can be done by Creating reports on a regular basis. Creating rules to validate the data Generating events to correct the data. Data monitoring will include creating a list of critical data quality problems as well as the business problems to which they relate. Data monitoring should not only measure information compliance to defined business rules, but also measure the actual costs associated with noncompliance. The data quality monitoring can be implemented either as an Interim Solution or Long term solution. During the Interim solution, data from various sources can be profiled and analyzed for anomalies and once the data quality rules are determined from the profiling analysis, they can be deployed to overcome these anomalies. For a Long term solution, data quality metrics and audit reports can be created which will measure the overall data quality improvement. Audit report structure details are provided in the Appendix. Quality metrics provide the means to quantify data quality. Measures are needed to measure current state of data quality and to evaluate the progress made towards improving data quality. Various attributes (format, range, domain, etc.) of the data elements can be measured. The measurements can be rolled up or aggregated into metrics e.g. the number of defective addresses, invalid phone numbers, incorrectly formatted email Data Quality Assessment Approach-Review.doc Page 8 of 15

addresses, and can all be measured and rolled up into one metric that represents quality of just the contact data. Figure 5: Data Quality Monitoring 5. Prevention The purpose of prevention is to remove causes of defective data to fix the processes by which defects are introduced. Prevention determines the root causes of defective data and takes steps to eliminate them. By providing error reports, audit and reconciliation reports to the source system providers for correction of data. Data quality issues can be reduced and hence prevented over a period of time. These reports give the data owners the visibility into the errors and their causes and the corrective action that needs to be taken. Reconciliation is a process through which data from both source and target systems are compared and analyzed. Validation scripts are run on both source and target data for comparison. Data quality issues should ideally be addressed and resolved in the source systems themselves. This helps in ensuring that data in the data Warehouse is always in sync with data in the source systems. Data Quality Assessment Approach-Review.doc Page 9 of 15

Conclusion: Strong frameworks and processes are required for controlling data quality and for managing data. Additional validation procedures such as data level reconciliation ensures high success in providing high data quality solutions Data Quality Assessment Approach-Review.doc Page 10 of 15

Case Study: For one of the Life Insurance clients, the objective of the data quality assessment was to provide an approach on how the data warehouse should increase its data accuracy and thus build business confidence in using the data warehouse. The Client had multiple source systems loading customer and product data to the data warehouse so the requirement was to provide an approach by which the data quality within the warehouse could improve and address the following business issues, that was having an impact on business and overall cost. Figure 6: Data Quality Relationship between Impact and Cost The data quality assessment was conducted using the Data Quality methodology and a solution was provided for improving data quality for data access, the solution provided for the following benefit Provided for a Data warehouse environment that enabled o 360 degree view of customer o Integrated product information from various source systems o Able to satisfy reporting needs of users Improved operational efficiencies by making data available to users when they need it, via a single standard framework, so that they can effectively make informed decisions The solution provided an opportunity for the analysts to spend more quality time on analysis and less time on data quality issue resolution The solution improvements provided for recommendations in the following domains Data Quality Assessment Approach-Review.doc Page 11 of 15

Data Architecture Created source to target mappings Enabled single version of entities and metrics Structured Standards (Ex. Naming Standards) ETL Processes It helped define robust data validation, rejection and reconciliation mechanisms built into the ETL processes Processes that need to be defined in the ETL: Data Integration Rules Data Standardization Data Rejection Data Reconciliation Recommended Data Steward Participation during functional testing The Data Steward along with the Test team needs to be involved in the following data quality aspects of functional testing of ETL and Reporting applications Ensured that sample data is used for testing, data that represents all kinds of irregularities and peculiarities of source data Ensured that ETL was able to identify, handle and notify all types of defined data issues Ensured that Reports and queries used for testing cover the required data samples Activities that are carried out by a Data Steward is provided in the Appendix Data Quality Monitoring The data quality monitoring program recommended was to check on the data purity levels, this program would involve A Data quality scorecard to measure purity levels of the data warehouse, identify issues pro-actively and plan projects to address these issues Pro-actively analyze quality of source system data to identify new corruption issues and modify ETL to handle them Periodically assess effectiveness of ETL error notification process to the source system and how effective it is to resolve these issues. Data Quality Assessment Approach-Review.doc Page 12 of 15

Appendix Data Quality Audit The purpose of auditing is to understand the degree to which data quality problems exist i.e. the extent and severity of data defects. Audit procedures examine content, structure, completeness and other factors to detect and report integrity and correctness rules violations. Data auditing is a process for identifying errors and check the health of the overall system with regard to quality of data. As the amount of data and number of process escalates over a period of time, the amount of inaccurate data also increases and decreased data quality. The audit reports can be created to measure progress in achieving data quality goals and complying with service level agreements. It is very important to understand the error prone areas in the BI solution. A data Quality scorecard can show the overall quality status. A template for Audit Summary is shown below. Audit summary shows the number of occurrence of different types of errors, as per error stack, at various stages in the system. On one axis, 'types of error' is shown and on the other, the stage where error occurred is shown. So it will give the overall picture of the error occurrence. For example, 500 data entry errors occurred in the source system. This will give a fair idea about the problematic areas where data quality is poor / strong. Data Quality Assessment Approach-Review.doc Page 13 of 15

Data Steward The data steward is responsible for driving organizational agreement on data definitions, business rules and domain values for the data warehouse data and publishing and reinforcing these definitions and rules. The data steward is responsible for the following Primary Guardian of data while it is being created or maintained Responsible to create standards and procedures to ensure that policies and business rules are known and followed Should enforce adherence to policies and business rules that govern the data while the data is in their custody Should periodically monitor (audit) the quality of the data in their custody Creating a data governance body, processes and tools for managing data quality will help to establish a robust framework for development. Figure 7: Data Governance Framework Data Quality Assessment Approach-Review.doc Page 14 of 15

Data Quality and Data Profiling tools Some of the Leading Data Quality vendors/products are provided below Sanjay Seth Sanjay Seth, a Senior Architect with the Business Intelligence Practice of a leading IT consulting firm, has 8 years of extensive experience in the data warehousing/business intelligence space Data Quality Assessment Approach-Review.doc Page 15 of 15