DATA CONSISTENCY, COMPLETENESS AND CLEANING. By B.K. Tyagi and P.Philip Samuel CRME, Madurai



Similar documents
Data Quality Assessment. Approach

Hybrid Technique for Data Cleaning

Attachment-VEE STANDARDS FOR VALIDATING, EDITING, AND ESTIMATING MONTHLY AND INTERVAL DATA

Five Fundamental Data Quality Practices

EMPLOYEE SELF-SERVICE DIRECT DEPOSIT PROCEDURES

Welcome to the Centers for Medicare & Medicaid Services presentation, Accessing and Interpreting the Hospice Final Validation Report & Reviewing Top

Client Marketing: Sets

Quality Assurance and Data Management Processes

SUGGESTED CONTROLS TO MITIGATE THE POTENTIAL RISK (Internal Audit)

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

Supply Chain Finance WinFinance

Batch and Import Guide

AXT JOBS GUI Users Guide

Data Quality Assurance

Appendix. Web Command Error Codes. Web Command Error Codes

Wealth and Investment Management Intermediaries. Online Banking. Bulk Payments User guide

CHAPTER SIX DATA. Business Intelligence The McGraw-Hill Companies, All Rights Reserved

Forecasting in STATA: Tools and Tricks

10426: Large Scale Project Accounting Data Migration in E-Business Suite

How To Manage A Pom.Net Account Book

Blackbaud, Inc. This publication, or any part thereof, may not be reproduced or transmitted in any form or by any means, electronic, or

The Advantages of a Golden Record in Customer Master Data Management. January 2015

EQR PROTOCOL 4 VALIDATION OF ENCOUNTER DATA REPORTED BY THE MCO

Business Mobile Banking

Welcome to CEFCU Bill Pay

Appendix A: Data Quality Management Model Domains and Characteristics

CareAware Capacity Management - Patient Flow Patient List Gadget

Current Account Creation Version-11 9NT1316-ORACLE FCUBSV.UM [January] [2010] Oracle Part Number E

Innovative Techniques and Tools to Detect Data Quality Problems

APPENDIX N. Data Validation Using Data Descriptors

M&E Development Series

Credit Card Market Study Interim Report: Annex 4 Switching Analysis

BANKOH BUSINESS CONNECTIONS WIRE TRANSFER GUIDE

Data Management Procedures

Cathay Business Online Banking

SENIOR COMMUNITY SERVICE EMPLOYMENT PROGRAM (SCSEP) DATA VALIDATION HANDBOOK

A Framework for Data Migration between Various Types of Relational Database Management Systems

MDS 3.0 Corrections. Why Accuracy

The Butterfly Effect on Data Quality How small data quality issues can lead to big consequences

KANSAS TRUCK ROUTING INTELLIGENT PERMITTING SYSTEM

Making SAP Information Steward a Key Part of Your Data Governance Strategy

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Barcode option details & costing February 2008

WorldShip Set the Label Stock Configuration

CWS/CMS Resource Management Desk Guide

THE SIX PRIMARY DIMENSIONS FOR DATA QUALITY ASSESSMENT

2015 Facility Survey. Facility Survey. Table of Contents. Return to MHPD Manual. Facility Survey Overview

Best Practices for Creating and Maintaining a Clean Database. An Experian QAS White Paper

Accounts Receivable Processing

Using Excel s PivotTable to Analyze Learning Assessment Data

Spam Manager User Guide. Boundary Defense for Anti-Spam End User Guide

Contents. Introduction to Gifts and Pledges

Chapter 3: Data Mining Driven Learning Apprentice System for Medical Billing Compliance

Clinical Data Management (Process and practical guide) Dr Nguyen Thi My Huong WHO/RHR/RCP/SIS

The following are two things that cannot be done with a lead until it has been converted to a prospect or account:

Quick Reference Guide Healthcode e Practice biller

ThinManager and Active Directory

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

ACHieve Access 4.3 User Guide for Corporate Customers

Microsoft Excel 2010 Part 3: Advanced Excel

Audit of Employee Self-Service Payroll System Access

Funds Transfer Agreement

Data Integrity in Travel Management Reporting

State of Connecticut Core-CT Continuing Education Initiative. Introduction to eprocurement

Parchment Guide to High School Transcripts

Worship Facility Matching Grant Program 2010 Program Manual

Online Banking. Customer Information

District of Columbia Department of Employment Services ICESA Tax and Wage Reporting Part 1: General Instructions Revised: February 27, 2012

Using Excel for Data Manipulation and Statistical Analysis: How-to s and Cautions

A Simplified Framework for Data Cleaning and Information Retrieval in Multiple Data Source Problems

Business Reports. ARUP Connect

Payments & Transfers ACH

Welcome to ComputerEase 10.0

WinFlexOne Version 8 How To Guide. Claims Processing

Online Bill Payment & Presentment User Guide

OptionsLink Utility Guide Version 5.1

AP - ISTV Vouchers. *See "The FIN SOURCE" for Ohio for the most recent version of this process.*

Fax and . Fax & Monitor Application

Challenges for. Health Insurers & TPAs. Data enhancement and information sharing in the health insurance sector

QAS for Salesforce User Guide

EXCEL Using Excel for Data Query & Management. Information Technology. MS Office Excel 2007 Users Guide. IT Training & Development

Payco, Inc. Evolution and Employee Portal. Payco Services, Inc.., Home

Marketing Database Toolkit. Everything You Need to Build and Manage a High-Quality Marketing Database

Application For C TPAT Certification Details

DATA QUALITY DATA BASE QUALITY INFORMATION SYSTEM QUALITY

Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration

Manual Analysis Software AFD 1201

May 2015 VERSION 02. Capacity Management Quick Reference Guide

Colorado Medical Assistance Program Web Portal Dental Claims User Guide

ValueOptions Provider Guide to using Direct Claim Submission

WHITEPAPER. The importance of a clean data infrastructure STIRISTA 1

Transcription:

DATA CONSISTENCY, COMPLETENESS AND CLEANING By B.K. Tyagi and P.Philip Samuel CRME, Madurai

DATA QUALITY (DATA CONSISTENCY, COMPLETENESS ) High-quality data needs to pass a set of quality criteria. Those include: Accuracy: An aggregated value over the criteria of integrity, consistency, and density Integrity: An aggregated value over the criteria of completeness and validity Completeness: Achieved by correcting data containing anomalies Validity: Approximated by the amount of data satisfying integrity constraints Consistency: Concerns contradictions and syntactical anomalies Uniformity: Directly related to irregularities and in compliance with the set 'unit of measure' Density: The quotient of missing values in the data and the number of total values ought to be known

DATA CLEANSING Data auditing: The data is audited with the use of statistical methods to detect anomalies and contradictions. This eventually gives an indication of the characteristics of the anomalies and their locations. Workflow specification: The detection and removal of anomalies is performed by a sequence of operations on the data known as the workflow. It is specified after the process of auditing the data and is crucial in achieving the end product of highquality data. In order to achieve a proper workflow, the causes of the anomalies and errors in the data have to be closely considered. Workflow execution: In this stage, the workflow is executed after its specification is complete and its correctness is verified. The implementation of the workflow should be efficient, even on large sets of data, which inevitably poses a trade-off because the execution of a data-cleansing operation can be computationally expensive. Post-processing and controlling: After executing the cleansing workflow, the results are inspected to verify correctness. Data that could not be corrected during execution of the workflow is manually corrected, if possible. The result is a new cycle in the data-cleansing process where the data is audited again

DATA QUALITY Data quality is not linear and has many dimensions like Accuracy, Completeness, Consistency, Timeliness and Auditability. Having data quality on one dimension is as good as 'no quality. None of the Data Quality dimensions is complete by itself, and many a times dimensions are overlapping.

DATA ACCURACY The address of customer in the customer database is the real address. The temperature recorded in the thermometer is the real temperature. The bank balance in the customer's account is the real value customer deserves from the Bank.

DATA COMPLETENESS Data Completeness definition is the 'expected completeness'. It is possible that data is not available, but it is still considered completed, as it meets the expectations of the user. Every data requirement has 'mandatory' and 'optional' aspects. For example Customer's mailing address is mandatory and it is available and because customer s office address is optional, it is OK if it is not available.

DATA CONSISTENCY Consistency of Data means that data across the enterprise should be in synch with each other. Examples of data in-consistency are: An agent is inactive, but he still has his disbursement account active. A credit card is cancelled, and inactive, but the card billing status shows 'due'. Data can be accurate (i.e., it will represent what happened in real world), but still inconsistent. An Airline promotion campaign closure date is Jan 31, and there is a passenger ticket booked under the campaign on Feb. 2. Data is inconsistent, when it is in synch in the narrow domain of an organization, but not in synch across the organization. For example: Collection management system has the Cheque status as 'cleared', but in the accounting system, the money is not shown being credited to the bank account. Reason for this kind of inconsistency is that system interfaces are synchronized during the end-of-day batch runs. Data can be complete, but inconsistent Data for all the packets dispatched from NEW DELHI to CHENNAI are available., but some of the packages are also shown as 'under bar-coding' status.

DATA TIMELINESS 'Data delayed' is 'Data Denied' The timeliness of data is extremely important. This is reflected in: Companies are required to publish their quarterly results with in a given frame of time. Customers service providing up-to date information to the customers. Credit system checking on the credit card account activity. The timeliness depends on user expectation. An online availability of data could be required for room allocation system in Hospitality, but an overnight data is fine for a billing system.

DATA AUDITABILITY Data Auditability means that any transaction, report, accounting entry, bank statement etc. can be tracked to its originating transaction. This would need a common identifier, which should stay with a transaction as it undergoes Transformation, aggregation and reporting.

DATA CLEANSING Data cleansing, data cleaning, or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data. After cleansing, a data set will be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores. Data cleansing differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at entry time, rather than on batches of data. The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities. The validation may be strict (such as rejecting any address that does not have a valid postal code) or fuzzy (such as correcting records that partially match existing, known records).

Data Cleaning is the First Step in Data Processing Data cleaning is the process of detecting and correcting (or removing) incomplete, incorrect, inaccurate and irrelevant parts of a dataset by replacing, modifying or deleting the bad data It is the first and most important step in any data processing It aims to have access to reliable data to avoid false and misdirected conclusions

Data Descriptive Document A document should be developed alongside the raw data containing the following information: Variable name - Variable type - Missing values Variable description - Variable value

Using Excel for Character Data Select the variable of interest, for example gender From the main tool bar go to data, from there select Filter and then autofilter Click on the auto-filter arrows and a box will show all the available values of our variable Check the variable values in the data description document to determine the valid values Use auto-filter to select the questionable values Excel can give you the case ID of each questionable value. Refer to the case ID, check and correct the questionable value by going back to the medical record

Another Approach: Using Frequencies

Checking for Invalid Character Values.(1) Run frequencies on all character variables that represent a limited number of categories such as gender, residence, hospital s department, occupation, etc. GENDER Frequency 2 1 F 300 M 440 X 1 f 3 Missing values 5

Checking for Invalid Character Values.(2) Three categories do not fit with our data value GENDER Frequency 2 1 F 300 M 440 X 1 f 3 Missing values 5

Checking for Invalid Character Values.(3) The 2 and the X are inappropriate values. f depending on the situation, it could be considered an error or not GENDER 2 1 Occur once F 300 M 440 X 1 Occur once f 3 Missing values 5 Frequency

Correcting Invalid Character Values If the lower case values were entered into the file by mistake but the value, aside from the case, was correct, we consider this value correct and change each of these lower case values to upper case For the 2 and X values, we need to identify the location of these errors and correct it after checking the medical records

Checking Missing Data Check each of the cases with missing data (here on gender) See whether there is information in the case that allows that variable to be entered (e.g. the patient s name will generally indicate gender)

Checking for Invalid Numeric Values The techniques for checking invalid numeric data are quite different from the techniques used with character data Examine minimum and maximum values for each numeric variable Internal consistency methods; if we see that most of the data values fall within a certain range of values, then any values that fall far enough outside the range may be data errors Run a univariate analysis, focusing especially on Number of non-missing observations, number of observation not equal to zero and the number of observation greater than zero are of most interest at this stage Extremes shows the five lowest and five highest values for numeric variables Quantiles Mean Standard deviation to decide on constitute reasonable cutoffs for low and high data value Range Graphic displays: a stem-and leaf plot, a box plot and a normal probability plot Check the medical records for the extreme values and write a note to the data center about the findings to help in further cleaning of these data

Dates: Hospitalization..(1) We can create a variable from subtracting the date of discharge from date of admission, and call it total hospitalization 1 This variable will detect any wrong data entry for dates such as case number 6014

Dates: Hospitalization..(2) We can create a variable from adding the days patient spent in ICU, ward and private room and call it total hospitalization 2

Dates: Hospitalization..(3) To check inconsistency we can create a variable, lets call it difference by subtracting the total hospitalization 1 (created from subtracting dates of admission and discharge) and the total hospitalization 2 (created by summing the days spent in ICU, ward and private room) We need to check any value other than zero by using the auto-filter command and recheck the medical records