DATA CONSISTENCY, COMPLETENESS AND CLEANING By B.K. Tyagi and P.Philip Samuel CRME, Madurai
DATA QUALITY (DATA CONSISTENCY, COMPLETENESS ) High-quality data needs to pass a set of quality criteria. Those include: Accuracy: An aggregated value over the criteria of integrity, consistency, and density Integrity: An aggregated value over the criteria of completeness and validity Completeness: Achieved by correcting data containing anomalies Validity: Approximated by the amount of data satisfying integrity constraints Consistency: Concerns contradictions and syntactical anomalies Uniformity: Directly related to irregularities and in compliance with the set 'unit of measure' Density: The quotient of missing values in the data and the number of total values ought to be known
DATA CLEANSING Data auditing: The data is audited with the use of statistical methods to detect anomalies and contradictions. This eventually gives an indication of the characteristics of the anomalies and their locations. Workflow specification: The detection and removal of anomalies is performed by a sequence of operations on the data known as the workflow. It is specified after the process of auditing the data and is crucial in achieving the end product of highquality data. In order to achieve a proper workflow, the causes of the anomalies and errors in the data have to be closely considered. Workflow execution: In this stage, the workflow is executed after its specification is complete and its correctness is verified. The implementation of the workflow should be efficient, even on large sets of data, which inevitably poses a trade-off because the execution of a data-cleansing operation can be computationally expensive. Post-processing and controlling: After executing the cleansing workflow, the results are inspected to verify correctness. Data that could not be corrected during execution of the workflow is manually corrected, if possible. The result is a new cycle in the data-cleansing process where the data is audited again
DATA QUALITY Data quality is not linear and has many dimensions like Accuracy, Completeness, Consistency, Timeliness and Auditability. Having data quality on one dimension is as good as 'no quality. None of the Data Quality dimensions is complete by itself, and many a times dimensions are overlapping.
DATA ACCURACY The address of customer in the customer database is the real address. The temperature recorded in the thermometer is the real temperature. The bank balance in the customer's account is the real value customer deserves from the Bank.
DATA COMPLETENESS Data Completeness definition is the 'expected completeness'. It is possible that data is not available, but it is still considered completed, as it meets the expectations of the user. Every data requirement has 'mandatory' and 'optional' aspects. For example Customer's mailing address is mandatory and it is available and because customer s office address is optional, it is OK if it is not available.
DATA CONSISTENCY Consistency of Data means that data across the enterprise should be in synch with each other. Examples of data in-consistency are: An agent is inactive, but he still has his disbursement account active. A credit card is cancelled, and inactive, but the card billing status shows 'due'. Data can be accurate (i.e., it will represent what happened in real world), but still inconsistent. An Airline promotion campaign closure date is Jan 31, and there is a passenger ticket booked under the campaign on Feb. 2. Data is inconsistent, when it is in synch in the narrow domain of an organization, but not in synch across the organization. For example: Collection management system has the Cheque status as 'cleared', but in the accounting system, the money is not shown being credited to the bank account. Reason for this kind of inconsistency is that system interfaces are synchronized during the end-of-day batch runs. Data can be complete, but inconsistent Data for all the packets dispatched from NEW DELHI to CHENNAI are available., but some of the packages are also shown as 'under bar-coding' status.
DATA TIMELINESS 'Data delayed' is 'Data Denied' The timeliness of data is extremely important. This is reflected in: Companies are required to publish their quarterly results with in a given frame of time. Customers service providing up-to date information to the customers. Credit system checking on the credit card account activity. The timeliness depends on user expectation. An online availability of data could be required for room allocation system in Hospitality, but an overnight data is fine for a billing system.
DATA AUDITABILITY Data Auditability means that any transaction, report, accounting entry, bank statement etc. can be tracked to its originating transaction. This would need a common identifier, which should stay with a transaction as it undergoes Transformation, aggregation and reporting.
DATA CLEANSING Data cleansing, data cleaning, or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data. After cleansing, a data set will be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores. Data cleansing differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at entry time, rather than on batches of data. The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities. The validation may be strict (such as rejecting any address that does not have a valid postal code) or fuzzy (such as correcting records that partially match existing, known records).
Data Cleaning is the First Step in Data Processing Data cleaning is the process of detecting and correcting (or removing) incomplete, incorrect, inaccurate and irrelevant parts of a dataset by replacing, modifying or deleting the bad data It is the first and most important step in any data processing It aims to have access to reliable data to avoid false and misdirected conclusions
Data Descriptive Document A document should be developed alongside the raw data containing the following information: Variable name - Variable type - Missing values Variable description - Variable value
Using Excel for Character Data Select the variable of interest, for example gender From the main tool bar go to data, from there select Filter and then autofilter Click on the auto-filter arrows and a box will show all the available values of our variable Check the variable values in the data description document to determine the valid values Use auto-filter to select the questionable values Excel can give you the case ID of each questionable value. Refer to the case ID, check and correct the questionable value by going back to the medical record
Another Approach: Using Frequencies
Checking for Invalid Character Values.(1) Run frequencies on all character variables that represent a limited number of categories such as gender, residence, hospital s department, occupation, etc. GENDER Frequency 2 1 F 300 M 440 X 1 f 3 Missing values 5
Checking for Invalid Character Values.(2) Three categories do not fit with our data value GENDER Frequency 2 1 F 300 M 440 X 1 f 3 Missing values 5
Checking for Invalid Character Values.(3) The 2 and the X are inappropriate values. f depending on the situation, it could be considered an error or not GENDER 2 1 Occur once F 300 M 440 X 1 Occur once f 3 Missing values 5 Frequency
Correcting Invalid Character Values If the lower case values were entered into the file by mistake but the value, aside from the case, was correct, we consider this value correct and change each of these lower case values to upper case For the 2 and X values, we need to identify the location of these errors and correct it after checking the medical records
Checking Missing Data Check each of the cases with missing data (here on gender) See whether there is information in the case that allows that variable to be entered (e.g. the patient s name will generally indicate gender)
Checking for Invalid Numeric Values The techniques for checking invalid numeric data are quite different from the techniques used with character data Examine minimum and maximum values for each numeric variable Internal consistency methods; if we see that most of the data values fall within a certain range of values, then any values that fall far enough outside the range may be data errors Run a univariate analysis, focusing especially on Number of non-missing observations, number of observation not equal to zero and the number of observation greater than zero are of most interest at this stage Extremes shows the five lowest and five highest values for numeric variables Quantiles Mean Standard deviation to decide on constitute reasonable cutoffs for low and high data value Range Graphic displays: a stem-and leaf plot, a box plot and a normal probability plot Check the medical records for the extreme values and write a note to the data center about the findings to help in further cleaning of these data
Dates: Hospitalization..(1) We can create a variable from subtracting the date of discharge from date of admission, and call it total hospitalization 1 This variable will detect any wrong data entry for dates such as case number 6014
Dates: Hospitalization..(2) We can create a variable from adding the days patient spent in ICU, ward and private room and call it total hospitalization 2
Dates: Hospitalization..(3) To check inconsistency we can create a variable, lets call it difference by subtracting the total hospitalization 1 (created from subtracting dates of admission and discharge) and the total hospitalization 2 (created by summing the days spent in ICU, ward and private room) We need to check any value other than zero by using the auto-filter command and recheck the medical records