Measuring Data Quality for Ongoing Improvement A Data Quality Assessment Framework Laura Sebastian-Coleman ELSEVIER AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO SINGAPORE SYDNEY TOKYO Morgan Kaufmann is an imprint of Elsevier M<
Contents Acknowledgments Foreword Author Biography Introduction: Measuring Data Quality for Ongoing Improvement xxiii xxv xxvii xxix SECTION 1 CONCEPTS AND DEFINITIONS CHAPTER 1 Data 3 Purpose 3 Data 3 Data as Representation 4 The Implications of Data's Semiotic Function 6 Semiotics and Data Quality 8 Data as Facts 11 Data as a Product 11 Data as Input to Analyses 12 Data and Expectations 13 Information 14 Concluding Thoughts 15 CHAPTER 2 Data, People, and Systems 17 Purpose 17 Enterprise or Organization 17 IT and the Business 18 Data Producers 19 Data Consumers 19 Data Brokers 20 Data Stewards and Data Stewardship 20 Data Owners 21 Data Ownership and Data Governance 21 IT, the Business, and Data Owners, Redux 22 Data Quality Program Team 23 Stakeholder 24 Systems and System Design 24 Concluding Thoughts 25
viii Contents CHAPTER 3 Data Management, Models, and Metadata 27 Purpose 27 Data Management 27 Database, Data Warehouse, Data Asset, Dataset 28 Source System, Target System, System of Record 29 Data Models 30 Types of Data Models 31 Physical Characteristics of Data 32 Metadata 32 Metadata as Explicit Knowledge 35 Data Chain and Information Life Cycle 36 Data Lineage and Data Provenance : 37 Concluding Thoughts 37 CHAPTER 4 Data Quality and Measurement 39 Purpose 39 Data Quality 39 Data Quality Dimensions 40 Measurement 41 Measurement as Data 42 Data Quality Measurement and the Business/IT Divide 43 Characteristics of Effective Measurements 44 Measurements must be Comprehensible and Interpretable 44 Measurements must be Reproducible 45 Measurements must be Purposeful 46 Data Quality Assessment 46 Data Quality Dimensions, DQAF Measurement Types, Specific Data Quality Metrics 47 Data Profiling 49 Data Quality Issues and Data Issue Management 49 Reasonability Checks 49 Data Quality Thresholds 50 Process Controls 52 In-line Data Quality Measurement and Monitoring 53 Concluding Thoughts 53 SECTION 2 DQAF CONCEPTS AND MEASUREMENT TYPES CHAPTER 5 DQAF Concepts 57 Purpose 57 The Problem the DQAF Addresses 57
Contents ix Data Quality Expectations and Data Management 58 The Scope of the DQAF 60 DQAF Quality Dimensions 61 Completeness 62 Timeliness 62 Validity 62 Consistency 63 Integrity 63 The Question of Accuracy 63 Denning DQAF Measurement Types 64 Metadata Requirements 64 Objects of Measurement and Assessment Categories 65 Functions in Measurement: Collect, Calculate, Compare 66 Concluding Thoughts 69 CHAPTER 6 DQAF Measurement Types 71 Purpose 71 Consistency of the Data Model 71 Ensuring the Correct Receipt of Data for Processing 72 Inspecting the Condition of Data upon Receipt 72 Assessing the Results of Data Processing 74 Assessing the Validity of Data Content 74 Assessing the Consistency of Data Content 76 Comments on the Placement of In-line Measurements 78 Periodic Measurement of Cross-table Content Integrity 81 Assessing Overall Database Content 82 Assessing Controls and Measurements 83 The Measurement Types: Consolidated Listing 83 Concluding Thoughts 91 SECTION 3 DATA ASSESSMENT SCENARIOS CHAPTER 7 Initial Data Assessment 97 Purpose 97 Initial Assessment 97 Input to Initial Assessments 97 Data Expectations 98 Data Profiling 100 Column Property Profiling 100 High-Frequency Values 101 Low-Frequency Values 101
x Contents Date Data 101 Observations about Column Population 102 Structure Profiling 105 Incomplete Referential Relationships 105 Missing or Unexpected Dependency Relationships and Rules 107 Differences in Granularity and Precision 108 Relationship Cardinality Different from Expectations 109 Profiling an Existing Data Asset 109 From Profiling to Assessment 110 Deliverables from Initial Assessment 110 Concluding Thoughts 111 CHAPTER 8 Assessment in Data Quality Improvement Projects 113 Purpose 113 Data Quality Improvement Efforts 113 Measurement in Improvement Projects 114 CHAPTER 9 Ongoing Measurement in Purpose 117 The Case for Ongoing Measurement 117 Example: Health Care Data 119 Inputs for Ongoing Measurement 121 Criticality and Risk 123 Automation 123 Controls 124 Periodic Measurement 125 Deliverables from Ongoing Measurement 126 In-Line versus Periodic Measurement 126 Concluding Thoughts 127 SECTION 4 APPLYING THE DQAF TO DATA REQUIREMENTS CHAPTER 10 Requirements, Risk, Criticality 133 Purpose 133 Business Requirements 133 Data Quality Requirements and Expected Data Characteristics 136 Data Quality Requirements and Risks to Data 139 Factors Influencing Data Criticality 140 Specifying Data Quality Metrics 141 Subscriber Birth Date Example 142 Additional Examples 146 Concluding Thoughts 149
Contents xi CHAPTER 11 Asking Questions 151 Purpose 151 Asking Questions 151 Understanding the Project 152 Learning about Source Systems 153 Source Goals and Source Data Consumers 154 Source Data Processing 155 Individual Data Attributes and Rules 155 Your Data Consumers' Requirements 156 The Condition of the Data 157 The Data Model, Transformation Rules, and System Design 158 Measurement Specification Process.". 158 Concluding Thoughts 162 SECTION 5 A STRATEGIC APPROACH TO DATA QUALITY CHAPTER 12 Data Quality Strategy 165 Purpose 165 The Concept of Strategy 165 Systems Strategy, Data Strategy, and Data Quality Strategy 166 Data Quality Strategy and Data Governance 168 Decision Points in the Information Life Cycle 169 General Considerations for Data Quality Strategy 170 Concluding Thoughts 171 CHAPTER 13 Directives for Data Quality Strategy 173 Purpose 173 Directive 1: Obtain Management Commitment to Data Quality 176 Assessing Management Commitment 176 Directive 2: Treat Data as an Asset 177 Characterizing an Organization's Data Assets 178 Directive 3: Apply Resources to Focus on Quality 178 Assessing Readiness to Commission a Data Quality Team 179 Directive 4: Build Explicit Knowledge of Data 180 Assessing the Condition of Explicit Knowledge and Knowledge Sharing 180 Directive 5: Treat Data as a Product of Processes that can be Measured and Improved 181 Assessing Organizational Understanding of Data as a Product 182 Directive 6: Recognize Quality is Defined by Data Consumers 182 Assessing How Data Consumers Define Data Quality 183
xii Contents Directive 7: Address the Root Causes of Data Problems 184 Assessing Organizational Ability to Address Root Causes 186 Directive 8: Measure Data Quality, Monitor Critical Data 186 Assessing Organizational Readiness for Ongoing Measurement and Monitoring 187 Directive 9: Hold Data Producers Accountable for the Quality of their Data (and Knowledge about that Data) 188 Assessing Options for Accountability 188 Directive 10: Provide Data Consumers with the Knowledge they Require for Data Use 189 Directive 11: Data Needs and Uses will Evolve Plan for Evolution 189 Developing a Plan for Evolution 190 Directive 12: Data Quality Goes beyond the Data Build a Culture Focused on Quality 191 Building a Culture Focused on Data Quality 191 Concluding Thoughts: Using the Current State Assessment 192 SECTION 6 THE DQAF IN DEPTH CHAPTER 14 Functions of Measurement: Collection, Calculation, Comparison..197 Purpose 197 Functions in Measurement: Collect, Calculate, Compare 197 Collecting Raw Measurement Data 199 Calculating Measurement Data 199 Comparing Measurements to Past History 201 Statistics 201 Measures of Central Tendency 202 Measures of Variability 202 The Control Chart: A Primary Tool for Statistical Process Control 205 The DQAF and Statistical Process Control 206 Concluding Thoughts 207 CHAPTER 15 Features of the DQAF Measurement Logical Model 209 Purpose 209 Metric Definition and Measurement Result Tables 209 Common Key Fields 211 Optional Fields 212 Denominator Fields 213 Automated Thresholds 215 Manual Thresholds 216 Emergency Thresholds 216
Contents xiii Manual or Emergency Thresholds and Results Tables 217 Additional System Requirements 217 Support Requirements 218 Concluding Thoughts 218 CHAPTER 16 Facets of the DQAF Measurement Types 219 Purpose 219 Facets of the DQAF 219 Organization of the Chapter 221 Measurement Type #1: Dataset Completeness Sufficiency of Metadata and Reference Data 224 Definition 224 Business Concerns 224 Measurement Methodology 224 Programming 225 Support Processes and Skills 225 Measurement Type #2: Consistent Formatting in One Field 225 Definition 225 Business Concerns 226 Measurement Methodology 226 Programming 226 Support Processes and Skills 226 Measurement Type #3: Consistent Formatting, Cross-table 227 Definition 227 Business Concerns 227 Measurement Methodology 227 Programming 227 Support Processes and Skills 227 Measurement Type #4: Consistent Use of Default Value in One Field 227 Definition 227 Business Concerns 228 Measurement Methodology 228 Programming 228 Support Processes and Skills 228 Measurement Type #5: Consistent Use of Default Values, Cross-table 228 Definition 228 Business Concerns 229 Measurement Methodology 229 Programming 229 Support Processes and Skills 229
xiv Contents Measurement Type #6: Timely Delivery of Data for Processing 229 Definition 229 Business Concerns 229 Measurement Methodology 230 Programming 230 Support Processes and Skills 230 Measurement Logical Data Model 231 Measurement Type #7: Dataset Completeness Availability for Processing 232 Definition 232 Business Concerns 232 Measurement Methodology 233 Programming 233 Support Processes and Skills 233 Measurement Type #8: Dataset Completeness Record Counts to Control Records 233 Definition 233 Business Concerns 234 Measurement Methodology 234 Programming 234 Support Processes and Skills 234 Measurement Type #9: Dataset Completeness Summarized Amount Field Data 234 Definition 234 Business Concerns 234 Measurement Methodology 235 Programming 235 Support Processes and Skills 235 Measurement Type #10: Dataset Completeness Size Compared to Past Sizes 235 Definition 235 Business Concerns 235 Measurement Methodology 236 Programming 236 Support Processes and Skills 236 Measurement Logical Data Model 236 Measurement Type #11: Record Completeness Length 237 Definition 237 Business Concerns 237 Measurement Methodology 237 Programming 238 Support Processes and Skills 238
Contents xv Measurement Type #12: Field Completeness Non-Nullable Fields 238 Definition 238 Business Concerns 238 Measurement Methodology 238 Programming 238 Support Processes and Skills 239 Measurement Type #13: Dataset Integrity De-Duplication 239 Definition 239 Business Concerns 239 Measurement Methodology 239 Programming 239 Support Processes and Skills 239 Measurement Type #14: Dataset Integrity Duplicate Record Reasonability Check 240 Definition 240 Business Concerns 240 Measurement Methodology 240 Programming 240 Support Processes and Skills 240 Measurement Logical Data Model 241 Measurement Type #15: Field Content Completeness Defaults from Source 241 Definition 241 Business Concerns 242 Measurement Methodology 242 Programming 242 Support Processes and Skills 243 Measurement Logical Data Model 243 Measurement Type #16: Dataset Completeness Based on Date Criteria 244 Definition 244 Business Concerns 244 Measurement Methodology 244 Programming 244 Support Processes and Skills 244 Measurement Type #17: Dataset Reasonability Based on Date Criteria 245 Definition 245 Business Concerns 245 Measurement Methodology 245 Programming 245 Support Processes and Skills 245 Measurement Logical Data Model 245
xvi Contents Measurement Type #18: Field Content Completeness Received Data is Missing Fields Critical to Processing 247 Definition 247 Business Concerns 247 Measurement Methodology 247 Programming 247 Support Processes and Skills 247 Measurement Type #19: Dataset Completeness Balance Record Counts Through a Process 248 Definition 248 Business Concerns 248 Measurement Methodology 248 Programming 248 Support Processes and Skills 249 Measurement Type #20: Dataset Completeness Reasons for Rejecting Records 249 Definition 249 Business Concerns 249 Measurement Logical Data Model 249 Measurement Type #21: Dataset Completeness Through a Process Ratio of Input to Output 250 Definition 250 Business Concerns 250 Measurement Methodology 251 Programming 251 Support Processes and Skills 251 Measurement Logical Data Model 251 Measurement Type #22: Dataset Completeness Through a Process Balance Amount Fields 251 Definition 251 Business Concerns 252 Measurement Methodology 252 Programming 253 Support Processes and Skills 253 Measurement Type #23: Field Content Completeness Ratio of Summed Amount Fields 253 Definition 253 Business Concerns 253 Measurement Methodology 253 Programming 254
Contents xvii Support Processes and Skills 254 Measurement Logical Data Model 255 Measurement Type #24: Field Content Completeness Defaults from Derivation 255 Definition 255 Business Concerns 255 Measurement Methodology 256 Programming 256 Support Processes and Skills 256 Measurement Logical Data Model 256 Measurement Type #25: Data Processing Duration 257 Definition 257 Business Concerns 257 Measurement Methodology 257 Programming 258 Support Processes and Skills 258 Measurement Logical Data Model 258 Measurement Type #26: Timely Availability of Data for Access 259 Definition 259 Business Concerns 259 Measurement Methodology 260 Programming 260 Support Processes and Skills 260 Measurement Logical Data Model 260 Measurement Type #27: Validity Check, Single Field, Detailed Results 261 Definition 261 Business Concerns 261 Measurement Methodology 261 Programming 262 Support Processes and Skills 263 Measurement Logical Data Model 263 Measurement Type #28: Validity Check, Roll-up 264 Definition 264 Business Concerns 264 Measurement Methodology 265 Programming 265 Support Processes and Skills 265 Measurement Logical Data Model 265 Measurement Type #29: Validity Check, Multiple Columns within a Table, Detailed Results 266
xviii Contents Definition 266 Business Concerns 266 Measurement Methodology 267 Programming 267 Support Processes and Skills 267 Measurement Logical Data Model 267 Measurement Type #30: Consistent Column Profile 267 Definition 267 Business Concerns 269 Measurement Methodology 269 Programming 269 Support Processes and Skills 269 Measurement Logical Data Model 270 Measurement Type #31: Consistent Dataset Content, Distinct Count of Represented Entity, with Ratios to Record Counts 270 Definition 270 Business Concerns 271 Measurement Methodology 271 Programming 271 Support Processes and Skills 272 Measurement Logical Data Model 272 Measurement Type #32 Consistent Dataset Content, Ratio of Distinct Counts of Two Represented Entities 272 Definition 272 Business Concerns 273 Measurement Methodology 273 Programming 273 Support Processes and Skills 273 Measurement Logical Data Model 273 Measurement Type #33: Consistent Multicolumn Profile 274 Definition 274 Business Concerns 274 Measurement Methodology 275 Programming 275 Support Processes and Skills 275 Measurement Logical Data Model 276 Measurement Type #34: Chronology Consistent with Business Rules within a Table 277 Definition 277 Business Concerns 278
Contents xix Other Facets 278 Measurement Type #35: Consistent Time Elapsed (hours, days, months, etc.) 279 Definition 279 Business Concerns 279 Measurement Methodology 279 Programming 279 Support Processes and Skills 280 Measurement Logical Data Model 280 Measurement Type #36: Consistent Amount Field Calculations Across Secondary Fields 281 Definition 281 Business Concerns 281 Measurement Methodology 282 Programming 282 Support Processes and Skills 282 Measurement Logical Data Model 283 Measurement Type #37: Consistent Record Counts by Aggregated Date 284 Definition 284 Business Concerns 284 Measurement Methodology 285 Programming 285 Support Processes and Skills 285 Measurement Logical Data Model 286 Measurement Type #38: Consistent Amount Field Data by Aggregated Date 286 Definition 286 Business Concerns 287 Measurement Methodology 287 Programming 287 Support Processes and Skills 287 Measurement Logical Data Model 288 Measurement Type #39: Parent/Child Referential Integrity 288 Definition 288 Business Concerns 289 Measurement Methodology 289 Programming 289 Support Processes and Skills 290 Measurement Type #40: Child/Parent Referential Integrity 290 Definition 290 Business Concerns 290 Measurement Methodology 290
XX Contents Programming 290 Support Processes and Skills 290 Measurement Type #41: Validity Check, Cross Table, Detailed Results 291 Definition 291 Business Concerns 291 Measurement Methodology 291 Programming 291 Support Processes and Skills 291 Measurement Type #42: Consistent Cross-table Multicolumn Profile 292 Definition 292 Business Concerns 292 Measurement Methodology 292 Programming 292 Support Processes and Skills 293 Measurement Type #43: Chronology Consistent with Business Rules Across-tables 293 Definition 293 Business Concerns 293 Measurement Methodology and Programming 293 Support Processes and Skills 293 Measurement Type #44: Consistent Cross-table Amount Column Calculations...293 Definition 293 Business Concerns 294 Measurement Methodology 294 Programming 294 Support Processes and Skills 294 Measurement Type #45: Consistent Cross-Table Amount Columns by Aggregated Dates 294 Definition 294 Business Concerns 295 Measurement Methodology 295 Programming 295 Support Processes and Skills 295 Measurement Type #46: Consistency Compared to External Benchmarks 295 Definition 295 Business Concerns 296 Measurement Methodology 296 Programming 296 Support Processes and Skills 296 Measurement Type #47: Dataset Completeness Overall Sufficiency for Defined Purposes 296
Contents xxi Definition 296 Business Concerns 296 Measurement Methodology 297 Programming 297 Support Processes and Skills 297 Measurement Type #48: Dataset Completeness Overall Sufficiency of Measures and Controls 297 Definition 297 Business Concerns 297 Measurement Methodology 298 Programming 298 Support Processes and Skills 298 Concluding Thoughts: Know Your Data 298 Glossary 301 Bibliography 313 Index 319 Online Materials: Appendix A: Measuring the Value of Data doi: 10.1016/B978-0-12-397033-6.00028-6 el Appendix B: Data Quality Dimensions doi:10.1016/b978-0-12-397033-6.00029-8 e5 Appendix C: Completeness, Consistency, and Integrity of the Data Model doi:10.1016/b978-0-12-397033-6.00030-4 el 1 Appendix D: Prediction, Error, and Shewhart's Lost Disciple, Kristo Ivanov doi:10.1016/b978-0-12-397033-6.00031-6 e21 Appendix E: Quality Improvement and Data Quality doi: 10.1016/B978-0-12-397033-6.00013-4 e27