Practical Considerations for Rapidly Improving Quality in Large Data Collections

Similar documents
Peter Aiken. Measuring Data Management Practice Maturity

Enterprise Resource Planning (ERP) Considerations 1

Appendix B Data Quality Dimensions

Enterprise Resource Planning Analysis of Business Intelligence & Emergence of Mining Objects

Data Discovery, Analytics, and the Enterprise Data Hub

The Role of Internal Audit in Risk Governance

Sales & Operations Planning Process Excellence Program

Windows Server 2003 migration: Your three-phase action plan to reach the finish line

Citizen Engagement Platform

Guide to Successful Program Management

OUTSOURCING, QUALITY CONTROL, AND THE ACQUISITIONS PROFESSIONAL

Cinda Daly. Who is the champion of knowledge sharing in your organization?

STATEMENT THOMAS P. MICHELLI CHIEF INFORMATION OFFICER U.S. IMMIGRATION AND CUSTOMS ENFORCEMENT DEPARTMENT OF HOMELAND SECURITY REGARDING A HEARING ON

A Brief History of Change Management

TWD WHITE PAPER. Implementing Next-Generation Network Infrastructures for Cloud Migration

2009 NASCIO Recognition Award Nomination State of Georgia

The fact is that 90% of business strategies are not implemented through operations as intended. Overview

Proactive DATA QUALITY MANAGEMENT. Reactive DISCIPLINE. Quality is not an act, it is a habit. Aristotle PLAN CONTROL IMPROVE

Data center transformation: an application focus that breeds success

ITIL V3: Making Business Services Serve the Business

Managed Objects Service Configuration Management

United States General Accounting Office GAO. High-Risk Series. February Farm Loan Programs GAO/HR-95-9

Operations and Supply Chain Management Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology Madras

S&OP Mission Critical: Getting Top Management on Board

The Change Management Handbook

Statement. Mr. Paul A. Brinkley Deputy Under Secretary of Defense for Business Transformation. Before

Five Core Principles of Successful Business Architecture

A Sales Strategy to Increase Function Bookings

Consultants: Stop Giving Away This High-Value Service to Clients for Free!

Earned Value. Valerie Colber, MBA, PMP, SCPM. Not for duplication nor distribution 1

Ten Critical Steps. for Successful Project Portfolio Management. RG Perspective

Ten Strategies to Encourage Academic Integrity in Large Lecture Classes

The State of SEO within Retail Ecommerce Research Report

Major IT Projects: Continue Expanding Oversight and Strengthen Accountability

McNaughton-McKay Electric Company

ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION Francine Forney, Senior Management Consultant, Fuel Consulting, LLC May 2013

The Challenges of Application Service Hosting

ETPL Extract, Transform, Predict and Load

How to Build a Service Management Hub for Digital Service Innovation

What sets breakthrough innovators apart PwC s Global Innovation Survey 2013: US Summary

Status of Enterprise Resource Planning Systems Cost, Schedule, and Management Actions Taken to Address Prior Recommendations

2 Business, Performance, and Gap Analysis

Mentoring YOUR ROAD MAP TO SUCCESS. By Nona Chigewe

Project Management for Process Improvement Efforts. Jeanette M Lynch CLSSBB Missouri Quality Award Examiner Certified Facilitator

The Review Of Business Information Systems Volume 8, Number 2

Sage ERP I White Paper. An ERP Guide to Driving Efficiency

Documented Evidence of Property Management Value, Part 3

Revealing the Big Picture Using Business Process Management

Enhance Production in 6 Steps Using Preventive Maintenance

Using Peer Review Data to Manage Software Defects By Steven H. Lett

Overview and Frequently Asked Questions

The Hidden Value of Enterprise Content Management Deliver business value by leveraging information

Deploying & Maintaining Affordable In House CRM Software

structures stack up Tom McMullen

COMPETITIVE INTELLIGENCE

Computer and Internet Usage at Businesses in Kentucky Steven N. Allen

Enterprise Resource Planning Systems Schedule Delays and Reengineering Weaknesses Increase Risks to DoD's Auditability Goals

Could a Managed Services Agreement Save Your Company Tens of Thousands of Dollars Each Year?

When you are up to your neck in alligators, it is hard to remember that. The First Law of Strategy

State of Florida ELECTRONIC RECORDKEEPING STRATEGIC PLAN. January 2010 December 2012 DECEMBER 31, 2009

BBBT Podcast Transcript

Accounts Payable Outsourcing

mysap ERP FINANCIALS SOLUTION OVERVIEW

Planning a Basel III Credit Risk Initiative

Mineral rights ownership what is it and why is it so unique in the USA?

The Next Wave in Finance & Accounting Shared Services Establishing Centers of Expertise

FACT SHEET. General Information About the Defense Contract Management Agency

Quality Meets the CEO

RO-Why: The business value of a modern intranet

Busting 7 Myths about Master Data Management

Healthcare, transportation,

ERP and the Future of integration

CORPORATE INFORMATION AND TECHNOLOGY STRATEGY

Monitoring Employee Productivity in a Roaming Workplace

Design Thinking: Driving Innovation

1.1.1 Introduction to Cloud Computing

Mastering your supply spend

CONDIS. IT Service Management and CMDB

Agile Manufacturing for ALUMINIUM SMELTERS

Position Classification Standard for Management and Program Clerical and Assistance Series, GS-0344

How NAS Can Increase Reliability, Uptime & Data Loss Protection: An IT Executive s Story

Exhibit F. VA CAI - Staff Aug Job Titles and Descriptions Effective 2015

Ten Critical Steps. for Successful Project Portfolio Management. RG Perspective

Session 4. System Engineering Management. Session Speaker : Dr. Govind R. Kadambi. M S Ramaiah School of Advanced Studies 1

How to use Text Mining in Social and CRM to Improve Quality Control and Save Money

A Wissen White Paper. Effectively utilizing an offshore Remote DBA team augmentation strategy to increase productivity and reduce costs

Agenda. Background Driving Forces Data Informed Decisions Lessons Learned Governance Framework and Standards Summary

in collaboration with: Maximising Where are my assets? Adding the Spatial Dimension

COUNTERINTELLIGENCE. Protecting Key Assets: A Corporate Counterintelligence Guide

SUCCESSFUL INTERFACE RISK MANAGEMENT FROM BLAME CULTURE TO JOINT ACTION

Contract Management Principles and Practices

Guidelines for the Development of a Communication Strategy

An RCG White Paper The Data Governance Maturity Model

DARMADI KOMO: Hello, everyone. This is Darmadi Komo, senior technical product manager from SQL Server marketing.

Real Property Portfolio Optimization

Office of the Chief Information Officer

Managing the Ongoing Challenge of Insider Threats

The Supports Coordinator s Role in Incident Management

Seradex White Paper A newsletter for manufacturing organizations April, 2004

Outsourcing Manufacturing: A 20/20 view

Transcription:

www.datablueprint.com Practical Considerations for Rapidly Improving Quality in Large Data Collections By Peter Aiken Founder, Data Blueprint Abstract: While data quality has been a subject of interest for many years, only recently has the research output begun to converge with the needs of organizations faced with these challenges. This paper addresses the fundamental issues existent within existing approaches to improving DoD data quality (DQ). It briefly discusses our collective motivation, examines three root causes preventing more rapid DQ improvement progress. An examination of "newly perceived" realities in this area leads to discussion of several considerations that will improve related efforts. Motivation The situation is getting worse! A recent, voluminous book on the subject has documented more than $13 billion in the high cost of poor quality government information that is attributed directly to the Pentagon and more than $700 billion to governmental challenges [English, 2009]. When we couple these costs with recent attempts to determine how much DQ measurement is occurring the results indicate that these two numbers are probably very low. This, in spite of the fact that DoD has being objectively determined to be on the relative forefront of these types of efforts (see Figure 1). FIGURE 1 OBJECTIVE COMPARISON ACROSS FOUR MAJOR (ANONYMOUS) DOD DATA MANAGEMENT PROGRAMS INDICATES THAT SOME DOD EFFORTS OUTPERFORM AVERAGE PRIVATE SECTOR ORGANIZATIONS WHOSE PERFORMANCE ROUGHLY INDICATED BY THE DOTTED LINE.

Page 2 of 11 Figure 2 indicates the results of a 2009 survey from Information Management Magazine. Highlights from this and other recent survey data include: One- third of respondents rate their data quality as poor at best and only 4 percent as excellent. Forty- two percent of organizations make no attempt to measure the quality of their data. Only 15% of organizations are very confident of the data received from other organizations. The only reasonable conclusion is that, absent a formal data quality assessment effort, all data in an organization is of unknown quality! FIGURE 2 PERCENTAGE OF ORGANIZATIONS REPORTING VARIOUS LEVELS OF DATA QUALITY (BARS) & % OF ORGANIZATIONS PROACTIVELY MEASURING THE QUALITY OF THEIR OWN DATA (PIE CHART). With the advent of truly big data challenges the problem continues to worsen. Recent articles such as this year's special report from The Economist have helped to increase awareness of the challenges of dealing with yottabytes of data [Economist 2010]. Most organizations are still approaching data quality problems from stove piped perspectives! In spite of these challenges, many are still dealing with these challenges from various stove- piped perspectives. It has been the classic case of the blind person and the elephant illustrated in Figure 2. Most organizations approach data quality problems in the same way that the blind men approached the elephant - people tend to see only the data that is in front of them. Little cooperation exists across boundaries, just as the blind men were unable to convey their impressions about the elephant to FIGURE 3 recognize the entire entity. NO UNIVERSAL CONCEPTION OF DATA QUALITY EXISTS, INSTEAD MANY DIFFERING PERSPECTIVE COMPETE. In order to be effective, data quality engineering must achieve a more complete picture and facilitate cross- boundary communications. Whether you believe that the solution should

Page 3 of 11 come in the form of TQM, six- sigma, standards- related work, or tiger teams, it remains clear that one solution cannot satisfy all aspects of the challenge. Root Cause Analysis Three root causes do seem common to DQ problems. Many DQ challenges are unique and/or context- specific! After dealing with data quality problems for more than 25 years, I have two strong opinions: First, prevention is more cost effective than treating the symptoms. This is Tom Redman's well- repeated story about eliminating the sources of water pollution for any given "lake" of data as opposed to attempting to continue to clear the data lake of the polluted data. It should be obvious that correcting the data quality problems will be less expensive that fixing them forever. Second, data quality problems are more unique than similar. This prevents the resolution of these challenges from following programmatic solution development practices and it mandates the development of specialized data quality engineering specialists within organizations (more on this in the solutions section of this paper). Particular evidence of this second point can be seen when we examine the practices of "experienced" data migration specialists experienced here meaning that those surveyed individually had accomplished four or more data migrations. Collectively this group of experienced professionals underestimated the cost of future data migration projects by a factor of 10 as shown in Figure 4 [Hudicka 2005]. Median Actual Expense Median Projected Cost FIGURE 4 EXPERIENCED IT PROFESSIONAL ARE NOT YET ABLE TO USE PAST EXPERTISE TO ACCURATELY FORECAST PROJECT COSTS! Educational institutions are not addressing the challenge! Computer engineering/information systems/computer science (CEISCS) students are not being taught data quality concepts and non- CEISCS students (such as business majors) receive virtually no exposure to data concepts at all. With a few notable exceptions (including MIT's and ULAR's Data Quality Programs), university level programs are not addressing data quality in CEISCS curricula. Indeed the most prevalent data- related skill taught by these programs is how to develop new databases probably the very least desired skill set when considering organizational legacy systems environments. At the research level, there is also a short history. It was only in 2006 that the first academic journal dedicated to data quality was created.

Page 4 of 11 Vendors are incented to not address the challenges proactively! When contracting for a highway project (at least in the Commonwealth of Virginia) the contractor is offered a bonus for completing the project ahead of schedule, the contracted amount for finishing the project on time, and is penalized for completing the project behind schedule. In DoD systems contracting, vendors actually plan on cost overruns and are bonused for the achievement of these overruns. Anecdotal evidence indicates that data is the primary area where these overruns occur. I have spent considerable time expert- witnessing or otherwise in litigation support. Virtually all IT upgrades, migrations, and/or consolidations involve movement of data. When new systems don't work, one party blames the problems on poor quality data from the source system. Without a baseline assessment of the quality of the data before the movement/consolidation/transformation, it is impossible to defend against this charge. Yet, data quality is typically not addressed formally or informally as part of IT contracts. Vendors currently are incented to "discover" data quality problems after contracts are signed a practice that is literally indefensible, wasteful, and costly. New realities Data quality is now acknowledged as a major source of organizational risk by certified risk professionals! Data quality is now widely acknowledged as a major source of corporate risk. The DoD should take note of the advent of two new C- level executives in private industry: the Chief Risk Officer (CRO) and the Chief Data Officer (CDO). The CDO is an acknowledgement that the CIO concept has been hijacked to focus on areas far beyond the original focus of corporate information as an asset. Indeed many organizations are properly relabeling these individuals as Chief Technology Officers (CTOs) in light of their more broadly technology focused roles, and refocusing the data assts under the control of a CDO. From the business side, CROs are being groomed to understand how all aspects of risk play into strategic failures. These professionals understand the role that data quality plays in risk mitigation and can often be best allies of CDOs in the business management hierarchy. A body of knowledge has been developed! While this paper has focused on several challenges that relate to the relative immaturity of data quality engineering as a professional discipline, there is some hopeful news. In 2009, DAMA International released A Guide to the Data Management Body of Knowledge [DAMA 2009]. While it isn't a detailed as a Body of Knowledge (BOK) focused specifically on data quality, it does now elevate the field of data management to the status enjoyed by the Project Management discipline (PM BOK) and Software Engineering (SW BOK). Also, there much reference material in the DM BOK that focuses specifically on data quality.

Page 5 of 11 Much more analysis is required before we can implement repeatable solutions to today's data quality challenges! Similar to the point noted above, experienced IT professionals cannot well predict data migration costs, those of us experienced with developing data quality engineering solutions understand that the relative newness of this discipline precludes implementation of repeatable (much less optimized) solutions. Indeed, it is amazing how fast progress has been made in this area. Consider, for example, our concept of the data life cycle. As originally proposed [Redman 1993] the data life cycle consists of three phases: data acquisition, data storage and data use (see Figure 5). FIGURE 5 ORIGINAL DATA ACQUISITION AND USE CYCLE [LEVITAN 1993] Just five years later we acknowledge the data life cycle as more complex (Figure 6). FIGURE 6 REFINED DATA LIFE CYCLE [FINKELSTEIN 1999]

Page 6 of 11 Another relatively recent development focuses on the expansion of the canonical list of data quality attributes. Again, an original formulation of these consisted of a list of terms (such as completeness, conformity, consistency, accuracy, duplication, and integrity). We now know (see Figure 7) that data quality attributes extend to the data models that produce and govern production datasets, and even to organizational data architectures [GAO 2007]. FIGURE 7 A COMPLETE LIST OF DATA QUALITY ATTRIBUTES INCLUDES DATA MODEL AND DATA ARCHITECTURE ATTRIBUTES AS WELL AS DATA REPRESENTATION AND DATA VALUE QUALITY ATTRIBUTES [YOON 1999] Finally, I'm reminded of events that occurred more than 15 years ago with the DoD. The Office of the Secretary of Defense (OSD) would routinely send out requests to the various branches and services for information. These were referred to then as "data calls." One data call might request of various organizations "how many employees do you have?" On the surface this might seem a simple and innocent query. But as I observed the mechanics of the response patterns, they were generally of the form, "What do you mean by an employee?" As a data person, this was a reasonable clarifying question. Since the 37 systems that paid civilians at the time were not designed to maintain the same information types, they did not. A careful respondent might ask this question to ensure valid comparisons could be made of the responses. After all, in those days it was somewhat common for a service member to work part time for another agency at night or when otherwise off duty to earn vacation money or contribute a needed source of expertise. After seeing the various response patterns repeated, I became aware that data quality was a socio- technical discipline. The various respondents had no intention of providing the OSD with any information and the various questions, while legitimate, were also designed to ensure that no numbers were provided back to the head office. If no numbers were provided, then OSD couldn t tell the respondents to take any action based on the numbers. So we had to incorporate some social engineering onto our future data calls.

Page 7 of 11 "Solution" Considerations Our understanding of the nature of this socio- technical challenge is evolving! It is the relative velocity of the developments outlined above that forces us to acknowledge that right now we know just a bit and we still don't know what we don't know about data quality. We are in the discovery curve and attempts to over formalize our various approaches will result in development of brittle solutions. We do know that application of scientific and engineering disciplines can produce better data quality solutions than previous attempts. But for now, it is better to concentrate our efforts on high- level application of policies and principles as opposed to detailed specifications. Our toolset is improving! Since the development of formalized data reverse engineering and the invention of data profiling (both DoD funded initiatives [Aiken 1996]) in the early 1990's, our collective data quality engineering tool kit has matured considerably. A multitude of products are now available to help out with various analyses and tasks. The most common problem now facing DoD is the wide spread perception that tools alone will accomplish data quality improvements and that purchase of a package solve data quality problems. This of course has and will always be false. Best approaches combines manual and automated reconciliation! As we continue to learn more about data quality, solutions engineering, and related issues, one thing will continue to remain clear: the best data quality engineering solutions will continue to be a combination of selected tools combined with specific analysis tasks and that the primary challenge as we attempt to improve will be determining the proper mix human and automated solutions. Figure 8 below was developed by one of my heroes J. R. C. Licklider. His insight about the relative capabilities of human versus machine was prescient and is as correct now, as it was when it was published in 1960. HUMANS GENERALLY BETTER - Sense low level stimuli - Detect stimuli in noisy background - Recognize constant patterns in varying situations - Sense unusual and unexpected events - Remember principles and strategies - Retrieve pertinent details without a priori connection - Draw upon experience and adapt decision to situation - Select alternatives if original approach fails - Reason inductively; generalize from observations - Act in unanticipated emergencies and novel situations - Apply principles to solve varied problems - Make subjective evaluations MACHINES GENERALLY BETTER - Sense stimuli outside human's range - Count or measure physical quantities - Store quantities of coded information accurately - Monitor prespecified events, especially infrequent - Make rapid and consisted responses to input signals - Recall quantities of detailed information accurately - Retrieve pertinent detailed without a priori connection - Process quantitative data in prespecified ways - Perform repetitive preprogrammed actions reliably - Exert great, highly controlled physical force

Page 8 of 11 - Develop new solutions - Concentrate on important tasks when overload occurs - Adapt physical response to changes in situation - Perform several activities simultaneously - Maintain operations under heavy operation load - Maintain performance over extended periods of time FIGURE 8 LICKLIDER S RELATIVE CAPABILITIES A simple example will illustrate this point. At one point in the Defense Logistics Agency's business modernization program, someone realized that much their data was was poorly stored in the clear text/comment fields of their old SAMMS system. DLA thought that a manual approach would be required to clean and restructure the data to prepare it for use in the new SAP system. A simple set of calculations indicated that the time required to implement this manual approach to data quality engineering for approximately 2 million NSN/SKUs (a subset of the entire inventory) would run into person- centuries (see Figure 9). FIGURE 9 ILLUSTRATION OF HOW DATA CLEANSING OF 2 MILLION NSN/SKUS WOULD REQUIRE 93 PERSON- YEARS IF THE TASK TOOK ONLY 5 MINUTES PER NSN/SKU REAL ESTIMATES WERE MUCH GREATER. Instead Figure 10 illustrates that a combination of automated processing was able to reduce the "problem space" from a 100% manual approach to a much smaller task requiring manual attention to less than 7.5% of the original NSN/SKU inventory. Of perhaps equal importance, we were able to demonstrate that we could objectively identify the point of diminishing returns where more work on the automated approach did not produce a greater time/effort savings. This kind of synergistic approach is common to most data quality engineering challenges.

Page 9 of 11 FIGURE 10 SEMI- AUTOMATING THE DATA CLEANSING OF DLA'S SAMMS DATA SAVED LITERALLY PERSON CENTURIES NOT TO MENTION MILLIONS OF TAX- PAYER DOLLARS. Data quality must be approached as a specialized discipline! Give all of the above, it remains clear that the best approach to resolving some of DoD's data quality challenges is to form specialized data quality teams dedicated to resolving challenges wherever and whenever they occur. Only in this manner can DoD effectively concentrate its strengths on processes that can be matured from heroic, to repeatable, to documented, to managed, and finally to improvable processes. Failure to do so will dilute the intellectual strength of data quality engineers with respect to their subject matter knowledge, their tools expertise, and their ability to select and apply appropriate automated solutions to appropriate challenges. About Data Blueprint Data Blueprint is a data management and IT consulting firm that empowers organizations to gain more value from their data assets. We offer a full suite of services, including data assessments, data management, data solutions and data education. Our industry- leading

Page 10 of 11 methodologies have improved our clients data quality, reduced implementation costs and decreased time- to- market for strategic IT projects. Learn more at www.datablueprint.com. Contact Information Lewis Broome Chief Operating Officer 10124 W. Broad Street Suite C Glen Allen, VA 23060 804. 640.0414 lbroome@datablueprint.com This article includes significant contributions from Daniel Behm, analyst at Data Blueprint. Mr. Behm provided his extensive experience and research results from projects performed for the United States Marine Corps and Department of Defense National Bone Marrow Program. References [Aiken 1996] [DAMA 2009] Data Reverse Engineering: Slaying the legacy dragon Guide to the Data Management Body of Knowledge available at amazon.com

Page 11 of 11 [Economist 2010] Data, Data, Everywhere a special Economist Report on Managing Information Feb 27 2010 [English 2009] Information Quality Applied by Larry English 2009 [Finkelstein 1999] [GAO 2007] [GAO 2008] [Hudicka 2005] [Waddington, 2009] [Yoon 1999} Finkelstein, C. and P.H. Aiken, Building Corporate Portals Using XML". 1999, New York: McGraw- Hill. 530 pages (ISBN: 0-07- 913705-9). DHS Enterprise Architecture Continues to Evolve but Improvements Needed GAO- 07-564 Key Navy Programs Compliance with DOD s Federated Business Enterprise Architecture Needs to Be Adequately Demonstrated GAO- 08-972 Joseph R. Hudicka "Why ETL and Data Migration Projects Fail" Oracle Developers Technical Users Group Journal June 2005 pp. 29-31. The Sad State of Data Quality: Results from the Information Difference survey document initiatives and the state of data quality today Information Management Magazine, 11/01/2009 by David Waddington. Yoon, Y., Aiken, P., Guimaraes, T. Managing Organizational Data Resources: Quality Dimensions Information Resources Management Journal 13(3) July- September 2000 pp. 5-13.