Practical Considerations for Rapidly Improving Quality in Large Data Collections

www.datablueprint.com Practical Considerations for Rapidly Improving Quality in Large Data Collections By Peter Aiken Founder, Data Blueprint Abstract: While data quality has been a subject of interest for many years, only recently has the research output begun to converge with the needs of organizations faced with these challenges. This paper addresses the fundamental issues existent within existing approaches to improving DoD data quality (DQ). It briefly discusses our collective motivation, examines three root causes preventing more rapid DQ improvement progress. An examination of "newly perceived" realities in this area leads to discussion of several considerations that will improve related efforts. Motivation The situation is getting worse! A recent, voluminous book on the subject has documented more than $13 billion in the high cost of poor quality government information that is attributed directly to the Pentagon and more than $700 billion to governmental challenges [English, 2009]. When we couple these costs with recent attempts to determine how much DQ measurement is occurring the results indicate that these two numbers are probably very low. This, in spite of the fact that DoD has being objectively determined to be on the relative forefront of these types of efforts (see Figure 1). FIGURE 1 OBJECTIVE COMPARISON ACROSS FOUR MAJOR (ANONYMOUS) DOD DATA MANAGEMENT PROGRAMS INDICATES THAT SOME DOD EFFORTS OUTPERFORM AVERAGE PRIVATE SECTOR ORGANIZATIONS WHOSE PERFORMANCE ROUGHLY INDICATED BY THE DOTTED LINE.

of 11 Figure 2 indicates the results of a 2009 survey from Information Management Magazine. Highlights from this and other recent survey data include: One- third of respondents rate their data quality as poor at best and only 4 percent as excellent. Forty- two percent of organizations make no attempt to measure the quality of their data. Only 15% of organizations are very confident of the data received from other organizations. The only reasonable conclusion is that, absent a formal data quality assessment effort, all data in an organization is of unknown quality! FIGURE 2 PERCENTAGE OF ORGANIZATIONS REPORTING VARIOUS LEVELS OF DATA QUALITY (BARS) & % OF ORGANIZATIONS PROACTIVELY MEASURING THE QUALITY OF THEIR OWN DATA (PIE CHART). With the advent of truly big data challenges the problem continues to worsen. Recent articles such as this year's special report from The Economist have helped to increase awareness of the challenges of dealing with yottabytes of data [Economist 2010]. Most organizations are still approaching data quality problems from stove piped perspectives! In spite of these challenges, many are still dealing with these challenges from various stove- piped perspectives. It has been the classic case of the blind person and the elephant illustrated in Figure 2. Most organizations approach data quality problems in the same way that the blind men approached the elephant - people tend to see only the data that is in front of them. Little cooperation exists across boundaries, just as the blind men were unable to convey their impressions about the elephant to FIGURE 3 recognize the entire entity. NO UNIVERSAL CONCEPTION OF DATA QUALITY EXISTS, INSTEAD MANY DIFFERING PERSPECTIVE COMPETE. In order to be effective, data quality engineering must achieve a more complete picture and facilitate cross- boundary communications. Whether you believe that the solution should

of 11 come in the form of TQM, six- sigma, standards- related work, or tiger teams, it remains clear that one solution cannot satisfy all aspects of the challenge. Root Cause Analysis Three root causes do seem common to DQ problems. Many DQ challenges are unique and/or context- specific! After dealing with data quality problems for more than 25 years, I have two strong opinions: First, prevention is more cost effective than treating the symptoms. This is Tom Redman's well- repeated story about eliminating the sources of water pollution for any given "lake" of data as opposed to attempting to continue to clear the data lake of the polluted data. It should be obvious that correcting the data quality problems will be less expensive that fixing them forever. Second, data quality problems are more unique than similar. This prevents the resolution of these challenges from following programmatic solution development practices and it mandates the development of specialized data quality engineering specialists within organizations (more on this in the solutions section of this paper). Particular evidence of this second point can be seen when we examine the practices of "experienced" data migration specialists experienced here meaning that those surveyed individually had accomplished four or more data migrations. Collectively this group of experienced professionals underestimated the cost of future data migration projects by a factor of 10 as shown in Figure 4 [Hudicka 2005]. Median Actual Expense Median Projected Cost FIGURE 4 EXPERIENCED IT PROFESSIONAL ARE NOT YET ABLE TO USE PAST EXPERTISE TO ACCURATELY FORECAST PROJECT COSTS! Educational institutions are not addressing the challenge! Computer engineering/information systems/computer science (CEISCS) students are not being taught data quality concepts and non- CEISCS students (such as business majors) receive virtually no exposure to data concepts at all. With a few notable exceptions (including MIT's and ULAR's Data Quality Programs), university level programs are not addressing data quality in CEISCS curricula. Indeed the most prevalent data- related skill taught by these programs is how to develop new databases probably the very least desired skill set when considering organizational legacy systems environments. At the research level, there is also a short history. It was only in 2006 that the first academic journal dedicated to data quality was created.

of 11 Vendors are incented to not address the challenges proactively! When contracting for a highway project (at least in the Commonwealth of Virginia) the contractor is offered a bonus for completing the project ahead of schedule, the contracted amount for finishing the project on time, and is penalized for completing the project behind schedule. In DoD systems contracting, vendors actually plan on cost overruns and are bonused for the achievement of these overruns. Anecdotal evidence indicates that data is the primary area where these overruns occur. I have spent considerable time expert- witnessing or otherwise in litigation support. Virtually all IT upgrades, migrations, and/or consolidations involve movement of data. When new systems don't work, one party blames the problems on poor quality data from the source system. Without a baseline assessment of the quality of the data before the movement/consolidation/transformation, it is impossible to defend against this charge. Yet, data quality is typically not addressed formally or informally as part of IT contracts. Vendors currently are incented to "discover" data quality problems after contracts are signed a practice that is literally indefensible, wasteful, and costly. New realities Data quality is now acknowledged as a major source of organizational risk by certified risk professionals! Data quality is now widely acknowledged as a major source of corporate risk. The DoD should take note of the advent of two new C- level executives in private industry: the Chief Risk Officer (CRO) and the Chief Data Officer (CDO). The CDO is an acknowledgement that the CIO concept has been hijacked to focus on areas far beyond the original focus of corporate information as an asset. Indeed many organizations are properly relabeling these individuals as Chief Technology Officers (CTOs) in light of their more broadly technology focused roles, and refocusing the data assts under the control of a CDO. From the business side, CROs are being groomed to understand how all aspects of risk play into strategic failures. These professionals understand the role that data quality plays in risk mitigation and can often be best allies of CDOs in the business management hierarchy. A body of knowledge has been developed! While this paper has focused on several challenges that relate to the relative immaturity of data quality engineering as a professional discipline, there is some hopeful news. In 2009, DAMA International released A Guide to the Data Management Body of Knowledge [DAMA 2009]. While it isn't a detailed as a Body of Knowledge (BOK) focused specifically on data quality, it does now elevate the field of data management to the status enjoyed by the Project Management discipline (PM BOK) and Software Engineering (SW BOK). Also, there much reference material in the DM BOK that focuses specifically on data quality.

of 11 Much more analysis is required before we can implement repeatable solutions to today's data quality challenges! Similar to the point noted above, experienced IT professionals cannot well predict data migration costs, those of us experienced with developing data quality engineering solutions understand that the relative newness of this discipline precludes implementation of repeatable (much less optimized) solutions. Indeed, it is amazing how fast progress has been made in this area. Consider, for example, our concept of the data life cycle. As originally proposed [Redman 1993] the data life cycle consists of three phases: data acquisition, data storage and data use (see Figure 5). FIGURE 5 ORIGINAL DATA ACQUISITION AND USE CYCLE [LEVITAN 1993] Just five years later we acknowledge the data life cycle as more complex (Figure 6). FIGURE 6 REFINED DATA LIFE CYCLE [FINKELSTEIN 1999]

of 11 Another relatively recent development focuses on the expansion of the canonical list of data quality attributes. Again, an original formulation of these consisted of a list of terms (such as completeness, conformity, consistency, accuracy, duplication, and integrity). We now know (see Figure 7) that data quality attributes extend to the data models that produce and govern production datasets, and even to organizational data architectures [GAO 2007]. FIGURE 7 A COMPLETE LIST OF DATA QUALITY ATTRIBUTES INCLUDES DATA MODEL AND DATA ARCHITECTURE ATTRIBUTES AS WELL AS DATA REPRESENTATION AND DATA VALUE QUALITY ATTRIBUTES [YOON 1999] Finally, I'm reminded of events that occurred more than 15 years ago with the DoD. The Office of the Secretary of Defense (OSD) would routinely send out requests to the various branches and services for information. These were referred to then as "data calls." One data call might request of various organizations "how many employees do you have?" On the surface this might seem a simple and innocent query. But as I observed the mechanics of the response patterns, they were generally of the form, "What do you mean by an employee?" As a data person, this was a reasonable clarifying question. Since the 37 systems that paid civilians at the time were not designed to maintain the same information types, they did not. A careful respondent might ask this question to ensure valid comparisons could be made of the responses. After all, in those days it was somewhat common for a service member to work part time for another agency at night or when otherwise off duty to earn vacation money or contribute a needed source of expertise. After seeing the various response patterns repeated, I became aware that data quality was a socio- technical discipline. The various respondents had no intention of providing the OSD with any information and the various questions, while legitimate, were also designed to ensure that no numbers were provided back to the head office. If no numbers were provided, then OSD couldn t tell the respondents to take any action based on the numbers. So we had to incorporate some social engineering onto our future data calls.

of 11 "Solution" Considerations Our understanding of the nature of this socio- technical challenge is evolving! It is the relative velocity of the developments outlined above that forces us to acknowledge that right now we know just a bit and we still don't know what we don't know about data quality. We are in the discovery curve and attempts to over formalize our various approaches will result in development of brittle solutions. We do know that application of scientific and engineering disciplines can produce better data quality solutions than previous attempts. But for now, it is better to concentrate our efforts on high- level application of policies and principles as opposed to detailed specifications. Our toolset is improving! Since the development of formalized data reverse engineering and the invention of data profiling (both DoD funded initiatives [Aiken 1996]) in the early 1990's, our collective data quality engineering tool kit has matured considerably. A multitude of products are now available to help out with various analyses and tasks. The most common problem now facing DoD is the wide spread perception that tools alone will accomplish data quality improvements and that purchase of a package solve data quality problems. This of course has and will always be false. Best approaches combines manual and automated reconciliation! As we continue to learn more about data quality, solutions engineering, and related issues, one thing will continue to remain clear: the best data quality engineering solutions will continue to be a combination of selected tools combined with specific analysis tasks and that the primary challenge as we attempt to improve will be determining the proper mix human and automated solutions. Figure 8 below was developed by one of my heroes J. R. C. Licklider. His insight about the relative capabilities of human versus machine was prescient and is as correct now, as it was when it was published in 1960. HUMANS GENERALLY BETTER - Sense low level stimuli - Detect stimuli in noisy background - Recognize constant patterns in varying situations - Sense unusual and unexpected events - Remember principles and strategies - Retrieve pertinent details without a priori connection - Draw upon experience and adapt decision to situation - Select alternatives if original approach fails - Reason inductively; generalize from observations - Act in unanticipated emergencies and novel situations - Apply principles to solve varied problems - Make subjective evaluations MACHINES GENERALLY BETTER - Sense stimuli outside human's range - Count or measure physical quantities - Store quantities of coded information accurately - Monitor prespecified events, especially infrequent - Make rapid and consisted responses to input signals - Recall quantities of detailed information accurately - Retrieve pertinent detailed without a priori connection - Process quantitative data in prespecified ways - Perform repetitive preprogrammed actions reliably - Exert great, highly controlled physical force

of 11 - Develop new solutions - Concentrate on important tasks when overload occurs - Adapt physical response to changes in situation - Perform several activities simultaneously - Maintain operations under heavy operation load - Maintain performance over extended periods of time FIGURE 8 LICKLIDER S RELATIVE CAPABILITIES A simple example will illustrate this point. At one point in the Defense Logistics Agency's business modernization program, someone realized that much their data was was poorly stored in the clear text/comment fields of their old SAMMS system. DLA thought that a manual approach would be required to clean and restructure the data to prepare it for use in the new SAP system. A simple set of calculations indicated that the time required to implement this manual approach to data quality engineering for approximately 2 million NSN/SKUs (a subset of the entire inventory) would run into person- centuries (see Figure 9). FIGURE 9 ILLUSTRATION OF HOW DATA CLEANSING OF 2 MILLION NSN/SKUS WOULD REQUIRE 93 PERSON- YEARS IF THE TASK TOOK ONLY 5 MINUTES PER NSN/SKU REAL ESTIMATES WERE MUCH GREATER. Instead Figure 10 illustrates that a combination of automated processing was able to reduce the "problem space" from a 100% manual approach to a much smaller task requiring manual attention to less than 7.5% of the original NSN/SKU inventory. Of perhaps equal importance, we were able to demonstrate that we could objectively identify the point of diminishing returns where more work on the automated approach did not produce a greater time/effort savings. This kind of synergistic approach is common to most data quality engineering challenges.

of 11 FIGURE 10 SEMI- AUTOMATING THE DATA CLEANSING OF DLA'S SAMMS DATA SAVED LITERALLY PERSON CENTURIES NOT TO MENTION MILLIONS OF TAX- PAYER DOLLARS. Data quality must be approached as a specialized discipline! Give all of the above, it remains clear that the best approach to resolving some of DoD's data quality challenges is to form specialized data quality teams dedicated to resolving challenges wherever and whenever they occur. Only in this manner can DoD effectively concentrate its strengths on processes that can be matured from heroic, to repeatable, to documented, to managed, and finally to improvable processes. Failure to do so will dilute the intellectual strength of data quality engineers with respect to their subject matter knowledge, their tools expertise, and their ability to select and apply appropriate automated solutions to appropriate challenges. About Data Blueprint Data Blueprint is a data management and IT consulting firm that empowers organizations to gain more value from their data assets. We offer a full suite of services, including data assessments, data management, data solutions and data education. Our industry- leading

of 11 methodologies have improved our clients data quality, reduced implementation costs and decreased time- to- market for strategic IT projects. Learn more at www.datablueprint.com. Contact Information Lewis Broome Chief Operating Officer 10124 W. Broad Street Suite C Glen Allen, VA 23060 804. 640.0414 lbroome@datablueprint.com This article includes significant contributions from Daniel Behm, analyst at Data Blueprint. Mr. Behm provided his extensive experience and research results from projects performed for the United States Marine Corps and Department of Defense National Bone Marrow Program. References [Aiken 1996] [DAMA 2009] Data Reverse Engineering: Slaying the legacy dragon Guide to the Data Management Body of Knowledge available at amazon.com

of 11 [Economist 2010] Data, Data, Everywhere a special Economist Report on Managing Information Feb 27 2010 [English 2009] Information Quality Applied by Larry English 2009 [Finkelstein 1999] [GAO 2007] [GAO 2008] [Hudicka 2005] [Waddington, 2009] [Yoon 1999} Finkelstein, C. and P.H. Aiken, Building Corporate Portals Using XML". 1999, New York: McGraw- Hill. 530 pages (ISBN: 0-07- 913705-9). DHS Enterprise Architecture Continues to Evolve but Improvements Needed GAO- 07-564 Key Navy Programs Compliance with DOD s Federated Business Enterprise Architecture Needs to Be Adequately Demonstrated GAO- 08-972 Joseph R. Hudicka "Why ETL and Data Migration Projects Fail" Oracle Developers Technical Users Group Journal June 2005 pp. 29-31. The Sad State of Data Quality: Results from the Information Difference survey document initiatives and the state of data quality today Information Management Magazine, 11/01/2009 by David Waddington. Yoon, Y., Aiken, P., Guimaraes, T. Managing Organizational Data Resources: Quality Dimensions Information Resources Management Journal 13(3) July- September 2000 pp. 5-13.