Consultants Intelligence Business White Paper Data Quality Managing successfully Data quality is essential. It is the key to acceptance of IT solutions. And: Poor data is expensive. Companies that have recognised this, embrace central initiatives to guarantee a high level of data quality, often in correlation with data governance projects. Admittedly not at no cost, as Kurt Häusermann and Marcus Pilz expound on in the following text. Kurt Häusermann is founder of BI Consultants GmbH in Zurich. He has been working more than 20 years in data management, Adulterated, incomplete and inconsistent data, and the consequent derived information, lead to problems in business processes wherever this data are used. Invalid data delays the daily work and results in additional effort and hence costs. It can become critical for a company, when a strategic decision is based on an inadequate data basis due to a lack of data quality. Furthermore, the significance of data quality continues to rise due to increasing requirements in the area of compliance. Analytics and Business Intelligence, especially for Life Science companies. Marcus Pilz is a board member of the Data Warehousing Institute. He has been working as a project The negative influence of bad data quality upon companies has been investigated in different studies. Thomas Redman, a well known authority for data quality, estimates the effect of bad data quality at 8 to 12 percent of the revenue (Redman, 1996). The consequences of bad data quality in a company are often obscure, do not get quantified and are often accepted by managers as a normal cost of doing business (English, 1999). Flawed data in operative systems are often not even regarded as being erroneous; as such data errors play a minor role in the business process. Later however, when the data is transferred to a data warehouse and analysed, the bad data quality shows itself immediately. Categories appear then in the reports, that as such clearly should not exist or that are repeatedly present with differing designations. The result: Incorrect aggregation of the data. Such problems worsen the acceptance by the business users, who would not like to base important business decisions on erroneous data. Being laisser-faire with data quality therefore causes direct and indirect costs for the company, which should be taken seriously. leader in the BI environment for approximately 20 years. He is an experienced speaker at international BI symposiums and is technical adviser as well as evaluator for the The cost of data quality Prof. Martin Eppler and Markus Helfert, both from the University of St. Gallen, designed a cost model for data quality in 2004. At first they determine the cost of bad data quality. Belonging to this are the direct costs, costs for the veri- technical magazine BI-Spektrum. 1
2 fication of the data, correction of invalid data and the consequences thereof, as well as costs due to, for example, clients not being able to be reached as a result of incorrect address details and indirect costs such as expenditures due to incorrect decisions, unperceived opportunities, loss of image or customer dissatisfaction due to wrong deliveries. Subsequently, they determine the cost of improving, or more specifically, guaranteeing a sufficient data quality. Included are the costs for prevention, detection and data cleansing. Amongst the prevention costs are the measures necessary in order that fewer errors occur, such as the improvement of data capture through plausibility tests, documented standards, better training of personnel or better coordination between subprocesses. Amongst the detection costs are the measures that lead to the discovery of already existing errors in the data, such as the analysis of existing databases with the aid of rules in order to detect invalid or inconsistent data. The repair costs include all activities that are necessary in order to correct the detected errors in the databases. Data Quality caused by low Data Quality of improving or assuring Data Quality Direct Indirect Prevention Detection Repair Verfication based on lower Reputation Training Analysis Repair Planning Re-Entry based on wrong Decisions or Actions Monitoring Reporting Repair Implementation Compensation Sunk Investment Standard Development and Deployment Eppler, Helfert: A Framework for the Classification of Data Quality and an Analysis of their Progression Now the costs for the improvement and guarantee of data quality can be compared with those for bad data quality. The main purpose exists therein, to find an optimum whereby the costs of bad data quality are reduced, without the cost of data quality improvement becoming too much of a cost factor. The search for the optimal data quality The illustration below shows that there is an optimum for data quality that must be found in practice. Significant in the illustration is that a too high an appraisal of data quality can lead to higher costs. Not only economic arguments pertain as justification for better data quality. In order to attain compli-
3 ance, an effort that lies far above the optimum may be necessary, but must, nevertheless, be carried out. Ultimately, the well known formulation Fitness for use by Joseph Juran applies to data quality. It means: Quality must suffice the purpose of the application. Consequently quality, and thus also data quality, must align itself with the requirements. This applies also for the rationale for data quality already cited above. Eppler, Helfert: A Framework for the Classifi ation of Data Quality and an Analysis of their Progression Causes of bad data quality Bad data quality begins with the very first capture of the data. Data are entered incorrectly and checked incompletely or not at all by the system. The personnel responsible for capturing the data have little training and there often exists none or only rudimentary standards for data capture. Additionally, an awareness of the consequences of data errors does not exist, because the personnel don t understand to what end the data will be used later. An invoice amount may seem important, but what meaning do department names have, for example? This shows itself much later during reporting, of which these personnel mostly don t get to know. A feedback-loop is often missing in practise, in which the data capturers are informed about data errors. In systems with entries that are less structured, obscurities and misinterpretations persist by the data capture. Business processes change. But operative systems are not always able to be adapted synchronously. Therefore fields will, for the sake of simplicity, be used otherwise, so that the operating system can be maintained. Not kept in mind are the consequences that this casualness may have later on in follow-up systems and the data warehouse. Further sources of bad data quality lie in the inadequate data architecture of the source systems. These often originate autonomous and include varying perceptions of the company. The consequent data representation is thus diverse, which can greatly impede the integration. Should a source system already once have been migrated, there exists a high risk that migration errors will be present, which have as yet gone unnoticed in the operative activities. After all, there exists the danger during the integration of data stocks that the
4 data contents are not accurately defined or that the compiled documents no longer reflect the current state. In such cases, data of varying semantics are brought together which can lead to a systematic falsification of the data. Certain data stocks can also be forgotten during integration. Missing data from offshore branches or subsidiaries can lead to an aggregation at company level being incorrectly calculated. Data quality dimensions Before one improves the data quality, one should define the dimensions to which quality will be measured. Richard Wang, who has been researching data quality at the Massachusetts Institute of Technology (MIT) for 20 years, pointed out in his widely adopted article Beyond accuracy: What data quality means to data consumers (1996), that in data quality it is not only about correctness and accuracy, but that data quality also comprises other dimensions. Most authors assume a 360 degree business user point of view. The following table shows a selection of possible data quality dimensions. Data Quality Dimension Accuracy Consistency Completeness (Attribute Level) Timeliness Relevance Clear definition Identifiability Definition (T. Redman, 2001) Degree of agreement between a data value or collection of data values and a source agreed to be correct. Degree to which a set of data satisfies business rules. Degree to which data values are present for required attributes or the degree to which required data records are present. Degree to which information chain or process is completed within a prespecified date or time. Degree to which data are relevant to a particular task or decision. A datum is clearly defined, if it is unambiguosly defined using simple terms. A good data model calls for each distinct entity to be uniquely identified. Which dimension are meaningful for a particular purpose and how the dimension should be measured, depends on the concrete goals. Important is that the quality dimensions should orientate themselves to the objectives, and not to that which a particular tool can do. Data Quality Improvement Basically, it can be assumed that operative systems exist, that continually produce a portion of erroneous data. Furthermore, a multiple of information already exists in key databases, which are partially faulty. A strategy to improve data quality must apply to both areas. On the one hand, the constant further accrual of flawed data must be prevented, while on the other hand existing data in the database must either be cleansed there or in a step prior to input in the data warehouse. It has been shown in projects, that the responsibility
5 for the data is not regulated enough. The creation of explicit responsibilities for data sources is an important first step in data quality projects. Besides, it is sensible to create the roll of data steward. Since data quality can not be measured without an accurate description of the data semantics, a test of the metadata and a test of the conformity between metadata and the effective usage belong to the standard preparation for a data quality undertaking. Now dimensions that have relevance to the undertaking can be determined, and quality policies defined to which the data must conform. This concerns predominantly the completeness, correctness and the consistency of the data. Data Profiling An important approach is that of data profiling. It is a matter of the systematic methodology of analyzing and technically assessing largely automated data, in order to implement corrective measures. Ideally, data profiling would be implemented at the beginning of a project or generally performed asynchronous to data warehouse processes in an independent hardware infrastructure. In doing so, the complete data portfolio would be extracted into a separate environment, since data profiling is burdensome on runtime owing to large data volumes and consistent data bases must be maintained over a longer period of time for the analyses. Modern tools support data profiling teams at the analysis, which should consist of small groups of about 3 people comprising interdisciplinary IT and business skills. The analysis initially consists of a conditioning of the results, in order to technically evaluate and to secure these in form of business rules in workshops. As a result of the profiling process, the user receives a list of possible problem areas in the used data and can assess, whether a correction of the problem must take place and what outlay must be planned for this. A variety of data profiling tools are available on the market. These could be utilized as an alternative to a self development, and offer the advantage of being quickly deployable, of supporting many data formats and of delivering consistent results. Own developments offer the advantage of being able to be integrated better into ETL processes: Profiling test routines should namely be so configured, that they are enhanced with test and approval processes at a later stage in the ETL process. At this juncture the profiling methods and the target statuses will be saved in the metadata, and the ETL process will be enhanced by inspection and approval processes. A time series analysis to forecast record counts or ranges of values is thereby possible under operating conditions. Chains of report to those responsible for data quality or business can be implemented through SMS or E-mail, for prompt clarification or initiation of corrections and avoidance of faulty runs.
6 References Apel, D. et al.: Datenqualität erfolgreich steuern. Hanser, München, 2009 English, L.: Improving Data Warehouse and Business Information Quality: Methods for Reducing and Increasing Profits. Wiley, New York,1999 Eppler, M. and M. Helfert: A Framework for the Classifi cation of Data Quality and an Analysis of their Progression. http://www.computing.dcu.ie/~mhelfert/ Research/publication/2004/EpplerHelfert_ICIQ2004.pdf Lee, Y.W. et al.: Journey to Data Quality. The MIT Press, Cambridge, 2006 Redman, T.: Data Quality for the Information Age. Artech, Boston, 1996 Redman, T.: Data Quality: The Field Guide. Digital Press, Boston, 2001 BI Consultants GmbH Hadlaubstrasse 124 CH - 8006 Zürich Switzerland tel + 41 44 350 40 51 mob + 41 79 332 87 15 info@bi-consultants.ch www.bi-consultants.ch