DATA QUALITY DATA BASE QUALITY INFORMATION SYSTEM QUALITY

DATA QUALITY DATA BASE QUALITY INFORMATION SYSTEM QUALITY The content of those documents are the exclusive property of REVER. The aim of those documents is to provide information and should, in no case, be considered as a commitment of REVER. All uses, including referencing a part or the complete document, is entitled to a prior written agreement from REVER. REVER SA ; Belgique ; +32 81 72 51 33 ; http://www.rever.eu

Table of content 1 Introduction...3 2 Facts...3 3 Encountered difficulties...4 4 REVER s solutions...6 4.1 The REVER s approach...7 4.2 Major results provided by REVER...10 4.2.1 Data Quality...11 4.2.2 Data base quality...13 4.2.2.1 Measures on the structures...13 4.2.2.2 Measures on the database, as used by the programs...14 4.2.3 Information systems quality...16 5 The added values of REVER...17 6 To know more...17 Service Marketing 11/06/2008 REVER-SO10 Page 2 / 17

1 Introduction In order to assist the reading of this document, it looks relevant to remind some definitions: A data is a succession of characters representing an elementary object of an information system: as such, a data has no particular meaning. In computerized information systems, data are stored in electronic files ; An information is a data or a set of data endowed with a semantics ; A data base is a group of files allowing the storage of a great quantity of data in order to allow their management (data addition, update, request). This set of files is structured and organized in order to provide the users with the necessary information to cover their domain of activity. To reach this objective, the availability of a data base is not enough: programs are needed to run the data processing and provide the interface with the users. This whole (data base and programs) is called an application ; An information system is an organized set of resources (personnel, data base, procedures, hardware, software ) allowing to acquire, store, structure and communicate information (in text, images, sounds and company coded data form). The goal is to coordinate these activities to allow the organization to reach its objectives. The information system is the vehicle of the communication in the organization. The quality of its information system is a major challenge for the organizations: without credible, relevant and coherent information, it will be rather difficult to act daily and, a fortiori, to make important decisions, vital for the organization. Considering this strategic stake, the data and data bases quality measures are essential elements to evaluate the information system quality. The former definitions illustrate the important differences between the different concepts: the quality of a given piece of information and/or an information system is a concern for the organization and the users. The quality of data and data bases is a concern for the IT department. 2 Facts In fact, the quality problems are daily matters: The users are facing wrong or contradictory data (typically two screens giving two different customer addresses) Programs which abort due to incoherent data ; A data migration process which is blocked when loading data because the source data have formats, or types, incompatible with the structure and rules of the target base ; Communication to customers of wrong results which will have to be corrected later on by sending mails to them ; Service Marketing 11/06/2008 REVER-SO10 Page 3 / 17

Looking at these examples, the situations diversity multiplied by the frequency of «anomalies» and/or technical crash», leads to undesired consequences: Unforeseen workload leading to over budgeting (sometimes important); Delay in applications delivery with direct and/or indirect consequences on the activities of the organization ; Loss of confidence of the users in the IT system ; Organizations have long been considering that this type of trouble had a minor impact on their operations and, as a consequence, have tolerated them. The situation has however drastically evolved these last years: indeed, the automated information systems became, along the time, the central tool to operate an organization. Any stop, delay or error in the data leads to damages which can be sometimes important, in particular when these dysfunctions have a direct impact on the customer base. In this context, and being conscious of the importance of the challenge, strong actions are needed. 3 Encountered difficulties If the dysfunctions report is easy to make, to carry out the solutions is unfortunately not simple. The hereunder reasons for it are not exhaustive: Organizational reasons, : Which is the department (or persons) responsible for quality? Which are the objectives which were assigned to it? What is the quality level expected by the Management (no stoppage, no incoherence...)? What are the available budgets to achieve quality? What are the mission and the authority of the quality team? Is it only a technical one? Operational reasons: Where to start? With IS? With the technical aspects? Is there a data «dictionary» of the organization? If yes, is it updated? If not, how to build it? Considering that data are in a permanent and continued flow, entering and leaving the organization, how is it possible to perpetuate the obtained results? What are the technical solutions and/or the available products? Are they compatible with the technological platforms? Anyone will understand that it is not simple. It is even more complex when analyzing the different solutions of the market. It appears clearly that, regarding data quality, and nowadays, two main trends can be considered, depending on the approach (through the solutions, or through the results angle). The first trend is expressed in a document of the Gardner Group («magic quadrant for data quality tools» 29 June 2007) for the first approach and summarized in the following table: Service Marketing 11/06/2008 REVER-SO10 Page 4 / 17

Functions parsing and standardization generalized cleansing matching profiling monitoring enrichment Table 1 : Gardner Group Definition decomposition of text fields into component parts and formatting of values into consistent layouts based on industry standards, local standards (for example, postal authority standards for address data), user-defined business rules, and knowledge bases of values and patterns. modification of data values to meet domain restrictions, integrity constraints or other business rules that define sufficient data quality for the organization identification, linking or merging related entries within or across sets of data analysis of data to capture statistics (metadata) that provide insight into the quality of the data and aid in the identification of data quality issues deployment of controls to ensure ongoing conformance of data to business rules that define data quality for the organization enhancing the value of internally held data by appending related attributes from external sources (for example, consumer demographic attributes or geographic descriptors) The second trend is expressed by the Normalization ISO Institute (ISO/IEEC JTC1/SC7 3792, 28 June 2007). In its approach, ISO provides 16 measuring characteristics and makes a distinction, for each of them, between an inherent data characteristics or an extended one. More precisely: IN = Inherent data quality The capability of data to satisfy stated and implied needs when data are used under specified conditions, independently of other system s components. Inherent data quality refers to data itself; it provides the criteria to ensure and verify the quality of data taking into account: data domains values and possible restrictions (e.g.: business rules governing the currentness, accuracy, precision, etc. required for a given application) metadata relationships between data (e.g.: integrity constraints) EX = Extended data quality The extent to which data can satisfy stated and implied needs when data are used under specified conditions, using some capabilities of computer system components. Extended data quality refers to properties inherited from technological environment through the implementation of systems capabilities; it provides criteria to ensure and verify the quality of data for: data acquisition (e.g.: data entry, data loading, data updating in accordance with parsing, enriching and transforming rules of the data management) data management (e.g.: backup/restore capabilities in terms of capacity and store/retrieve times) data access (e.g. obscuring, for security purposes, the first twelve digits of a credit card number on a purchase receipt ) Service Marketing 11/06/2008 REVER-SO10 Page 5 / 17

Table 2 : ISO norm characteristics in ex definition the extent to which data are coherent with other data in the same context of consistency X use. Inconsistency can be verified on the same or different entities. the extent to which the data are the right age. Currentness is critical for currentness X volatile data the extent to which the subject data associated with an entity have values for all expected attributes and related entity occurrences in a specific completeness X context of use. Completeness includes also the capability of data to represent the context observed by users. precision X X the extent to which the data provide the depth of information needed. the extent to which the data correctly represent an attribute of a real world object, concept or event. Accuracy has two main aspects: Syntactic accuracy syntactical accuracy is defined as the closeness of the data values to a set of accuracy X values defined in a domain considered syntactically correct. Semantic accuracy semantic accuracy is defined as the closeness of the data values to a set of values defined in a domain considered semantically correct. confidentiality the extent to which data can be accessed and interpreted only by authorized X X users. availability X the extent to which data are retrievable by authorized users. recoverability the extent to which data maintain and preserve a specified level of X operations and its physical and logical integrity, even in the event of failure. the extent to which the data stored in their native format can be read and understandability X X easily interpreted by users, and are expressed in appropriate languages, symbols and units. the extent to which data can be processed (accessed, acquired, updated, efficiency X X managed and used etc) and provide the expected levels of performance using the appropriate amounts and types of resources under stated conditions. changeability the extent to which data can be modified. For instance, modification of its X type, length or assigned value. the extent to which data can be moved from one platform to another; this portability X includes also the possibility to install and replace data in the destination platform. traceability the extent to which data provide an audit trail to the origin and any changes X X made to the data. credibility X the extent to which data are regarded as true and believable by users. accessibility the extent to which data can be reached, particularly by people who need X X supporting technology or special configuration because of some disability. compliance the extent to which the data adhere to standards, conventions or regulations X X in force and similar rules relating to data quality. 4 REVER s solutions Let s remember that the REVER s activities are based on a Model Driven Data Engineering approach. In particular, the methods and technologies of REVER allow an Service Marketing 11/06/2008 REVER-SO10 Page 6 / 17

application (or a set of applications) data model reconstruction (physical, logical and semantic). For REVER, the data models contain: The data structures (entities and attributes) ; The relations linking the entities with each other ; The data rules, meaning the business rules which, in case of non-respect, create incoherence in persistent objects. The modeling tools allow, at any time, to add or modify content elements of the model and, in particular, the data rules. In the same way, let s note that, from a modeling point of view, it is possible to construct a complete model of the organization information system by placing side by side, and by linking, the data models of each applications which are parts of the information system. 4.1 The REVER s approach In a MDDE approach, as proposed by REVER, it is rather natural to approach the data quality problems in two phases: In a first step, the right thing to do is to measure the data quality through the data model which has structured the application(s) ; In a second step, the right thing to do is to evaluate the capacity of the data model to meet the users and/or organization requirements. More precisely, as for any qualitative measures, the measurement of an information system quality, of a database, of data, needs one or more reference elements allowing the appreciation of the measure. In this context, and as showed in the hereunder schema: The information system quality has to be measured in line with the users and organization needs; The data base quality has to be measured in line with its use by the programs, according to complexity, performance, evolution capacity criteria, ; The data quality has to be measured in line with the application data models and, in particular, the data rules of the IT application. Service Marketing 11/06/2008 REVER-SO10 Page 7 / 17

From this point of view, the solutions and technologies proposed by REVER will complete the existing solutions and will not be a substitute to them. This allows the production of qualitatively better results, faster and at an inferior cost. To illustrate these complementarities, let us take the two following examples: The splitting of a data zone in an elementary field: it is frequent (mainly in the applications using a non relational data bases) that the definition of one or more zones is global and not described under the form of elementary attributes (typically, an address field is described in the database as address, alphanumeric type, length: 163). In order to know if a more precise structure does exist, or not, (e.g.: N -type: numerical length: 5 -, STREET type : alpha, length 100-, COUNTRY type alpha, length 3 -, POSTAL CODE numerical type length: 5-, CITY alpha type, length 50- ), the «data profiling» tools will achieve a statistical analysis of the data and provide the different formats found during the analysis. It is obvious that, in such an approach, if an important proportion (let us say more than 30%) of the data does not respect these rules, it will be difficult to appreciate the data quality in that case. In a MDDE approach, as proposed by REVER, the principle is radically different: the exhaustive analysis of all programs source code will indicate the existence of at least one program (for example containing an address data entry screen) which takes into account the splitting. If it is the case, then the splitting used by the program is part of the model. Later on, the data validation programs, which are generated from the model, allow knowing in detail the «addresses» which do not respect the splitting rules. Of course, if no program is giving the splitting rules, then the data profiling Service Marketing 11/06/2008 REVER-SO10 Page 8 / 17

technique, by its capacity to analyze chains of characters, brings an added value which can not be provided by the models ; A great deal of programming languages allows «redefinitions» of fields (REDEFINE in COBOL for example). This mechanism is aiming at gaining room in the data base by authorizing the use of a same physical location for different concepts, mutually exclusive. For example: in a record concerning «persons», the field «name» will be used to memorize company names, when specifying a legal entity. It will be redefined in two fields «name» and «birth name» for a physical persons (with a rule which says that the zone birthday place is only used for married woman). Clearly, in these circumstances, an indicator exists necessarily in the record (in our example, a zone «type of person») which allows the programs to know which structures have to be taken into consideration. In a MDDE approach, the whole of these redefinitions are recuperated during the analysis of the programs source code and integrated into the model (the hereunder schema shows the modeling of an application in which a same field represents 77 different concepts); As a consequence, the data analysis takes into account, for each record of the structure, the concept which is memorized. Without precise indication on the existing concepts, a data profiling tool, in these circumstances, will not be able to deliver relevant results regarding data quality. This example clarifies the main added value of MDDE to assess the data quality, the data base quality and more generally, the quality of information systems. The hereunder tables take into account the two previous mentioned approaches by indicating, for each characteristics and/or functions, if MDDE, as applied by REVER, is suited to complement existing solutions (cells marked with the sign). Service Marketing 11/06/2008 REVER-SO10 Page 9 / 17

Functions Data Table 1 : Gardner Group Aspects concerned Application data base programs Information system standardization cleansing matching profiling monitoring enrichment table 2 : ISO norms Characteristics Inherent Extended Data Aspects concerned Applications data base programs Information system consistency X currentness X completeness X precision X X accuracy X confidentiality X X availability X recoverability X understandability X X efficiency X X changeability X portability X traceability X X credibility X accessibility X X compliance X X 4.2 Major results provided by REVER Through their MDDE approach, the solutions have an impact at each of the three levels of the quality measurement: data quality, data base quality and information system quality. Service Marketing 11/06/2008 REVER-SO10 Page 10 / 17

4.2.1 Data Quality In concrete terms, the tools of REVER allow, for an application and from the existing technical elements: To reconstruct the data model including: The detailed structures of the data base (entities, attributes type, length, position) Relations between entities ; The other data rules ; To identify the redundant attributes and to check their values ; To automatically generate the controls which allow verifying that the data respect the models rules ; To locate the modules and programs using the data and those updating them; As an example, the hereunder screens are the results of a data base content control. The operated controls have been generated automatically from the data model. The first screen illustrates the global results of the data base analysis such as the total count of requests (1), the number of tables containing errors (2) as well as the list of entities (3): The second screen provides the list of attributes per entity (1), the number of analyzed records (2) as well as the number of non consistent data: Service Marketing 11/06/2008 REVER-SO10 Page 11 / 17

The third screen indicates the data (1) which are not consistent with the rules (e.g.: the 31st of September is not a valid date) as well as the record identifiers (2): Service Marketing 11/06/2008 REVER-SO10 Page 12 / 17

It will also be mentioned that, beyond the simple control of conformity between data and data models, the solutions proposed by REVER, from the list of modules (programs) updating the data, allow in particular: To identify the reasons of divergence between the data model and the programmed rules, leading to appropriate solutions and corrections ; to correct the possible errors in the source modules ; to put in place the control and alert mechanisms during data introduction when not respecting the model rules ; to modify the model by adding data rules, which allows a simple validation of a rule which is not yet clearly set ; 4.2.2 Data base quality The database quality measures, besides the above mentioned data quality as such, will be based on two elements: The database structures (entities, attributes, relations...) The database as used by the programs. 4.2.2.1 Measures on the structures Service Marketing 11/06/2008 REVER-SO10 Page 13 / 17

4.2.2.2 Measures on the database, as used by the programs These measures are aiming at providing elements in order to know the type of database use done by the programs. Mainly, two main measures will be realized: Dependency measures: the objective is to know the degree of dependency between data and program. In concrete terms, the REVER tools allow identifying, for each of the «procedural objects», the list of accessed entities and the access types (read, write, etc...). These elements are grouped in a table where each point shows that program X is using entity Y. Of course, this graph can be done only for reads, or only for writes, or for both, depending of the needs. Such a table is presented below : Criticality measures: the objective is to determine the risks induced by the programs when using the data. In concrete terms: In the model a weight will be given to each entity depending on the number of parents and child s ; In the programs, and for each access verb to the database, a weight will be attributed depending of the action type (read, write, delete,...) These parameters being fixed, the following will be obtained by a simple calculation : The weight of a specific access calculated by a function depending on the weight of the verb and of the one of the entity ; The weight of a module is the sum of the weight of each access achieved in the module ; The weight of a program is the sum of the modules weights composing it. An example of this measure is illustrated in the hereunder graph : Service Marketing 11/06/2008 REVER-SO10 Page 14 / 17

Moreover, with the help of the utilization ratio for each program, it will be possible to make a graph of the program s risks related to data. It is illustrated in the hereunder table : In this graph, the upper right quadrant put aside the high risk programs (the frequently used and heaviest»), the lowest left quadrant being the one setting aside the less risky ones. This type of results is largely used for the scheduling of the programs to be tested. Service Marketing 11/06/2008 REVER-SO10 Page 15 / 17

4.2.3 Information systems quality Looking at information systems quality, the REVER s tools provide to the users a complete and detailed description of the information system managed by an application. It is now clear that, from the model, one can determined the evolutions to be made in order to meet the needs. Moreover, the REVER tools allow the generalization of the results produced for one given application to all of them. They give the following possibilities: To compare and set aside the data descriptions which are identical and/or looking alike; To compare the data value when it s about identical pieces of information ; To rationalize the data descriptions of the organization ; To put in place an architecture based on «data services» keeping simultaneously identical data updated in several databases, As an illustration, the hereunder screen is an example of a multi bases dictionary allowing identification of identical, or look alike, data in order to connect them. Service Marketing 11/06/2008 REVER-SO10 Page 16 / 17

5 The added values of REVER The approach proposed by REVER provides the following benefits: It separates the technical part from the organizational one, the elements to measure for each of them being clearer; It allows a mastered approach of problems identification and of their solutions, simplifying significantly the global process ; It can be applied to one or all applications of the organization, offering to progressively extend the capabilities of the approach in function of the expected results ; Last but not least, and it is not its less attractive aspect, it allows to provide concrete solutions allowing to measure the quality of the different elements by supporting and integrating the two main market approaches. 6 To know more Additional information regarding the methods and tools used by REVER are provided in the following documents: Reverse engineering of data bases DB-MAIN Titles Service Marketing 11/06/2008 REVER-SO10 Page 17 / 17