Reporting MDM Data Attribute Inconsistencies for the Enterprise Using DataFlux Ernesto Roco, Hyundai Capital America (HCA), Irvine, CA ABSTRACT The purpose of this paper is to demonstrate how we use DataFlux Data Management Studio Platform to provide our organization with a tool that at an entity record and attribute level, identifies inconsistencies between the MDM (Master Data Manager) master data when compared against the same entity record existing from other source contributor systems. For this conference, I want to share several key aspects of our design. With our design, we made possible a weekly data inconsistency report for millions of MDM cluster records within an enterprise MDM implementation, without significant overhead and development effort required. First, we believe that our design, with its metadata driven approach, can be modeled as a template for other organizations looking to implement the same capability for their enterprise MDM. Second, with the use of parallel processing nodes to maximize the utilization of available processing power, we have been able to significantly reduce overall processing time. Lastly, the design features the ability to output in different formats which contribute to the overall usability of this tool. With the report output we have been able to quickly identify areas where data inconsistencies exist and implement processes to minimize future occurrences. INTRODUCTION As you may already know, one of the main functions of an MDM solution is to provide the single best version of an entity record that is consistent and shared across the entire enterprise. At HCA, we believe that achieving a very high level of data consistency across all source systems is a high priority task in our efforts to maximize ROI within our MDM solution. As the MDM administrator for our organization, one of the ways I support this task is by providing the enterprise a tool to report back to the business the current state of MDM master data inconsistency across all source systems. Data Stewards can then use this report and take the appropriate action to resolve data inconsistencies. By working together with Data Stewards and infrastructure teams, we can identify any needed improvements within the system, in order to minimize the re-occurrence of data inconsistencies. This paper assumes that the reader has a basic understanding of MDM principles and its purpose for the enterprise along with some familiarity of DataFlux Data Management Studio development platform. THE DATA INCONSISTENCY PROBLEM A data inconsistency exists for an entity record when one or more of the attributes of a master record does not match with the same entity record and\or attribute existing in another source system. SOURCE FIRST_NM LAST_NM EMAIL PHONE SSN ADDR CITY ST ZIP MDM JOHN MILLER jm@yahoo.com 555-1213 345157896 55 BAY DR. HEMET CA 93478 CRM ED MILLER jm@hotmail.com 555-1213 345157896 55 BAY DR. HEMET CA 93478 LEASE JOHN MILLER jm@yahoo.com 555-1213 345157896 55 BAY DR. HEMET CA 93478 RETAIL JOHN MILLER jm@yahoo.com 222-6789 345157896 123 ELM ST. IRVINE CA 92614 Table 1.A typical data inconsistency example for an HCA Customer (Note: All data used in this presentation is for example purposes only and does not reflect actual HCA customer information) In the example above, we have a customer that exists across all three of our source systems (CRM, LEASE, and RETAIL) where the attribute first name from CRM is inconsistent with the MDM first name value. The same customer also has an inconsistent value in our Retail system for both phone and address fields. Inconsistencies are highlighted in red. With almost 4 million master data records having 65 attributes we need an up to date and reliable way to identify each instance where a data inconsistency exists and to minimize problems going forward.
Admittedly, having data inconsistency problems is the reality of having an MDM solution particularly for those hybrid MDM implementations with complex real-time integrations service while having multiple source systems. Even with the best of intentions and procedures in place: network, hardware, code errors, and other system failures can occur, causing outages and service interruptions. Because of these outages, transactions may be processed successfully from source systems, though they may not necessarily reach MDM. Because of this, records are not mastered, and\or are not being propagated back to the source systems. In addition, updates to entity records outside of the normal accepted processes can and will occur, which may bypass MDM processes thus, creating another source of data inconsistency. The challenge is to keep the number of data inconsistencies as low as possible, while taking both proactive and reactive action. This is a work in progress type of data maintenance task that requires continual efforts across multifunctional technical MDM support, infrastructure, source system administrators, and Data Governance teams. If left unchecked these inconsistencies may actually remain in an incorrect state which ends up having a negative impact on our customer service, billing, correspondence, collections, and more importantly state and federal compliance and regulatory functions. MAIN CHALLENGES Here are some of the main challenges in developing an attribute inconsistency tool: 1. Programming\Coding Comparing 65 attributes across multiple source systems can be a very tedious and cumbersome task. Typing up hundreds of these attribute comparisons also increases the likelihood of developers introducing a coding error into the tool. 2. Maintenance Having hard coded comparison logic is problematic to maintain in the event attribute changes occur between MDM and\or any of the 3 source systems which do occasionally occur. 3. Processing Limitations Comparing 65 attributes across 3 different source systems with 4 million records may take several days to process and not hours. Our requirement is to have the processing completed within 24 hours for this tool. OUR SOLUTION OVERVIEW Display 1.Main Process Job for Attribute Inconsistency Report 1. Staging This is where the de-normalized data from source systems is extracted then staged into the MDM staging area. 2. Parallel Compare Parallel processing is used for comparing attributes from all source systems against MDM attributes to improve performance and greatly reduce total processing time. 2
3. Attribute Exceptions This process will insert and format the result of all the comparisons from a record level to an attribute level format. This greatly enhances our ability to quickly spot problem areas at an attribute level much like what a spreadsheet pivoting table function would do but with additional features specific to our particular needs. The focus of this paper will be on items 2 and 3 only since this is where our solution can be used as a template for other enterprise to follow. METADATA APPROACH DataFlux has available out of the box four Expression Engine Language (EEL) scripting functions that we used to implement for this approach. They are: for(), fieldcount(), fieldname(), and fieldvalue(). These four functions used in combination gives us the ability to loop through each column in the data stream, while at the same time bypassing the necessity to hard code each column for comparison. This metadata approach solves the first two main challenges we have identified above. Please note that each record in the data stream is a data join between a single source system entity record to an MDM master record based on our specific MDM clustering criteria and the source system record unique identifier. With 65 attributes for the MDM master record and an additional 65 of the same attributes coming from the same entity record from the source system, we have a total of approximately 130 columns contained within the data stream. Display 2.Metadata Approach in Action (step 1) Step 1: (Loop through each master record attribute) In the example above, we demonstrate how we utilized a for function to loop by the number of times there are columns that are in the data stream returned by one of the aforementioned functions, fieldcount. The fieldname function, which returns the column name, is then used to filter out certain column names that are not required for our attribute comparison. The fieldvalue function is then used to extract the master record attribute value of the column for comparison against the same source system record attribute value. This first step populates a variable we named gr_value with the value of the master record attribute we are comparing against a source record with the same attribute. This step also populates another variable, named gr_name with the name of the master record column name. For this example, we will use master record column CUSTOMER_FIRST_NAME_GR as the column name of the master record attribute, comparatively we have the same attribute from the source record column named CUSTOMER_FIRST_NAME in the data stream, which is essentially the same column name minus the trailing characters _GR. This was done so we can programmatically 3
differentiate between the master record and source record of the same column. In doing so, we can then compare the two as a single attribute of the master entity record. Step 2: (For this example, loop until the source column CUSTOMER_FIRST_NAME is found) Display 3.Metadata Approach in Action (step 2) In step 2, we are in a sub loop from step 1, where for the purposes of this example we are in a looping construct through each column until we match with column name CUSTOMER_FIRST_NAME. Once we get a match on the column name we then use the function fieldvalue again to extract the value of the column just before comparison which populates the variable non_gr_value. The comparison to determine if an inconsistency exists now will simply be an if condition gr_value = non_gr_value. If no mismatch is found, then the actual field value for CUSTOMER_FIRST_NAME will be replaced with the string <match>, and vice versa if there is a mismatch the current inconsistent attribute value will be left intact and also incrementing the user variable errors with plus 1. This variable was initialized at 0 at the beginning of each record processing in the data stream. The errors variable is used by the program to determine if the current data stream record which is again a single entity record with a join of the MDM master entity and of the same record existing in a source system, contains any attribute inconsistencies. If the errors variable > 0, then that record will be inserted into the inconsistencies table, and if not then it will move on to the next record. As you can see, by using the metadata approach as described above we are able to perform every single attribute comparison from the data stream without having to hard code a single column name to column name comparison. This has been an instrumental DataFlux feature for us in meeting our business and technical requirements for this tool, while addressing the first two main challenges we have listed above. 4
PARALLEL PROCESSING Display 4.DataFlux parallel processing node in action Another DataFlux out of box tool is the Fork node. In order for us to meet our processing time requirements, we had to find a way to utilize more of the available processing power that we have available to reduce the overall processing time. By using the fork node, we were able to split into 9 separate parallel processes enabling us to complete the processing in a matter of hours and not days. In the future, we may again re-evaluate the number of processes, depending on the amount of data that we have to process which is increasingly getting larger every day. Display 5.DataFlux parallel processing node splits into 9 parallel processes above If we were to drill down on the Parallel_Compare fork node above (Display 4), it will take us to what you see in Display 5. This sub process contains the nine processes that we have developed to execute in parallel attribute comparisons across 3 source systems. At this time, our best available option for splitting into nine similarly sized batches is by using the first digit of the customer ID (a numeric non-sequential string ranging from 0 9), statistically this gives us a very reliable way of developing several batches without losing any records while maintaining relatively the same distribution ratio of records that we have today and into the future. 5
As you can also see from our screen capture above, we have three source systems which have different record count volume to process. Our CRM source system has significantly more records to process, thus requiring more parallel processes to complete for the same amount of time versus the other two source systems. Another out of box node tool that we used to facilitate parallel processing is the Global Get/Set node, which is used as a place holder to pass the hard coded first digit of the Customer ID to the sql used to retrieve the source entity records from the staging tables. Each process operating in parallel uses a custom sql to retrieve each record from the source system, as shown in the example below: Display 6.Sample SQL used to retrieve records for source entity records in staging area Each one of the nine nodes running in parallel uses a sql query similar to what is shown in Display 6 above to extract data for processing from the staging tables. As demonstrated above, the hard coded first digit of the Customer ID is being passed as a parameter value from %%CUST_ID%% inside of an IN clause. The value contained in the parameter CUST_ID is coming from the Get/Set node, thus giving us as close to an even sized distribution of records as possible. By using the Fork node to process our comparisons in parallel across our three source systems, we are able to successfully meet the time requirements for completing processing of our attribute inconsistency report every week. This functionality addresses the third main challenge we have identified above. There are many ways an organization can get creative in choosing how to implement their own parallel processing strategy. The goal must be to set the distribution of records for each parallel node as close as possible, while meeting the processing time requirements and also remaining scalable for the future. As a cautionary step, an enterprise preparing to implement a similar parallel processing approach should perform some benchmarking tests to determine the optimum number of processes their MDM infrastructure can support. At some point there may be too many parallel processes running at one time, which can overwhelm the servers causing diminishing returns. There are other options we can look further to exploit even more ways to utilize parallel processing here at HCA. With our multi (2) server DM Server configuration we have the option to double the amount of processes we can have running from nine to eighteen processes at one time, by simply just utilizing both servers to process the data. It is good to know that there is still room for growth as we scale up to meet our business requirements now and into the future. REPORTING OUTPUT At HCA, we report inconsistencies found for each source entity record in two stages, first at the record level and then as if to drill down from that record, down to an attribute level. Immediately after each comparison between a source entity record and an MDM master data record where an inconsistency is found to exist for one or more attributes, we insert the entire data stream contents into a record exception table as shown in the example below. 6
SOURCE FIRST_NM_MD FIRST_NM LAST_NM_MD LAST_NM EMAIL_MD EMAIL PHONE_MD PHONE EXC_ID CRM MARY MARY JANE SMITH mjane@abc.com msmith@abc.com 777-1245 778-4578 111 CRM BOB ROBERT DOE DOE rdoe@xyz.com rdoe@xyz.com 867-5309 867-5309 112 RETAIL JOE JOE KERR KERR jkerr@yahoo.com joker@hotmail.com 555-7878 555-7878 113 Table 2.Record level table exceptions example Record Level Exceptions Table Details Much like the data stream naming convention, the column naming convention we used for the record level exceptions table has the attribute name from the MDM record having the trailing _MD to denote the source of that attribute as coming from the master data, while the source column name of the same attribute have no extensions. For example, FIRST_NAME_MD is the MDM attribute name while FIRST_NAME without the _MD is the source entity name of the same attribute FIRST_NAME. Included in the record level exceptions is the EXC_ID (Exception ID), column which is a unique key for each record in the record level exceptions table, SRC_REC_ID, (not shown in the example), is the primary key of the entity record from the source system, and a flag column X_PASSED which is the status of the row with the following 4 possible values: 1. Missing Account Cluster Record an account record exists in the source system but not in MDM 2. Missing Customer Cluster Record a customer record exists in the source system but not in MDM 3. Missing both Account and Customer Cluster records Both records are NOT in MDM 4. Inconsistent attribute exist At least one attribute of the master data is inconsistent with the source system If none of the above conditions apply for the current record, then no insert to the exceptions table will take place as there is no data issue found for the specific record. Record Level Attribute Inconsistency Report In the first example record above, you will see that HCA customer MARY JANE may have had a name change submitted in CRM, where customer information such as the email and phone was also updated along with the name columns. For some reason, this successful update from the CRM source system never reached MDM and thus was also never propagated back to the other 2 source systems. This results in a customer record that is updated only in CRM while MDM along with the 2 other source systems are now having outdated customer information. In the second example, we have a CRM entity record where the MDM FIRST_NM value is inconsistent and everything else is a match. In the last example, the RETAIL source system EMAIL attribute is the only inconsistency with MDM. This could be an instance where the update was successful in CRM and reached MDM successfully however, the update did not propagate back to the other source systems from MDM as it was supposed to. This is also a typical scenario for us and does occasionally happen due to various network issues or data anomalies. We have also used this reporting table to obtain various statistical data to determine the number of records that have inconsistent attributes with MDM versus the entire data set or any of the 4 possible outcomes that I have listed above. We routinely deliver to the business the top 10 inconsistent attributes by source system to highlight any patterns emerging. As you can see, we also can tell when an entity record may exist in the source system however, not have an MDM master record. These functionalities bring additional value to us by helping identify these data issues where MDM is missing entire master records. EXC_ID MD_CLUSTER_ID ATTR_NAME CRM_VALUE RETAIL_VALUE LEASE_VALUE MDM_VALUE SRC_REC_ID 111 4523 LAST_NAME SMITH JANE JANE JANE 6901 111 4523 EMAIL msmith@abc.com mjane@abc.com mjane@abc.com mjane@abc.com 6901 111 4523 PHONE 777-1245 778-4578 778-4578 778-4578 6901 112 3698 FIRST_NAME ROBERT <null> BOB BOB 5743 113 2079 EMAIL jkerr@yahoo.com joker@hotmail.com <null> jkerr@yahoo.com 3763 Table 3.Attribute level table exceptions example 7
Attribute Level Exceptions Table Details Included in the attribute level table report are columns to show the entity attribute value from each source system and the MDM master data value. Since the record level exception table record is only a direct comparison between a single MDM and source entity value, there is an additional query that we must run for each record in the attribute level exception reporting, in order to fetch the other attribute values from either 1 or 2 more source systems. The record level table primary key (EXC_ID) is also included so we can have the ability to link the attribute level record back to the record level exceptions table. Attribute Level Inconsistency Report Immediately after the record level exceptions table is populated, the next step begins to process the contents of the record level exceptions table, by breaking it down into an attribute level reporting output format. We use standard DataFlux functionality to accomplish this task which at this time is not utilizing parallel processing like we did with the attribute comparisons. However, this may be a consideration for us to do in the near future as the number of master records continues to grow by almost 50% per year. Table 3 drills down from the record level exception table down to the attribute level exceptions table. Table 2 contains three distinct record level sample exceptions which breaks down into five distinct records, for our attribute level exceptions records. Each record in the attribute exceptions table is again by attribute level from the record level exceptions table. For example, where record level exception EXC_ID = 111 contain three total CRM attributes inconsistent with MDM which are entity LAST_NAME, EMAIL, and PHONE from CRM source. For each of these, they are broken down into three records on our attribute level exceptions output. Each record in our attribute level report has the CRM, RETAIL, LEASE, and MDM values for each one of the attributes that are inconsistent. This format greatly enhances our Data Steward s ability to quickly identify which source system/s is inconsistent with MDM. Main Steps for Attribute Level Exception Reporting 1. Use a SQL input node to extract data from Record level exceptions table as shown below. Display 7.Input SQL node to extract data from record exceptions table A simple sql is used to extract each record for attribute level processing. 2. Using an Expression node, extract attribute values by source system and MDM value. 8
Display 8.Populate variables The expression node script on the left is used to populate three variables in preparing the data from MDM and source systems for each attribute. Note that the crm_value variable on the left is still a null value. This will be populated later in step 3. 3. Using the Expression node again to run a dynamic sql statement to extract the CRM attribute value to populate the variable crm_value. Since this attribute name is a variable a dynamic sql statement is used as shown in display 9. Display 9.Extract CRM value of the attribute using dynamic sql 4. Use Clustering and Survivorship nodes to guarantee uniqueness of records inserted in the attribute level reporting table. 9
Using Display 10.Clustering and Survivorship nodes ensures only unique records are inserted The combination of the attribute id and the master data cluster id should be unique prior to inserting the data stream into the attribute level reporting table. The possibility that a duplicate may be inserted is quite possible when the attribute is inconsistent in more than one source system. Since we are already reporting the attribute value for all source systems along with the MDM value then there is no need to have more than a single existence of the attribute id and master data cluster id combination inserted into the table. 5. Insert records into the attribute level exceptions table using a Data Insert output node. CONCLUSION Having DataFlux Data Management Studio has given us the necessary tools to implement a very reliable and accurate attribute inconsistency reporting tool given that, it was up to us to come up with creative ways to harness the power of the tool to overcome our technical challenges. Though the techniques we demonstrated here are specific to meet our own business requirements, they can be easily retrofitted to fit any other MDM solution as well. Additionally, the MDM solution does not have to be a DataFlux qmdm vendor, for these techniques to work. Implementing MDM was a significant expense and undertaking for our organization and quite possibly for others as well. As the MDM administrator, we are committed to doing whatever we can to get the most value from our investment. This is just one way we as MDM administrators have contributed towards that goal. CONTACT INFORMATION Your comments and questions are valued and encouraged. You may contact the author at: Name: Ernesto Roco Enterprise: Hyundai Capital America Address: 3161 Michelson drive Suite 1900 City,State, ZIP: Irvine, CA 92612 Work Phone: 949-526-7728 Fax: E-mail: eroco@hcamerica.com Web:http://www.hyundaicapitalamerica.com/ SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 10