How Matching Technology Improves Data Quality White Paper
Table of Contents How... 3 What is Matching?... 3 Benefits of Matching... 5 Matching Use Cases... 6 What is Matched?... 7 Standardization before Matching... 7 Matching Technology... 9 Probabilistic... 11 Deterministic versus Probabilistic... 12 Blocking... 12 Matching Process... 13 What to do with Matches... 15 Conclusion... 16 About Talend Data Quality... 17 Page 2 of 18
How Enterprise applications like Master Data Management, Customer Data Integration, and Data Warehouse projects rely on clean, duplicate-free data to really be effective business tools. Companies have sought the single source of truth in their customer data, transactional data and even in their metadata to make these applications most effective. Matching plays an important role in achieving a single view of customers, parts, transactions and almost any type of data. For decades, software vendors and computer scientists have devised strategies and technologies for finding relationships within the data. Some of the first published works on matching strategies were as early as 1946, when Halbert Dunn, MD, who was Chief, National Office of Vital Statistics for the U. S. Public Health Service wrote a paper in which he described linking the pages of a person s medical records to create a book of life. It was an idea ahead of its time, since technology of the era was certainly not up to the task. In 1969, Ivan Fellegi and Alan Sunter formalized probabilistic matching techniques in a break-through research paper. Over the years, others have tweaked algorithms to match records. The strategies for matching records are mature and well-documented, although not always simple. This white paper looks into the topic of matching, what it is, how it works, and different methods for matching data. What is Matching? Matching is the process of putting together similar or the same records in order to either identify or remove duplicates from the Page 3 of 18
data. Matching is often used to link together records that have some sort of relationship. Since data doesn t always tell us the relationship between two data elements, matching technology lets us define rules for items that might be related. How relationships are interpreted and used, either in Business to Business data or in Business to Consumer data, depends on the context of the project and the needs of the business users. Commonly, corporations use matching to remove duplicate customer records and therefore optimize marketing programs, but there are many uses for this optimized data beyond marketing. Data may contain business names rather than households and relationships need to be created between IBM Corp., International Business Machines, and I.B.M., for example. Data to be matched may also come from supply chain and ERP data, where the matcher relies on patterns to match, for example, the part numbers of XL-12345 to XL123-45. Data may be descriptive data, where the matcher needs to find a relationship between Frozen Carrots and Car, Frzn. The identification of these relationships will contribute to the organization s data management strategy and effectiveness. Matching is also called Linking because the end result of finding two related records in your data might not necessarily be to delete one of the records or combine two or more records. Rather, the solution to understand your customers may just be to link the records together (using keys). A family who lives under one household and does business with one bank is an example of a household whose records should remain separate, but linked by household. Page 4 of 18
Benefits of Matching All corporations can benefit from matching, because the benefits are plentiful. The top benefits include: Billing and Credit Removal of duplicate records and householding can lead to benefits such as unified billing, accurate revenue accounting, accurate contract billings, unified credit management and reduced mailing costs. Direct mail/marketing Companies can decrease the costs associated with direct mail by mailing one and only one promotion to any given household. Relationships with customers Organizations with more accurate and duplicate-free records have better relationships with those customers. Supply Chain and Inventory Efficiency By matching inventory in a warehouse that is physically identical, but seemingly unconnected in the database, the company can lower carrying costs. There's less inventory to put away, less space to rent, lower insurance and taxes on inventory, lower costs to physically count inventory and lower risk from obsolete inventory. Vendor Cost Savings With cleaner vendor and inventory data, buyers have more accurate information on the amount purchased from any given vendor. Armed with accurate data, the buyer can apply pressure on the vendor to lower costs. Page 5 of 18
Overall Corporate Efficiency With duplicate-free data, users are more likely to adopt systems that will improve corporate efficiency. Storing fewer gigabytes of more accurate data is much more efficient. Matching Use Cases Matching software is commonly used in several different configurations, including but not limited to: One-time matching project where companies perform a one-time removal of duplicates from a single database or a one-time linking of two or more databases Real time single database often accomplished with first identification of duplicates, then real time matching to ensure that no duplicate records are added to the database Matching of multiple databases at regular interval most commonly as part of a data warehouse, where data is nightly loading into a data warehouse to understand business intelligence metrics Linking multiple databases via a master index where the flow of data enters a central master data management hub. In such a configuration, data quality is a service, called by the master data management application Page 6 of 18
What is Matched? Powerful matching technology will match a variety of types of data, including but not limited to: Individuals Individuals living at the same address. The software finds individuals, even if the data is mistyped or if nicknames are involved. The software knows that Bob and Rob are derived from the same root name. Households - Members of the same household living at one address. Usually, this is powerful in finding head of household and contacting only one person with marketing offers, for example. Businesses Companies with the same or similar names, also with the ability to recognize EMC, E.M.C and E M C Corp as the same company, for example. Inventory or Supply Chain Items Companies looking to consolidate parts and items with the same or similar names. The software understands that a bolt, one half inch might be a match to a 1/2 bolt or that carrots, frozen might be the same as Frz Car. Since this data is so unique from industry to industry, some standardization of data may be necessary for finding these types of matches. Standardization before Matching Nearly all experts agree that standardization is absolutely necessary before matching. Standardization is a process by which an agreed standard is defined for any type of data. The rules are offered as part of a business rules engine and applied to the data. So for Page 7 of 18
example, the postal services in various countries offer standards for name and address data. One such standard in the United States is that on addresses the word street is always abbreviated ST, and not Str or Street. Users may also opt to standardize data shapes, too. In an ERP or supply chain system, for example, a company may decide to always designate part numbers as NN-AAAAA, where N is a number, and A is an alphanumeric character. In this scenario, part number 12-HGAJS would be valid, while 12HGAJS_2 would be subject to standardization. For name data, nicknames can also be problematic. Attempting to match Steve with Stephen, for example, requires standardization. The strategy here is to create a root name attribute in the database that stores the root name of Steve, which is Stephen. This strategy keeps the original names intact in the database, but gives match opportunity to Steve Williams and Stephen Williams in your database. Data standardization can be achieved with Talend Data Quality by using your own business rules, regular expressions and even public domain and government sources like the US Census, data.gov and geonames.org. This standardization is not only integral to data quality, it s integral to the effectiveness of master data management, CRM, ERP and many business applications. Standardization also helps when data is misfielded, for example, a name is inadvertently typed into an address line. This commonly occurs during data migration, especially in legacy systems where data tends to be less structured. For example, some billing systems data may contain Attn: Accounts Payable on various lines of the database, and it s up to standardization to sort this out. By the same token, similar records won t come together when comparing Steven A. Smith to 25 Main St.. Profiling the data ahead of time is the best way to ensure that the correct data exists in the correct field and that apples are being compared to apples. The standardization process improves matching results, even when implemented along with very simple matching algorithms. More Page 8 of 18
exact matches will exist once the address has been standardized and the root name has been found. More exact matches will also exist once part numbers and descriptions have been standardized. However, in combination with advanced matching techniques, standardization can improve information quality even more. Matching Technology After standardization, Talend Data Quality matching uses algorithms to determine when two or more records match. It identifies matching records referring to the same business, household, individual/contact, product, etc. and identifies relationships linking a contact to a business, an individual to a household, a product to a product class, or other. The strength of matching technology is defined by how powerful the algorithms are to establish the match. For algorithms, solutions have powerful routines that are specially designed to compare names, addresses, strings and partial strings, business names, spelling errors, postal codes, tax ID numbers, data that sounds similar such as Phig and Fig, and more. There are two common types of matching technology on the market today, deterministic and probabilistic. Deterministic or rules-based matching is where records are compared using fuzzy algorithms. The various algorithms allow for a little bit of slop in data, so that if there are typos or phonetic similarities (like ph & f), the algorithms can identify linkage. Ultimately, the user decides which rows to compare and what algorithm to use on each. Each row can have a weight, so that a user might decide that TaxID number (social security number) has more weight than last name, for example. The user can choose from one of these common algorithms: Fuzzy Match Algorithm Use Case/Description Exact Match You can use an exact matching algorithm to find exact duplicates. Smith will match to Smith Page 9 of 18
and only Smith with no fancy variations. After records have been standardized, a certain number of new exact matches should come as a natural result. SoundEx Developed for the some of the first computers performing the US census in the 1930s, SoundEx is a phonetic algorithm for indexing names by sound, as pronounced in English. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Improvements to SoundEx are the basis for many of the modern phonetic algorithms that follow. Metaphone and Double Metaphone Realizing that SoundEx was limited, Metaphone was developed in the 1990s, using a larger set of rules for English pronunciation. Later, double metaphones were developed to provide even more power. The algorithm returns both a primary and a secondary code to account for many variations of surnames with common ancestry. For example, encoding the name "Smith" yields a primary code of SM0 and a secondary code of XMT, while the name "Schmidt" yields a primary code of XMT and a secondary code of SMT--both have XMT in common. The Double Metaphones algorithm does a better job because it uses a much more complex rule set for coding than its SoundEx and metaphones. Levenshtein In the 1960s, a Russian scientist devised the Levenshtein distance algorithm. Levenshtein distance is a measure of the similarity between two strings. Users define the distance, which is the number of deletions, insertions, or Page 10 of 18
substitutions required to transform one into the other. For example the distance between Smith and Smith is 0, because no transformations are needed. However, the distance between Smith and Smyth is 1 because one substitution is needed to transform it. Smith and Smythe would have a distance of 2, and so on. Jaro-Winkler Jaro-Winkler is similar in function to Levenshtein, since it measures the number of differences between strings. However, characters at the beginning of the string are given more weight than those at the end. This weighting of characters allows Jaro-Winkler to deliver a score between zero and 1, with one being a perfect match. Probabilistic The second category of matching technology is called probabilistic, the very same theories that Fellegi and Sunter wrote about back in 1969. The intricacies of probabilistic matching run well beyond the scope of this white paper. However, statistical analysis and advanced algorithms are key to its success. The algorithm is smart enough to know that a common last name like Jones should play a smaller role in matching as compared to a less common last name, like Jimmerson. How does it know? Probabilistic matching technology performs statistical analysis on the data and deciding the frequency of items. It then uses that analysis to weight the match, similar to the way that the user can apply weight to the relevance of each row. Page 11 of 18
Deterministic versus Probabilistic Data quality solutions often offer both types of matching, since one is not necessarily superior to the other. While deterministic matching doesn t take into account a holistic view of the data set, it will produce a good many matches and is much easier for data management professionals to understand and tune. Probabilistic may be superior in its holistic view of the data, but the ability to understand and track why records matched and why they didn t is hindered by a complex algorithm. If you re trying to do real-time matching, having one incoming record match up against a master data set, deterministic will also offer some performance benefits. Remember that probabilistic relies on statistical analysis of the data and that may slow performance on real-time jobs. Blocking No matter what algorithms you decide to use, the thought of comparing a large number of records to themselves to find matches makes the task a daunting one, both in resources needed and in time needed to compare them. If you have a million rows of data, you wouldn t want to have to make one million comparisons. Even comparing a single record against your large database would take significant time. The time to execute will grow exponentially as your dataset grows. That s why most software vendors recommend first making blocks or grouping keys part of the matching process. By creating a key so that only those records that have some basic similarities will be compared, matching performance improves with no effect on matching accuracy. The key might consist of part of a last name, postal code, street name, or sex. Only record pairs with identical keys will be grouped for more in-depth matching. Page 12 of 18
Organizations often evoke a multi-match strategy, where matching is analyzed from various angles. For name and address data, organizations might rely heavily on tax id (Social Security) number where it exists, while relying on other factors, such as address, last name, city, and state, where tax id is missing from the customer records. Matching Process If we were to take a journey from a record s perspective through a matching system, it would go something like this. The matcher starts with the entire database and quickly whittles down the list of possible matches by establishing the match key. Only those records that are somewhat alike (those with the same window key) are more precisely compared. This step is extremely fast. In a 100,000 record database, this step might reduce the list of match possibilities to say fifty possible detailed matches. In phase two, the fifty remaining records will be then be scrutinized more carefully with the solution s powerful algorithms. All of the major matching engines on the market use a similar two step process for matching. The wide variation seems to be in the Page 13 of 18
actual algorithms for detailed matching, more specifically in the efficiency in which they find correct matches, the rate at which they avoid bad matches, the ability for the matching solution to handle a wide variety of data domains and types, and the speed at which they do complete the task. PROCESS Profile DETAIL Profile the data to understand data quality issues. Issues can be categorized, so that misfielded data can go through one process, incomplete records through another, etc. Standardize Use a standardization process to optimize match efficiency. Be certain that data conforms to standards, if they exist, data is fielded correctly, and that nicknames, data shapes, and abbreviations are standardized Identify fields to compare (any/all field types) Perform match on fields that are unique. In this example, a straight name and address match will be performed. Matching can use any data available, however. This includes tax ID number, customer number, e-mail address, etc. Match Grouping Keys To deliver matching that is both accurate and high performance, Talend recommends first making grouping keys part of the matching process. Processing time to compare a single record against your entire database can take significant time. The time to execute will grow exponentially as your dataset grows. By creating a key so that only those records that have some basic similarities will be compared, matching performance improves with no effect on matching accuracy. Page 14 of 18
For example, if you were to generate a set of keys based on features of each customer record. The key might consist of part of a last name, postal code, street name, or sex. Only record pairs with identical keys will be grouped for more in-depth matching. Match Talend s algorithms are available that are specifically designed for name comparison, address comparison, spelling errors, items that sound alike such as phish and fish, and more. Match results are then grouped by pass, suspect and fail patterns. These match patterns allow users to know exactly why records were brought together. This information is crucial to enable matcher rules tuning. Users can experiment with different scores and weighting to find more matches. What to do with Matches Once data has been processed through the matcher, there are several possible outcomes. Between any two given records in the same match window, the matcher may find: No relationship Match the matcher found a definite match based on the criteria given Suspect the matcher thinks it found a match but is not confident. The results should be manually reviewed. Page 15 of 18
The matcher does not stop when it finds a match. In large data sets, it is often the case that an individual may exist in many different forms in the data. Mitigating the suspect matches is the most time-consuming follow-up task after the matching is complete. It is because of this that some tools offer utilities and strategies for dealing with them. The tools will present the suspect matches in a graphical user interface and allow users to pick which relationships are accurate and which are not. Conclusion Matching is vital to providing data that is fit-for-use in enterprise applications. There are key strategies outlined in this whitepaper. Be sure to standardize data, making sure that addresses are being compared to addresses and not names, for example. Finally, use powerful, yet transparent routines to perform the match to ensure that any data that has been brought together can be easily reconciled. Page 16 of 18
About Talend Data Quality Talend offers a complete Data Quality solution, composed of two products: the open source data profiling tool, Talend Open Profiler is available now on the web site, ready for download and free to use, or you may chose the powerful Talend Data Quality suite, for the improvement and corporate management of Data Quality. The suite includes the foundation tools for data quality, including data profiling, correction, issue mitigation, advanced reporting and an integrated Data Integration tool for quick and easy data transformations. Talend Data Quality includes the following tools: Data Profiling provides deep analysis of Data Quality problems and measures the evolution of Data Quality over time. It includes a report management framework that will compare current and historical statistics to determine the data improvement or degradation. Data Explorer lets users directly drill down into the tables of the analyzed databases to correlate data more precisely. Data Cleansing improves Data Quality by using standards reference data and cross-checking your data against other databases and reference data. It also enriches data by providing value-add information that actually improves the quality and usefulness of existing data. Data Matching helps you identify hidden duplicates in the data, offering a single view of customers, part numbers or almost any other data domain. Data Quality Portal is an analytical web application that lets business users share and capitalize on analysis results and reports. Page 17 of 18
How All functionality is completely integrated with Talend s data management solutions: Talend Integration Suite and Talend MDM. Take what you've learned from profiling and use the analysis in your Data Integration or MDM workflow. Single user interface, repository and deployment environment provide all you need to complete your data management tasks. Talend Data Quality Cleanse & track - Specific components - Reports - Data Quality Portal Talend Open Profiler Identify Data Quality problems - Free, GPL, no limitations - Custom indicators For more information on Talend open source solutions: http://www.talend.com Contact Talend in your region: http://www.talend.com/contact 2010 Talend. All rights reserved. Page 18 of 18