Getting Started with SAP Data Quality (Part 2)

Getting Started with SAP Data Quality (Part 2) Data Quality, part of SAP Data Services, is a powerful tool for cleansing and matching customers, businesses, postal addresses, and much more. However, the wide range of options and capabilities can be intimidating for new users. In this post, we will discuss configuration of the address cleanse and data cleanse transforms. Configuring the AddressCleanse Transform The cleanse transform editor is very similar to that of the query transform, with the input schema in the upper left, the output schema in the upper right, and a set of tabs below. In a cleanse, however, we only have three tabs: Input, Options, and Output. Input Tab The Input tab contains a list of fields on which the work will be performed. Each input column you wish to cleanse must be assigned to one of these fields. For example, you might drag your 5-digit zip code column down to POSTCODE1 and your city column to LOCALITY1. While some of the mappings are clear, Data Quality s terminology can often be confusing. For an initial, basic cleanse, we recommend using the following: FIRM or NAME Company name or Individual name ADDRESS_LINE Full street address with or without unit number UNIT_NUMBER Suite, apartment, or other unit number

LOCALITY1 City REGION1 State POSTCODE1 or POSTCODE_FULL 5-digit or 9-digit postal code If your address data is not yet standardized to this degree, you may find the following generic or combined fields useful: ADDRESS_LINE As above, full street address with or without unit LASTLINE City, state, and zip code MULTILINE1-12 Any data Because the MULTILINE fields can contain anything, even varying from record to record, Data Quality will simply make a best-effort parse. While the results of this parse are generally good, they can vary widely depending on the quality and consistency of the input data and should be carefully checked. Options Tab The Options tab defines which operations Data Quality will perform on your data and how those operations are configured. Since the list of options is extensive and most can safely be left as defaults, we will focus on those most relevant to an initial cleanse. The Reference Files section includes the locations of the files Data Quality uses to validate and complete postal addresses. If your reference files reside in the default installation folder, these values likely do not need to change. However, if you need to run multiple sets of files or verify their locations, you will need to use these settings. Assignment Options contains the flags telling Data Quality which operations to run. For an initial cleanse, we recommend DPV (Delivery Point Validation) and LACSLink (Locatable Address Conversion System). DPV is the foundation of postal verification, indicating whether an address is valid for delivery or the precise reason it failed. LACSLink supplements DPV by updating addresses when the physical address has not moved, but some part of the address assignment has changed. For example, when a rural route is

converted to a named street or when a direction is appended to a street name. Together, DPV and LACSLink meet the minimum USPS address processing requirements for CASS (Coding Accuracy Support System) certification. The remaining fields in Assignment Options supplement address verification by tracking moves (NCOALink) and new addresses (EWS), determining whether an address is residential (RDI), appending business suite number (SuiteLink), tracking mail carrier delivery route (elot), and assigning geocodes (Geo Mode). The Standardization Options section contains most of the settings affecting output format. By default, these fields are configured to produce output preferred by USPS, which is suitable for most business needs. You might need to alter them if you want to preserve certain parts of the input data within the standardized output that would otherwise be discarded or placed in separate fields. For example, you may want to retain the order of dual addresses to meet business requirements or to preserve place names due to customer preference. We recommend considering the following: Standardize Unassigned Address Whether or not to standardize and reformat addresses for which Data Quality could not find a valid assignment. This is useful if you want consistent formatting even for incorrect addresses. Use USPS Primary Name Abbreviation Attempts to abbreviate the address line to 30 characters or fewer using USPS standards. For addresses that must be shortened, this provides more meaningful output than a simple truncation. Capitalization USPS prefers all uppercase, but businesses and customers often prefer mixed-case. Preserve Dual Address Order This must be set to Yes if the input order should be retained regardless of address assignment. Otherwise, the assigned address will be placed immediately before locality, giving it precedence in delivery. Only useful if you re using dual addresses, e.g. PO Box and street. Output Tab The Output tab contains a list of all fields exposed as results of the cleansing, but, depending on your input and options, only some will contain data. Generally, you will want to look for fields with names

similar to the assigned input mappings, plus status and detail fields for the operations you selected in Assignment Options. You will usually be able to compile a full, cleansed address with some combination of the following: FIRM Firm name. ADDRESS_LINE Full street and unit information. In most cases, this is identical to PRIMARY_SECONDARY_ADDRESS. PRIMARY_ADDRESS Street or PO Box info without secondary address, e.g. unit number. This can also be constructed using individual PRIMARY_* output fields. SECONDARY_ADDRESS Floor, unit number, etc. LOCALITY1_OFFICIAL Official city or town name. COUNTY_NAME County name. REGION1 State or territory. POSTCODE_FULL The 9-digit postal code. This can also be constructed using POSTCODE1 and POSTCODE2. Unlike the fields above, status codes don t contain any address information. Instead, they re your windows into the cleansing operations themselves. Status codes are the best way to see detailed data on whether or not a record was successfully cleansed, why it failed, or which fields were changed. Many of the status codes are cryptic alphanumeric indicators, but you can find their translations in the Data Services technical manuals. For DPV and LACSLink, we recommend using these fields: STATUS_CODE Six-digit code indicating exactly which parts of the input address were changed. UNDELIVERABLE_INDICATOR True (T) or False (F) indicating whether the address is deliverable per the USPS. Note, an address can be valid while still being undeliverable.

DPV_FOOTNOTE A set of two-character codes either confirming that the address passed DPV verification or identifying the reasons it failed. The codes and their meanings are found in the Data Services technical manuals. DPV_STATUS A single-character code indicating the general result of DPV verification. LACSLINK_INDICATOR A single-character code indicating the general result of LACSLink verification. LACSLINK_RETURN_CODE A two-character code either confirming that the address passed DPV verification or identifying the reasons it failed. In addition to a name, each field also has a class (FIELD_CLASS), address class (FIELD_ADDRCLASS), category (FIELD_CATEGORY), type, and content type. FIELD_CLASS Field Classes are vaguely defined in the documentation, but they generally represent different the levels of processing used to generate the field. For example, you can choose to output a field after being parsed, after being fully standardized and corrected, or both. BEST represents the highest level of processing and is usually a safe choice. FIELD_ADDRCLASS The Address Class specifies the type of address for the generated field. DELIVERY is a safe default and indicates the standardized address used for delivery assignment. DUAL is only used when both a street and postal address are input, and it indicates the address not used for delivery assignment. OFFICIAL is similar to DELIVERY, but always uses the USPS preferred option when alternates are available. FIELD_CATEGORY The Category defines the generated field as either individual, processed address components (COMPONENT), standardized versions of the input fields (STANDARDIZED), or members of a suggestion list (SUGGESTION). COMPONENT should be your default. Type Type is the data type of the generated field. Content Type Content Type describes the kind of data in the field. Checking the box to the left of a field name automatically adds it to the output schema with default type and mapping. The default name of the output column includes the field class, address class, and category.

In addition to the generated fields, you can add fields directly from the input schema by dragging and dropping them into the output as you would in a Query transform. This allows you not only to output fields without sending them through the cleanse, but also to preserve the original versions of your input fields side-by-side with the cleansed results. \ Configuring the DataCleanse Transform The DataCleanse transform is very similar to AddressCleanse, but simpler in most respects. Again, the transform is organized with an input schema, output schema, and set of tabs. Input Tab The general behavior of the input tab is the same as described above. The main difference is that the fields are tailored to non-address customer and business data such as name, phone, email, date, etc. Also, you can include up to 6 different emails, dates, phones, and names in each record, whereas the AddressCleanse only accepts a single address. Assuming a single set of customer data per record, we recommend the following: PERSON1_GIVEN_NAME1 Given name (i.e. first name, forename) PERSON1_GIVEN_NAME2 Second given name (i.e. middle name) PERSON1_FAMILY_NAME1 Family name (i.e last name, surname) PERSON1_MATURITY_POSTNAME Maturity indicator, such as Jr or Sr

FIRM_NAME1 Business or organization name FIRM_LOCATION1 Location within a business or organization DATE1 Any date, often used for date of birth EMAIL1 Any email address PHONE1 Any phone number SSN1 U.S. Social Security number If the name or firm is not yet parsed, you may want to use the general or composite fields: NAME_LINE1 Full name, including title, maturity, etc FIRM_LINE1 Firm name or location or both NAME_OR_FIRM_LINE1 Any combination of name and firm data MULTILINE1 Any data Again, Data Quality will make a best effort to determine the type and content of these fields, but may make incorrect assumptions. Options Tab This list of options is much shorter and more intuitive than in the AddressCleanse. Most important is the Cleansing Package Name, the dropdown for which will pre-populate with the installed cleansing packages. It s likely the only option available will be PERSON_FIRM, which is appropriate. Standardization Options contains most of the non-date customizations. Here you define whether the middle and family names are used to assign gender, whether prenames (e.g. Mr, Mrs) should be assigned that were not included in the input, and the capitalization to be applied to the output. You may also define the formats applied to phone and SSN. Two of these options are a bit more confusing, but important: Name Order This indicates the order in which the names will be delivered as input. If you can guarantee a first-last or last-first order for all records, setting this to the appropriate _STRICT

setting may greatly improve cleansing accuracy. If just a majority of names follow one order, you may get some benefit from a _SUGGEST setting instead. Enable Presumptive Name Parsing This determines whether the data in the NAME_LINE field will be parsed as a names even if the cleanse cannot make an assignment. If this option is set to No, unassigned input will be placed unparsed in the EXTRA field. If you are sure NAME_LINE will contain only names, this option should be set to Yes. (Note also the Firm version of this option, Enable Presumptive Firm Parsing.) Finally, Date Options allow you to fine-tune date formats. This is particularly useful for compatibility with target systems expecting specific patterns. The cleanse can usually produce the necessary output, removing the need for additional reformatting downstream. Output Tab As with the input tab, the output tab is set up similarly to the AddressCleanse, with a list of fields and checkboxes. However, the fields are organized differently. Here, each is assigned to a parent component (PARENT_COMPONENT), a logical grouping of related data. For example, all of the data (first and last name, gender, title, etc) related to the first person in the input can be found under the PERSON1 component. Each firm, person, phone number, email address, and date has its own parent component. Also unlike the AddressCleanse, DataCleanse output fields do not have address classes or categories, only field classes (FIELD_CLASS). The PARSED class generates data that has been parsed into discrete components, but not further modified. The STANDARDIZED class modifies data to meet the standardization options and should be your default. The available output consists of high-level customer and firm fields as well as the individual elements contained with them. For example, you can choose to output the full date, the separate month/day/year, or both. Start with a basic set of fields: GIVEN_NAME1 First name or forename GIVEN_NAME2 Middle or second given name FAMILY_NAME1 Last name or surname FIRM Business or organization name FIRM_LOCATION Location within business or organization NORTH_AMERICAN_PHONE Full 10-digit North American phone number. Components: NORTH_AMERICAN_PHONE_AREA_CODE NORTH_AMERICAN_PHONE_PREFIX NORTH_AMERICAN_PHONE_LINE

DATE Full date. Components: DATE_DAY DATE_MONTH DATE_YEAR EMAIL Full email. Components: EMAIL_USER EMAIL_DOMAIN_ALL EMAIL_DOMAIN_TOP SSN Full 9-digit Social Security number. Components: SSN_AREA SSN_GROUP SSN_SERIAL The cleanse also predicts gender (GENDER and GENDER_ID) with varying degrees of confidence. The possible values indicate the gender itself, the confidence of the prediction, and whether single or multiple names were used. This may be useful, but isn t guaranteed to be correct, so be careful using it on external-facing data. Finally, the transform generates match standard fields (*_MATCH_STD1-6) for names, firms, and titles. These are standard aliases for data successfully located in the dictionary. For example, Chris might have standard aliases of Christopher and Christine, while PhD has Ph.D. These fields are intended for use in downstream match transforms. Conclusion Using these settings, you should be able to quickly configure a cleansing dataflow to produce useful output for addresses, customers, and businesses. In the next post, we will examine the output to determine which data might need to be tweaked or augmented and how to best configure it as input for match transforms. Bruce Labbate Business Intelligence Consultant Decision First Technologies bruce.labbate@decisionfirst.com Bruce Labbate is a business intelligence consultant specializing in data warehousing, data quality, and ETL development. Bruce delivers customized SAP Data Services solutions for customers across all industries. With Decision First Technologies, Bruce utilizes Data Integrator, Data Quality, Information Design Tool, and a variety of database technologies, including SQL Server, DB2, Oracle, Netezza, and HANA.