Match/Consolidate User s Guide to Record Matching

Transcription

1 Match/Consolidate User s Guide to Record Matching Match/Consolidate 8.00c April 2009

2 Copyright information 2009 SAP BusinessObjects. All rights reserved. SAP BusinessObjects and its logos, BusinessObjects, Crystal Reports, SAP BusinessObjects Rapid Mart, SAP BusinessObjects Data Insight, SAP BusinessObjects Desktop Intelligence, SAP BusinessObjects Rapid Marts, SAP BusinessObjects Watchlist Security, SAP BusinessObjects Web Intelligence, and Xcelsius are trademarks or registered trademarks of Business Objects, an SAP company and/or affiliated companies in the United States and/or other countries. SAP is a registered trademark of SAP AG in Germany and/or other countries. All other names mentioned herein may be trademarks of their respective owners. 2 Match/Consolidate User s Guide

3 Contents Preface...7 Chapter 1: Fundamentals of record matching... 9 Terms...10 Benefits of Match/Consolidate...11 Data, rules, and results...13 Chapter 2: Record matching overview Summary of the record matching process...16 About your first job...18 Prepare your files for Match/Consolidate...19 Set up your Match/Consolidate job...20 Read records and create Match Sets...21 Find matching records...23 Process the match results...24 Match/Consolidate features...25 Chapter 3: Define your input files and lists Input files and lists...28 Input files...29 Re-use processed input (key data) with reference files...30 Group your records with lists...31 Use lists to control the matching process...32 List types...33 Dupe search within this list...35 List Break Priority...36 Three approaches to defining lists...37 When a record doesn t fit into any list...40 Create groups of lists (super lists)...41 Reports on your lists...42 Chapter 4: Prioritize and suppress records Record priorities and types...48 Record priority and suppression...50 Prioritize or suppress records based on list membership...52 Penalize records that contain blank fields...54 Prioritize records based on the contents of one field...56 Reports about record ranking and priorities...58 Chapter 5: Purge input files or create output files Match/Consolidate results...62 Purge bad records or post good records...64 Contents 3

4 Purge the input file Create an output file or post data to the input file Data that you can post Choose the best records for your output file Custom sort your output records Create a multi-buyer file Create a multi-occurrence file Select a sample of records Reports about your purging or output process Chapter 6: Reports and statistics files Introduction to reports and report files Statistics files How statistics files relate to Match/Consolidate reports How Match/Consolidate counts intra-list and inter-list matches Use super lists for report data Print reports Duplicate Records Report (.dup) Executive Summary Report (.exs) Input File Summary Report (.ifs) Input List Summary Report (.ils) Job Summary Report (.mjs) List-by-List Match Report (.llm) List Duplicates Reports (.ldr) List Match Reports (.lm) List Quality Report (.lqr) Match Results Report (.mrr) Multi-List Report (.mlr) Output File Reports (.ofr) Posted Dupe Groups Report (.pdg) Purge by List Reports (.prl) Sorted Records Report (.sor) Unparsed Records Report (.unp) Job statistics file Input statistics file List match statistics file List statistics file Output statistics file Purge statistics file Super list match statistics file Multi-buyer statistics file List subordinates statistics file Chapter 7: Use group posting to consolidate data The basics of group posting Introduction to group posting Post data sources and destinations Group posting depends on your fields Group posting more than once per destination record Example: post a new phone number Example: additive information Match/Consolidate User s Guide

5 Examples of group posting strategies When group posting is all you want to do Group post with an input purge Reports on group posting Chapter 8: Record matching Introduction Choose between standard and extended matching Factors that affect comparison time Matching strategies Implement a matching strategy Rule matching Automatic matching Advanced matching Use reports to examine the matching process Chapter 9: Engineer key data Key files Include record keys only as needed Define key fields Standardize key data for lastline information Standardize key data for peoples names Standardize key data for firm (company) names Chapter 10: Engineer break groups Form break groups Break strategies Prioritize your break group records Break-group analysis Chapter 11: Engineer your match setup Compare record keys: the driver record What makes records match Simscore How close is close enough How record order affects comparisons Control record comparisons Match with unparsed addresses, last lines, names, and firms Matching options How blank fields affect matching Fine-tune your matching process Chapter 12: Advanced matching Terms Match Sets Multi level matching Combine match set Contents 5

6 Chapter 13: Constant Key ID Use Constant Key ID Appendix A: Match/Consolidate and Match programs Product-line overview Appendix B: Calculate the size of your work files Appendix C: Analyze your matching strategies Appendix D: Match/Consolidate Wizard Index Match/Consolidate User s Guide

7 Preface Purpose and contents of this manual This guide explains how Match/Consolidate (MCD) programs perform record matching. Beginning with an entry-level orientation on the basics of record matching, this guide progresses through the common record matching functions and an explanation of the features that comprise the current technology of record matching. Our examples and illustrations are based on actual MCD jobs set up and run through the MCD Views program on a Windows NT platform. If you are not using Views, look for similarly named parameters in the corresponding block of your job file. We assume that you are familiar with your operating system and have a general understanding of database management. Conventions This document follows these conventions: Convention Bold Italics Menu commands! Description We use bold type for file names, paths, emphasis, and text that you should type exactly as shown. For example, Type cd\dirs. We use italics for emphasis and text for which you should substitute your own data or values. For example, Type a name for your file, and the.txt extension (testfile.txt). We indicate commands that you choose from menus in the following format: Menu Name > Command Name. For example, Choose File > New. We use this symbol to alert you to important information and potential problems. We use this symbol to point out special cases that you should know about. We use this symbol to draw your attention to tips that may be useful to you. Preface 7

8 Documentation Documents related to this manual include the following: Document System Administrator s Guide Database Prep Match/Consolidate Extended Matching Reference Match Library Programmer s Reference Match/Consolidate Library Reference Quick Reference Description Explains how to install your software. Explains how to prepare input files for processing, including how to create DEF, FMT, and DMT files. Contains the operational how-to instructions for setting up extended matching. This is a reference manual for the Match Library. This is a reference for programmers working with the Match/Consolidate Library. Contains descriptions of the input and output fields, and the command line for the MCD job file. Access the latest documentation You can access documentation in several places: On your computer. Release notes, manuals, and other documents for each product that you ve installed are available in the Documentation folder. Choose Start > Programs > Business Objects Applications > Documentation. On the SAP Service Market Place. Go to and then click the Business Objects tab. Here, you can search for your products documentation. 8 Match/Consolidate User s Guide

9 Chapter 1: Fundamentals of record matching This chapter explains some of the fundamentals of record matching. It describes how to use Match/Consolidate (MCD) to match your records. Chapter 1: Fundamentals of record matching 9

10 Terms This guide references the following terms. Term Consolidation Group posting Salvaging data Dupe group Match group Match key Definition Consolidation (or group posting) means copying or accumulating data from one matched record to another. Often, it means merging matched records to form a single best record. Some users migrate information from one record to another, but do not specifically seek to merge the records. This is a follow-up process, which occurs after records are identified as members of match groups. The terms dupe group and match group are used interchangeably in this guide. This refers to two or more records that were found to match each other. Name, address, or other data that is broken down into components, standardized, and ready for comparison. For example: Raw Data Name_Line1 = George F Hayes Address = 100 Main St #5 Last_line = Edna, MN Match Key First_name for 8 characters = GEORGE Mid_name for 3 characters = F Last_name for 10 characters = HAYES Prim_range for 10 characters = 100 Prim_name for 15 characters = MAIN Suffix for 6 characters = ST Sec_range for 6 characters = 5 ZIP for 5 characters = Actual Key = GEORGE F HAYES 100 MAIN ST Match field Break group Data that is part of the match key and is compared during the matching process. The First_name data is one of the match fields in the example above. Middle name (Mid_name) is another, and Last_name, etc. Sorting keys into groups of records that are likely to match. Break groups speed the duplicate detection process by eliminating comparisons of records that have no likelihood of matching. Only records within the same break group are compared to one another. 10 Match/Consolidate User s Guide

11 Benefits of Match/Consolidate The benefits of using MCD begins with record matching. That means comparing name, address, and other customer data to find matching records, In other words, deciding whether, within your rules, Record A and Record B represent the same person, household, or company. We can help you get started with typical matching rules; eventually you will probably want to adjust them or make new rules. Once you ve identified pairs or groups of records that match, what do you want to do? Eliminate redundant records? Migrate customer data from one file to another? Consider the following possibilities listed in the following table. Term Extended parsing Extended matching Consolidation Reference files Advanced matching Constant key Definition Apply parsing and standardization capabilities of ACE and TrueName, to prepare the cleanest, most complete data for match keys. Highly tunable rule-based matching that lets you prioritize match fields. You can prioritize your match fields and make decisions for a match or non match on a per-field basis. We offer two approaches to consolidation. With each, you create your own rules for comparing and consolidating records. You can consolidate matched records into a best record, or migrate data among your files. When you repeatedly match against the same static database, there s no need to regenerate match keys each time. Some people call this feature durable or re-usable match keys. Advanced matching lets you find up to three levels of matches in one pass and find associated matches between separate data sets. For example, you can find families and individuals as well as separate residents all in one pass and give a unique number for each level on output. Association is finding persons who live at different residents at different times of the year by using a common data field. Constant key lets you create an ID that is unique to a record or group of duplicate records. It is sequential, static, and it will not change when records are updated or re-processed through MCD. When you append new records to the database, change when records are updated or re-processed through MCD tags any that belong to a group with an existing ID with that same ID. Feature Input purge or create output file Multi-buyer Options Most users choose to send desirable records to an output database. Or, if disk space is a concern, you can drop undesirable records from the input database(s). Let s say you re bringing together customer lists from several other direct marketers or publishers. Your best prospects may be the people whose names appear on two or more lists, indicating they may be most receptive to your offer. Chapter 1: Fundamentals of record matching 11

12 Feature Custom sorting and selection Business-to-business Group posting Suppression lists Options You can perform Nth-select and/or limit your output to a certain number of records. Within your maximum-records limit, you can select your best prospects using a variety of custom sorting strategies. MCD isn t just for consumer marketing. For example, with the proper setup and multiple passes, you can perform N-per-firm selection in other words, you can limit output so that only a certain number of individuals in each company will receive your offer. That helps you spend your advertising dollars most effectively. When you re working with several lists, take advantage of the best of each list. Use the MCD group posting feature to salvage the best data data that s missing from your records from those duplicate records that won t be included in your final output. You can work with suppression lists for example, your own bad-account file, or no-mail lists provided by the government or direct-marketing association (DMA) to prevent wasted mailings and offending consumers. 12 Match/Consolidate User s Guide

13 Data, rules, and results The keys to successful MCD use involves Data, Rules, and Results. Data Clean, complete name and address data will make a big difference in your success. If you have data from several sources or from outside your organization, then there may be issues about format and consistency. We can help. Use ACE, TrueName, DataRight, or DataRight IQ tools to break data down into components, correct errors and inconsistencies, and fill in missing data. Rules Rules refers to your matching rules your criteria for when two records should be called a match, and when they should not. You ll need to think carefully about which fields will be evaluated, how they will be compared, and any special or exceptional circumstances that might override your normal criteria. For Views users within your match criteria, we provide five default sets of rules to help you get started with individual, family, household, business, or businessindividual matching. We recommend that you start your learning and testing with one of our rule sets, then adjust as necessary. That may mean a cyclical process in which you run the search for matches, check your reports, make rule changes, and run the search again. Results Consider the results, or outputs, that you want at the end of the process. Do you want to create an output database? If so, plan your criteria for the records to be included in that file. If you want to consolidate records, write lists of fields to consolidate and how to evaluate or combine each source. Finally, think about what reports you will need for yourself and your clients. Chapter 1: Fundamentals of record matching 13

14 14 Match/Consolidate User s Guide

15 Chapter 2: Record matching overview This chapter summarizes the Match/Consolidate (MCD) process, and explains how preparation, setup, and step-by-step execution of your job is vital to getting the results you want from MCD. Chapter 2: Record matching overview 15

16 Summary of the record matching process To help you understand the MCD process, consider the five-step process shown at right. You perform the first two steps and Match/ Consolidate performs steps 3, 4, and 5. Here we concentrate on the basics, so we ignore many of the features that you can include to tailor your MCD job to your job requirements. The other chapters of this guide further explain these features. 1. Prepare your files for Match/Consolidate 2. Set up your Match/ Consolidate job 3. Read records and create Match Sets 4. Find matching records 5. Process the match results One step at a time This chapter describes the steps one at a time. As you better learn MCD and set up your MCD jobs, you can do all the processing steps at once. Match/Consolidate is a batch process. That means you set up a MCD job (define what records to use and what to do with them), and then start that job. Match/ Consolidate runs the job according to the job settings, in one batch. Checking your results During the MCD batch process, your interaction is limited to reading progress messages (if you so choose). However, once the process is complete, you can check your results by checking MCD reports and/or output files. Match/Consolidate can produce 16 different pre-formatted reports, containing statistics about the process and actual record data for your analysis. In addition, MCD can produce many statistics files in which you can find most any data pertinent to your MCD job. Normally, you will create reports for every job (select the Create Reports option at the Execution Options window). Carefully look at the appropriate reports. If you don t see the results you want, change your settings and re-process the job. Do this at each step until you get the results you want. Disk space for generated files As it runs, MCD generates work files. If you run out of disk space for those files, the program will stop. Note that, depending on your operating system, you may get a variety of errors. For details on estimating disk space requirements, refer to Calculate the size of your work files on page Match/Consolidate User s Guide

17 JOHN CASILLO CONSOLIDAION BEVERAGE 12 SAINT MARK ST AUBURN MA01501 ROBERT BRHDLEY WT. BRHDLEY & SONS ENTERPRISE 61 SUMMIT AVE SOUTH ADAMS MA01247 JOSEPHINE LAMER NEC INFORMATIN SYSTEMS 1414 MASSACHUSETTS AVE BOXBORO MA01719 MR BILL HANDRICH HELENA CHEMICAL CO PO BOX 220 HATFIELD MA01038 MR GREG HAMMOND, MGR CUST REL LISTA INTERNATIONAL 106 LOWLAND ST HOLLISTON MA01746 MARY PETERS UNIVERSAL PLASTICS CORP 165 FRONT ST CHICOPEE MA01013 HECTOR R RODRIGUEZ IMPRESOS ALFA AVE DEGETAU A-7 SAN ALFONSO CAGUAS PR CONSTANSA F FOSTER TRAULSEN & CO INC PO BOX 169 COLLEGE POINT NY11356 TIM GLAZE SHEPHERD INTELLIGENCE SYSTEMS 358 BAKER AVE CONCORD MA01742 CLAIRE MONAHAN ASTRA PHARM PRODUCTS 50 OTIS ST WESTBOROUGH MA01581 ROBERT FINE AMERICAN BILTRITE INC PO BOX 6146 TRENTON NJ08648 S DONGELO ACCO SWINGLINE 151 RADDIN RD GROTON MA01450 MR MOE L CURLY, SLS SUPV ROBERTS DISTRIBUTING CORP 372 PASCO RD SPRINGFIELD MA01119 LANCE R DUNHAM DIR ANGIOGRAPHIC DEVICES CORP 232 TAYLOR ST LITTLETON MA01460 MR PETER BEYETTE BROOKFRONT MEDICAL SERVICES 1459 NIAGARA FALLS BLVD BUFFALO NY14228 JAY SPUTNIK- MGR YANKEE AJOEIC ELEC CO 580 MAIN ST BOLTON MA01740 JAN PAINTER LUCAS GRASON STADLER INC 537 GREAT RD LITTLETON MA01460 BERNIE VITTI SANDOZ 59 ROUTE 10 EAST HANOVER NJ07936 LUIS PABON MILES PUERTO RICO INC CALL BOX SAN JUAN PR KAREN MCFADDEN VP ROCHE BIOMEDICAL LAB 17 WALDRON AVE GLEN ROCK NJ07452 MAUREEN DABERNARDI BRADFORD FURNITURE 23 BRADFORD ST CONCORD MA01742 JEANNE WEINTRAUB, MKTG COORD CHANNING L BETE CO 200 STAGE RD SOUTH DEERFIELD MA01373 MR BRADFORD W PHOENIX H M SPENCER INC BOX HOLYOKE MA MS SUZANNE MC KIERNAN THE HANOVER INSURANCE COMPA 100 SOUTH ST WORCESTER MA01605 AL DIGREGORIONSON AAA WATER QUALITY SYSTEMS 154 CENTRAL ST SOUTHBRIDGE MA01550 DENNIS R MILLS SCOTT CASTINGS CORP 461 TONAWANDA ST BUFFALO NY14207 Prepare your files for Match/Consolidate Preparation Set up your Match/Consolidate job Input file Input file Input file Execution Process: Read Records and Create Match Sets All input records Reports Reports Input File Summary Input List Summary Sorted Records report Unparsed Records report By guiding you through a job, this chapter provides an overview of the three main Match/Consolidate processes. For specific details about the processes, refer to the remaining chapters of this guide. Match/Consolidate reports are a most valuable source of information about your job. Study them carefully to see if you should adjust your job settings and rerun the process. Note that your MCD job can be run in one execution; it need not be run in separate phases as shown in this illustration. MSMI56MA55987 JCAS12SA01501 WSNI89LI56308 AMUT92KI56551 SWAL31SE44240 MZAS48FR44242 OLAR96SU06460 FDRA77MA14240 BJAD29HU80308 CHEI11SA10158 key file Execution Process: Find Matching Records Mary Jane Smith Mary Smith M. Smith Maryjane Smith Dupes, uniques Reports Reports Duplicate Records report List Duplicates report List Match reports Sorted Records report and more Input file Match/Consolidate Execution Process: Process the Match Results Resultant record matching data Multi-Occurrence All Duplicates Custom Match/Consolidate Post records Purge records Post data Reports Reports Output File report Posted Dupe Groups report Purge by List report Statistics files Chapter 2: Record matching overview 17

18 About your first job If you are new to MCD, we recommend that you make your first MCD job simple, to familiarize yourself with the overall MCD job processes. As an introductory job, and as a quick check to be sure your program is properly installed, we supply a collection of files that are automatically copied to your program's samples directory when you install MCD. We provide the quik_mpg.dat file to serve as your database for the sample job. We also provide the quik_mpg.def and quik_mpg.fmt support files for that database. The database file contains 1000 records, each record having name and address data. If you look through the file, you can find some blank fields, and you may note that some of the records have addresses (or names, or both) similar to those of other records. Depending on your operating system, we provide a job file named quikunix.mpg or quikwin.mpg. This job is preset to read the records of the quik_mpg.dat database, process it to find duplicate records, and produce a MCD Output File. If you have not used the standard directory structure prompted by the installation program, then before you run the introductory job, you may have to make some small changes to the Auxiliary Files settings, so your program will be able to find the directories it needs for processing. Refer to your System Administrator's Guide for additional details. Subdirectories In addition, many users prefer to keep their jobs' output and reports in separate subdirectories, with a directory structure similar to the one shown at right. If you want to separate your output and reports like this, you'll have to do two things: First, create the additional directories Then, in your job setup, (quikunix.mpg or quikwin.mpg) modify the file paths that are set in your reports and output file blocks to correspond to those directories. PW MPG Samples Template Work Output Reports 18 Match/Consolidate User s Guide

19 Prepare your files for Match/Consolidate You need to have two types of files ready before you run MCD: Input files the records you want in the job. You can input up to 255 files for your MCD job, and they can be of varying types, including ASCII, dbase3, EBCDIC, and delimited. Supporting files These files include the DEF file, which interprets your input data for MCD, and format files such as FMT, DMT, or EBC. 1. Prepare your files for Match/Consolidate 2. Set up your Match/ Consolidate job 3. Read records and create Match Sets 4. Find matching records 5. Process the match results For details about input files and support files, refer to Database Prep. Input files The best way to prepare your input files for MCD is to standardize your input data by using name and address correction software, like our DataRight, TrueName, ACE, and IACE software. Standardized data increases the speed and accuracy of the match process. If your data is not standardized, MCD Job can perform extended parsing for name and address data. Using extended parsing produces results equivalent to those derived from using DataRight, TrueName, ACE, and IACE (U.S. engine) software. However extended parsing is an extra cost option, and it may increase overall processing time. Note that data is standardized in the key data for the purpose of matching only. If you are running our sample job (quikunix.mpg or quikwin.mpg) then your input file, quik_mpg.dat, is in your program's samples directory. Re-run the same job If you just changed settings and now want to re-run the same job, you may be able to speed up the process by using reference files. For details, see Re-use processed input (key data) with reference files on page 30. Chapter 2: Record matching overview 19

20 Set up your Match/Consolidate job Once you have prepared your data for MCD, you need to set up your job so that the MCD program will know: Which input file records to include How to parse data from the input record What key data to store for each record What makes a match or no-match 1. Prepare your files for Match/Consolidate 2. Set up your Match/ Consolidate job 3. Read records and create Match Sets 4. Find matching records 5. Process the match results What result (output) to produce from this job Which reports that you want to create Three options If you are using MCD Views, you have the three options explained below for setting up a MCD job. If you do not have Views, then you must use the third alternative shown here. 1. Use the Views Wizard The MCD Views job setup Wizard prompts you through a setup for your job. The Wizard does not control all the features available in MCD; however, it does get the job started with the input, output, and processing options common for most MCD users. Once you've initially set up your job through the Wizard, you can use Views to add any additional sophistication needed to produce exactly the results you want from MCD. 2. Design your job in Views You can define and design your entire job setup through Views. Use the Views windows to select the options and define the setup parameters that produce the results you want. Views currently includes standard, extended, and advanced matching processes through Match Criteria and Match Options windows. 3. Copy and edit a job file If you are already familiar with MCD or other record matching programs, and like working in a text-only environment, you may want to set up your job by directly editing the job file. For the introductory sample job (quikunix.mpg or quikwin.mpg) your setup will be minimal, to accommodate any differences from the standard MCD installation (refer to page 18). 20 Match/Consolidate User s Guide

21 Read records and create Match Sets As you learn to use MCD, you may want to consider performing the first process, that of reading the input records and creating match sets, as a separate step. Once you have developed a better understanding of MCD, you may be more comfortable combining all of the processes of your job in one execution. To run our sample job (quikunix.mpg or quikwin.mpg) issue the MCD start command from 1. Prepare your files for Match/Consolidate 2. Set up your Match/ Consolidate job 3. Read records and create Match Sets 4. Find matching records 5. Process the match results your command prompt, or, if using Views, open the sample job and run the job from within Views. (Refer to our Quick Reference for command line options and format requirements.) The key file is a working file that MCD uses to hold the data that s used in placing, matching, and ranking (prioritizing) your records. You won t read or use this file; only the MCD process will. The match process compares data from one record to corresponding data from another record. However, comparing all the record data would take far too much time for most purposes. Additionally, comparing some parts of the data might actually be counterproductive. Therefore, instead of using all the record data, your matching process uses key data data that you, the MCD user, identify as the significant parts of the record to use for finding matches. That data is stored in the MCD key file. Each key represents a record The key file contains a string of data for each record to be processed. You identify each field and the length of characters to use in the key. For example, you may want to store 12 characters of the last name data, 30 characters of firm data, 10 characters of primary range data, and so on. Raw Data Name_Line1 = George F Hayes Address = 100 Main St #5 Last_line = Edna, MN Match Key First_name for 8 characters = GEORGE Mid_name for 3 characters = F Last_name for 10 characters = HAYES Prim_range for 10 characters = 100 Prim_name for 15 characters = MAIN Suffix for 6 characters = ST Sec_range for 6 characters = 5 ZIP for 5 characters = Actual Key = GEORGE F HAYES 100 MAIN ST Chapter 2: Record matching overview 21

22 Match sets represent a match strategy When you set up your matching criteria, which determines whether two records will match, you define your matching strategy. Match/Consolidate collects the records that it compares using this match strategy into a match set. Match Consolidate can evaluate more than one match set; however, this is an advanced feature. For more information, refer to Advanced matching on page 219. If you have defined only one match strategy in your Match Consolidate job, then MCD automatically creates the match set. Once the key data is assembled for MCD, it can move to the next process: finding matching records. 22 Match/Consolidate User s Guide

23 Find matching records Once MCD has read your input records and created the key file, it performs the next main processing step finding matching records. For detailed information about matching, refer to Engineer your match setup on page Prepare your files for Match/Consolidate 2. Set up your Match/ Consolidate job 3. Read records and create Match Sets 4. Find matching records 5. Process the match results Summary of the matching process Normally, the match process step involves these three phases: 1. Match/Consolidate places records into small groups to avoid comparing records that have no reasonable likelihood of matching. This process is often referred to as forming break groups or sorting keys. 2. Next, MCD compares each key of a specific group to every other key in that group. When two or more keys match, MCD identifies their records as members of a dupe group a duplicate record group. Note that the number of records in a dupe group can vary widely, depending on the quality of your data and your matching setup. 3. Then MCD sorts the keys of each group, to prioritize them and to categorize each record as a unique record, or a master or subordinate dupe. Raw Data Name_Line1 = George F Hayes Address = 100 Main St #5 Last_line = Edna, MN Match Key First_name for 8 characters = GEORGE Mid_name for 3 characters = F Last_name for 10 characters = HAYES Prim_range for 10 characters = 100 Prim_name for 15 characters = MAIN Suffix for 6 characters = ST Sec_range for 6 characters = 5 ZIP for 5 characters = Actual Key = GEORGE F HAYES 100 MAIN ST For the records of a break group, MCD assigns each to a group of records that it has determined to match each other and then ranks each record within that dupe group. Chapter 2: Record matching overview 23

24 Process the match results After MCD has determined which records match, you need to have MCD do something productive with its conclusions. Normally, that something will be one of the following: Purge the input file. Create a new output file. Update existing records. 1. Prepare your files for Match/Consolidate 2. Set up your Match/ Consolidate job 3. Read records and create Match Sets 4. Find matching records 5. Process the match results Whichever outcome you want, MCD checks each record of the job, one after another. Match/Consolidate acts on the record based on the results of the matching process and your choice of processing options. For details about input and output files, refer to Purge input files or create output files on page 61. Choose an output For detailed information about available options, refer to Purge input files or create output files on page 61. For your jobs, your job process will likely make obvious what you need from MCD. The most common use of MCD is to use the results of your MCD job to produce one of the following two output files. The introductory sample job that we provide with MCD (quikunix.mpg or quikwin.mpg) is set up to produce the MCD output file. The MCD output file contains all the unique records as well as all master records (master dupes). This type of output file could be used as a mailing list. The All Duplicates output file contains all the records that matched any others. It will include all the records that were members of all the dupe groups, but none of the unique records. This file might be used in further database maintenance activities, or quality control functions. This type of output file might have other uses, as well (refer to Output file on page 64). 24 Match/Consolidate User s Guide

25 Match/Consolidate features When you're done with your first, introductory job, you ll probably be ready to learn more about some of the features that you can incorporate in your MCD jobs, such as lists and data posting. Here's where you'll find these subjects in this guide: Task MCD feature Page number Categorizing input records by source or field value Logically including or excluding records, based on field data Consolidating or copying data among matching (duplicate) records Tracking what happens to (or with) records from various sources Selecting the highest quality records for output Lists 27 Filters Functions 28 Group posting 147 Super lists 41 Multi-Level matching Match sets Combined match set Nth select Custom sorting You ll probably also want to learn about these features. Task MCD feature Page number Finding more matching records Key data 172 Speeding up the match process Break groups 188 Controlling the match process Identifying the best of the matching records Standard matching Extended matching Advanced matching Ranking or prioritizing records Chapter 2: Record matching overview 25

27 Chapter 3: Define your input files and lists This chapter describes how to define your input. In this chapter, we explain how to define and limit files to be used as input, how to re-use already processed files, as reference files, and how to characterize records through the use of input lists. Match/Consolidate (MCD) uses the words file and list interchangeably. Even if you do not set up lists, MCD considers each input file a list. Chapter 3: Define your input files and lists 27

28 Input files and lists Terms The following table describes the various input files and lists. Term Input file Reference file List Normal list Suppression list Special list Super list Description Your records. The database you want MCD to process. A re-usable file that results from MCD reading input records. A grouping of records based on a common data characteristic. A list of records that MCD should consider to be eligible records. A list of records MCD uses to prevent matching records of other lists from being sent to the output. A list of records that should be treated as transparent, like seed lists. They are not counted in determining how to characterize a match group for example, multi-list or single-list. A group of lists. For example, a super list may be comprised of three lists rented from one broker. Set up your lists The following list summarizes how to set up lists using the MCD Job-File and Views. Input files and Reference files Set up an input file block for each file you want included in this job. Lists In your DEF file, define PW.List_ID. To manually set up lists, set up one input list description block for each list. To automatically generate lists, use the Input List Default block. Select records based on a value in a field In the DEF file, define PW.List_ID as the field containing your list identification data. For example, if you have a database field named List_Code that contains a useful value, use PW.List_ID = List_Code. Select records based on any criteria In the Input List Description section of your job, specify your selection criteria at the List Filter parameter. 28 Match/Consolidate User s Guide

29 JOHN CASILLO CONSOLIDAION BEVERAGE 12 SAINT MARK ST AUBURN MA RO BE RT B RH DL EY WT. BR HD LE Y & SO NS E NT ER PR IS E 61 S UM MI T AV E SO UT H AD AM S MA JO SE PH IN E LA ME R NE C IN FO RM AT IN S YS TE MS M AS SA CH US ET TS A VE BO XB OR O MA MR BILL HANDRICH HELENA CHEMICAL CO PO BOX 220 HATFIELD MA MR G RE G HA MM ON D, M GR C US T RE L LI ST A IN TE RN AT IO NA L 10 6 LO WL AN D ST HO LL IS TO N MA MA RY P ET ER S UN IV ER SA L PL AS TI CS C OR P 16 5 FR ON T ST CH IC OP EE MA HE CT OR R R OD RI GU EZ IM PR ES OS A LF A AV E DE GE TA U A- 7 SA N AL FO NS O CA GU AS PR CO NS TA NS A F FO ST ER TR AU LS EN & C O IN C PO B OX 1 69 CO LL EG E PO IN T NY TI M GL AZ E SH EP HE RD I NT EL LI GE NC E SY ST EM S 35 8 BA KE R AV E CO NC OR D MA CL AI RE M ON AH AN AS TR A PH AR M PR OD UC TS 50 O TI S ST WE ST BO RO UG H MA ROBERT FINE AMERICAN BILTRITE INC PO BOX 6146 TRENTON NJ S DO NG EL O AC CO S WI NG LI NE 15 1 RA DD IN R D GR OT ON MA MR M OE L C UR LY, SL S SU PV RO BE RT S DI ST RI BU TI NG C OR P 37 2 PA SC O RD SP RI NG FI EL D MA LA NC E R DU NH AM D IR AN GI OG RA PH IC D EV IC ES C OR P 23 2 TA YL OR S T LI TT LE TO N MA MR PETER BEYETTE BROOKFRONT MEDICAL SERVICES 1459 NIAGARA FALLS BLVD BUFFALO NY JA Y SP UT NI K- M GR YA NK EE A JO EI C EL EC C O 58 0 MA IN S T BO LT ON MA JA N PA IN TE R LU CA S GR AS ON S TA DL ER I NC 53 7 GR EA T RD LI TT LE TO N MA BERNIE VITTI SANDOZ 59 ROUTE 10 EAST HANOVER NJ LUIS PABON MILES PUERTO RICO INC CALL BOX SAN JUAN PR KA RE N MC FA DD EN V P RO CH E BI OM ED IC AL L AB 17 W AL DR ON A VE GL EN R OC K NJ JOHN CASILLO CONSOLIDAION BEVERAGE 12 SAINT MARK ST AUBURN MA ROBERT BRHDLEY WT. BRHDLEY & SONS ENTERPRISE 61 SUMMIT AVE SOUTH ADAMS MA JOSEPHINE LAMER NEC INFORMATIN SYSTEMS 1414 MASSACHUSETTS AVE BOXBORO MA MR BILL HANDRICH HELENA CHEMICAL CO PO BOX 220 HATFIELD MA MR G RE G HA MM ON D, M GR C US T RE L LI ST A IN TE RN AT IO NA L 10 6 LO WL AN D ST HO LL IS TO N MA MA RY P ET ER S UN IV ER SA L PL AS TI CS C OR P 16 5 FR ON T ST CH IC OP EE MA HE CT OR R R OD RI GU EZ IM PR ES OS A LF A AV E DE GE TA U A- 7 SA N AL FO NS O CA GU AS PR CO NS TA NS A F FO ST ER TR AU LS EN & C O IN C PO B OX 1 69 CO LL EG E PO IN T NY TI M GL AZ E SH EP HE RD I NT EL LI GE NC E SY ST EM S 35 8 BA KE R AV E CO NC OR D MA CL AI RE M ON AH AN AS TR A PH AR M PR OD UC TS 50 O TI S ST WE ST BO RO UG H MA ROBERT FINE AMERICAN BILTRITE INC PO BOX 6146 TRENTON NJ S DONGELO ACCO SWINGLINE 151 RADDIN RD GROTON MA MR M OE L C UR LY, SL S SU PV RO BE RT S DI ST RI BU TI NG C OR P 37 2 PA SC O RD SP RI NG FI EL D MA LA NC E R DU NH AM D IR AN GI OG RA PH IC D EV IC ES C OR P 23 2 TA YL OR S T LI TT LE TO N MA MR PETER BEYETTE BROOKFRONT MEDICAL SERVICES 1459 NIAGARA FALLS BLVD BUFFALO NY JA Y SP UT NI K- M GR YA NK EE A JO EI C EL EC C O 58 0 MA IN S T BO LT ON MA JA N PA IN TE R LU CA S GR AS ON S TA DL ER I NC 53 7 GR EA T RD LI TT LE TO N MA BERNIE VITTI SANDOZ 59 ROUTE 10 EAST HANOVER NJ LUIS PABON MILES PUERTO RICO INC CALL BOX SAN JUAN PR KAREN MCFADDEN VP ROCHE BIOMEDICAL LAB 17 WALDRON AVE GLEN ROCK NJ MAUREEN DABERNARDI BRADFORD FURNITURE 23 BRADFORD ST CONCORD MA JOHN CASILLO CONSOLIDAION BEVERAGE 12 SAINT MARK ST AUBURN MA ROBERT BRHDLEY WT. BRHDLEY & SONS ENTERPRISE 61 SUMMIT AVE SOUTH ADAMS MA JOSEPHINE LAMER NEC INFORMATIN SYSTEMS 1414 MASSACHUSETTS AVE BOXBORO MA MR BILL HANDRICH HELENA CHEMICAL CO PO BOX 220 HATFIELD MA MR G RE G HA MM ON D, M GR C US T RE L LI ST A IN TE RN AT IO NA L 10 6 LO WL AN D ST HO LL IS TO N MA MA RY P ET ER S UN IV ER SA L PL AS TI CS C OR P 16 5 FR ON T ST CH IC OP EE MA HE CT OR R R OD RI GU EZ IM PR ES OS A LF A AV E DE GE TA U A- 7 SA N AL FO NS O CA GU AS PR CO NS TA NS A F FO ST ER TR AU LS EN & C O IN C PO B OX 1 69 CO LL EG E PO IN T NY TI M GL AZ E SH EP HE RD I NT EL LI GE NC E SY ST EM S 35 8 BA KE R AV E CO NC OR D MA CL AI RE M ON AH AN AS TR A PH AR M PR OD UC TS 50 O TI S ST WE ST BO RO UG H MA ROBERT FINE AMERICAN BILTRITE INC PO BOX 6146 TRENTON NJ S DONGELO ACCO SWINGLINE 151 RADDIN RD GROTON MA MR MOE L CURLY, SL S SU PV RO BE RT S DI ST RI BU TI NG C OR P 37 2 PA SC O RD SP RI NG FI EL D JO HN C AS IL LO CO NS OL ID AI ON B EV ER AG E 12 S AI NT M AR K ST AU BU RN MA RO BE RT B RH DL EY WT. BR HD LE Y & SO NS E NT ER PR IS E 61 S UM MI T AV E SO UT H AD AM S MA JO SE PH IN E LA ME R NE C IN FO RM AT IN S YS TE MS M AS SA CH US ET TS A VE BO XB OR O MA MR B IL L HA ND RI CH HE LE NA C HE MI CA L CO PO B OX 2 20 HA TF IE LD MA MR G RE G HA MM ON D, M GR C US T RE L LI ST A IN TE RN AT IO NA L 10 6 LO WL AN D ST HO LL IS TO N MA MA RY P ET ER S UN IV ER SA L PL AS TI CS C OR P 16 5 FR ON T ST CH IC OP EE MA HE CT OR R R OD RI GU EZ IM PR ES OS A LF A AV E DE GE TA U A- 7 SA N AL FO NS O CA GU AS PR CO NS TA NS A F FO ST ER TR AU LS EN & C O IN C PO B OX 1 69 CO LL EG E PO IN T NY TI M GL AZ E SH EP HE RD I NT EL LI GE NC E SY ST EM S 35 8 BA KE R AV E CO NC OR D MA CL AI RE M ON AH AN AS TR A PH AR M PR OD UC TS 50 O TI S ST WE ST BO RO UG H MA RO BE RT F IN E AM ER IC AN B IL TR IT E IN C PO B OX TR EN TO N NJ S DO NG EL O AC CO S WI NG LI NE 15 1 RA DD IN R D GR OT ON MA MR M OE L C UR LY, SL S SU PV RO BE RT S DI ST RI BU TI NG C OR P 37 2 PA SC O RD SP RI NG FI EL D MA LA NC E R DU NH AM D IR AN GI OG RA PH IC D EV IC ES C OR P 23 2 TA YL OR S T LI TT LE TO N MA MR PETER BEYETTE BROOKFRONT MEDICAL SERVICES 1459 NIAGARA FALLS BLVD BUFFALO NY JA Y SP UT NI K- M GR YA NK EE A JO EI C EL EC C O 58 0 MA IN S T BO LT ON MA JA N PA IN TE R LU CA S GR AS ON S TA DL ER I NC 53 7 GR EA T RD LI TT LE TO N MA BE RN IE V IT TI SA ND OZ 59 R OU TE 1 0 EA ST H AN OV ER NJ LU IS P AB ON MI LE S PU ER TO R IC O IN C CA LL B OX SA N JU AN PR KA RE N MC FA DD EN V P RO CH E BI OM ED IC AL L AB 17 W AL DR ON A VE GL EN R OC K NJ MA UR EE N DA BE RN AR DI BR AD FO RD F UR NI TU RE 23 B RA DF OR D ST CO NC OR D MA JE AN NE W EI NT RA UB, MK TG C OO RD CH AN NI NG L B ET E CO 20 0 ST AG E RD SO UT H DE ER FI EL D MA MR B RA DF OR D W PH OE NI X H M SP EN CE R IN C BO X HO LY OK E MA MS S UZ AN NE M C KI ER NA N TH E HA NO VE R IN SU RA NC E CO MP A 10 0 SO UT H ST WO RC ES TE R MA AL D IG RE GO RI ON SO N AA A WA TE R QU AL IT Y SY ST EM S 15 4 CE NT RA L ST SO UT HB RI DG E MA DE NN IS R M IL LS SC OT T CA ST IN GS C OR P 46 1 TO NA WA ND A ST BU FF AL O NY Input files Before MCD can decide whether or not two records match, it must read those records from your database file(s) and convert them into key data. Identify all files that you want included in your MCD job. Determine which input file records to include Match/Consolidate processes records from your input files one at a time. First it decides whether the record should be included in the job perhaps the record has been marked for deletion, for example. Or perhaps you want to limit the number or type of records to use from an input file. You can set file by file limits on which records should be used with these methods: A starting record number. A maximum number of records from the input file. Filters that apply to records of this input file. Filters are formal, logical statements that MCD can act on as it reads your input record. For example, you might want to exclude or filter out any record that is not from a particular state. Refer to filter information in the Quick Reference. An input processing exit function. input file No limits; use all records input file Start at #100 Maximum 3000 input file All input records Use records that pass the filter; don t use the rest Parse data from the input file When it reads your input record, MCD identifies specific parts of your input records, such as first name, last name, address, city, and so on. This is called parsing. Later chapters explain the various parsing options. The parsing process is only for internal program use, to improve the detection of matching records. Match/Consolidate stores parsing results in working files that MCD will use in creating the key file. Parsing does not actually change the data in your input file, nor does it affect the data that will be in your output file (if you choose to create one). For more information about matching records, refer to Find matching records on page 23. Chapter 3: Define your input files and lists 29

30 Re-use processed input (key data) with reference files A reference file is a specialized work file that contains all the key data for an input file. Create the reference file during your first MCD process. For subsequent passes, MCD uses that reference file as the input data instead of using its associated input file. Reference files are controlled by settings (parameters) of the Input File block of your MCD job setup. Refer to the Job-File Reference manual for details about how to create them or use them. When you can use a reference file rather than an input file, you save the time that would have been spent repetitively reading input data and creating key files. As such, reference files can be a valuable substitute for large, frequently-used input files, such as mailer suppression lists. For example, many mailers use the DMA s MPS file, which lists about 3 million people who don't want to receive direct mail. Including this file as input suppresses these people from appearing on any mailing list produced by the MCD job. When using reference files, you can change your matching and breaking setup in subsequent MCD passes or jobs. However, you must stay within the bounds of the key data that was captured when the reference file was created. The reference file can t accommodate changes in the key data, or changes in list or input filter restrictions that apply to that file. Lists and priorities Reference files inherit from their input file the settings that are used in their corresponding input lists. (Lists are explained later in this chapter; priorities are explained in the following chapter.) Therefore, a reference file would have to be regenerated if your job includes the following: List_ID Changing to different List_ID field values. A reference file inherits the List_ID of its input file, whether the List_ID is defined in the DEF file as a constant or as a field. If the input file has no List_ID, then neither will the reference file. Priority field Changing the priority field to a different field. When you produce a reference file, generate the Job Summary report, for a record of all the relevant job settings, and include any options that you may want to include in jobs using this reference file. Purge an input file When your MCD job includes input posting or group posting during an input file purge, MCD will post to both the input file and its associated reference file. For details about input file purging with reference files, refer to Purge the input file on page Match/Consolidate User s Guide

31 Group your records with lists A list is the grouping of records on the basis of some data characteristic that you can identify. A list might be all records from one input file, or all records that contain a particular value in a particular field. Lists are abstract and arbitrary there is no physical boundary line between lists. List membership can cut across input files as well as distinguish among records within a file, based on how you define the list. Your MCD job can include up to 2,000 lists. However, if you are willing to treat all your input records as normal, eligible records with equal priority, then you do not need to include lists in your MCD job. Typically, a MCD user expects some characteristic or combination of characteristics to be significant, either for selecting the best matching record, or for deciding which records to include or exclude from the job output. Lists enable you to attach those characteristics to a record, by virtue of that record s membership in its particular list. Before getting to the details about how to set up and use lists, here are some of the many reasons you might want to include lists in your job: To give one set of records priority over others. For example, you might want to give the records of your master file priority over the records from an update file. For more information, refer to Prioritize or suppress records based on list membership on page 52. To identify a set of records that MCD uses to exclude other records from the output of your job. These are suppression-list records. For more information, refer to Prioritize or suppress records based on list membership on page 52. To set up a set of records that should not be counted toward multi-buyer status. For example, some mailers use a seed list of potential buyers who report back to the mailer when they receive a mail piece so that the mailer can measure delivery. These are special-type records. To save processing time, by canceling the dupe search within a set of records that you know contains no matching records. In this case, you must know that there are no matching records within the list, but there may be matches among lists. To save processing time, you could set up lists and cancel searching within each list. To get separate report statistics for a set of records within an input file, or to get report statistics for groups of lists. Refer to Statistics files on page 84 for details about report statistics and Use super lists for report data on page 91 for details about super lists. Chapter 3: Define your input files and lists 31

32 Use lists to control the matching process This chapter focuses on lists, rather than on the matching process. Because of that, we ll concentrate here on how to set up your lists, how to establish their list properties (see the table below), and, in general, what those properties do. For instruction about how to fine-tune your match setup with these and other controls, Refer to Chapters 8, 9, 10, and 11 of this guide. For each list, you can set the properties (or characteristics) shown in the table below. Each record of the list then assumes those characteristics as they are set for the list. When MCD deals with a record, its list settings affect the results as shown below. The following pages provide details about each of these settings. Setting List Type Dupe Search Within This List List Break Priority List Match Priority Suppress Apply Blank Priority Perform Data Salvage Use List to Assign New ID Effect on matching MCD includes three types of lists; normal, suppress, and special. In the matching process, a record is treated differently, depending on its type (refer to page 33). If you know a record has no matches within the records of its list, you can direct MCD to exclude this record from the search for duplicates within this list, but continue to search for duplicates among records from other lists. This can save processing time (refer to page 35). You can direct MCD to prefer records of certain lists to be the driver records for comparisons (refer to page 36). You can direct MCD to prefer records of certain lists to become the master record from among matching records. You can independently control whether MCD uses or ignores blank priority for suppression-list records. You can independently control data salvaging in comparisons with any type of list (refer to Fine-tune your matching process on page 218). This lets you generate a value for AP.ID_INC_NO on a per-list basis. You might want to enable/disable generating a value for AP.ID_INC_NO if some incoming records already have a valid ID and you do not want to assign them a new one. 32 Match/Consolidate User s Guide

33 List types Match/Consolidate lets you identify each list as one of three different types: Normal, Suppression, or Special. Match/Consolidate can process your records differently depending on their list type. List Normal Suppression Special Description A list of records that MCD should consider to be good, eligible records. A list of records that should not be used. A list of records MCD uses to prevent matching records of other lists from being sent to the output. A list of records that should be treated as transparent, such as seed lists. They are not counted in determining how to characterize a match group for example, multi-list or single-list. The reason for identifying the list type is to set that identity for each of the records that are members of the list. List type plays an important role in how MCD processes matching records (the members of dupe groups) and how MCD produces output (that is; whether it includes or excludes a record from its output). If Match/Consolidate sets the list type If you elect to have MCD automatically generate lists from your PW.List_ID fields, then you can also have MCD set the list type for each list. Here are your alternatives: If you d like all the records of a file to have the same list type, you can add a PW.List_Type entry to the file s definition (DEF) file. If types of records are mixed in your input file, and if the list type is stored in one of the database fields, then you can use that field to identify each record s type to MCD. In the file s definition file, set PW.List_Type to that database field. The first letter of the contents of that field must be N, P, or S (for Normal, Suppress/Purge, and Special). Chapter 3: Define your input files and lists 33

34 If you set the list type If you elect to manually set up your list(s), assign the list type in your Setup Input List block. Refer to Prioritize and suppress records on page 47 for information about how the list type affects ranking and suppression of records. Note that if MCD cannot assign a list based on the PW.List_ID as explained on the previous page, it assigns the list according to the undetermined list options setting in the Input List Defaults. Note also that if 2000 lists have already been automatically generated, any records that cannot be assigned to one of those 2000 are also assigned from the Input List Defaults. 34 Match/Consolidate User s Guide

35 Dupe search within this list Your job may include some records that you are certain have no matching records within their list. For example, you may have an input file that has already been de-duped by processing it with MCD. For these records, any time that MCD spends looking for matching records within already de-duped records is wasted time. This list property enables you to avoid wasting that time by directing MCD to not search for duplicate records within this list. If Match/Consolidate sets the list type If you elect to have MCD automatically generate lists from your PW.List_ID fields, then you can also have MCD set this dupe search value for each list. Here are your alternatives: If you d like all the records of a file to be treated the same way in terms of the dupe search, you can add a PW.List_Srch entry to the file s definition (DEF) file, either Y or N (for Yes or No). If your input file contains a mix of records; some of which should be included in the search for duplicates and others which should be excluded; then you may be able to use a database field to identify each record s dupe search status to MCD. In the file s definition file, set PW.List_Srch to that database field. The first letter of the contents of that field must be Y or N (for Yes or No). When MCD performs the duplicate search process for a record whose PW.List_Srch value is Y, it will compare that record to other records of its list. However, for records with a PW.List_Srch value of N, the comparison process will ignore the other records of its list. If MCD cannot assign the value based on the PW.List_Srch as explained above, it assigns the default value from the Input List Defaults. If you set up the lists If you elect to manually set up your list(s), set list search in the Setup Input List block of your MCD job. Chapter 3: Define your input files and lists 35

36 List Break Priority By assigning a break priority value to a list, you can influence which record of a break group is identified as the driver record for the record comparisons during the duplicate detection process. The driver record is the record to which others are compared during the duplicate detection process. There are various reasons why you may want MCD to use the records of a particular list (or lists) as driver records. For example, you may want your best records driving the matching process. The details of the matching process are complex, and the selection of the driver record can affect the results. For details about the driver record and how the comparisons are made, refer to Comparisons start with the driver record on page 196. If Match/Consolidate sets the list break priority If you elect to have MCD automatically generate lists from your PW.List_ID fields, then you can also have MCD set the break priority value for each list. Here are your alternatives: If you d like all the records of a file to have the same break priority, you can add a PW.Driv_Prior entry to the file s definition (DEF) file. If your input file contains a mix of records, which reflect differences in how the records should be prioritized as drivers, then you may be able to use a database field to identify each record s break priority status to MCD. In the file s definition file, set PW.Driv_Prior to that database field. The contents of that field must be a number from 0 to 255. When MCD processes the records within the break group, it uses the value it finds in that field for each record. Keep in mind that the lower the number, the higher the priority. If MCD cannot assign the value based on the PW.Driv_Prior as explained above, it assigns the default value from the Input List Defaults. If you set the list type If you elect to manually set up your list(s), set the break priority value in the Setup Input List block of your MCD job. 36 Match/Consolidate User s Guide

37 Three approaches to defining lists There are three different approaches to use in defining lists. You can use any or all these approaches within your MCD job. Treat an entire input file as a list A common way of defining lists is to treat each input file as a list. For example, suppose your job includes a master file and two update files. In such a case, you may prefer to use the records of your master file over any matching records from your updated files. That is, if records from different files match, you may want MCD to use your house record instead of a updated record. Master file Update file 1 Update file 2 Master List Update 1 Update 2 To do this, define each input file as a list and set each list s priority so that MCD will prioritize your house records over those of the updated lists. Link PW.List_ID to an input file First, you ll need to establish a constant value in the input file s definition (DEF) file. For example, if you intend that all the records of input file acme.dbf be considered members of a list, then in the acme.def file, set PW.List_ID to a constant value, such as house. The quotation marks around house mark it as a constant rather than a field in the output file. DATABASE TYPE = DBASE3 NAME_LINE = NAME FIRM = COMPANY ADDRESS = ADDRESS LAST_LINE = CITY&STATE&ZIP list_id = house For more information about DEF files, refer to your Database Prep manual. Set your job for the PW.List_ID Then, in addition, you ll need to set your job to recognize and act on that List_ID. You can set MCD to automatically generate lists from List_ID values, and you can also manually control all or part of the list generation process. To have MCD automatically generate lists from List_ID values, turn on the Auto Generate control of the Input List Defaults block. To manually control what lists are generated, turn that control off and set up an Input List block for each list you want to use. Chapter 3: Define your input files and lists 37

38 The result of this approach is that MCD generates a list for the records of each input file, as shown below: Input file List_ID = house List: house Input file List_ID = renta List: renta Input file List_ID = rentb List: rentb Select records based on a value in a field But suppose you don t want all the records of an input file to belong to the same list. Instead, you have records of three different lists together in one file. In this case, you can use the value in one of your database fields to identify the list to which each record belongs. For example, for an input file acme.dbf, with a List_Code database field that contains a value of A, B, or C, that database field value can be used to identify the list to which this record belongs. This approach is not limited to just one input file. The same lists, or additional ones, as well, can be set up for additional input files. Link PW.List_ID to the field First, identify the significant field in the input file s definition (DEF) file. From the example above, set PW.List_ID to the List_Code field. DATABASE TYPE = DBASE3 NAME_LINE = NAME FIRM = COMPANY ADDRESS = ADDRESS LAST_LINE = CITY&STATE&ZIP list_id = lst_code For complete information about DEF files, refer to Database Prep. Set your job for the PW.List_ID Set your job to recognize and act on the value of that List_ID. You can set MCD to automatically generate lists from List_ID values, and you can also manually control all or part of the list generation process. To have MCD automatically generate lists from List_ID values, enable the Auto Generate List from List_ID control of the Input List Defaults block. To manually control what lists are generated, turn that control off and set up an Input List block for each list you want to use. In this case, you d need a list for each predicted value that this List_ID might include. 38 Match/Consolidate User s Guide

39 As a result, MCD generates a list for each different value of the List_ID field, up to the MCD limit of 2,000 lists: Lst_Code = A List: A Input file Lst_Code = B Lst_Code = C List: B List: C This approach is not limited to just one input file. The same lists, or additional ones, as well, can be set up for additional input files. Select records based on criteria A third approach to defining lists is to establish a record s membership in a list based on some database-derived criteria that you design. This approach uses the MCD filter capability. In this approach, you create lists that is, you define list membership based on the result of filters that are identified for each list. Typically, the filter sets a range of values that qualifies a record for membership in the list. In this approach, you need not define PW.List_ID in your DEF files; instead, you define a filter statement for each list. Note that you cannot define List ID and use a filter to define a list in the same list block. For example, if your database has a field that contains an annual income value we ll call that field DB.Income you could define lists for ranges of annual income. You might want to set three lists: List_1 for records with an annual income below $20,000 List_2 for records with an income between $20,000 and $30,000 List_3 for records with an annual income above $30,000 Link a filter to a list Define each of these three lists with Input List blocks, and include list filters such as those shown below. (Refer to Database Prep for complete details about filters and functions.) For List_1: val(db.income) < For List_2: val(db.income) >= and. val(db.income) <= For List_3: val(db.income) > The order of your lists when using filters to define the list is important. Once a record is assigned to a list, it is not eligible to be assigned to any other list. Match/Consolidate assigns lists in the order of the Input List blocks. The first filter that evaluates to true puts the record into that list. In this example, if the List_2 filter did not include val(db.income) <= 30000, then the records that you would want in List_3 would be made members of List_2, instead. Chapter 3: Define your input files and lists 39

40 When a record doesn t fit into any list Regardless of which approach you use to assign list membership, you need to tell MCD what to do with records that do not belong to any defined list. This can happen for a variety of reasons, such as a defined PW.List_ID field being blank; field data not properly entered, or inconsistent with the list definition; or filter data not present or useable. For those records that do not meet the criteria for any of your defined lists, you have three choices: Action Ignore Abort Assign Default Description Leave the record out of the job. Halt processing and issue an error. Assign the record to a list that you set as the default list. For example, you might elect to assign all such undetermined records to a List_4 list. If so, you would select the Assign Default option at the Undetermined List control of the Input List Defaults block, and identify List_4 as the Default List Name, as shown at right. Note that the default list must be defined with an Input List block, as well. 40 Match/Consolidate User s Guide

41 Create groups of lists (super lists) The super list capability adds a higher level of list management. For example, suppose you rented several files from two brokers. You define five lists to be used in ranking the records. In addition, you would like to see your job s statistics broken down by broker as well as by file. To do this, you can define groups of lists super lists for each broker. Define each super list with a Super List Description block, such as those shown below. Broker A Files Super list for Broker A List_1 List_2 List_3 Broker B Files Super list for Broker B List_4 List_5 Super lists primarily affect reports. However, you can also use super lists to select multi-buyers based on the number of super lists in which a name occurs. This means that you can use super list membership to control output. For details, refer to Use super lists to find multi-buyers on page 75. Bear in mind, that you cannot use super lists in the same way you use lists. For example, you cannot give one super list priority over another, nor can you cancel matching within a super list. Chapter 3: Define your input files and lists 41

42 Reports on your lists Match/Consolidate includes a wide range of reports that record what the program has done in working with your lists. These reports provide your primary insight into how your results compare to your expectations, and they provide the clues for making any adjustments to further improve your results. As with the other steps of the process, study the reports that show what MCD has done. If your results show any trends that could be improved by adjustments to your settings, then change those settings and re-process the job. What reports can tell you These reports provide information that you can study to determine what if any adjustments you should make to your list setup, or to other aspects of your job setup, to optimize your results. Regarding input lists, here are the sorts of questions that you can answer with the various MCD reports: Have the records of my lists been read and has their data been appropriately included in the job? Reports that show input record quality The Input File Summary shows the number of records in each file and the number of records that were input. That report can also show the number of records that were not input because they could not be identified with a list (list drops). To show list drops, set the the Undetermined List Action option in the Input List Defaults block to Ignore. Input File Summary Report Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Input Gross Delete Filter List Sample Net File Input Drops Drops Drops Drops Input house.txt mail_sup.txt house_fm.txt update_1.txt rent_mag.txt Totals The Input List Summary shows the number of input records from each of the job s input lists. There are two columns to identify those assigned by default versus those identified through your list identification controls. By correlating the file information with the list information, you can see Input List Summary Report Match/Consolidate x.xx tekpubs Firstlogic, Inc Technical Publications Sample Report Matched Id Default Net List Records Records Input house firms no_mail select update Totals whether your records have been read, and whether the records have been assigned to lists as you expected. 42 Match/Consolidate User s Guide

43 Reports that show input record quality The List Quality report can show you how well your record data was parsed. It shows the raw numbers and percentages that reflect the name, firm, and address quality of your records, by list. It shows you whether each list s records were read and parsed successfully. The Unparsed Records report goes beyond statistics to show the content of records that could not be parsed and the reason the records were unparsed. This report is especially useful to "trouble-shoot" records whose dupe detection process was affected by certain unparsed data (refer to Match with unparsed addresses, last lines, names, and firms on page 208). List membership is identified for each record on this list, as well. Among matching records, which records have come from which lists? List identity is on the Duplicate Records report, the Sorted Records report, and the Unparsed Records report. As shown below on the Duplicate Records report, for each dupe group, you can see which lists its records came from. Duplicate Records Report Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Code List File Record LIST_ID NAME_LINE ADDRESS CITY FIRM M house H. V. JACOBSEN P.O. BOX C SANTA ANA M house HAROLD JACOBSEN P O BOX C SANTA ANA *M firms H. V. JACOBSEN P.O. BOX C SANTA ANA CONCEPTS M firms H V JACOBSEN P O BOX C SANTA ANA CONCEPTS M house GERALD KRYWICKI PO BOX NO 2978 SPRINGFIELD *M firms GERALD KRYWICKI PO BOX NO 2978 SPRINGFIELD CORP. M house ANGEL J RODRIGUEZ URB TERRAZAS DE GUAYNABO GUAYNABO *M firms ANGEL J RODRIGUEZ URB TERRAZAS DE GUAYNABO GUAYNABO M house GRDN HLS PLZ/1353 CARR GUAYNABO PR *M firms GRDN HLS PLZ/1353 CARR GUAYNABO PR What matching records have been found among and between the lists? To get a clear picture of the matching that was detected within and among your lists, you can study several different reports that were designed for exactly that purpose. The List by List Match report, the List Match report, and the Multi-List report show, for each input list, how many of its records were found to match records in the lists of the job inter-list and intra-list matches. Each report uses a different format, so choose the one most useful for your purposes. The List Match report the Summary version is shown on the following page. You can see the number of matches that MCD found for records that are members of each list. If you are surprised to find no intra-list matches, check your setting of the Search for Dupes Within This List option. List Match Report, Summary Information tekpubs Firstlogic, Inc Technical Publications Sample Report Net Intra List Inter List Total Percent of List Input Matches Matches Matches Net Input house firms no_mail select update Totals Chapter 3: Define your input files and lists 43

44 How have my lists affected the results the output of my job? Because lists are so important in categorizing and ranking the members of match groups, you can use list statistics and other report information to better assess the results of your job. List Duplicates Report, Summary Information Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report List List Total Pct of Total Pct of Net List Name List_id Type Priority Dupes Net Non Dupes Net Input house house NORM firms firms NORM no_mail no_mail SUPP select select NORM update update NORM The List Duplicates report (Summary version shown above) shows the numbers of records, by list, that have been designated for each match status, and will therefore be kept or dropped as your output. The Multi-List report, shown below, can be a very useful report when creating multi-occurrence files. For example, if you want to create a multi-buyer file, this report shows the number of records from each list that were matched to records from other lists. This report shows the number of inter-list matches. Refer to How Match/Consolidate counts intra-list and inter-list matches on page 88 for a detailed explanation of inter-list and intra-list matches. Multi-List Report Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report List Multi List 2 List 3 List 4 List 5 List 8 List house firms no_mail select update Totals The Output File report shows, on a list-by-list basis, the number of records that were included in the job s output file(s). Output File Report,Summary Information: C:\pw\mpg\Work\output\Out_MPG.txt Match/Consolidate Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Output Results for no-fee Net List List Filter Pct of Net Pct of List Name Input List_id Type Priority Drops Net Input Output Net Input house 1000 house NORM firms 1000 firms NORM Totals Totals (Including Suppression Records and After Filter) If your job includes an input purge, the Purge by List report will show, on a listby-list basis, the number of records that were purged, or marked for deletion. The following example shows a report generated for a job that predicted a purge, rather than performing it. For more information, see Predict a purge on page Match/Consolidate User s Guide

45 Purge By List Report, Detail Information (PREDICTION) Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Single Multiple Single Multiple Suppress Suppress Suppress Net Filter Suppress List List List List List List List List Name Input Drops Dupes Dupes Dupes Uniques Masters Masters Uniques Masters Subord house firms no_mail select update Totals Chapter 3: Define your input files and lists 45

47 Chapter 4: Prioritize and suppress records This chapter explains how to control the ranking of records within groups of matching records (dupe groups). Ranking of records affects which records will be master records and which records will be grouped to other records. Chapter 4: Prioritize and suppress records 47

48 Record priorities and types Terms Term Priority Record (match key) Suppression list (Suppress-type list) Suppression record Master dupe Subordinate dupe Description This chapter focuses on the ranking of matching records. Match/Consolidate (MCD) ranks records within a group after assigning matching records to a dupe group. The following tables list general terms used throughout this chapter, and the categories into which you can rank records. The ranking of records within a match group. The record that has the highest priority in a match group becomes the master duplicate. You can rank records according to list membership, completeness, the contents of a particular field, or randomly. The ranking of record keys within a break group. The higher the ranking, the more likely that the record key will become a driver record for match comparisons. Within break groups, records can be ranked only by List Break Priority. The score a record receives to determine its rank. Priority is scored on a penalty system. The fewer penalty points a record receives, the higher its priority. The lower the score, the higher the priority. In this chapter, when we use the term record, keep in mind that we are referring to the match key for a record. The match key usually does not contain the entire data from a record. A list of records MCD uses to prevent matching records of other lists from being sent to the output. For example, this might be delinquent accounts or consumers who have requested suppression of advertising mail. For information about other list types (normal and special), refer to List types on page 33. A record that came from a suppression list. The highest ranking member of a match (dupe) group. Any member of a match (dupe) group which is not the highest ranking member. Term Suppression list dupe Single list dupe Multiple list dupe Unique record Single list master Multiple list master Suppression list unique Suppression list master Suppression list subordinate Description Subordinate member of a dupe group that includes a higher-priority record that came from a Suppress -type list. Can be from normal- or special-type list. Subordinate members of a dupe group whose members all came from the same list. These can be from lists with a normal- or special-type list. Subordinate members of a dupe group whose members came from two or more lists. These can be from lists with a normal or special-type list. Records that are not members of any dupe group. No matching records were found. These can be from lists with a normal- or special-type list. Highest ranking member of a dupe group whose members all came from the same list. Can be from normal- or special-type lists. Highest ranking member of a dupe group whose members came from two or more lists. Can be from normal- or special-type lists. Records that came from a Suppress-type list, and for which no matching records were found. A record that came from a Suppress-type list and is the highest ranking member of a dupe group. A record that came from a Suppress-type list and is a subordinate member of a dupe group. 48 Match/Consolidate User s Guide

49 Control priority Product MCD Job and Views Standard Matching MCD Job Extended Matching MCD Library with configuration files MCD Library without configuration files The following table summarizes the steps that may be involved in setting up priority and suppression in your MCD product. Setting priority List Break Priority: If manually defining lists, in the Input List Description block or window, set a break priority number from 0 to 255. If automatically generating lists with PW.List_ID, use PW.Driv_Prior to set the break priority. Used to determine the driver record. List Match Priority: In the Input List Description section, set a list match priority number from 0 to 999. To create a suppression list, set the List Type parameter to Suppress. If you are automatically generating lists with PW.List_ID, use PW.List_Prior to set the list priority. Blank-field priority: In the Match Criteria section, set a number from 999 to 999 as the blank priority for a key field. This is only valid when used with blank matching. Field priority: In the Match Options section, set the priority to ascending or descending. In your DEF file(s), define the priority field as PW.Priority for example, PW.Priority= ExpireDate. Random priority: In the Match Options section, set random sortation to Yes. List Break Priority: In the Input List Description block or window, set a break priority number from 0 to 255. List match priority: In addition to the job-file setup described above, in the Prioritize Matches section of the extended matching file, set the Type to List, Fld, List_Fld, or Fld_List. Blank-field priority: In the Prioritize Matches section of the Extended Matching file, set up a Blank Priority parameter with the field name and a number from 999 to 999. Field priority: In the Parsing and Key Options section of the Extended Matching file, set Store Priority Field to Yes. In the Prioritize Matches section, specify ascending or descending order at the Priority Field Order parameter. In your DEF file(s), define the priority field as PW.Priority. Random priority: In the Prioritize Matches section of the extended matching file, set Break Priority Ties Randomly to Yes. List Break Priority: See MP_List_Config_File (mplist.cfg). List match priority: See MP_List_Config_File (mplist.cfg). Blank-field priority: See MP_List_Config_File (mplist.cfg). Field priority: See MP_Reference_Config_File (mpref.cfg). List Break Priority: Call mp_list_set_list_attr(), with MP_LIST_ATTR_BREAK_PRIORITY. List match priority: Call mp_list_set_list_priority() with MP_LIST_ATTR_PRIORITY. Blank-field priority: Call mp_misc_set_blank_field_priority(). Field priority: To create a priority field, call mp_refcreate_set_option() and set a value for MP_REFCREATE_OPTION_PRIORITY_FLD_LEN. To set the priority order to ascending or descending, call mp_misc_set_option_info(), with MP_MISC_OPTION_PRIORITY_FIELD_ORDER. Chapter 4: Prioritize and suppress records 49

50 Record priority and suppression Priority within match groups When two records match, they are assigned to a match group, along with any other records that were found to match the same record. In this chapter, we will not get into any detail on the match or dupe detection process. A complete explanation about matching how MCD determines whether two records are matches or dupes, begins in Record matching on page 163. Our focus here is how does MCD rank records within a match group. How does MCD identify the best record from among the matching records? After MCD finishes the search for matches and all the match groups are formed, MCD sorts the records within each match group. The highest ranking record in each group is the master record. All other members of the group are subordinate records. For most purposes, you can consider the master record to be the best record of the dupe group. Master record Subordinate #1 Subordinate #2 Subordinate #3 You can control how MCD sorts records within a match group. For example, you might want to prefer records from a file you own over records from rented lists. Or, you may want to prefer newer records over older records, or more complete records over those with blank fields. Whatever your preference, the way to express it to MCD is through priorities. Different types of priorities There are four different types of priority you can use in MCD: Priority List match priority Blank-field priority Field priority Random priority Brief description Prefer records from one input list over another. Assign a lower priority to records in which a particular field is blank. Rank records in ascending or descending order based on the contents of a particular field. To break ties, assign a random number to each record and assign priority based on that random number. If you do not use random priority, ties are broken in favor of the driver record, then by input file and record number. Priorities are assessed in sequence With standard matching, List Match Priority has precedence over Field Priority. With standard matching, priorities work in the following sequence, or hierarchy: 1. List match priority + blank-field priority 2. Field priority 3. Random priority However, with Extended Matching, you can reverse this precedence. For more information, refer to the Prioritize Matches section in Chapter 2 of the Extended Matching Reference manual. Field priority is used only as a tie-breaker if two records have the same score for list and blank-field priority. Likewise, random priority is used only as a tiebreaker if two records are tied for list, blank-field, and field priority. 50 Match/Consolidate User s Guide

51 Random priority Random priority is an option that assigns a random number to each record and sorts on that random number. This means that if you run the same job twice, you may get a different set of surviving records each time. If you do not elect random sortation (many users do not), ties are broken in favor of the driver record, then by input file and record number. Suppressing and priority In addition to preferring certain kinds of records, you can actively suppress certain records. That is, you can take steps to exclude records from the output of your MCD job. The way to suppress records is by identifying those records as members of a suppression list. In this chapter, we explain how to prioritize records both for the purpose of preference and for the purpose of suppressing them from your results. When using more than one input file, the results of your duplicate records search can be compromised by the order of your input files when you elect not to dupe search within a list. If one of your input files is a suppress-type list, that means your output could include records that you wanted (and expected) to be suppressed. In general, if you are using a suppress-type list, you should dupe search within your other lists. That will ensure that all the dupes of those lists are suppressed when any are found to duplicate a record on the suppress list, regardless of which record is the driver record. If you cannot dupe search within all your lists, you may find a work-around by reordering your input files. If you can, make the input file that includes your suppress records come ahead of the other input files (as Input File blocks in your job file or setup). Then, assuming you have not set up list break priorities, those suppression records will be more likely to be driver records for your match comparisons. Chapter 4: Prioritize and suppress records 51

52 Prioritize or suppress records based on list membership List match priority You can prioritize or suppress records based on list membership. For example, suppose you are a charitable foundation mailing a solicitation to your current donors and to names from two rented lists. If a name appears on your house list and a rented list, you prefer to use the name from your house list. For one of the rented lists, List B, suppose also that you can negotiate a rebate for any records you do not use. You want to use as few records as possible from List B so that you can get the largest possible rebate. Therefore, you want records from List B to have the lowest preference, or priority, from among the three lists. List House List Rented List A Rented List B Priority Highest Medium Lowest Penalty scoring system When you set up your lists, you can assign priority for each list. Think of priority as a penalty-scoring system. You assign the most penalty points to the least desirable list, and the least penalty points to the most desirable list. For example, suppose we want to take records from our house list first, then rented List A, then rented List B. To do this, we ll assign the fewest penalty points to our house list and the most penalty points to List B: List List match priority (penalty points) House List 100 Fewer penalty points means higher priority. Rented List A 200 Rented List B 300 You can assign any score between -999 and 999, using any combination of numbers for example, 1/2/3, 10/20/30, or 100/200/300. Assess a higher penalty to the least desirable list, and a lower penalty to the most desirable list. Blank-field priority List match priority interacts with blank-field priority, but we ll explain list match priority first. Therefore, the examples explained on the following page ignore blank field priority. For details about blank field priority, refer to Penalize records that contain blank fields on page Match/Consolidate User s Guide

53 Suppression lists often have a high priority In most cases, you will want suppression list records to have a high priority that is, a low penalty score. This makes it likely that normal records that match a suppress record will be subordinate duplicates, and will therefore be suppressed, as well. Within each match group, any record with a lower priority than a suppression list record is considered a suppress dupe. For example, suppose you are running your files against the DMA Mail Preference File (a list of people who do not want to receive advertising mailings). You would identify the DMA list as a suppression list and assign a list match priority of zero. List DMA Suppression List 0 House List 100 Rented List A 200 Rented List B 300 List match priority (penalty points) Suppose MCD found four matching records among the input records, and therefore established the following dupe group. Matching record (name fields only) List List match priority Maria Ramirez House 100 Ms. Ramirez List B 300 Ms. Maria A Ramirez List A 200 Ms. Maria A Ramirez DMA 0 Based on their list match priority, MCD would rank the records as shown below, at the right of the table. As a result, the record from the suppression file (the DMA file) would be the master record, and the others would be subordinate suppress dupes, and thus suppressed, as well. Matching record (name fields only) List List match priority Maria Ramirez House 100 Ms. Ramirez List B 300 Ms. Maria A Ramirez List A 200 List List match priority DMA 0 Master House 100 List A 200 List B 300 Ms. Maria A Ramirez DMA 0 Chapter 4: Prioritize and suppress records 53

54 Penalize records that contain blank fields Blank-field priority Given two records, you may prefer to keep the record that contains the most complete data. You can use blank-field priority to penalize records that contain blank fields. Use with blank matching Blank-field priority is appropriate if you feel that a blank in that field shouldn t disqualify one record from matching another. For example, suppose you are willing to accept a record as a match even if the prename, first name, middle name, street suffix, or secondary range is blank. Even though you accept these records into your match groups, you can assign them a lower priority for each blank field. Penalty scoring system As with list match priority, blank-field priority is a penalty-scoring system. For each blank field, you can assess a penalty of up to 999 points. You can assess the same penalty for each blank field, or assess a higher penalty for fields you consider more important. For example, if you were targeting a mailing to college students, who primarily live in apartments or dormitories, you might assess a higher penalty for a blank first name or apartment number. Field Prename 5 First name 20 Middle name 5 Street suffix 5 Secondary range (apartment) 20 Blank-field priority (penalty points) As a result, the records below would be ranked in the order shown (assume they are from the same list, so list match priority is not a factor). Even though the first record has blank prename, middle name, and street suffix fields, we want it as the master record because it does contain the data we consider more important: first name and apartment. Prename (5)* First (20)* Middle (5)* Last Range Street Suffix (5)* Apt (20)* Blank-field penalty Maria Ramirez 100 Main = 15 Ms. Maria A Ramirez 100 Main St 20 Ms. Ramirez 100 Main St = Match/Consolidate User s Guide

55 Blank-field priority interacts with list match priority When records are ranked, the list match priority and blank-field priority scores are added together and considered as one score. Therefore, you ll need to consider how blank-field priority and list match priority interact. For example, suppose you want records from your house list to have high priority, but you also want records with blank fields to have low priority. Is list membership more important, even if some fields are blank? Or is it more important to have as complete a record as possible, even if it is not from the house list? Most users want their house records to have priority, and would not want blank fields to override that priority. To make this happen, set a high penalty for membership in a rented list, and lower penalties for blank fields: List List match priority (penalty points) Field Blank-field priority penalty points) Suppression List 0 Prename 5 House List 100 First name 20 Rented List A 200 Middle name 5 Rented List B 300 Street suffix 5 Rented List C 400 Apartment 20 With this scoring system, a record from the house list will always receive priority over a record from a rented list, even if the house record has blank fields. For example, suppose the records below were in the same match group. Even though the house record contains five blank fields, it receives only 155 penalty points ( ), while the record from List A receives 200 penalty points. The house record, therefore, has the lower penalty and therefore the higher priority. Priorities List Pre First Mid Last Range Street Suffix Apt ZIP List Blank Total House Terranova 100 Bren List A Ms. Rita A Terranova 100 Bren Rd 12A List B Rita Terranova 100 Bren Rd You can manipulate the scores to set priority exactly as you d like. In the example above, suppose you prefer a rented record containing first-name data over a house record without first-name data. You could set the first-name blank-field priority score to 500 so that a blank first-name field would weigh more heavily than any list membership. Chapter 4: Prioritize and suppress records 55

56 Prioritize records based on the contents of one field Sometimes you may want to prioritize records based on data in a particular field. For example, given two matching records, you might prefer the record with the larger donation, the larger credit limit, or the later expiration date. For example, suppose you are consolidating a file of recent subscribers into your database. If two records match, you want to keep the record with the later expiration date. You can sort records in descending order by date: Prename First Middle Last Range Street Suffix Apt ZIP Expire Date Craig R Andrews 1234 Main St Mr Craig R Andrews 1234 Main St In such a situation, there are two things you must do: 1. In your DEF file(s), define the PW field Priority, based on your amount or date field. For example, if you have an Enroll_Date database field, your DEF field should include a line like this: PW.Priority = Enroll_Date 2. Set Field Priority to Ascend or Descend to set the prioritize direction. If you are using standard matching, this setting is in the job s Match Options block. If you are using extended matching, your extended matching file should include a Prioritize Matches block. Ascending or descending To determine priority, you can sort records in ascending order or descending order. When you set Field Priority to ASCEND, the sort sequence is 0-9, A-Z, a- z. When you set it to DESCEND, the sequence is z-a, Z-A, 9-0. If you set priority on an amount field, select ascending order to prefer the record bearing the lesser amount. Select descending order to prefer the record bearing the greater amount. If you set priority on a date field, select ascending order to prefer the record with the earlier date. Select descending order to prefer the later date. To be sorted correctly, numbers may have to be right-justified or pre-padded with zeroes. For example, when sorting in ascending order, 02 comes before 10, but 2, (left-justified) comes after 10. In the MCD job-file product, the PW.Priority field is always a character-type field, regardless of your database field type. Is field priority most important? With standard matching, MCD uses field priority only as a tie-breaker when two records have the same list match priority and blank-field priority. Field priority may be more important to you than list match priority or blank-field priority. For example, you might be willing to take the record with the later expiration date no matter which list it comes from. If so, assign the same list match priority to all lists, and do not use blank priority. Because all records will 56 Match/Consolidate User s Guide

57 tie on list match priority and blank-field priority, field priority will always be used to break the tie. Alternatively, with extended matching, you can set the priority type so that MCD uses the list match priority or blank field priority as a tie breaker when two records have the same field priority. Chapter 4: Prioritize and suppress records 57

58 Reports about record ranking and priorities Match/Consolidate has two ways to show you how it has ranked your records. To see the records themselves, produce the Duplicate Records report. Or, to see the numbers about the various matching and ranking categories, produce the List Duplicates report. Study these reports to see what MCD has done. If your results show any trends that could be improved by adjustments to your settings, then change those settings and re-process the job. For example, if those records chosen as master records are not as good as the subordinate dupes, check your priorities, and, if necessary, change them. Duplicate Records report Produce the PW version of the Duplicate Records report to see matching records that is, the records themselves as grouped in their dupe groups. You may not want this report to include all the dupe groups, unless there aren t many. But you may want to include at least a reasonable sample, so you can see lists like the one shown below. The master record of each dupe group is listed first, with subordinate dupes following the master. If, to your eyes, the order within each dupe group looks right that is, the first record appears to be the best record then your setup is right, and no priority adjustments are needed. Check the list identifiers, too, as well as the field content, because that reflects your priority settings. Duplicate Records Report Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Code List File Record LIST_ID NAME_LINE ADDRESS CITY ST ZIP FIRM M house H. V. JACOBSEN P.O. BOX C SANTA ANA CA M house HAROLD JACOBSEN P O BOX C SANTA ANA CA *M firms H. V. JACOBSEN P.O. BOX C SANTA ANA CA PANEL CONCEPTS M firms H V JACOBSEN P O BOX C SANTA ANA CA PANEL CONCEPTS M house GERALD KRYWICKI PO BOX NO 2978 SPRINGFIELD MA *M firms GERALD KRYWICKI PO BOX NO 2978 SPRINGFIELD MA HEATBATH CORP. M house ANGEL J RODRIGUEZ URB TERRAZAS DE GUAYNABO GUAYNABO PR *M M firms 778 house ANGEL J DREW RODRIGUEZ D HAMMOND URB TERRAZAS VITRACO DE PARK GUAYNABO GUAYNABO ST THOMAS PR VI *M firms DREW D HAMMOND VITRACO PARK ST THOMAS VI D J MANAGEMENT CORP M 1 60 house GRDN HLS PLZ/1353 CARR GUAYNABO PR *M M firms 345 house MR WILLOUGHBY LEWIS GRDN HLS PO PLZ/1353 BOX 5588 CARR GUAYNABO ST THOMAS PR VI EXECUTIVE DYNAMICS *M firms MR WILLOUGHBY LEWIS PO BOX 5588 ST THOMAS VI VI EMPLOYEE BENEFIT CNSLTS INC *P no_mail MS DENISE COSTELLO PO BOX CHRISTIANSTON VI P house MS DENISE COSTELLO PO BOX CHRISTIANSTON VI P firms MS DENISE COSTELLO PO BOX CHRISTIANSTON VI RAMCO INC M house MARSTON ADAMS PO BOX 3003 KINGSHILL VI *M 2 Code 3 Definitions 922 firms MARSTON ADAMS PO BOX 3003 KINGSHILL VI TROPICAL SHIPPING CO M = Multi List S = Single List P = Purge Group * = Driver List Listname 1 house 2 firms 3 no_mail 4 select 5 update File Filename 1 C:\pw\mpg\Work\house.txt 2 C:\pw\mpg\Work\mail_sup.txt 3 C:\pw\mpg\Work\house_fm.txt 4 C:\pw\mpg\Work\update_1.txt 5 C:\pw\mpg\Work\rent_mag.txt Note: The bottom of the report shows the names of the lists and files involved in the job, as well as the code definitions. Note: The asterisk (*) in the code indicates the driver record. For details about how the driver record affects matching, see page Match/Consolidate User s Guide

59 List Duplicates report If you want to see numbers statistics about the job, look at the List Duplicates report. That report shows which list your master records came from. These numbers can help confirm that your list priorities are right, or alert you to potential problems. For example, from the List Duplicates report, you can see how many of your suppression-list records were identified as master records of dupe groups. You can see which lists are supplying the records that are master records, and which are supplying records that are subordinate dupes. You can also see how widely dispersed are the matches among your lists records. List Duplicates Report, Detail Information Match/Consolidate Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Single Multiple Single Multiple Total Suppress Suppress Suppress Net Suppress List List Total List List Non List List List List Name Input Dupes Dupes Dupes Dupes Uniques Masters Masters Dupes Uniques Masters Subord house firms no_mail select update Totals Totals (Including Suppression Records) Chapter 4: Prioritize and suppress records 59

61 Chapter 5: Purge input files or create output files This chapter explains how to use the results of your record matching process to produce your choice of four specialized output files or to refine your input file(s) by purging unwanted records. Chapter 5: Purge input files or create output files 61

62 Match/Consolidate results Terms This subsection explains how to control the results of your Match/Consolidate (MCD) job, such as producing output files, purging input files, and making use of MCD matching. This chapter uses the following terms. Term Input purge Input posting Output file Multi-buyer file Multi-occurrence file Master duplicate Master record Master Subordinate duplicate Subordinate dupe Unique record Single-list duplicate Single-list dupe Multiple-list duplicate Multiple-list dupe Suppression list N per firm Nth select Description Marking for deletion the records of your input files that your job has identified as subordinate dupes that is, records which have matching records that are better. Adding MCD application fields or PW field data to the records of your input files. The production of a new file that contains data as described in your output file description. This can be one of four types. A file of names that occurred more than once. Typically, this is shown by names that occur on more than one input list; they are frequent or repetitive customers. The highest ranked member of a match group. This is normally considered the best record of the match group. All members of a match group except the master. A record that is not a member of any match group. A record from a match group whose members all came from the same input list. A record from a match group whose members came from multiple input lists. A list of records MCD uses to prevent matching records of other lists from being sent to the output. Include a limited number of records (N per firm). You can select whether the output includes N records per individual, N records per department, or N records per firm. Selecting eligible records at fixed or random intervals. 62 Match/Consolidate User s Guide

63 This table shows the nine categories into which records can be placed. Records to purge Unique records Single-list masters Single-list dupes Multiple-list masters Multiple-list dupes Suppression-list uniques Suppression-list masters Suppress dupes Suppression-list subordinates Description Unique records are not members of any match group. This designation does not include records that are members of suppression lists, which are categorized as suppression-list uniques. Master records from match groups whose members all came from the same input list. (Does not include records from suppression lists.) Subordinate records from match groups whose members all came from the same input list. (Does not include records from suppression lists.) Master records from match groups whose members came from multiple input lists. (Does not include records from suppression lists.) Subordinate records from match groups whose members came from multiple input lists. (Does not include records from suppression lists.) Unique records from suppression lists. Master records that came from a suppression list. Records that came from a normal list or special list, but matched a record from a suppression list and had a lower priority than that suppression-list record. Subordinate records that came from a suppression list. Producing results Product MCD Job and Views MCD Library This table lists the ways you can set up your desired results according to the product. How to set up your desired results: Input purge (Job): In the Execution block, set the Purge or Custom Purge parameter to Y (Yes). Set the Protect Input File From Purge parameter to N (No). Input purge (Views): In the Execution Options window, select either Purge or Custom Purge from the Input File Options. To exempt an input file from the purge, set the Protect from Purge control at the Input File window. Input posting (Job): In the Execution block, set the Post to Input File parameter to Y (yes). Set up a Post to Input File block for each input file to which you want data posted. Input posting (Views): In the Execution Options window, select Post to Input File from the Input File Options. Set up a Post to Input File window for each input file to which you want data posted. Output file (Job): In the Execution block, set the appropriate output file parameter(s) to Y (yes). For each desired output file, set up a Create File For Output block and the appropriate Output File block. Output file (Views): In the Execution Options window, select the appropriate output file(s). For each, set up a Create File For Output window and the appropriate Output File window. Your application controls how records are handled after matching. Use the mp_duperes_*() functions to retrieve results about the matching process. Chapter 5: Purge input files or create output files 63

64 Purge bad records or post good records After you have completed the search for matching records, you are ready to separate the good records from the bad ones. Usually, good records refers to unique records and the master record from each match group, and bad records refers to subordinate records from match groups. To keep the good records and discard the bad ones, you can either purge bad records from the input file, or post good records to an output file. Input purge You can delete bad records from your input file(s). This might involve removal of the record data, or merely a non-destructive delete mark. You might elect this method if disk space is limited. Non-destructive delete marking for a purge is sometimes faster than output posting. Output file You can copy good records from your input file(s) to another file. The output file may be a new file, or you may append good records to an existing file. Factors to consider Consider the following points if you re not sure whether to purge bad records from the input file, or post good records to an output file. If disk space is limited, purge records from the input file. However, before doing this, be sure to create a backup of the input file. Each method s processing time depends on your files, your machine, and on the percentage of matches in your job. In most cases, a purge is faster. If you use an input filter, you should probably create an output file. Records that fail the filter are not processed at all, so they cannot be affected by an input purge. If you have strict criteria for what are good records, then good records may be a small percentage of the input. For example, if you re preparing a multibuyer list, good records might be just 10 percent of the total input. In this case, consider creating an output file because it might be more efficient for MCD to post the good 10 percent of records to an output file than to review the input files and delete 90 percent of the records. Contrast the purge controls with those of the custom output file. If you re purging input files, you specify which records to delete. If you re creating an output file, you specify which records to keep. 64 Match/Consolidate User s Guide

65 Purge the input file Use input purging to delete unwanted records from your input files. The normal purging process is based on the premise that the matching records of your match groups represent unwanted duplicate records, and that you want to eliminate such duplicate records from your file(s). Conventional or custom purge When you use conventional purging to purge an input file, MCD removes all subordinate duplicates, whether from a normal list or suppression list. What remains in your input file? Unique records and master duplicates. With conventional purging, MCD does not consider whether a record came from a list whose members are all from the same list or a list whose members are from multiple lists. Match/Consolidate treats single-list and multiple-list records the same. Match/Consolidate also offers the custom-purge option. This lets you to design your own purge, by selecting from the nine different categories of records, which should be purged from your input file(s). The following Custom Purge Input File(s) block shows those categories as you would set them for a conventional purge. You can also incorporate a filter. Records for which the filter evaluates to true are deleted; those for which the filter evaluates to false are kept. A record may be deleted either by falling into one of the purge categories or by passing the delete filter. Contrast the purge controls with those of the custom output file. If you re purging input files, you specify which records to delete. If you re creating an output file, you specify which records to keep. If you elect to purge input files If you elect to purge your input files, MCD includes three features to make the process easier. Be sure to make a backup of the input file on tape or disk. You can set your MCD job to predict the purge results before you actually purge your input file. You can protect any or all your input file(s) from the purge process. You can have MCD create a backup of your input file(s). Chapter 5: Purge input files or create output files 65

66 Predict a purge You can choose to predict a purge before actually running the process and risking any record losses. Predicting lets you generate reports and view the results of the predicted purge to make sure they are satisfactory. The Purge by List report (detail and summary versions) shows you how many records would be deleted from the input file if you actually purge the file. If, after studying the reports, you decide you need to adjust your settings, you can do so and predict again to check the new results. Use the prediction feature as often as you like to fine-tune your input file purge. Non-destructive marking When purging dbase3 files, MCD uses non-destructive delete marking. A deleted record is not literally removed. It is simply marked, and removing it requires another operation. If you realize an error, you can use your database program to remove the delete marks. When you process ASCII files, you can mimic this feature. You will need a onecharacter field in your input file(s) to store the mark. Then, in your DEF file(s), define a PW.Delete field. To mark a record for deletion, MCD places an asterisk in this field. After purging, MCD deletes all work files. If you want any reports, set them up before you run the purge. If you change your mind after a purge, you must restore the input file(s) from backup and re-run the job. 66 Match/Consolidate User s Guide

67 Create an output file or post data to the input file Four kinds of output files You can create four different kinds of output files: Output file MCD output file All-duplicates output file Multi-occurrence output file Custom MCD output file Contents of the output file Unique records and master duplicates. Suppression records are not included. All master and subordinate duplicates from all match groups. Suppression records are included. Unique records are not included. All master duplicates which, in essence, means one record per dupe group (match group). Unique records, subordinate duplicates, and suppression records are not included. You end up with a file of names that occurred more than once for example, frequent or repetitive customers or donors. You specify which types of records to keep. Output file structure You can create a new output file or append records to an existing file. If you create a new output file, you have three choices for the file structure. You can clone, or copy, the structure of an existing file. Clone the structure of an existing file and append new fields. Define all the fields yourself. Note that MCD creates a DEF file to go with your new file, though that DEF file contains only the Database Type parameter, no PW fields. You can elect to have MCD not create that DEF file. For details about file structures and output DEF files, refer to Database Prep. Post data back to the input file You can use MCD to post data back to your input file. Your input file must contain a field ready to receive the data that you post. You cannot append new fields to input records if you need to append new fields, you ll need to create an output file. After input posting, MCD deletes all work files. If you want reports, be sure to set them up before you run the job. If you want to perform both input posting and a purge, make sure you perform them both in the same batch run. Because the work files are deleted, you cannot post during one run and purge during another. Chapter 5: Purge input files or create output files 67

68 Data that you can post You can post several kinds of data to your input or output file. Input data: DB and PW fields You can use database and PW fields to copy raw data from your input file(s) to an output file. These fields are identified by the prefix DB or PW. For example: DB.Soc_Sec_No, and PW.Name_Line. Database or PW fields need not be common to all of your input files. When posting a record that does not have the named source field, MCD simply places blanks in the output field. For example, suppose you post DB.Soc_Sec_No to the SSN field in your output file. If one of the input files does not contain a Soc_Sec_No field, records from that file will have a blank SSN field in the output file. You can post DB or PW fields only to an output file. You cannot post DB or PW fields to an input file. For a list of the PW fields, refer to the Quick Reference. MCD data: AP fields You can post data that was generated during MCD processing. These fields are identified by the prefix AP: AP.Group_Cnt. For a complete list of MCD AP fields, refer to the Quick Reference. Constants A constant is a data string that does not change from one record to the next. For example, you might post today s date to a date field. When you post a constant, enclose it in quotation marks. For example: Manipulate data before posting it You can use functions to check or manipulate data before posting it to the output field. For example, you could check the name field and, if it s empty, post Current Resident. Your function might look like this: iif(empty(db.name), "Current Resident", DB.Name) When posting to your input file, do not use DB or PW fields in filter or function expressions. However, you can use DB and PW fields when posting to output files. 68 Match/Consolidate User s Guide

69 Choose the best records for your output file In some MCD jobs, especially jobs that prepare for a mailing, you must limit the output to a certain number of records. For example, the mailing might be limited by the client s contract or by the number of pieces that the printer actually produced. After you eliminate duplicates and suppression records, you might still have more eligible records than you need. In that case, you can pick the best records from the pool of eligible records. To select the best records, first sort all eligible records in order from best to worst. Then start at the top and take the best records for your output file. Sort all eligible records To select the best records, first sort eligible records so that the best records can be selected first. You decide what criteria should be used to sort the records. You can sort records in either ascending order (0 9, A Z), or descending order (9 0, Z A). The following table lists the available sort options. Sort by File Random order Match group MCD field Geographically Priority field List count Dupe group size Custom Description Sort records in the same order they appear in the input file(s). This is the fastest option. Sort randomly. This is useful for abbreviated jobs, like testing output. Sort by match group. This makes it easier to relate members of the same match group. Sort by a field that you choose, such as an account number field or affluence rating. You define the MCD field in your DEF file(s). Sort by state, city, ZIP Code, street name, street range, and so on. Sort based on the total of list match priority plus blank-field priority. For more information about priority, see Prioritize and suppress records on page 47. Sort based on how many lists the record belongs to. Use this option to sort multi-buyer lists. Sort based on how many records are in this record s match group. Use this option to sort multi-occurrence lists. Sort based on your own layered sortation. MCD sorts based on key-field data Match/Consolidate sorts records based on the data in key fields, not the data in your database fields. Therefore, key-data standardization settings can affect the sorting results. For example, if you standardize data for firm keys, the original firm data is not used for sorting the standardized data is used. Note that, when sorting by names, MCD uses input data rather than standardized data. Your setting at the Standardize Name Keys parameter will not affect the output sort. Chapter 5: Purge input files or create output files 69

70 Use reports for feedback If you do not understand your sort results, generate a Sorted Records report, Duplicate Records report, or Unparsed Records report in the Key version. These reports should provide enough data to determine if adjustments should be made to your sorting setup. Use the best records After you sort eligible records in order from best to worst, use the best records for your output file. For example, suppose you printed 50,000 copies of a catalog. You could tell MCD to place a maximum of 50,000 records in the output file. Match/Consolidate would select those records from your sorted list, starting with the first (best) record. As shown at right, regardless of which type of output file you are creating, the controls for selecting the best records are at that output file s Output File block. Example 1: Most affluent Suppose your records include an INCOME field that contains an actual income figure. You want to use this information to send your mailing to the 50,000 most affluent people (after matching). First, tell MCD which field to use for sorting. This is a two-step process: 1. First, define the Income field as PW.Merg_Purg1 in your DEF file: PW.Merg_Purg1 = Income 2. Then, direct MCD to sort on the PW.Merg_Purg1 field. To do this, go to your output file block (the Custom MCD Output File block is shown below) and set the Sort By option to MP1. To sort with the MP1 option, be sure the Merg_Purg1 field is included in the match key. If you are using standard matching, that s done at the Matching Criteria block. For extended matching, it s done with the Key Length parameter of the Parsing and Key Options block. 70 Match/Consolidate User s Guide

71 3. Next, because you want higher incomes first, set the output sort order to Descend. Finally, to select 50,000 records, set as the maximum number of records to output. This method works for actual income figures. If the field contains a demographic code, you can use it if the codes are in logical sequence for example, A K representing lowest to highest incomes. If codes are not sequential, you will need to adjust them. You could create a sequential code using the search-and-replace features of DataRight. Example 2: Highest priority Suppose you are processing your house database and three rented files. Given a house record and a rented record, you prefer to select the house record. Express this preference by setting up list match priority (see Prioritize or suppress records based on list membership on page 52). Then, select output records based on priority.for example, to select the 10,000 highest-priority records, you would do the following. 1. First, direct MCD to sort on the records priority. To do this, go to your output file block (the MCD Output File block is shown below) and set the Sort By option to LB_Prior. 2. Next, because you want higher priorities first, set the output sort order to Ascend. (Remember, a lower number indicates a higher priority.) 3. Finally, to select 10,000 records, set as the maximum number of records to output. When you select the sort option LB_Prior, you are sorting on a priority number. If you want, you can post that same number to your output file by using the application field AP.LB_Prior. For more details about setting priorities, refer to Prioritize and suppress records on page 47. Chapter 5: Purge input files or create output files 71

72 Custom sort your output records With MCD, you can also sort output records based on the contents of up to 16 fields. For example, assuming your database included these fields, you can sort first by an INCOME field, then by an AGE field, then by a DONOR field. You can sort in ascending or descending order for each sortation level. Consider the following examples: By INCOME in descending order. By AGE in descending order. By DONOR data in ascending order. Be sure to define the fields in your DEF file(s). For the example described above, your DEF file(s) would need the following entries. PW.Merg_Purg1 = Income PW.Merg_Purg2 = Age PW.Merg_Purg3 = Donor Be sure all three of the sorting fields (Merg_Purg1, Merg_Purg2, and Merg_Purg3) are included in the match key. If you are using standard matching, that s done at the Matching Criteria block. For extended matching, it s done with the Key Length parameter of the Parsing and Key Options block. Sort fields You can sort by any field defined in your match criteria or Parsing and Key Options, or any of the following application fields. Refer to the following page for information about setting up this custom sort process. AP.File_No AP.Group_Cnt AP.Group_No AP.Group_Ord AP.Parse AP.LB_Prior AP.List_Cnt AP.List_No AP.Record_No AP.Super_Cnt AP.Unique_No To set up your custom sortation, follow these two steps. 1. Tell MCD which fields to use for sorting. Use the Custom Output Sorting block, as shown below. Be sure to use the right order for your sort levels; MCD will sort in the order of your Custom Sort Fields as you set them here. Note that this example (from the previous page), relates the database to the Merg_Purg fields as follows. PW.Merg_Purg1 = Income PW.Merg_Purg2 = Age PW.Merg_Purg3 = Donor 72 Match/Consolidate User s Guide

73 2. Direct MCD to sort on the those fields. At your output file block (the MCD Output File block is shown below), set the Sort By option to Custom. Then, enter the Custom Sort Name (from the Custom Output Sorting block). Chapter 5: Purge input files or create output files 73

74 Create a multi-buyer file In the direct-mail industry, a multi-buyer is someone whose name appears on two, three, or more lists someone who, by their appearance on several different lists, demonstrates a pattern of consumption or affluence. To prepare a multi-buyer list, you scan a large pool of input records for names that appear more than once. In this situation, you hope for matches, because those are the names of frequent buyers. Target multi-buyers Suppose you are mailing catalogs of radio equipment. The printing, handling, and postage cost is $4.75 per copy, so you have to be selective. You rent several mailing lists: List Number of records Ham radio operators 458,087 Ham News subscribers 148,879 Proceedings of the Amateur Radio Society 252,789 SIC Code 5731: Radio, TV, and Electronics Stores 53,976 From the total input of 713,731 records, you want to select the best prospects. Your assumption is that the more lists on which a name appears, the more active that person is in amateur radio, and the more likely they will be to order from the catalog. Therefore, you might want to include only those names which appear on at least two lists. Create a multi-buyer output file To create a multi-buyer output file, set up your job to search all four input lists for matches. Then, create a Custom MCD Output File, to output records for anyone who appeared on at least three of the four lists. See the next page for details about setting up this output file. As output, select only the Output Multiple List Masters option. Employ an output filter to select only those records whose list count is 3 or more. You can get list-count data from the MCD field AP.List_Cnt. Your output filter will look like this: val[ap.list_cnt]>=3. 74 Match/Consolidate User s Guide

75 Select the best multi-buyers Suppose you printed 10,000 catalogs, so you want to select the 10,000 best prospects. It makes sense to choose those names that appeared on the largest number of lists. To select the 10,000 most frequent buyers, you would do the following: 1. Set as the maximum number of records to output. 2. Sort records by list count the number of lists in which the record appears. To do this, set the sort-by option to List_Cnt. 3. Sort records in descending order (highest to lowest) so that records with the highest list count will be selected first. Use super lists to find multi-buyers Suppose you rented several lists from two different brokers, Able Direct and Baker Marketing. In addition, you are using your house file. To consider someone a multi-buyer, you want that person s name to be found in at least two of your three sources: your house file, Able Direct, and Baker Marketing. If a name simply appears in two different Able Direct lists, you don t want to consider that person a multi-buyer. This output can be produced in essentially the same way as in the prior example. However, instead of using AP.List_Cnt, use AP.Super_Cnt. 1. Create a super list for each source (House, Able, and Baker). House file Rented from Able Direct Rented from Baker Marketing 2. Create a Custom MCD Output File, to output records for anyone who appeared on both of the super lists. To do this, base your output filter on super-list count, which you can retrieve from the MCD field AP.Super_Cnt: val(ap.super_cnt) = 2 Chapter 5: Purge input files or create output files 75

76 Create a multi-occurrence file Multi-occurrence vs. multi-buyer A multi-occurrence file is similar to a multi-buyer file, because we look for a buying pattern by searching for matches. The difference is this: For a multi-buyer file, we count the number of input lists on which a name appeared. In effect, we count the number of sources or companies from which a person has purchased goods, services, or memberships. For a multi-occurrence file, we count the total number of times a name appears, without concern for the number of lists. This is appropriate if you are willing to say that appearing twice on the same list is just as good as appearing once each on two separate lists. Create a multioccurrence file Suppose you rent a file of Porsche owners and a file of home owners. Mary Smith s name appears once on each list, because she owns a Porsche and a home. John Doe s name also appears twice because he bought two Porsches, but his name doesn t appear in the home-owners file because John rents. Use the Multi-Occurrence Output File capability of MCD for this situation. In that block, specify the minimum number of times a name must occur to be included in the output file. Select the best multioccurrences To select the names that occur most frequently, look for the records that have the most matching records. For example, to select the 10,000 most frequently matched names, type as the maximum number of records to output. Sort records by the number of records in the match group (dupe group), by setting the Sort By option to Group_Cnt. Set the Sort Order to Descend so records with the highest group count will be selected first. If you would like to post the group count to a field in your output file, post AP. Group_Cnt. 76 Match/Consolidate User s Guide

77 Select a sample of records Would you like to post a limited number of records to your output file? Many users do, for a number of reasons for example, to set up test mailings or to split output into multiple mail segments. One way to limit output is to set a Maximum records to output number at your output file block. In addition, MCD lets you output records at intervals every third record, every fifth record, or every tenth record, and so on. This approach to MCD output is called Nth Select. All Nth Select settings are done at your selected output file block. MCD implements Nth Select downstream of sorting and of any filter. Records that don t pass a filter are not included. Of course, MCD also respects your Maximum records to output setting. In addition to the advantages for your actual job output, Nth Select helps you test output, because your record sample can span a wider range of input files. However, plan for extra processing time if you use the Auto or Random type of Nth Select with an output filter. In that circumstance, MCD must count the records that pass the output filter before it outputs the records. Three types of Nth select You can select from three types of Nth select: Type Remarks User MCD selects every Nth record. You set the value of N. Auto Random You need not set a value for N; MCD calculates that increment based on the Maximum records to output setting. MCD selects at random which records to output, up to the Maximum records to output setting. Here s a simplified example of all three types. Suppose there are 100 records available for output. The output filter allows the following records to pass: 1, 2, 23, 29, 44, 78, 80, 82, 90, 97. The figure below shows how Nth selection would act. In all three cases, Maximum records to output is set to 4. User; N = 2 Auto Random Record #90 is the next selection, but it won t be used, because four records have already been selected. With Random output, any four of the records could be chosen. Chapter 5: Purge input files or create output files 77

78 Reports about your purging or output process Input file purge report How do you know the results of your input purge? You should run your job, then check the Purge by List report. The Purge by List report shows the numbers of records that were deleted from the job s input file(s) or marked for deletion, or predicted for deletion, depending on your job setup. These numbers provide a clear picture of the results of an input file purge. This report is especially useful if you ve included lists, because it shows results on a list-by-list basis. For information about lists, see Define your input files and lists on page 27. What should you expect to see? That depends on the type of purge you set up. Because a Custom Purge enables you to select any (or all) record categories, you ll have to check your Custom Purge setup to see what categories you were trying to purge, and which you wanted to maintain. Purge By List Report, Detail Information (PREDICTION) Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Single Multiple Single Multiple Suppress Suppress Suppress Net Filter Suppress List List List List List List List List Name Input Drops Dupes Dupes Dupes Uniques Masters Masters Uniques Masters Subord house firms no_mail select update Totals Purge By List Report, Summary Information (PREDICTION) Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report The Output File report Your basic check about MCD output is to look at the output itself the content of the output file. In addition, though, MCD can produce the Output File report, to see the numbers of output records that fit into the various matching and ranking categories and that are therefore included in or excluded from your output. Study your Output File report to see what MCD has done. If your results show any trends that could be improved by adjustments to your settings, then change those settings and re-run the job. For example, if your filter drops are higher than you think is right, check your filter setup. Keep in mind that you can sort the data in different ways. For example, you can display the rows and pages of the report by State, by ZIP Code or other key field, or by list, or super list. For details, see Output File Reports (.ofr) on page Match/Consolidate User s Guide

79 The different output files include different types of records. Your Output File report should show the following, based on the type of output file: Output File Contents of the output file Categories of records included MCD Output File All-Duplicates Output File Multi-Occurrence Output File Custom Output File All unique records and master dupes are copied to the output file (from Normal and Special input lists). All master and subordinate dupes, representing names that appear more than once (for example, frequent or repetitive customers or donors are included), are copied to the output file. Unique records are omitted. All master dupes of all dupe groups are copied to the output file. Unique records are omitted. Any or all of the following types of records may be included. Design the contents to suit your purpose. Unique records Single list masters Single list dupes Multiple list masters Multiple list dupes Suppress list uniques Suppress list masters Suppress list subordinates Suppress list dupes Unique records Single-List Masters Multiple-List Masters Single-List Masters Single-List Dupes Multiple-List Masters Multiple-List Dupes Suppress List Masters Suppress List Subordinates Suppress Dupes Single-List Masters Multiple-List Masters You specify the types of records to include. The following figure shows the top of a Detail version of the Output File report, sorted by State and list, for a MCD Output File. For this output file, you'd expect to see numbers in the following categories: Unique records Single-list Masters Multiple-list Masters Output File Report, Detail Information: C:\pw\mpg\Work\output\Out_MPG.txt Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Output Results for California Single Multiple Single Multiple Suppress Suppress Suppress Net Filter Suppress List List List List List List List List Name Input Drops Dupes Dupes Dupes Uniques Masters Masters Uniques Masters Subord house firms Totals Output File Report, Detail Information: C:\pw\mpg\Work\output\Out_MPG.txt Match/Consolidate x.xx Page 2 tekpubs Firstlogic, Inc Technical Publications Sample Report Output Results for Colorado Single Multiple Single Multiple Suppress Suppress Suppress Net Filter Suppress List List List List List List List List Name Input Drops Dupes Dupes Dupes Uniques Masters Masters Uniques Masters Subord house firms Totals Chapter 5: Purge input files or create output files 79

81 Chapter 6: Reports and statistics files This chapter provides a sample of each available Match/Consolidate (MCD) report, arranged alphabetically first reports, then statistics files. These examples show the content and format of the reports. Chapter 6: Reports and statistics files 81

82 Introduction to reports and report files How do you know what s happened in your MCD job? What records have been input? Have any filters done what you expected? How many matches were found, and what records were found to match? If you batch process files with MCD Job or MCD Views, consider using their report capabilities. As it processes your job, the program gathers the data needed for your reports. Then, after processing is complete, it formats that data into your choice of reports. Different types of data are collected during different job processes, so some reports may not be available if you haven t included their associated process in your job setup. The following table shows the report data that s generated during each phase of your MCD job. Processing Step Read records and create match sets Find duplicates Create output file(s) Purge or post to input file(s) Updated at each step throughout the job Reports Generated Input File Summary Report Input List Summary Report List Quality Report Unparsed Records Report Sorted Records Report Duplicate Records Report List-by-List Match Report Multi-List Report List Duplicates Report Match Results Report (if extended matching) Output File Report and, if you perform Group Posting: Posted Dupe Groups Report Purge by List Report and, if you perform Group Posting: Posted Dupe Groups Report Job Summary Report Executive Summary Report You can send MCD reports directly to a printer. However, many users prefer to save reports in files on disk to preview reports before committing them to paper. One file or many You can direct the program to write each report to a separate file or send all the reports to one file. Many users write each report to a separate file. This approach gives you more files to handle, but it s easier to find a particular report. Also, the files are smaller and you have more control over printing them. Some users prefer to combine all the reports into one file. This file can be quite large, but you can insert banner pages to help you organize it. File names based on job name To save time and to keep files names manageable, name your report files $job. Consider the following examples: 82 Match/Consolidate User s Guide

83 Job-file name Report type Report file name as for the job setup Report file produced my_job.mpg Executive Summary $job.exs my_job.exs Job Summary $job.mjs my_job.mjs Duplicate Records $job.dup my_job.dup Unparsed Records $job.unp my_job.unp Report defaults Another time-saver in setting up your reports is the use of defaults for many of your report settings. Nearly all the report settings can be controlled with defaults, including destinations, number of copies, and page format. Three versions of record listings The record listings are the Unparsed Records, Duplicate Records, and Sorted Records reports. These reports provide information on how well your job setup has performed, on a record-by-record basis. When you create any of these record listings, you can choose the type of data that you want from each record. You can elect to create the report with PW field data or with key data, or you can design your own custom version for the report. Version PW Key Custom Description You can choose to use the PW fields on the record listing. This format, which is the default, displays the raw data that was input to MCD with PW fields in the DEF file(s). Before searching for duplicates, MCD converts the raw input data into keys. This includes parsing the address, standardizing names, and so on. MCD uses that processed key data in its search for matching duplicate records. To report key data, choose the key format for your listings. This option gives you flexibility to choose which fields will be printed, their sequence, and the title over each column of data. The Custom option does require more setup, because you must identify each field that you want included in the report. As a source for your data, you can use application (AP), database (DB), or PW fields. You must also set the length of each column on the report. Most often this will equal the length of the source field, but you may make the column wider or narrower. MCD will insert one blank space between columns. You can also place a title over each column. Chapter 6: Reports and statistics files 83

84 Statistics files You can choose to generate up to seven statistics files to store data associated with your job. The statistics files can be brought into a database, spreadsheet, or word-processing program, so you can create your own reports. You can provide your business or clients with reports, spreadsheets, and even graphs based on a MCD job. The statistics files give you reporting flexibility, so you can present job information in the format that will best suit the needs of your business or clients. Create statistics files The data generated and stored in the statistics files depends on your processing steps. For example, if you aren t performing a purge, there is no need to set up the Purge statistics file, because purge information will not be generated. Individual statistics files may contain one or many records, depending on the number of files or lists used in the job. As a result, the length of each statistics file can vary according to the number of records it contains. Each of the statistics files holds data available in a variety of MCD reports. In some cases, the information in the file corresponds exactly with a specific report (for example, all of the data in the Output Statistics file can be found in the Output File report). In other cases, the information is borrowed from more than one report. Due primarily to field width limitations, any filters used are not shown in the statistics files. Valid file types The following are valid file types for statistics files: ASCII dbase3 delimited EBCDIC If the statistics file is dbase3 and there are more than 128 fields in the file, you ll receive a verification error. If this occurs, switch to a different file type. If your file type is other than dbase3, MCD will create a format file for each statistics file that you are generating. The format file (FMT for ASCII files, DMT for delimited files, EBC for EBCDIC files) will contain the field names, lengths, and data types as shown in this chapter. For ASCII files, the new line character (EOR or End-of-Record) will also be included. 84 Match/Consolidate User s Guide

85 Name statistics files Consider using the following names for statistics files. Job Statistics File List Statistics File Input Statistics File Output Statistics File Purge Statistics File List Match Statistics File Super List Match Statistics File $jobj.sfj $jobl.sfl $jobi.sfi $jobo.sfo $jobp.sfp $jobm.sfm $jobs.sfs These names are default entries in the master.mpg file and in the Statistics Files window of the MCD Views program. Note that the example base file names end with the same last character as the file extension. If your file type is other than dbase3, we recommend that you use the seven statistics file names shown above to prevent the format files that MCD creates (FMT, DMT, or EBC) from automatically overwriting each other. For example, if you are creating ASCII statistics files, and if you name the Job Statistics File promo.sfj and the List Statistics File promo.sfl (using the macro $job), MCD names both of the FMT files it creates as promo.fmt. As MCD creates each FMT, it overwrites the previous one. Unsuccessful FMT file creation Job file name Statistics file name FMT file name promo.mpg promo.sfj promo.fmt promo.sfl promo.fmt Successful FMT file creation Job file name Statistics file name FMT file name promo.mpg promoj.sfj promoj.fmt promol.sfl promol.fmt Chapter 6: Reports and statistics files 85

86 How statistics files relate to Match/Consolidate reports The following tables show how the data collected in MCD statistics files relates to data shown on MCD reports. For details about the data, see the descriptions of each MCD report and statistics file in this chapter. MCD input questions MCD Report column title Statistics file field name From this input file, how many records were/ will be input? From this input file, how many records were not input, because: they were marked for deletion? they did not pass an input filter? their list was not determined? they were outside the range? From this input file, how many records were/ will be used? Input File Summary Report Gross Input Input File Summary Report Delete Drops Filter Drops List Drops Sample Drops Input File Summary Report Net Input Input Statistics File gross_in Input Statistics File del_drops filt_drops list_drops samp_drops Input Statistics File net_in Input list records questions MCD Report column title Statistics file field name Of the input records, how many matched this list s list_id? Of the input records, how many were assigned to this list by default action? Of this list s records, from how many could MCD not parse: any data? address data? firm data? title data? last name data? first name data? Of this list s input records, how many were: suppress dupes? single-list dupes? multiple list dupes? dupes of all types? unique records? single-list masters? multiple-list masters? suppress-list uniques? suppress-list masters? suppress-list subordinates? uniques and masters of all types? Of the input records, how many were assigned to this list? Input List Summary Report Matched Id Records Input List Summary Report Default Records List Quality Report No Parse Count No Address Count No Firm Count No Title Count No Last Name Count No First Name Count List Duplicates Reports Suppress Dupes Single List Dupes Multiple List Dupes Total Dupes Uniques Single List Masters Multiple List Masters Suppress List Uniques Suppress List Masters Suppress List Subord Total Non Dupes List Quality Report Input List Summary Report List Duplicates Reports Net Input List Statistics File num_mtchid List Statistics File num_defaul List Statistics File num_nopars num_noaddr num_nofirm num_notitl num_nolnam num_nofnam List Statistics File suppr_dups singl_dups milti_dups tot_dups num_uniq singl_mas multi_mas supprl_uni supprl_mas supprl_sub num_nondup List Statistics File net_in 86 Match/Consolidate User s Guide

87 Matching questions MCD Report column title Statistics file field name How many records of this list were input to this job? How many records of this list matched other records of this list? How many records of this list matched records of list2? How many records of this list matched other records of listn? List Match Reports Net Input List Match Reports Matches (list1) List Match Reports Matches (list2) List Match Reports Matches (listn) List Match Statistics File net_in List Match Statistics File list1 Super List Match Statistics File super1 List Match Statistics File list2 Super List Match Statistics File super2 List Match Statistics File listn Super List Match Statistics File supern Questions about MCD results output or input file purging How many records of this list were input to this job? Of this list s output or purged records, how many were categorized as the following: suppress dupes? single-list dupes? multiple list dupes? unique records? single-list masters? multiple-list masters? suppress-list uniques? suppress-list masters? suppress-list subordinates? filter drops Of this list, how many records were output or purged? MCD Report column title Output File Reports Purge by List Reports Net Input Output File Reports Purge by List Reports Suppress Dupes Single List Dupes Multiple List Dupes Uniques Single List Masters Multiple List Masters Suppress List Uniques Suppress List Masters Suppress List Subord Filter Drops Output File Reports Net Output Purge by List Reports Total Deletes Statistics file field name Output Statistics File Purge Statistics File net_in Output Statistics File Purge Statistics File suppr_dups singl_dups milti_dups num_uniq singl_mas multi_mas supprl_uni supprl_mas supprl_sub filt_drops Output Statistics File net_out Purge Statistics File total_del Overall questions about your job MCD Report Statistics file Executive Summary Job Summary Job Statistics File Chapter 6: Reports and statistics files 87

88 How Match/Consolidate counts intra-list and inter-list matches For most multiple-list jobs, MCD users want to know about record matches between and within the lists. To facilitate that, MCD can track both types of matches: Intra-list matches are matches between records of the same input list. Inter-list matches are matches between records of different lists. The table at right is a simplified job of three lists. Consider that each first name represents a record, and that the matching first names represent matching records. List 1 List 2 List 3 John John John John John John John Mary Mary Mary Mary Sam Sam Sam Intra-list matches When MCD counts intra-list matches, it looks at only one list at a time to find the number of records in that list that matched another record within the same list. The calculation is as follows: the number of matching records minus the number of dupe groups. For every group of matching records within the list, count the number of records within that group that matched the first record of the group. From the List 1 example above, there are seven matching records. From those matching records, there are three John records that matched the first John record that MCD found, and two Sam records that matched the first Sam record that MCD found. This calculates to five intra-list matches, or the number of matching records (seven), minus the number of dupe groups (two). The following list shows the results of the job represented by the above table. List 1 contains seven matching records (the four John records, plus the three Sam records). Subtract two dupe groups (the John group and the Sam group), for a result of five intra-list matches. List 2 contains five matching records (two John records plus, three Mary records). There are two groups (the John group and the Mary group), which results in three intra-list matches. List 3 has no records that match any others in the list, so there are no intra-list matches. (Remember, for intra-list matches, we just look inside the one list; we do not look at any records from any other lists.) Inter-list matches When MCD counts inter-list matches, it looks for dupe groups with records in more than one list. First, the program drops those records that were already counted as intra-list dupes. Then, for each list, it counts the number of times that a record in that list matched a record in other lists. The tables on the following page show the results for this job. 88 Match/Consolidate User s Guide

89 List 1 List 1 List 2 List 3 John John John John John John John Mary Mary Mary Mary Sam Sam Sam List 1 has three interlist matches List 2 List 1 List 2 List 3 John John John John John John John Mary Mary Mary Mary Sam Sam Sam List 2 has three interlist matches List 3 List 1 List 2 List 3 John John John John John John John Mary Mary Mary Mary Sam Sam Sam List 3 has two inter-list matches Chapter 6: Reports and statistics files 89

90 Observations Counts of matches are not counts of records. As you can see from even this small example, 14 records produced 16 matches (8 intra-list and 8 inter-list). This match information shows relative duplication within and among lists. Do not use this data to predict the results of a purge or output operation, because the data does not in any way consider which records are masters and which are dupes. 90 Match/Consolidate User s Guide

91 Use super lists for report data A super list is a way to prepare a second set of reports on matching, combining the statistics for two or more regular lists. We see two situations in which you might set up super lists: Suppose you have four mailing lists stored in a single database, with a database field identifying the list to which each record belongs. In this situation, you may want to have two sets of reports one containing separate statistics for each list and a second set giving statistics for the file as a whole. House file Reports for List 1 Reports for List 2 Reports for List 3 Reports for List 4 Super list (Reports for the entire file, as a whole) Suppose, in addition, that you rent multiple lists from two different list brokers (or other sources). You ll want to see match statistics for each individual list, of course. But you might also want a summary for each broker. That s a total of nine input lists, plus three super lists one super list for your house file (above), one for the two Able Direct files, and one for the three Baker Marketing files. Rented from Able Direct Rented from Baker Marketing When you use super lists, MCD will automatically append a second report to your matching reports (List Match, List-by-List Match, and Multi-List). Keep these details in mind when using super lists: A super list might be related to an input file, or a list vendor, or any other system of binding lists together. Super lists affect only the way that match statistics are reported. They do not affect matching or priorities at all. Chapter 6: Reports and statistics files 91

92 Print reports Before printing reports, you ll set up several options for their appearance. These include page dimensions, margins, and header lines. You can set these options once and make them apply to all reports through the Report Defaults. Then, if you want a particular report to look a little different, you can override your default settings when you set up that report. Printer control Match/Consolidate Job does not use any printer-driver software. Reports are formatted as ASCII text, with line break commands appropriate for your operating system and form-feeds between the pages of the reports. For proper alignment of report data, set your printer to use a non-proportional font such as Courier. Printable area Because of margins, report text cannot occupy the entire sheet of paper. Remember to subtract your margins from the height and width of your paper to determine the printable area. For example, if you are printing at 12 characters per inch, and want.5-inch margins, the printable area of a normal sheet is just 7.5 inches wide (90 characters) not 8.5 inches wide (102 characters). Note that all of your MCD report settings are performed in characters, (CPI) rather than in inches. 8.5 by 11 sheet Half-inch margins reduce the printable area to 7.5 by 10 Some reports require a wide printable area, including the List Quality, List Match, Multi-List, List Duplicates, Output File, Purge By List. In fact, the Duplicate Records, Sorted Records, and Unparsed Records reports are formatted at 240 characters wide. For the wide reports, you might have to set up your printer to use a condensed font or landscape orientation. We recommend using a wide-carriage printer and 11-by-14-inch paper. 92 Match/Consolidate User s Guide

93 Duplicate Records Report (.dup) This report lists each record of each dupe group that is, groups of matching records separated by blank lines between the dupe groups. This listing can help you decide if your match criteria are too loose. If you see records in a dupe group which, based on what you see here, are not really duplicate records, then tighten up your criteria to eliminate those matches. The report data is generated during MCD s duplicate detection (matching) process. Dupe groups are listed in the order in which they were found. For each group, the master record is shown first, followed by the subordinate dupes, in the order of their priority within their dupe group. The driver record for each dupe group is coded with an asterisk in the first column. If a record does not have data for a field/column, that space will be blank on the report. This code data is not available for the Custom format report. Options You can limit the size of this report by setting a maximum number of records to print and by setting a starting record number. You can choose from three versions of this report, based on the field types that you want to print. You can elect to show the records PW data. The report will show a column for each PW field of your job. The example on the following page is the PW version of this report. You can elect to show the records key data. The report will show a column for each key field that you have set up in your job. The key version shows the exact data that was used for comparing the records. You can design your own Custom format. With the Custom format report, you can select (from PW, database, and MCD application fields) the field data for each column of your report. You pick the fields, the order, and the heading for the column. If your job includes lists, then the PW and key data versions show list data in the first two columns of the report (see codes at the bottom of the report). We show the names of the lists as defined in the job setup at the bottom of the report. This saves some valuable space; look for the corresponding list number in the second column of the report. Use Simscore (see Simscore on page 199) to compare the driver record data to the data for a record that shouldn t have been in the dupe group. The Simscore similarity score will help you determine how to change your match levels to prevent a false dupe from appearing in a group. Chapter 6: Reports and statistics files 93

94 Duplicate Records Report Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Code List File Record LIST_ID NAME_LINE ADDRESS CITY ST ZIP FIRM M house H. V. JACOBSEN P.O. BOX C SANTA ANA CA M house HAROLD JACOBSEN P O BOX C SANTA ANA CA *M firms H. V. JACOBSEN P.O. BOX C SANTA ANA CA PANEL CONCEPTS M firms H V JACOBSEN P O BOX C SANTA ANA CA PANEL CONCEPTS M house GERALD KRYWICKI PO BOX NO 2978 SPRINGFIELD MA *M firms GERALD KRYWICKI PO BOX NO 2978 SPRINGFIELD MA HEATBATH CORP. M house ANGEL J RODRIGUEZ URB TERRAZAS DE GUAYNABO GUAYNABO PR *M firms ANGEL J RODRIGUEZ URB TERRAZAS DE GUAYNABO GUAYNABO PR M house MR S L SMEAD 124 SWITZER AVE SPRINGFIELD MA M house S. L. SMEAD 124 SWITZER AVE SPRINGFIELD MA *M firms MR S L SMEAD 124 SWITZER AVE SPRINGFIELD MA MOTORACE M firms S. L. SMEAD 124 SWITZER AVE SPRINGFIELD MA MOTORACE Code Definitions M = Multi List S = Single List P = Purge Group * = Driver List Listname 1 house 2 firms 3 no_mail 4 select 5 update This part of the report is deleted from the picture so we can show you the bottom of the report File Filename 1 C:\pw\mpg\Work\house.txt 2 C:\pw\mpg\Work\mail_sup.txt 3 C:\pw\mpg\Work\house_fm.txt 4 C:\pw\mpg\Work\update_1.txt 5 C:\pw\mpg\Work\rent_mag.txt 94 Match/Consolidate User s Guide

95 Executive Summary Report (.exs) The Executive Summary is a concise listing of the most vital results of a MCD job. The report summarizes facts that appear in more detail in other reports, such as List Quality, List Duplicates, and Purge by List. Although you may use the other reports as well, you will likely find that the Executive Summary is more suitable for presentation to clients, and for your own records. Data for the Executive Summary is generated through all the phases of your MCD job, from input through output. The format of the report is the same for all jobs. The Input File Summary Report (.ifs) on page 96 illustrates the six parts of the report. The Duplicate Detection and Non-Duplicate Records sections correspond with data columns on the List Duplicates Report (detail version). For ADV matching users, in the following example report, the Duplicate Detection and Non Duplicate Records sections report on a match-level basis only. The other sections report on an entire match-set basis. Executive Summary Report Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Pct of Gross Input Number of Input Files: 5 Number of Reference Files: 0 Number of Input Lists: 5 Number of Suppression Lists: 1 Number of Suppression Records: 100 Gross Input Records: 2375 Records Dropped (Filtered, etc): % Net Input Records: % Pct of Net List Quality No Name Data Parsed: % No Firm Data Parsed: % No Address Data Parsed: % Invalid Addresses: % No Last Line Data Parsed: % Invalid Last Lines: % Foreign Last Lines: % Total Unparsed Records: % Duplicate Detection Suppressed Duplicates: % Single List Duplicates: % Multiple List Duplicates: % Suppress List Subordinates: % Total Duplicates: % Non Duplicate Records Unique Records: % Single List Masters: % Multiple List Masters: % Suppress List Uniques: % Suppress List Masters: % Total Non Dupes: % Input Posting/Purging Total Input Records Posted: 0 Total Input Records Purged: 0 Output Number of Output Files: 1 Total Records Output: 770 Group Posted Records: 0 Chapter 6: Reports and statistics files 95

96 Input File Summary Report (.ifs) The Input File Summary shows input records per input file. Each line of the report represents an input file. The entries show the gross number of records, the number dropped for various reasons, and the resulting net number of records that were in fact read as input. This report is valuable for verifying that your input records have actually been read and will be processed. You can also use this report to quickly gauge the effect of any input filters. Data for this report is generated during the input phase of your MCD job, when you have included the Read Records and Create Match Sets execution option. The format of this report is always the same. The number of lines, of course, depends upon the number of input files that you include in your MCD job. The example below shows that five input files were included in this job. Gross Input is the number of records physically present in the file. Delete Drops is the number of records excluded because they had been previously marked for deletion. Filter Drops is the number of records excluded because they did not pass an input filter. List Drops is the number of records excluded because they could not be assigned to any of the input lists. This can only happen when the Undetermined List Action control is set to Ignore. Sample Drops is the number of records excluded because they fell outside the range of record numbers that you set for that input file (see the Starting Record Number and Maximum # of Records to Input controls). Net Input is the actual number of records that will be processed. Input File Summary Report Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Input Gross Delete Filter List Sample Net File Input Drops Drops Drops Drops Input house.txt mail_sup.txt house_fm.txt update_1.txt rent_mag.txt Totals Match/Consolidate User s Guide

97 Input List Summary Report (.ils) The Input List Summary shows input records per input list. Each line of the report represents an input list. The entries show the total number of records assigned to each list, and subdivides that total into two parts to identify those assigned by default. This report is valuable for verifying that your input records list membership has been identified, and that the lists will be reflected in your job process. Data for this report is generated during the input phase of your MCD job, when you have included the Read Records and Create Match Sets execution option. The format of this report is always the same. The number of lines, of course, depends upon the number of input lists that you define in your MCD job. The example below shows that five input lists were included in this job. A totals row follows the lists. Matched ID Records were assigned to each list based on the PW field List_ID, or on passing the List Filter. Refer to the parameter value in PW Field List_ID, or see the list filter setup in your Input List Description block. Default Records were assigned to the default list. Refer to the Undetermined List Action block in your job file. Net Input is the number of records that will be processed from this list. The Net Input total here should agree with the Net Input total from your Input File Summary for this job. Input List Summary Report Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Matched Id Default Net List Records Records Input house firms no_mail select update Totals Chapter 6: Reports and statistics files 97

98 Job Summary Report (.mjs) The Job Summary presents processing statistics, reflecting the process settings of your MCD job. The report concisely summarizes your job setup, processing performance, files used, and reports issued. Use this to verify and record all the pertinent data about your job. Data for this report is generated throughout all the phases of your MCD job. Data will be shown only for those phases of a job that have been performed. For example, if you have not elected to create output files, entries relating to that step are blank in this report. The format of this report is always the same. There are several pages, and many sections to the report. We explain each section of the report in an example below, looking at each section, one at a time. Job Status The Job Status section of the report lists processing steps completed, with the date and time of each. If a step is repeated, the date and time reflect the most recent run. You can find each of the entries of this section as execution options in the Execution block of your job setup. Job Summary Report Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Job Description: TekPubs.mpg Job Owner: TechPubs Program Version: x.xx Job Status Read Records & Create Key File: Done Tue May 28 15:01: Find Duplicates: Done Tue May 28 15:01: Create Match/Consolidate Output File: Done Tue May 28 15:01: Create Multi-Occurrence Output File: Create All-Duplicates Output File: Create Custom Match/Consolidate Output File: Post to Input File(s): Purge: Custom Purge: Create Reports: Done Tue May 28 15:01: Create Report Statistics Files: No Save Work Files: Yes Process Statistics The Processing Statistics section breaks down performance statistics for the five major processes involved in a MCD job: 1. Creating key files 2. Finding duplicates 3. Creating output files 4. Posting to input files 5. Purging input files For each process, the following numbers are shown: The elapsed time that was used in performing each major process. The total number of records or comparisons processed. 98 Match/Consolidate User s Guide

99 The rate-per-hour that was achieved. At the Find Duplicates entry, the total number of duplicates found. The Elapsed Time of This Run is the time from the start of the run to its completion. Because there is time between processes, don t expect the sum of the elapsed times for all the processes to equal the elapsed time of the run. Processing Statistics Create Key File Elapsed Time (hrs:mins:secs): 00:00:06 Total Records Read: 2375 Records Read Per Hour: Find Duplicates Elapsed Time (hrs:mins:secs): 00:00:02 Total Comparisons: 4910 Comparisons Per Hour: Total Duplicates Found: 1507 Create Output File(s) Elapsed Time (hrs:mins:secs): 00:00:02 Total Records Output: 770 Records Output Per Hour: Post to Input File(s) Elapsed Time (hrs:mins:secs): 00:00:00 Total Records Posted: 0 Records Posted Per Hour: 0 Purge Input File(s) Elapsed Time (hrs:mins:secs): 00:00:00 Total Records Purged: 0 Records Purged Per Hour: 0 Elapsed Time of This Run: 00:00:20 Elapsed Time of This Job: 00:00:20 Auxiliary Files This section of the Job Summary shows the directories and dictionaries used in the process. These entries show all the files that have been included in your job setup. It may be that some of these files are not actually used in the job. For example, you may have identified extended name, title, and firm parsing dictionaries (parsing.dct), but not included extended name, title, and firm parsing in your job. If so, the files will be shown here, but they have no effect on the job. Auxiliary Files Address Dictionary: Last Line Dictionary: City Directory: ZCF Directory: Zip+4 Directory 1: Zip+4 Directory 2: Rev Zip+4 Directory: Firm Line Dictionary: Capitalization Dictionary: Standard Pre-name Dictionary: Standard Name Dictionary: Standard Pre-lastname Dictionary: Standard Post-name Dictionary: Multi-line Rules: Firm Rules: Parsing Dictionary: Match Pct Dictionary: Ext Match Blocks: Default ASCII FMT: Default DEF: C:\pw\mpg\addrln.dct C:\pw\mpg\lastln.dct C:\pw\dirs\city07.dir C:\pw\dirs\zcf07.dir C:\pw\dirs\zip4us.dir C:\pw\dirs\revzip4.dir C:\pw\mpg\firmln.dct C:\pw\mpg\pwcas.dct C:\pw\mpg\prename.dct C:\pw\mpg\name.dct C:\pw\mpg\prelname.dct C:\pw\mpg\postname.dct C:\pw\mpg\mlrules.gcf C:\pw\mpg\fprules.gcf C:\pw\mpg\parsing.dct C:\pw\mpg\matchpct.dct Chapter 6: Reports and statistics files 99

100 Input and Key File Information The Input and Key File Information section includes the number of input files and lists, gross and net input records, options related parsing and matching data of the key file, and field IDs included in the key file. It also shows details about the characteristics of the records, including gender, and the number of names in each record. The specific entries that you ll see in this section of the report vary with the choice of extended or standard matching for your MCD job. If your job uses standard matching If your job uses standard matching, you ll see entries like the following in this section of your Job Summary report. Input and Key File Information Number of Input Files: 5 Number of Input Lists: 5 Total Input Records (Gross): 2375 Records Dropped (Filtered, etc): 0 Net Input Records: 2375 Standardize Name Keys: No Standardize Firm Keys: No Standardize Lastline Keys: Yes Include Second Name: No Priority Field: off Fields Included in Key File: Last_Name, 12 Street Range, 10 Street Pre-directional, 2 Street Primary Name, 22 Street Suffix, 4 Street Post-directional, 2 Street Secondary Range, 6 PO Box, 6 Rural Route Number, 2 Rural Route Box, 6 State, 2 ZIP, 5 Gender: Unassigned 2375 Strong Male 0 Strong Female 0 Weak Male 0 Weak Female 0 Ambiguous 0 Multiple Names - Mixed 0 Multiple Names - Male 0 Multiple Names - Female 0 Multiple Names - Ambiguous 0 Parsed as: Business 0 Residential 2375 Number of Parsed Names (per record): One name 2348 Two names 1 Three names 0 Four names 0 Five names 0 Six names Match/Consolidate User s Guide

101 If your job uses extended matching On the other hand, if your job uses extended matching, you ll see entries like the following in this section of your Job Summary report. With extended matching in your job setup, this section includes the number of input files and lists, gross and net input records, and the settings of the Parsing and Key Options block for your included Extended Matching file. Because this report was generated from a different job, the data in this sample does not correspond to that of the other reports in this chapter. Input and Key File Information Number of Input Files: 5 Number of Input Lists: 5 Total Input Records (Gross): 2375 Records Dropped (Filtered, etc): 0 Net Input Records: 2375 Auto Generate Key Lengths: NONE Name, Title, & Firm Parsing: STD Std Number of Names to Store: 1 Std Number of First_Name Stds: 0 Std Number of Mid_Name Stds: 0 Ext Number of Names to Store: 1 Ext Number of First_Name Stds: 1 Ext Number of Mid_Name Stds: 0 Ext Store Pre/Post Name: ORIG Ext Store Title: ORIG Ext Number of Firms to Store: 1 Store Firm: ORIG Ext Number of Firm_Locs to Store: 1 Ext Store Firm_Loc: ORIG Address & Last Line Parsing: ORIG Standardize Last Line Fields: Yes Upper Case Match/Consolidate Fields: No Store Priority Field: No Fields Included in Key File: Last_Name, 16 Prim_Range, 8 Predir, 2 Prim_Name, 12 Suffix, 4 Postdir, 2 Sec_Range, 4 PO_Box, 8 RR_Number, 4 RR_Box, 8 Unp_Addr, 30 ZIP, 5 Gender: Unassigned 2375 Strong Male 0 Strong Female 0 Weak Male 0 Weak Female 0 Ambiguous 0 Multiple Names - Mixed 0 Multiple Names - Male 0 Multiple Names - Female 0 Multiple Names - Ambiguous 0 Parsed as: Business 0 Residential 2375 Number of Parsed Names (per record): One name 2348 Two names 1 Three names 0 Four names 0 Five names 0 Six names 0 Chapter 6: Reports and statistics files 101

102 Duplicate matching information The Duplicate Matching Information section lists all the significant information about your breaking and matching processes. The break fields and breaking results will help you evaluate and improve your breaking strategy. If your MCD job uses standard matching, you ll see the settings of your Match Options block in this section. On the other hand, if your job uses extended matching, the match options aren t shown in this section, because they are shown in the previous section. Duplicate Matching Information Net Input Records: 2375 Matching Method: Std Breaking Fields used for Breaking: Street Primary Name, 3 ZIP, 5 Maximum Work Buffer Keys: Total Break Groups: 708 Largest Break Group (# of Records): 19 Break Field Value of Largest Group: "01701 " Comparisons Theoretical Maximum, Without Breaking: Total Comparisons Actually Performed: 4910 Comparisons Per Hour: Hours Saved by Breaking (Estimate): 0.32 Match Options Household Match Type: resident Firm Match Type: firm Check for Transposed Letters: No (...) Random Priority: No Fields used for Matching: Last_Name, 12, m Street Range, 10, t Street Pre-directional, 2, 100% Street Primary Name, 22, m Street Suffix, 4, t Street Post-directional, 2, 100% Street Secondary Range, 6, t PO Box, 6, t Rural Route Number, 2, t Rural Route Box, 6, t Output file information The Output File Information section lists each output file that has been included in this job. In this example, the MCD output file was included. In addition to listing the options for the output file such as name, file access, group posting, and so on the section also lists the number of records that are output to this file. Output File Information Match/Consolidate Output File: Done Tue May 28 15:01: Name: C:\pw\mpg\Work\output\Out_MPG.txt File Access: replace Do Dupe Group Posting: No Number of Records Output: Match/Consolidate User s Guide

103 Report and statistics files information Each report is listed with its path and file name and the date and time it was created. If you repeat a processing step, and run new reports for that process, the most recent run is the one reflected here. If you choose to create Statistics files, they are listed here. If not, this section isn t in your report. Report Information Input File Summary Report: Done Tue May 28 15:01: Name: C:\pw\mpg\work\reports\tekpubs.ifs Input List Summary Report: Done Tue May 28 15:01: Name: C:\pw\mpg\work\reports\tekpubs.ils Unparsed Records Report: Done Tue May 28 15:01: Name: C:\pw\mpg\work\reports\tekpubs.unp List Quality Report: Done Tue May 28 15:01: Name: C:\pw\mpg\work\reports\tekpubs.lqr Duplicate Records Report: Done Tue May 28 15:01: Name: C:\pw\mpg\work\reports\tekpubs.dup (...) Statistics Files Information Job Statistics: Name: List Statistics: Name: (...) Purge information If your process includes a purge, a summary of the purge is included here. In this example, the purge prediction was executed, so no files are listed as subject to, or protected from the purge operation. For details, refer to the Purge by List Report. Purge Information Net Input Records: 2375 Total Records Purged: 0 Do Dupe Group Posting: File(s) Subject to Purge: File(s) Protected from Purge: No C:\pw\mpg\Work\mail_sup.txt C:\pw\mpg\Work\rent_mag.txt Reference file information Finally, if reference files are included in your job, reference file information is summarized at the bottom of the report. Reference File Information Reference File: Name: C:\pw\mpg\Work\ref_house.ref Associated input file: Name: C:\pw\mpg\Work\house.txt Input file date: Date: 05/28/2002 Input file size (bytes): Size: Starting Record Number: 1 Maximum Number of Records to Input Input Filter (to 512 chars): NONE LIST_ID constant: "house" LIST_ID field: Field Priority Field: Chapter 6: Reports and statistics files 103

104 List-by-List Match Report (.llm) This report is designed to answer questions about matching between and among all the lists of your job. For example, how many of your list1 records were found to match a list2 record? How many list3 records matched other list3 records? This report can help answer those questions, to show you how the records of one list relate to those on other lists. This report, however, cannot by itself tell you which records will be dropped or maintained, because it doesn t reflect the priorities that dictate master and subordinate records. You can see the List Duplicates report to see that sort of listby-list detail. The data for this report is generated after the Find Duplicates step of your MCD job. The format of this report is as shown in the example on the following page. The input lists are arranged in a matrix, with a row for each list of your job and a column for each list, as well. You ll see the same numbers, whether you read across a row or down a column. For example, if you wanted to see how many list2 records matched records of other lists, you could read across the list2 row or down the list2 column. The numbers will be the same, relative to the lists of the job. The following numbers correspond with the example List-By-List Match Report on the following page. The default width for this report is 80 characters. If your job includes more than five lists, you may want to increase the width to keep more list columns in the same table. This is the number of times a record that was a member of list4 (this is the list4 row) matched a record that was a member of list1 (this is the list1 column). These are inter-list matches, because the matching records were from different lists. This is the number of times a record that was a member of a list matched another record that was a member of this same list. Notice that the list row is the same as the list column. These are intra-list matches. There is no direct correlation between the intra-list match counts on this report and the counts on the List Duplicates Report. Refer to page 88 for an explanation of how MCD counts inter-list and intra-list matches. When you include super lists in your job, MCD includes them in a separate page of this report, as shown in our example. Here, statistics for the individual lists are totaled together consistent with your super list setup. 104 Match/Consolidate User s Guide

105 List-By-List Match Report Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report List 1 List 2 List 3 List 4 List 5 1 house firms no_mail select update List-By-List Match Report Match/Consolidate x.xx Page 2 tekpubs Firstlogic, Inc Technical Publications Sample Report Super Super List 1 List 2 1 no-fee contracted The number shown for one list's comparison to itself (the cells marked with the in the sample above) show intra-list matches. If you were to de-dupe a list, you could expect that number of duplicate records to be eliminated from the list, and the master duplicates and unique records kept. For example, if you de-duped the "house" list above, you could expect 157 duplicate records to be dropped from that list. Note that this assumes that driver records remain the same (see Compare record keys: the driver record on page 196 for details about record comparisons). What this report does not show Master or subordinate dupe status: The numbers on this report do not show which records may be ranked as master dupes and which are subordinate dupes. For example (see above), of the 249 List 1 matches found between "select" list records and "house" list records, there is no indication of which may be master dupes, and which are subordinate dupes. So, from just this report, there is no way of knowing which records would be eliminated or included in the output from the job. The size of the dupe groups: Whether a dupe group contains 2 records or 200, each match counts as one in this report. The numbers equate to subtracting one from each dupe group, then adding the results of all the dupe groups. Chapter 6: Reports and statistics files 105

106 List Duplicates Reports (.ldr) The List Duplicates report can be produced in two versions: detail and/or summary. The reports show the numbers of records, by match status, from each list of your job, so you can see the exact numbers of records that will be kept and dropped. It s especially useful to review these reports before you perform an input file purge or produce your output file. The List Duplicates report data is generated during the matching process the Find Duplicates phase of your MCD job. The columns are as shown in our examples below and on the next page. The report shows totals both with and without suppression records, so you can see the separate values and compare them with your net input totals. Summary version This version shows a row for each list of your job. Two totals rows show values both with and without any suppression records, so you can compare those values with the net input totals. Quantities for suppression records only appear in the second totals row and in the Net Input column. List Duplicates Report, Summary Information Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report List List Total Pct of Total Pct of Net List Name List_id Type Priority Dupes Net Non Dupes Net Input house house NORM firms firms NORM no_mail no_mail SUPP select select NORM update update NORM Totals Totals (Including Suppression Records) Total Dupes is the sum of single list dupes, multiple list dupes, suppression-list dupes, and (in the bottom Totals row) suppression-list subordinates. Pct of Net (percentage of net) refers to the Total Dupes column and equals Total Dupes divided by Net Input, times 100. Total Non Dupes is the sum of unique records, single list masters, multiple list masters, and (in the bottom Totals row) suppression list uniques and masters. Refer to the detail version of the List Duplicates report for a complete breakdown of non-dupes. Pct of Net (percentage of net) here refers to the Total Non Dupes column and equals Total Non-Dupes divided by Net Input, times 100. Net Input is the number of records processed. 106 Match/Consolidate User s Guide

107 Detail version Category Net input Suppress Dupes Single List Dupes Multiple List Dupes Uniques Single List Masters Multiple List Masters Suppress list Uniques Suppress list Masters Suppress list Subord Description This version shows a row for each list, with columns showing the matching status of each list s records. As in the summary version of the report, two totals rows show values both with and without suppression records. The breakdown of records shows the number of each list s records that fall into the following matching categories (in the order shown on the report): The number of input records in this list. Subordinate member of a dupe group that includes a higher-priority record that came from a Suppress-type list. These can be from Normal- or Special-type lists. Subordinate members of a dupe group whose members all came from the same list. These can be from Normal- or Special-type lists. Subordinate members of a dupe group whose members came from two or more lists. These can be from lists with a Normal or Special type. Records that are not members of any dupe group. No matching records were found. These can be from Normal- or Special-type lists. Highest ranking member of a dupe group whose members all came from the same list. These can be from Normal- or Special-type lists. Highest ranking member of a dupe group whose members came from more than one list. These can be from Normal- or Special-type lists. Records that came from a Suppress-type list, and for which no matching records were found. A record that came from a Suppress-type list and that is the highest ranking member of a dupe group. A record that came from a Suppress-type list and that is a subordinate member of a dupe group. List Duplicates Report, Detail Information Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Single Multiple Single Multiple Total Suppress Suppress Suppress Net Suppress List List Total List List Non List List List List Name Input Dupes Dupes Dupes Dupes Uniques Masters Masters Dupes Uniques Masters Subord house firms no_mail select update Totals Totals (Including Suppression Records) Total Dupes sums the dupes in the three columns before this one. Total Non Dupes sums the uniques and masters in the three columns before this one. The last three columns break down the suppression list records. List Distribution Summary The List Duplicates report now offers a way for you to analyze which records from one list match those records in other lists. Chapter 6: Reports and statistics files 107

108 For example, suppose you have two suppress lists (A and B) and are matching these against a normal list. With the new List Distribution Report section of the List Duplicates Report, you can see how many records from each subordinate list matched a particular master list. List Distribution Summary sample In this example, you will note the following: The Master List Name represents the name of the list that the master record in a match group belongs to. The Subordinate List_id represents the List_id of a subordinate record in the match group. The Duplication Count represents the number of subordinate records from the Subordinate List_id column that matched the master list. List Distribution Report, Summary Information Match/Consolidate 7.70c Page Master Subordinate Duplication List Name List_id Count Total: Total: Total: Total: Total: Total: Total: 2 The first entry in this report shows you that: A record from List 3 was the master record in a match group. Two records from the list "2999" were found as subordinate matches to master records from list "3". Three records from the list "1025" were found as subordinate matches to master records from list "3". 108 Match/Consolidate User s Guide

109 List Match Reports (.lm) The List Match report can be produced in two versions, the detail version and/or the summary version. The reports show the numbers of records in raw numbers and percentages that match others, both among and between each list of your job. These numbers provide a clear picture of where your duplicate records are coming from. Compare these reports to the List-by-List Match report, which displays similar data in a different format. These reports also contain percentage data that does not appear on the List-by-List Match report. The List Match report data is generated during the matching process the Find Duplicates phase of your MCD job. The columns are as shown in our examples. The report shows totals both with and without suppression records, so you can see the separate values and compare them with your net input totals. Summary version The summary version of the List Match report shows a row for each list of your job, plus a totals row. The first page of the summary shows your input lists. If you have included super lists in your job, each of these is shown on an additional page. Our sample on the next page shows that five lists were in the job (page 1 of the report), plus two super lists (page 2 of the report). The report columns show the following statistics: Statistics Net Input Intra List Matches Inter List Matches Total Matches Percent of Net Input Description The number of input records that are members of this list. The number of matches between one member of this list and another member of this list. For example, 157 records of the "house" list match other "house" records. The number of matches between a member of this list and a member of another list. For example, in 1190 instances, records of the "house" list were found to match records in other lists. To find out how these matches were distributed among the different lists, see the detail version of the report. The sum of the previous two columns. Total matches divided by net input. This can be higher than 100 percent because a record may match records in more than one other list. For example, 1000 house records were input, and there were 1347 total matches found. So, the percent of net input is 1347 divided by 1000, or percent. Note: The total row for this column does not sum the entries of its list rows, as do the other columns. Instead the total matches entry of the totals row is divided by the net input entry of the totals row to produce the percent of net input for the totals row. Chapter 6: Reports and statistics files 109

110 Summary version of the List Match report List Match Report, Summary Information Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Net Intra List Inter List Total Percent of List Input Matches Matches Matches Net Input house firms no_mail select update Totals List Match Report, Summary Information Match/Consolidate x.xx Page 2 tekpubs Firstlogic, Inc Technical Publications Sample Report Net Intra List Inter List Total Percent of Super List Input Matches Matches Matches Net Input no-fee contracted Totals Detail information The detail version of the List Match report (our sample is on the next two pages) shows a table for each list of your job, plus a totals row. The table breaks down the matches among each of the lists of the job. First, a table is shown for each input list. Then, if you have included super lists in your job, a table for each of those is shown on an additional page. The table for each list shows the list name (the Match Comparison List) and its net input (the number of input records that are members of this list). Each table includes a row for each input list of the job. In the sample, nine lists are included in the job, these are named list1 through list9. These same rows are repeated for each of the nine list tables in the report. In the section of the report dealing with super lists, the names of the four super lists suppress, 2-3, 4-5-6, and are repeated for each of the four super list tables in the report. Each table of this report is statistically independent of the others, so don t try to relate percentages from one table to those of any other table. There is only one value that correlates a table to any other the matches entry from the table s list to the row s list. For example, in the first table (the "house" list table) the number of matches for a row relates to that same number in the house row of the corresponding table. 110 Match/Consolidate User s Guide

111 List Match report Detail information List Match Report, Detail Information Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Match Comparison List: house Net Input: 1000 Percent of Percent of List Matches Net Input Matches house firms no_mail select update Totals Match Comparison List: firms Net Input: 1000 Percent of Percent of List Matches Net Input Matches house firms no_mail select update Totals Match Comparison List: no_mail Net Input: 100 Percent of Percent of List Matches Net Input Matches house firms no_mail select update Totals Match Comparison List: select Net Input: 250 Percent of Percent of List Matches Net Input Matches house firms no_mail select update Totals Match Comparison List: update Net Input: 25 Percent of Percent of List Matches Net Input Matches house firms no_mail select update Totals (the report continues on the following page) Chapter 6: Reports and statistics files 111

112 Match Comparison Super List: no-fee Net Input: 2000 Percent of Percent of Super List Matches Net Input Matches no-fee contracted Totals Match Comparison Super List: contracted Net Input: 375 Percent of Percent of Super List Matches Net Input Matches no-fee contracted Totals The columns of the table show the following statistics: Statistics List Matches Percent of net input Percent of matches Description The number of input records that are members of this. The number of matches between this table s Match Comparison List and a member of the list named in the row. For example, in the table for "house" records (the first table of the report) the house row shows that there are 157 "house" records that match other "house" records. These are intra-list matches matches between records that are members of the same list. The firms row shows that there are instances in which "house" records match "firms" records. These are inter-list matches matches between records from different lists. The number of matches as shown in the previous column divided by the net input that this table s list contributed to the job. For example, the first table of the report (for the "house" list) shows that 249 records of the "house" list's 1000 input records were found to match records within the "select" list. These 249 records represent 24.9 percent of the list1 input records (249 divided by 1000 =.249, or 24.9 percent). The percent of net input in the totals row is the sum of the values above it in the table, representing total matches divided by the net input from this table s list. The number of this row s matches (as shown in the second column) divided by the total matches of this table s list (as shown in the totals row for this table). For example, the first table of the report (for the "house" list) shows that, of this list s 1347 total matches, 249 were found among the records of the select list. These 249 matches represent percent of the total "house" matches (249 divided by 1347 equals.1849, or percent). The percent of matches in the totals row sums the values above it in the table, and always equals 100 percent. 112 Match/Consolidate User s Guide

113 List Quality Report (.lqr) The List Quality report shows, for each input list, the number of records that lacked important data, or in which the data could not be found by the parsing software. This report verifies that the records of your input lists contain the field data needed for useful comparisons and duplicate detection. If you see that a large number of records lack important data, you may want to adjust your matching or breaking strategy and expectations. If you have included the Read Records and Create Match Sets execution option, MCD generates data for this report during the input phase of your job. The first part of the report is a wide table, with a row for each of the lists included in your MCD job. The columns are the same, regardless of your job setup. Following that first table, a No Parse Record Summary is shown for each list, to detail reasons for parsing problems. The six columns of the first table show the information below. For each, the Pct column equals the Count divided by the list s Net Input, times 100. Statistics List, Net Input No Parse No Address No Firm No Title No Last Name No First Name Description The name of the list as defined in your MCD job, and the number of records from that list that were read and processed as input. The number of input records from this list from which MCD could not parse data, either because a field was blank or the field data was bad For example, in the table for "house" list records (the first line of the first table) 28 records contained at least one field that could not be parsed. The No Parse Record Summary for List house does not have a single corresponding entry. The number of input records from this list from which MCD could not parse address data, either because an address field was blank or the field data was bad. The No Parse Record Summary for List house shows that 15 "house" list records had bad address data, and none had blank address data. The number of input records from this list from which MCD could not parse Firm data, either because the firm field was blank or the field data was bad. The No Parse Record Summary for List house shows that none of the "house" list records had blank firm data. The number of input records from this list from which MCD could not parse Title data, either because the title field was blank or the field data was bad. The "house" list count shows that none of its 1000 records had title data. The number of input records from this list from which MCD could not parse Last Name data, either because the field was blank or the field data was bad. The No Parse Record Summary for List house shows that 13 "house" list records had blank name data. The number of input records from this list from which MCD could not parse First Name data, either because the title field was blank or the field data was bad. The "house" list count shows that 14 "house" list records had no first name. If 100 percent of records are counted as lacking a field, check your DEF files. For example, if 100 percent of records lack a Title, perhaps you did not define the PW field Title. Of course, this is not necessary unless you intend to use Title for matching (we didn't in our sample job). Chapter 6: Reports and statistics files 113

114 If the input file is multiline, firm, title, and name information will be parsed only if the extended Name, Title, and Firm parsing process has been selected. The numbers in each pair of columns Count and Pct are separate calculations from those in other pairs of columns. Each record may have more than one parsing problem, therefore, the No Parse count need not equal the sum of the counts of other columns. The same is true for the No Parse Record Summary for each list. List Quality Report Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Net No Parse No Address No Firm No Title No Last Name No First Name List Input Count Pct Count Pct Count Pct Count Pct Count Pct Count Pct house firms no_mail select update Totals No Parse Record Summary for List house Blank Names 13 Blank Firms 0 Blank Addresses 0 Bad Addresses 15 Blank Lastlines 0 Bad Lastlines 0 Foreign Lastlines 0 No Parse Record Summary for List firms Blank Names 13 Blank Firms 0 Blank Addresses 0 Bad Addresses 15 Blank Lastlines 0 Bad Lastlines 0 Foreign Lastlines 0 The records that are reflected here may not have errors recorded. For example, if Firm is not defined in this job s key file, none of the records that appear in the No Firm columns above would have AP.Firm_Error set. No Parse Record Summary for List select Blank Names 0 Blank Firms 0 Blank Addresses 0 Bad Addresses 2 Blank Lastlines 0 Bad Lastlines 0 Foreign Lastlines Match/Consolidate User s Guide

115 Match Results Report (.mrr) The Match Results report shows, in detail, the results of your extended matching process. This report is not generated for standard matching. The report shows how many comparisons were made, how many matches were found, how many no-matches, and so on. From this data, you may see ways to improve the speed or effectiveness of your matching process. For example, matching decisions made early in the process save comparisons. Data for this report is generated during the duplicate detection phase. This report lists the number of comparisons made, then shows either two or three tables: Non Match Spec Results, Match Spec Results, and, if you are using a rule match session, Rule Match Results Summary. Non-Match Spec Results Statistics Not compared Forced No Match Not Compared Already a match Not Compared Internal List Compares Disabled Pre Compare Exit Match Decisions Pre Compare Exit No Match Decisions Match Post compare Exit No Match Decisions No Match Post compare Exit Match Decisions Description The number of input records that were not compared because they have multiline address data that MCD was unable to parse. The number of input records that were not compared because they had already been identified as a match for a record that had been compared previous to this record. The number of input records that were not compared because the job setup specified that records should not be compared to other records within the same list. (MCD Custom only) The number of records that were not compared because a Pre- Compare exit returned a match decision to MCD. (MCD Custom only) The number of records that were not compared because a Pre- Compare exit returned a match decision to MCD. (MCD Custom only) The number of records that were not compared because a Pre- Compare exit returned a match decision to MCD. (MCD Custom only) The number of records that were not compared because a Pre- Compare exit returned a match decision to MCD. Match Spec Results Statistics Match Spec Name Match Spec Type Match Attempts Rule Match Decisions Rule No Match Decisions Undecided Description The name of the match specification as defined in your MCD job setup, in the Auto Match Spec or Rule Match Spec block. The type of matching, from your Auto Match Spec or Rule Match Spec block. The number of times this match specification was used to compare two records. This is not a count of total field comparisons. You will also see this number in the Attempts Made column of the first field comparison in the Rule Match Results Summary. The number of times this match specification determined that two records matched. The number of times this match specification determined that two records did not match. The number of times this match specification could not decide whether two records matched. Chapter 6: Reports and statistics files 115

116 Rule Match Results summary Note that this is not used with auto matching. This section lists a row for each rule, and for your convenience, lists the field involved in the rule. Each row shows how that field comparison impacted the results of the job s matching process. Total comparisons equals the Non-Match Spec Results total plus the Match Attempts. The number of times this rule was used in comparing records. The first rule is used in all comparisons. Any decisions produced by that rule (shown in the next two columns) reduce the use of the next rule. In our example, the first rule is used times, producing No Match Decisions. Therefore, the second rule is only used in 1475 comparisons (14365 minus 12890). The same is true for each subsequent rule, as well. The number of matches decided by this rule. The sum of the entries in this column equals the Rule Match Decision (above). The number of no matches decided by this rule. The sum of the entries in this column equals the Rule No Match Decision (above). Of this job s Total Comparisons, the percent that were decided by this rule. In the first row of our example, decisions were produced (Match [0] plus No Match [12890]), which is percent of The data in this sample does not correspond to that of the other reports in this chapter, since this report was generated from a different job. Match Results Report Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Total Comparisons: Non Match Spec Results Not Compared - Forced No Match: 0 Not Compared - Already a Match: Not Compared - Compares Disabled: 0 Pre-compare Exit Match Decisions: 0 Pre-compare Exit No Match Decisions: 0 Match Post-compare Exit No Match Decisions: 0 No Match Post-compare Exit Match Decisions: 0 Match Spec Results Match Spec Name: Family Matching Match Spec Type: rule Match Attempts: Rule Match Decisions: 1403 Rule No Match Decisions: Undecided: 0 Rule Match Results Summary Rule Rule Attempts Match No Match Percent Number Field Type Made Decision Decision Decision 1 Last_Name Prim_Range PO_Box RR_Box Prim_Name RR_Number Address WEIGHTED Match/Consolidate User s Guide

117 Multi-List Report (.mlr) The Multi-List Report shows, for each input list, how many of its records were found to match records in other lists. This can be a most useful report when your MCD job includes creating a Multi-Occurrence Output file for example, if you want to create a multi-buyer file. The data for this report is generated during the Find Duplicates phase of the MCD process. The format of the Multi-List report is always the same. There is a row for each of the input lists of the job. The columns show, first, the name of the list, then the total number of the list s records that appeared on more than one list. The remaining columns show how many records in that list were found in 2 lists, 3 lists, 4 lists, and so on. The Multi List entry is the sum of the entries across the remaining columns for the entire row. If you created super lists, multi-list matches for the super lists are included in a separate table in the report. If a record from list1 matches a record from list2, then that record is included in the number in the 2 List column. If a record from list1 matches a record from list2 and also a record from list4, then that record is included in the number in the 3 List column. The entry in each column shows the number of multi-buyers that is, how many records appeared on more than one list, not how many times they appeared. For example, if a record from list1 matches three records from list2, then that record adds one to the this list s total in the 2 List column it s not added to the 4 List column, nor is three added to the 2 List column. When determining the number of lists on which a record appeared, the program does not count single-list matches, or any matches to records from special lists. If you want to output multi-list values for each record, you can use the fields AP.List_Cnt and AP.Super_Cnt. For more information, refer to the MCD Output Fields in the Quick Reference. Chapter 6: Reports and statistics files 117

118 Multi-List Report Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report List Multi List 2 List 3 List 4 List 8 List 9 List 10+ List house firms no_mail select update Totals Multi-List Report Match/Consolidate x.xx Page 2 tekpubs Firstlogic, Inc Technical Publications Sample Report Super List Multi List 2 List 3 List 4 List 8 List 9 List 10+ List no-fee contracted Totals In this example, we ve removed three of the columns (from 5 List through 7 List) to fit the report on the page. Our example had zeros in all those columns, and all columns are formatted in the same way. 118 Match/Consolidate User s Guide

119 Output File Reports (.ofr) The Output File report can be produced in two versions, the detail version and/or the summary version. The reports show the number of records that were included in the job s output file(s). The numbers on these reports provide a clear picture of the source of your output file records. You can elect to produce summary or detail information versions of the report or both. The examples we ve used here are for a MCD Output File; the format of the reports doesn t change for the different output file types (All-Duplicates Output File, Custom MCD Output File, etc.). The report shows totals both with and without suppression records, so you can see the separate values and compare them with your net input totals. Data for the report is generated during the output process of your job. The columns are always as shown in our examples, but you can vary the content of both the rows and pages of the Output File report. In effect, you can perform a two-level sort of the report s data, to show rows and pages by List, Super List, State, or Key Field. Some popular sorts are explained below. In order to use the State or Key Field sorting, the sorting field(s) must be set up as key fields. For example, if you want to vary pages by State, State must be set up as a key field (normally with a length of two characters). If you want to vary rows by a Sales Zone field of your data base, you must set up that Sales Zone field as a Merg_Purge key field, with a length great enough for your largest Sales Zone data. You need not match on such a field, but it must be included in the record key. Sort by broker Create a page for each data broker, and, on each page, show a row for each list involved in your job. To produce this report, set the controls of your Output File Report to vary rows by LIST and vary pages by SUPER_LIST. For this arrangement, of course, you must set up Super Lists to reflect your data brokers (refer to Create groups of lists (super lists) on page 41). Sort by state Have the rows of your report list the output record count by state. To produce this style report, set the controls of your Output File Report to vary rows by STATE. If you d like, vary pages by LIST, as well. Sort by ZIP Code Have the rows of your report list the output record count by ZIP Code. To produce this style report, set the controls of your Output File Report to vary rows by KEYFLD, with the Row Custom Field set to ZIP. If you d like, vary pages by LIST, as well. Chapter 6: Reports and statistics files 119

120 Summary version The summary version of the Output File report shows a row for each list of your job, plus two totals rows. The second totals row includes suppression records and adjusts for output filter results. The data in the report columns is as follows. Information List Name Net Input List_ID List Type List match priority Filter Drops Pct of Net Input Net Output Pct of Net Input Description The name of the list as defined in your MCD job setup. The number of records from this list that were processed as input to this job. The value that is used to identify members of this list. The category of records in this list normal, Suppress, or special. The numerical value of the list match priority. The higher the number, the lower the priority. Records that were excluded from the output file because they did not pass the output filter. The number of filter dropped records divided by net input, times 100. (This column and the Filter Drops column will always show zeros in the first Totals row because filtered records do not appear in an output file.) The number of records copied to the output file. Percent of Net Input refers to the Net Output column. It is the number of records output divided by net input, times 100. The two Totals rows deserve special attention, to understand what their data represents. We provide the two rows of data to show the effects of any output filter and to enable users to see results both with and without records from Suppress-type lists. The first Totals row ignores the results of Suppress-type lists on the output; the second Totals row includes those results. In addition, the first Totals row ignores the impact of any output filter on the results; the second Totals row shows the results after the effect of the filter. The following sample shows the normal report setup with rows sorted by list and no page sorting. Output File Report, Summary Information: C:\pw\mpg\Work\output\Out_MPG.txt Match/Consolidate x.xx tekpubs Firstlogic, Inc Technical Publications Sample Report Net List List Filter Pct of Net Pct of List Name Input List_id Type Priority Drops Net Input Output Net Input house 1000 house NORM firms 1000 firms NORM no_mail 100 no_mail SUPP select 250 select NORM update 25 update NORM Totals Totals (Including Suppression Records and After Filter) Output File Type: Match/Consolidate Output File 120 Match/Consolidate User s Guide

121 Output File report: Detail version Category Suppress Dupes Single List Dupes Multiple List Dupes Uniques Single List Masters Multiple List Masters Suppression List Uniques Suppression List Masters Suppression List Subord The detail version of the Output File report shows a row for each list of your job (depending on your sort by row setting, as explained earlier), plus a totals row. Our sample below included five lists. Each list row shows the total number of its records that were processed as input to this job (the Net Input column) and the total that were dropped because they did not pass an output filter (the Filter Drops column). Entries in the Totals row are simply the sum of the values for that column. The remaining columns show the number of each list s records that fall into the following matching categories (in the order shown on the report). Description Subordinate member of a dupe group that includes a higher-priority record that came from a Suppress-type list. Can be from Normal or Special type lists. Subordinate members of a dupe group whose members all came from the same list. These can be from lists with a Normal or Special list type. Subordinate members of a dupe group whose members came from two or more lists. These can be from lists with a Normal or Special list type. Records that are not members of any dupe group. No matching records were found. These can be from lists with a Normal or Special list type. Highest ranking member of a dupe group whose members all came from the same list. Can be from Normal or Special type lists. Highest ranking member of a dupe group whose members came from more than one list. Can be from Normal or Special type lists. Records that came from a Suppress-type list, and for which no matching records were found. A record that came from a Suppress-type list and that is the highest ranking member of a dupe group. A record that came from a Suppress-type list and that is a subordinate member of a dupe group. The following sample shows the normal report setup, with rows sorted by list and no page sorting. Output File Report, Detail Information: C:\pw\mpg\Work\output\Out_MPG.txt Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Single Multiple Single Multiple Suppress Suppress Suppress Net Filter Suppress List List List List List List List List Name Input Drops Dupes Dupes Dupes Uniques Masters Masters Uniques Masters Subord house firms no_mail select update Totals Output File Type: Match/Consolidate Output File Chapter 6: Reports and statistics files 121

122 Output File report options: Super List sort Here are two more versions of the same output file reports, showing different report sorting options. This is how the report is organized when sorting rows by list and pages by Super List. The detail information is shown first, followed by the summary information. Output File Report, Detail Information: C:\pw\mpg\Work\output\Out_MPG.txt Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Output Results for no-fee Single Multiple Single Multiple Suppress Suppress Suppress Net Filter Suppress List List List List List List List List Name Input Drops Dupes Dupes Dupes Uniques Masters Masters Uniques Masters Subord house firms Totals Output File Report, Detail Information: C:\pw\mpg\Work\output\Out_MPG.txt Match/Consolidate x.xx Page 2 tekpubs Firstlogic, Inc Technical Publications Sample Report Output Results for contracted Single Multiple Single Multiple Suppress Suppress Suppress Net Filter Suppress List List List List List List List List Name Input Drops Dupes Dupes Dupes Uniques Masters Masters Uniques Masters Subord no_mail select update Totals Output File Type: Match/Consolidate Output File Output File Report, Summary Information: C:\pw\mpg\Work\output\Out_MPG.txt Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Output Results for no-fee Net List List Filter Pct of Net Pct of List Name Input List_id Type Priority Drops Net Input Output Net Input house 1000 house NORM firms 1000 firms NORM Totals Totals (Including Suppression Records and After Filter) Output File Report, Summary Information: C:\pw\mpg\Work\output\Out_MPG.txt Match/Consolidate x.xx Page 2 tekpubs Firstlogic, Inc Technical Publications Sample Report Output Results for contracted Net List List Filter Pct of Net Pct of List Name Input List_id Type Priority Drops Net Input Output Net Input no_mail 100 no_mail SUPP select 250 select NORM update 25 update NORM Totals Totals (Including Suppression Records and After Filter) Output File Type: Match/Consolidate Output File 122 Match/Consolidate User s Guide

123 Output File report options: State sort This is how the report looks when sorting rows by list and pages by State. Again, detail information first, then summary information. Output File Report, Detail Information: C:\pw\mpg\Work\output\Out_MPG.txt Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Output Results for California Single Multiple Single Multiple Suppress Suppress Suppress Net Filter Suppress List List List List List List List List Name Input Drops Dupes Dupes Dupes Uniques Masters Masters Uniques Masters Subord house firms Totals Output File Report, Detail Information: C:\pw\mpg\Work\output\Out_MPG.txt Match/Consolidate x.xx Page 2 tekpubs Firstlogic, Inc Technical Publications Sample Report Output Results for Colorado Single Multiple Single Multiple Suppress Suppress Suppress Net Filter Suppress List List List List List List List List Name Input Drops Dupes Dupes Dupes Uniques Masters Masters Uniques Masters Subord house firms Totals This report would include a detail page for each state included in the job. We've deleted the rest to save space here. Output File Report, Summary Information: C:\pw\mpg\Work\output\Out_MPG.txt Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Output Results for California Net List List Filter Pct of Net Pct of List Name Input List_id Type Priority Drops Net Input Output Net Input house 8 house NORM firms 8 firms NORM Totals Totals (Including Suppression Records and After Filter) Output File Report, Summary Information: C:\pw\mpg\Work\output\Out_MPG.txt Match/Consolidate x.xx Page 2 tekpubs Firstlogic, Inc Technical Publications Sample Report Output Results for Colorado Net List List Filter Pct of Net Pct of List Name Input List_id Type Priority Drops Net Input Output Net Input house 2 house NORM firms 2 firms NORM Totals Totals (Including Suppression Records and After Filter) This report would include a detail page for each state included in the job. We've deleted the rest to save space here. Chapter 6: Reports and statistics files 123

124 Posted Dupe Groups Report (.pdg) The Posted Dupe Groups report contains a table of statistics for each input file or output file for which a group posting operation has been included in the MCD job. Each table lists a row for each posting operation (that is, each Group Posting block) that was executed in this job, for this file. On each line, when you sum up completed operations and all drops, the result should equal the number of attempts. Post Name is the name for the operation as you defined it in the Group Posting block of your MCD job setup. Post Attempts is the number of records that were eligible for this group posting operation. Purge Drops are operations that were canceled because the destination record was a member of a Suppress-type list. (For details, refer to the Group Post to Suppression Lists option or parameter.) Destination Field Drops are operations that were canceled because the destination record did not contain the destination field. This value also includes operations that were canceled because the group posting operation was set to allow group posting only once per destination record. Filter Drops are operations that were canceled because the Group Post Filter returned False. Post Completes are group posting operations that were successfully completed. Posted Dupe Groups Report Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report File : C:\pw\mpg\Work\output\Out_mpg.txt Post Post Purge Dst Field Filter Post Name Attempts Drops Drops Drops Completes mpg Match/Consolidate User s Guide

125 Purge by List Reports (.prl) The Purge by List report can be produced in two versions, the detail version and the summary version. The reports show on a list-by-list basis the numbers of records that were deleted from the job s input file(s) or marked for deletion, or predicted for deletion, depending on your job setup. These numbers provide a clear picture of the results of an input file purge. The report data is generated during the input purge process of your MCD job. There are no options for the format of either report. The columns are as shown in our examples. If you perform a purge prediction (as in our examples), PREDICTION appears at the top of the report. You can elect to print either version or both. The report shows totals both with and without suppression records, so you can see the separate values and compare them with your net input totals. Summary version The summary version of the Purge by List report shows a row for each list of your job, plus a totals row. Our sample below includes five lists. The data in the report columns is as follows. Information List Name Net Input List_ID List Type List priority Filter Drops Description The name of the list as defined in your MCD job setup. The number of records from this list that were processed as input to this job. The value that is used to identify members of this list. The category of records in this list normal, suppress, or special. The numerical value of the list match priority. The higher the number, the lower the priority. Records that were deleted because they passed a delete filter that was included in the Custom Purge Input File(s) block. Pct of Net Input The number of filter dropped records divided by net input, times 100. Total Deletes Pct of Net Input The number of records deleted (or predicted to be deleted) from the list. Percent of Net Input refers to the Total Deletes column. It is the number of deleted records, divided by net input, times 100. Purge By List Report, Summary Information (PREDICTION) Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Net List List Filter Pct of Total Pct of List Name Input List_id Type Priority Drops Net Input Deletes Net Input house 1000 house NORM firms 1000 firms NORM no_mail 100 no_mail SUPP select 250 select NORM update 25 update NORM Totals Chapter 6: Reports and statistics files 125

126 Detail version Information Suppress Dupes Single List Dupes Multiple List Dupes Uniques Single List Masters Multiple List Masters Suppression list Uniques Suppression list Masters Suppression list Subord Description The detail version of the Purge by List report shows a row for each list of your job, plus a totals row. Our sample below included five lists. List names are shown as they are defined in your job setup. The Net Input column shows the number of records from this list that were processed as input for this job. The Filter Drops column shows the number of records from this list that were deleted because they passed a delete filter that was included in the Custom Purge Input File(s) block. The rest of the columns show the number of that list s records that fall into the following matching categories, in the order shown on the report: Subordinate member of a dupe group that includes a higher-priority record that came from a Suppress-type list. Can be from Normal or Special type lists. These records are deleted when a conventional purge is performed. Subordinate members of a dupe group whose members all came from the same list. These can be from lists with a Normal or Special list type. These records are deleted when a conventional purge is performed. Subordinate members of a dupe group whose members came from two or more lists. These can be from lists with a Normal or Special list type. These records are deleted when a conventional purge is performed. Records that are not members of any dupe group. No matching records were found. These can be from lists with a Normal or Special list type. Highest ranking member of a dupe group whose members all came from the same list. Can be from Normal or Special type lists. Highest ranking member of a dupe group whose members came from more than one list. Can be from Normal or Special type lists. Records that came from a Suppress-type list, and for which no matching records were found. A record that came from a Suppress-type list and that is the highest ranking member of a dupe group. A record that came from a Suppress-type list and that is a subordinate member of a dupe group. Entries in the Totals row are the sum of the values for that column. Purge By List Report, Detail Information (PREDICTION) Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Single Multiple Single Multiple Suppress Suppress Suppress Net Filter Suppress List List List List List List List List Name Input Drops Dupes Dupes Dupes Uniques Masters Masters Uniques Masters Subord house firms no_mail select update Totals Match/Consolidate User s Guide

127 Sorted Records Report (.sor) You can use the Sorted Records report to check your match criteria. One typical approach is to select the geographical sort version of this report, to place likely matches close together in the report. This report can show you if your matching criteria need to be adjusted. If you see records that appear to match, but which MCD has not identified as members of the same dupe group, then you may want to loosen your matching criteria so the Find Duplicates process will find such matches. Match/Consolidate generates the report data during the Read Records and Create Match Sets process. Option: Limit the size You can limit the size of this report by setting a maximum number of records to print and by setting a starting record number. Limited data may be able to show you trends; however, when studying a report that you have limited, keep in mind that what you are looking for may not be there because of the limit that you have imposed. Option: Select your field types You can choose from three versions of this report, based on the field types that you want to see. However, the versions only affect what data appears on the report. Regardless of which of the three types of data that you use, the records will appear in the same order. You can elect to show the records PW data. The report will show a column for each PW field of your job. The first two examples on the following pages show PW versions of this report. You can elect to show the records key data. The report will show a column for each key field that you have set up in your job. The key version shows the exact data that was used for comparing the records. The key version of our example reports is shown after the PW versions that are shown on the following pages. You can design your own Custom format. With the Custom format, you can select (from PW, database, and MCD application fields) the field data for each column of your report. You pick the fields, the order, and the heading for the columns. You can also elect to show subordinate dupes and suppression records, or to not show them in the report. Chapter 6: Reports and statistics files 127

128 Option: Select the sortation of the records You can also choose how the records of this report are sorted. There are eight established sorts, as explained in the chart that follows, plus the Custom option that enables you to establish your own multi-layer sortation. To select a sortation, select from among the Sort Options of the Report Options for the Sorted Records report block or Views window. If you want to use a Custom sort, name the sort routine in the Sorted Records block, and define that sortation in a Custom Output Sorting block or Views window. Sort by File (FILE) Random order (RANDOM) Match group (DUP) MCD field (MP1) Geographically (GEO) Priority field (LB_PRIOR) List count (LIST_CNT) Dupe group size (GROUP_CNT) Custom (CUSTOM) Description Sort records in the same order they appear in the input file(s). This the fastest option. Sort randomly. Use for abbreviated jobs, such as testing output. Sort by match group. This makes it easier to relate members of the same match group. Sort by a field that you choose, such as an account number field or affluence rating. You define the MCD field in your DEF file(s). Sort by ZIP Code, state, city, street name, street range, and so on. Sort based on the total of list match priority plus blank-field priority. For more information about priority, see Prioritize and suppress records on page 47. Sort based on how many lists the record belongs to. Use this option to sort multi-buyer lists. For more information, refer to Create a multi-buyer file on page 74). Sort based on how many records are in this record s match group. Use this option to sort multi-occurrence lists. Sort based on your own layered sortation. If your job includes lists, then list data is shown in the first two columns of the report. Refer to the code definitions at the bottom of the report. We show the names of the lists as defined in the job setup at the bottom of the report. Showing these at the bottom saves some valuable space; look for the corresponding list number in the second column of the report. Some examples of the Sorted Records report are shown on the following pages. The first example (on the following page) shows the top of the Sorted Records report for our sample job. This report displays PW data and is sorted geographically the default sortation routine. The report includes subordinate dupes and suppression records. 128 Match/Consolidate User s Guide

129 Sorted Records Report Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Code Dup Group List File Record LIST_ID NAME_LINE ADDRESS CITY ST ZIP FIRM M house H. V. JACOBSEN P.O. BOX C SANTA ANA CA M house HAROLD JACOBSEN P O BOX C SANTA ANA CA M firms H. V. JACOBSEN P.O. BOX C SANTA ANA CA PANEL CONCEPTS M firms H V JACOBSEN P O BOX C SANTA ANA CA PANEL CONCEPTS M house GERALD KRYWICKI PO BOX NO 2978 SPRINGFIELD MA M firms GERALD KRYWICKI PO BOX NO 2978 SPRINGFIELD MA HEATBATH CORP. M house PHIL AYERS PALMER AVE SPRINGFIELD MA M firms PHIL AYERS PALMER AVE SPRINGFIELD MA LONGVIEW FIBRE CO M house SUSAN J. COELHO 136 SOUTH STREET SPRINGFIELD MA M firms SUSAN J. COELHO 136 SOUTH STREET SPRINGFIELD MA BERKSHIRE MINIATURES M house MS MARGARET NUMMY 1295 STATE ST DEPT F029 SPRINGFIELD MA M firms MS MARGARET NUMMY 1295 STATE ST DEPT F029 SPRINGFIELD MA MASS MUTUAL LIFE INS M house BURVIN E PUGH 1295 STATE ST D 21 SPRINGFIELD MA M firms BURVIN E PUGH 1295 STATE ST D 21 SPRINGFIELD MA MASS MUTUAL M house PETER VOGIAN 1295 STATE STREET SPRINGFIELD MA M firms PETER VOGIAN 1295 STATE STREET SPRINGFIELD MA MASS MUTUAL LIFE INS CO M house ANGEL J RODRIGUEZ URB TERRAZAS DE GUAYNABO GUAYNABO PR M firms ANGEL J RODRIGUEZ URB TERRAZAS DE GUAYNABO GUAYNABO PR M house GRDN HLS PLZ/1353 CARR GUAYNABO PR M firms GRDN HLS PLZ/1353 CARR GUAYNABO PR EXECUTIVE DYNAMICS M house JOSE FLASINI VILLA BOX 1024 PONCE PR M firms JOSE FLASINI VILLA BOX 1024 PONCE PR PONCE FEDERAL BANK M house ISRAEL HILERIO HC 01 BOX 5106 AGUADILLA PR M firms ISRAEL HILERIO HC 01 BOX 5106 AGUADILLA PR CERVECERIA INDIA INC M house ANGEL F CORDERO SOTO HC 02 BOX 7270 CAMUY PR M firms ANGEL F CORDERO SOTO HC 02 BOX 7270 CAMUY PR TROPICAL PET CENTER M house WALDO SANCHEZ ROUTE 698 #12 DORADO PR M firms WALDO SANCHEZ ROUTE 698 #12 DORADO PR CPI CARDIAC PACE MAKER INC M house SAMUEL GONZALEZ DORADO DEL MAR DORADO PR M firms SAMUEL GONZALEZ DORADO DEL MAR DORADO PR UNILENES DE PR INC Chapter 6: Reports and statistics files 129

130 This example shows the top of the Sorted Records report for the same sample job again with PW data but with File sortation selected. This report also includes subordinate dupes and suppression records. Sorted Records Report Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Code Dup Group List File Record LIST_ID NAME_LINE ADDRESS CITY ST ZIP FIRM P house JOHN CASILLO 12 SAINT MARK ST AUBURN MA M house ROBERT BRADLEY 61 SUMMIT AVE SOUTH ADAMS MA M house JOSEPHINE LAMER 1414 MASSACHUSETTS AVE BOXBORO MA M house MR BILL HANDRICH PO BOX 220 HATFIELD MA M house MR GREG HAMMOND 106 LOWLAND ST HOLLISTON MA M house MARY PETERS 165 FRONT ST CHICOPEE MA M house HECTOR R RODRIGUEZ AVE DEGETAU A-7 SAN ALFONSO CAGUAS PR M house CONSTANSA F FOSTER PO BOX 169 COLLEGE POINT NY M house TIM GLAZE 358 BAKER AVE CONCORD MA P house CLAIRE MONAHAN 50 OTIS ST WESTBOROUGH MA M house ROBERT FINE PO BOX 6146 TRENTON NJ M house S DONGELO 151 RADDIN RD GROTON MA M house MR MOE L CURLY 372 PASCO RD SPRINGFIELD MA M house LANCE R DUNHAM 232 TAYLOR ST LITTLETON MA M house MR PETER BEYETTE 1459 NIAGARA FALLS BLVD BUFFALO NY M house JAY SPUTNIK 580 MAIN ST BOLTON MA M house JAN PAINTER 537 GREAT RD LITTLETON MA P house BERNIE VITTI 59 ROUTE 10 EAST HANOVER NJ M house LUIS PABON CALL BOX SAN JUAN PR M house KAREN MCFADDEN 17 WALDRON AVE GLEN ROCK NJ M house MAUREEN DABERNARDI 23 BRADFORD ST CONCORD MA M house JEANNE WEINTRAUB 200 STAGE RD SOUTH DEERFIELD MA M house MR BRADFORD W PHOENIX BOX HOLYOKE MA P house MS SUZANNE MC KIERNAN 100 SOUTH ST WORCESTER MA M house AL DIGREGORIONSON 154 CENTRAL ST SOUTHBRIDGE MA M house DENNIS R MILLS 461 TONAWANDA ST BUFFALO NY M house MS MARILYN GERMAINE 1170 HIGHWAY 36 HAZLET NJ P house PATRICIA TOBIN 911 CENTRAL AVE ALBANY NY M house TIM HAMMOND 121 LYMAN ST SPRINGFIELD MA M house MR PHIL GORSKI 18 SAYBROOK RD FRAMINGHAM MA M house FRED HOERNER 511 FARBER LAKES DR BUFFALO NY M house KENNETH WITTIS PO BOX 336 HOLDEN MA M house MS MARY T MC NAMARA 495 LINDEN ST BOYLSTON MA M house MR ED SUGGSTON 317 E MOUNTAIN ST WORCESTER MA M house ROSE A BIBEAU 52 RACETTE AVE GARDNER MA Match/Consolidate User s Guide

131 This example shows the top of the Sorted Records report for the same sample job, but with Key data, rather than PW data. The default geographical sortation was selected. This report also includes subordinate dupes and suppression records. In the Key data version, the heading over each of the data columns is the name of a key field set up in the Match Criteria block. The width of each column depends on the length of the key field. Field names are truncated to fit the column width. Sorted Records Report Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Code Dup Group List File Record Last Range Pr St Name Suff Po Unit Box RR RR Box St ZIP M JACOBSEN C-2910 CA M JACOBSEN C-2910 CA M JACOBSEN C-2910 CA M JACOBSEN C-2910 CA M KRYWICKI NO MA M KRYWICKI NO MA M AYERS PALMER AVE MA M AYERS PALMER AVE MA M COELHO 136 SOUTH ST MA M COELHO 136 SOUTH ST MA M NUMMY 1295 STATE ST F029 MA M NUMMY 1295 STATE ST F029 MA M PUGH 1295 STATE ST D21 MA M PUGH 1295 STATE ST D21 MA M VOGIAN 1295 STATE ST MA M VOGIAN 1295 STATE ST MA M RODRIGUEZ PR M RODRIGUEZ PR M GRDN HLS PLZ 1353CA PR M GRDN HLS PLZ 1353CA PR M FLASINI 1024 VILLA BOX PR M FLASINI 1024 VILLA BOX PR M HILERIO PR M HILERIO PR M SOTO PR M SOTO PR M SANCHEZ PR M SANCHEZ PR M GONZALEZ PR M GONZALEZ PR This example shows the top of the Sorted Records report for the same sample job again with Key data but with the dupe group (DUP) sortation selected. Chapter 6: Reports and statistics files 131

132 Like the others shown in this section, this report includes subordinate dupes and suppression records. Sorted Records Report Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Code Dup Group List File Record Last Range Pr St Name Suff Po Unit Box RR RR Box St ZIP CURLY 1760 N KENNEDY BLVD MA BEYETTE 310 HOLLYWOOD DR NY GORSKI 1389 SAYBROOK RD MA LOMBARDI 455 OAK DR MA KELLY 5553 OLD WINDING WAY NY BUBELLO 435 CONNECTICUT AVE MA BERGERON 874 WASHINGTON ST MA PORTA 943 W CUTLER ST NY MCGOURTY 325 POPPY RD MA YEREMIAN 428 JEAN HARLOW ST MA MUELLER 267 N PARK DR MA TERRELS 3253 DIVISION ST NJ JAY 333 NEW YORK DR MA PETERSON 842 POLYANNA DR MA SANTUCCI 392 BOXER RD 410 NY SHIPPER 335 ROYAL COACH RD NJ CUTLER 2566 HAPHAZARD RD 9313 MA KLINE 7743 LINCOLN AVE NJ ROBBINS 3232 S LEANDER ST MA MALO 5320 GEORGE ST MA ROWELL 4436 MACHINE ROW MA BROWN 887 HOOCH RD MA LINEHAN 444 PILLSBURY ST MA RICHARDS 344 LOWLAND ST MA HOLCOMB 88 W FOREST AVE NJ M JACOBSEN C-2910 CA M JACOBSEN C-2910 CA M JACOBSEN C-2910 CA M JACOBSEN C-2910 CA M KRYWICKI NO MA M KRYWICKI NO MA M RODRIGUEZ PR M RODRIGUEZ PR M GRDN HLS PLZ 1353CA PR M GRDN HLS PLZ 1353CA PR M AYERS PALMER AVE MA M AYERS PALMER AVE MA M COELHO 136 SOUTH ST MA M COELHO 136 SOUTH ST MA 132 Match/Consolidate User s Guide

133 Unparsed Records Report (.unp) This report lists the records that contain address data that MCD could not parse. This list is especially important, because MCD unparsed data can negatively affect the matching process. You can use this report to show clients deficiencies in their record data. Data for this report is generated during the input phase of your MCD job, when you have included the Read Records and Create Match Sets execution option. Records are listed in the order in which MCD attempted to read them. If a record does not have data for a field/column, that space will be blank on the report. Options You can limit the size of this report by setting a maximum number of records to print and by setting a starting record number. You can also use a record filter to limit records. For example, if you do not intend to perform any corrective actions for Firm data not parsed, you may want to filter out any records which show only that error code. For more information, refer to the output fields section of the Quick Reference. You can choose from three versions of this report, based on the field types that you want to print. You can elect to show the records PW data. The report will show a column for each PW field of your job. The example on the following page is the PW version of this report. You can elect to show the records key data. The report will show a column for each key field that you have set up in your job. The key version shows the exact data that was used for comparing the records. You can design your own Custom format. With the Custom format, you can select (from PW, database, and MCD application fields) the field data for each column of your report. You pick the fields, the order, and the heading for the column. If your job includes lists, then list data is shown in the first two columns of the report. Refer to the code definitions at the bottom of the report. We show the names of the lists as defined in the job setup at the bottom of the report. Showing these at the bottom saves some valuable space; look for the corresponding list number in the second column of the report. The example on the following page shows the top of the Unparsed Records report for our sample job. This is the PW version, so the PW-defined fields are shown across the top of the report. Chapter 6: Reports and statistics files 133

134 Unparsed Records Report Match/Consolidate x.xx Page 1 tekpubs Firstlogic, Inc Technical Publications Sample Report Code List File Record LIST_ID NAME_LINE ADDRESS CITY ST ZIP FIRM E house HECTOR R RODRIGUEZ AVE DEGETAU A-7 SAN ALFONSO CAGUAS PR E house 647 CHANDLER ST WORCESTER MA E house 324 GROVE ST WORCESTER MA E house GRDN HLS PLZ/1353 CARR GUAYNABO PR E house VINCENT RAVIXICO VISTA FOOD EXCHANGE INC BRONX NY E house MS LISA A PYENSON DRAKE INGLESI MILARDO WESTBOROUGH MA E house 405 BOSTER TURNPIKE RTE 9 SHREWSBURY MA E house 46 SUFFIELD ST AGAWAM MA E house BRIAN A MACDONALD CENTRAL WHARF BOSTON MA E house MS MELISSA BAILEY ABATE MAIL CODE D 19 WORCESTER MA E house MS JANE M MC AULIFFE QUALITY MANAGMNT DPT WORCESTER MA E house ANGEL J RODRIGUEZ URB TERRAZAS DE GUAYNABO GUAYNABO PR E house MR PETER F STANSKY R. F. D. 146A UXBRIDGE MA E house 588 EMBARCADERO SAN FRANCISCO CA E house MIGUEL A PEREZ ST 19B-20L-15 SABANA GARDE CAROLINA PR E house SAMUEL GONZALEZ DORADO DEL MAR DORADO PR E house 440 LINCOLN ST # TV2 WORCESTER MA E house MR HARRY LEVINE NATNL SLS TRNNG MGR FRAMINGHAM MA E house SALEM SQUARE WORCESTER MA E house 29TH SUMMIT SIOUX FALLS SD E house 324 GROVE ST WORCESTER MA E house 3 E MOUNTAIN ST WORCESTER MA E house MR BRIAN A MACDONALD CENTRAL WHARF BOSTON MA E house MR VINNY RAVIXICO VISTA FOOD EXCHANGE INC BRONX NY E house MR. JOE PETTIER P.O. BOX MASTERMAN'S AUBURN MA E house LEE MORAN AVE LOMAS VERDES ESQ LAS AMERI RIO PIEDRAS PR E house 100 CROSBY DR BEDFORD MA E house 3M COMPANY SAINT PHIL MN E firms HECTOR R RODRIGUEZ AVE DEGETAU A-7 SAN ALFONSO CAGUAS PR IMPRESOS ALFA E firms 647 CHANDLER ST WORCESTER MA TATNUCK BOOKSELLER & SON E firms 324 GROVE ST WORCESTER MA HOUSEHOLD FINANCE CORP E firms GRDN HLS PLZ/1353 CARR GUAYNABO PR EXECUTIVE DYNAMICS E firms VINCENT RAVIXICO VISTA FOOD EXCHANGE INC BRONX NY VISTA FOOD EXCHANGE INC E firms MS LISA A PYENSON DRAKE INGLESI MILARDO WESTBOROUGH MA DRAKE INGLESI MILARDO E firms 405 BOSTER TURNPIKE RTE 9 SHREWSBURY MA FAIR LANES BOWLING CENTERS 134 Match/Consolidate User s Guide

135 Job statistics file The job statistics file contains a single record, with 60 fields (61 if ASCII file type see the last field). The statistics in this file represent most all the significant aspects of this MCD job. The record s length will be 641 to 643 bytes (refer to the last field length). Field name Length Description ms_name 20 Holds the match set name job_descr 80 Job description job_owner 20 Job owner date 11 Date elapsed_t 8 Total elapsed time of job num_infl 3 Number of input files num_refs 3 Number of reference files num_lists 3 Number of input list num_supprl 3 Number of suppression lists num_suppr 10 Number of suppression records gross_in 10 Gross input del_rec 10 Delete drops filt_recs 10 Filter drops ign_recs 10 List drops samp_drops 10 Sample drops net_in 10 Net input largest_bg 10 Largest break group (number of records) total_comp 10 Total comparisons performed blank_name 10 Records with blank name data blank_firm 10 Records with blank firm data blank_addr 10 Records with blank address data bad_addr 10 Records with bad address data blnk_lline 10 Records with blank lastline bad_lline 10 Records with bad lastline for_lline 10 Records with foreign lastline bad_parse 10 Total unparsed records gen_unasgn 10 Records with unassigned gender gen_s_male 10 Records with strong male gender gen_w_male 10 Records with weak male gender Chapter 6: Reports and statistics files 135

136 Field name Length Description gen_ambig 10 Records with ambiguous gender gen_w_fem 10 Records with weak female gender gen_s_fem 10 Records with strong female gender gen_mn_mix 10 Multi-name records with mixed gender gen_mn_mle 10 Multi-name records with male gender gen_mn_fem 10 Multi-name records with female gender gen_mn_amb 10 Multi-name records of ambiguous gender bus_recs 10 Business (company) records res_recs 10 Residence records num_names1 10 Records with 1 name num_names2 10 Records with 2 names num_names3 10 Records with 3 names num_names4 10 Records with 4 names num_names5 10 Records with 5 names num_names6 10 Records with 6 names suppr_dups 10 Suppressed duplicates singl_dups 10 Single list duplicates multl_dups 10 Multiple list duplicates supprl_sub 10 Suppression list subordinates total_dups 10 Total duplicates num_uniq 10 Unique records singl_mas 10 Single list masters multl_mas 10 Multiple list masters supprl_uni 10 Suppression list uniques supprl_mas 10 Suppression list masters tot_nondup 10 Total non duplicates ipost_recs 10 Total input records posted ipurge_recs 10 Total input records purged (or predicted) num_outfl 10 Number of output files trecs_out 10 Total records output gpost_recs 10 Group posted records eor 1 or 2 (Present only if ASCII file type is used) End of record; 2 bytes ( C R L F) on Windows 1 byte ( L F) on UNIX 136 Match/Consolidate User s Guide

137 Input statistics file This type of statistics file contains 10 fields for each record (11 if ASCII file type see the last field). The file contains one record for each input file included in this MCD job. The total length of each record will be 124 to 126 bytes (see the last field length). Field name Length Description ms_name 20 Holds the match set name ms_results 3 Holds the match set match results if_name 32 File name ref_name 32 Reference file name gross_in 10 Gross input del_drops 10 Delete drops filt_drops 10 Filter drops list_drops 10 List drops samp_drops 10 Sample drops net_in 10 Net input eor 1 or 2 (Present only if ASCII file type is used) End of record; 2 bytes ( C R L F) on Windows 1 byte ( L F) on Unix Chapter 6: Reports and statistics files 137

138 List match statistics file This type of statistics file will contain one record for each list in the job. The file has a minimum of 11 fields (12 if ASCII type see the last field). The actual number of fields depends on the number of lists in your job. If there are no lists in your job, only the list1 fields are shown. Each additional list beyond the first adds another record to the file and an additional 10-byte field to each record. Depending on the file type and number of lists in the job, the length of each record can be from 87 to 2631 bytes (see the last field length). Each additional list adds an additional 10-byte field to each record. The list_tag identifies each list by a sequence number, making the statistics file grid easier to read when many lists have been processed. The following table shows the format of a list match statistics file for a job that included three lists. The fields in bold show those that vary up to listn, depending upon the number of lists in the job. Field name Length Description ms_name 20 Holds the match set name ms_level 20 Holds the match set level name ms_results 3 Holds the match set match results list_name 20 List name list_id 20 List ID list_type 8 List type list_pri 3 List match priority asuper_name 20 Super list name defined in the job super_name 20 Super list name net_in 10 Net input list_tag 10 List number tag list1 10 Matches with List1 records list2 10 Matches with List2 records list3 10 Matches with List3 records eor 1 or 2 (Present only if ASCII file type is used) End of record; 2 bytes ( C R L F) on Windows 1 byte ( L F) on Unix 138 Match/Consolidate User s Guide

139 List statistics file This type of statistics file contains 28 fields in each record (29 if ASCII file type see the last field). The file contains one record for each list of the job. The total length of each record is bytes (see the last field length). Field name Length Description ms_name 20 Holds the match set name ms_level 20 Holds the match set level name ms_results 3 Holds the match set match results list_name 20 List name list_id 20 List ID list_type 8 List type list_pri 3 List match priority asuper_name 20 Super list name defined in the job super_name 20 Super list name num_mtchid 10 Matched ID records num_defaul 10 Default records num_nopars 10 No parse num_noaddr 10 No address num_nofirm 10 No firm num_notitl 10 No title num_nolnam 10 No last name num_nofnam 10 No first name suppr_dups 10 Suppressed duplicates singl_dups 10 Single list duplicates multl_dups 10 Multiple list duplicates tot_dups 10 Total duplicates num_uniq 10 Uniques singl_mas 10 Single list masters multl_mas 10 Multiple list masters supprl_uni 10 Suppression list uniques supprl_mas 10 Suppression list masters supprl_sub 10 Suppression list subordinates num_nondup 10 Total non dupe records net_in 10 Net input eor 1 or 2 (Present only if ASCII file type is used) End of record; 2 bytes ( C R L F) on Windows 1 byte ( L F) on Unix Chapter 6: Reports and statistics files 139

140 Output statistics file This type of statistics file has 23 fields in each record (24 if ASCII file type see the last field). The file contains one record per list per output file. The total length of each record will be bytes (see the last field length). To produce this file, it s necessary to generate an Output File report, in addition to setting up his statistics file in your MCD job setup. Field name Length Description ms_name 20 Holds the match set name ms_level 20 Holds the match set level name ms_results 3 Holds the match set match results of_num 10 Output file number of_name 32 Output file name of_type 30 Output file type list_name 20 List name list_id 20 List ID list_type 8 List type list_pri 3 List match priority asuper_name 20 Super list name defined in the job super_name 20 Super list name net_in 10 Net input suppr_dups 10 Suppressed duplicates singl_dups 10 Single list duplicates multl_dups 10 Multiple list duplicates num_uniq 10 Uniques singl_mas 10 Single list masters multl_mas 10 Multiple list masters supprl_uni 10 Suppression list uniques supprl_mas 10 Suppression list masters supprl_sub 10 Suppression list subordinates filt_drops 10 Filter drops net_out 10 Net output eor 1 or 2 (Present only if ASCII file type is used) End of record; 2 bytes ( C R L F) on Windows 1 byte ( L F) on Unix 140 Match/Consolidate User s Guide

141 Purge statistics file The number of fields in this type of statistics file is 20 (18 if ASCII file type see the last field). The total length of each record is characters (refer to the last field length). The file contains a record for each list of the job. Field name Length Description ms_name 20 Holds the match set name ms_level 20 Holds the match set level name ms_results 3 Holds the match set match results list_name 20 List name list_id 20 List ID list_type 8 List type list_pri 3 List match priority asuper_name 20 Super list name defined in the job super_name 20 Super list name net_in 10 Net input suppr_dups 10 Suppressed duplicates singl_dups 10 Single list duplicates multl_dups 10 Multiple list duplicates num_uniq 10 Uniques singl_mas 10 Single list masters multl_mas 10 Multiple list masters supprl_uni 10 Suppression list uniques supprl_mas 10 Suppression list masters supprl_sub 10 Suppression list subordinates filt_drops 10 Filter drops total_del 10 Total deletes eor 1 or 2 (Present only if ASCII file type is used) End of record; 2 bytes ( C R L F) on Windows 1 byte ( L F) on Unix Chapter 6: Reports and statistics files 141

142 Super list match statistics file This type of statistics file will contain one record for each super list in the job. The file has a minimum of three fields (four if ASCII file type see the last field). The actual number of fields depends on the number of super lists in your job. If there are no super lists in your job, only the super1 fields are shown. Each additional super list beyond the first adds a record to the file and an additional 10- byte field to each record. Depending on the file type and number of super lists in the job, the length of each record can be from 87 to 2580 bytes (see the last field length). Each additional list adds an additional 10-byte field to each record. The list_tag identifies each super list by a sequence number, making the statistics file grid easier to read when many super lists have been processed. The following table shows the format of a super list match statistics file for a job that included three super lists. The fields in bold show those that vary up to supern, depending upon the number of super lists in the job. Field name Length Description ms_name 20 Holds the match set name ms_level 20 Holds the match set level name ms_results 3 Holds the match set match results asuper_name 20 Super list name defined in the job super_name 20 Super list name list_tag 10 Super list number tag asuper1 10 Matches with asuper1 records super1 10 Matches with super1 records asuper2 10 Matches with asuper2 records super2 10 Matches with super2 records asuper3 10 Matches with asuper3 records super3 10 Matches with super3 records eor 1 or 2 (Present only if ASCII file type is used) End of record; 2 bytes ( C R L F) on Windows 1 byte ( L F) on Unix 142 Match/Consolidate User s Guide

143 Multi-buyer statistics file The Multi-Buyer Statistic file provides a multiple-buyer count, which is a count of normal lists (as opposed to suppress or special lists). To determine the multibuyer count of a match group, count the number of normal lists represented in the match group; ignore all records from the suppress lists and special lists. If a suppress record appears in the match group before the first normal record, MCD sets the multi-buyer count to zero. The following table shows the format of Field name Length Description ms_name 20 Holds the match set name ms_level 20 Holds the match set level name ms_results 3 Holds the match set match results list_name 20 List name list_id 20 List ID list_type 8 List type list_pri 3 List priority asuper_name 20 Super list name defined in the job super_name 20 Super list name mbuy_num 10 Total number of multi-buyers mbuy_ buyer multi-buyers mbuy_ buyer multi-buyers mbuy_ buyer multi-buyers mbuy_ buyer multi-buyers mbuy_ buyer multi-buyers mbuy_ buyer multi-buyers mbuy_ buyer multi-buyers mbuy_ buyer multi-buyers mbuy_ buyer multi-buyers eor 1 or 2 (Present only if ASCII file type is used) End of record; 2 bytes ( C R L F) on Windows 1 byte ( L F) on Unix Chapter 6: Reports and statistics files 143

144 List subordinates statistics file The List Subordinates Statistics file provides a count of subordinate records between lists. The file will contain one record per list used. There will be 12 to 2011 fields in each file; the number of fields listed in the FMT file varies due to the ASCII new line character (the field name for the character is eor ) and the number of lists. The list_tag fields will identify each list, making the statistics file grid easier to read when MCD processes many lists. The following table shows the format of Field name Length Description ms_name 20 Holds the match set name ms_level 20 Holds the match set level name ms_results 3 Holds the match set match results list_name 20 List name list_id 20 List ID list_type 8 List type list_pri 3 List priority asuper_name 20 Super list name defined in the job super_name 20 Super list name net_in 10 Net input list_tag 10 List number tag master_rec 10 Number of master records List1 10 List 1 subordinates of this list List List 2 subordinates of this list ListN 10 List N subordinates of this list eor 1 or 2 (Present only if ASCII file type is used) End of record; 2 bytes ( C R L F) on Windows 1 byte ( L F) on Unix 144 Match/Consolidate User s Guide

145 Chapter 7: Use group posting to consolidate data This chapter explains how you can use Match/Consolidate (MCD) group posting functions to salvage data from matching records that is, members of dupe groups and post that data to a best record, or to all matching records. This is a key component in most data consolidation efforts. Chapter 7: Use group posting to consolidate data 145

146 The basics of group posting Terms Term Consolidation (group posting) Description Consolidation (or group posting) means copying or accumulating data from one matched record to another. Sometimes, this means taking data from matched records to form a single best record. Some users use group posting to migrate information from one record to another. This process occurs after MCD identifies records as members of match groups. Performing group posting Product MCD Job MCD Views Performing group posting Set up a Group Posting block for each posting operation you want included in this job. Include copy instructions, source/destination data, and any filters you want to control the posting operation. A group posting operation can include more than one data posting action (copy line). If group posting with an input file purge, in the Execution block, set Group Post to Purged Files to Y (Yes). If group posting to an output file, at each output file block, set the Group Posting parameter to All so that all group posting blocks post to the output file. Or, set the Group Posting parameter to Select and use the Select Group Posting parameter to have only specific group posting blocks post to the output file. Set up a Group Posting window for each posting operation you want included in this job. Include copy instructions, source/destination data, and any filters you want to control the posting operation. A group posting operation can include more than one data posting action (copy line). If group posting with an input file purge, select Group Post to Purged File(s) from the Input File Options of the Execution Options screen. If group posting to an output file, select the Group Posting option at the output file screen. 146 Match/Consolidate User s Guide

147 Introduction to group posting Group posting is a way to save data from matching records before discarding them, or a way to post data from one record to all matching records. Group posting happens within match groups Group posting enables you to update information in record fields based on their membership in a match group, and based on the priority and completeness of the records within that match group. We call it group posting because the data posting occurs within match groups. There are two common group posting operations: You can use group posting to salvage useful data from duplicate records before discarding them. For example, when running a driver-license file against your house file, you might pick up gender or date-of-birth data to add to your house record. You can post updated data for example, the most recent phone number to all of the records in a match group. Start with our template There are more examples that you can use as a starting point. For MCD Job, refer to the resource file group.mpg (look in the template subdirectory). Posting is done with input purging or output files In addition to designing the group posting operation, you must specify whether MCD is to perform the group posting operation on records bound for each output file. You must also specify whether MCD is to perform group posting to the input files. You can exempt your input files from purging if your job involves group posting to input files. If group posting is the only result you want from your MCD job no output files, no input file purge see When group posting is all you want to do on page 159. Refer to the Group Posting block in your Job-File Reference or Views online help for job setup information. In addition, see all four output file blocks and the Input File block for instructions on performing the group posting operation for each output file or exempting your input file from group posting. Chapter 7: Use group posting to consolidate data 147

148 Post data sources and destinations You can choose to post data to the master record, to all the subordinate members of the match group, or to all members of the match group. The data can be posted to the input record or you can post it to the record as it is being output. Master dupe Subordinate #1 Subordinate #2 Subordinate #3 Field types in posting When you consolidate data with MCD s group posting function, you can work with six data types. These six types represent a source and destination identifier for each of three data types that you re already familiar with: your database fields, MCD application fields, and our PW fields. In addition, if you group post to a new field in the output record, the data type DBOUT is also available. The group posting data type is used with the appropriate field name as the identifier for posting data. For example, to use the PW field Address as a source of data for a group posting operation, you would designate that data as PWSRC.Address. The table below shows all the data types. If you include more than one copy parameter in a Group Posting block or window, they all must copy group posting data to either DBDST or PWDST or to DBOUT; you can t mix the copy parameters. Data type DBSRC DBDST APSRC APDST PWSRC PWDST DBOUT The group posting operation uses Your database field as the source of the posting data. Your database field as the destination of the posted data. A MCD-generated field as the source of the posting data. A MCD-generated destination field used in filters or expressions. A PW field as the source of the posting data. A PW field as the destination of the posted data. An output file field as the destination of the posted data. Refer to the Quick Reference for a master list of PW fields and for the MCD Output Fields table that lists MCD AP fields. Use filters To design complex group posting operations, you may want to include a filter. Filters give more precise control over the posting operation. For simpler operations, the controls of your job setup can suffice. For information on filters and functions, see your Database Prep manual. Post higher priority records first Match/Consolidate always starts the group posting operation with the highest priority member of the dupe group (the master) and works its way down to the last subordinate, one at a time. This ensures that data can be salvaged from the higher-priority record to the lower priority record. 148 Match/Consolidate User s Guide

149 Group posting depends on your fields Discrete fields Group posting is set up on a field-by-field basis. This means, for example, that you can work with first names only if you have a discrete field for first name. If you have a Name_Line field containing first and last, you cannot manipulate the first name through group posting. Non-common fields Sometimes, you may work with files that do not all include the same fields. MCD may find that a destination field mentioned in your group-posting setup does not exist in some records. In this instance, MCD simply cancels the operation for those records. The group-posting report includes the number of operations canceled for this reason. When a field is missing from the source record, all is not lost and the operation is not canceled. In your filter and posting expressions, a missing database field is evaluated as an empty character string (length 0), and a PW field not defined is evaluated as a string full of spaces. PW fields You can apply group posting to database fields or PW fields. If you make any PW field the destination of group posting, that PW field must be based on one database field, not on two or more concatenated database fields. For example, if you have the following lines in your DEF file Address = Address Last_Line = City & State & ZIP then you can group post to the field PWDST.Address (a destination-type PW address field). However, you will receive an error message if you attempt to group post to PWDST.Last_Line, because that field concatenates fields. Input fields If you perform group posting while creating an output file, remember that group posting is performed just before fields are copied to the output record. References to DB and PW fields (DBSRC, DBDST, PWSRC, and PWDST) pertain to fields in the input records. If you define additional fields in the output file, those additional fields are available as DBOUT fields for your group-posting operation. Chapter 7: Use group posting to consolidate data 149

150 Group posting more than once per destination record You may want your group posting operation to stop after the first time it posts data to the destination record, or you may want it to continue with the other dupe group records as well. Your choice depends on the nature of the data you re posting and the records you re posting to. The two examples that follow (refer to page 151 and page 154) illustrate each case. If you post only once to each destination record, then once data is posted for a particular record, MCD moves on to either perform the next group posting operation (if more than one is defined) or produce the next output record. If you don t limit the group posting in this way, MCD works through each member of the destination record s dupe group, in priority order, performing this posting action each time the filter is passed. Only then does it move on to the next group posting operation (if more than one is defined) or to the next destination record. Regardless of this setting, MCD always works through the dupe group members in priority order. When posting to record #1 in the figure below, without limiting the posting to only once, here is what happens: Record #1 (master file) Record #2 (subordinate) Record #3 (subordinate) Record #4 (subordinate) First, the posting operation is attempted using, as a source, that record from among the other dupe group records that has the highest priority (record #2). Next, the posting operation is attempted with the next highest priority record (record #3) as the source. Finally, the posting operation is attempted with the lowest priority record (record #4) as the source. The results In the case above, record #4 was the last source for the posting action, and therefore could be a source of data for the output record. However, if you set your group posting to post only once per destination record, here is what happens: Record #1 (master file) Record #2 (subordinate) Record #3 (subordinate) Record #4 (subordinate) First, the posting operation is attempted using, as a source, that record from among the other dupe group records that has the highest priority (record #2). If this attempt is successful, MCD considers this dupe group posting operation to be complete and moves to the next group posting operation (if there is one) in the job setup, or to the next output record. If this attempt is not successful, MCD moves to the dupe group member with the next highest priority and attempts the posting operation. In this case, record #2 was the source last used for the posting action, and so the source of posted data in the output record. 150 Match/Consolidate User s Guide

151 Example: post a new phone number You may want to post the latest phone number to all duplicate records. In your job setup, set list and record priorities include PW.Priority so that the most recent records become the master records within each dupe group (refer to Prioritize records based on the contents of one field on page 56 for details about prioritizing with field data). Your dupe detection process would then find dupe groups such as the following: Record Name Phone Date Dupe status #1 John Smith 11 Apr 2001 Master #2 John Smyth Oct 1999 Subordinate #3 John E. Smith Feb 1997 Subordinate #4 J. Smith Subordinate In your group posting operation, set up a filter to use the order of the records in the dupe group (that s the MCD application field AP.Group_Ord) to indicate the more recent phone number, and a Copy statement to post that number to the Phone field of each record. In this case, your group posting operation should post only once per destination record, because any additional posting action would presumably post less recent data. As this example shows, there may be more recent records without phone numbers, so your filter should accommodate that possibility. In addition, your filter should prevent posting an older phone number over a newer number. The following filter statement satisfies both concerns:.not. empty(dbsrc.phone).and. (empty(dbdst.phone).or. (APSRC.Group_Ord < APDST.Group_Ord).and..not. empty (DBSRC.Phone) What happens with record #1 When record #1 is about to be output, MCD detects that it is a member of a dupe group, and that this group post operation calls for all members of the dupe group to be posted. Match/Consolidate finds that record #2 has the highest priority among the other members of this record s dupe group. Match/Consolidate applies the group post filter, with record #1 as the destination record and record #2 as the source record..not. empty(dbsrc.phone).and. (empty(dbdst.phone).or. (APSRC.Group_Ord < APDST.Group_Ord)) Because record #2 s original database Phone field is not empty, and record #1 s Phone field is empty, the filter passes, and MCD performs the group post Copy... function (DBSRC.Phone, DBDST.Phone). The phone number from record #2 is posted to the Phone field of record #1. Chapter 7: Use group posting to consolidate data 151

152 Match/Consolidate detects that the data should be posted only once, so it performs no further actions on this record with this group post operation. Record #1 is therefore output with (the most recent phone number among the records in the dupe group) posted to its Phone field. Record to Output other dupe group members posting action(s) output PHONE record #1 (empty PHONE) ( ) record #2 ( ) record #3 ( ) record #4 ( ) What happens with record #2 When record #2 is about to be output, MCD detects that it is a member of a dupe group, and that this group post operation calls for all members of the dupe group to be posted. Match/Consolidate determines that the highest priority member of this record s dupe group is record #1. Match/Consolidate applies the group post filter function, with record #2 as the destination record and record #1 as the source record..not. empty(dbsrc.phone).and. (empty(dbdst.phone).or. (APSRC.Group_Ord < APDST.Group_Ord)) Because record #1 s original database Phone field is empty, the filter does not pass. Moving to the next highest priority member of this record s dupe group, MCD applies the filter to record #3. This time, the source Group_Ord value is higher than the destination Group_Ord value, so the filter again fails. Next, the same thing happens with record #4. As a result, the group posting operation can t be done. Record #2 is therefore output with its Phone field unchanged, at That s good, because this is the most recent phone number among the records in the dupe group. Record to Output other dupe group members posting action(s) output PHONE record #2 ( ) record #1 (empty PHONE) filter fails record #3 ( ) filter fails record #4 ( ) filter fails What happens with record #3 When record #3 is about to be output, MCD detects that it is a member of a dupe group, and that this group post operation calls for all members of the dupe group to be posted. Match/Consolidate finds record #1 with the highest priority among of this record s dupe group. Match/Consolidate applies the group post filter function, with record #3 as the destination record and record #1 as the source record..not. empty(dbsrc.phone).and. (empty(dbdst.phone).or. (APSRC.Group_Ord < APDST.Group_Ord)) 152 Match/Consolidate User s Guide

153 Because record #1 s original database Phone field is empty, the filter does not pass. Moving to the next highest priority member of this record s dupe group, MCD applies the filter to record #2. This time the filter passes: the source Phone field is not empty and the source Group_Ord value (2) is lower than the destination Group_Ord value (3). Match/Consolidate therefore performs group post Copy... DBSRC.Phone, DBDST.Phone. The phone number from record #2 is posted to the Phone field of record #3. With the group post operation Post Only Once... parameter set to yes (Y), MCD halts actions on this record with this group post operation. Record #3 is therefore output with (the most recent phone number among the records in the dupe group) posted to its Phone field. Record to Output other dupe group members posting action(s) output Phone record #3 ( ) record #1 (empty PHONE) filter fails record #2 ( ) record #4 ( ) What happens with record #4 When record #4 is about to be output, MCD performs the same actions described above for record #3. As a result, record #4 is also output with , the most recent phone number among the records in the dupe group. In a case like this, you must post only once to your destination record. Otherwise, MCD repeats the posting process from highest priority record through lowest priority duplicate. That could result in the final posted value for an output record having come from a record with older data than might be available from higher priority records within the dupe group. For example, when about to output record #4 in the example above, Match/ Consolidate would post Phone data from record #2, just as shown above. Then, instead of outputting the posted record #4, MCD repeats the group post process using record #3 as the source. The filter again passes, and MCD posts Phone data ( ) from record #3. As a result, record #4 would be output with Phone data that s older than was available from a higher priority record (record #2) in the dupe group. Chapter 7: Use group posting to consolidate data 153

154 Example: additive information Posting a new total In another typical situation, you may want to post additional information to the members of a dupe group; for example, to merge credit balances from several rented credit lists. For your output file, you'd like each record to include the sum of all those balances. In your job setup, establish your output file to include the credit amount (Balance) field and any other mailing data your job demands. Your dupe detection process would then find dupe groups such as the one shown below: Record Lastline Date Account Balance Dupe status #1 La Crosse WI Apr 95 FirstBank 3,456 Master #2 La Crosse WI Oct 93 SecondBank 1,234 Subordinate #3 La Crosse WI Oct 93 SecondBank 6,789 Subordinate #4 La Crosse WI Mar 90 LastBank 2,345 Subordinate Design your group posting operation to post to all records within the dupe group and do not limit the posting to once per destination record. Create the following Copy function: DBSRC.Balance + DBDST.Balance, DBDST.Balance (This function is based on numeric field data not character data.) The results As MCD is about to output each record, it detects that group posting should occur to all records and that there is no filter to impose before posting. Therefore, for each output record, three posting actions would occur, to make the value of DBDST.Balance as follows: Record First post results Second post results Third post results #1 (Rec #2 + Rec #1) = 4,690 (+ Rec #3) = 11,479 (+ Rec #4) = 13,824 #2 (Rec #1 + Rec #2) = 4,690 (+ Rec #3) = 11,479 (+ Rec #4) = 13,824 #3 (Rec #1 + Rec #3) = 10,245 (+ Rec #2) = 11,479 (+ Rec #4) = 13,824 #4 (Rec #1 + Rec #4) = 5,801 (+ Rec #2) = 7,035 (+ Rec #3) = 13,824 If you were to post only once per destination record (as in the example on the previous page) MCD would perform the Copy function only once for each record to be output, and would therefore output the records with the results that you see in the first column above (first post results). 154 Match/Consolidate User s Guide

155 Examples of group posting strategies The following are some examples of group posting strategies that you might want to consider. For each, the resource file group.mpg includes a job-file block preset for its use. To use, copy any block from that file into your own job file. Or, if you re using MCD Views, set the Group Posting controls as listed in the descriptions below. For details about functions and operators, refer to Database Prep. Phone number salvage In our first two examples, we merge several rented mailing lists. Some of these lists contain a Phone field. We re going to salvage the telephone number by copying it from subordinate to master dupe. However, first, we use a filter to answer two questions: Is the field in the master dupe empty? Does the subordinate dupe contain data? The operation continues only if the answer to both questions is yes. This protects us from any net loss of data. This group posting operation in group.mpg is Phone Number Salvage. Posting destination Master Post only once per destination Yes Group post to suppress list(s) No Group posting filter empty(dbdst.phone).and..not. empty(dbsrc.phone) Copy DBSRC.Phone, DBDST.PHONE Phone number update In this example, we want to post the newest phone number to all the members of the dupe group. The filter verifies that the source dupe contains some data. It also ensures that, unless the source dupe field is empty, the priority of the source dupe is higher than that of the destination dupe, so we don t overwrite a newer phone number with an older one. With appropriate modification to the field name, this approach could be used with any other dated field, as well. This group posting operation in group.mpg is Phone Number Update. Set your list and record priorities so that the most recent phone number is assigned the highest priority. Posting destination All Post only once per destination Yes Group post to suppress list(s) Yes Group posting filter.not. empty(dbsrc.phone).and. (empty(dbdst.phone).or. (APSRC.Group_Ord < APDST.Group_Ord)) Copy DBSRC.Phone, DBDST.Phone Chapter 7: Use group posting to consolidate data 155

156 Add one field to another In this example, several of our files have a field that contains a credit limit the maximum balance on a credit account. We want to sum all those limits, so each output record that is, each Master dupe contains the total credit exposure. This group posting operation in group.mpg is Total Credit Limit. Because both Max_Credit fields contain character-type data, you must apply the val function before performing arithmetic on them. Posting destination Master Post only once per destination No Group post to suppress list(s) No Group posting filter None Copy val(dbsrc.max_credit) + val(dbdst.max_credit), DBDST.Max_Credit Total balance to all In this example, the value of the Balance fields of all the records is added, and the total is posted to all the records of the dupe group. This group posting operation in group.mpg is Total Balance To All. Because both Balance fields contain character-type data, you must apply the val function before performing arithmetic on them. Posting destination All Post only once per destination No Group post to suppress list(s) No Group posting filter None Copy val(dbsrc.balance) + val(dbdst.balance), DBDST.Balance Append one field to another Here, we merge a catalog marketer s house file with a file of today s orders. (The order-entry system generates a database record.) The house file has top priority, so we know that the house record will be the master dupe. The house file includes a long field for a history of items that the customer has purchased. We don t want this history to contain redundant numbers (when a customer buys the same item a second time). So before appending the item ordered today onto the history field, we check with a filter to see whether the same item number already exists in the history field. This group posting operation in group.mpg is Item Number History. Posting destination Master Post only once per destination No Group post to suppress list(s) No Group posting filter.not. ALLTRIM(DBSRC.Item) $ DBDST.History Copy DBDST.History & DBSRC.Item, DBDST.History 156 Match/Consolidate User s Guide

157 Save the greater of two amounts In the same job as the previous example, we also want to store the record-setting purchase the greatest amount the customer has ever spent on a single order. We compare the amount of today s order with the previous record (from the house file). If today s amount is greater, we post it, thereby replacing the old amount with the new one. This operation is named Record Purchase Amount in the group.mpg file. Posting destination Master Post only once per destination No Group post to suppress list(s) No Group posting filter DBSRC.Amt > DBDST.Max_Amt Copy DBSRC.Amt, DBDST.Max_Amt Save the number of orders Our catalog marketer wants to maintain, in the house file, a count of the number of times each customer has ordered. This field will be used later to determine which customers are eligible for special treatment. We must increase the value of the destination field by one. No data from the source record is actually used. This operation is named Number of Orders in the group.mpg file. Posting destination Master Post only once per destination No Group post to suppress list(s) No Group posting filter None Copy val(dbdst.orders) + 1, DBDST.Orders Average donation Here we merge the donor files of two charitable foundations. Each file contains a numeric-type field for the amount pledged. If there is a pledge in both files, we want to post the average of the two pledges; if there is a pledge in only one, we want to post that amount. We use the iif and max functions to determine what will be posted. This operation is named Average Donation in the group.mpg file. Posting destination Master Post only once per destination No Group post to suppress list(s) No Group posting filter None Copy iif(dbsrc.amt > 0.AND. DBDST.Amt > 0, (DBSRC.Amt + DBDST.Amt) / 2, max(dbsrc.amt, DBDST.Amt)), DBDST.Amt Chapter 7: Use group posting to consolidate data 157

158 Best first name Some records contain full first names, while others have only the first initial or an abbreviation. We compare lengths of the first-name data. If the first name in the master dupe is shorter, we replace it with the first name from the subordinate dupe. This procedure requires a discrete field for first name. This operation is named Best First Name in the group.mpg file. Posting destination Master Post only once per destination No Group post to suppress list(s) No Group posting filter len(rtrim(pwsrc.first_name)) > len(rtrim(pwdst.first_name)) Copy PWSRC.First_Name, PWDST.First_Name Combine first names Here we want to compare first names and, if they re different, string them together with an ampersand. For example, John Doe and Mary Doe would become John & Mary Doe). This process can be fooled by nicknames and misspellings. This operation is named First Name Combiner in the group.mpg file. Posting destination Master Post only once per destination No Group post to suppress list(s) No Group posting filter None Copy iif(empty(pwdst.first_name), PWSRC.First_Name, iif(rtrim(pwdst.first_name) $ rtrim(pwsrc.first_name), PWSRC.First_Name, iif(rtrim(pwsrc.first_name) $ rtrim(pwdst.first_name), PWDST.First_Name, PWDST.First_Name & "&" & PWSRC.First_Name))), PWDST.First_Name 158 Match/Consolidate User s Guide

159 When group posting is all you want to do There may be times when you want to run a group posting operation without purging your input file(s) or creating any output files. But when a job includes group posting, the group posting is always performed during an input purge or the creation of an output file. The answer: Perform an input file purge without actually deleting any records. How is that possible? Protect the input file(s) with your input file setup. Each input file that you set up for your job includes a parameter or switch that enables you to protect that file from a purge. For each of your input files, set that control to protect the file. Producing the group posting report When you run the job, be sure to generate the Posted Dupe Group report, as well, because you won't be able to come back and generate it later. In the Execution Options of your job setup, include the following: The Group Post to Purged Files Input File Option The Purge or Custom Purge Input File Option The Create Reports Options When you run the job, MCD performs your group posting operations. It attempts to perform the purge, only to discover that every single input record has been protected from the purge. With this setup, MCD purges no records from the input file; however, MCD performs group posting on the input files. Chapter 7: Use group posting to consolidate data 159

160 Group post with an input purge Your group posting operation may face an additional challenge in jobs that involve an input purge. The potential complication only exists under the following combination of conditions: The job includes group posting in an input purge. The group posting operation includes posting to subordinate dupes or all records. In this situation, source fields for your group posting operation may or may not have already been posted by the group posting operation(s). This means you must design your group post operation to work properly in both situations; when the source field s record has been posted to, and when it has not. Typically, group posting to subordinate dupes or all records in conjunction with an input purge is done with a custom purge of the input file. This allows the retention of subordinate dupes, as well as master dupes. Why not in other cases The potential of reading previously group-posted source fields is only present in the specific combination of conditions listed above. In all other cases, you can be sure source-field data has not been modified by a group posting operation. Here s why: When you produce an output file (rather than purge the input file), input record data is never modified by the group posting operation, so source fields always contain unmodified data. When you group post only to master records, subordinate records always provide the source fields. Because subordinate records are not a posting destination, their fields always contain unmodified data. However, when group posting to subordinates as part of input purging, MCD group posts data back to input records. This does not affect the first group post operation for the dupe group (always the master record), because none of the subordinates have yet been posted. As the group posting operation progresses through the subordinate dupe group members, though, any record higher in the dupe group order has been the destination for the group posting action; any record lower in the group order has not yet been the destination for the group posting action. Therefore, when posting to subordinate records in conjunction with input purging, make sure you design your group posting operation (including any filters and functions) to work predictably whether dealing with records that have already been the destination of the operation or not. One thing you can depend upon: A group posting operation in conjunction with input purging is always performed master first, then dupes in dupe group order. 160 Match/Consolidate User s Guide

161 The process The tables below reflect the potential status of source field data for a four-record dupe group as a group posting operation is performed with the group posting set to post to all records, in conjunction with input purging. Apply this information when constructing filters and functions for your group posting operations in this situation. When dest record is: record #1 record #2 record #3 record #4 When dest record is: record #1 record #2 record #3 record #4 When dest record is: record #1 record #2 record #3 record #4 When dest record is: record #1 record #2 record #3 record #4 has input record been destination record? no no no has input record been destination record? yes no no has input record been destination record? yes yes no has input record been destination record? yes yes yes Chapter 7: Use group posting to consolidate data 161

162 Reports on group posting How do you know the results of your group posting operations? One option, of course, is to look at the content of your output file: What data (if any) was posted to your records? However, it may be more useful to see statistics about the results, and that s the reason for having the Posted Dupe Groups report. From the Posted Dupe Groups report, you may find clues for making any adjustments to further improve your results. If your results show any trends that could be improved by adjustments to your settings, then change those settings and re-process the step. The example below shows a portion of a report for a job that included one group posting operation, with the name mpg. Posted Dupe Groups Report File : C:\pw\mpg\Work\output\Out_mpg.txt Post Post Purge Dst Field Filter Post Name Attempts Drops Drops Drops Completes mpg attempts minus drops equals completions How many group posting operations were tried, and how many were completed? Check the numbers in the first column to see how many times your group posting operation was attempted, and the last column to see how many times the posting operation was accomplished. These numbers show attempts and completions, not simply the number of destination records. If your group posting operation is set to post more than once per destination record, the numbers may be larger than the number of destination records. If you expected more attempts, you may want to recheck the details of your group posting operation, to be sure the fields are named right and, if you have a filter, that it s right. In addition, if you haven t done so, check to see that the records you expected to see posted were, in fact, found to be dupes. How many posting operations were canceled? Check the Purge Drops, Dst Field Drops, and Filter Drops columns to see how many group posting operations were canceled because of these three reasons: The destination record was a member of a Suppress-type list. The destination record did not have the appropriate field or the group posting operation was set to group post only once per destination. The group post filter returned false (didn t pass). 162 Match/Consolidate User s Guide

163 Chapter 8: Record matching In this chapter, and the remaining chapters of this book, you will learn how to fine-tune your breaking and matching settings to achieve the matching results that you want. Chapter 8: Record matching 163

164 Introduction Record matching overview on page 15 explained how to use our templates to set up a matching job. In the remaining chapters of this book, you ll learn how to fine-tune settings to achieve the matching results you want. It s important to understand that record matching is not a black and white process with a one size fits all solution. The quality of your record matching is directly related to the level of knowledge and skill brought to the task and the time allotted to it. Therefore, for most Match/Consolidate (MCD) users, the challenge is to invest enough effort to get good, useful results without reaching the point of diminishing returns. Three phases There are three phases of record matching: 1. First, MCD places the input records into small groups to avoid comparing records that have no reasonable likelihood of matching. These are referred to as break groups. Refer to Engineer key data on page 175 for information about how you can control the formation of break groups. 2. Next, the software compares each key of a break group to each other key in that break group. When two or more record keys match, MCD identifies them as members of a dupe group a duplicate record group. The number of records in a dupe group can vary widely, depending on the quality of your data and how stringent your matching setup was. Refer to Engineer break groups on page 187 for information about how you can control what constitutes a match. 3. With all the dupe groups formed, MCD sorts the keys of each group, to prioritize the records. Once sorting is done, MCD can categorize each record key as one of the following: Unique records those keys that did not match any other key in their break group. Master dupes those keys ranked at the top of their dupe group. Subordinate dupes those keys ranked second or lower in their dupe group 164 Match/Consolidate User s Guide

165 Choose between standard and extended matching The MCD standard matching method incorporates match logic that is useful to many MCD users, across varied job circumstances. Complete instructions for using standard matching are in your Match/Consolidate Job-File Reference. However, some MCD users want more control over the match comparison process. The extended matching method lets you engineer your own match logic. Although the job setup is more involved, the extended matching method lets you precisely control record comparisons. There are two ways to implement extended matching: automatic and rule-based. Each is explained a little later in this chapter, so we won t contrast them here. Instructions for using extended matching are in your Match/Consolidate Extended Matching Reference. What might prompt you to use extended matching? If the standard matching method produces what you need, in terms of results and processing speed, continue using that method. Setup is less involved, and especially for those already familiar with this method the learning curve is more easily managed. However, if standard matching does not produce the results you want, then the benefits of extended matching make it well worth using. Faster overall job processing You have far more control over the matching process with the extended matching method. For example, you can establish the order in which fields are compared, set thresholds for including or excluding a key field comparison in the overall record score, and set overrides for calling a record match (or no-match) based just on a single field comparison. With the extended matching method, you can dramatically reduce the record matching time, by eliminating field comparisons that, on a record-by-record basis should not affect your match results. Refer to the examples on the following page. Positive matching by field Many MCD users have business rules that include positive match or no-match data fields fields such as Social Security number (SSN), phone number, or account number that override all other fields in importance. In cases like these, two records should be identified as matches (or no-matches) based entirely on that field comparison alone. The extended matching method makes it possible for you to create a matching process that calls record match results based on specific field comparisons, and cancels any remaining field comparisons for that pair of records. Depending on your data, processing time for record comparisons can be significantly reduced. For instance, you may want a match on SSN to produce a match (or a no-match) decision regardless of any other field comparisons. Using the extended matching method, you could do either of the following: Design a match process that first performs the SSN comparison. If SSNs match, call the records a match, eliminate further comparisons for this pair of records, and go on to the next record pair. Further key field comparisons would only be made when the SSNs do not match. Design a match process to first perform the SSN comparison. If SSNs do not match, call the records a no-match, eliminate further key field comparisons Chapter 8: Record matching 165

166 for this pair of records, and go on to the next record pair. Further key field comparisons would only be made when the SSNs match. With the standard matching method, if any field comparison fails to meet its threshold, the two records are not a match, and if all the field comparisons meet their thresholds, the two records match. All your key field comparisons are performed, regardless of the results of the SSN field comparison. Weighting of field match scores You may want to have some key field comparisons count for more than other comparisons in the overall record score. The extended matching method enables you to set different weighting factors for each comparison. For example, some MCD users may want an address field to make up 25 percent of the overall record match score. Others may want that comparison to make up 50 percent of the overall record match score. Weighting makes it possible for your record pair to match, even though some key fields of lesser importance did not meet their match threshold. Better control of blank field matching If you use a lot of different input files, with varied fields and less predictable content for example, many blank fields the extended matching method might better serve you. Extended matching allows you to set conditions for key field comparisons in which one or both fields are blank. You can set blank scores, or direct that the blank comparison not count in the overall record score. 166 Match/Consolidate User s Guide

167 Factors that affect comparison time The speed of the comparison process depends on how many fields are used for matching, the length of those fields, and whether you require exact matches. When you require exact matches, the length of match fields does not affect performance. However, when you select near matching (tight, medium, or loose), field length does affect performance. Chapter 8: Record matching 167

168 Matching strategies To get the best results from a match process, you must have a clear idea of your purpose. The purpose should dictate the details of the process that is, how you set operator controls for the match process. For example, you would look for different data similarities when finding matches based on persons (individuals) than you would when finding matches based on households. Most matching processes are done for the five reasons shown below. As demonstrated in our introductory job, in Chapter 2, you choose a template to quickly set the match engine controls for each strategy. For most, these strategies can, without additional effort, produce the results needed from the record matching processes. In other cases, they serve as a fine starting point, which, with fine-tuning, can produce the results you need from your process. Match strategy What defines a match What is compared Family Individual Household Firm Firm/Individual Whether two records represent people who should be considered members of the same family. Whether two records represent the same person. Whether two records represent people who should be considered members of the same household. Note that the family match strategy includes last name matching; the household match strategy does not. Whether two records reflect the same firm or company. Whether two records represent the same person, at the same firm or company. Last name Address data First name Last name Address data Address data Firm data Address data First name Last name Firm name Address data 168 Match/Consolidate User s Guide

169 Implement a matching strategy All of the match programs provide templates. Therefore, regardless of the program MCD Job, or MCD Views you can use these templates to design your strategy. Our templates are pre-set to a configuration that has proven to give good matching results with typical input data. You simply have to integrate the template into your job setup. The table below summarizes how to implement the match strategy for our different match programs. Details for using templates vary from program to program, so refer to the Match/Consolidate Job-File Reference, or Match/ Consolidate Extended Matching Reference. The templates are starting points that you can use without further modification; however, you can change the settings. Chapters 9 through 11 explain modifications you may want to consider. Product MCD Job File MCD Views Implementing a match strategy Standard matching: From the match.mpg file, copy and paste the appropriate Match Criteria and Match Options blocks you want to implement into your job file. Extended matching: Copy the appropriate extended matching file (for example, family.mpg) into your job file. Do not include Match Criteria and Match Options blocks in the job file. Enter std, ext, or adv at the Matching Method parameter of the Execution block. Standard matching: At the Matching Criteria window, click the Defaults button, then select the appropriate option from the five strategies displayed in that window. Update match options by clicking yes at the prompt that follows. Extended matching: Copy the appropriate extended matching file (for example, family.mpg) with a new file name. Do not include Match Criteria and Match Options blocks in the job file. Select Standard, Extended, or Advanced as the Matching Method at the Execution Options window Chapter 8: Record matching 169

170 Rule matching Rule matching gives you control of defining, on a field-by-field basis, what qualifies as a match when comparisons are made. Rules are used in sets, as a rule matching session. Each match rule specifies a key field to be compared, and specifies how the comparison will be scored. The result of the session as a whole determines whether or not the two records match. A rule for each field Normally, you will need a rule for each field included in your match key that is, the fields that are significant to your matching strategy. For example, if using a family match strategy, you might want to compare the last name and the address elements, so those fields would be included in your match key, and you would define a rule for each of those fields. For its field comparison, each rule can control the following: How this field comparison will affect the overall score: How much weight should this field similarity score carry in the overall score? If this field similarity score is lower than a set cutoff, should the field similarity score be ignored, or should it be counted as zero in the overall score? How blank field data affects the comparison: If there is blank data in one or both fields, should the comparison be ignored or performed? What score should be assigned as the field similarity score? Whether or not the keys be considered to match or not to match, based solely on this field comparison. For example, if the field similarity score is more than 75, should the records be designated a match? Or, if the field similarity score is less than 50, should they be called a no-match? The Match engine applies the rules one at a time to the key comparison. Generally, each rule will produce a score for the field comparison, or a conclusion that the records should be considered a match or not a match. If a rule produces a match decision, rather than a field score, then the comparison process stops, and the Match engine awaits the next pair of keys. For example, the last name field in family matching might include a cutoff similarity score for determining that the records do not match (our template uses 75 percent) based solely on this last name field comparison. If that determination is made, the comparison process stops and the Match engine looks for the next set of keys to compare. Overall boundaries In addition to the rules for the field comparisons, the rule session can include thresholds for the overall weighted score the sum of all the field comparisons. These thresholds define the lowest score that qualifies as a match and the highest that determines the records do not match. Refer to Compare record keys: the driver record on page 196 for details. 170 Match/Consolidate User s Guide

171 Automatic matching Rather than design your own rules for matching, you may want to take advantage of the built-in automatic matching abilities of your match program. With the automatic matching capabilities, you normally need less setup to get started, and, especially if you are new to MCD, less learning to use your MCD job. For example, when you select the automatic setting of MCD Job s extended matching method, you need only select a match strategy, (such as resident or family) and a match threshold (exact through loose). The match program can do the rest, deciding what data to compare, the order of the comparisons, the thresholds for matches, and so on. You can use automatic matching in the extended matching method of MCD job. You can also fine-tune your automatic matching process by setting additional match options. Match strategy It s important to identify the strategy for your match process because it should govern the comparison process (what fields to compare, what fields to ignore) and the end result of the comparison (should the records be considered to match or not match). When you identify the match strategy, the Match engine automatically integrates key field settings, match criteria, and selected match options. For a description of the available match strategies provided with MCD products refer to Matching strategies on page 168. Match threshold The match threshold setting answers the question How similar should the keys be to consider their records a match? In automatic matching, the answer is expressed as a category exact, tight, medium, or loose. The Match engine measures the similarity of the match key contents, then compares that measurement to the match threshold that you set. Match options Match options give you a greater degree of control, even with automatic matching. Your selection of the match strategy sets option defaults appropriate for most users of that strategy and threshold. However, you are not forced to live with every detail of the automatic settings; you can change match options to fine tune your match results if necessary. Refer to Chapters 9 through 11 for explanations of many of your match options. Chapter 8: Record matching 171

172 Set up automatic matching Refer to the following table for information about setting up automatic matching. Product MCD Job and Views for Standard matching Extended matching Setting up automatic matching As an Execution option, select the Standard (STD) matching method. Use the Match Criteria and Match Options blocks of your job file. MCD automatically controls the order of the field comparisons. In Views, at Matching Criteria, select one of the five preset matching strategies by clicking the Defaults button. Or, if directly editing the job file, copy the appropriate Match Criteria block from the match.mpg file (from your \templates subdirectory). You can edit the defaults set by your selected strategy. As an Execution option, select the Extended (EXT) matching method. Copy the lines of the auto.mpg extended matching file (from your \templates subdirectory) that are appropriate for your match strategy into your job. In addition, specify the type of matching and the match threshold in your Auto Match Spec block. Refer to the Extended Matching Reference manual for details. 172 Match/Consolidate User s Guide

173 Advanced matching Match/Consolidate has incorporated new functionality called Advanced Matching. This is activated with the Advanced (ADV) Matching Method in the Execution Options of the job. This type of matching, which can incorporate both standard match criteria and rule based (extended) matching, eliminates the need for multi-pass processing by allowing you to use multiple match criteria and levels of matching within any given job. Combining the results of different match criteria within the same job lets you do association type matching of records that previously required multiple passes. Multi criteria matching Multi criteria matching lets you use multiple matching strategies within a single job. For example, you can process a file of domestic addresses, and another file of international addresses in the same job by setting up separate match criteria for each file. Multi criteria matching also lets you do association type matching. A typical application would be the need to bring records together of people who maintain households in multiple locations, by having one match criteria set to match on name and address, and another match criteria for matching on name and address or cell phone number. You can then get the overlapped results of the multiple criteria (name and address overlapping with name and address) in a single, combined match set or dupe group. This is commonly referred to as association. Multi level matching Multi level matching gives you the capability of performing n-per-family and n- per-firm applications in a single pass. Perhaps you have a need to select two individual members from each family in your database. Using multi level matching, you can set up one criteria for matching on address (resident-level), another criteria for matching on last name (family-level), and a third criteria for matching on first and middle name (individual-level). Once MCD matches records to an address, they are passed down to the next level (family) to determine how many family names exist at the resident level. From there, records with the same address and last name are passed down to the third level (individual) to determine the number of individuals at the same address. You can use this same basic concept when you find it necessary to select a specific number of individuals per company (N-per-firm). Constant Key ID You can use the Constant Key ID feature to assign sequential identification numbers to each new record when adding records to a data warehouse. For example, the largest number assigned in a particular job can be carried over as the beginning identification number (plus 1) to be used in the assignment of new sequential IDs. This occurs when MCD processes the next purchased list against the data warehouse file. Chapter 8: Record matching 173

174 Use reports to examine the matching process Match/Consolidate can produce a number of reports that detail what happened in the matching process. Some of these reports can show you the statistics, and others can show you the actual data used in the matching process. The following table summarizes those reports that will be most useful to you in assessing the results of your match process. Report Duplicate Records Sorted Records List-by-List Match Multi-List Match List Duplicates Match Results Description This report lists each record of each dupe group that is, groups of matching records. For each group, the master record is shown first, followed by the subordinate dupes. This listing can help you decide if your match criteria are too loose. If you see records in a dupe group which, based on what you see here, are not really duplicate records, then tighten up your criteria to eliminate those matches This report lists input records as sorted in any of the eight preset sortations from which you can choose, or in a custom sortation of your own design. The default sort is geographically (that is, by ZIP Code, state, city, address, and so on). When the records are sorted geographically, it s likely that matching records will be closely listed in this report. This report can help show you if your criteria are too tight. If you see records that appear to be duplicates but which MCD has not identified as members of the same dupe group, then loosen up your criteria to identify more matches. These reports answer questions about matching between and among the lists of your job. For example, how many of your list1 records were found to match a list2 record? How many list3 records matched other list3 records? Here you can see the exact numbers of records that will be kept and dropped. Refer to Chapter 3 for information about lists. This report details the results of an extended matching process. If you are not using extended matching, this report can t be generated. Refer to the Extended Matching Reference manual for information about extended matching. The Match Results report shows the numbers that describe what part of your matching process generated results how many comparisons were made, how many matches were found, how many no-matches, and so on. From the data on this report, you can assess how your match setup performed, and you may see ways to improve the speed or effectiveness of your matching process. 174 Match/Consolidate User s Guide

175 Chapter 9: Engineer key data To speed comparison time and to more precisely control the matching results, Match/Consolidate (MCD) looks at keys rather than looking at the whole record. This chapter explains keys and how to use them. Chapter 9: Engineer key data 175

176 Key files The key file is a working file that MCD uses to hold the data that s used in placing, matching, and ranking (prioritizing) your records. You won t read or use this file; only the MCD process will. However, the Key Information versions of the Sorted Records report, Duplicate Records report, and Unparsed Records report can show you what s in the key file. The match process compares data from one record to corresponding data from another record. However, comparing all the record data would take far too much time for most purposes. Additionally, comparing some parts of the data might actually be counterproductive. For example, comparing telephone numbers, which frequently change, might prevent many records from being identified as matches. Therefore, instead of using all the record data, your matching process normally depends on key data data that you, the MCD user, identify as the significant parts of the record to use for finding matches. That data is stored in the MCD key file. Each key represents a record The key file contains a string of data for each record to be processed. You identify each field and the length of characters to use in the key. For example, you may want to store 12 characters of the last name data, 30 characters of firm data, 10 characters of primary range data, and so on. input file input file input file your settings define the key field CASILLO 12 SAINT MARK ST BRADLEY 61 SUMMIT AVE LAMER 1414 MASSACHUSETTS AVE HANDRICH HAMMOND 106 LOWLAND ST PETERS 165 FRONT ST RODRIGUEZ AVE DEGETAU FOSTER GLAZE 358 BAKER AVE MONAHAN 50 OTIS ST FINE key file Your main concern should be to make sure the key file contains all the data it needs to complete the process. However, you don t want to overload the key file, because if you can keep from copying unnecessary, extra data into the key file, you can speed up your job process and save disk space. Key contents Each key field may contain data directly from the record, or it may also contain other MCD-generated data, depending on which matching options have been selected. For example, if you elect to use extended address parsing, rather than standard address parsing, your address-related key data may be different from your input file data for some fields and for some records. 176 Match/Consolidate User s Guide

177 Define key fields To define key fields, map the database fields to be used for matching PW fields (refer to Database Prep for support file requirements). With standard matching, you can control which of the listed key fields are included in your key file and the length of each of those key fields. In Views, at Matching Criteria, select one of the five preset matching strategies by clicking the Defaults button. Or, if directly editing the job file, copy the appropriate Match Criteria block from the match.mpg file (from your \templates subdirectory).you can edit the defaults set by your selected strategy. With extended matching, you control which of the key fields are included, and the length of each of those key fields. Define each individually with Key Length parameters in the Parsing and Key Options block of your extended matching file, or have MCD do it for you via the Auto Generate Key Lengths parameter of that block. Chapter 9: Engineer key data 177

178 Include record keys only as needed You may be able to save some processing time and file space by excluding from the key file some record keys that will not be members of break groups, and therefore can never be identified as matches. When this would help The advantage of limiting the key file in this way increases with the size of your input database and decreases with the amount of potentially-matching records in the database. For very large files with few matching records, you may see an appreciable reduction in the time required to run your job. Here is a simplified illustration: Suppose your MCD job is comparing your transaction database a smaller, regional file with a large, national database that includes 15 records in each of 50,000 ZIP Codes. Further assume that you want to form break groups based only on the ZIP Code. 1,500 records records are in 40 ZIP Codes 750,000 records records are in 50,000 ZIP Codes region.dbf nation.dbf Notes region.dbf nation.dbf total Without the As Needed option, MCD reads all records of both files to create and store keys for each record. 1, , ,500 With the As Needed option, the key file will include only region.dbf keys plus those nation.dbf record keys having the same ZIP Code as found in the region.dbf record keys. 1,500 About 600 (40 x 15) 2,100 This shows the results if the records were evenly distributed across all 50,000 ZIP Codes and all 40 of the ZIP Codes in region.dbf records were represented in this file, as well. What record keys would not be included in the key file? Any from nation.dbf that would not be assigned to a break group that was formed when inputting record keys from region.dbf. In this example, that means any with a ZIP Code different from those found in region.dbf. 178 Match/Consolidate User s Guide

179 For example, the figure below shows which record keys (by ZIP Code) would be included in the key file. region.dbf record keys nation.dbf record keys Included in key file? Yes No X X X X X Set up this option In each Input File block or window of your job, set the control for Keys for Break Groups to All or As_Needed. However, do not set all your input files to the As_Needed option. Set your first input file to enable storage of keys for all its records (the All setting for the Keys for Break Groups control). Match/Consolidate issues an error message during verification if it does not find one in your job setup. Do not set an input file s Keys for Break Groups control to the As_Needed setting unless only a small portion of that file s records will belong to break groups. If a large portion of the file s records in fact belong to break groups, the As_Needed processing time will take more time than would have been required to simply create the keys for all the records. When this parameter is set to As_Needed, records from the As_Needed file that do not match the break key criteria of records from a smaller file are dropped from input. They will not be processed and will not be passed through as unique records. Run the process The MCD program will issue an error message during verification if you have set up any input file(s) to As Needed, and the job does not include breaking, for example: If no break fields have been set up If the Find Duplicates execution option is set to No.! Match/Consolidate first processes the input file(s) that are set to All, then processes any set to As Needed. If this changes the input order, then the order of the records within their break groups may be affected, resulting in a change in the selection of driver records for the match comparisons. Refer to How record order affects comparisons on page 204 for details about how this might affect your matching results. Chapter 9: Engineer key data 179

180 Define key fields Simple match programs might use a match key of just a few characters from selected fields. More sophisticated programs, like applications based on MCD s extended matching method, can save much more data in each key, resulting in more intelligent matching. Field length Field length is the maximum number of bytes of data that you want to store in a key field. A field length setting for first name data would have to be at least 8 bytes in order to store the complete data for the first name Jennifer. If the length were set to 4 bytes, Bill would be stored completely, but Jennifer would be stored as Jenn, which might affect matching results later. Blank field settings The data in one or more of the fields to be compared may be empty. In such cases, you can set the response of your MCD job or library application. For example, with the standard matching method in your MCD job, you can call such field comparisons matches, or call them non-matches. Using the extended matching method, you can also define scores to be assessed for such comparisons. For more information, refer to Blank-field priority on page 52. Field count The field count is the number of times a field might be found in the key. For example, you might have more than one person in a record, and you would like to use both of them for matching. The following fields can have a value set greater than one: prename, first name, middle name, last name, postname maturity, postname other, and gender. Alternates An alternate is a variation of the field data for a key field type. For example, an alternate for Bill might be William. When making comparisons, you may want to use the original data and one or more alternates. If the first names are compared but don t match, the alternates will then be compared. If the alternates match, the two records will still have a chance of matching, rather than failing because the original first names were not considered a match. 180 Match/Consolidate User s Guide

181 Standardize key data for lastline information Standardizing your records city, state, and ZIP Code information (lastline data) can improve the performance of your record matching program. Note that different names for the same place are not uncommon. Examples are unincorporated towns and what the post office calls vanity names and place names. For example, to the Postal Service, a Hollywood address may be a Los Angeles address. Many users standardize their lastline data before MCD processing, by processing their records through an address correction program like ACE or International ACE. Addresses that have been assigned in this way have proper spelling of U.S. city names and the proper postcode or ZIP Code. If the lastline data in your records has been standardized, then your key data will, in effect, be standardized. However, if your record data has not been standardized through ACE, you can direct MCD to standardize the lastline data for your record keys. When your records lastline key data is standardized, you can increase the precision of your match tolerances. That is, you can use tighter matching. With tighter matching you should see the following: Faster processing, from significantly reduced volume of comparisons More accurate results, from fewer false matches In fact, with standardized data, exact matching criteria can often be used. That setting results in a much faster processing algorithm than the other values (tight, medium, and loose). Directories and dictionaries To standardize key data, MCD looks up your input address and lastline data in dictionaries and directories that are provided with MCD, as shown in the chart that follows. As the chart shows, MCD uses different resource files, depending on which type of address and lastline parsing you choose standard or extended. If it finds a match standard for your record s address and lastline data, MCD generates the key data from that standard, rather than from the raw input address data. If it does not find a match standard (or if the job is not set to standardize lastline key data), then MCD generates key data from the input address and lastline data. Address and lastline parsing option City directory file ZCF directory file Address dictionary file Lastline dictionary file International directories Resource city0x.dir zcf0x.dir addrln.dct lastln.dct gaz... If you use extended parsing, MCD also uses the zip4us.dir and revzip4.dir directories. Chapter 9: Engineer key data 181

182 When you select standard address and lastline parsing, MCD can standardize the spelling of suffix and directional information in your address data. In addition, it can perform the processes described below to find match standards for your lastline (city, state, ZIP Code) data. Your records include city and state, but no ZIP Code In this situation, MCD reads the city and state and looks for that data in the city0x.dir file. If it finds that entry, MCD looks for a match standard for that place. If it finds a match standard, MCD copies the city name, state name, and ZIP Code data for the match standard into MCD work files for use in generating that record s match key. If it does not find a match standard, MCD copies the record s raw data for city name and state into MCD work files for use in generating that record s match key. Your records include city, state, and ZIP Code In this situation, MCD reads the ZIP Code first, and looks for that entry in the ZCF0x.dir file. If it finds that entry, MCD compares the city and state of the record data to that found in the ZCF0x.dir file. If the elements agree, MCD copies the match standard city, state, and ZIP Code into work files for that record s match key. If the city, state, and ZIP Code data do not agree, MCD looks for agreement in two of the three elements. If the city name and state agree, but the ZIP Code doesn t, MCD assumes the record s ZIP Code is wrong, and looks into the city0x.dir file for the city name and state found in the record. If it finds a match standard, MCD copies that data into work files for use in generating the record s match key. When you select extended address and lastline parsing, MCD can standardize many of the components of your address data such as directionals, suffixes, and street names because it looks up the standardized address in postal directories. In addition, it standardizes the lastline data (city, state, ZIP Code) from standardized lastline data in postal directories. 182 Match/Consolidate User s Guide

183 Standardize key data for peoples names Standardizing key data improves the results of the MCD matching process, by improving the handling of nicknames (like Cathy and Catherine) and alternate spellings (like Karyn and Karen). MCD finds a more formal name from which the nickname was probably derived, identifies that name as the match standard and uses that match standard for matching. Match standards do not replace the actual data in your records. As key data, the match standards just help out the actual data during the matching process. Matching can be more accurate Standardizing your key data does not change anything about the matching process. Instead, it refines the data that s fed into the matching process. With standardized key data, the matching process can be more accurate. For example, if your database includes the name James W. Smith and your update data included a record with the name Jim Smith, your matching level for first name would have to be set very loose to discern that these two names are the same person. Set very loose, there would be many false matches. Standardizing the key data can fix this nickname dilemma. James and Jim have something in common: a match standard of James. By telling MCD to standardize name keys, you can keep the match level as tight as you want to eliminate false matches. As shown at right, the matching process includes the original and the standardized form(s) of the name comparison. Match/Consolidate would consider the two Jim records to match if the key data was standardized and in most situations to be different if the key data was not standardized. Input data James Unstandardized key JAMES Match/Consolidate looks up name-related information in a set of dictionary files. These auxiliary files some for extended matching help MCD parse elements out of name and firm fields. The extended parsing dictionaries produce more concise parsing from name fields, as well as floating data multi-line fields. Jim Input data James JIM Standardized key JAMES JAMES Chapter 9: Engineer key data 183

184 The following table lists the files that contain the name-related dictionaries. When you install the software, MCD copies these directories to your MCD directory. Name and firm parsing option Standard Extended Pre-name dictionary prename.dct parsing.dct mlrules.gcf Name dictionary Pre-last name dictionary Post-name dictionary name.dct prelname.dct postname.dct Firm dictionary (none) fprules.gcf firmln.dct A match standard is a one-way relationship. For example, Al is likely to be a nickname that derives from any of the six more formal names shown below, but it s most unlikely that any of those more formal names would be a nickname that derives from Al. Therefore, in the names dictionary, as the match standard for each of the more formal names only that name, Albert Alan Alfred Alexander Al Alphonse Almon itself, is listed. As the match standard for Al, all six of the more formal names are listed. When more than one name is listed, they are listed in their order of frequency within the U.S. population. Match/Consolidate can deal with different name formats because it s able to determine the different parts of names. For example, from this name data (first line below), MCD can identify two persons (second line below) and standardize the key data for each word of each name. Mr. and Mrs. James W. and Stacey K. Smith Mr. James W. Smith Mrs. Stacey K. Smith 184 Match/Consolidate User s Guide

185 Standardize key data for firm (company) names Standardizing the key data for firm or company names improves matching by eliminating many of the ways people remember the names of businesses (and in the way companies name themselves, too). Standardizing firm data usually helps minimize false matches. In standardizing key data for company names, MCD does the following: 1. It removes noise words, such as The, Corporation, and Limited (and their equivalents). 2. Then, if Extended Name, Title, and Firm parsing is used in this job: MCD looks up each significant word of the company name in a dictionary provided with MCD. If a match standard is found in the firm dictionary, MCD generates the key data from that standard, rather than from the raw input company name. If it does not find a match standard (or if the job isn t set to standardize firm key data), MCD generates the key data from the input company name. The standardized firm key data does not replace the data in your records; rather, it uses the standardized data for that data during the matching process. The following are some examples for some typical company names: Input data firm name Community Care Inc Community Care The Community Care Clinic Leary Management Corp Leary Mgt Leary Mgt, Inc. The Center for Effective Living, Ltd Effective Living Center Effective Living Limited Standardized key data Community Care Community Care Community Care Clinic Leary Mgmt Leary Mgmt Leary Mgmt Center for Effective Living Effective Living Center Effective Living Note that, in the Community Care Inc example, the length of the key data is limited to the size of the key field for that data. In this case, if you set 15 characters for your firm key data, then the key data for all these records would be the same: Community Care. In the The Center for Effective Living, Ltd example, MCD compares each word of the company name. As a result, a different order of words does not keep a record from matching another whose key data includes those same words in a different order. Chapter 9: Engineer key data 185

187 Chapter 10: Engineer break groups This chapter defines and explains break groups and how to plan your breaking strategies. Chapter 10: Engineer break groups 187

188 Form break groups Breaking places records into groups that are likely to match into groups. Although it is not necessary to incorporate breaking in your job, it can save time for all but the smallest jobs. In general, on any field where you use exact matching, you can use breaking. Fields commonly used for breaking are ZIP Codes, account numbers, or the first two positions of a street name. The following example shows the effects of breaking on the first four digits of a ZIP Code. In order for record keys to be assigned to the same break group, break field data must match exactly. Based on a break priority you can assign to each list of your job, you can also have Match/ Consolidate (MCD) sort the records within each break group. The result of break grouping is that every input record key becomes a member of a break group. Each break group has at least one record key. Even though we call them groups, there may only be one record key in a break group. The size of each break group depends on what you ve selected to break on, and how diverse that value is in the input records. Input records Break groups Record 01, Record 01, Record 02, Record 07, Record 10, Record 03, Record 21, Record 04, Record 22, Record 05, Record 06, Record 02, Record 07, Record 03, Record 08, Record 14, Record 09, Record 15, Record 10, Record 11, Record 12, Record 04, Record 13, Record 09, Record 14, Record 11, Record 15, Record 12, Record 16, Record 13, Record 17, Record 16, Record 18, Record 19, Record 20, Record 17, Record 21, Record 18, Record 22, Record 19, Record 23, Record 20, Record 23, Record 05, Record 06, Record 08, Increase the speed of the matching process Forming break groups can increase the speed of the matching process because it eliminates a great number of comparisons. The following example shows how breaking can help you reduce your processing time. However, your breaking strategy can affect your match results, too (refer to Break strategies on page 190). 188 Match/Consolidate User s Guide

189 For detailed information about the record key comparison process, refer to Engineer your match setup on page 195. MCD performs your job s comparison process on only the records within the same break group. Records in one break group are never compared with records in another break group. Record 0001 Record 0007 Record 0010 Record 0021 Record 0022 Record 0004 Record 0009 Record 0011 Record 0012 Match/Consolidate compares these records to each other. Match/Consolidate does not compare these records with records of another break group. The number of comparisons is an Record 0013 exponent of the number of input record keys. As shown in the Record 0016 following example, as the number of records increases, the number of comparisons increases significantly. Records to be searched Total number of comparisons ,500 10,000 49, 995, ,000 4,999,950,000 1,000, ,999,500,000 The formula for the number of comparisons (with N the number of input records) is: N 2 - N 2 Control breaking If you are using MCD Job or Views, your breaking options vary, based on whether you are using standard, advanced, or extended matching, as shown in the chart below. Refer to your Job-File Reference (or Views online help) and the Extended Matching Reference for details. Match type Standard and advanced matching Extended matching Description Set the break fields in the Matching Criteria block or screen of the job. Break on up to 16 of the MCD key fields, and specify the length of the break field and the starting position of that field. For example, break on the first three characters of the ZIP Code that is, start on character one and use characters one, two, and three. If you want to set the list break priority, do so in the Input List or Input List Defaults block or screen. Set the break fields in the parameters of the extended matching file to be used with this job. Break on any MCD key field. Specify the length of the break field and the starting position of that field. For example, break on the first three characters of the ZIP Code that is, start on character one and use characters one, two, and three. You control the order in which the fields are used. Or, with auto match, direct that MCD automatically set break fields for you. In addition, you can control the size of the break groups, by combining small break groups into larger ones, up to the size of your work buffer. Set an upper limit on the number of keys to combine in making these larger break groups. Chapter 10: Engineer break groups 189

190 Break strategies When you choose a field (or fields) for breaking, consider the quality of your input data and the geographical spread of your addresses. Some matches may be missed if you break on any of the following: a field that contains unstandardized data a field that is blank in some records name, firm, or address-line components 5-digit ZIP Code Unstandardized data If possible, the field on which you Jane Mc Donald Jane McDonald break should contain standardized data (such as from ACE and 100 Main St 100 Main St DataRight). Otherwise, typing errors or inconsistencies may cause missed La Crosse WI La Crosse WI dupes. As shown in the example at right, if you break on the first three characters of the Last Name key, you will fail to catch these matching records. The same risk pertains to street name and other fields. Match/Consolidate can standardize city, state, and ZIP data for reliable breaking. For reliable results, 3- digit or 5-digit ZIP Code makes a good choice for breaking. If you have processed addresses with ACE, Street Primary Name may also be used for breaking. Blank fields Breaking is risky on any field that is Jane Mc Donald Jane McDonald empty in some records. All records that are empty in the break field are 100 Main St 100 Main St lumped together in one catch-all break group. For example, in the example at right, if you break on Social Security Number, you wouldn t catch these records as dupes because they would never be compared. Match options Some breaking strategies may Rita Terranova Rita Terranova make your match options irrelevant. For example, in the ETI Eco Technologies example at right, if you break on the first two or three characters of firm name, the Ignore Firm if Names Match option will be 100 Bren Rd Bren Rd ineffectual, and you would miss these dupes. Next, consider the matching option, Match on Street, RR, or PO Box. If you break on street range, MCD cannot find the match shown in the example at right because these two records will be placed in separate break groups. A PO box address has a blank Street Primary Range key. Acme Hardware Acme Hardware 100 Elm Ave PO Box 300 PO Box Match/Consolidate User s Guide

191 5-digit ZIP Code Some ZIPs serve only Acme Hardware Acme Hardware PO boxes. This is important when you 100 Main St PO Box 100 are matching business addresses, which (La Crosse, WI streets) (PO Boxes) sometimes use a street address and sometimes use a PO box. As shown in this example, if you break on all five digits of the ZIP Code, you may fail to find the following match. Size of break groups The size of your break groups may have a significant impact on performance for larger jobs, because, the larger your break groups become, the more comparisons MCD must make. For example, lets assume that you have a very large file and all of the records fall within the same ZIP Code range. Breaking only on ZIP Code is not going to be beneifical in reducing comparisons. You will want to identify another field within your database that you can use for breaking to create more, smaller break groups. Depending on your data fields and your business rules for matching, you can use the first few characters of the primary street name, firm, or account number. Chapter 10: Engineer break groups 191

192 Prioritize your break group records When Match Consolidate assigns break groups, the input order does not affect the composition of the different break groups. Regardless of the order in which records are input, a break group will have the same members. The order of the records within the break group normally reflects the order in which they are input into the break group process. By using List Break Priority parameter in the List Description block, you can control which records drive the matching process. You may want to control break groups in order to have you best or most complete records driving the matching process. Another reason may be that you want your suppression records as drivers, or if you have a large previously de-duped file, you may want to allow a smaller update file to lead the matching process and find possible dupes that went undetected in your large file. Why order is important Engineer your match setup on page 195 explains in detail how the order of the records within the break groups can affect your match results. In summary, MCD sends records to the matching process in the order of the records within their break group. That order determines which records become driver records and which do not. The better the driver record, the better your match results. (Refer to Compare record keys: the driver record on page 196 for details about the driver record and match results.) Control the input file order The program inputs records in the order that the Input File blocks are listed in your job file. So, to change the order, rearrange the Input File blocks in your job file. If you are not using lists in your job, this method is the best way to affect the break group record order. You may be able to use your database program to reorder the records within your input file(s), so your better records are input before your less preferable records. If you are using only one input file, and do not include lists in your job, this is the only way you can affect the order of the records within the break groups. Assigning break group priority If your job includes lists, you can assign a break group priority to your records, based on their list membership. Match/Consolidate can then use that priority to re-order the records within each break group. If you have defined a list in your job with an Input List Description block, assign its records break priority with the List Break Priority parameter. If you use Views, set the break priority with the List Break Priority control at the Setup Input List window. If you direct MCD to automatically set up lists based on the field LIST_ID, then you can assign the break priority through the DRIV_PRIOR field in your input file s definition (DEF) file. 192 Match/Consolidate User s Guide

193 To assign the same break priority to all the input file s records, assign a constant (a number from 0 to 255) to DRIV_PRIOR. If your input file includes more than one list, and you d like to assign different break priorities to the records of those lists, then you can assign DRIV_PRIOR to a field that contains appropriate data that is, a number from 0 to 255. Chapter 10: Engineer break groups 193

194 Break-group analysis To save processing time and perhaps improve matching performance, evaluate your break-group setup before running the full matching process. Using the Predict option of the Find Duplicates control, at the Execution options block, run the matching process in steps: 1. In the Execution section, set Find Duplicates to Predict. Run the job. Match/ Consolidate will form break groups, but will not perform the full matching process. 2. Check the Job Summary Report and adjust your break-group setup as explained below. 3. Set Find Duplicates to Yes (instead of Predict) and run the job again. Match/ Consolidate will perform the full matching process. Job Summary report To assess breaking, check the Duplicate Matching Information section of the Job Summary report (see below). Net Input Records: Breaking Fields used for Breaking: Street Name, 3 ZIP, 5 Maximum Work Buffer Keys: Total Break Groups: 2497 Largest Break Group (# of Records): Adjust your break groups Consider the following tips when adjusting your break groups. Do you have enough memory? On the Job Summary report, compare the Maximum Work Buffer Keys with the Largest Break Group. If the greater number is Maximum Work Buffer Keys, you re fine. If the Largest Break Group is greater, then at least one of your break groups is too large to fit into memory. MCD runs the matching process much faster if it can fit an entire break group into memory. Is your breaking strategy effective? On the Job Summary report, compare the Net Input Records with the Total Break Groups and Largest Break Group. Suppose your Net Input Records is 1,000,000. Consider the extremes: If the report says you have one break group and your (largest) group is 1,000,000 records, you re not getting any breaking at all, and the matching process probably is going to take far longer than necessary. If a break group is larger than the memory available, and part remains on hard disk, there is more disk-access time, and the process slows down. Either adjust your breaking to make more, smaller groups, or increase the memory available (add RAM, check Max Work Buffer Size in the Execution section). Conversely if you have 1,000,000 break groups and the (largest) group is one record, you re not going to find any matches at all. Somewhere between those two extremes is the right compromise between finding matches and saving processing time. 194 Match/Consolidate User s Guide

195 Chapter 11: Engineer your match setup This chapter explains the controls for the match process. Use this information to adjust match settings for your particular data and needs. Chapter 11: Engineer your match setup 195

196 Compare record keys: the driver record In this second phase of the duplicate detection step, the Match engine compares one record key to another record key in order to determine whether the two records should be considered a match. You may be familiar with the term matching records; others use the terms dupes, or duplicate records. Whichever terminology you use, the process and the results are the same. Before explaining how the Match engine decides which record keys match and which don t, let s look at the overall scope of the comparison activity that is, the comparisons Match/Consolidate (MCD) must make, and the order in which the records are compared. If you need to find out why your results weren t what you expected, or when you don t understand why two records were called a match (or not), quickly finding the best answer often depends upon knowing which record is driving the comparisons, and which records were, in fact, compared. Note that the primary tool for assessing duplicate detection, the Duplicates Records report indicates the driver record with an asterisk (*). Comparisons start with the driver record The driver record is what we call that record to which other records within a break group are compared. Based on their similarity to the driver record, the Match engine designates each other record as a matching record or not. Here s the duplicate record search order for a break group of 10 records. The numbers inside the chart show a sequence number for each comparison. Record # ** This is the ninth comparison 2 * ** * * ** * * * ** * * * * ** * * * * * ** * * * * * * ** * * * * * * * ** * * * * * * * * ** * * * * * * * * * ** When matches are found among the records of the dupe group, then some comparisons can be eliminated to save processing time. Before we explain how that happens, first consider the simplest scenario: What happens if no matches are found in the comparisons? 196 Match/Consolidate User s Guide

197 When there are no matches Record #1 is the driver record for the first round of comparisons. Record #1 is compared with the non-driver records #2, #3, #4, and so on through all the records of the break group. Record # Then, after those comparisons, record #2 becomes the driver record. It is compared with record #3, #4, and so on, through #10. Then #3 becomes the driver record for comparisons 18 through 24, and so on. Finally, record #9 is the driver record for its comparison to record #10. But what if some records are found to be matches? Once a record is identified as a duplicate record, it is not used as the driver record for any subsequent comparisons. Here s an illustration of the comparisons made in the same 10- record dupe group, when matches are detected. For example, let s assume records #2, #5, and #7 are found to match record #1 when the first round of comparisons is made, with record #1 as the driver. Record # Because record #2 matches record #1, record #2 will not become a driver record for comparisons with the other records. Nor will records #5 and #7 because records 2, 5, and 7 are eliminated from the process. As a result, the comparisons will be as shown in the figure below. Also, any matching records found in the remaining comparisons would further reduce the number of comparisons needed for this dupe group. For example, if record #8 were found to match driver record #6, then comparisons 27 and 28 would not be made. Record # ** * ** 3 * * ** * * * ** * * * * ** 6 * * * * * ** * * * * * * ** 8 * * * * * * * ** * * * * * * * * ** * * * * * * * * * ** Chapter 11: Engineer your match setup 197

198 What makes records match How does the Match engine decide that a pair of record keys match? The specifics of the comparison process vary with the the method of matching that you set up. In all cases, though, the Match engine compares one record key to another, one pair at a time, and within the two keys one field at a time. If the fields key data match closely enough, the pair are considered dupes. If not, another record key pair is fed to the Match engine for comparison. What s close enough? You decide what s close enough to call a match. You can decide field-by-field, or let MCD decide for you, or a combination of both; see page 196 for details. For now, we ll just say that you set a minimum similarity level for each key field comparison, and that the similarity level you set is called match criteria. You control the match criteria Your match criteria determine the effectiveness of the dupe search. If your criteria are very restrictive, the program may fail to match some records that truly are matching records. On the other hand, if your criteria aren t restrictive enough, the program may match some records that are not actually duplicates. 198 Match/Consolidate User s Guide

199 Simscore With Simscore, you present two data strings, and Simscore returns a percentage score that reflects the similarity of those two strings. With data from key versions of your reports, Simscore can help you find out why a pair of records have been designated matches or why not. For example, say you notice from the Sorted Records Report that a match was missed, maybe because the last names didn t match for example, Vanderhoeven vs. Vonderhooven. How far must you turn down the match level on Last Name to catch that dupe? Use the Simscore Simscore executable file, which is stored in your MCD directory, to find out. Windows The Views version of Simscore presents the same data in an easy-to-use Viewsstyle window. Click Start Run, then browse to simscvws.exe. To start Simscore from within Views, click the Simscore icon in the Views toolbar, or select Simscore from the Tools menu. UNIX or other non- Windows platforms Enter simscore at your system prompt. When Simscore starts, you will be prompted for several settings (one at a time), as shown in the figure below. At each prompt, type a number, then press the Enter key. To accept the default value, simply press the Enter key. Set Simscore so it mirrors your match settings (see details on the next page). Chapter 11: Engineer your match setup 199