IDENTITY MATCHING AT ERDC. John Sabel, ERDC Research Coordination Committee September 6, 2013

Similar documents
Master Your Data and Your Business Using Informatica MDM. Ravi Shankar Sr. Director, MDM Product Marketing

Data Quality and Stewardship in the Veterans Health Administration

Data Domain Discovery in Test Data Management

Data Model Management

BUSINESS RULES AND GAP ANALYSIS

MDM Multidomain Edition (Version 9.6.0) For Microsoft SQL Server Performance Tuning

Enable Business Agility and Speed Empower your business with proven multidomain master data management (MDM)

Principal MDM Components and Capabilities

Office of Court Administration Automated Registry (AR) Interface Design Document for DSHS - Clinical Management for Behavioral Health Services (CMBHS)

WHO LEAVES TEACHING AND WHERE DO THEY GO? Washington State Teachers

Relational Database Basics Review

Master Data Management. Zahra Mansoori

Import Clients. Importing Clients Into STX

Thank you for attending the MDM for the Enterprise Seminar Series!

Vermont Enterprise Architecture Framework (VEAF) Master Data Management Design

Data Warehouse and Business Intelligence Testing: Challenges, Best Practices & the Solution

How. Matching Technology Improves. White Paper

Bringing Strategy to Life Using an Intelligent Data Platform to Become Data Ready. Informatica Government Summit April 23, 2015

Mapping Analyst for Excel Guide

What s New with Informatica Data Services & PowerCenter Data Virtualization Edition

Smarter Balanced Assessment Consortium. Request for Information Test Delivery Certification Package

Anonymous Oracle Applications HR Data

Improving your Data Warehouse s IQ

James Serra Data Warehouse/BI/MDM Architect JamesSerra.com


P-20 Data Warehouse Project

Each Ad Hoc query is created by making requests of the data set, or View, using the following format.

OBIEE DEVELOPER RESUME

Combining structured data with machine learning to improve clinical text de-identification

Informatica and our product strategy

In the Supreme Court of British Columbia NOTICE OF FAMILY CLAIM

Master Data Management and Data Warehousing. Zahra Mansoori

Presentation at 2006 DAMA / Wilshire Metadata Conference. John R. Friedrich, II, PhD Friedrich@metaintegration.net

Clever + DreamBox SFTP Instructions

A Simplified Framework for Data Cleaning and Information Retrieval in Multiple Data Source Problems

Data Quality Assurance

InCompass, Privacy Impact Assessment (PIA) 8/3/2011

What's New in SAS Data Management

Data Masking for HIPAA Compliance

Class News. Basic Elements of the Data Warehouse" 1/22/13. CSPP 53017: Data Warehousing Winter 2013" Lecture 2" Svetlozar Nestorov" "

Hyatt MDM Case Study: Increasing Revenue with Better Customer Insight. Chris Brogan VP Business Strategy Analytics Hyatt Hotel Corporation

Reporting MDM Data Attribute Inconsistencies for the Enterprise Using DataFlux

A. SYSTEM DESCRIPTION

Data Discovery & Documentation PROCEDURE

Master Data Management (MDM) in the Public Sector

Reverse Engineering in Data Integration Software

MDM PROJECT BEST PRACTICE INFORMATICA DAY 4 TH JUNE Presented by Arhis and EDF

Tamr on Google Cloud Platform: E-Commerce Tutorial

The Value of Advanced Data Integration in a Big Data Services Company. Presenter: Flavio Villanustre, VP Technology September 2014

A basic create statement for a simple student table would look like the following.

Privacy Impact Assessment. For Rehabilitation Services Administration Management Information System (RSA-MIS) Date: November 19, 2014

PRIVACY IMPACT ASSESSMENT

Data warehouse Architectures and processes

SCORE An Overview. State of Colorado Registration and Election Management

MDM and Data Warehousing Complement Each Other

Addressing Risk Data Aggregation and Risk Reporting Ben Sharma, CEO. Big Data Everywhere Conference, NYC November 2015

Data Vault and The Truth about the Enterprise Data Warehouse

Early Learning Compensation Rates Comparison

Texas Records Exchange System (TREx)

Click HERE for Online Help (must be a registered user)

Cognos Security Policy. A look at the security policy that is in place for the ODS/Cognos Reporting Environment at the College of Charleston

Customer Centricity Master Data Management and Customer Golden Record. Sava Vladov, VP Business Development, Adastra BG

Announcements. SQL is hot! Facebook. Goal. Database Design Process. IT420: Database Management and Organization. Normalization (Chapter 3)

You re one step closer to working more efficiently, increasing performance, and gaining clean, enhanced data.

Who Is Your Insurance Web Service

Employers Guide to Online Recruiting

Using the SQL Procedure

Methodology for creation of the HES Patient ID (HESID)

SQL Server Master Data Services A Point of View

Transcription:

IDENTITY MATCHING AT ERDC John Sabel, ERDC Research Coordination Committee September 6, 2013

2 Identity Matching at ERDC Identities come in from more than two sources Matching builds across time, not a one-time process Each source of data has their own set of matching rules due to differences in data elements

P20W Data Warehouse Project K-12 PCHEES SBCTC UI Wage ECEAP Conceptual Data Flow Staging Data is organized in a manner similar to the source; contains PII MDM Identity Matching, Linking and Cohort Management Processes P20W ID is assigned using PII data kept outside the system ODS P20W Data Model Data is organized into common tables; No PII Data Warehouse Data is organized optimally for analysis and reporting; No PII Data Marts Feedback Reports Researc h Briefs Data Sets 3 Output for users Output has no PII or P20W ID, Research ID only

4 Why Informatica? Their MDM Hub provides a central repository of identifiers (e.g. names, DOBs, SSNs) over time for every source. Provides deterministic and probabilistic matching tools for matching source data. Has a mechanism for assigning P20IDs (Person IDs) to collections of data. Preserves histories of merges. Has mechanism for unmerging data. Other non-identity matching criteria for using Informatica: PowerCenter dataflow programming language, Metadata Manager, etc.

5 Overview of Identity Matching Process Data moves from source files to Stage database. Identifiers are cleaned and standardized. Tokens are assigned (more on tokens later) to each record. Names standardized using a set of rigorous set of business rules incorporated into an Informatica mapplet. Data moves from Stage database, to MDM Hub. In MDM Hub, new data (tokens) are merged with existing data. If a person exists already in data warehouse, their new data (tokens) are assigned their existing P20ID. New people get new P20IDs. Their new data (tokens) are assigned to their data (tokens).

6 Matching Process: Automerge First Data is first merged using Automerge rules. Automerge rules are conservative rules for merging identities, merging P20IDs. As ERDC has implemented them, they largely deterministic, similar to using SQL to merge data. False negative rates not too important. Rather, Automerge rules are designed to ensure a very, very low false positive rates.

7 Automerge Sample Rule Set Rule set for automerging data: Rule Number Search Level Matching Columns 1 Typical Gender DOB Complete Person Name (fuzzy) SSN 2 Typical Complete Person Name (fuzzy) DOB College and Student ID 3* Typical Gender DOB First Name Complete Person Name (fuzzy) * As of 9/6/2013, rule still undergoing validation.

8 Matching Process: Manual Merge Last Data is next subjected to a manual merge rule set. Manual merge rule sets are looser. They are designed to so that the false negative rate is low. False positives are not much of a concern (since potential matches will be manually reviewed). Manual merge rule sets creates a manual review table of pairs of identities. This table is brought into Excel, and a human evaluates each pair of identities to determine if there is a match or not.* Matches are uploaded to the MDM hub, and results merged. * ERDC custom process. Informatica comes with a tool for side-by-side evaluation of match pairs, but side-by-side comparisons do not scale well.

9 Manual Merge Sample Rule Set Rule set for creating potential match pairs: Rule Number Search Level Matching Columns 1 Typical Complete Person Name (fuzzy) College and Student ID 2 Typical Complete Person Name (fuzzy) SSN 3 Typical Complete Person Name (fuzzy) DOB

10 Manual Review in Excel Example Pairwise evaluation of matches: Match P20ID Class LastName FirstName MiddleName DOB College ID SSN GENDER COUNTY 1 1392 1 OVAK STEPHEN J 9/21/1987 111 55555555 M Grays Harbor 92827 1 OVAK STEPHEN J 9/21/1987 111 55555555 M Jefferson 0 27284 2 ILI LISA 9/14/1989 234 F Douglas 43323 2 ILI LISA M 9/14/1989 233 F Douglas 1 767 3 COOK ALICE M 12/13/1990 123 53333333 F Jefferson 28579 3 COOK ALICE 12/13/1990 123 F Jefferson 0 767 4 COOK ALICE M 12/13/1990 123 53333333 F Jefferson 46342 4 COOK ALICE LUDMILA 12/13/1990 124 F Jefferson

11 Merging People, Easy; Unmerging, Hard Identities are easy to merge. For example, to merge information for Jane Smith with Jan Smith just assign all instances of P20ID Jane Smith to PersonID Jan Smith. Unmerging is hard, even impossible. Say Jane and Jan are now found to really be two people. Now their data is comingled. How can you tease their data apart? How then can a data warehouse unmerge these two identities? Use tokens!

12 Unmerging: Made Possible Using Tokens Tokens are a set of identifiers that clumps data into the big lumps where each token is guaranteed to only be associated with only one person. K12 Token components example: K12 School District ID: 17892 K12 Student ID: 0014353 Name: Jane Smith DOB 1/5/1995 P20IDs are composed of one to many Tokens. Merging and unmerging Tokens into existing and new P20IDs is what the MDM Hub does.

13 Unmerging Example, the Problem ERDC initially thought that Jane Smith and Jan Smith were one and the same, so they were merged under P20ID = 354: Jane Smith/Jan Smith P20ID Source TokenID Definition* Data Warehouse 354 OSPI Jane-Smith 1/5/1995 17111 0014353 Grade 8 data 354 OSPI Jane-Smith 1/5/1995 17892 0014353 Grade 9 to 12 data 354 SBCTC Jan-Smith 1/5/1995 111 W4543935 SBCTC data * Actual TokenIDs in the data warehouse are surrogate keys. These surrogate keys are tied to the TokenID definitions that exist only in the MDM Hub. But later, ERDC realized that Jane Smith and Jan Smith were not one and the same.

14 Unmerging Example, the Solution Using tokens, it is straightforward to separate all the data associated with Jane Smith and assign a new P20ID to that data: Jane Smith P20ID Source TokenID Definition Data Warehouse 354 OSPI Jane-Smith 1/5/1995 17111 0014353 Grade 8 data 354 OSPI Jane-Smith 1/5/1995 17892 0014353 Grade 9 to 12 data Jan Smith P20ID Source TokenID Definition Data Warehouse 57289 SBCTC Jan-Smith 1/5/1995 111 W4543935 SBCTC data Jane Smith and Jan Smith are now unmerged.

15 Tokens in the MDM Hub Database e.g. Jane Smith, 1/1/1995, 111, W4543935

16 Other Sources of Data Marriage and Divorce data from DOH Name change data from AOC Death data from DOH and SSA

Questions? 17

18 Contact Us Washington Education Research & Data Center www.erdc.wa.gov John Sabel john.sabel@ofm.wa.gov