IDENTITY MATCHING AT ERDC. John Sabel, ERDC Research Coordination Committee September 6, 2013
|
|
|
- Corey Thomas
- 9 years ago
- Views:
Transcription
1 IDENTITY MATCHING AT ERDC John Sabel, ERDC Research Coordination Committee September 6, 2013
2 2 Identity Matching at ERDC Identities come in from more than two sources Matching builds across time, not a one-time process Each source of data has their own set of matching rules due to differences in data elements
3 P20W Data Warehouse Project K-12 PCHEES SBCTC UI Wage ECEAP Conceptual Data Flow Staging Data is organized in a manner similar to the source; contains PII MDM Identity Matching, Linking and Cohort Management Processes P20W ID is assigned using PII data kept outside the system ODS P20W Data Model Data is organized into common tables; No PII Data Warehouse Data is organized optimally for analysis and reporting; No PII Data Marts Feedback Reports Researc h Briefs Data Sets 3 Output for users Output has no PII or P20W ID, Research ID only
4 4 Why Informatica? Their MDM Hub provides a central repository of identifiers (e.g. names, DOBs, SSNs) over time for every source. Provides deterministic and probabilistic matching tools for matching source data. Has a mechanism for assigning P20IDs (Person IDs) to collections of data. Preserves histories of merges. Has mechanism for unmerging data. Other non-identity matching criteria for using Informatica: PowerCenter dataflow programming language, Metadata Manager, etc.
5 5 Overview of Identity Matching Process Data moves from source files to Stage database. Identifiers are cleaned and standardized. Tokens are assigned (more on tokens later) to each record. Names standardized using a set of rigorous set of business rules incorporated into an Informatica mapplet. Data moves from Stage database, to MDM Hub. In MDM Hub, new data (tokens) are merged with existing data. If a person exists already in data warehouse, their new data (tokens) are assigned their existing P20ID. New people get new P20IDs. Their new data (tokens) are assigned to their data (tokens).
6 6 Matching Process: Automerge First Data is first merged using Automerge rules. Automerge rules are conservative rules for merging identities, merging P20IDs. As ERDC has implemented them, they largely deterministic, similar to using SQL to merge data. False negative rates not too important. Rather, Automerge rules are designed to ensure a very, very low false positive rates.
7 7 Automerge Sample Rule Set Rule set for automerging data: Rule Number Search Level Matching Columns 1 Typical Gender DOB Complete Person Name (fuzzy) SSN 2 Typical Complete Person Name (fuzzy) DOB College and Student ID 3* Typical Gender DOB First Name Complete Person Name (fuzzy) * As of 9/6/2013, rule still undergoing validation.
8 8 Matching Process: Manual Merge Last Data is next subjected to a manual merge rule set. Manual merge rule sets are looser. They are designed to so that the false negative rate is low. False positives are not much of a concern (since potential matches will be manually reviewed). Manual merge rule sets creates a manual review table of pairs of identities. This table is brought into Excel, and a human evaluates each pair of identities to determine if there is a match or not.* Matches are uploaded to the MDM hub, and results merged. * ERDC custom process. Informatica comes with a tool for side-by-side evaluation of match pairs, but side-by-side comparisons do not scale well.
9 9 Manual Merge Sample Rule Set Rule set for creating potential match pairs: Rule Number Search Level Matching Columns 1 Typical Complete Person Name (fuzzy) College and Student ID 2 Typical Complete Person Name (fuzzy) SSN 3 Typical Complete Person Name (fuzzy) DOB
10 10 Manual Review in Excel Example Pairwise evaluation of matches: Match P20ID Class LastName FirstName MiddleName DOB College ID SSN GENDER COUNTY OVAK STEPHEN J 9/21/ M Grays Harbor OVAK STEPHEN J 9/21/ M Jefferson ILI LISA 9/14/ F Douglas ILI LISA M 9/14/ F Douglas COOK ALICE M 12/13/ F Jefferson COOK ALICE 12/13/ F Jefferson COOK ALICE M 12/13/ F Jefferson COOK ALICE LUDMILA 12/13/ F Jefferson
11 11 Merging People, Easy; Unmerging, Hard Identities are easy to merge. For example, to merge information for Jane Smith with Jan Smith just assign all instances of P20ID Jane Smith to PersonID Jan Smith. Unmerging is hard, even impossible. Say Jane and Jan are now found to really be two people. Now their data is comingled. How can you tease their data apart? How then can a data warehouse unmerge these two identities? Use tokens!
12 12 Unmerging: Made Possible Using Tokens Tokens are a set of identifiers that clumps data into the big lumps where each token is guaranteed to only be associated with only one person. K12 Token components example: K12 School District ID: K12 Student ID: Name: Jane Smith DOB 1/5/1995 P20IDs are composed of one to many Tokens. Merging and unmerging Tokens into existing and new P20IDs is what the MDM Hub does.
13 13 Unmerging Example, the Problem ERDC initially thought that Jane Smith and Jan Smith were one and the same, so they were merged under P20ID = 354: Jane Smith/Jan Smith P20ID Source TokenID Definition* Data Warehouse 354 OSPI Jane-Smith 1/5/ Grade 8 data 354 OSPI Jane-Smith 1/5/ Grade 9 to 12 data 354 SBCTC Jan-Smith 1/5/ W SBCTC data * Actual TokenIDs in the data warehouse are surrogate keys. These surrogate keys are tied to the TokenID definitions that exist only in the MDM Hub. But later, ERDC realized that Jane Smith and Jan Smith were not one and the same.
14 14 Unmerging Example, the Solution Using tokens, it is straightforward to separate all the data associated with Jane Smith and assign a new P20ID to that data: Jane Smith P20ID Source TokenID Definition Data Warehouse 354 OSPI Jane-Smith 1/5/ Grade 8 data 354 OSPI Jane-Smith 1/5/ Grade 9 to 12 data Jan Smith P20ID Source TokenID Definition Data Warehouse SBCTC Jan-Smith 1/5/ W SBCTC data Jane Smith and Jan Smith are now unmerged.
15 15 Tokens in the MDM Hub Database e.g. Jane Smith, 1/1/1995, 111, W
16 16 Other Sources of Data Marriage and Divorce data from DOH Name change data from AOC Death data from DOH and SSA
17 Questions? 17
18 18 Contact Us Washington Education Research & Data Center John Sabel
Master Your Data and Your Business Using Informatica MDM. Ravi Shankar Sr. Director, MDM Product Marketing
Master Your and Your Business Using Informatica MDM Ravi Shankar Sr. Director, MDM Product Marketing 1 Driven Enterprise Timely Trusted Relevant 2 Agenda Critical Business Imperatives Addressed by MDM
Data Quality and Stewardship in the Veterans Health Administration
Data Quality and Stewardship in the Veterans Health Administration ABSTRACT The mission of the Veterans Health Administration (VHA) is to serve the needs of America's Veterans by providing primary care,
Data Domain Discovery in Test Data Management
Data Domain Discovery in Test Data Management 1993-2016 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording
Data Model Management
Whitemarsh Information Systems Corporation 2008 Althea Lane Bowie, Maryland 20716 Tele: 301-249-1142 Email: [email protected] Web: www.wiscorp.com Table of Contents 1. Objective...1 2. Topics Covered...1
BUSINESS RULES AND GAP ANALYSIS
Leading the Evolution WHITE PAPER BUSINESS RULES AND GAP ANALYSIS Discovery and management of business rules avoids business disruptions WHITE PAPER BUSINESS RULES AND GAP ANALYSIS Business Situation More
MDM Multidomain Edition (Version 9.6.0) For Microsoft SQL Server Performance Tuning
MDM Multidomain Edition (Version 9.6.0) For Microsoft SQL Server Performance Tuning 2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic,
Enable Business Agility and Speed Empower your business with proven multidomain master data management (MDM)
Enable Business Agility and Speed Empower your business with proven multidomain master data management (MDM) Customer Viewpoint By leveraging a well-thoughtout MDM strategy, we have been able to strengthen
Principal MDM Components and Capabilities
Principal MDM Components and Capabilities David Loshin Knowledge Integrity, Inc. 1 Agenda Introduction to master data management The MDM Component Layer Model MDM Maturity MDM Functional Services Summary
Office of Court Administration Automated Registry (AR) Interface Design Document for DSHS - Clinical Management for Behavioral Health Services (CMBHS)
Office of Court Administration Automated Registry (AR) Interface Design Document for DSHS - Clinical Management for Behavioral Health Services (CMBHS) August 04, 2009 Interface Design Document for CMBHS
WHO LEAVES TEACHING AND WHERE DO THEY GO? Washington State Teachers
www.erdc.wa.gov ERDC Research Brief 2011-01 Longitudinal Studies January 2011 WHO LEAVES TEACHING AND WHERE DO THEY GO? Washington State Teachers The Washington State Education Research & Data Center (ERDC)
Relational Database Basics Review
Relational Database Basics Review IT 4153 Advanced Database J.G. Zheng Spring 2012 Overview Database approach Database system Relational model Database development 2 File Processing Approaches Based on
Master Data Management. Zahra Mansoori
Master Data Management Zahra Mansoori 1 1. Preference 2 A critical question arises How do you get from a thousand points of data entry to a single view of the business? We are going to answer this question
Import Clients. Importing Clients Into STX
Import Clients Importing Clients Into STX Your Client List If you have a list of clients that you need to load into STX, you can import any comma-separated file (CSV) that contains suitable information.
Thank you for attending the MDM for the Enterprise Seminar Series!
Thank you for attending the MDM for the Enterprise Seminar Series! Please do not distribute this presentations without permission from the speaker (see contact information within.) This is just intended
Vermont Enterprise Architecture Framework (VEAF) Master Data Management Design
Vermont Enterprise Architecture Framework (VEAF) Master Data Management Design EA APPROVALS Approving Authority: REVISION HISTORY Version Date Organization/Point
Data Warehouse and Business Intelligence Testing: Challenges, Best Practices & the Solution
Warehouse and Business Intelligence : Challenges, Best Practices & the Solution Prepared by datagaps http://www.datagaps.com http://www.youtube.com/datagaps http://www.twitter.com/datagaps Contact [email protected]
How. Matching Technology Improves. White Paper
How Matching Technology Improves Data Quality White Paper Table of Contents How... 3 What is Matching?... 3 Benefits of Matching... 5 Matching Use Cases... 6 What is Matched?... 7 Standardization before
Bringing Strategy to Life Using an Intelligent Data Platform to Become Data Ready. Informatica Government Summit April 23, 2015
Bringing Strategy to Life Using an Intelligent Platform to Become Ready Informatica Government Summit April 23, 2015 Informatica Solutions Overview Power the -Ready Enterprise Government Imperatives Improve
Mapping Analyst for Excel Guide
Mapping Analyst for Excel Guide Informatica PowerCenter (Version 8.6.1) Informatica Mapping Analyst for Excel Guide Version 8.6.1 March 2009 Copyright (c) 1998 2009 Informatica Corporation. All rights
What s New with Informatica Data Services & PowerCenter Data Virtualization Edition
1 What s New with Informatica Data Services & PowerCenter Data Virtualization Edition Kevin Brady, Integration Team Lead Bonneville Power Wei Zheng, Product Management Informatica Ash Parikh, Product Marketing
Smarter Balanced Assessment Consortium. Request for Information 2013-31. Test Delivery Certification Package
Office of Superintendent of Public Instruction Smarter Balanced Assessment Consortium Request for Information 2013-31 Test Delivery Certification Package September 4, 2013 Table of Contents Introduction...
Anonymous Oracle Applications HR Data
Abstract Information in many test and development Oracle Applications HR systems must be rendered anonymous for security purposes. Such data masking helps prevent the unauthorized visibility of the data
Improving your Data Warehouse s IQ
Improving your Data Warehouse s IQ Derek Strauss Gavroshe USA, Inc. Outline Data quality for second generation data warehouses DQ tool functionality categories and the data quality process Data model types
James Serra Data Warehouse/BI/MDM Architect [email protected] JamesSerra.com
James Serra Data Warehouse/BI/MDM Architect [email protected] JamesSerra.com Agenda Do you need Master Data Management (MDM)? Why Master Data Management? MDM Scenarios & MDM Hub Architecture Styles
P-20 Data Warehouse Project
P-20 Data Warehouse Project Project Charter Version 1.5 STATE OF WASHINGTON Project Management Office OFFICE OF FINANCIAL MANAGEMENT INFORMATION SERVICES DIVISION Revision History Date Version # Author
Each Ad Hoc query is created by making requests of the data set, or View, using the following format.
Contents Ad Hoc Reporting... 2 What is Ad Hoc Reporting?... 2 SQL Basics... 2 Select... 2 From... 2 Where... 2 Sorting and Grouping... 2 The Ad Hoc Query Builder... 4 The From Tab... 7 The Select Tab...
OBIEE DEVELOPER RESUME
1 of 5 05/01/2015 13:14 OBIEE DEVELOPER RESUME Java Developers/Architects Resumes Please note that this is a not a Job Board - We are an I.T Staffing Company and we provide candidates on a Contract basis.
Combining structured data with machine learning to improve clinical text de-identification
Combining structured data with machine learning to improve clinical text de-identification DT Tran Scott Halgrim David Carrell Group Health Research Institute Clinical text contains Personally identifiable
Informatica and our product strategy
Informatica and our product strategy Piotr Skowronski April 2015 2015 Sales Plays: Power the -Ready Enterprise Business Imperatives Improved Decisions Cross-Sell Up-Sell Mergers & Acquisitions Business
In the Supreme Court of British Columbia NOTICE OF FAMILY CLAIM
Form F3 (Rule 4-1(1)) In the Supreme Court of British Columbia Court File No.: Court Registry: Vancouver Claimant: Respondent: JANE JANICE DOE NOTICE OF FAMILY CLAIM This family law case has been started
Master Data Management and Data Warehousing. Zahra Mansoori
Master Data Management and Data Warehousing Zahra Mansoori 1 1. Preference 2 IT landscape growth IT landscapes have grown into complex arrays of different systems, applications, and technologies over the
Presentation at 2006 DAMA / Wilshire Metadata Conference. John R. Friedrich, II, PhD [email protected]
Metadata Management Best Practices and Lessons Learned Presentation at 2006 DAMA / Wilshire Metadata Conference Denver, CO John R. Friedrich, II, PhD [email protected] Slide 1 of??? Outline
Clever + DreamBox SFTP Instructions
Clever + DreamBox SFTP Instructions 1. Introduction Clever is a service for transferring school information in a secure manner from a school database to applications. This document explains how to use
A Simplified Framework for Data Cleaning and Information Retrieval in Multiple Data Source Problems
A Simplified Framework for Data Cleaning and Information Retrieval in Multiple Data Source Problems Agusthiyar.R, 1, Dr. K. Narashiman 2 Assistant Professor (Sr.G), Department of Computer Applications,
Data Quality Assurance
CHAPTER 4 Data Quality Assurance The previous chapters define accurate data. They talk about the importance of data and in particular the importance of accurate data. They describe how complex the topic
InCompass, Privacy Impact Assessment (PIA) 8/3/2011
DEPARTMENT OF TREASURY Washington, D.C. 20220 InCompass, Privacy Impact Assessment (PIA) 8/3/2011 A. Identification System Name: InCompass Former System Name: Integrated Talent Management (ITM) OMB Unique
What's New in SAS Data Management
Paper SAS034-2014 What's New in SAS Data Management Nancy Rausch, SAS Institute Inc., Cary, NC; Mike Frost, SAS Institute Inc., Cary, NC, Mike Ames, SAS Institute Inc., Cary ABSTRACT The latest releases
Data Masking for HIPAA Compliance
The Safe Harbor Method: Abstract The Health Insurance Portability and Accountability Act of 1996 (HIPAA) Privacy Rule mandates the de-identification of specific types of Protected Health Information (PHI)
Class News. Basic Elements of the Data Warehouse" 1/22/13. CSPP 53017: Data Warehousing Winter 2013" Lecture 2" Svetlozar Nestorov" "
CSPP 53017: Data Warehousing Winter 2013 Lecture 2 Svetlozar Nestorov Class News Class web page: http://bit.ly/wtwxv9 Subscribe to the mailing list Homework 1 is out now; due by 1:59am on Tue, Jan 29.
Hyatt MDM Case Study: Increasing Revenue with Better Customer Insight. Chris Brogan VP Business Strategy Analytics Hyatt Hotel Corporation
Hyatt MDM Case Study: Increasing Revenue with Better Customer Insight Chris Brogan VP Business Strategy Analytics Hyatt Hotel Corporation About Hyatt Global Hospitality organization Several brands - Hyatt
Reporting MDM Data Attribute Inconsistencies for the Enterprise Using DataFlux
Reporting MDM Data Attribute Inconsistencies for the Enterprise Using DataFlux Ernesto Roco, Hyundai Capital America (HCA), Irvine, CA ABSTRACT The purpose of this paper is to demonstrate how we use DataFlux
A. SYSTEM DESCRIPTION
NOTE: The following reflects the information entered in the PIAMS website. A. SYSTEM DESCRIPTION Authority: Office of Management Budget (OMB) Memorandum (M) 03-22, OMB Guidance for Implementing the Privacy
Data Discovery & Documentation PROCEDURE
Data Discovery & Documentation PROCEDURE Document Version: 1.0 Date of Issue: June 28, 2013 Table of Contents 1. Introduction... 3 1.1 Purpose... 3 1.2 Scope... 3 2. Option 1: Current Process No metadata
Master Data Management (MDM) in the Public Sector
Master Data Management (MDM) in the Public Sector Don Hoag Manager Agenda What is MDM? What does MDM attempt to accomplish? What are the approaches to MDM? Operational Analytical Questions 2 What is Master
Reverse Engineering in Data Integration Software
Database Systems Journal vol. IV, no. 1/2013 11 Reverse Engineering in Data Integration Software Vlad DIACONITA The Bucharest Academy of Economic Studies [email protected] Integrated applications
MDM PROJECT BEST PRACTICE INFORMATICA DAY 4 TH JUNE 2014. Presented by Arhis and EDF
MDM PROJECT BEST PRACTICE INFORMATICA DAY 4 TH JUNE 2014 Presented by Arhis and EDF 1 AGENDA 1. Presenting Arhis 1. Arhis presentation and value proposition 2. Arhis MDM projects approach. 3. Arhis MDM
Tamr on Google Cloud Platform: E-Commerce Tutorial
Tamr on Google Cloud Platform: E-Commerce Tutorial Overview In this tutorial, we ll be working with sources from an e-commerce company s customer account database and web analytics data warehouse. In particular,
The Value of Advanced Data Integration in a Big Data Services Company. Presenter: Flavio Villanustre, VP Technology September 2014
The Value of Advanced Data Integration in a Big Data Services Company Presenter: Flavio Villanustre, VP Technology September 2014 About LexisNexis We are among the largest providers of risk solutions in
A basic create statement for a simple student table would look like the following.
Creating Tables A basic create statement for a simple student table would look like the following. create table Student (SID varchar(10), FirstName varchar(30), LastName varchar(30), EmailAddress varchar(30));
Privacy Impact Assessment. For Rehabilitation Services Administration Management Information System (RSA-MIS) Date: November 19, 2014
For Rehabilitation Services Administration Management Information System (RSA-MIS) Date: November 19, 2014 Point of Contact and Author: Ken Schellenberg [email protected] System Owner: Ed Anthony
PRIVACY IMPACT ASSESSMENT
PRIVACY IMPACT ASSESSMENT Employee Benefits Management Services December 2013 FDIC External Service Table of Contents System Overview Personally Identifiable Information (PII) in EBMS Purpose & Use of
Data warehouse Architectures and processes
Database and data mining group, Data warehouse Architectures and processes DATA WAREHOUSE: ARCHITECTURES AND PROCESSES - 1 Database and data mining group, Data warehouse architectures Separation between
SCORE An Overview. State of Colorado Registration and Election Management
SCORE An Overview State of Colorado Registration and Election Management Table of Contents The Voter Registration Module 3 The Voter Search Module 4 The Voter Merge Module 5 The Batch Scan/Commit Batch
MDM and Data Warehousing Complement Each Other
Master Management MDM and Warehousing Complement Each Other Greater business value from both 2011 IBM Corporation Executive Summary Master Management (MDM) and Warehousing (DW) complement each other There
Addressing Risk Data Aggregation and Risk Reporting Ben Sharma, CEO. Big Data Everywhere Conference, NYC November 2015
Addressing Risk Data Aggregation and Risk Reporting Ben Sharma, CEO Big Data Everywhere Conference, NYC November 2015 Agenda 1. Challenges with Risk Data Aggregation and Risk Reporting (RDARR) 2. How a
Data Vault and The Truth about the Enterprise Data Warehouse
Data Vault and The Truth about the Enterprise Data Warehouse Roelant Vos 04-05-2012 Brisbane, Australia Introduction More often than not, when discussion about data modeling and information architecture
Early Learning Compensation Rates Comparison
Report to the Legislature Early Learning Compensation Rates Comparison January 2015 Table of Contents 1 Introduction Early Childhood Education and Assistance Program (ECEAP) 2 ECEAP Staff Qualification
Texas Records Exchange System (TREx)
Texas Records Exchange System (TREx) TREx User Guide Phase 3 Texas Education Agency 1 Contents Introduction 3 Reset a forgotten password 16 Searching TREx transactions 17 Home page 19 Inbound Requests
Click HERE for Online Help (must be a registered user)
Click HERE for Online Help (must be a registered user) Click HERE for an online consultation for the correct Background check(s) for your state. Click HERE for access to online documents. 2009 Servant
Cognos Security Policy. A look at the security policy that is in place for the ODS/Cognos Reporting Environment at the College of Charleston
Cognos Security Policy A look at the security policy that is in place for the ODS/Cognos Reporting Environment at the College of Charleston Mary L. Person 11/13/2012 Table of Contents Overview... 2 Initiating
Customer Centricity Master Data Management and Customer Golden Record. Sava Vladov, VP Business Development, Adastra BG
Customer Centricity Master Data Management and Customer Golden Record Sava Vladov, VP Business Development, Adastra BG 23 April 2015 What is this presentation about? Customer-centricity for Banking and
Announcements. SQL is hot! Facebook. Goal. Database Design Process. IT420: Database Management and Organization. Normalization (Chapter 3)
Announcements IT0: Database Management and Organization Normalization (Chapter 3) Department coin design contest deadline - February -week exam Monday, February 1 Lab SQL SQL Server: ALTER TABLE tname
You re one step closer to working more efficiently, increasing performance, and gaining clean, enhanced data.
chases down and eliminates duplicate data FREE TRIAL! CONGRATULATIONS! You re one step closer to working more efficiently, increasing performance, and gaining clean, enhanced data. You ll find that Cloudingo
Who Is Your Insurance Web Service
Who Is Your Insurance Web Service Contents Service address...2 Insurance search...2 Authentication Request...2 Structure of the authentication message...2...2 Authentication Response...3 Structure of the
Employers Guide to Online Recruiting
Employers Guide to Online Recruiting Jobs.utah.gov Utah Department of Workforce Services Table of Contents Employer Login Process (Existing Username & Password).....0 Forgot Username Process (Can t remember
Using the SQL Procedure
Using the SQL Procedure Kirk Paul Lafler Software Intelligence Corporation Abstract The SQL procedure follows most of the guidelines established by the American National Standards Institute (ANSI). In
Methodology for creation of the HES Patient ID (HESID)
Methodology for creation of the HES Patient ID (HESID) Version Control Version Date Author Reason for change 2.0 21/05/2014 HES Development Team Reviewed & updated text explaining the 3 pass matching process
SQL Server Master Data Services A Point of View
SQL Server Master Data Services A Point of View SUBRAHMANYA V SENIOR CONSULTANT [email protected] Abstract Is Microsoft s Master Data Services an answer for low cost MDM solution? Will
