Enterprise Master Patient Index

Enterprse Master Patent Index Healthcare data are captured n many dfferent settngs such as hosptals, clncs, labs, and physcan offces. Accordng to a report by the CDC, patents n the Unted States made an estmated 1.1 bllon vsts to physcan offces, hosptal outpatent departments and hosptal emergency departments just n 2006. Ths corresponds to a rate of about four vsts per patent. In addton to the large volume of vst data generated on an annual bass, the data s also dstrbuted across dfferent healthcare settngs as patents vst hosptals, prmary and specalty physcans and move across the country. In order to put together the longtudnal health record of a patent, all ther data needs to be ntegrated accurately and effcently despte the fact that the data s captured usng dsparate and heterogeneous systems. The heterogenety of the systems used to capture patent data at the dfferent healthcare systems cause patent records to have multple unrelated patent dentfers assgned to them wth the possblty of multple dentfers assgned to a gven patent record wthn a sngle nsttuton. The lack of precse standards on the format of patent dentfyng and patent demographc data results n ncomplete data sharng among healthcare professonals, patents, and data repostores. In addton to the syntactc heterogenety of the data, the data capture process s often not carefully controlled for qualty nor necessarly defned n a common way across dfferent data sources, resultng n unavodable and all too common data entry errors whch further exacerbate the nconsstency of the data. Common data management desgn ssues nclude lack of normalzaton or de-normalzaton and mssng ntegrty constrants whereas mproper data handlng results n wrong or mssng data or uncontrolled data duplcaton. In order to ntegrate all ths healthcare data, the varous patent dentfers assgned to a gven patent ether at dfferent nsttutons or erroneously by a sngle nsttuton, must be lnked together despte the presence of syntactc and semantc dfferences n the assocated demographc data captured for the patent. Ths problem has been known for more than fve decades as the record lnkage or the record matchng problem. The goal n the record matchng problem s to dentfy records that refer to the same real world entty, even f the records do not match completely. If each record carred a unque, unversal, and error-free dentfcaton code, the only problem would be to fnd an optmal search sequence that would mnmze the total number of record comparsons. The syntactc and semantc dfferences between the data sources as well as the data entry errors ntroduced durng capture coupled wth pure qualty control measures result n the need to use dentfcaton codes that are nether unque nor error-free. The followng examples llustrates the stuaton where even though two records refer to the same patent, because of semantc and syntactc varatons n the two records, the task of matchng them cannot be easly automated. The followng example makes t clear that only a sophstcated record matchng algorthm could automatcally make the decson on whether to lnk the dentfers assgned to the followng two patent records. Name Address Age Javer Martnez 49 E. Applecross Road 33 Haver Marteenez 49 Applecross Road 36

Sophstcated record matchng algorthms approach ths complex problem by decomposng the overall process nto three tasks: the data preparaton phase, the searchng phase and the matchng phase. The data preparaton phase s a pre-processng phase whch parses and transforms the data n an effort to remove the syntactc and semantc heterogenety of the patent demographc data. In the absence of unque patent dentfers, the matchng algorthms must use patent demographc data as the matchng varables so, n the data preparaton phase the ndvdual felds are transformed n order for them to conform to the data types of ther correspondng domans. In the searchng phase, the algorthms search through the data to dentfy canddates for potental matches. The brute-force approach of an exhaustve search across the Cartesan product of the data sets s of quadratc order, wth respect to the number of records so t s not feasble for lnkng large data sets. For example, attemptng to lnk two data sets wth only 10,000 records each usng an exhaustve search would requre 100 mllon comparsons. To reduce the total number of comparsons, blockng algorthms are used whch partton the full Cartesan product of possble record pars nto smaller subsets. The topc of blockng algorthms s addressed n more detal n a separate secton. In the matchng phase, the record pars dentfed durng the searchng phase are compared to dentfy matches. Typcally, a subset of the patent s demographc attrbutes, whch are referred to as the matchng varables, s used to dentfy matches. The correspondng matchng varables for each par of records are compared wth one another formng the comparson vector. The matchng decson must determne whether a par of records should be regarded as lnked, not lnked or possbly lnked, dependng upon the varous agreements or dsagreements of tems of dentfyng nformaton. The specfcaton of a record lnkng procedure requres both a method for measurng closeness of agreement between records as well as an algorthm that uses ths measure for decdng when to classfy records as matches. For example, f we are lnkng person records, a possble measurement would be to compare the famly names of the two records and assgn the value of 1 to those pars where there s absolute agreement and 0 to those pars where there s absolute dsagreement. Values n between 0 and 1 are most frequently used to ndcate how close the values n the two dfferent domans are. Matchng algorthms are classfed as determnstc or probablstc. The secton on matchng algorthms defnes these two classfcatons n detal and descrbes the advantages and dsadvantages of each approach. Blockng Algorthms In most record lnkage cases, the two datasets that need to be ntegrated do not posses a unque dentfer so a collecton of record attrbutes must be used to match records and record pars from the two datasets need to be compared to one another to dentfy matchng records. For example, a typcal scenaro s to use the frst name, last name, gender and brth date of each record as the dentfyng attrbutes for a dstnct patent record. The next step s to compare record pars from the two data sources and apply a matchng algorthm to detect whether the two records of a par match or not. The naïve approach of comparng every possble record par from the two datasets s of quadratc complexty and s thereby computatonally nfeasble for large sets. As was mentoned prevously, comparng all the record pars n an effort to match two data sets wth 10,000 records each, requres 100 mllon comparsons, makng t prohbtvely expensve. In addton to the computatonal complexty, comparng these many records n memory

ncreases the demands on the memory subsystem to the pont where the data set cannot all be kept n memory. The algorthm wll need to page records nto memory as needed complcatng the complexty of the mplementaton. To reduce the huge number of possble record par comparsons, record lnkage systems employ blockng algorthms to reduce the large number of record pars that need to be compared. The most commonly used blockng algorthm uses one or more record attrbutes, referred to as the blockng key. Examples of blockng keys are usng the frst four characters of the last name n one case or the zp code attrbute combned wth an age category. The attrbutes of the blockng key are used to splt the datasets nto blocks. Then only records from the block wth the same blockng key need to be compared for possble matches. There s a cost-beneft trade-off to be consdered n choosng the blockng keys. If the number of records n each block s very large, as would be the case when usng the gender as the only blockng attrbute, then more record pars than necessary wll be generated. If the number of records n each block s very small, many potental record matches wll be mssed snce the matchng algorthm wll only compare records that fall n the same block. An example of ths stuaton would be usng the SSN as the only attrbute n the blockng key. In ths case only records wth the same SSN would end up n the same block, whch means that records wth ncorrect or mssng values n the SSN feld would not match. A conflctng goal for a blockng algorthm s that whle attemptng to reduce the number of record pars that need to be compared, the algorthm must not erroneously assgn potentally matchng records to two dfferent blocks because then they wll never match wth one another. It s preferable to use the least error-prone attrbutes n the blockng key and as a further safeguard, multple passes of blockng are sometmes used wth dfferent blockng keys n each phase so that potental matches that are mssed wth a certan blockng key wll be consdered for matchng n a subsequent phase. A number of more advanced blockng algorthms have been proposed over the years that am to acheve a reducton n the number of record pars that need to be compared whle reducng the errors n separatng potental matches. The sorted neghborhood blockng algorthm parttons each dataset usng the blockng key, sorts the records usng the blockng key values and then moves a sldng wndow across the records. Records wthn the wndow are then pared wth each other and ncluded n the canddate record par lst. For a wndow of sze w and a total number of records of n, ths lmts the number of possble record par comparsons for each record to 2w-1 and the total number of generated record pars to O(wn). Another blockng algorthm s the Bgram Indexng method. The Bgram Indexng method converts blockng keys nto a lst of bgrams and generates one block for each bgram value n the lst. It then assgns records wth the same blockng key value to all the blocks labeled wth one of the bgram values of that blockng key. The last advanced blockng algorthm we descrbe here s the Canopy Clusterng algorthm whch forms clusters by choosng a record at random and puttng n ts cluster all the records wthn a certan dstance to t. The record chosen at random and any records wthn a certan tght threshold dstance of t are then removed from the canddate set of records. Fgure 1 llustrates an example of four data clusters and the canopes that

cover them. Ponts belongng to the same cluster are colored n the same shade of gray. Pont A was selected at random and forms a canopy consstng of all ponts wthn the outer (sold) threshold. Ponts nsde the nner (dashed) threshold are excluded from beng the center of, and formng new canopes. Canopes for B, C, D, and E were formed smlarly to A. Note that the optmalty condton holds: for each cluster there exsts at least one canopy that completely contans that cluster. Note also that whle there s some overlap, there are many ponts excluded by each canopy. Expensve dstance measurements wll only be made between pars of ponts n the same canopes, far fewer than all possble pars n the data set [MCCAL]. The number of record par comparsons 2 fn resultng from canopy clusterng s O where n s the number of records n each of c the two data sets, c s the number of canopes and f s the average number of canopes a record belongs to. The threshold parameter should be set so that f s small and c s large, n order to reduce the amount of computaton. Fgure 1 Illustraton of the Canopy Clusterng Blockng Algorthm Matchng Algorthms The subject of record lnkage has receved actve attenton by the research communty for more than ffty years and durng that tme, many algorthms have been proposed. At the same tme no algorthm has emerged as the most approprate for use n every possble populaton of patent demographc data. Ths plethora of matchng algorthms can be classfed as belongng nto one of two major categores of matchng algorthms: determnstc or probablstc algorthms.

Determnstc (or exact or all-or-none) algorthms employ a set of rules based on exact agreement/dsagreement results between correspondng felds n potental record pars. Determnstc algorthms are the most approprate choce when records on both sources of data that need to be ntegrated contan a varable or characterstc of an ndvdual that s deally () unversally avalable, () fxed, () easly recorded, (v) unque to each ndvdual, and (v) readly verfable. Few, f any, varables meet all these requrements although some come close enough to be useable. The advantage of determnstc algorthms s that they are easy to mplement but ther dsadvantage s that they only produce accurate matchng results after careful analyss and extensve preprocessng of the data sets that need to be matched. In one of the few publshed studes of the applcaton of determnstc matchng algorthms, the authors descrbe n detal the process that they employed n order to acheve accurate matchng results usng a determnstc algorthm. Ther objectve was to match records from two hosptal systems patent regstres wth the Socal Securty Death Master Fle. The source data was preprocessed usng feld specfc approaches such as encodng names usng phonetc compressons algorthms, mputaton of mssng gender feld values usng the frst name to look up gender specfc names n the Census and parsng of brth dates nto month, day and year followed by correcton or elmnaton of nvald values. They used number theory-based algorthms to detect cases where the order of the names vared between sources. After the preprocessng phase, the authors extracted samples from the two data sets and manually revewed the matches found usng a socal securty-only matchng phase n order to generate a gold standard for measurng the error rates of lnkage varables and for comparng the matchng accuracy of varous combnatons of these varables. After extensve analyss of varous combnatons of matchng varables the concluded that the best matchng attrbutes for ther dataset were the SSN, the frst name transformed by the NYSIIS phonetc encodng algorthm, month of brth and gender. A more recent study that compares a determnstc wth a probablstc matchng algorthm usng an emprcal evaluaton comes to smlar conclusons regardng the advantages and dsadvantages of determnstc algorthms. Determnstc matchng algorthms can produce good results f the qualty of the source data s farly hgh, the data s preprocessed carefully and after matchng attrbutes are selected after a detaled analyss of ther performance n small subsets of the data. Probablstc algorthms have ganed popularty due to ther applcablty n most common scenaros where the data sets that need to be ntegrated do not posses one or more attrbutes wth the characterstcs descrbed earler. Most data sets have attrbutes where ther values contan errors and omssons and they typcally do not possess a unque and unversally avalable hgh qualty dentfer. Probablstc record matchng reles on calculatng scores n determnng whether two records are a match and those underlyng scores are based on probabltes. A set of record attrbutes are frst selected after analyzng the data set and those attrbutes form what s referred to as the match key. Canddate record pars are then compared to one another by usng a dstance metrc to calculate the smlarty between correspondng felds of the match key. The values of the dstance metrc for each record form a comparson vector. The probablty of each comparson vector s then evaluated to determne whether the two records are close enough to each other to be classfed as a match. The probablstc approach to record lnkage orgnated wth the work of genetcst Howard Newcombe who ntroduced odds

ratos of frequences and the decson rules for delneatng matches and non-matches. Felleg and Sunter then organzed Newcombe s deas nto a rgorous mathematcal framework whch formalzed the record lnkage problem n probablstc terms. Before presentng the rgorous formulaton of the record lnkage problem, some notaton needs to be ntroduced. The assumpton s that there are two sets of records that need to be lnked to one another. The two sets of records are denoted as A and B and a sngle record from each set s denoted by lower case letters a and b, respectvely. The set of all possble record pars conssts of the Cartesan product of the two sets A B = {( a, b); a A, b B} The objectve of the matchng algorthm s to partton the set of record pars nto the two subsets of matchng pars and M = {( a, b) A B a = b} U = {( a, b) A B a b} of non-matchng pars. Note that snce ths s a parttonng of the two sets, that mples that M U = A B and M U =. Let s assume that the match key conssts of K attrbutes. The two records n each record par are compared wth one another, formng a comparson vector γ of length K. The space of all possble values of γ s Γ : [ γ γ γ γ ] T j γ =,..., 1, 2,..., K wth γ {0,1 }. In partcular, γ = 1 ndcates that the correspondng values for feld n j the two records n record par j are close enough based on the selected dstance metrc to be consdered a match and γ = 0 otherwse. If we assume that the observaton of γ s an j event generated from some probablty dstrbuton, we can then consder the condtonal probablty of observng γ gven that the record par s a match and denote that dstrbuton as: m( γ ) = P( γ ( a, b) M ) = P( γ M ) as well as the condtonal probablty of observng γ gven that the record par s a nonmatch, denoted as: u( γ ) = P( γ ( a, b) U ) = P( γ U ) Gven ths model, a matchng algorthm s smply a mappng or decson rule that upon observng the comparson vector γ for record par ( a, b) t needs to decde f t s a matched par, labelng the record par as A 1, or f t s an unmatched par, labelng t by A 3. There wll be some cases n whch the nether label can be assgned wth suffcent confdence and those cases wll be labeled as A 2. A lnkage rule can now be defned as a mappng from Γ, the comparson space, onto a set of random decson functons D = { d( γ )} where

and d γ ) = { P( A γ ), P( A γ ), P( γ )}; γ Γ ( 1 2 A3 3 = 1 P( γ ) = 1. A There are two types of error assocated wth a lnkage rule. The frst occurs when an unmatched record par s assgned to be a match (also referred to as a Type I error) and the probablty of ths error s: P γ γ ). ( A1 U ) = u( ) P( A1 γ Γ The second type of error s when a matched par s assgned to be a non-match (also referred to as a Type II error): P γ γ ). ( A3 M ) = m( ) P( A3 γ Γ For fxed values of the false match rate µ and false non-match rate λ, Felleg and Sunter defne the optmal lnkage rule at levels µ and λ, denoted by L ( µ, λ, Γ) as the rule for whch P ( A2 L) P( A2 L ) over all possble L ( µ, λ, Γ). They then defne a lnkage rule by frst assgnng an orderng to all comparson vectors γ such that the correspondng sequence of ratos m ( γ ) u( γ ) s monotone ncreasng and ndexng the ordered set {γ } by the subscrpt ( = 1,2,..., N ) and wrte u = u γ ); m = m( γ ) where N = Γ. They fnally prove that f ; Γ n u = 1 N µ =, λ = Γ m, and n < n, = n ( then L o ( µ, λ, Γ) s the optmal algorthm at the levels ( µ, λ) where the decson rule s defned as: (1,0,0) d(γ ) = (0,1,0) (0,0,1) f f f 1 n n < < n n The proof for the optmalty of the algorthm as well as detals on the mplementaton of the algorthm can be found n the orgnal paper by Felleg and Sunter. 1.1. OpenEMPI OpenEMPI s an Open Source Enterprse Master Patent Index (EMPI) that orgnated from the remnants of the Care Data Exchange software that were turned over to the open source communty after the Santa Barbara County (Calforna) Care Data Exchange (SBCCDE) ceased operatons. Snce t was orgnally released t has undergone consderable refactorng and redesgn n an effort to acheve the followng goals: a. decompose the overall system nto a collecton of servces, b. make t extensble so that new algorthms can be easly embedded nto the system over tme, c. optmze the data model to support nstances N Γ Γ

wth large numbers of patents, and d. provde standards-based ntegraton access ponts nto the system so that OpenEMPI can be easly ntegrated nto exstng healthcare envronments. To acheve the frst goal, the system was re-desgned so that the overall archtecture of the software s now based on Servce Orented Archtecture (SOA) prncples; the overall system archtecture conssts of a collecton of loosely coupled components and nteracton among components only takes place through well defned nterfaces. Fgure 2 llustrates the new archtecture of OpenEMPI wth only some of the servces ncluded n the fgure for concseness. The fgure also llustrates the layered nature of the archtecture where servces are allocated to the data access layer, the servce layer or the UI layer and where each layer bulds on top of the layer below. Layered archtectures provde flexblty by allowng entre layers to be removed and replaced wthout affectng the rest of the system. For example, movng OpenEMPI to utlze storage n the cloud would smply requre mplementaton of a new data layer to be deployed on a Storage-as-a-Servce (SaaS) nfrastructure wthout requrng modfcatons to the rest of the applcaton. Some of the key servces that comprse the archtecture of OpenEMPI nclude the Blockng Servce, whch abstracts the algorthms that reduce the number of record pars that need to be compared for matchng purposes, the Matchng Servce, whch provdes an abstracton for the algorthm that determne whether two or more patent records n the system dentfy a unque patent, the Strng Comparson Servce, whch determnes the measure of smlarty between two patent demographc attrbutes, and the Standardzaton Servce, whch supports the data preparaton phase and transforms patent attrbutes nto a standard format for the purpose of mprovng matchng performance. By decomposng the system nto components and specfyng nterfaces as the only means of nteracton between those components, the new archtecture acheved the extensblty goal. The system can be easly extended by smply mplementng a dfferent verson of a servce and pluggng nto the system wthout requrng any modfcatons to the rest of the system. Ths capablty of allowng for the pluggng n of new and nterchangeable algorthms n any of the servces s crucal snce t allows for OpenEMPI to become a platform for the testng and valdaton of varous ntellgent algorthms n any of the three phases of the record lnkage process.

Fgure 2 SOA-based Archtecture of OpenEMPI An EMPI nternally utlzes algorthms from the feld of record lnkage to detect and lnk duplcate records. Over the years many algorthms have been proposed wth dfferent operatng assumptons and performance characterstcs as descrbed n prevous sectons of ths report. By utlzng the extensblty of OpenEMPI, multple, alternatve matchng algorthms may be mplemented and the most approprate choce may be selected durng deployment. The software dstrbuton currently ncludes a fully functonal mplementaton of both a smple, determnstc algorthm as well as a probablstc matchng algorthm, whch s an mplementaton of the Felleg-Sunter algorthm that uses Expectaton-Maxmzaton (EM) for estmatng the margnal probabltes of the model. The same extensblty mechansm s also avalable n the blockng servce mplementaton and n the strng comparson servce. The current strng comparson servce provdes a number of strng dstance metrcs ncludng an exact strng comparator, the Jaro, Jaro-Wnkler, and the Levenshten dstance metrcs among others. The next objectve for the redesgn of OpenEMPI was the optmzaton of the data model to support the persstence and retreval of patent dentfyng and demographc data. Ths objectve was selected for two reasons. The ntent s for OpenEMPI to be sutable for mplementaton n large-scale healthcare envronments wth hundreds of thousands to mllons of patent records, and where the system needs to sustan ntense user workloads. Ths goal can only be acheved f the data model s desgned from the begnnng wth hgh performance as a key crteron. The second reason for the focus on optmzaton of the data model was to form a sold bass on top of whch multple addtonal matchng algorthms could be developed, tested and deployed n a producton envronment. Wth a data model where query performance s not consdered durng desgn, t would be mpossble to develop matchng algorthms that are effcent and sutable for mplementaton n large-scale envronments.

The fnal objectve was the provson of ntegraton ponts nto OpenEMPI n order to facltate ts ntegraton nto the Informaton Technology (IT) nfrastructure of exstng healthcare envronments. To acheve ths goal, OpenEMPI provdes support for the Patent Identfer Cross-Referencng (PIX) and Patent Demographcs Query (PDQ) standards defned by the Integratng the Healthcare Enterprse (IHE) organzaton. IHE defnes the workflow and specfes the standards to be used n the mplementaton of those workflows for promotng the sharng of medcal nformaton. The PIX profle supports the cross-referencng of patent dentfers from multple Patent Identfer Domans and the PDQ profle provdes ways for multple dstrbuted applcatons to query a patent nformaton server for a lst of patents, based on user-defned search crtera, and retreve a patent s demographc nformaton. OpenEMPI utlzes the open source mplementaton of the PIX/PDQ profles n OpenPIXPDQ and the two projects combned have been tested successfully at both the 2009 and 2010 IHE Connectathon. To further smplfy the ntegraton of OpenEMPI nto exstng IT nfrastructure, there are plans for the development of both a REST-based and a SOAP-based web servces nterface to the full functonalty avalable by OpenEMPI.