A Conditional Model of Deduplication for Multi-Type Relational Data

Transcription

1 A Conditionl Model of Dedupliction for Multi-Type Reltionl Dt Aron Culott, Andrew McCllum Deprtment of Computer Science University of Msschusetts Amherst, MA {culott, Astrct Record dedupliction is the tsk of merging dtse records tht refer to the sme underlying entity. In reltionl dtses, ccurte dedupliction for records of one type is often dependent on the merge decisions mde for records of other types. Wheres nerly ll previous pproches hve merged records of different types independently, this work models these inter-dependencies explicitly to collectively deduplicte records of multiple types. We construct conditionl rndom field model of dedupliction tht cptures these reltionl dependencies, nd then employ novel reltionl prtitioning lgorithm to jointly deduplicte records. We evlute the system on two cittion mtching dtsets, for which we deduplicte oth ppers nd venues. We show tht y collectively deduplicting pper nd venue records, we otin up to 30% error reduction in venue dedupliction, nd up to 20% error reduction in pper dedupliction over competing methods. 1 Introduction A common prerequisite for knowledge discovery is ccurtely comining dt from multiple, heterogeneous sources into unified, minele dtse. An importnt step in creting such dtse is record dedupliction: consolidting multiple records tht refer to the sme strct entity. The difficulty in this tsk rises oth from dt errors (e.g. misspellings nd missing fields) nd from vrints in field vlues (e.g. revitions). Most historicl pproches hve frmed the dedupliction prolem s set of independent decisions. For ech pir of records, similrity score is clculted, nd the records re merged if the similrity is ove some threshold [8]. The decisions re comined y tking the trnsitive closure of the resulting djcency mtrix. More recently, McCllum nd Wellner [13] nd Prg nd Domingos [20] hve demonstrted tht mking multiple dedupliction decisions collectively cn provide etter results thn historicl pproches. These models re types of conditionl rndom fields (CRFs) [9], where the oserved nodes re mentions, nd the predicted nodes re the dedupliction decisions for ech pir of nodes. By frming inference s n instnce of grph prtitioning, the models re collective in the sense tht mentions re clustered sed not only on their distnce to ech other, ut lso on their distnce from ll other prtitions. By treting 1

2 dedupliction decisions in dependent reltion to ech other, inconsistencies nd noise in the similrity metric my e overcome. This pper presents model for collective dedupliction, extended to the importnt nd uiquitous cse of reltionl dtses, where records hve types, nd where there exist reltions etween records of different types. These reltions provide useful evidence for dedupliction decisions ecuse the identity of record often depends on the identities of relted records. For exmple, consider dtse of reserch ppers, where records cn e of type pper, venue, or uthor. If two pper records re leled s duplictes, then it follows tht the venue records corresponding to those ppers should lso e leled s duplictes. The reverse is more sutly true: if two venues re duplictes, then this my slightly increse the proility tht their corresponding ppers re duplictes. We propose model tht leverges these sutle interdependencies to mke dedupliction decisions collectively cross multiple record types. In prticulr, we present CRF for the cittion domin tht provides conditionl proilistic model of dedupliction decisions over records of multiple types given oserved record mentions nd the reltions mong them. We propose novel, reltionl grph prtitioning lgorithm for inference tht not only ensures tht dedupliction decisions mde for different record types re consistent, ut lso llows the decisions from one record type to inform the decisions for nother record type. Prmeter estimtion consists of mximizing the product of locl mrginls for pirs of records of different types. Tht is, we prmeterize the CRF to lern weights over 4-tuples consisting of record pir nd relted record pir of different type. In the cittion domin, these 4-tuples consist of pir of pper records, nd pir of relted venue records. In this wy, the model lerns prmeters to trde-off pper nd venue dedupliction decisions. We provide results on dtse of reserch ppers, where we show tht modeling dedupliction of pper nd venue records collectively improves dedupliction performnce for ech type, providing up to 30% error reduction in venue dedupliction, nd up to 20% error reduction in pper dedupliction over previously proposed collective model [13] tht does not model the dependencies etween record types. 2 Relted Work To the est of our knowledge, this is the first pper to present discrimintive, collective model of dedupliction for multiple, relted record types nd demonstrte empiriclly the performnce gins ttinle over independent models. We riefly review clssicl work in dedupliction, then discuss recent efforts in collective dedupliction. Record dedupliction, known vriously s record linkge, coreference resolution, dedupliction, nd identity uncertinty, is prevlent in mny fields, including computer vision, dtses, nd nturl lnguge processing. 2

3 Originlly introduced in the dtse community s record linkge [17], record dedupliction ws lter formlized y Fellegi nd Sunter [8] s the computtion over fetures etween pirs of records, nd further extended y Winkler [25, 26]. This previous work clcultes similrity score for record pirs, collpses those ove similrity threshold, then performs trnsitive closure. It is not reltionl in the sense tht one dedupliction decision does not directly ffect nother. More recent record linkge work hs considered the dedupliction of ctegoricl dt, llowing ttriutes to e deduplicted long with records [1]; however, tht work does not utilize mchine lerning nd requires thresholds to e set mnully. Methods of lerning etter similrity score hve een investigted recently in the dtse community [5, 7]. Similr trends exist in nturl lnguge processing for the tsk of coreference resolution, where reserch hs focused on lerning more useful similrity metrics nd pplying them to thresholding techniques nlogous to those found in the dtse community [18, 16]. Only recently hve collective dedupliction models een investigted. Milch et. l hve introduced genertive models to reson in worlds with n unknown numer of ojects, enling proility distriutions to e defined over reltionl dt with mny oject types [11, 15]. While these models hve ppeling forml semntics, their genertive nture forces the model to mke conditionl independence ssumptions mong fetures. Our work cn e viewed s extensions to recent models pplying conditionl rndom fields to the dedupliction tsk [13, 24, 20]. McCllum nd Wellner [13] hve presented CRFs which perform collective coreference y equting inference in the CRF s grph prtitioning prolem, resulting in collective coreference of records of one type. This previous work demonstrted the dvntges collective coreference hs over clssicl pproches; however, it does not model multiple types of coreferent ojects. Prg nd Domingos [20] present CRF model similr to tht in McCllum nd Wellner [13] in tht it collectively deduplictes records of the sme type using grph prtitioning. Additionlly, this method llows informtion to propgte etween records y wy of their shred ttriutes. However, the Prg nd Domingos model does not tret ttriutes s first-clss ojects. In prticulr, their model collpses string identicl ttriute nodes nd cretes informtion nodes to model whether or not ttriutes mtch. The model does not explicitly optimize dedupliction decisions for ttriutes; rther, the informtion nodes cn e viewed s n input vriles to record dedupliction. An importnt distinction with our work is tht in the Prg nd Domingos model, joint dedupliction only occurs mong records shring n identicl ttriute. This is often not the cse in rel dt. In sense, the Prg nd Domingos model cn e viewed s discrimintive version of recently proposed hierrchicl model for dedupliction y Rvikumr nd Cohen [22], which introduces ltent mtch nodes for ttriutes. Here gin, determining whether ttriute vlues re coreferent is viewed s locl decision used s input to the record dedupliction decision. Our model insted 3

4 PID Author Title Venue VID 0 X. Li Predicting the stock mrket CIKM 10 1 X. Li Predicting the stock mrket Conf on Informtion Mngement 20 2 J. Smith Semi-Definite Progrmming CIKM 30 3 Smith, J. Semi-Definte Progrming Conference on Info Mngement 40 Tle 1: An exmple of four ppers with the sme venues in pulictions dtse. PID is the pper id, nd VID is the venue id. trets ttriutes s records themselves, performing full dedupliction on them s well s their relted records. Other recent work in knowledge discovery hs leverged reltionl informtion to perform coreference [3, 4]. These models define similrity metric etween records tht considers the identity of relted records. This is similr in spirit to our model, since it uses dedupliction decisions of relted records to clculte the similrity etween records. However, their model is minly concerned with deduplicting uthors lone, nd does not explicitly model dedupliction of multiple record types. Also, the trining methods descried in those models do not cpture the rich set of fetures ville to the model presented in this pper. Our model cn lso e descried s type of reltionl Mrkov network (RMN) [23], which hve een employed successfully in reltionl domins, lthough not multi-type dedupliction. Another importnt distinction is tht our trining nd inference methods differ sustntilly from the loopy elief propgtion lgorithm commonly used in RMNs. 3 Motivting Exmple We first provide n exmple to motivte the potentil enefits of collective dedupliction for dtses with multiple record types. Consider gin dtse of reserch ppers, with uthor, pper, nd venue records. Our tsk is to deduplicte the vrious mentions of these records into unique entities. Tle 1 shows dtse of four ppers nd four venues, ech with unique ids. Ppers 0 nd 1 nd ppers 2 nd 3 should e merged; ll the venues should e merged. Imgine n gglomertive dedupliction system which egins y ssuming ech record is unique. Suppose the system first considers merging ppers 0 nd 1. Although the venues do not mtch, ll the other fields re exct mtches, so it is fesile tht the system my overcome this discrepncy. After merging ppers 0 nd 1, the system lso merges the corresponding venues 10 nd 20 into the sme cluster, since the venues of duplicte ppers must themselves e duplictes. Imgine the system next merges venues 10 nd 30 ecuse they re string identicl. The system must now decide if ppers 2 nd 3 re duplictes. Treted in isoltion, system my hve hrd time correctly detecting tht 2 nd 3 re duplictes: the uthors re highly similr, ut the title contins two 4

5 r ii x i y ij y jk x k x i y ij y jk x k x i y ij y jk x k x i y ij y jk x k x i r jj y ij y jk x k r kk x i y ij y jk x k () () (c) r jj Figure 1: Three incresingly complex models of pper nd venue dedupliction. X nd X re pper nd venue records, respectively. Y is inry rndom vrile indicting duplicte records, nd R ij denotes whether pper X i ws pulished in venue Xj. misspellings, nd the venues re extremely dissimilr. However, the system we hve descried so fr is fortunte to hve more informtion t its disposl. It hs lredy merged venues 10 nd 20, which re highly similr to venues 30 nd 40. By consulting its dtse of deduplicted venues, it could determine tht 30 nd 40 re in fct the sme venue. With this informtion in hnd, it my e more forgiving of the spelling mistkes in the title, finlly merging ppers 2 nd 3 correctly. This exmple illustrtes the notion tht the identity of n oject is dependent on the identity of relted ojects. Notice tht y using reltionl informtion, the system not only merged ppers it might not hve otherwise (2 nd 3), ut lso merged venues it might not hve (10, 20 nd 30, 40). Indeed, the chin of dedupliction decisions which led to optiml performnce interleved pper nd venue decisions: (0,1), (10,20), (30,40), (2,3). The work presented here descries system tht models the dedupliction decisions of relted records collectively, enling the sort of proilistic trdeoffs instrumentl to the success of the system in this exmple. 4 Model The model is n instnce of conditionl rndom field tht jointly models the conditionl proility of multiple dedupliction decisions given n oserved reltionl dtse. We egin with rief review of conditionl rndom fields, followed y forml description of the model. We then descrie the pproximtions used to mke inference nd prmeter estimtion trctle for this model. 5

6 4.1 Conditionl Rndom Fields Conditionl rndom fields (CRFs) [9] re undirected grphicl models encoding the conditionl proility of set of output vriles Y given set of evidence vriles X. The set of distriutions expressile y CRF is specified y n undirected grph G, where ech vertex corresponds to rndom vrile. If C = {{y c, x c }} is the set of cliques in G, then the conditionl proility of y given x is p Λ (y x) = 1 φ c (y c, x c ; Λ) Z x c C where φ is potentil function prmeterized y Λ nd Z x = y c C φ(y c, x c ) is normliztion fctor. We ssume φ c fctorizes s log-liner comintion of ritrry fetures computed over clique c, therefore ( ) φ c (y c, x c ; Λ) = exp λ k f k (y c, x c ) The model prmeters Λ = {λ k } re set of rel-vlued weights typiclly lerned from leled trining dt y mximum likelihood estimtion. 4.2 CRFs for multi-type dedupliction Let X e collection of rndom vriles representing oserved record mentions in dtse tht requires dedupliction. For clrity, ssume there re only two types of records, X = (X, X ), where X = (X1,..., Xn), X = (X1,..., Xm). The gol of dedupliction is to prtition X into clusters of records tht ll refer to the sme strct entity. To this end, we define collection of inry rndom vriles Y = (Y, Y ) tht indicte whether or not two records re duplictes. For exmple, Yij indictes whether or not records Xi nd Xj re coreferent. We lso define the inry rndom vriles R, where Rij indictes whether some ritrry reltion R holds etween record mentions Xi nd Xj. For exmple, in reserch pper dtse, X represents the set of pper records, X represents the set venue records, Yij indictes whether X i nd Xj re duplictes, nd Rij indictes whether pper X i ws pulished t venue X j. In the generl cse where R is unoserved, one could construct the conditionl distriution P (Y, Y, R X). With this model, one cn infer from n oserved set of records the most prole set of duplicte records nd the most prole set of reltions etween records. For exmple, in the pulictions domin, one my wnt to model the dvisor of reltion etween uthors, while lso modeling uthor dedupliction. We postpone this investigtion for future work, nd insted focus on the cse where R is oserved. For instnce, in cittion dt, we know which venues records re relted to which pper records. Thus, we desire to model the conditionl distriution P (Y, Y X, R). k 6

7 Figure 1 displys three incresingly complex grphicl models of this conditionl distriution. Vertices re rndom vriles, edges indicte possile proilistic dependence etween vriles, nd shded vertices indicte oserved vriles. Model () corresponds to the clssicl pproch, which trets ech dupliction decision independently. Model () is the pproch evluted in McCllum nd Wellner [13], extended here to the cse of multiple record types. Note tht in this model dedupliction decisions for records of the sme type re mde collectively. Model (c) is the one this pper dvoctes. Not only re dedupliction decisions for records of the sme type mde collectively, ut lso the decisions for one type of record re dependent on decisions mde for relted records. For ese of presenttion, we hve only included the oserved reltion vriles R which re true. We now provide more precise description of model (c). Let x ij = x i,, x i, x j e pir of oserved pper record mentions nd their corresponding venue records. To cpture the dependence etween yij nd yij, we fctorize the potentil functions to consider them jointly, resulting in the model: p(y, y x, r) = 1 Z x exp ( i,j,l λ l f l (x ij, y ij, y ij, r ij ) + ) λ f (yij, yjk, yik, yij, yjk, yik) i,j,k where the fetures f re consistency checking functions used to enforce trnsitivity mong dedupliction decisions. For exmple, if ppers x i nd x j re coreferent, nd x j nd x k re coreferent, then not only must ppers x i nd x k e coreferent, ut venues x i, x j, x k must lso e coreferent. (Note f is of nottionl use only in prctice, the inference lgorithm simply voids these impossile configurtions.) Becuse oth yij nd y ij re rguments to the feture functions f l, these potentils cpture the cross-product of pper nd venue dedupliction decisions. This llows the lerned weights to encourge merging pper records which hve equivlent venue records, nd to discourge merging ppers with different venues. The cost of modeling these interdependencies is highly connected grphicl model, which necessittes pproximtions in oth inference nd prmeter estimtion. We descrie these pproximtions elow. 4.3 Inference Inference in this model corresponds to finding the solution to y = (y, y ) = rgmx p Λ (y, y x, x, r) y 7

8 tht is, finding the most prole dedupliction decisions y given x, x, r nd the lerned prmeters Λ. Exct inference in this model is intrctle ecuse the spce of possile y is exponentil in the numer of records x, nd the the high connectivity of the grph precludes fesile dynmic progrm to mke this serch trctle. One common pproximte inference technique for such predicment is to perform loopy elief propgtion; tht is, perform stndrd elief propgtion [21], ignoring the messge doule-counting cused y the cycles in the grph. However, the severe cyclicity of this model my require prohiitive mount of time for elief propgtion to converge, if it converges t ll. Insted, we follow recent work which finds n equivlence etween grph prtitioning lgorithms nd inference in certin undirected grphicl models [6]. We first trnsform our grph to weighted, undirected grph tht only contins vertices for vriles x nd hs edges weighted y the (log) clique potentil for ech pir of vertices. The vlue on these edges depends on which type of records they join. For pper edges, we define the weight w ij = y ij {0,1} ( l l λ l f l (x ij, y ij = 1, y ij, r ij ) ) λ l f l (x ij, yij = 0, yij, rij ) nd similrly for venue edges: w ij = y ij {0,1} ( l l λ l f l (x ij, y ij, y ij = 1, r ij ) ) λ l f l (x ij, yij, yij = 0, rij ) Intuitively, the pper weights wij cn e thought of s the comptiility of ppers x i,, summed over possile dedupliction decisions for venues x i, x j. Similrly, the venue weights wij cn e thought of s the comptiility of venues x i, x j, summed over the possile dedupliction decisions for ppers x i,. Interpreting the weights s the similrity etween two records, we cn see tht the similrity of pper records considers the similrity of their venue records, nd vice vers. This results in weighted, undirected grph with edge weights rnging from to +. It cn e shown tht finding n optiml prtitioning of this grph corresponds to finding the optiml configurtion y in the originl undirected grphicl model. Here, the numer of prtitions is unknown, s it corresponds to the numer of unique records. 8

9 Although grph prtitioning with positive nd negtive edge weights is NPhrd, there exist severl good pproximtions, including recent work in correltion clustering [2]. Additionlly, McCllum nd Wellner [13] hve found tht greedy gglomertive clustering with n verge link criterion works well in prctice. However, trditionl prtitioning lgorithms would not ccount for the known dependencies etween clusters tht exist in our dt. Therefore, we develop novel, reltionl gglomertive clustering lgorithm tht exploits these dependencies. Trditionl greedy gglomertive clustering first initilizes ech vertex to its own cluster, then itertively merges the clusters tht re closest, where the distnce etween clusters is often defined s the verge of the edge weights connecting the two clusters. We ugment this lgorithm with two enhncements. First, we must enforce the constrint tht duplicte ppers hve duplicte venues. This is stright-forwrdly enforced y the following rule: Whenever pir of pper clusters re merged, their corresponding venue clusters must lso e merged. The second enhncement redefines the distnce etween clusters to more ccurtely reflect the impct of the first enhncement. Let Ci, C j e two pper clusters tht re cndidtes to e merged, nd let Ci, C j e the venue clusters corresponding to these ppers. The first enhncement requires tht if we merge Ci, C j, we must lso merge C i, C j. However, the current distnce metric etween Ci, C j does not reflect this fct. To remedy this, we redefine the distnce etween two pper clusters (Ci, C j ) to e the verge of (1) the trditionl distnce etween the pper clusters (Ci, C j ) nd (2) the trditionl distnce etween their corresponding venue clusters (Ci, C j ). (Note tht we choose the verge rther thn the sum to del with ppers tht hve no venue informtion.) This metric is likely to etter pproximte the effect merging Ci, C j will hve on the ojective function p Λ (y x, r), since it ccounts for the merger of the corresponding venue clusters. This new clustering lgorithm provides enefits to oth pper nd venue dedupliction tht would e unville in n independent clustering lgorithm. As illustrted in our motivting exmple in Section 3, it is often the cse tht pper duplictes re not detected ecuse they hve venues with decidedly different surfce forms (e.g. CIKM nd Conference on Knowledge nd Informtion Mngement ). The second enhncement ddresses this prolem y using the evidence from previous venue clusterings to inform pper dedupliction. Specificlly, if CIKM hs lredy een resolved with Conference on Knowledge nd Informtion Mngement, then merging ppers with venues CIKM nd Conference on Knowledge nd Informtion Mngement will e encourged, since there will e high similrity etween their ssocited venue clusters. Conversely, y the hrd constrint introduced in the first enhncement, difficult venue dedupliction decisions re informed y confident pper dedupliction decisions, s ws lso illustrted in Section 3. In this wy, dedupliction decisions for oth record types simultneously grow more ccurte. 9

10 4.4 Prmeter Estimtion Given leled corpus of fully clustered dt, mximum likelihood prmeter estimtion corresponds to finding the prmeters Λ which mximize the loglikelihood of the leled trining dt. Exct estimtion is intrctle here ecuse it requires clculting the normliztion term Z x, sum over ll possile vlues of y, which is sum over ll possile prtitionings of the dt. Due to the nture of the dt nd the high connectivity of the grph, this cnnot e efficiently computed with dynmic progrm. One could perform stochstic grdient scent on n pproximtion of the likelihood. However, it hs een noted tht mximizing product of locl mrginls performs t lest s well s this pproximtion on similr coreference tsk, if not etter [24, 12]. Wheres in [24] the locl mrginls re over single coreference decisions, here we mximize product of joint conditionl proilities for decisions yij nd y ij, which we define s P Λ (yij, yij x, x, r) = 1 Z x exp λ l f l (x ij, yij, yij, r ij ) i,j,l where Z x is normliztion constnt summing over possile vlues for the pir yij, y ij. For leled dtset D, the log-likelihood is defined s L Λ (D) = log y ij,y ij D P Λ (y ij, y ij x, x, r) We perform grdient scent on L y mximizing its derivtive: L λ l = x,y,r D ( i,j,l y ij,y i,j,l λ l f l (x ij, y ij, y ij, r ij ) ij P Λ (y ij, y ij x, x, r) λ l f l (x ij, y ij, y ij, r ij ) Becuse the defined likelihood is convex function, we cn perform grdient scent using ny suitle optimiztion lgorithm. In prticulr, we use limited-memory BFGS, which itertively pproximtes second-order curvture informtion to speed up convergence [19]. The estimtion method cn lso e viewed s lerning distnce metric etween pper-venue pirs. As explined in Section 4.3, this metric is used to weight the edges in the dedupliction grph, which is then prtitioned t inference time. ) 10

11 5 Experiments We evlute our model on two dtsets of reserch pper cittions. The first is from Citeseer [10], contining pproximtely 1500 cittions, with 900 unique ppers nd 350 unique venues. The second is from the Cor Computer Science Reserch Pper Engine 1, contining out 1800 cittions, with 600 unique ppers nd 200 unique venues. Both dtsets re mnully leled for oth pper coreference nd venue coreference, s well s mnully segmented into fields, such s uthor, title, etc. The dt were collected y serching for certin uthors nd topic, nd they re split into susets with non-overlpping ppers for the ske of cross-vlidtion experiments. We used numer of feture functions, including exct nd pproximte string mtch 2 on normlized nd unnormlized vlues for the following cittion fields: title, ooktitle, journl, uthors, venue, dte, editors, institution, nd the entire unsegmented cittion string. We lso clculted n unweighted cosine similrity etween tokens in the title nd uthor fields. Additionl fetures include whether or not the ppers hve the sme puliction type (e.g. journl or conference), s well s the numericl distnce etween fields such s yer nd volume. All rel vlues were inned nd converted into inry-vlued fetures. To evlute performnce, we compre the clusters output y our system with the true clustering using pirwise metrics. Pirwise precision is the frction of pirs in the sme cluster tht re coreferent; pirwise recll is the frction of coreferent ppers tht were plced in the sme cluster. Pirwise F1 is the hrmonic men of pirwise precision nd pirwise recll. Tles 2 nd 3 show the F1 performnce of two systems: joint is the system we hve dvocted in this pper, nd indep is the system which deduplictes records of different types independently. Note tht this system corresponds to model () in Figure 1, so dedupliction decisions re mde collectively for records of the sme type, s in McCllum nd Wellner [13]. Since the McCllum nd Wellner model hs een shown to consistently outperform the clssicl trnsitive closure model, we do not compre with the clssicl model here. Results re listed y the nme of ech test set; the remining sections re used for trining. Venue performnce improves considerly in the joint model, which is plusile considering the strong influence pper dedupliction hs on venue dedupliction. Becuse pper dedupliction often hs more evidence t its disposl thn does venue dedupliction, the joint model drmticlly enhnces venue recll, otining 5% solute recll oost in Citeseer, nd 9% oost in Cor dt. This is especilly noticele when pper dedupliction performnce is high: The hrd constrint requiring the venues of duplicte ppers to e merged often merges venues tht otherwise would hve seemed too dissimilr to merge on their own. Indeed, error nlysis confirms tht mny of the venue 1 mccllum/dt/cor-refs.tr.gz 2 We used the Secondstring pckge, found t 11

12 Pper Venue indep joint indep joint constrint reinforce fce reson Micro Avg Tle 2: Pirwise F1 dedupliction performnce on Citeseer dt. Pper Venue indep joint indep joint kil fhl utgo Micro Avg Tle 3: Pirwise F1 dedupliction performnce on Cor dt. dedupliction errors our model voids re those where venues re dissimilr in form, ut re relted to ppers tht re similr in form. More interestingly, noticele improvement in pper dedupliction is ttined y the collective model. Prt of this is due to the precision enhncement provided y the constrined clustering lgorithm. Workshop nd technicl report versions of journl or conference ppers with the sme title re correctly not merged when the venues re ccurtely identified. Also, error nlysis suggests tht ppers tht would not hve een otherwise merged were merged ecuse their venues were determined to e coreferent. It is worth noting tht mny of the errors mde y the joint model hve cuses similr to those tht on verge hve improved performnce. For exmple, if pper dedupliction ccurcy is poor, the reltionl clustering lgorithm cn result in mny venues eing merged tht would not hve een otherwise. Future work should investigte how to detect poor pper ccurcy nd djust ccordingly. 5.1 Sclility While the dtsets used in our experiments re of resonle size, we would ultimtely like to pply this model to lrge dtses. Here we riefly discuss performnce issues nd descrie methods to scle our model to rel-world dt. Becuse prmeter estimtion mximizes the product of locl node potentils, it is likely to e much fster thn glol pproximte trining method such s loopy elief propgtion. For the dt used in these experiments, trining time verged out 15 minutes on dul-processor, 3.06 GHz Xeon mchines 12

13 with 4 GB of RAM. Inference time rnged from 20 minutes to out n hour, depending on the size of the testing dt. To mke inference sclle, prcticl implementtion would mke use of cnopies [14]. This technique reduces the connectivity of the grph of records y defining chep similrity metric etween records (often using n inverted index or tf-idf). When constructing the record grph, edges re only dded etween those records tht hve similrity ove some threshold, where similrity is the output of the chep metric. In this wy, records tht re very unlikely to e duplictes re not considered y the model. This provides n efficient, ccurte wy of pruning the serch spce. Besides performnce issues, there is reson to elieve tht the dvntges of the model presented in this pper will e even more noticele in lrger dt sets, where there is more heterogeneity in field vlues, more interesting reltionl ptterns, nd lrger record clusters. In fct, the dt used in experiments presented here contin mny singleton clusters, which is not truly reflective of the lrge clusters of records found in rel-world dt. 6 Conclusions We hve introduced collective model for dedupliction of relted records of multiple types nd demonstrted empiriclly the dvntges it hs over methods tht do not ddress the interdependencies inherent in reltionl dt. Bsed on these results, two promising res of future reserch re (1) extending the model to dtses with more thn two types of records, nd (2) modeling the reltion vriles R. In ddition to pper nd venue dedupliction, uthor dedupliction is lso difficult prolem tht would likely enefit from this pproch, nd we re in the process of hrvesting dt to llow us to model uthor, venue, nd pper dedupliction jointly. In the pulictions domin, the connections etween uthor, venue, nd pper dedupliction ecome more interesting with lrger dtses, where communities nd reltions ecome more visile. In prticulr, exciting chllenges include uilding model to predict dvisor of reltions etween uthors, suggest possile venues for pper, identify fruitful uthor collortions, mtch recent grdutes with potentil reserch ls, nd discover the dynmics of reserch communities. The chllenge s usul will lie in developing model tht is complex enough to model these long-distnce reltions, ut is still trctle enough to perform on rel dt. We feel tht the model proposed here is productive step in tht long-term direction. 7 Acknowledgments This work ws supported in prt y the Center for Intelligent Informtion Retrievl, in prt y U.S. Government contrct #NBCH through sucon- 13

14 trct with BBNT Solutions LLC, in prt y The Centrl Intelligence Agency, the Ntionl Security Agency nd Ntionl Science Foundtion under NSF grnt #IIS , nd in prt y the Defense Advnced Reserch Projects Agency (DARPA), through the Deprtment of the Interior, NBC, Acquisition Services Division, under contrct numer NBCHD Any opinions, findings nd conclusions or recommendtions expressed in this mteril re the uthor(s) nd do not necessrily reflect those of the sponsor. References [1] R. Annthkrishn, S. Chudhuri, nd V. Gnti. Eliminting fuzzy duplictes in dt wrehouses. In In Proceedings of the 28th Interntionl Conference on Very Lrge Dtses (VLDB 2002), [2] Nikhil Bnsl, Avrim Blum, nd Shuchi Chwl. Correltion clustering. Mchine Lerining, 56:89 113, [3] Indrjit Bhttchry nd Lise Getoor. Dedupliction nd groupd detection using links. In 10th ACM SIGKDD Workshop on Link Anlysis nd Group Detection (LinkKDD-04), [4] Indrjit Bhttchry nd Lise Getoor. Itertive record linkge for clening nd integrtion. In Proceedings of the SIGMOD 2004 Workshop on Reserch Issues on Dt Mining nd Knowledge Discovery, June [5] Mikhil Bilenko nd Rymond J. Mooney. Adptive duplicte detection using lernle string similrity mesures. In 9th ACM SIGKDD Interntionl Conference on Knowledge Discovery nd Dt Mining (KDD-2003), pges 39 48, [6] Yuri Boykov, Olg Veksler, nd Rmin Zih. Fst pproximte energy minimiztion vi grph cuts. In IEEE trnsctions on Pttern Anlysis nd Mchine Intelligence (PAMI), 23(11): , [7] W. Cohen nd J. Richmn. Lerning to mtch nd cluster entity nmes. In ACM SIGIR 01 workshop on Mthemticl / Forml Methods in IR, [8] I. P. Fellegi nd A. B. Sunter. A theory for record linkge. Journl of the Americn Sttisticl Assocition, 64: , [9] John Lfferty, Andrew McCllum, nd Fernndo Pereir. Conditionl rndom fields: Proilistic models for segmenting nd leling sequence dt. In Proc. 18th Interntionl Conf. on Mchine Lerning, pges Morgn Kufmnn, Sn Frncisco, CA, [10] S. Lwrence, C. L. Giles, nd K. Bollker. Digitl lirries nd utonomous cittion indexing. IEEE Computer, 32:67 71,

15 [11] Bhskr Mrthi, Brin Milch, nd Sturt Russell. First-order proilistic models for informtion extrction. In Proceedings of the IJCAI-2003 Workshop on Lerning Sttisticl Models from Reltionl Dt, pges 71 78, Acpulco, Mexico, August [12] Andrew McCllum nd Chrles Sutton. Piecewise trining with prmeter independence digrms: Compring glolly- nd loclly-trined linerchin crfs. In NIPS 2004 Workshop on Lerning with Structured Outputs, [13] Andrew McCllum nd Ben Wellner. Conditionl models of identity uncertinty with ppliction to noun coreference. In Lwrence K. Sul, Yir Weiss, nd Léon Bottou, editors, Advnces in Neurl Informtion Processing Systems 17. MIT Press, Cmridge, MA, [14] Andrew K. McCllum, Kml Nigm, nd Lyle Ungr. Efficient clustering of high-dimensionl dt sets with ppliction to reference mtching. In Proceedings of the Sixth Interntionl Conference On Knowledge Discovery nd Dt Mining (KDD-2000), Boston, MA, [15] Brin Milch, Bhskr Mrthi, nd Sturt Russell. Blog: Reltionl modeling with unknown ojects. In ICML 2004 Workshop on Sttisticl Reltionl Lerning nd Its Connections to Other Fields, [16] Thoms Morton. Coreference for NLP pplictions. In ACL, [17] H. B. Newcome, J. M. Kennedy, S. J. Axford, nd A.P. Jmes. Automtic linkge of vitl records. Science, 130:954 9, [18] Vincent Ng nd Clire Crdie. Improving mchine lerning pproches to coreference resolution. In Proceedings of the 40th Annul Meeting of the Assocition for Computtionl Linguistics, [19] J. Nocedl. Updting qusi-newton mtrices with limited storge. Mthemtics of Computtion, 95: , [20] Prg nd Pedro Domingos. Multi-reltionl record linkge. In Proceedings of the KDD-2004 Workshop on Multi-Reltionl Dt Mining, pges 31 48, August [21] Jude Perl. Proilistic Resoning in Intelligent Systems: Networks of Plusile Inference. Morgn Kufmn, [22] Prdeep Rvikumr nd Willim W. Cohen. A hierrchicl grphicl model for record linkge. In UAI 2004, [23] Ben Tskr, Aeel Pieter, nd Dphne Koller. Discrimintive proilistic models for reltionl dt. In Uncertinty in Artificil Intelligence: Proceedings of the Eighteenth Conference (UAI-2002), pges , Sn Frncisco, CA, Morgn Kufmnn Pulishers. 15

16 [24] Ben Wellner, Andrew McCllum, Fuchun Peng, nd Michel Hy. An integrted, conditionl model of informtion extrction nd coreference with ppliction to cittion mtching. In Conference on Uncertinty in Artificil Intelligence (UAI), [25] Willim E. Winkler. Improved decision rules in the fellegi-sunter model of record linkge. Technicl report, Sttisticl Reserch Division, U.S. Census Bureu, Wshington, DC, [26] Willim E. Winkler. Methods for record linkge nd Byesin networks. Technicl report, Sttisticl Reserch Division, U.S. Census Bureu, Wshington, DC,