KD2R: a Key Discovery method for semantic Reference Reconciliation Danai Symeonidou, Nathalie Pernelle and Fatiha Saϊs LRI (University Paris-Sud XI) February, 8th 2013
2 Linked Open Data cloud (LOD) LOD contains all the RDF sources in the Web links between them Same as is the most important type of link: combine information given in different data sources The number of already existing links is very small How to create links automatically?
3 Reference Reconciliation Problem Dataset1 Dataset2 FirstName: Michael LastName: Jackson SSN: 011223456 Job: Singer FirstName: Michael LastName: Jackson SSN: 011223456 Job: Singer FirstName: Michael LastName: Jackson SSN: 444223456 Job: Teacher
4 Reference Reconciliation Problem Dataset1 Dataset2 FirstName: Michael LastName: Jackson SSN: 011223456 Job: Singer SameAs FirstName: Michael LastName: Jackson SSN: 011223456 Job: Singer FirstName: Michael LastName: Jackson SSN: 444223456 Job: Teacher
5 Reference Reconciliation Problem Dataset1 Dataset2 FirstName: Michael LastName: Jackson SSN: 011223456 Job: Singer FirstName: Michael LastName: Jackson SSN: 444223456 Job: Teacher SameAs SameAs FirstName: Michael LastName: Jackson SSN: 011223456 Job: Singer
6 Reference Reconciliation Problem How do we decide if two identifiers refer to the same real world entity??? SOURCE1 Name Located incountry TicketPrice 11 Madame Tussauds London UK O 12 Royal Academy of Arts London UK O SOURCE2 Name Located incountry TicketPrice 21 Tate Britain London England Free 22 Royal Academy of Arts London England Free
7 Reference Reconciliation Problem How do we decide if two identifiers refer to the same real world entity??? SOURCE1 Name Located incountry TicketPrice 11 Madame Tussauds London UK O 12 Royal Academy of Arts London UK O SOURCE2 Name Located incountry TicketPrice 21 Tate Britain London England Free 22 Royal Academy of Arts London England Free
8 Reference Reconciliation Problem How do we decide if two identifiers refer to the same real world entity??? SOURCE1 Name Located incountry TicketPrice 11 Madame Tussauds London UK O 12 Royal Academy of Arts London UK O SOURCE2 Name Located incountry TicketPrice 21 Tate Britain London England Free Sim. 0.5 22 Royal Academy of Arts London England Free
9 Reference Reconciliation Problem How do we decide if two identifiers refer to the same real world entity??? SOURCE1 Name Located incountry TicketPrice 11 Madame Tussauds UK UK O Sim(12, Royal Academy London 22) = 0.5 England Free of Arts SOURCE2 Name Located incountry TicketPrice Sim. 0.5 21 Tate Britain London England Free 22 Royal Academy of Arts London England Free
10 Reference Reconciliation Problem How do we decide if two identifiers refer to the same real world entity??? SOURCE1 Name Located incountry TicketPrice 11 Madame Tussauds London UK O 12 Royal Academy of Arts London UK O SOURCE2 Name Located incountry TicketPrice Name KEY 21 Tate Britain London England Free 22 Royal Academy of Arts London England Free
11 Reference Reconciliation Problem How do we decide if two identifiers refer to the same real world entity??? SOURCE1 Name Located incountry TicketPrice 11 Madame Tussauds London UK O 12 Royal Academy of Arts London UK O SOURCE2 Name Located incountry TicketPrice 21 Tate Britain London England Free Sim. Using keys 1 22 Royal Academy of Arts London England Free
12 Reference Reconciliation Problem How do we decide if two identifiers refer to the same real world entity??? SOURCE1 Name Located incountry TicketPrice 11 Madame Tussauds London UK O Sim(12, Royal Academy 22) UK = 1 England è SameAs O of Arts SOURCE2 Name Located incountry TicketPrice Sim. 1 Using keys 21 Tate Britain London England Free 22 Royal Academy of Arts London England Free
13 Reference Reconciliation Problem How do we decide if two identifiers refer to the same real world entity??? SOURCE1 Name Located incountry TicketPrice 11 Madame Tussauds London UK O 12 Royal Academy of Arts UK England O Solution è Use keys to reconcile data SOURCE2 Name Located incountry TicketPrice Sim. 1 Using keys 21 Tate Britain London England Free 22 Royal Academy of Arts London England Free
14 Reference Reconciliation with or without key constraints No knowledge given about the properties: all the properties have the same importance. Knowledge given by an expert: Specific expert rules [Arasu and al. 09, Low and al. 01, Volz and al. 09 (Silk)] Example: max(jaro(phone-number,phone-number), jaro-winkler(ssn,ssn)) > 0.88 Key constraints [Saïs, Pernelle and Rousset 09] Example: haskey( ()((museumname, museumaddress)) ² Problem: when data sources contain numerous data and/or complex ontologies ² Some keys are not obvious to find by the expert. ² Erroneous keys can be given by the expert. Aim: automatic discovery of a complete set of keys from RDF data
15 Key discovery methods Supervisedè Learn keys using a set of reconciled data Unsupervisedè No additional information are given Property-based è Guided by the properties Suchanek et al. 2011 (only single keys) Attencia et al. 2012 (CWA) Instance-based è Guided by the instances Symeonidou et al. 2011 (multi keys, OWA)
16 Key definition RDF data conform to an OWL2 RL ontology Key for a class expression: a combination of (inverse) properties which identifies uniquely an entity. HasKey( CE ( OPE 1... OPE m ) ( DPE 1... DPE n ) ) x, y, z 1,..., z m, w 1,..., w n : if x (CE) C and ISNAMED O (x) and y (CE) C and ISNAMED O (y) and ( x, z i ) (OPE i ) OP and ( y, z i ) (OPE i ) OP and ISNAMED O (z i ) for each 1 i m and ( x, w j ) (DPE j ) DP and ( y, w j ) (DPE j ) DP for each 1 j n then x = y If we consider haskey(city (Inverse(IsInCity)()) as a key and we have in the dataset : isincity(restaurant1,city1), isincity(restaurant1, city2), isincity(restaurant2,city2) Then we will infer that city1 = city2
17 Key Discovery Problem in OWA A set of RDF data sources: each data source conform to an OWL 2 ontology Multivalued properties may exist. Open world assumption (incomplete data) name firstname hasfriend i1 Atencia Manuel i2,i3 i2 Atencia Madalina i3 David Jerôme i2, i4 i4 Chein Michel How to discover keys when we don t know if : i1 =?= i2 =?=i3 =?=i4 hasfriend(i1,i4), hasfriend(i2, i3).?? firstname(i1, Elodie)?
18 Key Discovery Problem: our assumptions Unique Name Assumption (UNA): Two distinct URIs refer to two different real world entities. In the LOD, we consider the data sources generated from relational databases or those build in a way the UNA is fulfilled (Yago) i1 <> i2<> i3 <> i4 Two literals that are syntactically different are semantically different (e.g. Napoleon Bonaparte <> Napoleon ) Heuristic 1 - Pessimistic: Not instantiated property è all the values are possible Example: hasfriend(i2, i3), hasfriend(i2, i4) are possible. Instantiated property è only given values are considered Example: not hasfriend(i1, i4)
19 Key Discovery Problem: our assumptions A set of property expressions {pe1,, pe n } is a non key for the class c in a data source s i if: Example: {name}, {hasfriend} is a non key A set of property expressions {pe1,, pe n } is a key for the class c in a data source s i if: Example: {firstname}, {name, firstname}, {firstname, hasfriend} are keys {hasfriend, name} are neither a key nor a non key, it is called undetermined key.
20 Key Discovery Problem: our assumptions Heuristic 2 -Optimist : Not instantiated property è value not one of the already existing ones Example: not hasfriend(i2, i3), not hasfriend(i2, i1), not hasfriend(i2, i4). Instantiated property è only given values are considered Example: not hasfriend(i1, i4) The same definition for non keys A set of property expressions {pe1,, pe n } is a key for the class c in a data source s i if: pe j, Zpe j (X,Z) Wpe j (Y,W ) or Example : {firstname}, {name, firstname}, {firstname, hasfriend} are keys
21 KD2R approach Find all minimal keys that are valid w.r.t the previous definition, in all the considered data sources Scalability Do not check all the combinations of properties Partially scan the data Find first the set of maximal non keys and undetermined keys (inspired from Gordian [Y. Sismanis and al. 2006]) è derive keys from this set. Unlike Gordian, KD2R: is ontology based: subsumption relation is exploited to inherit keys considers multi-valued properties and incomplete information.
22 KD2R approach Topological sort of the classes (subsumption). The keys are obtained by selecting the minimal keys of the Cartesian product (w.r.t mappings) of the minimal key sets discovered in the sources S1, S2. Example: K1 = {{name, firstname}, {hasfriend}} K2 = {{firstname}} K 1-2 = { {name, firstname}, {hasfriend, firstname}}
23 KD2R approach: Key Finder The set of maximal non keys and undetermined keys is computed on a prefix-tree (a compact representation of the data of one class) Key derivation: Computation of the complement set of each non key and undetermined key Computation of the Cartesian product of the complement sets Selection of the minimal keys. Time complexity: quadratic in terms of number of discovered keys.
Pessimistic: Prefix-tree Creation - Step1 incountry located contains museumname museumaddress 1 Greece City1 - - - Archaeological 44 Pa:ssion Street 2 France - - - S1_p4, S1_p5 19 rue Beaubourg 3 France City3 - - - Musee d orsay 62, rue de Lille 4 England City4 - - - Madame Tussauds Marylebone Road incountry Greece {M1} France {M2, M3} England {M4} Node cell located City1 {M1} Null City 3 {M3} City 4 {M4} contains Null {M1} P4 P5 Null {M3} Null {M4} Name Archaeological {M1} Musee d orsay {M3} Madame Tussauds {M4} Address 44 Pa:ssion Street {M1} rue Beaubourg rue Beaubourg rue de Lille {M3} Marylebone Road {M4} Each level represents an attribute of a class Each node describes instances that share the same father-cell value. Each cell contains a value and a list of identifiers (URI List)
Pessimistic: Prefix-tree Creation - Step1 incountry located contains museumname museumaddress 1 Greece City1 - - - Archaeological 44 Pa:ssion Street 2 France - - - S1_p4, S1_p5 19 rue Beaubourg 3 France City3 - - - Musee d orsay 62, rue de Lille 4 England City4 - - - Madame Tussauds Marylebone Road incountry Greece {M1} France England {M4} located City1 {M1} Null City 4 {M4} contains Null {M1} P4 P5 Null {M4} Name Archaeological {M1} Madame Tussauds {M4} Address 44 Pa:ssion Street {M1} rue Beaubourg rue Beaubourg Marylebone Road {M4}
Pessimistic: Prefix-tree Creation - Step1 incountry located contains museumname museumaddress 1 Greece City1 - - - Archaeological 44 Pa:ssion Street 2 France - - - S1_p4, S1_p5 19 rue Beaubourg 3 France City3 - - - Musee d orsay 62, rue de Lille 4 England City4 - - - Madame Tussauds Marylebone Road incountry Greece {M1} France {M2, M3} England {M4} located City1 {M1} Null City 4 {M4} contains Null {M1} P4 P5 Null {M4} Name Archaeological {M1} Madame Tussauds {M4} Address 44 Pa:ssion Street {M1} rue Beaubourg rue Beaubourg Marylebone Road {M4}
Pessimistic: Prefix-tree Creation - Step1 incountry located contains museumname museumaddress 1 Greece City1 - - - Archaeological 44 Pa:ssion Street 2 France - - - S1_p4, S1_p5 19 rue Beaubourg 3 France City3 - - - Musee d orsay 62, rue de Lille 4 England City4 - - - Madame Tussauds Marylebone Road incountry Greece {M1} France {M2, M3} England {M4} located City1 {M1} Null City 3 {M3} City 4 {M4} contains Null {M1} P4 P5 Null {M4} Name Archaeological {M1} Madame Tussauds {M4} Address 44 Pa:ssion Street {M1} rue Beaubourg rue Beaubourg Marylebone Road {M4}
Pessimistic: Prefix-tree Creation - Step1 incountry located contains museumname museumaddress 1 Greece City1 - - - Archaeological 44 Pa:ssion Street 2 France - - - S1_p4, S1_p5 19 rue Beaubourg 3 France City3 - - - Musee d orsay 62, rue de Lille 4 England City4 - - - Madame Tussauds Marylebone Road incountry Greece {M1} France {M2, M3} England {M4} located City1 {M1} Null City 3 {M3} City 4 {M4} contains Null {M1} P4 P5 Null {M3} Null {M4} Name Archaeological {M1} Madame Tussauds {M4} Address 44 Pa:ssion Street {M1} rue Beaubourg rue Beaubourg Marylebone Road {M4}
Pessimistic: Prefix-tree Creation - Step1 incountry located contains museumname museumaddress 1 Greece City1 - - - Archaeological 44 Pa:ssion Street 2 France - - - S1_p4, S1_p5 19 rue Beaubourg 3 France City3 - - - Musee d orsay 62, rue de Lille 4 England City4 - - - Madame Tussauds Marylebone Road incountry Greece {M1} France {M2, M3} England {M4} located City1 {M1} Null City 3 {M3} City 4 {M4} contains Null {M1} P4 P5 Null {M3} Null {M4} Name Archaeological {M1} Musee d orsay {M3} Madame Tussauds {M4} Address 44 Pa:ssion Street {M1} rue Beaubourg rue Beaubourg Marylebone Road {M4}
Pessimistic: Prefix-tree Creation - Step1 incountry located contains museumname museumaddress 1 Greece City1 - - - Archaeological 44 Pa:ssion Street 2 France - - - S1_p4, S1_p5 19 rue Beaubourg 3 France City3 - - - Musee d orsay 62, rue de Lille 4 England City4 - - - Madame Tussauds Marylebone Road incountry Greece {M1} France {M2, M3} England {M4} located City1 {M1} Null City 3 {M3} City 4 {M4} contains Null {M1} P4 P5 Null {M3} Null {M4} Name Archaeological {M1} Musee d orsay {M3} Madame Tussauds {M4} Address 44 Pa:ssion Street {M1} rue Beaubourg rue Beaubourg rue de Lille {M3} Marylebone Road {M4}
Pessimistic: Prefix-tree Creation Step2 incountry located contains Greece {M1} City1 {M1} Null {M1} P4 France {M2, M3} Null City 3 {M3} P5 Null {M3} England {M4} City 4 {M4} Null {M4} Merging the cells of a node Merging nodes Name Archaeological {M1} Musee d orsay {M3} Madame Tussauds {M4} Address 44 Pa:ssion Street {M1} rue Beaubourg rue Beaubourg rue de Lille {M3} Marylebone Road {M4} incountry Greece { M1} France {M2, M3} England {M4} located City 1{M1} City3{M2, M3} City4{M4} Final Prefix Tree contains Null {M1} P4 {M2, M3} P5 {M2, M3} Null {M4} Name Archaeological {M1} Musee d orsay {M3} Musee d orsay {M3} Madame Tussauds {M4} Address 44 Pa:ssion Street {M1} rue Beaubourg rue de Lille {M3} rue Beaubourg rue de Lille {M3} Marylebone Road {M4}
UNKeyFinder Wax(S1_m2), museumname(s1_m2, Wax ), Prefix tree creation UNKey Finder Maximal undetermined keys and non keys Input: One dataset, one class, a set of known keys Output: set of maximal non keys and undetermined keys Examination of each possible subset of attributes. Recursive method The traversal is top down and left first è When URI List >1 : More than two instances share the same value for a specific subset of attributes The subset of attributes belongs to a UNKey Different prunings: Key Monitonicity Detection of paths describing one entity Use existing inherited keys to avoid exploring sub-trees in the prefix-tree. Non Key anti-monitonicity Use the already computed non keys to avoid exploring sub-trees in the prefix-tree.
UNKeyFinder Example We call the UNKeyFinder for the highlighted node Since the URI List is 1 we stop Pruning step (key Monotonicity) incountry Greece { M1} France {M2, M3} England {M4} located City 1{M1} City3{M2, M3} City4{M4} contains Null {M1} P4 {M2, M3} P5 {M2, M3} Null {M4} Name Archaeological {M1} Musee d orsay {M3} Musee d orsay {M3} Madame Tussauds {M4} Address 44 Pa:ssion Street {M1} rue Beaubourg rue de Lille {M3} rue Beaubourg rue de Lille {M3} Marylebone Road {M4} incountry, located, contains, Name, Address
UNKeyFinder Example We call the UNKeyFinder for the highlighted node incountry Greece { M1} France {M2, M3} England {M4} located City 1{M1} City3{M2, M3} City4{M4} contains Null {M1} P4 {M2, M3} P5 {M2, M3} Null {M4} Name Archaeological {M1} Musee d orsay {M3} Musee d orsay {M3} Madame Tussauds {M4} Address 44 Pa:ssion Street {M1} rue Beaubourg rue de Lille {M3} rue Beaubourg rue de Lille {M3} Marylebone Road {M4} incountry, located, contains, Name, Address
UNKeyFinder Example We call the UNKeyFinder for the highlighted node incountry Greece { M1} France {M2, M3} England {M4} located City 1{M1} City3{M2, M3} City4{M4} contains Null {M1} P4 {M2, M3} P5 {M2, M3} Null {M4} Name Archaeological {M1} Musee d orsay {M3} Musee d orsay {M3} Madame Tussauds {M4} Address 44 Pa:ssion Street {M1} rue Beaubourg rue de Lille {M3} rue Beaubourg rue de Lille {M3} Marylebone Road {M4} incountry, located, contains, Name, Address
UNKeyFinder Example We call the UNKeyFinder for the node In the next step we follow the left child of the highlighted node incountry Greece { M1} France {M2, M3} England {M4} located City 1{M1} City3{M2, M3} City4{M4} contains Null {M1} P4 {M2, M3} P5 {M2, M3} Null {M4} Name Archaeological {M1} Musee d orsay {M3} Musee d orsay {M3} Madame Tussauds {M4} Address 44 Pa:ssion Street {M1} rue Beaubourg rue de Lille {M3} rue Beaubourg rue de Lille {M3} Marylebone Road {M4} incountry, located, contains, Name, Address
UNKeyFinder Example We call the UNKeyFinder for the highlighted node Cell with URI List = 1 Pruning step (1) Cell Musee d orsay with URI List = 1 Pruning step (1) Now we have to merge the children of the node and call UNKeyFinder for the merged node incountry Greece { M1} France {M2, M3} England {M4} located City 1{M1} City3{M2, M3} City4{M4} contains Null {M1} P4 {M2, M3} P5 {M2, M3} Null {M4} Name Archaeological {M1} Musee d orsay {M3} Musee d orsay {M3} Madame Tussauds {M4} Address 44 Pa:ssion Street {M1} rue Beaubourg rue de Lille {M3} rue Beaubourg rue de Lille {M3} Marylebone Road {M4} incountry, located, contains, Name, Address
UNKeyFinder Example We call the UNKeyFinder for the highlighted node incountry Greece { M1} France {M2, M3} England {M4} located City 1{M1} City3{M2, M3} City4{M4} contains Null {M1} P4 {M2, M3} P5 {M2, M3} Null {M4} Name Archaeological {M1} Musee d orsay {M3} Madame Tussauds {M4} Address 44 Pa:ssion Street {M1} rue Beaubourg rue de Lille {M3} rue Beaubourg rue de Lille {M3} Marylebone Road {M4} incountry, located, contains, Name, Address
UNKeyFinder Example Since there a cell with URIList> 1 the curunkey is a UNKey incountry Greece { M1} France {M2, M3} England {M4} located City 1{M1} City3{M2, M3} City4{M4} contains Null {M1} P4 {M2, M3} P5 {M2, M3} Null {M4} Name Archaeological {M1} Musee d orsay {M3} Musee d orsay {M3} Madame Tussauds {M4} Address 44 Pa:ssion Street {M1} rue Beaubourg rue de Lille {M3} rue Beaubourg rue de Lille {M3} Marylebone Road {M4} incountry, located, contains, Name, Address incountry, located, contains
40 Experiments: OAEI 10 datasets Datasets RDF files #instances Restaurants Dataset Person Dataset Restaurant1.rdf 339 Restaurant2.rdf 1390 Person11.rdf 1000 Peson12.rdf 1000 Person21.rdf 1200 Experiments executed to compare: KD2R keys Expert keys Datasets Classes Property set Restaurants (2 files) Person (3files) Restaurant Address Person Address name, phonenumber, hascategory, hasaddress street, city, Inverse(hasAddress) givenname, state, surname, dateofbirth, socsecurityid, phonenumber, age, hasaddress street, housenumber, postcode, isinsuburb
Person Dataset 41 Person dataset consists of 2000 instances of the classes Person and Address.
Restaurant Dataset 42 Restaurant dataset describes 1729 instances (classes Restaurant and Address).
ChefMoz Dataset 43 32586 instances (class Restaurant). 1575 instances of the class Restaurant.
Dbpedia Dataset 44 Dbpedia Person è 6 discovered keys 763644 instances 5639680 RDF triples Natural Places è 21 discovered keys 49887 instances 1604347 RDF triples Subclasses of Natural Places Lake è 6 discovered keys BodyOfWater è 17 discovered keys
45 Conclusion Approach that discover composite keys in RDF datasets different ontologies (aligned) Unique Name Assumption Experiments: Discovered keys improve the data linking KD2R is scalable thanks to the pruning techniques Ex. Dbpedia Natural Places 5% of data explored
46 Future work DAVI approach Keys with N exceptions Key with N number of instances that violate of the definition of the key Conditional keys.
QUESTIONS??? 47
THANK YOU!!! 48