How to port a collection to the Semantic Web presenter: Borys Omelayenko contributors: A. Tordai, G. Schreiber
Datasets Content Collection meta-data describing works Local thesauri terms, such as materials, styles, techniques geographical places, such as villages directory of people involved. Representation independent solutions variety of database (via SQL), XML dump, text files different schemas and values
Process model Thesaurus Meta Data Convert schema to a standard Align data values to terms 1 4 2 3
1. Convert thesaurus schema to SKOS SKOS is a W3C model for expressing structure of thesauri and vocabularies Preferred labels, alternative labels, broader, narrower relations <skos:concept rdf:about= Nederland > <skos:preflabel xml:lang= nl >Nederland</skos:prefLabel> <skos:preflabel xml:lang= en >Netherlands</skos:prefLabel> <skos:broader rdf:resource= Europe /> </skos:concept> Pretty straightforward: just a few concepts involved
2. Convert metadata schema to VRA VRA is a de-facto standard in describing visual resources. It is a specialization of Dublin Core for visual resources <vra:work rdf:about= SK-C-5 > <vra:creator rdf:resource= Rembrandt_van_Rijn /> <vra:title xml:lang= en >The company of of Captain <vra:title xml:lang= nl >Het korporaalschap <vra:date>1642</vra:date> </vra:work> Conversion may be complicated as museums tend to use their own data models (database schemas) to describe works
Technology: AnnoCultor Java-based platform infrastructure basic conversion rules open to custom rules open to other systems, such as GATE GPL Converter is a Java program built according to a template invoking rules
Complexity of schema conversion Rules Types of rules collection + thesaurus + directory Rijksmuseum Tropenmuseum Volkenkunde Bibliopolis 56 + 13 + 17 19 + 6 + 4 22 + 19 + NA 34 + 15 + NA 11 + 6 + 7 8 + 6 + 4 9 + 8 + NA 11 + 5 + NA
3. Metadata: value alignment Look at each meta-data value word and find a corresponding vocabulary term 1. Concept from local thesaurus 2. Concept from other known thesaurus 3. Concept from implicit thesaurus 4. Value that should become part of some thesaurus in the future 5. Typed literals 6. Just literals
3. Value alignment: Rijksmuseum 29.000 (online) records 43.000 terms in local thesaurus 1. Concept from local thesaurus: 150.000 2. Concept from other thesaurus: 1. concepts 2.000 2. places in descriptions 7.800 3. Concept from implicit thesaurus: NA 4. New terms: 8.600 2.000 5. Interpret typed literals: NA 6. Leave as literal: NA
Usefulness: Etretat Title: Zeegezicht aan de kust Description: Gezicht op het strand en de krijtrotsen bij Etretat in Normandië
4. Thesaurus alignment Local thesauri are overlapping with standard ones Sometimes thesauri explicitly borrow some parts from others Focus on correct match each mapping mistake is propagated with each term use manual verification is an option
Alignment technology Shared information extraction technique to find terms in thesauri/collection entries based on advanced tailored string matching plus term context, e.g. Rembrandt in 1920 open for other methods, e.g. NLP Efficient human interaction show top not mapped terms, e.g. foto. Successful party due to structured text and context fails when different words are used
Success of thesauri alignment AAT TGN ULAN Rijksmuseum 6 / 43 13 / 56 Ethnographic 4/ 5 3 / 5.5 RKD Bibliopolis.4 / 1.1 41 / 300.25 / 1 mapped / total; 000
Conclusions Methodology for porting collections convert thesauri to SKOS convert collections to DC/VRA align thesauri search for terms in collections Open and flexible technology Java-based and open for extensions Cost/success stories a collection requires 50-70 rules or 1-2 weeks alignment can be costly can align up to 80% of records automatically http:// annocultor.sourceforge.net