from the Natural History Domain Computational Linguistics Saarland University 11 October 2007
Background The MITCH project Mining for Information in Texts from the Cultural Heritage joint research project between Tilburg University and Naturalis (Dutch National Museum of Natural History) text mining and information extraction for natural history data part of the CATCH programme (10 projects, funded by NWO)
Naturalis
Naturalis more than 10 million specimens: 5,250,000 insects 2,290,000 invertebrates 1,000,000 vertebrates 1,160,000 fossils 440,000 stones and minerals 150,000 species 10% of the Earth s biodiversity
Naturalis
Data and Meta-Data
Data and Meta-Data
Data and Meta-Data
Data and Meta-Data
Data and Meta-Data
Data and Meta-Data
Data and Meta-Data
Data and Meta-Data
Data Sources and Tasks Two main data sources... Tasks... (handwritten) fieldbooks (digitised and externally transcribed) specimen databases (manually created by curators, incomplete) converting transcribed field books into structured records data cleaning for specimen databases (error detection, data completion)
Digitisation of Fieldbooks
Transcription of Fieldbooks all fieldbooks relating to Reptiles and Amphibians Collection 15,000 handwritten pages manually transcribed by typists at Combiwerk simple guidelines on how to deal with non-ascii characters text written in the margins illegible passages etc. transcriptions completed in around 8 months
Specimen Databases manually compiled from field books designed by biologists not by database experts maintained by several people rows in the database correspond to fieldbook entries usually one specimen per row columns give information about specimen and circumstances of their collection (when, where, by whom etc.) columns in a variety of formats: numbers (e.g., collection date, registration number) short text (e.g., collector, genus) free text (e.g., biotope, place, remarks)
Example Columns place: 10 km. N. of Lucie Base Bivuac near De Kock Mountain weg van Lozoya naar Navarradonda location: 07.15 h. on small tree in deciduous tropical forest (now in full leaf), 150 cm. above ground, Dewlap orange with red around rim. kwakend op grashalm in poel lang weg biotope: under stone on moist, calcareous loam 275 m, forest floor among leaf litter, near Arroyo special remarks: aangereden (nog niet dood) according to information from R. Heyer, Smithsonian, Washington, this is likely to be L. knudseni, considering the short dorso-lateral folds and chest spines (e-mail to J. W. Arntzen January 2004)
Task 1: Converting Fieldbooks into Structured Records
Structure of Fieldbook Entries Number 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, 20.45-22.00 u. RMNH 23865 Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, 13.07.1968, 8.45 u., RMNH 26076 Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt 1867. RMNH 17656 Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.
Structure of Fieldbook Entries Number 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, 20.45-22.00 u. RMNH 23865 Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, 13.07.1968, 8.45 u., RMNH 26076 Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt 1867. RMNH 17656 Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.
Structure of Fieldbook Entries Number, Genus 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, 20.45-22.00 u. RMNH 23865 Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, 13.07.1968, 8.45 u., RMNH 26076 Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt 1867. RMNH 17656 Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.
Structure of Fieldbook Entries Number, Genus, Species 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, 20.45-22.00 u. RMNH 23865 Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, 13.07.1968, 8.45 u., RMNH 26076 Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt 1867. RMNH 17656 Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.
Structure of Fieldbook Entries Number, Genus, Species, Biotope 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, 20.45-22.00 u. RMNH 23865 Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, 13.07.1968, 8.45 u., RMNH 26076 Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt 1867. RMNH 17656 Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.
Structure of Fieldbook Entries Number, Genus, Species, Biotope, Collection Time 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, 20.45-22.00 u. RMNH 23865 Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, 13.07.1968, 8.45 u., RMNH 26076 Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt 1867. RMNH 17656 Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.
Structure of Fieldbook Entries Number, Genus, Species, Biotope, Collection Time, Reg. Num. 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, 20.45-22.00 u. RMNH 23865 Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, 13.07.1968, 8.45 u., RMNH 26076 Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt 1867. RMNH 17656 Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.
Converting Fieldbook Entries into Structured Records Aim make inherent structure of entry explicit (i.e., find segments conveying different types of information) Motivation Enable more sophisticated search raw data only allows keyword search enriched data allows querying of specific types of information
Example Aim: find all specimens of Phyllobates femoralis Query: Fieldbook entry contains string Pyllobates femoralis Result: Phyllobates femoralis, post Conini, Coeroenirivier, bosgrond, 25-IV-1968, 8:30-13:30 u. RMNH 26127-26129 Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, 13.07.1968, 8.45 u., RMNH 26076 Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij Phyllobates femoralis.
Modelled as Sequence Labelling 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, 20.45-22.00 u. RMNH 23865
Modelled as Sequence Labelling 1/num ex./num Leptodactylus/genus wagneri/species At/bio base/bio of/bio tree/bio on/bio small/bio island/bio,/bio primary/bio forest/bio,/o 20/time./time 45/time -/time 22/time./time 00/time u/time./time RMNH/reg num 23865/reg num
Supervised Machine Learning... could use supervised learner: Hidden Markov Models, Conditional Random Fields etc. However: requires manual annotation of data (by domain experts) annotation needs to be re-done for each new domain (e.g., archaeological field reports)... or even sub-domain (reptiles vs. crustaceans)
Bootstrapping from Existing Resources Specimen databases readily available database entry = fieldbook entry database column labels = fieldbook segment labels Caveat: databases are only derived from fieldbooks some information in fieldbooks not in database and vice versa re-writings, systematic differences (format of dates, cue words (e.g., RMNH for registration number) etc.) segment sequence probabilities are lost in databases (joint work with Sander Canisius)
Converting Fieldbooks into Structured Records Three approaches... database look-up supervised ML trained on data automatically created from database HMMs plus language modelling trained on database
Database Look-Up (a) assign each token its most frequent column label in the database (unigram look-up) (b) assign each token the most frequent column label of the trigram centred on it, backing off to bi- and unigrams (trigram look-up) (c) assign labels to trigrams in a sliding window (each token receives 3 labels) and vote over them (trigram look-up plus voting) (d) check field book entries for substrings which are exact matched of database cells (exact match)
Training on Automatically Created Data Training Data concatenation of (contents of) database fields: (a) with uniform probabilities (random) (b) with probabilities that were taken from 10 manually labelled field book entries (biased) Machine Learning Set-Up memory-based learner (TiMBL) training data: 18,000 database entries test data: 150 field book entries 107 features: sliding window of 5 tokens typographic features tfidf similarities between n-grams and database column labels (in a window of 3 tokens around focus token)
HMM plus Language Modelling Segmentation look for bigrams which are unlikely within a field language modelling (based on database) plus Viterbi to find most likely segmentation Labelling HMM applied to segmented data initial and transition probabilities estimated in unsupervised fashion (Baum-Welch algorithm) emission probabilities based on language model of database... plus a few domain-independent rules to deal with systematic differences between databases and fieldbook
Results Token Segment Acc. Prec. Rec. F β=1 ExactB 16.0 25.7 23.1 24.3 UniB 27.0 8.9 22.8 12.8 TriB 43.8 12.9 24.8 16.9 TriB+Vote 45.1 14.9 27.8 19.4 MBL rand. 44.6 7.1 19.2 10.4 MBL bias 53.4 12.1 32.0 17.6 HMM 56.9 62.7 58.1 60.3
Results Token Segment Acc. Prec. Rec. F β=1 ExactB 16.0 25.7 23.1 24.3 UniB 27.0 8.9 22.8 12.8 TriB 43.8 12.9 24.8 16.9 TriB+Vote 45.1 14.9 27.8 19.4 MBL rand. 44.6 7.1 19.2 10.4 MBL bias 53.4 12.1 32.0 17.6 HMM 56.9 62.7 58.1 60.3
Results Token Segment Acc. Prec. Rec. F β=1 ExactB 16.0 25.7 23.1 24.3 UniB 27.0 8.9 22.8 12.8 TriB 43.8 12.9 24.8 16.9 TriB+Vote 45.1 14.9 27.8 19.4 MBL rand. 44.6 7.1 19.2 10.4 MBL bias 53.4 12.1 32.0 17.6 HMM 56.9 62.7 58.1 60.3
Results Token Segment Acc. Prec. Rec. F β=1 ExactB 16.0 25.7 23.1 24.3 UniB 27.0 8.9 22.8 12.8 TriB 43.8 12.9 24.8 16.9 TriB+Vote 45.1 14.9 27.8 19.4 MBL rand. 44.6 7.1 19.2 10.4 MBL bias 53.4 12.1 32.0 17.6 HMM 56.9 62.7 58.1 60.3
Task 2: Cleaning Textual Databases
Automatic Database Cleaning Errors and Missing Values unavoidable, even in well-maintained databases negatively affect information retrieval manual error correction extremely time consuming Traditional Data Clean-Up Methods not geared towards text databases treat fields as atoms but tokens within text string can provide valuable cues e.g. km frequent in location column but may indicate error in other columns
Errors in Specimen Databases Error types typos: 1% content errors (information is wrong, e.g. Surinam instead of Indonesia): 5.4% disprefered synonyms: 6.2% wrong-column errors (information is correct but should be in a different column, e.g., location instead of special remarks): 3.5% missing values: up to 90% of a given column
Semi-Automatic Database Cleaning Subtasks predict missing values detect and correct wrong values (typos, content errors, disprefered synonyms) detect and correct wrong-column errors Semi-Automatic set-up tools search the database to predict missing values detect potential errors and find possible corrections new value/potential error and correction are flagged to domain expert
Predicting Missing Values Method exploit interdependencies between different fields location: Tafel Mountain & country: South Africa train classifier to predict the value of a field given the values of the other fields
Predicting Missing Values Set-Up one classifier per column training data automatically generated from database split data into 80% training, 20% testing
Predicting Missing Values Set-Up one classifier per column training data automatically generated from database split data into 80% training, 20% testing Features Label Author Determinator Family Genus Country Location (Daudin, 1802) (Schlegel) Schneider G. vd. Boog M.S. Hoogmoed Bataguridae Colubridae Anolis Geophis Bufo Cambodia Indonesia Suriname B. Hoeksema s garden near airfield
Predicting Missing Values Set-Up one classifier per column training data automatically generated from database split data into 80% training, 20% testing Features Label (Daudin, 1802) Bataguridae Anolis Cambodia Author Determinator Family Genus Country Location (Daudin, 1802) (Schlegel) Schneider G. vd. Boog M.S. Hoogmoed Bataguridae Colubridae Anolis Geophis Bufo Cambodia Indonesia Suriname B. Hoeksema s garden near airfield
Predicting Missing Values Set-Up one classifier per column training data automatically generated from database split data into 80% training, 20% testing Features Label (Daudin, 1802) Bataguridae Anolis Cambodia (Schlegel) G. vd. Boog Colubridae Geophis B. Hoeksema s garden Indonesia Author Determinator Family Genus Country Location (Daudin, 1802) (Schlegel) Schneider G. vd. Boog M.S. Hoogmoed Bataguridae Colubridae Anolis Geophis Bufo Cambodia Indonesia Suriname B. Hoeksema s garden near airfield
Predicting Missing Values Set-Up one classifier per column training data automatically generated from database split data into 80% training, 20% testing Features Label (Daudin, 1802) Bataguridae Anolis Cambodia (Schlegel) G. vd. Boog Colubridae Geophis B. Hoeksema s garden Indonesia Schneider M.S. Hoogmoed Bufo near airfield Suriname Author Determinator Family Genus Country Location (Daudin, 1802) (Schlegel) Schneider G. vd. Boog M.S. Hoogmoed Bataguridae Colubridae Anolis Geophis Bufo Cambodia Indonesia Suriname B. Hoeksema s garden near airfield
Results Generally... fairly high prediction accuracies, even for free-text fields well above baselines (random (rnd) and majority value (maj)) Accuracies for different columns Accuracy column # values (types) ML maj rnd family 83 97.37% 18.59% 1.92% genus 649 91.95% 10.13% 0.35% species 1,351 88.65% 6.18% 0.07% collector 1,079 85.25% 30.44% 0.09% special remarks 2,537 76.21% 4.07% 0.03% location 653 67.66% 22.46% 0.15% biotope 700 63.02% 4.63% 0.14%
More Sophisticated Approach simple approach treats values of fields as atoms not ideal for free text fields Example: Predict value of country from place Venezuela co-occurs with 56 distinct values in place 30 of those contain the string El Dorado, e.g.: La Escalera, Z. van El Dorado, weg El Dorado - Sta El Dorado, Estado Bolivar Las Claritas, 85 km Z. van El Dorado El Dorado, Estado Bolivar, 4 km N van El Dorado
More Sophisticated Approach Alternative feature representations instead of representing fields as atomic strings: (a) represent only named entities (e.g., binary features indicating presence of various NEs) (b) represent fields by the unigram with the highest tfidf (c) represent fields by the unigram for which the mutual information with the values in the target column is highest Pilot study indicates that (c) works best: full string NEs (binary features) max. tfidf max. MI Acc. 78.55% 83.13% 82.41% 84.30%
Detecting Content Errors Apply Value Prediction to filled fields... Actual Value: Geophis Predicted Value: Rhabdophis Author Determinator Family Genus Country Conservation (Daudin, 1802) (Schlegel) Schneider (Horst, 1883) G. vd. Boog M.S. Hoogmoed Tyler, M.J. Bataguridae Colubridae Hylidae Anolis Geophis Bufo Litoria Cambodia Indonesia Suriname (shell, dry) alcohol
Detecting Content Errors Apply Value Prediction to filled fields... Actual Value: Geophis Predicted Value: Rhabdophis Author Determinator Family Genus Country Conservation (Daudin, 1802) (Schlegel) Schneider (Horst, 1883) G. vd. Boog M.S. Hoogmoed Tyler, M.J. Bataguridae Colubridae Hylidae Anolis Geophis Bufo Litoria Cambodia Indonesia Suriname (shell, dry) alcohol
Detecting Content Errors Apply Value Prediction to filled fields... Actual Value: Geophis Predicted Value: Rhabdophis Author Determinator Family Genus Country Conservation (Daudin, 1802) (Schlegel) Schneider (Horst, 1883) G. vd. Boog M.S. Hoogmoed Tyler, M.J. Bataguridae Colubridae Hylidae Anolis? Bufo Litoria Cambodia Indonesia Suriname (shell, dry) alcohol
Detecting Content Errors Apply Value Prediction to filled fields... Actual Value: Geophis Predicted Value: Rhabdophis Author Determinator Family Genus Country Conservation (Daudin, 1802) (Schlegel) Schneider (Horst, 1883) G. vd. Boog M.S. Hoogmoed Tyler, M.J. Bataguridae Colubridae Hylidae Anolis? Bufo Litoria Cambodia Indonesia Suriname (shell, dry) alcohol
Detecting Content Errors Apply Value Prediction to filled fields... Actual Value: Geophis Predicted Value: Rhabdophis Author Determinator Family Genus Country Conservation (Daudin, 1802) (Schlegel) Schneider (Horst, 1883) G. vd. Boog M.S. Hoogmoed Tyler, M.J. Bataguridae Colubridae Hylidae Anolis? Bufo Litoria Cambodia Indonesia Suriname (shell, dry) alcohol
Detecting Content Errors Experimental Set-Up test on taxonomic fields potential errors can be checked by non-expert against gold standard taxonomy (correction) precision calculated by manual checking (detection) recall estimated on artificial errors
Detecting Content Errors Detection Recall (estimated) column recall class 95.56% order 96.82% family 96.15% genus 93.09% Correction Precision precision column items flagged incl. synonyms excl. synonyms class 2 50.00% 50.00% order 26 57.00% 38.00% family 33 45.45% 9.09% genus 135 10.37% 5.93%
Detecting Wrong Column Errors Aim detect information that was entered in the wrong column: e.g., died in captivity is in location but should be in special remarks Method recast as a text classification problem train classifier to predict column of a text string apply to cell contents signal potential error if: predicted column actual column
Detection Wrong-Column Errors Training Data generated automatically from database Features typographical (number of tokens, capitalisation, punctuation, numbers, units of measurement etc.) similarity with each column (i.e., tfidf weighted token overlap)
Detecting Wrong-Column Errors Training Set Creation Features Label 0.785 0.983 1 3 1 1 1 0 0 0 Daudin, 1802 Author Author Determinator Family Place Genus Province Daudin, 1802 (Peters, 1867) (Schlegel) M. S. Hoogmoed Polychrotidae Bigisanti beach Anolis A. H. Bol, 1972/73 Agamidae Gonocephalus M Java, Wonosobo Elaphe Ned. Nieuw Guinea
Detecting Wrong-Column Errors Training Set Creation Features Label 0.785 0.219 0.983 0.886 1 1 3 3 1 0 1 1 1 1 0 1 0 1 0 0 Daudin, 1802 M. S. Hoogmoed Author Determinator Author Determinator Family Place Genus Province Daudin, 1802 (Peters, 1867) (Schlegel) M. S. Hoogmoed Polychrotidae Bigisanti beach Anolis A. H. Bol, 1972/73 Agamidae Gonocephalus M Java, Wonosobo Elaphe Ned. Nieuw Guinea
Detecting Wrong-Column Errors Training Set Creation Features Label 0.785 0.219 0.560 0.983 0.886 0.432 1 1 1 3 3 1 1 0 0 1 1 0 1 1 1 0 1 0 0 1 0 0 0 0 Daudin, 1802 M. S. Hoogmoed Polychrotidae Author Determinator Family Author Determinator Family Place Genus Province Daudin, 1802 (Peters, 1867) (Schlegel) M. S. Hoogmoed Polychrotidae Bigisanti beach Anolis A. H. Bol, 1972/73 Agamidae Gonocephalus M Java, Wonosobo Elaphe Ned. Nieuw Guinea
Detecting Wrong-Column Errors Actual Value: Geophis Predicted Value: Rhabdophis Conservation Location Biotope Special Remarks alcohol alcohol Huys te Linschoten roadside bordering secondary forest in roadside pool Bodemvallen Parkbos aanplant Eik died in captivity alcohol bosbivak Zanderij in betonnen met water biotoop: zwampig terrein gevulde bak met zandboden on base of tree along trail in terra firme forest geen verdere gegevens bekend alcohol 12 km NE of Elmali clay soil with reed vegetation Tank 8 alcohol in field near roadside injured before capture by observers
Detecting Wrong-Column Errors Actual Value: Geophis Predicted Value: Rhabdophis Conservation Location Biotope Special Remarks alcohol alcohol Huys te Linschoten roadside bordering secondary forest in roadside pool Bodemvallen Parkbos aanplant Eik died in captivity alcohol bosbivak Zanderij in betonnen met water biotoop: zwampig terrein gevulde bak met zandboden on base of tree along trail in terra firme forest geen verdere gegevens bekend alcohol 12 km NE of Elmali clay soil with reed vegetation Tank 8 alcohol in field near roadside injured before capture by observers
Detecting Wrong-Column Errors Actual Value: Location Predicted Value: Rhabdophis Conservation? Biotope Special Remarks alcohol alcohol Huys te Linschoten roadside bordering secondary forest in roadside pool Bodemvallen Parkbos aanplant Eik died in captivity alcohol bosbivak Zanderij in betonnen met water biotoop: zwampig terrein gevulde bak met zandboden on base of tree along trail in terra firme forest geen verdere gegevens bekend alcohol 12 km NE of Elmali clay soil with reed vegetation Tank 8 alcohol in field near roadside injured before capture by observers
Detecting Wrong-Column Errors Actual Value: Location Predicted Value: Biotope Conservation? Biotope Special Remarks alcohol alcohol Huys te Linschoten roadside bordering secondary forest in roadside pool Bodemvallen Parkbos aanplant Eik died in captivity alcohol bosbivak Zanderij in betonnen met water biotoop: zwampig terrein gevulde bak met zandboden on base of tree along trail in terra firme forest geen verdere gegevens bekend alcohol 12 km NE of Elmali clay soil with reed vegetation Tank 8 alcohol in field near roadside injured before capture by observers
Detecting Wrong-Column Errors Experimental Set-Up leave-one-out testing manual annotation for 4 free-text columns: biotope, location, publication, special remarks Results items detection correction column flagged recall precision accuracy biotope 234 89.1% 24.4% 91.2% location 286 77.6% 18.2% 51.9% publication 58 100% 6.9% 25.0% special remarks 298 24.0% 20.1% 61.7%
Detecting Wrong-Column Errors, Examples Good corrections: string original column predicted column on a tree 2.5 m above ground special remarks biotope 25 km N.N.W Antalya special remarks location 1700 m biotope altitude died in captivity 23.09.1994 location special remarks roadside bordering secondary forest location biotope Suriname Exp. 1970 collection number collector Not so good: string original column predicted column (Kikkervisje) special remarks author N.W. van Meknes location collector
Conclusions Summary for the cultural heritage domain, manual annotation of training data is usually not feasible but it is possible to go a long way by exploiting existing resources exploit existing databases to bootstrap a fieldbook segmenter exploit redundancy and interdependencies to detect database errors Software Error Detection Demo (http://ls0135.uvt.nl/) Timpute: a TiMBL wrapper for semi-automatic error detection in databases to be released soon
Collaborators Antal van den Bosch, Sander Canisius, Marieke van Erp, Steve Hunt, Tijn Porcelijn Links Error Detection Demo MITCH project CATCH programme Museum Naturalis http://ls0135.uvt.nl/ http://ilk.uvt.nl/mitch/ http://www.nwo.nl/catch/ http://www.naturalis.nl/