How To Create A Specimen Database

Transcription

1 from the Natural History Domain Computational Linguistics Saarland University 11 October 2007

2 Background The MITCH project Mining for Information in Texts from the Cultural Heritage joint research project between Tilburg University and Naturalis (Dutch National Museum of Natural History) text mining and information extraction for natural history data part of the CATCH programme (10 projects, funded by NWO)

3 Naturalis

4 Naturalis more than 10 million specimens: 5,250,000 insects 2,290,000 invertebrates 1,000,000 vertebrates 1,160,000 fossils 440,000 stones and minerals 150,000 species 10% of the Earth s biodiversity

5 Naturalis

6 Data and Meta-Data

14 Data Sources and Tasks Two main data sources... Tasks... (handwritten) fieldbooks (digitised and externally transcribed) specimen databases (manually created by curators, incomplete) converting transcribed field books into structured records data cleaning for specimen databases (error detection, data completion)

15 Digitisation of Fieldbooks

16 Transcription of Fieldbooks all fieldbooks relating to Reptiles and Amphibians Collection 15,000 handwritten pages manually transcribed by typists at Combiwerk simple guidelines on how to deal with non-ascii characters text written in the margins illegible passages etc. transcriptions completed in around 8 months

17 Specimen Databases manually compiled from field books designed by biologists not by database experts maintained by several people rows in the database correspond to fieldbook entries usually one specimen per row columns give information about specimen and circumstances of their collection (when, where, by whom etc.) columns in a variety of formats: numbers (e.g., collection date, registration number) short text (e.g., collector, genus) free text (e.g., biotope, place, remarks)

18 Example Columns place: 10 km. N. of Lucie Base Bivuac near De Kock Mountain weg van Lozoya naar Navarradonda location: h. on small tree in deciduous tropical forest (now in full leaf), 150 cm. above ground, Dewlap orange with red around rim. kwakend op grashalm in poel lang weg biotope: under stone on moist, calcareous loam 275 m, forest floor among leaf litter, near Arroyo special remarks: aangereden (nog niet dood) according to information from R. Heyer, Smithsonian, Washington, this is likely to be L. knudseni, considering the short dorso-lateral folds and chest spines ( to J. W. Arntzen January 2004)

19 Task 1: Converting Fieldbooks into Structured Records

20 Structure of Fieldbook Entries Number 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, u. RMNH Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, , 8.45 u., RMNH Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt RMNH Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.

21 Structure of Fieldbook Entries Number 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, u. RMNH Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, , 8.45 u., RMNH Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt RMNH Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.

22 Structure of Fieldbook Entries Number, Genus 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, u. RMNH Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, , 8.45 u., RMNH Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt RMNH Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.

23 Structure of Fieldbook Entries Number, Genus, Species 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, u. RMNH Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, , 8.45 u., RMNH Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt RMNH Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.

24 Structure of Fieldbook Entries Number, Genus, Species, Biotope 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, u. RMNH Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, , 8.45 u., RMNH Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt RMNH Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.

25 Structure of Fieldbook Entries Number, Genus, Species, Biotope, Collection Time 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, u. RMNH Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, , 8.45 u., RMNH Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt RMNH Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.

26 Structure of Fieldbook Entries Number, Genus, Species, Biotope, Collection Time, Reg. Num. 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, u. RMNH Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, , 8.45 u., RMNH Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt RMNH Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.

27 Converting Fieldbook Entries into Structured Records Aim make inherent structure of entry explicit (i.e., find segments conveying different types of information) Motivation Enable more sophisticated search raw data only allows keyword search enriched data allows querying of specific types of information

28 Example Aim: find all specimens of Phyllobates femoralis Query: Fieldbook entry contains string Pyllobates femoralis Result: Phyllobates femoralis, post Conini, Coeroenirivier, bosgrond, 25-IV-1968, 8:30-13:30 u. RMNH Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, , 8.45 u., RMNH Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij Phyllobates femoralis.

29 Modelled as Sequence Labelling 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, u. RMNH 23865

30 Modelled as Sequence Labelling 1/num ex./num Leptodactylus/genus wagneri/species At/bio base/bio of/bio tree/bio on/bio small/bio island/bio,/bio primary/bio forest/bio,/o 20/time./time 45/time -/time 22/time./time 00/time u/time./time RMNH/reg num 23865/reg num

31 Supervised Machine Learning... could use supervised learner: Hidden Markov Models, Conditional Random Fields etc. However: requires manual annotation of data (by domain experts) annotation needs to be re-done for each new domain (e.g., archaeological field reports)... or even sub-domain (reptiles vs. crustaceans)

32 Bootstrapping from Existing Resources Specimen databases readily available database entry = fieldbook entry database column labels = fieldbook segment labels Caveat: databases are only derived from fieldbooks some information in fieldbooks not in database and vice versa re-writings, systematic differences (format of dates, cue words (e.g., RMNH for registration number) etc.) segment sequence probabilities are lost in databases (joint work with Sander Canisius)

33 Converting Fieldbooks into Structured Records Three approaches... database look-up supervised ML trained on data automatically created from database HMMs plus language modelling trained on database

34 Database Look-Up (a) assign each token its most frequent column label in the database (unigram look-up) (b) assign each token the most frequent column label of the trigram centred on it, backing off to bi- and unigrams (trigram look-up) (c) assign labels to trigrams in a sliding window (each token receives 3 labels) and vote over them (trigram look-up plus voting) (d) check field book entries for substrings which are exact matched of database cells (exact match)

35 Training on Automatically Created Data Training Data concatenation of (contents of) database fields: (a) with uniform probabilities (random) (b) with probabilities that were taken from 10 manually labelled field book entries (biased) Machine Learning Set-Up memory-based learner (TiMBL) training data: 18,000 database entries test data: 150 field book entries 107 features: sliding window of 5 tokens typographic features tfidf similarities between n-grams and database column labels (in a window of 3 tokens around focus token)

36 HMM plus Language Modelling Segmentation look for bigrams which are unlikely within a field language modelling (based on database) plus Viterbi to find most likely segmentation Labelling HMM applied to segmented data initial and transition probabilities estimated in unsupervised fashion (Baum-Welch algorithm) emission probabilities based on language model of database... plus a few domain-independent rules to deal with systematic differences between databases and fieldbook

37 Results Token Segment Acc. Prec. Rec. F β=1 ExactB UniB TriB TriB+Vote MBL rand MBL bias HMM

41 Task 2: Cleaning Textual Databases

42 Automatic Database Cleaning Errors and Missing Values unavoidable, even in well-maintained databases negatively affect information retrieval manual error correction extremely time consuming Traditional Data Clean-Up Methods not geared towards text databases treat fields as atoms but tokens within text string can provide valuable cues e.g. km frequent in location column but may indicate error in other columns

43 Errors in Specimen Databases Error types typos: 1% content errors (information is wrong, e.g. Surinam instead of Indonesia): 5.4% disprefered synonyms: 6.2% wrong-column errors (information is correct but should be in a different column, e.g., location instead of special remarks): 3.5% missing values: up to 90% of a given column

44 Semi-Automatic Database Cleaning Subtasks predict missing values detect and correct wrong values (typos, content errors, disprefered synonyms) detect and correct wrong-column errors Semi-Automatic set-up tools search the database to predict missing values detect potential errors and find possible corrections new value/potential error and correction are flagged to domain expert

45 Predicting Missing Values Method exploit interdependencies between different fields location: Tafel Mountain & country: South Africa train classifier to predict the value of a field given the values of the other fields

46 Predicting Missing Values Set-Up one classifier per column training data automatically generated from database split data into 80% training, 20% testing

47 Predicting Missing Values Set-Up one classifier per column training data automatically generated from database split data into 80% training, 20% testing Features Label Author Determinator Family Genus Country Location (Daudin, 1802) (Schlegel) Schneider G. vd. Boog M.S. Hoogmoed Bataguridae Colubridae Anolis Geophis Bufo Cambodia Indonesia Suriname B. Hoeksema s garden near airfield

48 Predicting Missing Values Set-Up one classifier per column training data automatically generated from database split data into 80% training, 20% testing Features Label (Daudin, 1802) Bataguridae Anolis Cambodia Author Determinator Family Genus Country Location (Daudin, 1802) (Schlegel) Schneider G. vd. Boog M.S. Hoogmoed Bataguridae Colubridae Anolis Geophis Bufo Cambodia Indonesia Suriname B. Hoeksema s garden near airfield

49 Predicting Missing Values Set-Up one classifier per column training data automatically generated from database split data into 80% training, 20% testing Features Label (Daudin, 1802) Bataguridae Anolis Cambodia (Schlegel) G. vd. Boog Colubridae Geophis B. Hoeksema s garden Indonesia Author Determinator Family Genus Country Location (Daudin, 1802) (Schlegel) Schneider G. vd. Boog M.S. Hoogmoed Bataguridae Colubridae Anolis Geophis Bufo Cambodia Indonesia Suriname B. Hoeksema s garden near airfield

50 Predicting Missing Values Set-Up one classifier per column training data automatically generated from database split data into 80% training, 20% testing Features Label (Daudin, 1802) Bataguridae Anolis Cambodia (Schlegel) G. vd. Boog Colubridae Geophis B. Hoeksema s garden Indonesia Schneider M.S. Hoogmoed Bufo near airfield Suriname Author Determinator Family Genus Country Location (Daudin, 1802) (Schlegel) Schneider G. vd. Boog M.S. Hoogmoed Bataguridae Colubridae Anolis Geophis Bufo Cambodia Indonesia Suriname B. Hoeksema s garden near airfield

51 Results Generally... fairly high prediction accuracies, even for free-text fields well above baselines (random (rnd) and majority value (maj)) Accuracies for different columns Accuracy column # values (types) ML maj rnd family % 18.59% 1.92% genus % 10.13% 0.35% species 1, % 6.18% 0.07% collector 1, % 30.44% 0.09% special remarks 2, % 4.07% 0.03% location % 22.46% 0.15% biotope % 4.63% 0.14%

52 More Sophisticated Approach simple approach treats values of fields as atoms not ideal for free text fields Example: Predict value of country from place Venezuela co-occurs with 56 distinct values in place 30 of those contain the string El Dorado, e.g.: La Escalera, Z. van El Dorado, weg El Dorado - Sta El Dorado, Estado Bolivar Las Claritas, 85 km Z. van El Dorado El Dorado, Estado Bolivar, 4 km N van El Dorado

53 More Sophisticated Approach Alternative feature representations instead of representing fields as atomic strings: (a) represent only named entities (e.g., binary features indicating presence of various NEs) (b) represent fields by the unigram with the highest tfidf (c) represent fields by the unigram for which the mutual information with the values in the target column is highest Pilot study indicates that (c) works best: full string NEs (binary features) max. tfidf max. MI Acc % 83.13% 82.41% 84.30%

54 Detecting Content Errors Apply Value Prediction to filled fields... Actual Value: Geophis Predicted Value: Rhabdophis Author Determinator Family Genus Country Conservation (Daudin, 1802) (Schlegel) Schneider (Horst, 1883) G. vd. Boog M.S. Hoogmoed Tyler, M.J. Bataguridae Colubridae Hylidae Anolis Geophis Bufo Litoria Cambodia Indonesia Suriname (shell, dry) alcohol

55 Detecting Content Errors Apply Value Prediction to filled fields... Actual Value: Geophis Predicted Value: Rhabdophis Author Determinator Family Genus Country Conservation (Daudin, 1802) (Schlegel) Schneider (Horst, 1883) G. vd. Boog M.S. Hoogmoed Tyler, M.J. Bataguridae Colubridae Hylidae Anolis Geophis Bufo Litoria Cambodia Indonesia Suriname (shell, dry) alcohol

56 Detecting Content Errors Apply Value Prediction to filled fields... Actual Value: Geophis Predicted Value: Rhabdophis Author Determinator Family Genus Country Conservation (Daudin, 1802) (Schlegel) Schneider (Horst, 1883) G. vd. Boog M.S. Hoogmoed Tyler, M.J. Bataguridae Colubridae Hylidae Anolis? Bufo Litoria Cambodia Indonesia Suriname (shell, dry) alcohol

59 Detecting Content Errors Experimental Set-Up test on taxonomic fields potential errors can be checked by non-expert against gold standard taxonomy (correction) precision calculated by manual checking (detection) recall estimated on artificial errors

60 Detecting Content Errors Detection Recall (estimated) column recall class 95.56% order 96.82% family 96.15% genus 93.09% Correction Precision precision column items flagged incl. synonyms excl. synonyms class % 50.00% order % 38.00% family % 9.09% genus % 5.93%

61 Detecting Wrong Column Errors Aim detect information that was entered in the wrong column: e.g., died in captivity is in location but should be in special remarks Method recast as a text classification problem train classifier to predict column of a text string apply to cell contents signal potential error if: predicted column actual column

62 Detection Wrong-Column Errors Training Data generated automatically from database Features typographical (number of tokens, capitalisation, punctuation, numbers, units of measurement etc.) similarity with each column (i.e., tfidf weighted token overlap)

63 Detecting Wrong-Column Errors Training Set Creation Features Label Daudin, 1802 Author Author Determinator Family Place Genus Province Daudin, 1802 (Peters, 1867) (Schlegel) M. S. Hoogmoed Polychrotidae Bigisanti beach Anolis A. H. Bol, 1972/73 Agamidae Gonocephalus M Java, Wonosobo Elaphe Ned. Nieuw Guinea

64 Detecting Wrong-Column Errors Training Set Creation Features Label Daudin, 1802 M. S. Hoogmoed Author Determinator Author Determinator Family Place Genus Province Daudin, 1802 (Peters, 1867) (Schlegel) M. S. Hoogmoed Polychrotidae Bigisanti beach Anolis A. H. Bol, 1972/73 Agamidae Gonocephalus M Java, Wonosobo Elaphe Ned. Nieuw Guinea

65 Detecting Wrong-Column Errors Training Set Creation Features Label Daudin, 1802 M. S. Hoogmoed Polychrotidae Author Determinator Family Author Determinator Family Place Genus Province Daudin, 1802 (Peters, 1867) (Schlegel) M. S. Hoogmoed Polychrotidae Bigisanti beach Anolis A. H. Bol, 1972/73 Agamidae Gonocephalus M Java, Wonosobo Elaphe Ned. Nieuw Guinea

66 Detecting Wrong-Column Errors Actual Value: Geophis Predicted Value: Rhabdophis Conservation Location Biotope Special Remarks alcohol alcohol Huys te Linschoten roadside bordering secondary forest in roadside pool Bodemvallen Parkbos aanplant Eik died in captivity alcohol bosbivak Zanderij in betonnen met water biotoop: zwampig terrein gevulde bak met zandboden on base of tree along trail in terra firme forest geen verdere gegevens bekend alcohol 12 km NE of Elmali clay soil with reed vegetation Tank 8 alcohol in field near roadside injured before capture by observers

67 Detecting Wrong-Column Errors Actual Value: Geophis Predicted Value: Rhabdophis Conservation Location Biotope Special Remarks alcohol alcohol Huys te Linschoten roadside bordering secondary forest in roadside pool Bodemvallen Parkbos aanplant Eik died in captivity alcohol bosbivak Zanderij in betonnen met water biotoop: zwampig terrein gevulde bak met zandboden on base of tree along trail in terra firme forest geen verdere gegevens bekend alcohol 12 km NE of Elmali clay soil with reed vegetation Tank 8 alcohol in field near roadside injured before capture by observers

68 Detecting Wrong-Column Errors Actual Value: Location Predicted Value: Rhabdophis Conservation? Biotope Special Remarks alcohol alcohol Huys te Linschoten roadside bordering secondary forest in roadside pool Bodemvallen Parkbos aanplant Eik died in captivity alcohol bosbivak Zanderij in betonnen met water biotoop: zwampig terrein gevulde bak met zandboden on base of tree along trail in terra firme forest geen verdere gegevens bekend alcohol 12 km NE of Elmali clay soil with reed vegetation Tank 8 alcohol in field near roadside injured before capture by observers

69 Detecting Wrong-Column Errors Actual Value: Location Predicted Value: Biotope Conservation? Biotope Special Remarks alcohol alcohol Huys te Linschoten roadside bordering secondary forest in roadside pool Bodemvallen Parkbos aanplant Eik died in captivity alcohol bosbivak Zanderij in betonnen met water biotoop: zwampig terrein gevulde bak met zandboden on base of tree along trail in terra firme forest geen verdere gegevens bekend alcohol 12 km NE of Elmali clay soil with reed vegetation Tank 8 alcohol in field near roadside injured before capture by observers

70 Detecting Wrong-Column Errors Experimental Set-Up leave-one-out testing manual annotation for 4 free-text columns: biotope, location, publication, special remarks Results items detection correction column flagged recall precision accuracy biotope % 24.4% 91.2% location % 18.2% 51.9% publication % 6.9% 25.0% special remarks % 20.1% 61.7%

71 Detecting Wrong-Column Errors, Examples Good corrections: string original column predicted column on a tree 2.5 m above ground special remarks biotope 25 km N.N.W Antalya special remarks location 1700 m biotope altitude died in captivity location special remarks roadside bordering secondary forest location biotope Suriname Exp collection number collector Not so good: string original column predicted column (Kikkervisje) special remarks author N.W. van Meknes location collector

72 Conclusions Summary for the cultural heritage domain, manual annotation of training data is usually not feasible but it is possible to go a long way by exploiting existing resources exploit existing databases to bootstrap a fieldbook segmenter exploit redundancy and interdependencies to detect database errors Software Error Detection Demo ( Timpute: a TiMBL wrapper for semi-automatic error detection in databases to be released soon

73 Collaborators Antal van den Bosch, Sander Canisius, Marieke van Erp, Steve Hunt, Tijn Porcelijn Links Error Detection Demo MITCH project CATCH programme Museum Naturalis