How To Create A Specimen Database



Similar documents
Vorbespechung/Introductory Meeting: Text Mining for Historical Documents

Mining for Information in Texts from the Cultural Heritage. Marieke van Erp

From Field Notes Towards a Knowledge Base

Web-Scale Extraction of Structured Data Michael J. Cafarella, Jayant Madhavan & Alon Halevy

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

POS Tagging 1. POS Tagging. Rule-based taggers Statistical taggers Hybrid approaches

Web Document Clustering

Project 2: Term Clouds (HOF) Implementation Report. Members: Nicole Sparks (project leader), Charlie Greenbacker

The University of Amsterdam s Question Answering System at QA@CLEF 2007

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Data Mining Practical Machine Learning Tools and Techniques

HELP DESK SYSTEMS. Using CaseBased Reasoning

Mining the Software Change Repository of a Legacy Telephony System

Social Media Mining. Data Mining Essentials

2015 Workshops for Professors

Data Mining - Evaluation of Classifiers

Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. Alessandro Vinciarelli, Samy Bengio and Horst Bunke

Experiments in Web Page Classification for Semantic Web

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

Chapter 6. The stacking ensemble approach

Micro blogs Oriented Word Segmentation System

Data Mining. Nonlinear Classification

PICCL: Philosophical Integrator of Computational and Corpus Libraries

Gwen Landburg December Anton de Kom University of Suriname

The Scientific Data Mining Process

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Resolving Common Analytical Tasks in Text Databases

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery

Knowledge Discovery and Data Mining

Cell Phone based Activity Detection using Markov Logic Network

Automated Content Analysis of Discussion Transcripts

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Machine Learning and Statistics: What s the Connection?

Term extraction for user profiling: evaluation by the user

Specimen Labels v. 09/2002

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Word Completion and Prediction in Hebrew

II. RELATED WORK. Sentiment Mining

Visualization methods for patent data

CS570 Data Mining Classification: Ensemble Methods

Azure Machine Learning, SQL Data Mining and R

ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database

ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search

Learning is a very general term denoting the way in which agents:

The Delicate Art of Flower Classification

Search and Information Retrieval

Data, Measurements, Features

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Nature Values Screening Using Object-Based Image Analysis of Very High Resolution Remote Sensing Data

CYBER SCIENCE 2015 AN ANALYSIS OF NETWORK TRAFFIC CLASSIFICATION FOR BOTNET DETECTION

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

Technical Report. The KNIME Text Processing Feature:

On Discovering Deterministic Relationships in Multi-Label Learning via Linked Open Data

ATLAS.ti for Mac OS X Getting Started

Introduction to Data Mining

Inner Classification of Clusters for Online News

PoliticalMashup. Make implicit structure and information explicit. Content

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Active Learning SVM for Blogs recommendation

Clustering Connectionist and Statistical Language Processing

Data Mining Algorithms Part 1. Dejan Sarka

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Using Data Mining for Mobile Communication Clustering and Characterization

<is web> Information Systems & Semantic Web University of Koblenz Landau, Germany

1. Classification problems

The Role of Metadata for Effective Data Warehouse

Interactive Information Visualization in the Digital Flora of Texas

Big Data: Rethinking Text Visualization

Framing Business Problems as Data Mining Problems

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata

Information Leakage in Encrypted Network Traffic

SVM Based Learning System For Information Extraction

Text Mining for Health Care and Medicine. Sophia Ananiadou Director National Centre for Text Mining

Cross-Validation. Synonyms Rotation estimation

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

It Takes a Village to Raise a Machine Learning Model. Lucian

Gerry Hobbs, Department of Statistics, West Virginia University

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction

Sentiment analysis: towards a tool for analysing real-time students feedback

Efficient database auditing

Big Data & Scripting Part II Streaming Algorithms

Knowledge Discovery and Data Mining

Assisting bug Triage in Large Open Source Projects Using Approximate String Matching

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Guide for Bioinformatics Project Module 3

Sentiment analysis on tweets in a financial domain

Segmentation and Classification of Online Chats

COURSE RECOMMENDER SYSTEM IN E-LEARNING

Blog Post Extraction Using Title Finding

Mining a Corpus of Job Ads

Develop and Implement a Pilot Status and Trend Monitoring Program for Salmonids and their Habitat in the Wenatchee and Grande Ronde River Basins.

3.1 Measuring Biodiversity

Introduction to Pattern Recognition

6.2.8 Neural networks for data mining

Diagnosis Code Assignment Support Using Random Indexing of Patient Records A Qualitative Feasibility Study

Transcription:

from the Natural History Domain Computational Linguistics Saarland University 11 October 2007

Background The MITCH project Mining for Information in Texts from the Cultural Heritage joint research project between Tilburg University and Naturalis (Dutch National Museum of Natural History) text mining and information extraction for natural history data part of the CATCH programme (10 projects, funded by NWO)

Naturalis

Naturalis more than 10 million specimens: 5,250,000 insects 2,290,000 invertebrates 1,000,000 vertebrates 1,160,000 fossils 440,000 stones and minerals 150,000 species 10% of the Earth s biodiversity

Naturalis

Data and Meta-Data

Data and Meta-Data

Data and Meta-Data

Data and Meta-Data

Data and Meta-Data

Data and Meta-Data

Data and Meta-Data

Data and Meta-Data

Data Sources and Tasks Two main data sources... Tasks... (handwritten) fieldbooks (digitised and externally transcribed) specimen databases (manually created by curators, incomplete) converting transcribed field books into structured records data cleaning for specimen databases (error detection, data completion)

Digitisation of Fieldbooks

Transcription of Fieldbooks all fieldbooks relating to Reptiles and Amphibians Collection 15,000 handwritten pages manually transcribed by typists at Combiwerk simple guidelines on how to deal with non-ascii characters text written in the margins illegible passages etc. transcriptions completed in around 8 months

Specimen Databases manually compiled from field books designed by biologists not by database experts maintained by several people rows in the database correspond to fieldbook entries usually one specimen per row columns give information about specimen and circumstances of their collection (when, where, by whom etc.) columns in a variety of formats: numbers (e.g., collection date, registration number) short text (e.g., collector, genus) free text (e.g., biotope, place, remarks)

Example Columns place: 10 km. N. of Lucie Base Bivuac near De Kock Mountain weg van Lozoya naar Navarradonda location: 07.15 h. on small tree in deciduous tropical forest (now in full leaf), 150 cm. above ground, Dewlap orange with red around rim. kwakend op grashalm in poel lang weg biotope: under stone on moist, calcareous loam 275 m, forest floor among leaf litter, near Arroyo special remarks: aangereden (nog niet dood) according to information from R. Heyer, Smithsonian, Washington, this is likely to be L. knudseni, considering the short dorso-lateral folds and chest spines (e-mail to J. W. Arntzen January 2004)

Task 1: Converting Fieldbooks into Structured Records

Structure of Fieldbook Entries Number 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, 20.45-22.00 u. RMNH 23865 Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, 13.07.1968, 8.45 u., RMNH 26076 Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt 1867. RMNH 17656 Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.

Structure of Fieldbook Entries Number 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, 20.45-22.00 u. RMNH 23865 Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, 13.07.1968, 8.45 u., RMNH 26076 Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt 1867. RMNH 17656 Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.

Structure of Fieldbook Entries Number, Genus 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, 20.45-22.00 u. RMNH 23865 Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, 13.07.1968, 8.45 u., RMNH 26076 Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt 1867. RMNH 17656 Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.

Structure of Fieldbook Entries Number, Genus, Species 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, 20.45-22.00 u. RMNH 23865 Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, 13.07.1968, 8.45 u., RMNH 26076 Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt 1867. RMNH 17656 Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.

Structure of Fieldbook Entries Number, Genus, Species, Biotope 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, 20.45-22.00 u. RMNH 23865 Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, 13.07.1968, 8.45 u., RMNH 26076 Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt 1867. RMNH 17656 Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.

Structure of Fieldbook Entries Number, Genus, Species, Biotope, Collection Time 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, 20.45-22.00 u. RMNH 23865 Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, 13.07.1968, 8.45 u., RMNH 26076 Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt 1867. RMNH 17656 Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.

Structure of Fieldbook Entries Number, Genus, Species, Biotope, Collection Time, Reg. Num. 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, 20.45-22.00 u. RMNH 23865 Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, 13.07.1968, 8.45 u., RMNH 26076 Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis. Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool. Hoedt 1867. RMNH 17656 Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.

Converting Fieldbook Entries into Structured Records Aim make inherent structure of entry explicit (i.e., find segments conveying different types of information) Motivation Enable more sophisticated search raw data only allows keyword search enriched data allows querying of specific types of information

Example Aim: find all specimens of Phyllobates femoralis Query: Fieldbook entry contains string Pyllobates femoralis Result: Phyllobates femoralis, post Conini, Coeroenirivier, bosgrond, 25-IV-1968, 8:30-13:30 u. RMNH 26127-26129 Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, 13.07.1968, 8.45 u., RMNH 26076 Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij Phyllobates femoralis.

Modelled as Sequence Labelling 1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, 20.45-22.00 u. RMNH 23865

Modelled as Sequence Labelling 1/num ex./num Leptodactylus/genus wagneri/species At/bio base/bio of/bio tree/bio on/bio small/bio island/bio,/bio primary/bio forest/bio,/o 20/time./time 45/time -/time 22/time./time 00/time u/time./time RMNH/reg num 23865/reg num

Supervised Machine Learning... could use supervised learner: Hidden Markov Models, Conditional Random Fields etc. However: requires manual annotation of data (by domain experts) annotation needs to be re-done for each new domain (e.g., archaeological field reports)... or even sub-domain (reptiles vs. crustaceans)

Bootstrapping from Existing Resources Specimen databases readily available database entry = fieldbook entry database column labels = fieldbook segment labels Caveat: databases are only derived from fieldbooks some information in fieldbooks not in database and vice versa re-writings, systematic differences (format of dates, cue words (e.g., RMNH for registration number) etc.) segment sequence probabilities are lost in databases (joint work with Sander Canisius)

Converting Fieldbooks into Structured Records Three approaches... database look-up supervised ML trained on data automatically created from database HMMs plus language modelling trained on database

Database Look-Up (a) assign each token its most frequent column label in the database (unigram look-up) (b) assign each token the most frequent column label of the trigram centred on it, backing off to bi- and unigrams (trigram look-up) (c) assign labels to trigrams in a sliding window (each token receives 3 labels) and vote over them (trigram look-up plus voting) (d) check field book entries for substrings which are exact matched of database cells (exact match)

Training on Automatically Created Data Training Data concatenation of (contents of) database fields: (a) with uniform probabilities (random) (b) with probabilities that were taken from 10 manually labelled field book entries (biased) Machine Learning Set-Up memory-based learner (TiMBL) training data: 18,000 database entries test data: 150 field book entries 107 features: sliding window of 5 tokens typographic features tfidf similarities between n-grams and database column labels (in a window of 3 tokens around focus token)

HMM plus Language Modelling Segmentation look for bigrams which are unlikely within a field language modelling (based on database) plus Viterbi to find most likely segmentation Labelling HMM applied to segmented data initial and transition probabilities estimated in unsupervised fashion (Baum-Welch algorithm) emission probabilities based on language model of database... plus a few domain-independent rules to deal with systematic differences between databases and fieldbook

Results Token Segment Acc. Prec. Rec. F β=1 ExactB 16.0 25.7 23.1 24.3 UniB 27.0 8.9 22.8 12.8 TriB 43.8 12.9 24.8 16.9 TriB+Vote 45.1 14.9 27.8 19.4 MBL rand. 44.6 7.1 19.2 10.4 MBL bias 53.4 12.1 32.0 17.6 HMM 56.9 62.7 58.1 60.3

Results Token Segment Acc. Prec. Rec. F β=1 ExactB 16.0 25.7 23.1 24.3 UniB 27.0 8.9 22.8 12.8 TriB 43.8 12.9 24.8 16.9 TriB+Vote 45.1 14.9 27.8 19.4 MBL rand. 44.6 7.1 19.2 10.4 MBL bias 53.4 12.1 32.0 17.6 HMM 56.9 62.7 58.1 60.3

Results Token Segment Acc. Prec. Rec. F β=1 ExactB 16.0 25.7 23.1 24.3 UniB 27.0 8.9 22.8 12.8 TriB 43.8 12.9 24.8 16.9 TriB+Vote 45.1 14.9 27.8 19.4 MBL rand. 44.6 7.1 19.2 10.4 MBL bias 53.4 12.1 32.0 17.6 HMM 56.9 62.7 58.1 60.3

Results Token Segment Acc. Prec. Rec. F β=1 ExactB 16.0 25.7 23.1 24.3 UniB 27.0 8.9 22.8 12.8 TriB 43.8 12.9 24.8 16.9 TriB+Vote 45.1 14.9 27.8 19.4 MBL rand. 44.6 7.1 19.2 10.4 MBL bias 53.4 12.1 32.0 17.6 HMM 56.9 62.7 58.1 60.3

Task 2: Cleaning Textual Databases

Automatic Database Cleaning Errors and Missing Values unavoidable, even in well-maintained databases negatively affect information retrieval manual error correction extremely time consuming Traditional Data Clean-Up Methods not geared towards text databases treat fields as atoms but tokens within text string can provide valuable cues e.g. km frequent in location column but may indicate error in other columns

Errors in Specimen Databases Error types typos: 1% content errors (information is wrong, e.g. Surinam instead of Indonesia): 5.4% disprefered synonyms: 6.2% wrong-column errors (information is correct but should be in a different column, e.g., location instead of special remarks): 3.5% missing values: up to 90% of a given column

Semi-Automatic Database Cleaning Subtasks predict missing values detect and correct wrong values (typos, content errors, disprefered synonyms) detect and correct wrong-column errors Semi-Automatic set-up tools search the database to predict missing values detect potential errors and find possible corrections new value/potential error and correction are flagged to domain expert

Predicting Missing Values Method exploit interdependencies between different fields location: Tafel Mountain & country: South Africa train classifier to predict the value of a field given the values of the other fields

Predicting Missing Values Set-Up one classifier per column training data automatically generated from database split data into 80% training, 20% testing

Predicting Missing Values Set-Up one classifier per column training data automatically generated from database split data into 80% training, 20% testing Features Label Author Determinator Family Genus Country Location (Daudin, 1802) (Schlegel) Schneider G. vd. Boog M.S. Hoogmoed Bataguridae Colubridae Anolis Geophis Bufo Cambodia Indonesia Suriname B. Hoeksema s garden near airfield

Predicting Missing Values Set-Up one classifier per column training data automatically generated from database split data into 80% training, 20% testing Features Label (Daudin, 1802) Bataguridae Anolis Cambodia Author Determinator Family Genus Country Location (Daudin, 1802) (Schlegel) Schneider G. vd. Boog M.S. Hoogmoed Bataguridae Colubridae Anolis Geophis Bufo Cambodia Indonesia Suriname B. Hoeksema s garden near airfield

Predicting Missing Values Set-Up one classifier per column training data automatically generated from database split data into 80% training, 20% testing Features Label (Daudin, 1802) Bataguridae Anolis Cambodia (Schlegel) G. vd. Boog Colubridae Geophis B. Hoeksema s garden Indonesia Author Determinator Family Genus Country Location (Daudin, 1802) (Schlegel) Schneider G. vd. Boog M.S. Hoogmoed Bataguridae Colubridae Anolis Geophis Bufo Cambodia Indonesia Suriname B. Hoeksema s garden near airfield

Predicting Missing Values Set-Up one classifier per column training data automatically generated from database split data into 80% training, 20% testing Features Label (Daudin, 1802) Bataguridae Anolis Cambodia (Schlegel) G. vd. Boog Colubridae Geophis B. Hoeksema s garden Indonesia Schneider M.S. Hoogmoed Bufo near airfield Suriname Author Determinator Family Genus Country Location (Daudin, 1802) (Schlegel) Schneider G. vd. Boog M.S. Hoogmoed Bataguridae Colubridae Anolis Geophis Bufo Cambodia Indonesia Suriname B. Hoeksema s garden near airfield

Results Generally... fairly high prediction accuracies, even for free-text fields well above baselines (random (rnd) and majority value (maj)) Accuracies for different columns Accuracy column # values (types) ML maj rnd family 83 97.37% 18.59% 1.92% genus 649 91.95% 10.13% 0.35% species 1,351 88.65% 6.18% 0.07% collector 1,079 85.25% 30.44% 0.09% special remarks 2,537 76.21% 4.07% 0.03% location 653 67.66% 22.46% 0.15% biotope 700 63.02% 4.63% 0.14%

More Sophisticated Approach simple approach treats values of fields as atoms not ideal for free text fields Example: Predict value of country from place Venezuela co-occurs with 56 distinct values in place 30 of those contain the string El Dorado, e.g.: La Escalera, Z. van El Dorado, weg El Dorado - Sta El Dorado, Estado Bolivar Las Claritas, 85 km Z. van El Dorado El Dorado, Estado Bolivar, 4 km N van El Dorado

More Sophisticated Approach Alternative feature representations instead of representing fields as atomic strings: (a) represent only named entities (e.g., binary features indicating presence of various NEs) (b) represent fields by the unigram with the highest tfidf (c) represent fields by the unigram for which the mutual information with the values in the target column is highest Pilot study indicates that (c) works best: full string NEs (binary features) max. tfidf max. MI Acc. 78.55% 83.13% 82.41% 84.30%

Detecting Content Errors Apply Value Prediction to filled fields... Actual Value: Geophis Predicted Value: Rhabdophis Author Determinator Family Genus Country Conservation (Daudin, 1802) (Schlegel) Schneider (Horst, 1883) G. vd. Boog M.S. Hoogmoed Tyler, M.J. Bataguridae Colubridae Hylidae Anolis Geophis Bufo Litoria Cambodia Indonesia Suriname (shell, dry) alcohol

Detecting Content Errors Apply Value Prediction to filled fields... Actual Value: Geophis Predicted Value: Rhabdophis Author Determinator Family Genus Country Conservation (Daudin, 1802) (Schlegel) Schneider (Horst, 1883) G. vd. Boog M.S. Hoogmoed Tyler, M.J. Bataguridae Colubridae Hylidae Anolis Geophis Bufo Litoria Cambodia Indonesia Suriname (shell, dry) alcohol

Detecting Content Errors Apply Value Prediction to filled fields... Actual Value: Geophis Predicted Value: Rhabdophis Author Determinator Family Genus Country Conservation (Daudin, 1802) (Schlegel) Schneider (Horst, 1883) G. vd. Boog M.S. Hoogmoed Tyler, M.J. Bataguridae Colubridae Hylidae Anolis? Bufo Litoria Cambodia Indonesia Suriname (shell, dry) alcohol

Detecting Content Errors Apply Value Prediction to filled fields... Actual Value: Geophis Predicted Value: Rhabdophis Author Determinator Family Genus Country Conservation (Daudin, 1802) (Schlegel) Schneider (Horst, 1883) G. vd. Boog M.S. Hoogmoed Tyler, M.J. Bataguridae Colubridae Hylidae Anolis? Bufo Litoria Cambodia Indonesia Suriname (shell, dry) alcohol

Detecting Content Errors Apply Value Prediction to filled fields... Actual Value: Geophis Predicted Value: Rhabdophis Author Determinator Family Genus Country Conservation (Daudin, 1802) (Schlegel) Schneider (Horst, 1883) G. vd. Boog M.S. Hoogmoed Tyler, M.J. Bataguridae Colubridae Hylidae Anolis? Bufo Litoria Cambodia Indonesia Suriname (shell, dry) alcohol

Detecting Content Errors Experimental Set-Up test on taxonomic fields potential errors can be checked by non-expert against gold standard taxonomy (correction) precision calculated by manual checking (detection) recall estimated on artificial errors

Detecting Content Errors Detection Recall (estimated) column recall class 95.56% order 96.82% family 96.15% genus 93.09% Correction Precision precision column items flagged incl. synonyms excl. synonyms class 2 50.00% 50.00% order 26 57.00% 38.00% family 33 45.45% 9.09% genus 135 10.37% 5.93%

Detecting Wrong Column Errors Aim detect information that was entered in the wrong column: e.g., died in captivity is in location but should be in special remarks Method recast as a text classification problem train classifier to predict column of a text string apply to cell contents signal potential error if: predicted column actual column

Detection Wrong-Column Errors Training Data generated automatically from database Features typographical (number of tokens, capitalisation, punctuation, numbers, units of measurement etc.) similarity with each column (i.e., tfidf weighted token overlap)

Detecting Wrong-Column Errors Training Set Creation Features Label 0.785 0.983 1 3 1 1 1 0 0 0 Daudin, 1802 Author Author Determinator Family Place Genus Province Daudin, 1802 (Peters, 1867) (Schlegel) M. S. Hoogmoed Polychrotidae Bigisanti beach Anolis A. H. Bol, 1972/73 Agamidae Gonocephalus M Java, Wonosobo Elaphe Ned. Nieuw Guinea

Detecting Wrong-Column Errors Training Set Creation Features Label 0.785 0.219 0.983 0.886 1 1 3 3 1 0 1 1 1 1 0 1 0 1 0 0 Daudin, 1802 M. S. Hoogmoed Author Determinator Author Determinator Family Place Genus Province Daudin, 1802 (Peters, 1867) (Schlegel) M. S. Hoogmoed Polychrotidae Bigisanti beach Anolis A. H. Bol, 1972/73 Agamidae Gonocephalus M Java, Wonosobo Elaphe Ned. Nieuw Guinea

Detecting Wrong-Column Errors Training Set Creation Features Label 0.785 0.219 0.560 0.983 0.886 0.432 1 1 1 3 3 1 1 0 0 1 1 0 1 1 1 0 1 0 0 1 0 0 0 0 Daudin, 1802 M. S. Hoogmoed Polychrotidae Author Determinator Family Author Determinator Family Place Genus Province Daudin, 1802 (Peters, 1867) (Schlegel) M. S. Hoogmoed Polychrotidae Bigisanti beach Anolis A. H. Bol, 1972/73 Agamidae Gonocephalus M Java, Wonosobo Elaphe Ned. Nieuw Guinea

Detecting Wrong-Column Errors Actual Value: Geophis Predicted Value: Rhabdophis Conservation Location Biotope Special Remarks alcohol alcohol Huys te Linschoten roadside bordering secondary forest in roadside pool Bodemvallen Parkbos aanplant Eik died in captivity alcohol bosbivak Zanderij in betonnen met water biotoop: zwampig terrein gevulde bak met zandboden on base of tree along trail in terra firme forest geen verdere gegevens bekend alcohol 12 km NE of Elmali clay soil with reed vegetation Tank 8 alcohol in field near roadside injured before capture by observers

Detecting Wrong-Column Errors Actual Value: Geophis Predicted Value: Rhabdophis Conservation Location Biotope Special Remarks alcohol alcohol Huys te Linschoten roadside bordering secondary forest in roadside pool Bodemvallen Parkbos aanplant Eik died in captivity alcohol bosbivak Zanderij in betonnen met water biotoop: zwampig terrein gevulde bak met zandboden on base of tree along trail in terra firme forest geen verdere gegevens bekend alcohol 12 km NE of Elmali clay soil with reed vegetation Tank 8 alcohol in field near roadside injured before capture by observers

Detecting Wrong-Column Errors Actual Value: Location Predicted Value: Rhabdophis Conservation? Biotope Special Remarks alcohol alcohol Huys te Linschoten roadside bordering secondary forest in roadside pool Bodemvallen Parkbos aanplant Eik died in captivity alcohol bosbivak Zanderij in betonnen met water biotoop: zwampig terrein gevulde bak met zandboden on base of tree along trail in terra firme forest geen verdere gegevens bekend alcohol 12 km NE of Elmali clay soil with reed vegetation Tank 8 alcohol in field near roadside injured before capture by observers

Detecting Wrong-Column Errors Actual Value: Location Predicted Value: Biotope Conservation? Biotope Special Remarks alcohol alcohol Huys te Linschoten roadside bordering secondary forest in roadside pool Bodemvallen Parkbos aanplant Eik died in captivity alcohol bosbivak Zanderij in betonnen met water biotoop: zwampig terrein gevulde bak met zandboden on base of tree along trail in terra firme forest geen verdere gegevens bekend alcohol 12 km NE of Elmali clay soil with reed vegetation Tank 8 alcohol in field near roadside injured before capture by observers

Detecting Wrong-Column Errors Experimental Set-Up leave-one-out testing manual annotation for 4 free-text columns: biotope, location, publication, special remarks Results items detection correction column flagged recall precision accuracy biotope 234 89.1% 24.4% 91.2% location 286 77.6% 18.2% 51.9% publication 58 100% 6.9% 25.0% special remarks 298 24.0% 20.1% 61.7%

Detecting Wrong-Column Errors, Examples Good corrections: string original column predicted column on a tree 2.5 m above ground special remarks biotope 25 km N.N.W Antalya special remarks location 1700 m biotope altitude died in captivity 23.09.1994 location special remarks roadside bordering secondary forest location biotope Suriname Exp. 1970 collection number collector Not so good: string original column predicted column (Kikkervisje) special remarks author N.W. van Meknes location collector

Conclusions Summary for the cultural heritage domain, manual annotation of training data is usually not feasible but it is possible to go a long way by exploiting existing resources exploit existing databases to bootstrap a fieldbook segmenter exploit redundancy and interdependencies to detect database errors Software Error Detection Demo (http://ls0135.uvt.nl/) Timpute: a TiMBL wrapper for semi-automatic error detection in databases to be released soon

Collaborators Antal van den Bosch, Sander Canisius, Marieke van Erp, Steve Hunt, Tijn Porcelijn Links Error Detection Demo MITCH project CATCH programme Museum Naturalis http://ls0135.uvt.nl/ http://ilk.uvt.nl/mitch/ http://www.nwo.nl/catch/ http://www.naturalis.nl/