Comparison between historical population archives and decentralized databases



Similar documents
Required Employment D Documents Document Options for Ve erifying Eligibility Legal S Spouse Eligibility requirements:

KEY TIPS FOR CENSUS SUCCESS

Open Data in the Netherlands - opportunities for innovation. Bob Coret Gaenovium 7 October 2014

Natural Language to Relational Query by Using Parsing Compiler

JENNIFER TARDELLI, MA, LPC, NCC PSYCHOTHERAPY WOMEN S ISSUES

Practice Direction 14C Reports by the Adoption Agency or Local Authority

Mining the Software Change Repository of a Legacy Telephony System

This is a publication of the Netherlands Ministry of Foreign Affairs. FAQ Same-sex marriage 2010

The VONK Ancestral line of Dirk Arie Vonk ( )

Business Process Discovery

Search and Information Retrieval

INSTRUCTIONS FOR COMPLETING THE PETITION TO CORRECT A BIRTH CERTIFICATE

Follow your family using census records

CHAPTER 1 INTRODUCTION

Visualization methods for patent data

GEO-VISUALIZATION SUPPORT FOR MULTIDIMENSIONAL CLUSTERING

Simplifying e Business Collaboration by providing a Semantic Mapping Platform

ESTATE PLANNING INFORMATION

ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search

Deposit Identification Utility and Visualization Tool

Levels of legal consequences of marriage, cohabitation and registered partnership for different-sex and same-sex partners:

Data ownership within governance: getting it right

L A W ON ELECTRONIC DOCUMENT I. GENERAL PROVISIONS. Scope of the Law

Expanding the CASEsim Framework to Facilitate Load Balancing of Social Network Simulations

Making the most of your conference poster. Dr Krystyna Haq Graduate Education Officer Graduate Research School

Ancestors of Mildred A. Slaugh

Graph-Based Linking and Visualization for Legislation Documents (GLVD) Dincer Gultemen & Tom van Engers

HP Quality Center. Upgrade Preparation Guide

Nowcasting of significant convection by application of cloud tracking algorithm to satellite and radar images

Ancestors of Pleuntje (Pearl) van der Maarel

A GUIDE TO WRITING YOUR LIFE STORY. Identifying information: Name and maiden name, date and place of birth:

APPLICATION TO AMEND CERTIFICATE OF BIRTH

Towards Software Configuration Management for Test-Driven Development

Do Code Clones Matter?

THE BACHELOR S DEGREE IN SPANISH

GCE. Computing. Mark Scheme for January Advanced Subsidiary GCE Unit F452: Programming Techniques and Logical Methods

How To Teach English To Other People

2014/02/13 Sphinx Lunch

WHITE PAPER HOW TO REDUCE RISK, ERROR, COMPLEXITY AND DRIVE COSTS IN THE ACCOUNTS PAYABLE PROCESS

COCOVILA Compiler-Compiler for Visual Languages

Framework Contract no: DI/ Authors: P. Wauters, K. Declercq, S. van der Peijl, P. Davies

Domain Adaptive Relation Extraction for Big Text Data Analytics. Feiyu Xu

Foundations of Information Management

STEPS FOR APPLYING FOR FINANCIAL AID

BIRTH CERTIFICATES FOR THE STATE OF ARIZONA

RULES ALABAMA STATE BOARD OF HEALTH ALABAMA DEPARTMENT OF PUBLIC HEALTH CHAPTER VITAL STATISTICS REVISED: FEBRUARY 2014

Learning Translation Rules from Bilingual English Filipino Corpus

Report to the Council of Australian Governments. A Review of the National Identity Security Strategy

Figure 1: Architecture of a cloud services model for a digital education resource management system.

OPERATING ENGINEERS TRUST FUNDS

AMERICAN GENEALOGY: HOME STUDY COURSE

Building a Question Classifier for a TREC-Style Question Answering System

Analysis and Forecasting for Own Source Revenues in the Municipality of

KNOWLEDGE ORGANIZATION

Selecting a Taxonomy Management Tool. Wendi Pohs InfoClear Consulting #SLATaxo

USER MODELLING IN ADAPTIVE DIALOGUE MANAGEMENT

Data collection architecture for Big Data

Clustering Connectionist and Statistical Language Processing

Applying for a passport from outside the UK Supporting Documents

Improving Your Use of FamilySearch: Data Cleanup Strategies Geoffrey D. Rasmussen

Web 3.0 image search: a World First

Software Engineering of NLP-based Computer-assisted Coding Applications

Internet of Things, data management for healthcare applications. Ontology and automatic classifications

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

The course is included in the CPD programme for teachers II.

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Automatic Detection and Correction of Errors in Dependency Treebanks

Quick Start to Family Tree

PATIENT IDENTIFICATION AND MATCHING INITIAL FINDINGS

The PALAVRAS parser and its Linguateca applications - a mutually productive relationship

1 File Processing Systems

ROYAL BALLET SCHOOL ASSOCIATES PROGRAMME Declaration of Income and Application for Assistance with Associate Fees

Semantic Web based e-learning System for Sports Domain

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Pattern Insight Clone Detection

Engineering Process Software Qualities Software Architectural Design

Application for a Parental Order Section 54 Human Fertilisation and Embryology Act 2008

Critical Success Factors of CAD Data Migrations

Modern Databases. Database Systems Lecture 18 Natasha Alechina

Lecture 9. Semantic Analysis Scoping and Symbol Table

CAREER TRACKS PHASE 1 UCSD Information Technology Family Function and Job Function Summary

ECM Governance Policies

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Universiteit Leiden. Opleiding Informatica

Data Coding and Entry Lessons Learned

Transcription:

Comparison between historical population archives and decentralized databases Marijn Schraagen Dionysius Huijsmans Leiden Institute of Advanced Computer Science (LIACS) Leiden University, The Netherlands LaTeCH Workshop 2013

Research subject Historical databases have increasingly become digitized Census data, civil registry, church records, trade records,... Millions of interrelated records historical social networks However, network structure is not given Alternative data sources: personal and local archives Family trees, legal archives,... Small amount of information Relations between records generally indicated and verified Research goal: combine the information from different sources

Outline 1 Introduction 2 Matching 3 Verification 4 Application 5 Conclusion

Motivation Links between (historical) records are important for a wide range of applications Data Mining: graph traversal algorithms, community detection Humanities: migration patterns, family size, occupational development Linguistics: stability of spelling, morphology, phonetics Onomastics: name inheritance, geographical name distribution

Overview First match records from databases X and Y, then identify complementary or conflicting links birth record X 1 match? birth record Y 1 L a link compare L b death record X 2 match? death record Y 2 Example: If X 1 = Y 1 but X 2 Y 2 then either L a or L b or both are wrong.

Data formats Large-scale historical databases Syntax usually structured XML, SQL, comma-separated Occasionally structured natural language is used Semantics generally based on events Birth, marriage, baptism, change of ownership Exception: census records Family databases Syntax often the legacy Gedcom format Hierarchical level numbers and tags Semantics generally based on individuals and families

Example historical databases Genlias civil certificate database Official registration of birth, marriage and death The Netherlands, 1811-1920 15 million certificates (events) Gedcom family archive Hand-compiled from various sources Mostly northern part of the Netherlands, 1600-now 1750 records (individuals and families) Overlap: 1100 events, of which 600 births

Data formats example Civil certificate Type: birth certificate Serial number: 176 Date: 16-05-1883 Place: Wonseradeel Child: Sierk Rolsma Father: Sjoerd Rolsma Mother: Agnes Weldring Family archive 0 @F294@ FAM 1 HUSB @I840@ 1 WIFE @I787@ 1 CHIL @I848@ 1 CHIL @I849@ 0 @I787@ INDI 1 NAME Agnes/Welderink/ 0 @I849@ INDI 1 NAME Sierk/Rolsma/ 1 BIRT 2 DATE 16 MAY 1883

Data formats example Civil certificate Type: birth certificate Serial number: 176 Date: 16-05-1883 Place: Wonseradeel Child: Sierk Rolsma Father: Sjoerd Rolsma Mother: Agnes Weldring Family archive 0 @F294@ FAM 1 HUSB @I840@ 1 WIFE @I787@ 1 CHIL @I848@ 1 CHIL @I849@ 0 @I787@ INDI 1 NAME Agnes/Welderink/ 0 @I849@ INDI 1 NAME Sierk/Rolsma/ 1 BIRT 2 DATE 16 MAY 1883

Parser Grammar birth [FAM:CHIL]:child, father,mother. child bdate,bplace,name. father [FAM:HUSB]:name. mother [FAM:WIFE]:name. bdate [INDI:BIRT:DATE]. bplace [INDI:BIRT:PLAC]. name [INDI:NAME]. Family archive 0 @F294@ FAM 1 HUSB @I840@ 1 WIFE @I787@ 1 CHIL @I848@ 1 CHIL @I849@ 0 @I787@ INDI 1 NAME Agnes/Welderink/ 0 @I849@ INDI 1 NAME Sierk/Rolsma/ 1 BIRT 2 DATE 16 MAY 1883

Record similarity measure The parser provides uniform data for matching two records using similarity requirements for selected fields. Example: Birth certificate similarity Out of the four names of child and mother, at least two names are exactly equal. The year of birth is equal, or the difference in year of birth is within a small margin and the edit distance between the names is below some threshold. If multiple candidates for matching a record are found, then the candidate with the smallest edit distance is selected. Note that the definition is domain specific.

Matching example Birth certificate similarity Out of the four names of child and mother, at least two names are exactly equal. The year of birth is equal, or the difference in year of birth is within a small margin and the edit distance between the names is below some threshold. Civil certificate Date: 16-05-1883 Child: Sierk Rolsma Mother: Agnes Weldring Family archive Date: 16 MAY 1883 Child: Sierk Rolsma Mother: Agnes Welderink Three out of four names equal (Sierk, Rolsma, Agnes), year of birth equal (1883) match

Matching results Birth certificate similarity Out of the four names of child and mother, at least two names are exactly equal. The year of birth is equal, or the difference in year of birth is within a small margin and the edit distance between the names is below some threshold. Birth matches: 361/611 (59%) Civil certificate database still in digitization phase Family database contains many peripheral individuals for which parent names and birth date are unknown Similarity measure could be improved Cf. results for marriage certificate matching: 154/176 (88%)

Verification Ideal case: gold standard Generally not available for historical databases Large variation in domain and data quality Performance of matching algorithms obtained on one database is not indicative for other databases Unlike, e.g., newspaper archives, e-mail archives, co-author networks,... Possible solution: internal verification

Internal verification A similarity measure does not necessarily use all record fields for matching Unused fields can provide a support level for a match Example: the birth similarity measure used person names and year of birth Location, exact date of birth, and serial number can be used for verification

Verification results serial location date dist birth marriage + + + 177 69 + - + 31 2 + + 21 41 + 33 0 + 7 2 + 3 10 6 2 3 4 20 > 3 79 8 total 361 154

Interpretation of support categories serial location date dist mean % unique + + + 177 100 ok + - + 31 100 ok + + 21 99.1 ok + 33 98.7 ok + 3 98.1 ok 6 94.4 likely ok + 7 90.0 manual check 3 4 74.0 manual check > 3 79 74.0 incorrect total 361

Application: link comparison First match records from databases X and Y, then identify complementary or conflicting links record X 1 match? record Y 1 L a link compare L b record X 2 match? record Y 2 Application: compare links from Gedcom family archive (given) to links between civil certificates (computed)

Visualization tool @F100@ 01-05-1824 Sikke Sasses van der Zee Aafke Klazes de Boer Afke de Boer @F171@ 13-05-1848 Sjoerd Riemerts Riemersma Johanna Sikkes van der Zee @F15@ 09-05-1857 Jan Johannes Altena Klaaske Sikkes van der Zee @F16@ 02-07-1892 Johannes Altena Elisabeth Vonk @F17@ 16-11-1889 Eke Foekema Aaltje Altena @F18@ 09-01-1896 Sikke Altena Cornelia Verkooyen Cornelia Verkooijen @F13@ 13-06-1896 Ruurd Altena Anna Jans Rolsma @F19@ ~1900 H Wesseling Agatha Altena 9797998 08-05-1895 Hendrikus Wesseling Agatha Altena @F122@ ~1920 Sikkes? IJbeltje Altena @F123@ ~1925 Bartolomeus Mathias van Oerle Klaaske Altena @F124@ 18-05-1923 Sikke Altena Trijntje Homminga A tool is developed to explore the link tree Red and blue: matched certificates have differences

Visualization tool @F100@ 01-05-1824 Sikke Sasses van der Zee Aafke Klazes de Boer Afke de Boer @F171@ 13-05-1848 Sjoerd Riemerts Riemersma Johanna Sikkes van der Zee @F15@ 09-05-1857 Jan Johannes Altena Klaaske Sikkes van der Zee @F16@ 02-07-1892 Johannes Altena Elisabeth Vonk @F17@ 16-11-1889 Eke Foekema Aaltje Altena @F18@ 09-01-1896 Sikke Altena Cornelia Verkooyen Cornelia Verkooijen @F13@ 13-06-1896 Ruurd Altena Anna Jans Rolsma @F19@ ~1900 H Wesseling Agatha Altena 9797998 08-05-1895 Hendrikus Wesseling Agatha Altena @F122@ ~1920 Sikkes? IJbeltje Altena @F123@ ~1925 Bartolomeus Mathias van Oerle Klaaske Altena @F124@ 18-05-1923 Sikke Altena Trijntje Homminga Only red or blue: marriage from family archive without match in civil certificates, or vice versa

Visualization tool @F100@ 01-05-1824 Sikke Sasses van der Zee Aafke Klazes de Boer Afke de Boer @F171@ 13-05-1848 Sjoerd Riemerts Riemersma Johanna Sikkes van der Zee @F15@ 09-05-1857 Jan Johannes Altena Klaaske Sikkes van der Zee @F16@ 02-07-1892 Johannes Altena Elisabeth Vonk @F17@ 16-11-1889 Eke Foekema Aaltje Altena @F18@ 09-01-1896 Sikke Altena Cornelia Verkooyen Cornelia Verkooijen @F13@ 13-06-1896 Ruurd Altena Anna Jans Rolsma @F19@ ~1900 H Wesseling Agatha Altena 9797998 08-05-1895 Hendrikus Wesseling Agatha Altena @F122@ ~1920 Sikkes? IJbeltje Altena @F123@ ~1925 Bartolomeus Mathias van Oerle Klaaske Altena @F124@ 18-05-1923 Sikke Altena Trijntje Homminga Records F19 and 9797998 are a false negative match

Visualization tool @F100@ 01-05-1824 Sikke Sasses van der Zee Aafke Klazes de Boer Afke de Boer @F171@ 13-05-1848 Sjoerd Riemerts Riemersma Johanna Sikkes van der Zee @F15@ 09-05-1857 Jan Johannes Altena Klaaske Sikkes van der Zee @F16@ 02-07-1892 Johannes Altena Elisabeth Vonk @F17@ 16-11-1889 Eke Foekema Aaltje Altena @F18@ 09-01-1896 Sikke Altena Cornelia Verkooyen Cornelia Verkooijen @F13@ 13-06-1896 Ruurd Altena Anna Jans Rolsma @F19@ ~1900 H Wesseling Agatha Altena 9797998 08-05-1895 Hendrikus Wesseling Agatha Altena @F122@ ~1920 Sikkes? IJbeltje Altena @F123@ ~1925 Bartolomeus Mathias van Oerle Klaaske Altena @F124@ 18-05-1923 Sikke Altena Trijntje Homminga Records F122, F123, F124 are outside of the civil certificate timeframe

Summary Combining information from different databases in the same domain Syntactic and semantic parsing of records based on individuals to records based on events Matching using domain-specific similarity measures Match validation using additional record fields Application: visualization of link comparison

Future work Scale up to more and larger databases Crowdsourcing is particularly suited to obtain data Refine matching procedure Public release of visualization tool

Acknowledgment This work is part of the research programme LINKS, which is financed by the Netherlands Organisation for Scientific Research (NWO), grant 640.004.804. The authors would like to thank Tom Altena for the use of his Gedcom database.