Comparison between historical population archives and decentralized databases Marijn Schraagen Dionysius Huijsmans Leiden Institute of Advanced Computer Science (LIACS) Leiden University, The Netherlands LaTeCH Workshop 2013
Research subject Historical databases have increasingly become digitized Census data, civil registry, church records, trade records,... Millions of interrelated records historical social networks However, network structure is not given Alternative data sources: personal and local archives Family trees, legal archives,... Small amount of information Relations between records generally indicated and verified Research goal: combine the information from different sources
Outline 1 Introduction 2 Matching 3 Verification 4 Application 5 Conclusion
Motivation Links between (historical) records are important for a wide range of applications Data Mining: graph traversal algorithms, community detection Humanities: migration patterns, family size, occupational development Linguistics: stability of spelling, morphology, phonetics Onomastics: name inheritance, geographical name distribution
Overview First match records from databases X and Y, then identify complementary or conflicting links birth record X 1 match? birth record Y 1 L a link compare L b death record X 2 match? death record Y 2 Example: If X 1 = Y 1 but X 2 Y 2 then either L a or L b or both are wrong.
Data formats Large-scale historical databases Syntax usually structured XML, SQL, comma-separated Occasionally structured natural language is used Semantics generally based on events Birth, marriage, baptism, change of ownership Exception: census records Family databases Syntax often the legacy Gedcom format Hierarchical level numbers and tags Semantics generally based on individuals and families
Example historical databases Genlias civil certificate database Official registration of birth, marriage and death The Netherlands, 1811-1920 15 million certificates (events) Gedcom family archive Hand-compiled from various sources Mostly northern part of the Netherlands, 1600-now 1750 records (individuals and families) Overlap: 1100 events, of which 600 births
Data formats example Civil certificate Type: birth certificate Serial number: 176 Date: 16-05-1883 Place: Wonseradeel Child: Sierk Rolsma Father: Sjoerd Rolsma Mother: Agnes Weldring Family archive 0 @F294@ FAM 1 HUSB @I840@ 1 WIFE @I787@ 1 CHIL @I848@ 1 CHIL @I849@ 0 @I787@ INDI 1 NAME Agnes/Welderink/ 0 @I849@ INDI 1 NAME Sierk/Rolsma/ 1 BIRT 2 DATE 16 MAY 1883
Data formats example Civil certificate Type: birth certificate Serial number: 176 Date: 16-05-1883 Place: Wonseradeel Child: Sierk Rolsma Father: Sjoerd Rolsma Mother: Agnes Weldring Family archive 0 @F294@ FAM 1 HUSB @I840@ 1 WIFE @I787@ 1 CHIL @I848@ 1 CHIL @I849@ 0 @I787@ INDI 1 NAME Agnes/Welderink/ 0 @I849@ INDI 1 NAME Sierk/Rolsma/ 1 BIRT 2 DATE 16 MAY 1883
Parser Grammar birth [FAM:CHIL]:child, father,mother. child bdate,bplace,name. father [FAM:HUSB]:name. mother [FAM:WIFE]:name. bdate [INDI:BIRT:DATE]. bplace [INDI:BIRT:PLAC]. name [INDI:NAME]. Family archive 0 @F294@ FAM 1 HUSB @I840@ 1 WIFE @I787@ 1 CHIL @I848@ 1 CHIL @I849@ 0 @I787@ INDI 1 NAME Agnes/Welderink/ 0 @I849@ INDI 1 NAME Sierk/Rolsma/ 1 BIRT 2 DATE 16 MAY 1883
Record similarity measure The parser provides uniform data for matching two records using similarity requirements for selected fields. Example: Birth certificate similarity Out of the four names of child and mother, at least two names are exactly equal. The year of birth is equal, or the difference in year of birth is within a small margin and the edit distance between the names is below some threshold. If multiple candidates for matching a record are found, then the candidate with the smallest edit distance is selected. Note that the definition is domain specific.
Matching example Birth certificate similarity Out of the four names of child and mother, at least two names are exactly equal. The year of birth is equal, or the difference in year of birth is within a small margin and the edit distance between the names is below some threshold. Civil certificate Date: 16-05-1883 Child: Sierk Rolsma Mother: Agnes Weldring Family archive Date: 16 MAY 1883 Child: Sierk Rolsma Mother: Agnes Welderink Three out of four names equal (Sierk, Rolsma, Agnes), year of birth equal (1883) match
Matching results Birth certificate similarity Out of the four names of child and mother, at least two names are exactly equal. The year of birth is equal, or the difference in year of birth is within a small margin and the edit distance between the names is below some threshold. Birth matches: 361/611 (59%) Civil certificate database still in digitization phase Family database contains many peripheral individuals for which parent names and birth date are unknown Similarity measure could be improved Cf. results for marriage certificate matching: 154/176 (88%)
Verification Ideal case: gold standard Generally not available for historical databases Large variation in domain and data quality Performance of matching algorithms obtained on one database is not indicative for other databases Unlike, e.g., newspaper archives, e-mail archives, co-author networks,... Possible solution: internal verification
Internal verification A similarity measure does not necessarily use all record fields for matching Unused fields can provide a support level for a match Example: the birth similarity measure used person names and year of birth Location, exact date of birth, and serial number can be used for verification
Verification results serial location date dist birth marriage + + + 177 69 + - + 31 2 + + 21 41 + 33 0 + 7 2 + 3 10 6 2 3 4 20 > 3 79 8 total 361 154
Interpretation of support categories serial location date dist mean % unique + + + 177 100 ok + - + 31 100 ok + + 21 99.1 ok + 33 98.7 ok + 3 98.1 ok 6 94.4 likely ok + 7 90.0 manual check 3 4 74.0 manual check > 3 79 74.0 incorrect total 361
Application: link comparison First match records from databases X and Y, then identify complementary or conflicting links record X 1 match? record Y 1 L a link compare L b record X 2 match? record Y 2 Application: compare links from Gedcom family archive (given) to links between civil certificates (computed)
Visualization tool @F100@ 01-05-1824 Sikke Sasses van der Zee Aafke Klazes de Boer Afke de Boer @F171@ 13-05-1848 Sjoerd Riemerts Riemersma Johanna Sikkes van der Zee @F15@ 09-05-1857 Jan Johannes Altena Klaaske Sikkes van der Zee @F16@ 02-07-1892 Johannes Altena Elisabeth Vonk @F17@ 16-11-1889 Eke Foekema Aaltje Altena @F18@ 09-01-1896 Sikke Altena Cornelia Verkooyen Cornelia Verkooijen @F13@ 13-06-1896 Ruurd Altena Anna Jans Rolsma @F19@ ~1900 H Wesseling Agatha Altena 9797998 08-05-1895 Hendrikus Wesseling Agatha Altena @F122@ ~1920 Sikkes? IJbeltje Altena @F123@ ~1925 Bartolomeus Mathias van Oerle Klaaske Altena @F124@ 18-05-1923 Sikke Altena Trijntje Homminga A tool is developed to explore the link tree Red and blue: matched certificates have differences
Visualization tool @F100@ 01-05-1824 Sikke Sasses van der Zee Aafke Klazes de Boer Afke de Boer @F171@ 13-05-1848 Sjoerd Riemerts Riemersma Johanna Sikkes van der Zee @F15@ 09-05-1857 Jan Johannes Altena Klaaske Sikkes van der Zee @F16@ 02-07-1892 Johannes Altena Elisabeth Vonk @F17@ 16-11-1889 Eke Foekema Aaltje Altena @F18@ 09-01-1896 Sikke Altena Cornelia Verkooyen Cornelia Verkooijen @F13@ 13-06-1896 Ruurd Altena Anna Jans Rolsma @F19@ ~1900 H Wesseling Agatha Altena 9797998 08-05-1895 Hendrikus Wesseling Agatha Altena @F122@ ~1920 Sikkes? IJbeltje Altena @F123@ ~1925 Bartolomeus Mathias van Oerle Klaaske Altena @F124@ 18-05-1923 Sikke Altena Trijntje Homminga Only red or blue: marriage from family archive without match in civil certificates, or vice versa
Visualization tool @F100@ 01-05-1824 Sikke Sasses van der Zee Aafke Klazes de Boer Afke de Boer @F171@ 13-05-1848 Sjoerd Riemerts Riemersma Johanna Sikkes van der Zee @F15@ 09-05-1857 Jan Johannes Altena Klaaske Sikkes van der Zee @F16@ 02-07-1892 Johannes Altena Elisabeth Vonk @F17@ 16-11-1889 Eke Foekema Aaltje Altena @F18@ 09-01-1896 Sikke Altena Cornelia Verkooyen Cornelia Verkooijen @F13@ 13-06-1896 Ruurd Altena Anna Jans Rolsma @F19@ ~1900 H Wesseling Agatha Altena 9797998 08-05-1895 Hendrikus Wesseling Agatha Altena @F122@ ~1920 Sikkes? IJbeltje Altena @F123@ ~1925 Bartolomeus Mathias van Oerle Klaaske Altena @F124@ 18-05-1923 Sikke Altena Trijntje Homminga Records F19 and 9797998 are a false negative match
Visualization tool @F100@ 01-05-1824 Sikke Sasses van der Zee Aafke Klazes de Boer Afke de Boer @F171@ 13-05-1848 Sjoerd Riemerts Riemersma Johanna Sikkes van der Zee @F15@ 09-05-1857 Jan Johannes Altena Klaaske Sikkes van der Zee @F16@ 02-07-1892 Johannes Altena Elisabeth Vonk @F17@ 16-11-1889 Eke Foekema Aaltje Altena @F18@ 09-01-1896 Sikke Altena Cornelia Verkooyen Cornelia Verkooijen @F13@ 13-06-1896 Ruurd Altena Anna Jans Rolsma @F19@ ~1900 H Wesseling Agatha Altena 9797998 08-05-1895 Hendrikus Wesseling Agatha Altena @F122@ ~1920 Sikkes? IJbeltje Altena @F123@ ~1925 Bartolomeus Mathias van Oerle Klaaske Altena @F124@ 18-05-1923 Sikke Altena Trijntje Homminga Records F122, F123, F124 are outside of the civil certificate timeframe
Summary Combining information from different databases in the same domain Syntactic and semantic parsing of records based on individuals to records based on events Matching using domain-specific similarity measures Match validation using additional record fields Application: visualization of link comparison
Future work Scale up to more and larger databases Crowdsourcing is particularly suited to obtain data Refine matching procedure Public release of visualization tool
Acknowledgment This work is part of the research programme LINKS, which is financed by the Netherlands Organisation for Scientific Research (NWO), grant 640.004.804. The authors would like to thank Tom Altena for the use of his Gedcom database.