German Record Linkage Center Microdata Computation Centre (MiCoCe) Workshop Nuremberg, 29 April 2014 Johanna Eberle FDZ of BA at IAB
Agenda Basic information on German RLC Services & Software Projects (past/present) Access to linked IAB data Conclusion 2
German Record Linkage Center: Basic information Established: 2011 Directors: Prof. Dr. Rainer Schnell (University of Duisburg-Essen) Stefan Bender (Research Data Centre, Nuremberg) Current staff (N): Dr. Manfred Antoni Dr. Christopher-Johannes Schild Johanna Eberle Funding: DFG (German Research Foundation) Funding Scientific Library Services and Information Systems (LIS) Funding period: 2011-2014 (follow-up grant proposal submitted in Jan 14) 3
Objectives Sustained increase in the number and quality of Record Linkage applications in scientific research (of various fields) Development of new (linked) data sources for research Performing service tasks (FDZ Nuremberg) and research on technical solutions (University of Duisburg-Essen) 4
Activities of the GRLC s two locations FDZ Nuremberg Focus: Service facility Project advisory center Conducting data linkage Recruitment of junior researchers University of Duisburg-Essen Focus: Research unit Development and evaluation of algorithms Development of linkage software Dissemination of current research results Recruitment of junior researchers 5
Services of G-RLC Individual advice during the planning and realization stages of data linkage projects Conducting data linkages as commissioned work Updating and maintaining the record linkage software MTB (Merge ToolBox) Acting as a trustee for the linkage of sensitive datasets Organization of regular workshops and tutorials on Record Linkage, partication in sessions (JSM, ISI, ESRA, SHIP, IHDL) and presence on national and international conferences 6
Webpage www.record-linkage.de Basic information on record linkage (concepts, literature, current research) Record Linkage bibliography Overview of past and present projects and partners Publications of G-RLC staff and Working Paper Series Downloads: MTB, Safelink, TDGen (Test data generator) 7
GermanRLC Working Paper Series Current volumes 2014: Gramlich T 2014. STROKES Record Linkage der Schlaganfälle in Hessen 2007-2010. German RLC Working Paper No. wp-grlc-2014-03. Schild CJ, Antoni M 2014. Linking Survey Data with Administrative Social Security Data - the Project Interactions Between Capabilities in Work and Private Life. German RLC Working Paper No. wp-grlc-2014-02. Kroll M 2014. A Graph Theoretic Linkage Attack on Microdata in a Metric Space. German RLC Working Paper No. wp-grlc-2014-01. All working papers 2011-2014 are free for download via www.record-linkage.de 8
Merge ToolBox (MTB) Collection of Java programs and one GUI Platform-independent (tested for Windows, MacOS, Unix) MTB can be downloaded for free (non-commercial use only) from www.record-linkage.de MTB is widely used in Germany (e.g. evaluation of cancer registry systems) 9
Merge ToolBox (MTB) Features Probabilistic Record Linkage with EM-Estimation Many different string similarity functions (e.g., Jaro, N-Gram, Levenshtein) Array Matching Fuzzy Blocking Privacy preserving Record Linkage with Bloom Filters (Safelink) References: Schnell, R., Bachteler, T. & Bender, S. (2004): A Toolbox for Record Linkage. In: Austrian Journal of Statistics, Vol. 33,1-2, S.125-133. Schnell R., Bachteler T. & Reiher, J. (2009): Privacy-preserving record linkage using Bloom filters. In: BMC Medical Informatics and Decision Making, Vol. 9, 41. 10
Merge ToolBox (MTB) Screenshot 11
Linkage Projects Current focus: Linkage of data on individuals or establishments with administrative data of the German Federal Employment Agency (BA) / Institute for Employment Research (IAB) Advancement of methods: Further development of preprocessing and data cleaning routines (currently: Stata, R; future: Perl) Speed-up of preprocessing and linkage processes 12
Linkage of the German SAVE study with administrative employment biographies (past project) Linkage of two data sets: Wave 9 of the study SAVE Saving and old-age provision in Germany conducted by Munich Center for the Economics of Aging (MEA) Survey on households' saving and asset choices with special focus on old-age provision Administrative Integrated Employment Biographies (IEB) data of the Institute for Employment Research Purpose: Link survey data with administrative information about periods of employment and social security contributions Enhance information from household survey with administrative data on the labour market biographies of respondents and (if applicable) their partner Linkage performed by the G-RLC on behalf of MEA institute 13
Linkage of Bureau van Dijk company data and IAB establishment data (current project) Linkage of Bureau van Dijk enterprise data (German financial company information and business intelligence) with administrative establishment data of the Institute for Employment Research Task: Identification of establishments (IAB) within enterprises (BvD) using company name and legal form Aims: New encompassing data product combining information on establishments and company background (Company-level linked employer-employee data) Opening up new research questions: company-level vs. establishment-level factors Relationship between labor and productivity / capital output 14
Consulting the project Record Linkage between IAB- SOEP Migration Sample and administrative data Project head: P. Trübswetter (Institute for Employment Research) Draw sample of households with migration background from Federal Employment Agency data Integration of subsample in GSOEP survey (German Socioeconomic Panel Study) Link survey data to administrative data of the Institute for Employment Research Advantages: Precise data on employment history of participants over time Longitudinal analyses already after wave 1 of survey 15
Access to linked IAB data Currently 2 modes to access data linked to sensitive IAB data: 1) On-site use at FDZ and remote data access 2) Data transfer to partner institution Both require judicial approval by German ministry of Social Affairs, mode 2 is more extensive Mode 1 requires data sets to be transferred to FDZ So far no ad-hoc way of linking micro data (for data protection reasons) and data sets cannot be stored at separate locations Linkage process: Name and address data are separated from other contents Anonymous linkage (PPRL) is possible and does not require identificators to be transferred 16
Conclusions Foster the creation of linked data sets for scientific research German Record Linkage Center gathers knowledge and resources regarding the linkage of micro data Data cleaning procedures, approximate string matching algorithms, blocking strategies, choice of matching parameters Software and hardware requirements Legal provisions regarding linkage of micro data (e.g., informed consent questions) Privacy-preserving Record Linkage might reduce privacy concerns 17
Thank you for your attention! Visit the German Record Linkage Center online: www.record-linkage.de Or contact us by email: recordlinkage@iab.de www.iab.de
BACKUP www.iab.de
Technical equipment Compute / File Server embedded in a secure IT environment at the Institute for Employment Research multi-core processor, huge RAM, large disk space Software: Statistical software packages: Stata, R Merge ToolBox Routines: Data Cleaning: Scripts in Stata & R 20