BIG DATA : National data linkage infrastructure James Boyd
What defines Big Data? Data whose scale, diversity and complexity requires new architecture, techniques, algorithms and analytics to manage it and extract value and hidden knowledge from it
Characteristics of Big Data Volume data volumes exponential increasing over time Variety (Complexity) various formats, types and structures Velocity fast processing of data to ensure it is representative
Administrative Data Collections (Big-ish data?) Administrative data are usually collected by government for some administrative purpose - not primarily for research Life events can be generated across a number of different government areas (Health, Education, Criminal Justice etc.) The databases are often population based, so important population subgroups are not missed.
Data Sharing Ability to share the same data resource with multiple applications or users: Data and information used/reused to inform significant decisions Bring together key elements across Government Enhance the value of information gained from a single source
Data Sharing - Challenges Governance arrangements for sharing: Custodian requirements Control of data release Users agreements Confidentiality, Privacy and Security: Protecting confidentiality and Privacy Ensure data security throughout
Data Linkage - Overview To establish efficiently and accurately which records belong to the same individual. Personal identifying information makes Data Linkage Possible: Family Name First Name(s) Date of birth Postcode
Matching Techniques Exact matching can lead to inexact results e.g. requiring exact match on a number of fields e.g. surname, first initial, date of birth, sex - expect at least 10-15% errors because of discrepancies Probability matching more accurate Quantifies levels of agreement & disagreement 2% true links missed
How Does Linkage Work? Bring together the pairs of records to be compared Quantify the relative probability that the two records belong to the same person Make the linkage decision
Development of linkage methodology Matching Challenges?
Population Health Research Network PHRN: Collaborative Network developing data linkage capability within and between Australian jurisdictions 6 state/territory linkage units - 2 existing (WA, NSW/ACT) + 4 new (Qld, Vic, Tas & SA/NT) Program Office in Perth providing coordination and national client services National linkage (Centre for Data Linkage at Curtin University and AIHW Commonwealth Data Integration) Secure Unified Research Environment (SURE) and secure Data Delivery (Sax Institute)
Centre for Data Linkage (CDL) Building national data linkage infrastructure Facilitate linkages that span across state/territory borders Link these datasets with research datasets Secure linkage of datasets Research & Development into data linkage methods
Why develop a new linkage system? Address weaknesses & gaps in existing systems (complexity, scale, performance, functionality, administration) Provide an enterprise-grade platform that is reliable, easy to maintain and operate, with auditing capabilities Automate functions that traditionally require manual intervention Tackle emerging problems e.g. privacy-preserving linkage
What differentiates the NLS? Large data volume (linkage, data management, output, scalability) Manages multiple linkage & extraction projects Manages new, amended and deleted records (open file handling) Handles diverse linkage & DC needs e.g. enduring vs project linkage; researcher wishing to link their own data; DCs imposing restrictions on linkage Secure & auditable
Managing change over time How to handle change over time ( New, Amends & Deletes )? NLS handles new, amended & deleted Records (Open file handling) NLS handles deletion of Data collections, Data Providers and Linkage Projects NLS differentiates between end-dated & deleted records
Any point in time referencing NLS stores full history of records and groups Groups are dynamic entities Linkage structure can be recreated for any record at any (previous) point in time
Graph of Matching Group
PHRN: Proof of Concept Project Hospital-related mortality CDL created linkage keys using demographic data from WA, NSW, SA and QLD hospital morbidity and mortality data collections Linkage of around 45,000,000 event records Linkage Processing completed within 10 days Over 2 billion pair relationships
Contact Details James Boyd Centre for Data Linkage Curtin University j.boyd@curtin.edu.au