Technical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0



Similar documents
Guidance on De-identification of Protected Health Information November 26, 2012.

THE HIPAA PRIVACY RULE AND THE NATIONAL HOSPITAL CARE SURVEY

EXECUTIVE SUMMARY...1 II.

HIPAA and Big Data Twenty Third National HIPAA Summit. March 17, 2015 Mitchell W. Granberg, Optum Chief Privacy Officer

Efficient Similarity Search over Encrypted Data

De-identification Koans. ICTR Data Managers Darren Lacey January 15, 2013

Employing SNOMED CT and LOINC to make EHR data sensible and interoperable for clinical research

De-Identification of Health Data under HIPAA: Regulations and Recent Guidance" " "

How to De-identify Data. Xulei Shirley Liu Department of Biostatistics Vanderbilt University 03/07/2008

From Research to Practice: New Models for Data-sharing and Collaboration to Improve Health and Healthcare

Li Xiong, Emory University

Research Data Networks: Privacy- Preserving Sharing of Protected Health Informa>on

Societal benefits vs. privacy: what distributed secure multi-party computation enable? Research ehelse April Oslo

REACCH PNA Data Management Plan

DATA MINING - 1DL360

BRITISH COUNCIL DATA PROTECTION CODE FOR PARTNERS AND SUPPLIERS

Data Driven Approaches to Prescription Medication Outcomes Analysis Using EMR

NSF Workshop on Big Data Security and Privacy

SCHOOL OF PUBLIC HEALTH. HIPAA Privacy Training

Secondary Uses of Data for Comparative Effectiveness Research

IDAHO STATE UNIVERSITY POLICIES AND PROCEDURES (ISUPP) HIPAA Privacy - De-identification of PHI 10030

Privacy Policy. The Read Privacy Policy was created on June 11, 2015

Security Controls for the Autodesk 360 Managed Services

tell you about products and services and provide information to our third party marketing partners, subject to this policy;

Degrees of De-identification of Clinical Research Data

Whitepapers on Imaging Infrastructure for Research Paper 1. General Workflow Considerations

Special Topics in Security and Privacy of Medical Information. Privacy HIPAA. Sujata Garera. HIPAA Anonymity Hippocratic databases.

Protecting Personal Health Information in Research: Understanding the HIPAA Privacy Rule

DAIDS Bethesda, MD USA POLICY

Medicare Program: Expanding Uses of Medicare Data by Qualified Entities. AGENCY: Centers for Medicare & Medicaid Services (CMS), HHS.

Secure Authentication and Session. State Management for Web Services

Informatics Domain Task Force (idtf) CTSA PI Meeting 02/04/2015

HIPAA-Compliant Research Access to PHI

Health Data De-Identification by Dr. Khaled El Emam

A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA

Overview of FDA s active surveillance programs and epidemiologic studies for vaccines

Following are detailed competencies which are addressed to various extents in coursework, field training and the integrative project.

HIPAA Security Rule Toolkit

From Fishing to Attracting Chicks

ADVANCING POPULATION HEALTH: NEW MODELS AND THE ROLE OF RESEARCH

Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD

Obtaining IRB approval for multi-center research: challenges and recommendations

Notice of Privacy Practices for Protected Health Information (PHI)

Richmond Gastroenterology Associates, Inc.

University of Cincinnati Limited HIPAA Glossary

HIPAA-P06 Use and Disclosure of De-identified Data and Limited Data Sets

Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects

(Big) Data Anonymization Claude Castelluccia Inria, Privatics

Protecting Patient Privacy. Khaled El Emam, CHEO RI & uottawa

SECURITY RISK MANAGEMENT

HIPAA Medical Billing Requirements For Research

Computer Security (EDA263 / DIT 641)

De-Identification of Clinical Data

Assessing the impact of health literacy, numeracy and race on willingness to participate in biomedical research

De-Identification Framework

DISCLOSURES WEB PRIVACY POLICY

Data Privacy and Biomedicine Syllabus - Page 1 of 6

PO Box 2201, Durango, CO TEL FAX openskywilderness.com. Registration Form

Wayne Physical Medicine & Rehabilitation Associates 401 Hamburg Turnpike, Suite 105 Wayne, NJ 07470

Public Health 101 Series

Data and Information Management in Public Health

Privacy Aspects in Big Data Integration: Challenges and Opportunities

North Florida Medical Centers, Inc. Notice of Information Practices

Computer Security Incident Response Plan. Date of Approval: 23- FEB- 2015

i2b2 Clinical Research Chart

HIPAA 100 Training Manual Table of Contents. V. A Word About Business Associate Agreements 10

De-Identification of Clinical Data

HIPAA Basics for Clinical Research

Guidance Specifying Technologies and Methodologies DEPARTMENT OF HEALTH AND HUMAN SERVICES

HIPAA Compliance for Students

Challenges of Data Privacy in the Era of Big Data. Rebecca C. Steorts, Vishesh Karwa Carnegie Mellon University November 18, 2014

The Basics of HIPAA Privacy and Security and HITECH

Electronic Health Records: Why are they important?

into HIPAA Ian Campbell and The information a service to Short Act, HIPAA "Administrative use to host contract with an Documentation regulations.

Human Subjects Research (HSR) Series

i2b2 Clinical Research Chart

Business Associate Agreement

Issues with Tissues. Bertha delanda Celia Molvin/Kevin Murphy Research Compliance Office Stanford University

Privacy Policy - LuxTNT.com

HIPAA: Open Research Issues Michael L. Blau, Esq. McDermott, Will & Emery

Electronic and Digital Signatures

By the end of this course you will demonstrate:

Clinical Study Reports Approach to Protection of Personal Data

HIPAA COMPLIANCE. What is HIPAA?

The OCR Audit Protocol a first look

Online Detainee Locator System

Patient-Centered Outcomes Research Institute

How To Protect Your Health Information Under Hiopaa

Why Add Data Masking to Your IBM DB2 Application Environment

One Research Court, Suite 200 Rockville, MD Tel: Fax:

Comparative effectiveness research and big data: balancing potential with legal and ethical considerations

Summary of Responses to the Request for Information (RFI): Input on Development of a NIH Data Catalog (NOT-HG )

Rehabilitation, Sports & Spine Center, P.S. Notice of Privacy Practices. l. Use and Disclosures of Protected Health Information

Sheena Dungey 1,2, Simon Glew 3, Barbara Heyes 4, John MacLeod 5, A. Rosemary Tate 2

NOTICE OF PRIVACY PRACTICES

1R01HG : Privacy-Preserving Sharing and Analysis of Human Genomic Data. XiaoFeng Wang and Haixu Tang, IUB

The Challenge of Implementing Interoperable Electronic Medical Records

JEWISH FAMILY SERVICE NOTICE OF PRIVACY PRACTICES

1.2: DATA SHARING POLICY. PART OF THE OBI GOVERNANCE POLICY Available at:

NOTICE OF PRIVACY PRACTICES

Transcription:

Technical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0 Guidance Document Prepared by: PCORnet Data Privacy Task Force Submitted to the PMO Approved by the PMO Submitted to PCORI Accepted by PCORI March 31, 201 April 2, 201 April 3, 201 June 4, 201 i

Data Privacy Task Force Technical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0 TABLE OF CONTENTS EXECUTIVE SUMMARY... - ii - 1.0 MINIMUM THRESHOLD... 1 2.0 PERTURBATION OF QUERY RESULTS... 1 3.0 OBFUSCATION OF IDENTIFIERS FOR RECORD LINKAGE... 2 4.0 DE- IDENTIFICATION OF RECORD- LEVEL DATA... 3 A. CAPRICORN APPROACHES... 3 B. NEPHCURE PPRN S APPROACHES TO DE- IDENTIFICATION... 4 C. PEDSNET APPROACHES TO DE- IDENTIFICATION... 4 TABLES AND FIGURES... REFERENCES... 6 The Data Privacy Task Force - ii - Technical Approaches for Protecting

EXECUTIVE SUMMARY PCORnet is a federated network, with PCORnet network partners retaining discretion and responsibility with respect to the collection, access, use, and disclosure of patient information; network partners also make determinations about when they will participate in any particular PCORnet query. The Data Privacy Task Force is working collectively with the CDRNs and PPRNs to develop a set of privacy policies to govern data sharing by PCORnet. This guidance is intended to augment the PCORnet policies to provide examples of methods to reduce the risk of re- identification with respect to the generation, collection, maintenance, or return of Network Data. Terms used in this guidance are defined in the PCORnet policies. This guidance is intended to be modified over time as the PCORnet Distributed Research Network gains experience. The guidance covers the following privacy protective techniques: (Threshold) Minimum count thresholds for Aggregate Data; (Perturb) Perturbation of PCORnet Data; (Obfuscate) Obfuscation of identifiers for record linkage; and (De- identify) De- identification of record- level research participant information. The Data Privacy Task Force - ii - Technical Approaches for Protecting

MINIMUM THRESHOLD One of the manners by which personal information can be exploited for re- identification is by the triangulation on small groups of individuals. In order to mitigate such attacks, PCORnet Policy currently states that Network Data Affiliates cannot release Network Data with cell counts of five or less, unless authorized by the research protocol and IRB(s) approving the query. (See PCORnet Policy 6.2.2.) PCORnet policies permit network partners to apply their local rules for masking cell counts, or for rejecting queries where the return of results would not match their thresholds for releasing Aggregate Data. Such local policies must be consistent with commitments made to patients/data subjects with respect to use of their information. Other examples of thresholds are shown in Table 1. PERTURBATION OF QUERY RESULTS Another manner by which personal information can be exploited for re- identification is by overlapping queries to remove the intersection and disclose the remaining individuals. Consider an example of how this might be achieved. First, an Authorized User issues a query for how many juvenile diabetics were on drug A and drug B with an adverse outcome and the answer is X, which, for this case, let us assume corresponds to 31. The User then issues a subsequent query in which they ask how many juvenile diabetics were on drug A with an adverse outcome, such that the answer is now 30. At this point, the User learns that there is only 1 juvenile diabetic on both drug A and drug B with the adverse outcome. There are a number of ways in which this type of attack could be prevented. In practice, systems tend to apply either 1) rounding (or coarsening) or 2) injection of a certain degree of noise to the query result. As noted in PCORnet policies, the PCORnet query should specify the approach to be used to de- identify data or reduce re- identification risks (see PCORnet Policy.2.1.1). If a rounding (or coarsening approach is used), the result X could be rounded to the nearest value of 10. For instance, in the above scenarios, the answers to the queries would both be 30. However, it should be noted that the degree to which the utility of the query answers would be tied directly to the rounding values. An initial rounding value of 10 is recommended. An alternative to rounding is the injection of a certain amount of noise into the results. This is the strategy that query- response tools such as i2b2 [Murphy 2009] (specifically in SHRINE [Lowe 2009]) apply in their system. In this scheme, the result would be reported as 30 + ε, where ε is a random value selected from a known distribution. This distribution could be uniform, Gaussian, Laplacian, or something else. It should be noted that i2b2 applies a Gaussian distribution. If random noise is to be added, the approach needs to specify the standard deviation of the distribution from which the value is selected.

OBFUSCATION OF IDENTIFIERS FOR RECORD LINKAGE To mitigate bias in investigations, it is important to resolve when a patient s data resides in multiple resources. This process, called record linkage, is non- trivial because a patient s record often contains typographical and semantic errors. Sophisticated record linkage strategies have been proposed to resolve these problems, but they rely on patient identifiers, such as personal name and Social Security Number. To overcome this barrier, a growing list of techniques has been proposed to support private record linkage (PRL). From a high level, the PRL process has a lifecycle that entails (but is not necessarily limited to) the following steps [Toth 2014]: 1. Generation and storage of keys for cryptosystems, or salt values for hash functions, invoked in a PRL protocol; 2. Communication of keys and salt to the entities encoding the records upon request; 3. Transformation of identifiers into their protected form as specified by the protocol; 4. Separation of salt hosting and de- duplication trusted entities for enhanced security. Execution of the record linkage framework (e.g., feature weighting, blocking, and comparison of record pairs to predict which correspond to the same individual); and 6. Transfer of records and parameters related to the linkage protocol (i.e., all communication between parties). Under no circumstances can the keys or salt values be disclosed to any entity beyond PCORnet network partners. A number of network partners are exploring different approaches to private record linkage. Some network partners report using NIH s Global Unique Identifier (GUID) Tool (https://fitbir.nih.gov/jsp/contribute/guid- overview.jsp). The CAPriCORN Clinical Data Research Network has developed private record de- duplication software [insert link to JAMIA paper when it is available]. The Secure Open Master Patient Indexing System (SOEMPI), developed researchers at Vanderbilt University and the University of Texas at Dallas, is another approach. Private companies also offer de- duplication software options. Although it is too early to require that all PCORnet participants adopt a specific approach, evolving to the same approach would be beneficial, as it would allow for centralized de- duplication to occur, versus having network participants individually engage in these efforts. To apply such an approach, PCORnet would need to agree on: 1. Who is the third party (trusted party A) who generates the keys/salt values of the functions? 2. Who is the third party (trusted party B) who gets to perform the linkage? 3. Who gets to see the linkage results? In other words, do the member sites get to know when their constituents went to other sites? 4. What is the similarity threshold by which we could claim that two records correspond to the same individual? There are no standards and no standard software available at this time. SOEMPI is one option, but it will require either PCORnet or some organization to adopt the source code and support is operations. An alternative solution would be to piggyback on the software developed by the Chicago CDRN the paper describing this system is under review at JAMIA and is provided separately. There are benefits and drawbacks to both systems in their design and linkage algorithms.

DE- IDENTIFICATION OF RECORD- LEVEL DATA A predominant model for research using the PCORnet Distributed Research Network is one where the individual, record- level or patient- level data remains under the control of the network partner (or Network Data Affiliate); the research query is run on the Network Data, and only Aggregate Data is returned in response. This privacy- preserving architecture reduces the need to adopt de- identification strategies for data shared in response to a query. [Mini Sentinel 2012] However, PCORnet policies recognize that at times, responses to queries may require the sharing of record- or patient- level de- identified data. In addition, network partners (particularly those consisting of disparate organizations) may choose as a matter of local policy to create de- identified datasets for research purposes. There a number of ways by which de- identification can be achieved. Follow this link for the latest guidance from the HHS office for Civil Rights on HIPAA de- identification: http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/de- identification/guidance.html In circumstances where the query requires the return of de- identified data, PCORnet policies require the query to specify the definition and approach or procedures required to de- identify data. In addition, some network partners may be required to abide by NIH s recently released Genomic Data Sharing Policy, which includes specifications on the de- identification approach to be used. http://gds.nih.gov/03policy2.html. For initial queries requiring the return of de- identified data, the PCORnet Coordinating Center (CC), with input from network partners participating in the queries, may need to set the approach to be used; however, over time, PCORnet should develop a robust set of policies and best practices that over time may reduce or eliminate the need for CC control. These approaches focus on reducing risk of re- identification using demographic identifiers; future iterations of the guidance may need to deal with risk of re- identification from exposure of clinical data. PCORnet network partners are invited to share their approaches to de- identification of record level data, in order to share resources and begin to develop a library of best practices. The following record- level de- identification approaches have been shared and are also available on the PCORnet Central Desktop: A. CAPRICORN APPROACHES CAPriCORN proposes initially to validate and use limited data sets with randomly seeded, time- shifted temporal references and geographical references restricted to the first three digits of zip codes. Expert statistical determination will be sought for the method of time- stamping events to confirm that it also meets the Safe Harbor de- identification criteria of the HIPAA Privacy Rule. Until such determination has been achieved, the data sets will be considered limited, rather than de- identified, datasets. In the event that this proves infeasible, CAPriCORN will adhere to Safe Harbor until the situation has evolved and use of date shifting is accepted. A separate important piece of information useful for epidemiologic investigations is geographic location. We may need to incorporate these data through IRB approval of limited data sets rather than addresses

that can be geocoded. ZIP code level data will need to be considered when applying our minimum threshold and perturbation of query rules. B. NEPHCURE PPRN S APPROACHES TO DE- IDENTIFICATION 1. Encrypted hash (SHA1) on a sequential ID number assigned as the surveys come in. 2. Randomizing birth dates within six months, with a new random birth date generated for each query. 3. The Common Data Model has been constructed as views in a separate schema, so no queries can get to the underlying data. C. PEDSNET APPROACHES TO DE- IDENTIFICATION 1. Institution replaces PHI with a site encrypted identifier, and maintains link between the two. 2. DCC replaces site encrypted identifier with a PEDSnet encrypted identifier (PEI) to insure uniqueness across sites. 3. All datasets stored or sent out of the DCC use the PEI. What this means in the study context is that the investigator gets a set of PEIs in response to a case- finding query. If they want to re- identify patients, they tell the DCC, who translates that back to a site and site encrypted identifier, and sends that back to the site of origin. That site is then able to link to PHI and re- contact the patient or provide additional data (e.g., chart review). We re planning to cycle a test of this process in December, if the DUAs get sorted by then.

TABLES AND FIGURES Refer to tables and figures throughout the document and place them here. Use capital T s and F s when referring to tables and figures (e.g., As mentioned in Table 1, etc.). Table 1. Examples of thresholds applied in the minimum threshold rule AGENCY Washington State Department of Health [WA 2012] Centers for Disease Control Healthy People 2010 [Klein 2002] Arkansas HIV/AIDS Data Release Policy [AR 2012] Colorado State Department of Public Health and Environment [CO 2012] National Center for Health Statistics [NCHS 2004] UK Department of Enterprise, Trade, and Investment [DETI 2012] Utah State Department of Health [UT 200] Iowa Department of Public Health [IA 200] NASA [SEDAC 200] MINIMUM THRESHOLD 10-10 4 3

REFERENCES [AR 2010] Arkansas HIV/AIDS Surveillance Section. Arkansas HIV/AIDS Data Release Policy. Available Online: http://www.healthy.arkansas.gov/programsservices/healthstatistics/documents/stdsurveillance/d atadeissemination.pdf. First published: May 2010. Last Accessed: April 29, 2014. [CO 2010] Colorado State Department of Public Health and Environment. Guidelines for working with small numbers. Available online: http://www.cdphe.state.co.us/cohid/smnumguidelines.html. Last Accessed: April 29, 2014. [DETI 2010] U.K. Department of Enterprise, Trade, and Investment. DETI Data Confidentiality Statement. Available online: http://www.detini.gov.uk/deti- stats- index/stats- national- statistics/data- security.htm. Last Accessed: April 29, 2014. [Klein 2002] R. KLEIN, S. Proctor, M. Boudreault, K. Turczyn. Healthy people 2010 criteria for data suppression. Centers for Disease Control Statistical Notes Number 24. 2002. [Mini Sentinel 2012] J RASSEN, et al., Mini Sentinel Methods: Evaluating Strategies for Data Sharing and Analyses in Distributed Data Settings, November 2012, http://www.mini- sentinel.org/work_products/statistical_methods/mini- Sentinel_Methods_Evaluating- Strategies- for- Data- Sharing- and- Analyses.pdf. [Murphy 2009] S. MURPHY, et. al. Strategies for maintaining patient privacy in i2b2. Journal of the American Medical Informatics Association. 2011; 18: 103-108. [SEDAC] Socioeconomic Data and Applications Center. Confidentiality issues and policies related to the utilization and dissemination of geospatial data for public health application; a report to the public health applications of earth science program, national aeronautics and space administration, science mission directorate, applied sciences program. 200. Available online: http://www.ciesin.org/pdf/sedac_confidentialityreport.pdf. Last Accessed: April 29, 2014. [TOTH 2014] C. TOTH, et al. SOEMPI: A Secure Open Master Patient Index Software Toolkit for private record linkage. Proceedings of the 2014 American Medical Informatics Association Annual Symposium. 2014: in press. [UT 200] Utah State Department of Health. Data release policy for Utah s IBIS- PH web- based query system, Utah Department of Health. Available online: http://health.utah.gov/opha/ibishelp/datareleasepolicy.pdf. First published: 200. Last Accessed: April 29, 2014. [WA 2012] Washington State Department of Health. Guidelines for working with small numbers. Available online: http://www.doh.wa.gov/portals/1/documents/00/smallnumbers.pdf. First published 2001, last updated October 1 2012. Last Accessed: April 29, 2014.