1R01HG0007078: Privacy-Preserving Sharing and Analysis of Human Genomic Data XiaoFeng Wang and Haixu Tang, IUB
Project Objectives Study of Scalable, Privacy-Preserving Data Analysis, particular those for public clouds Study of Privacy-Preserving Data Dissemination techniques for data sharing through NIH data centers
Project Progress Summary Secure Data Analysis Improving our privacy-preserving read-mapping techniques Exploring use of the technique for microbial filtering Secure Data Sharing Development of privacy-preserving data selection techniques Identification of the most promising way for supporting privacy-preserving GWAS studies Organization of the first Critical Assessment of Data Privacy and Protection (CADPP) Others Evaluating privacy risks in releasing clinical proteomic data Survey study on human genome privacy and building of a portal to support follow-up research
Hybrid-cloud Secure Data Analysis Partitions of the computation on privacy and public Cloud to support secure computation outsourcing Private Cloud For microbial filtering, Public Cloud We found that the computation can be performed almost entirely in the public cloud on encrypted data.
Data Analysis and Dissemination Services Programs Results
Technical Highlights LD-based noise adding technique Dimension reduction Noise adding that preserves utility in pilot data Utility assessment mechanism For identification of the data source most likely to have usable data Evaluation Use 4 popular association tests For releasing a locus with 180 SNPs In the vast majority cases, the user can find most useful dataset, without leaking out any information Product A paper is being considered by JAMIA (revision submitted).
The CADPP Competition Evaluate how effective the best security technologies could be in protecting patient privacy and preserving data utility The first challenge focuses on the tasks for sharing aggregate SNP data (allele frequencies) for GWAS studies
Research Tasks Development of techniques for analyzing the level of information leaks from computation results Preliminarily explore the approach for secure data disseminations
Teams and Tasks 6 teams U. Oklahoma UT Dallas McGill University CMU UT Austin IU (Baseline) Scenarios: Privacy Protection for GWAS Task 1: raw data sharing Task 2: outcome release
What has been Learnt Task 1 It remains a challenge to privacy-preserved sharing of aggregate human genomic data, while maintaining their utilities in genome-wide association studies (GWAS). Even for a single genomic locus involving a few hundreds of SNPs, the utility of the data was large damaged after noise-adding to ensure privacy protection It is un-likely that current privacy-preserving techniques will scale well for sharing whole human genomic data Task 2 Privacy-preserving techniques work surprisingly well on publishing outcomes of GWAS-like analyses High accuracy can be achieved when only a small number of most significant SNPs are concerned from the users perspective This task is well aligned with the centralized data/computing model The centralized data/computing center will host human genomic data as well as service for customized analyses on these data, and will only release the results of these analyses to users We encourage the community to improve the approaches to this task!
Other Outcomes BMC special issue on Human Genome Privacy Development of a web service for automatic evaluation of privacy-preserving GWAS techniques (https://humangenomeprivacy.ucsd-dbmi.org/). Competition results are made public: http://www.humangenomeprivacy.org/
Risks in releasing clinical proteomic data Presence of Identifiable information in clinical proteomic data The risks could be mitigated through pre-processing, removing such data Further study is needed to understand the privacy/utility balance
Dissemination of Research Outcomes Organization of the 1 st Workshop on Genome Privacy (GenoPri): July 15, Amsterdam Review paper on Privacy Risk and Mitigation Techniques on Genomic Research http://arxiv.org/abs/1405.1891
Next Steps Privacy-preserving data analysis Complete the development of read-mapping system Further study on microbial filtering Research on other crypto techniques for genomic data analysis on the public cloud Privacy-preserving data sharing Improve the scalability of data-selection techniques Analyze risks of information leaks on the data center Organization of the 2 nd CADPP competition with idash A tentative topic is to evaluate the performance of techniques for data analysis on encrypted data (e.g., homomorphic encryption, secure multi-party computation, etc.)