Balancing Big Data for Security, Collaboration and Performance Sai Balu Lineberger Cancer Center UNC Chapel Hill Oct 14, 2014
About UNC Oldest Public University -1793 Top 5 Public University. 46th World Wide Clinical Translational Science Award NCTraCS Institute Carolina Data Warehouse - Hospital/Research School of Medicine - 6th in NIH funding
About Lineberger NCI Designated Comprehensive Cancer Center Largest Research Entity at UNC - $190 million/year in external grants 300 Scientists, 1200 Staff across UNC Campus 250 Clinical Trials offered NC Cancer Hospital : Clinical Home University Cancer Research Fund - $25 million in 2007 and $42 million/year in 2014
About UNC Hospitals Not-for-Profit Integrated Health Care Teaching Mission State of the Art Patient Care EMR and Cancer Registry WebCIS Epic
About RENCI A Leader in Cyber Technologies Scientific Discoveries & Business Innovations Medicine & Genomics Environmental Sciences Data Management Technologies: irods
Bioinformatics Core at Lineberger Infrastructure for Data Management and Data Analysis Integrated Data Analysis - Genomic & Clinical & public annotations Supporting Instruments
Big Data Velocity The rate of data generation, rate of change Volume The size of data Variety Under represented of the Vs but not Today!
TCGA The Cancer Genome Atlas Project Study Molecular Basis for Cancer 20+ tumor types studied Expression, Copy Number, DNA/RNA, mirna UNC is Gene Expression Center Dr. Chuck Perou 10K samples processed
TCGA Analysis Tumor Working Groups & Data Freezes Exposure to Variety Types of data, Security, Sources, Performance, Sharing, Analysis.
UNC Cancer Survivorship Goal: Enroll 10K Patients! Collect Biospecimens, medical records and follow-up with questionnaires
UNCseq Genetic Profiling Cancer Patient Specimens Support Treatment Decisions Target ~200 genes of potential clinical utility All known druggable targets Genes of interest confirmed by experts
Big Data - Variety! 1. Clinic Schedules 7. Public data - Clinical Trials, Oncotator, Death Indexes 2. ICD codes 8. Ancillary Studies 3. Consent Status 9. Workflows 4. Tissue Banking and Annotations 10. Metadata 5. Questionnaires - 2 different languages 11. Analysis - exome, survival, spatial. 6. EMR - Pathology as an example 12. Instruments - robots, sequencing - sequonom, snp arrays
Big Data - Variety! Variety of Sources Epic, SAS-Health Outcome Analytics, Death Indexes Variety of Security Public Data to CLIA to FISMA compliance Variety of Standards +1 standards
SAS - HOA Private partnership to create Cancer Data Mart Patient Counts - 155,078 Pathology Report Types - 33 Pathology Report Datapoints - 21,347,023 Lab Tests - 387,495 Lab Test Observations - 34,168,986
Security, Collaboration and Performance Balancing is an art Institutional Policies Develop Trust Develop standard verification processes Develop Training materials
Security HIPAA Sensitive Data FISMA Moderate Claims Data Secure Medical Workspaces Secure Cluster Computing
Performance Sustained 15Gbs/sec over the network for many hours - largest network traffic seen within UNC campus Transferring to Data Coordination Centers - Bit Torrent Style software
Collaboration Through Data Sharing Without Duplication with different ACLs Bring Compute to Data irods - A Possible Solution
Data Governance Identify Stewards Identify Custodians Identify Users Develop Policies Create Workgroups
Acknowledgement UNCseq Team Health Registry Team TCGA at UNC Team DDN Thank you!