Genomics and Health Data Standards: Lessons from the Past and Present for a Genome-enabled Future

Genomics and Health Data Standards: Lessons from the Past and Present for a Genome-enabled Future Daniel Masys, MD Professor and Chair Department of Biomedical Informatics Professor of Medicine Vanderbilt University School of Medicine Principal Investigator electronic Medical Records and Genomics (emerge) Coordination Center

Topics The emerge consortium: on the frontier of genomes and phenotypes derived from clinical sources Lessons for future standards learned from the last 40 years of biomedical informatics

RFA HG-07-005: Genome-Wide Studies in Biorepositories with Electronic Medical Record Data 2007 NIH Request for Applications from the National Human Genome Research Institute The purpose of this funding opportunity is to provide support for investigative groups affiliated with existing biorepositories to develop necessary methods and procedures for, and then to perform, if feasible, genome-wide studies in participants with phenotypes and environmental exposures derived from electronic medical records, with the aim of widespread sharing of the resulting individual genotype-phenotype data to accelerate the discovery of genes related to complex diseases.

emerge: an NHGRI-funded consortium for Biobanks linked to EMR data Consortium members Group Health of Puget Sound Marshfield Clinic Mayo Clinic Northwestern University Vanderbilt University Condition of NIH funding: contribute genomic and EMR-derived phenotype data to dbgap database at NCBI

Dementia Cataracts Type II diabetes Peripheral vascular disease Coordinating center QRS duration

Each site includes DNA linked to electronic medical records Each project includes community engagement, genome science, natural language processing capability for EMR data Research plans include identifying a phenotype of interest in 3,000 subjects and conduct of a genomewide association study at each center: Σ=18,000 Supplemental funding provided for cross-network phenotypes

Supplement phenotypes using genotyped samples from primary phenotypes* *The benefit of data from routine clinical testing results recorded in EHR RBC/WBC Diabetic Retinopathy Lipid Levels & Height GFR GHC 3,579 230 3,114 1,713 Marshfield 3,865 213 3,693 3,929 Mayo 3,346 806 3,175 3,340 NU 2,484 139 2,816 1,485 VU 2,650 1,449 1,631 2,679

Informatics Issues in emerge Determination of comparability of patient populations across institutions Data exchange standards for phenotype data Representation of change over time (repeated measures) and clinical uncertainty for EMR-derived observations (definite probable possible for assertion and negation) Re-identification potential of clinical data and associated samples: maximizing scientific value while complying with federal privacy protection policies

Lessons learned from emerge Structured data alone (ICD9 codes, labs, meds) is not enough to accurately define clinical phenotypes in EMR data. Natural language processing of clinician notes improves both sensitivity and specificity of phenotype selection logic Once selection logic has been validated at one site, it can be successfully shared as pseudocode and implemented at other sites with widely varying EMR systems architectures

The problem with ICD codes ICD code sets contain both false negatives and false positives False negatives: Outpatient billing limited to 4 diagnoses/visit Outpatient billing done by physicians (e.g., takes too long to find the unknown ICD9) Inpatient billing done by professional coders: omit codes that don t pay well can only code problems actually explicitly mentioned in documentation False positives Diagnoses evolve over time -- physicians may initially bill for suspected diagnoses that later are determined to be incorrect Billing the wrong code (perhaps it is easier to find for a busier clinician) Physicians may bill for a different condition if it pays for a given treatment Example: Anti-TNF biologics (e.g., infliximab) originally not covered for psoriatic arthritis, so rheumatologists would code the patient as having rheumatoid arthritis

Observations about clinical genomics Genomic data is the current poster child for complexity in healthcare No practitioner can absorb and remember more than a tiny fraction of the knowledge base of human variation Therefore, computerized clinical decision support is the only effective way to insert genomic variation-based guidance into clinical care

Need for Patient-Specific Decision Support Assistance Facts per Decision 1000 100 Proteomics and other effector molecules Functional Genetics: Gene expression profiles Decisions by clinical phenotype i.e., traditional health care 10 1990 2000 2010 2020 Structural Genetics: e.g. SNPs, haplotypes Human Cognitive Capacity

Observations about genomic data in general Predictably, some future uses of genomic data will be so creative that nobody could have predicted them. Raw data e.g., primary DNA sequence, specific SNP values, should be retained when adding clinical interpretations, not discarded in the synthesis and reporting of those interpretations

Lessons learned about standards The most successful and widely used standards are messaging and communications standards e.g., TCP/IP, http, smtp, ftp, HL7. Knowledge representation standards tend to be widely used only if driven by financial processes, government mandate (e.g., ICD-9), or sustained government investment (e.g., MeSH, SNOMED, UMLS Metathesaurus)

Lessons learned about standards, cont d Familiar and usable wins over theoretically correct but too complex e.g., HL7 v.2 vs. v.3 Lindberg s axiom: Systems that get used get better.

URL: www.gwas.net

Commander Willard Decker: V'ger... expects an answer. Captain James T. Kirk: An answer? I don't even know the question. Star Trek, the movie 1979