INTEGRATION OF ELECTRONIC HEALTH RECORDS AND PUBLIC BIOLOGICAL REPOSITORIES ILLUMINATES HUMAN PATHOPHYSIOLOGY AND UNDERLYING MOLECULAR RELATIONSHIPS

Transcription

1 INTEGRATION OF ELECTRONIC HEALTH RECORDS AND PUBLIC BIOLOGICAL REPOSITORIES ILLUMINATES HUMAN PATHOPHYSIOLOGY AND UNDERLYING MOLECULAR RELATIONSHIPS A DISSERTATION SUBMITTED TO THE PROGRAM IN BIOMEDICAL INFORMATICS AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DAVID P. CHEN AUGUST 2011

2 2011 by David Pei-Ann Chen. All Rights Reserved. Re-distributed by Stanford University under license with the author. This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. This dissertation is online at: ii

3 I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Atul Butte, Primary Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Russ Altman I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Michael Walker Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost Graduate Education This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives. iii

4 iv

5 ABSTRACT Secondary use of electronic health record (EHR) data has the potential to unlock novel insight into human pathophysiology. While EHR data has often been used in retrospective studies, management of public health, and to improve patient safety, its use in discovering underlying molecular mechanisms of human disease and pathophysiology has been limited. Much of this can be attributed to the differing priorities between healthcare providers and basic biological researchers. The advent of biobanks that collect physiological measurements as well as tissue samples and molecular measurements promises to address this issue. However, the sheer number of different biological and clinical measurement modalities hinders the generation of a truly complete view of the human organism. The increased adoption of EHRs as well as growing biological data repositories enables researchers to answer biological questions applicable to the human population. The goal is not to treat humans as experimental organisms, but rather to gain as much knowledge as possible from every patient seen. By viewing EHRs as a repository of perturbations and their associated physiological consequences we can begin to design experiments that leverage EHR data to generate hypotheses that can be further evaluated. This thesis aims to describe methods to summarize EHR biomarker data in a systematic way to enable downstream analysis as well as methods for integrating EHR data and disparate biological data. I will describe the creation of the clinarray and its application to specific disease populations to differentiate patients by severity and to discover latent physiological factors associated with disease. I will also describe v

6 how to aggregate and analyze clinarrays from across the EHR to build models of aging. Finally I will discuss the use of diseases to integrate EHR data with gene expression data from a disparate biological data source to discover genes related to aging and to generate hypotheses for relationships between biomarkers and genes. The integration of readily available clinical and biological data promises to improve our understanding of phenomics without impacting patient care and adding an unnecessary burden to the healthcare system. It is important for biological research to leverage the increased amount of molecular and environmental data stored in EHRs to build a more complete view of the human organism. vi

7 ACKNOWLEDGEMENTS Throughout my graduate career I have had the pleasure and privilege to work with many extraordinary individuals in the Biomedical Informatics Training Program and the greater Stanford community that have made my time here both intellectually stimulating and life-changing. I would first like to recognize and thank some of my fellow students in the Biomedical Informatics Training Program: Alex Morgan, Sarah Aerni, Noah Zimmerman, Marina Sirota, Chirag Patel, and Joel Dudley for their collaboration, stimulating conversation, advice, and support throughout these years. I would like to thank the following members of the Butte lab who have created and maintained the various repositories and computational resources that have enabled my research: Rong Chen and Alex Skrenchuk. I would also like to thank Susan Aptekar for making the Butte lab run smoothly. I would like to thank my academic advisor, David Paik, for keeping me on track during the early years of my graduate career. I would also like to thank the members of my thesis defense committee for their time and academic advice: Russ Altman, Michael Walker, Henry Lowe, and Gary Peltz. I would like to thank Kenneth Weinberg for his collaboration and for bringing a critical biological perspective to computational results. I would like to especially thank the many members of the Biomedical Informatics community at Stanford without whom nothing would occur smoothly or on time: Larry Fagan, Betty Cheng, and Mary Jeanne Oliva. I would also like to thank Darlene Vian, who while no longer with us, has an enduring legacy in the BMI program. I would also like to thank my close friends and family that have supported me throughout these years. First, I would like vii

8 to thank my partner Justin for always being present, loving, and supportive. Next, I d like to thank my parents Tom and Emily, and my brother Brian for the examples they set and a lifetime of love, guidance, encouragement, and support. I would also like to thank my Aunt Blossom, Uncle Frank, and cousins Peimei and Peili for always being there for me. Lastly, I would like to thank my mentor and advisor Atul Butte. From the moment I met Atul I knew that he was who I wanted to emulate. Thank you for being such a great person, mentor, and life coach., viii

9 TABLE OF CONTENTS List of Tables...xii List of Figures...xiii Chapter 1: Natural Experiments Using Electronic Medical Records Enable Biological Discoveries...1 Chapter 2: The Current Paradigm for Using Electronic Health Records for Molecular Discoveries...12 EHR Use in Biobanks, Academic Research Institutions, and Primary Care Institutions...12 Hurdles of Using EHR Data: Data Quality and Privacy...16 Chapter 3: Clinical Arrays of Laboratory Measures, or Clinarrays, Built from an Electronic Health Record Enable Disease Subtyping by Severity...20 Methods...22 Data Collection, Processing and the Clinarray...22 Metric for Severity...23 Hierarchical Clustering of Patients and Lab Tests...24 Normalization...24 Results...27 Discussion...29 Chapter 4: Latent Physiological Factors of Complex Human Diseases Revealed by Independent Component Analysis of Clinarrays...33 Methods...35 Building Clinarrays from Patient Lab Tests...35 Independent Component Analysis...36 Extracting Physiological Factors...37 Results...39 Creation of Disease-Specific Clinarrays...39 ix

10 Independent Component Analysis of Clinarrays...40 Discussion...40 Conclusions...45 Chapter 5: Validating Pathophysiological Models of Aging Using Clinical Electronic Medical Records...47 Methods and Results...50 Model Building Using NHANES...51 Model Evaluation...55 Model Validation on Clinical Samples...57 Discussion...60 Conclusion...65 Chapter 6: Novel Integration of Hospital Electronic Medical Records and Gene Expression Measurements to Identify Genetic Markers of Maturation...66 Methods...69 Data Collection and Processing...69 Finding Biomarkers for Maturation Using Analysis of Variance...72 Using Diseases to Model Maturation...72 Results...74 Clinical Biomarker for Maturation...74 Finding Maturation and Aging Related Genes...76 Discussion...80 Chapter 7: Bedside to Bench Reverse Translation Enables Discovery of Phenomic Relationships Between Immune Related Molecules and Pathophysiology...84 Methods...88 Gene Expression Data and Annotation...88 Mapping UMLS CUIs To ICD-9-CM Codes...90 Generating Disease-Biomarker State Vectors...90 Weighted Least Squares Regression Between Biomarkers and Genes...92 Multiple Hypothesis Testing...92 Results...93 x

11 Evaluation...97 Discussion...99 Chapter 9: Conclusion xi

12 LIST OF TABLES Table 1: Sample biological questions and their associated problem solving methods...9 Table 2: Diseases and the number of patients used in our clinarray study...23 Table 3: Population after data pruning...27 Table 4: Number of patients and biomarkers remaining after pruning...38 Table 5: Significant biomarkers after ICA analysis...39 Table 6: Top 10 most informative biomarkers...53 Table 7: Results between predicted and expected chronological age...57 Table 8: Diseases and relevant GEO data sets...70 Table 9: Top aging related genes...80 Table 10. MamPhEA Enrichment Results...98 Table 11. Biomarkers compared betwee IL-4R -/- and Balb/c mice...99 xii

13 LIST OF FIGURES Figure 1: Information flow between clinicians and basic science researchers...5 Figure 2: Example Clinarrays...26 Figure 3: Hierarchical cluster of laboratory tests...28 Figure 4: Clustered clinarrays for cystic fibrosis...29 Figure 5: Visual schematic of the ICA model of disease pathophysiology...35 Figure 6: Schematic of the model building and prediction pipeline...51 Figure 7: Aging model feature selection...54 Figure 8: Actual age versus predicted age...56 Figure 9: Comparisons between error distributions across datasets...60 Figure 10: Distribution of laboratory measurements at different ages...75 Figure 11: Comparison of different rates of decay across 11 diseases and the baseline...78 Figure 12: Schematic for integrating clinical and gene expression data...88 Figure 13: Clinarray creation...91 Figure 14: Relationship between Neutrophil(%) and IL-4R...96 Figure 15: Gene-Biomarker networks...97 xiii

14

15 CHAPTER 1: NATURAL EXPERIMENTS USING ELECTRONIC MEDICAL RECORDS ENABLE BIOLOGICAL DISCOVERIES Imagine a major primate core facility at a top-tier academic medical school. This primate facility has studied 60 thousand primates over the past seven years, modeling over 8000 different phenotypes. Now imagine that this facility is linked to a data warehouse, where 7 million quantitative measurements of nearly 1500 different types have already been made, and 5 million new measurements are made, time-stamped, and stored each year. The quality of the measurements made on these primates is ensured, and repeatedly checked and verified. More recently, this facility now records all interventions ordered and performed on these primates, including medication doses administered, imaging studies conducted, and procedures performed, with 10 million recorded interventions of 2800 different types, and 20 million new interventions timestamped and recorded each year. Though easily imaginable, this primate facility would dwarf nearly every existing animal care facility in the United States, in terms of breadth, depth, and relevance. The basic research that could be enabled by this facility would already be boundless. But the most amazing component of this facility is that it covers humans. In fact, this facility is exactly what is provided at an academic medical hospital. For over thirty years patient phenotypes, measurements, and interventions have been collected by physicians, nurses, healthcare technicians, social workers, and other allied health professions, and stored in electronic health records (EHRs), the depth and 1

16 breadth of which grows exponentially year upon year. While the primary use of the data has and continues to be for clinical documentation, patient management, decision support, and reimbursement 1, the secondary uses have historically been limited to performing retrospective clinical studies, monitoring public health, and improving technologies that impact patient care and safety 2, 3. Due to the increasing adoption of the EHR and the ever-growing amounts of biological data being captured, there now exists the important potential for reuse of clinical data for biological research relevant to humans. By realizing that data stored in EHRs can be used for natural experiments and by developing new problem solving methods, we can begin to take advantage of EHR data to better understand underlying human biology. Deviating from the primate core facility analogy, the goal is not to treat humans as experimental organisms, but rather to gain as much knowledge as possible from every patient seen. Natural experiments are observational studies where the experimenter observes how nature impacts subjects, rather than methodically introducing perturbations, as is the norm with controlled experiments 4. The main difference between a natural experiment and an observational study is that one can only claim that assignment to specific groups is as if random 5. Despite this seeming limitation, major biomedical discoveries have resulted from natural experiments. The classic example of such a natural experiment is John Snow s 1855 study that linked sewage dumped into the Thames river to cholera outbreaks in London 6. Similarly, the discovery of blood groups that provide resistance to smallpox can be thought of as a natural experiment. Four decades ago, Vogel and colleagues measured rural Indian villagers blood groups and observed who was more susceptible to the disease in a natural smallpox 2

17 epidemic 7. Epidemiologists, economists, and social scientists have used the concept of the natural experiment to infer relationships between variables that often cannot be manipulated or would be immoral and illegal to do so 8. One cannot simply inject individuals with most viruses and observe their response. Whereas the smallpox study was done with one disease with one physiological marker, the data stored in an EHR represents thousands of physiological changes from thousands of perturbations arising from different conditions, interventions, demographics, and environmental effects. Almost entirely, these perturbations are not related to any experimental protocol and with the right precautions, could arguably be considered as if random. The voluminous amounts of data stored in an EHR consist of demographic information, physicians notes, laboratory values, imaging data, pathology reports, pharmacological data, insurance claims, and much more. Included within these data are perturbations associated with an individual. As an example, SNOMED pathology or ICD-9-CM codes associated with a medical visit could represent the patient s condition. Similarly, smoking status may be found in a physician s note 9. As an illustration of the breadth of data that is available, laboratory data collected from Lucile Packard Children s Hospital and Stanford Hospitals and Clinics consists of over 3000 different laboratory tests. Much of this data is seldom used after a patient encounter. Furthermore, an increasing amount of molecular data is available to individuals outside the clinical environment. These technologies include genome profiling, provided by companies like 23andMe, and complete genome sequencing, which is moving quickly towards consumer uptake. While these types of data may reside in separate databases, the virtual unified view of the EHR encapsulates these 3

18 different elements. Even though complete integration between diverse measurements of human physiology is currently lacking, the current data stored in EHRs suggest what can be accomplished. As an example, we can consider the act of bone marrow donation (ICD-9-CM code V59.3) as a differentiating factor that can separate a paired population into two groups, pre-donation and post donation. Using this design we can begin to examine characteristics of cell populations that differ between pre and post donation. In fact we see that the percent of neutrophil in the blood increases post donation while platelet count decreases (Figure 1b). While these are only two biological examples that show significant differences, we can extend this analysis to systematically compare other types of biomarkers stored in an EHR to see if there are more differences that may be associated with bone marrow donation. 4

19 Figure 1: Information flow between clinicians and basic science researchers a) Biological areas that may benefit from cross-domain analyses of de-identified EHR data. b) Bone marrow donor data used to examine differences in cell populations between paired individuals pre and post donation. 5

20 A key strength of using EHR data for natural experimentation is the ability to ask questions that span clinical specialties (Figure 1a). A prime example of such a study was one done by Andrey Rzhetsky and colleagues. In this study 1.5 million patient records from the Columbia University Medical Center were used to infer genetic overlap between 161 disorders 10. The results of their study showed many known and unknown correlations between diseases and suggest that autism, bipolar disorder, and schizophrenia share significant genetic overlap. The study of human aging is another example of a research area that would benefit from aggregating data across different clinical departments. Most biological studies of aging are currently done in model organisms including yeast, worm, and mouse. As a consequence, many of the discoveries have questionable relevance to humans. Using human physiological data, we and others have built models of human development and aging 11, 12. A combination of alkaline phosphatase, creatinine, hematocrit, and mean red cell volume are the best predictors of male development between years of age while alkaline phosphatase, creatinine, and total serum globulin are the best for female development. Other examples of this methodology include phenomewide association studies conducted by Vanderbilt to detect associations between single nucleotide polymorphism and disease 13 and genome-wide association studies from the Mayo Clinic examining red blood cell traits 14, both of which use EHR data and genotyped individuals. As molecular diagnostics are increasingly incorporated into the standard of care, experiments that were previously accomplished in controlled settings can now be done 6

21 by analyzing EHR data. For example, studies have been published showing how gender and genetic differences affect white blood cell counts, distribution of cell types and other physiological biomarkers 15. Additionally, investigators have probed the relationship between clinical imaging features and gene expression 16. It is therefore ideal that more investigators become informed of the types of analyses that can be done. It is also crucial that the infrastructure and resources are developed for these types of analyses in those clinical settings that have already adopted EHRs. While there are many challenges in using EHR data to its fullest potential, including data quality and privacy concerns which will be discussed in subsequent chapters, perhaps the most significant challenge is determining the new kinds of questions that are enabled by this data. Knowledge of specific medical domain or domains of interest, as well as the types of EHR data available is essential. This differs from the traditional scientific approach in which a hypothesis is first generated, followed by data acquisition and analysis. A researcher who is planning a natural experiment with EHR data must first know the types, amount, and coverage of the data collected. Next the researcher can, given domain expertise or consultation with a domain expert, propose biological questions that could be tested with this data. The researcher must make sure that selected groups are otherwise representative and equal and if not, take into consideration the covariates involved. Finally the researcher can test their hypotheses on the data with the appropriate quantitative methods. While results may show statistically significant relationships, it is still crucial that controlled experiments are subsequently used to validate any observations and to fully understand the mechanism. Table 1 provides a glimpse at possible questions and problem solving 7

22 methods that take advantage of EHR data. The advantage of using natural experiments is that they can provide unexpected relationships that are not just an incremental gain on the knowledge of human biology, but can fundamentally change what we know about it. In order to prepare the next generation of researchers to fully use the potential of EHRs, better collaborative dialogue between clinicians and scientists must be achieved. Scientists need to understand types, amounts, and pitfalls of EHR data available in order to propose valid questions. Clinicians need to be actively engaged in the research endeavor, to teach investigators about diseases and how they affect humans, and to facilitate the accessibility to clinical data. Another significant consideration for clinicians is to be cognizant of and to improve the data quality as it is entered into EHRs in order to facilitate downstream analyses. Training programs that are proponents of this systems medicine approach to research should be adapted to teach graduate students about the clinical environment, perhaps including morning rounds or even weeklong rotations in a hospital. These programs should also have greater medical student participation and interaction so that collaborative dialogue can be fostered. From a sociological perspective, it is often the case that collaboration leads to questions and resources that would be difficult to come by otherwise 17. 8

23 Question How do genotypes in HLA regions affect lifespan? How does the size of the hypothalamus affect BMI? What are the relationships between gene expression and physiological biomarkers (e.g. red blood cell count, blood urea nitrogen, etc.) What chronic diseases accelerate human development and aging? What non-chemotherapeutic drug significantly reduces WBC in the short term? Is there any correlation between physiological pain and laboratory measurements? Method Examine genotype data on HLA regions from normal healthy blood marrow donors to see if any genotype type is over-represented in an older population. Look for the relationship between genotype and immune senescence. Retrieve all patients that have an MRI scan for their hypothalamus and have BMI recorded. Use methods similar to Cypress and colleagues which linked brown adipose tissue to BMI 18. Examine the relationship between physiological biomarker differences and gene expression differences using disease groups as an intermediary to join clinical and experimental data 19. Build a disease specific aging model using patients diagnosed with specific ICD-9-CM codes 20. Remove chemotherapeutic drugs from the drug corpus and examine differences between white blood count prior-to and after administration of all other drugs. Examine correlations between pain scores and measured laboratory values. Table 1: Sample biological questions and their associated problem solving methods In the following chapters I will present examples of informatics methods that allow for the aggregation, evaluation, and use of electronic health records to better understand human health and underlying pathophysiology. In chapter two I will give a brief overview of the current state of integrating electronic health records and molecular data for reverse translational medicine as well as examining some potential hurdles of using EHR data. In chapter three I will discuss the creation of the clinarray, a 9

24 representation of individuals biomarker data aggregated across an EHR, and its application to differentiate individuals by disease severity. The creation of the clinarray as a virtual platform enables many methods that have been widely used in molecular data analyses to be applied to clinical data. We leverage this platform in chapter four, where I will discuss the use of independent component analysis to disease specific aggregation of clinarrays for the discovery of known and unknown physiological factors. The goal in this chapter is twofold. The first is to show that methods used in molecular research can be applied to clinical data and second to shed new light on physiological processes for multifactorial and complex diseases. Whereas chapters three and four focus on specific disease conditions, the remaining chapters will discuss the use of EHR data in aggregate from many different patients across many different conditions. While I ve shown that molecular methods can be applied to clinical data, the question of consistency in the results of data analyses using clinical data still remain. In chapter five I address this concern in the context of age prediction. I will discuss the use of normal biomarker values found in EHR data to validate models of maturation that were built using data from the National Health and Nutrition Examination Survey. The final two chapters will present examples of problem solving methods that incorporate molecular data with clinical data to shed new light on physiological processes and their underlying mechanisms. Both of these chapters use the idea that diseases are perturbations of human physiology to derive relationships between physiological changes and molecular changes. Chapter six will discuss methods to intersect EHR data and gene expression data from a nonintersecting population to find genes related to maturation and aging. Chapter seven 10

25 will describe methods for the integration of EHR data and gene expression data to discover novel relationships between molecules and immune related pathophysiology. It is my hope that the following chapters will lay the groundwork for the understanding of the value of electronic health records and present problem solving methods that have enabled us to discover new clinical and molecular features of diseases. 11

26 CHAPTER 2: THE CURRENT PARADIGM FOR USING ELECTRONIC HEALTH RECORDS FOR MOLECULAR DISCOVERIES EHR use in Biobanks, Academic Research Institutions, and Primary Care Institutions The most comprehensive stores of coupled clinical and molecular data currently reside in biobanks, repositories of biological samples that are connected to clinical data, the majority of which are found at academic research institutions. For example, Vanderbilt University, Mayo Clinic, Marshfield Clinic, Northwestern University, and the Group Health Cooperative University of Washington, are biobanks that have recently become involved in the emerge network, an NHGRI funded consortium that explores the utility of DNA repositories linked to EHRs 21, 22. The data stored in these biobanks have enabled numerous genome-wide association studies (GWAS) that examine the relationship between clinical features and single nucleotide polymorphisms (SNPs) that may affect them. In 2009 Minerva Carrasquillo and colleagues from Mayo Clinic determined that a genetic variant in PCDH11X is associated with susceptibility to late onset Alzheimer s disease 23. In this study, 313,504 SNPs were examined for 844 cases and 1,255 controls aggregated from clinically ascertained individuals from two different Mayo clinics and the Mayo brain bank. Other examples include the investigation into blood cell traits associated with various SNPs 14. Blood cell trait data was collected from EHR records spanning 15 years for 3,012 patients. Features of the EHR, including billing codes were used to 12

27 control for hematological diseases and comorbidities that may affect downstream analyses. The results identified 11 significant SNPs within 4 genomic loci (HBS1L/MYB, TMPRSS6, HFE, and SLC17A1) associated to 4 blood cell traits (RBC, MCV, MCH, and MCHC). Three of these four loci were previously identified in a GWAS using a much larger number of participants. While the majority of these studies have dealt with blood markers and specific diseases one can envision these methods being applied to other clinical disciplines like radiology and pathology in which researchers can examine imaging and tissue features in conjunction with polymorphisms. The lack of studies that integrate these domains and high throughput molecular measurements stem from difficulties in extracting and digitizing features from radiological images and pathology reports. While radiological images are stored in Picture Archiving and Communication Systems, the interpretations of the images, once they are read, are usually the only information stored in an EHR that is easily accessible. These interpretations, while useful for diagnosis and treatment, are the tip of the iceberg in regards to the information that can be extracted from raw image files. The storage of pathology data in electronic format lags behind even radiology as implementation of systems that store such data are currently in development. However, with the digitalization and ability to extract richer features from these commonly used clinical diagnostic tools, their integration into GWAS studies will provide greater insight into the physiological effects of polymorphisms. Although biobanks have an enormous amount of genetic, clinical, and sometimes 13

28 environmental data, the diversity of molecular measurements is somewhat lacking. Most of the emphasis has been placed on the collection of genomic information and storage of tissue samples for future analyses. There exist many other molecular modalities that are continuously being developed and used for basic molecular research that provide different perspectives on disease and pathophysiology. The lack of infrastructure and institutional support for the creation of more academic biobanks, in which high throughput measurements are taken in tandem with clinical measurements gathered from standard healthcare practices, has not precluded these academic research hospitals and institutions from taking advantage of EHRs. While at these institutions, there are no omnibus plans to gather molecular data from patients, individual investigators have collected and used treasure troves of molecular data from research projects and clinical trials that include: genetic sequence, gene expression, mass spectrometry, flow cytometry, microrna, just to name a few. These data, in conjunction with clinical EHR data, have enabled researchers to examine the relationships between clinical and molecular features beyond that of genetic sequence. In 2007 Eran Segal showed, in an example of non-invasive molecular profiling, that 28 imaging traits from lunch CT scans can be predictive of 78% of the global lung cancer gene expression profile 16. While recently there have been many GWAS studies examining blood cell traits and SNPs, Whitney and colleagues in 2003 profiled these traits as well as circadian cycles in the context of gender and age using gene expression microarrays 15. The greatest amount of EHR data collected is by primary care facilities that are not 14

29 related to academic research institutions. These institutions solely focus on the care and treatment of patients. However, there exist primary care facilities that are using their ability to gather clinical, molecular, and environmental information to better understand disease processes and the effects of treatments. Kaiser Permanente s Research Program on Genes, Environment, and Health is a prime example of this 24. Their current plan is to collect clinical data from EHRs, environmental exposure and behavioral data, and genetic information for 500,000 consenting members to examine the genetic and environmental factors that influence common diseases. As of 2010, over 130,000 members have consented and projects studying bipolar disorder among different ethnicities and prostate cancer in African American men are now being initiated. While the prevalence of EHR adoption in the United States is relatively low (~1.5 percent of all U.S. hospitals have comprehensive electronic records 1 ), many European countries have implemented nationwide EHRs. Denmark, for example, has a national health network, MedCom, which is used by over three quarters of the healthcare sector, more than 5,000 different organizations 25. Over 98 percent of primary care practices use clinical EHRs. Patient data, including laboratory information and pharmacy orders, are easily accessible to healthcare providers as well as patients. The connectivity of this amount of information can potentially be an enabling factor for molecular research when molecular diagnostics become the standard of care. 15

30 Hurdles of Using EHR Data: Data Quality and Privacy While the discussion about EHR data quality and the privacy concerns regarding the use of EHR data goes beyond the scope of this thesis, I will attempt to briefly outline the major concerns and how they are being addressed in practice. Much of the concern with EHR data quality involves incorrect data entry and coding, institution-specific coding practices, the quality of natural language processing with regards to unstructured text and uncontrolled terminologies, covariates including comorbidities, drug usage, procedures, environmental effects, socioeconomic groups, etc. The amount of data stored in an EHR plays an important role when dealing with some of these data quality issues. Due to the large amount of data, researchers can be very stringent when coming up with exclusionary rules. For example, Kullo and Colleagues, when examining the relationships between SNPs and blood cell traits, developed an algorithm that uses billing codes and natural language processing of unstructured clinical notes to exclude data affected by comorbidities, medications, or blood loss 14. As examples, Kullo, using International Classification of Disease 9 Clinical Management, procedural ICD-9, and Current Procedural Terminology codes, excluded data from patients with comorbidities that included hematological and solidorgan malignancies, hereditary anemias, solid-organ malignancies, cirrhosis, etc., and medications that included chemotherapeutic agents and immunosuppressive drugs. As a result, 12,864 values in 1,165 patients were excluded from the original 35,159 RBC trait values. This resulted in excluding 200 patients out of the original 3,411. The heuristics that are used to exclude data are domain dependent and require expertise 16

31 with the question that is being asked. The large amounts of data also enable methods like propensity score matching which attempts to derive similar populations based on selected covariates. With regards to values of data, the often non-normal distributions 26, and the potential for outliers found in clinical measurements, analyses should focus on distributions that more closely fit the data and more robust statistical measurements like the median and quantiles rather than the mean and variance. Privacy concerns remain a significiant issue when dealing with use of EHR data. However, institutions have shown that with informed consent the population size of people who allow their health information to be used for research is not insubstantial (up to 75,000 individuals at Vanderbilt to over 130,000 at Kaiser Permanente). There are multiple safe guards in place at many academic and non-academic institutions including the Health Insurance Portability and Accountability Act and Institutional Review Boards that ensure that data from people who opt-in to these programs are being used appropriately. De-identified data also represents a valuable data source that can be used for molecular research. An example of such a data source is NHANES, the National Health and Nutrition Examination Survey, a biannual survey conducted by the Centers for Disease Control and Prevention on a sample of the noninstitutionalized populations of the United States 27. This data set consists of behavioral, environmental, and clinical data that is freely available for download. Access to matched genetic information must be separately requested. Private institutions have also started to use and release de-identified health data. The Heritage Health Prize, for example, is a competition that aims to use aggregated de-identified medical claims data for 100,000 individuals to predict hospital admissions. 17

32 The drawbacks of using de-identified data, however, include the ability to re-identify the data if the de-identification process is not complete enough. This data will undoubtedly be less complete due to exclusion of certain data types, like physicians notes, that are extremely difficult to de-identify. De-identification may also lead to abstraction of data. For example, rather than using specific ICD9-CM codes, the deidentification process may abstract them into high level terms, such as disease groups which may or may not be suitable to a particular study. While technical, sociological, and legal obstacles prevent the use of EHRs to their fullest potential, we believe that many of them can be surmounted. New methods, such as the text mining of physician notes to generate useful computational data 28, 29, continue to be developed and tested on clinical data. Due to the concerns of healthcare privacy, new methodologies for the de-identification of EHR records need to be prioritized. Institutions should maintain repositories and implement procedures to streamline the use of de-identified data in accordance with the Health Insurance Portability and Accountability Act and institutional requirements. Institutions need to build systems that facilitate scientific inquiry and data retrieval, examples of which include the Stanford Translational Research Integrated Database Environment 30 and i2b2 31. Healthcare institutions such as Vanderbilt 32, federal institutions like the Department of Health and Human services and Centers for Disease Control, and private corporations like Cerner have started to collect, use and make available clinical and health data for research. Funding mechanisms like the Clinical and Translational Science Awards promise to democratize this approach to others 33. National structures that aggregate de-identified clinical data, similar to NCBI Gene Expression Omnibus 18

33 for gene expression experiments 34 or dbgap for genome-wide association studies 35, would enable more scientists to use this kind of data for research. As EHR data becomes more accessible, higher quality, and more rich, it promises to contribute significantly to biological research. The ability to integrate EHR data for biological research will fundamentally change how human biological research is performed. It is imperative for researchers to recognize this potential value and work to put into place policies and procedures that facilitate the use of EHR data. As the United States moves from 1.5% 1 to 100% adoption of EHRs, along with the rest of the world, and as more individualized health data become available, scientific leveraging of this data will shed new light on basic science and ultimately improve human health. 19

34 CHAPTER 3: CLINICAL ARRAYS OF LABORATORY MEASURES, OR CLINARRAYS, BUILT FROM AN ELECTRONIC HEALTH RECORD ENABLE DISEASE SUBTYPING BY SEVERITY This work has been done in collaboration with Susan Weber, Philip Constantinou, Todd Ferris and Henry Lowe. Susan provided clinical laboratory data from the Stanford Translational Integrated Database Environment system under the supervision of Philip and leadership of Henry. Todd ensured de-identification met institutional requirements. This work has been published: Chen DP, Weber SC, Constantinou PS, Ferris TA, Lowe HJ, Butte AJ (2007) Clinical Arrays of Laboratory Measures, or Clinarrays, Built from an Electronic Health Record Enable Disease Subtyping by Severity. American Medical Informatics Association The conceptualization and application of biological methods and techniques to clinical data can help narrow the gap between basic science and their clinical relevance as espoused as the underpinnings of translational research. For the past decade, a major modality of research in the biosciences has been microarray technology. Microarrays and gene expression profiling have been used to gain valuable insight into biological processes through the measurement of tens of thousands of genes and have paved the way for novel prognostic tests and disease-subclass determination 36. This platform has provided the ability to quantify gene expression under differing experimental conditions that can be used by various algorithms to classify, learn or predict biologically relevant processes. 20

35 In 1999, Todd Golub and colleagues showed that supervised clustering of microarray samples could distinguish between acute myeloid leukemia and acute lymphoblastic leukemia 37. Alizadeh and colleagues used an unsupervised algorithm to discover subtypes with differing severities from samples of a single disorder, B-cell lymphomas, the difference of which can directly affect clinical outcome 38. Laura van t Veer and colleagues used supervised classification of gene expression to determine a signature that is indicative of the clinical outcome of breast cancer 39. More recently, Nathan Price and colleagues used gene expression data to create a highly accurate two-gene classifier for differentiating between gastrointestinal stromal tumor and leiomyosarcomas 40. The application of supervised and unsupervised algorithms to high-bandwidth gene expression data have had a direct impact at both the bench and bedside to further our understanding and treatment of singular human diseases 41. However, diseases like cystic fibrosis or Crohn s disease, that have environmental influences or social influences or both, make classification based on microarray data imprecise 42. Hence, the prediction of clinical outcome of patients with more complex diseases must examine other variables that are often considered qualitatively. An often-overlooked metric that is a direct measurement of phenotypic information is clinical laboratory data. Stoll and colleagues previously created physiological profiles from measurements in rats, but this approach has yet to be translated to humans 43. While data collected during clinical care were prone to transcription errors in the past, the movement towards using electronic medical records (EMR) has improved the data quality due to elimination of transcription and omission errors 44. In this paper we propose the aggregation of clinical laboratory tests gathered from EMR data on a per- 21

36 patient basis to create what we term a clinarray, enabling quantitative methods traditionally used on gene expression microarrays to now be applied to clinical data. The clinarray is a platform that allows for the quantification of phenotypic expression, across a panel of pathophysiological measurements through clinical laboratory tests, for a patient in the same way that the microarray is a platform used to quantify genome-wide expression for an experiment. We first show that we can apply unsupervised clustering methods to aggregations of clinarrays to retrieve pathophysiologically-relevant laboratory groupings. We then show that unsupervised methods used in microarray analysis can be directly applied to clinarrays to distinguish patients with severe and less-severe forms of cystic fibrosis and Crohn s disease. Methods Data collection, processing and the clinarray Quantitative clinical laboratory data, consisting of 317,338 measurements across 553 distinct lab tests, originally obtained at the Lucile Packard Children s Hospital, were collected in a de-identified manner from the Stanford Translational Research Integrated Database Environment (STRIDE). In total, this data represented 966 patients across all ages that were diagnosed with one or more of 3 chronic diseases (Table 2). The use of de-identified clinical laboratory data in this manner was approved by the Institutional Review Board of the Stanford University School of Medicine. 22

37 Diseases Number of Patients Crohn s disease 154 Cystic fibrosis 449 Down Syndrome 366 Table 2: Diseases and the number of patients used in our clinarray study We averaged the values for each individual lab test across all time points subsequent to a patient being diagnosed with any of the three diseases. Each average represents one value in what we term the clinarray. The clinarray thus represents the collection of average laboratory values for one patient. Metric for Severity Measurements of severity have often been derived from direct clinical or pathological examination of patients or patient samples. The drawback of using de-identified quantitative laboratory measurements is that direct indicators of disease severity are not available for use. However, as it has been previously shown that the number of blood samples drawn for laboratory tests increases for intensive care patients with more severe illness, based on APACHE III scores 45, we believe that we can calculate a similar proxy for severity in using de-identified laboratory test data. Our proxy for the severity of chronic disease is the average number of laboratory tests measured on a patient per year after their first recorded diagnosis of a disease. For each patient, we sum the number of laboratory tests measured on that patient regardless of type. We then divide by the number of years over which the patient has had laboratory measurements taken. We propose that the greater the number of 23

38 laboratory tests measured on a patient per year, the more severe the form of chronic disease. We associate this severity score with each patient. Hierarchical clustering of patients and lab tests After construction, all clinarrays were grouped by disease type. Clinarrays for patients with more than one disorder were considered in each disorder. For each disease, we created a disease-specific matrix in which columns represented individual clinarrays and rows represented laboratory tests. Each cell represented a clinarray value for that specific patient/laboratory pair. Normalization For each disease-specific matrix, we normalized the values by laboratory type. We calculated the mean measurements for each laboratory among all clinarrays. We then assigned a z-score for each cell in our matrix by calculating the number of standard deviations a particular laboratory/patient clinarray value was from their respective laboratory mean. Any values more than three standard deviations were set to three standard deviations. We then removed labs if no patients had the lab measured (Figure 2). We used the normalized laboratory values to examine the coherence of applying hierarchical clustering to laboratory tests across the clinarrays. We calculated pairwise Pearson s correlation coefficients (cor) as a measure of similarity between laboratory tests within each disease-specific matrix 46. As our disease-specific matrix is sparse, correlations between individual clinarrays may not always be possible. To 24

39 rectify this, we pruned each disease-specific matrix by removing clinarrays that had fewer than three overlapping laboratory tests with all other clinarrays, thereby yielding a disease-specific matrix in which all correlations between clinarrays were meaningful. We then removed any laboratory test missing values in 20% or more clinarrays, in each disease-specific matrix. The resulting disease-specific matrices were fairly dense. We then applied hierarchical clustering algorithms using average agglomeration methods to examine the clustering of laboratory tests and to distinguish between patients with differing disease severities. We first clustered laboratory tests across each disease-specific matrix, with a distance measure of 1 minus the correlation coefficient, where negative correlations were considered as zero. Clusters of laboratory tests were then manually examined. 25

40 Figure 2: Example Clinarrays Top: Clinarrays from patients with cystic fibrosis. Columns correspond to patients as represented by clinarrays and rows represent laboratory tests. Any available laboratory studies measured in no cystic fibrosis patients were removed. Clinarrays missing a specific laboratory test measurement are white. Gray scale indicates the degree of deviation from the mean. Bottom: Magnified portion of the matrix. 26

41 We then hierarchically clustered each disease-specific matrix to search for natural subtypes of disease. The similarity of clinarrays was again computed as 1 minus the correlation coefficient, where negative correlations were considered as zero. We then examined major subtypes for each disease by comparing the severity scores assigned to patients found in each cluster, to assess whether the major clusters significantly distinguished between patients with differing disease severities. Results Three disease-specific matrices were created with rows representing laboratories and columns representing clinarrays. We pruned our matrices as described above, which removed a number of patients and laboratories (Table 3). Diseases # patients after pruning Crohn s disease 141 (92%) 29 Cystic fibrosis 352 (78%) 32 Down Syndrome 320 (87%) 9 Table 3: Population after data pruning # of labs at 80% threshold Diseases, number of patients after pruning, and number of laboratory tests for which have at least 80% of patients with measurements We first clustered laboratory tests using the cystic fibrosis disease matrix by correlating measurements of labs across all clinarrays to see if logical and coherent clusters could be retrieved (Figure 3). As expected, liver function tests clustered together, as did blood markers. 27

42 Figure 3: Hierarchical cluster of laboratory tests We next examined whether clustering patients by correlating clinarrays yielded significant subtypes of disease, and whether these subtypes corresponded to patients with differing disease severity. After calculating a severity score for each patient, we applied hierarchical clustering for each disease-specific matrix with average agglomeration to cluster patients based on their clinarrays as described above. (Figure 4) The resulting hierarchical clustering of patients broadly demonstrated two subtypes of disease in each of the three chronic diseases. We retrieved the severity score for all patients within both subtypes and applied the Wilcoxon test to determine if there was any significant difference between the two groups. We find that the patients in the two discovered disease subtypes have statistically significant differences in severity of cystic fibrosis (mean severe = , mean less-severe = 50.81, p = 4.29 x 10-9 ) as 28