Big Challenges of Big Data - What are the statistical tasks for the precision medicine era?

Transcription

1 Big Challenges of Big Data - What are the statistical tasks for the precision medicine era? Oct 18, 2015 Yu Shyr, Ph.D. Vanderbilt Center for Quantitative Sciences

2 Highlights Overview of the BIG data in biomedical research Future of the BIG data in biomedical research Statistical challenges & tasks Vanderbilt University Precision Medicine Initiative

3

4

5 President Obama s Precision Medicine Initiative January 30 th, 2015 President s 2016 Budget will provide a $215 million investment to support this effort, including: $130 million to NIH for development of a voluntary national research cohort of a million or more volunteers to propel our understanding of health and disease and set the foundation for a new way of doing research through engaged participants and open, responsible data sharing. $70 million to the National Cancer Institute (NCI), part of NIH, to scale up efforts to identify genomic drivers in cancer and apply that knowledge in the development of more effective approaches to cancer treatment.

6 President Obama s Precision Medicine Initiative January 30 th, 2015 $10 million to FDA to acquire additional expertise and advance the development of high quality, curated databases to support the regulatory structure needed to advance innovation in precision medicine and protect public health. $5 million to ONC to support the development of interoperability standards and requirements that address privacy and enable secure exchange of data across systems.

7 Omics biomedical research Microarray: cdna (about 5,000 variables), Affymetrix U133 Plus 2.0 (about 45,000 variables) SNPs (about 500,000 2,000,000 variables) Next Generation Sequencing (?)

8 Storage of the Data? cdna, Microarray, SNPs NGseq raw imaging data: > 2 TB per sample RNAseq or Exome seq data: 10 GB per sample (raw data), GB during the processing. Whole genome seq: 200 GB per sample (raw data), GB during the processing.

9 Raw 1:N:0:ATCACG NTGGAGTCCTAGGCACAGCTCTAAGCCTCCTTATTCGAGCCGAGCTGGGCC + #4=DDDDDDDDDDE<DAEEEIDFEIEIEIEIIIIIIDEDDDDA@DDDDII@

10

11 RNA Sequencing

12 Why is RNAseq data more difficult to analyze? There are a lot of zeros in the data (count data) The range of the count data is very wide Large variation Usually a small sample size Need to ensure fair comparisons between conditions, sometimes also between genes.

13 NGS Data Analysis

14 Culture of Reproducibility In 2015, Institute of Medicine of the National Academies formed a committee to study the Clinical Development and Use of Biomarkers for Molecularly Targeted Therapies In testimony before Congress on March 5 th, 2013 Bruce Alberts, then the editor-in-chief of Science, outlined what needs to be done to bolster the credibility of the scientific enterprise. Journals must do more to enforce standards. Budding scientists must be taught technical skills, including statistics, and must be imbued with skepticism towards their own results and those of others.

15 This should have been a warning that the big data were over-fitting the small number of cases a standard concern in data analysis.

16 Using the NCI60 to Predict Sensitivity Potti et al (2006), Nature Medicine, 12: The main conclusion is that we can use microarray data from cell lines (the NCI60) to define drug response signatures, which can be used to predict whether patients will respond. They provide examples using 7 commonly used agents.

17 Top Headlines The Cancer Letter (7/23/2010) Thirty-three biostatisticians sent a letter to NCI Director Harold Varmus urged the organization to suspend three trials until a more rigorous investigation of Potti s work is completed.

18 Top Headlines The Cancer Letter (7/23/2010) A Baron, K Bandeen-Roche, D Berry, J Bryan, V Carey, K Chaloner, M Delorenzi, B Efron, R Elston, D Ghosh, J Goldberg, S Goodman, F Harrell, S Hilsenbeck, W Huber, R Irizarry, C Kendziorski, M Kosorok, T Louis, JS Marron, M Newton, M Ochs, G Parmigiani, J Quackenbush, G Rosner, I Ruczinski, Y Shyr, S Skates, TP Speed, JD Storey, Z Szallasi, R Tibshirani, S Zeger

19 From: William T Barry [mailto:bill.barry@duke.edu] Sent: Thursday, November 18, :10 AM To: Shyr, Yu Subject: Request from Duke University s Institute for Genome Sciences and Policy Dear Dr Shyr, Duke University s Institute for Genome Sciences and Policy (Duke IGSP) currently has 3 actively enrolling genomics cancer trials that are monitored by an independent, 5-member Data Safety and Monitoring Board-Oversight Committee (DSMB-OC). The primary objective of these trials is validation of genomic biomarkers in a prospective clinical setting. I invite your participation to serve on this Board. Duke IGSP seeks members with specific professional expertise and who are completely independent of financial or scientific interest or other potential conflict of interest with the clinical genomic studies or Duke University. The DSMB-OC meets three-time a year not only to assure patient safety by reviewing enrollment and safety data, but also to review trial procedures and processes. Duke IGSP would welcome your participation to serve on its DSMB-OC.

20 What did we learn? The most common mistakes are simple Confounding in the Experimental Design: Mixing up the sample labels Mixing up the gene labels Mixing up the group labels 26 (13 completed and 13 partial) very top journal papers withdrew. You need at least one quantitative scientist in your team.

21 The log files of the statistical analyses (not the results) should be added to the supplemental data. This will help readers understand the detailed statistical analysis procedures.

22 Recent issues in the reproducibility of computational research have surfaced: Scientific papers commonly leave out experimental details necessary for reproduction Studies have shown difficulty replicating published experimental results Recent increase in retracted papers High number of failing clinical trials

23 Culture of Reproducibility To increase the trust in computational research, it is necessary for individual researchers, institutions, funding bodies, and journals to establish a culture of reproducibility. At a minimum, research should be sufficiently documented for the researchers themselves to reproduce their results.

24 Rule 1: For every result, keep track of how it was produced Rule 2: Avoid manual data manipulation steps Rule 3: Archive the exact versions of all external programs used Rule 4: Version control all custom scripts (Subversion, Git) Rule 5: Record all intermediate results, when possible in standardized formats

25 Rule 6: For analyses with randomness, note underlying random seed Rule 7: Always store raw data behind plots Rule 8: Generate hierarchical analysis output, allowing layers of increasing detail to be inspected Rule 9: Connect textual statements to underlying results Rule 10: Provide public access to scripts, runs, and results

26 Microbiome and PheWAS

27

28 The launch of the US BRAIN and European Human Brain Projects coincides with growing international efforts toward transparency and increased access to publicly funded research in the neurosciences. However, big science efforts are not the only drivers of data-sharing needs, as neuroscientists across the full spectrum of research grapple with the overwhelming volume of data being generated daily and a scientific environment that is increasingly focused oncollaboration.

29 The authors consider the issue of sharing of the richly diverse and heterogeneous small data sets produced by individual neuroscientists, so-called long-tail data. The utility of these data, the diversity of repositories and options available for sharing such data, and emerging best practices.

30 Ridge Regression Analysis Ridge regression reduces this variability by shrinking the coefficients, resulting in more prediction accuracy at the cost of usually only a small increase of bias. In Ridge regression, the coefficients are shrunken towards zero, but will never become exactly zero. So, when the number of predictors is large, Ridge regression will not provide a sparse model that is easy to interpret.

31 Regression Analysis The Lasso was developed by Tibshirani (1996) to improve both prediction accuracy and model interpretability by combining the nice features of Ridge regression and subset selection. The Lasso reduces the variability of the estimates by shrinking the coefficients and at the same time produces interpretable models by shrinking some coefficients to exactly zero.

32 Elastic Net Analysis Zou and Hastie (2005) proposed the Elastic Net to overcome the limitations of the Lasso in some situations. The Elastic Net also combines shrinkage and variable selection, and in addition encourages grouping of variables: groups of highly correlated variables tend to be selected together, where the Lasso would only select one variable of the group.

33 Regression Analysis Also, in the case P >> N, Lasso algorithms are limited because at most N variables can be selected. Zou and Hastie (2005) conjecture that, whenever Ridge regression improves on OLS, the Elastic Net will improve the Lasso.

34 Lasso and Elastic Net Elastic net is a related technique. Elastic net is a hybrid of ridge regression and lasso regularization. Like lasso, elastic net can generate reduced models by generating zero-valued coefficients. Empirical studies have suggested that the elastic net technique can outperform lasso on data with highly correlated predictors.

35 Definition of Ridge Regression, Lasso, EN The loss functions for Ridge regression, the Lasso, and the Elastic Net can be viewed as constrained versions of the ordinary least squares (OLS) regression loss function. In Ridge regression, the sum of squares of the coefficients is constrained as follows:

36 Definition of Ridge Regression, Lasso, EN The Lasso constrains the sum of the absolute values of the coefficients: with t 1 the Lasso tuning parameter.

37 Definition of Ridge Regression, Lasso, EN Finally, the Elastic Net combines the Ridge regression and the Lasso constraints:

38 Summary Lasso The lasso technique solves this regularization problem. For a given value of λ, a nonnegative parameter, lasso solves the problem

39 Summary Lasso As λ increases, the number of nonzero components of β decreases. The lasso problem involves the L 1 norm of β, as contrasted with the elastic net algorithm.

40 Summary Elastic Net The elastic net technique solves this regularization problem. For an α strictly between 0 and 1, and a nonnegative λ, elastic net solves the problem where

41 Summary Elastic Net Elastic net is the same as lasso when α = 1. As α shrinks toward 0, elastic net approaches ridge regression. For other values of α, the penalty term P α (β) interpolates between the L 1 norm of β and the squared L 2 norm of β.

42 Limitations of the lasso The group lasso and sparse group lasso acts like the lasso at the group level depending on λ. In fact if the group sizes are all one, it reduces to the lasso. In group lasso, if a group of parameters is non-zero, they will all be non-zero. The sparse group lasso yields sparsity at both the group and individual feature levels, in order to select groups and predictors within a group.

43 Definition of Ridge Regression, Lasso, EN These constrained loss functions can also be written as penalized loss functions:

44 NATURE REVIEWS CANCER VOLUME 13 NOVEMBER 2013

45 Microbiome research is just one of many flavors of the big data projects that have become ubiquitous in the life sciences. Brain scientists are attempting to map all of the 86 billion neurons in the human brain and catalog the trillions of connections they make with other neurons. As science moves toward big data endeavors, so grows the concern that much of what is discovered is fool s gold.

46 Studying microbiome : 16S rdna gene sequencing 16S rrna gene is found in all bacterial species Variable sequence can be thought of as a molecular fingerprint. Can be used to identify bacterial genera and species. Degenerate primers are designed form the conserved region. Large public databases available for comparison.

47 Sequence clustering into OTUs (Operational Taxonomic Units)

48

49

50 Statistical methods Sparse Dirichlet-multinomial Regression for simultaneous selection of microbiome-associated variables and their affected taxa Kernel-based Regression Methods for testing the effect of microbiome composition on the clinical/biological outcome(s). Network analysis

51 END

52 Questions