Big Challenges of Big Data - What are the statistical tasks for the precision medicine era?

Size: px
Start display at page:

Download "Big Challenges of Big Data - What are the statistical tasks for the precision medicine era?"

Transcription

1 Big Challenges of Big Data - What are the statistical tasks for the precision medicine era? Oct 18, 2015 Yu Shyr, Ph.D. Vanderbilt Center for Quantitative Sciences

2 Highlights Overview of the BIG data in biomedical research Future of the BIG data in biomedical research Statistical challenges & tasks Vanderbilt University Precision Medicine Initiative

3

4

5 President Obama s Precision Medicine Initiative January 30 th, 2015 President s 2016 Budget will provide a $215 million investment to support this effort, including: $130 million to NIH for development of a voluntary national research cohort of a million or more volunteers to propel our understanding of health and disease and set the foundation for a new way of doing research through engaged participants and open, responsible data sharing. $70 million to the National Cancer Institute (NCI), part of NIH, to scale up efforts to identify genomic drivers in cancer and apply that knowledge in the development of more effective approaches to cancer treatment.

6 President Obama s Precision Medicine Initiative January 30 th, 2015 $10 million to FDA to acquire additional expertise and advance the development of high quality, curated databases to support the regulatory structure needed to advance innovation in precision medicine and protect public health. $5 million to ONC to support the development of interoperability standards and requirements that address privacy and enable secure exchange of data across systems.

7 Omics biomedical research Microarray: cdna (about 5,000 variables), Affymetrix U133 Plus 2.0 (about 45,000 variables) SNPs (about 500,000 2,000,000 variables) Next Generation Sequencing (?)

8 Storage of the Data? cdna, Microarray, SNPs NGseq raw imaging data: > 2 TB per sample RNAseq or Exome seq data: 10 GB per sample (raw data), GB during the processing. Whole genome seq: 200 GB per sample (raw data), GB during the processing.

9 Raw 1:N:0:ATCACG NTGGAGTCCTAGGCACAGCTCTAAGCCTCCTTATTCGAGCCGAGCTGGGCC + #4=DDDDDDDDDDE<DAEEEIDFEIEIEIEIIIIIIDEDDDDA@DDDDII@

10

11 RNA Sequencing

12 Why is RNAseq data more difficult to analyze? There are a lot of zeros in the data (count data) The range of the count data is very wide Large variation Usually a small sample size Need to ensure fair comparisons between conditions, sometimes also between genes.

13 NGS Data Analysis

14 Culture of Reproducibility In 2015, Institute of Medicine of the National Academies formed a committee to study the Clinical Development and Use of Biomarkers for Molecularly Targeted Therapies In testimony before Congress on March 5 th, 2013 Bruce Alberts, then the editor-in-chief of Science, outlined what needs to be done to bolster the credibility of the scientific enterprise. Journals must do more to enforce standards. Budding scientists must be taught technical skills, including statistics, and must be imbued with skepticism towards their own results and those of others.

15 This should have been a warning that the big data were over-fitting the small number of cases a standard concern in data analysis.

16 Using the NCI60 to Predict Sensitivity Potti et al (2006), Nature Medicine, 12: The main conclusion is that we can use microarray data from cell lines (the NCI60) to define drug response signatures, which can be used to predict whether patients will respond. They provide examples using 7 commonly used agents.

17 Top Headlines The Cancer Letter (7/23/2010) Thirty-three biostatisticians sent a letter to NCI Director Harold Varmus urged the organization to suspend three trials until a more rigorous investigation of Potti s work is completed.

18 Top Headlines The Cancer Letter (7/23/2010) A Baron, K Bandeen-Roche, D Berry, J Bryan, V Carey, K Chaloner, M Delorenzi, B Efron, R Elston, D Ghosh, J Goldberg, S Goodman, F Harrell, S Hilsenbeck, W Huber, R Irizarry, C Kendziorski, M Kosorok, T Louis, JS Marron, M Newton, M Ochs, G Parmigiani, J Quackenbush, G Rosner, I Ruczinski, Y Shyr, S Skates, TP Speed, JD Storey, Z Szallasi, R Tibshirani, S Zeger

19 From: William T Barry [mailto:bill.barry@duke.edu] Sent: Thursday, November 18, :10 AM To: Shyr, Yu Subject: Request from Duke University s Institute for Genome Sciences and Policy Dear Dr Shyr, Duke University s Institute for Genome Sciences and Policy (Duke IGSP) currently has 3 actively enrolling genomics cancer trials that are monitored by an independent, 5-member Data Safety and Monitoring Board-Oversight Committee (DSMB-OC). The primary objective of these trials is validation of genomic biomarkers in a prospective clinical setting. I invite your participation to serve on this Board. Duke IGSP seeks members with specific professional expertise and who are completely independent of financial or scientific interest or other potential conflict of interest with the clinical genomic studies or Duke University. The DSMB-OC meets three-time a year not only to assure patient safety by reviewing enrollment and safety data, but also to review trial procedures and processes. Duke IGSP would welcome your participation to serve on its DSMB-OC.

20 What did we learn? The most common mistakes are simple Confounding in the Experimental Design: Mixing up the sample labels Mixing up the gene labels Mixing up the group labels 26 (13 completed and 13 partial) very top journal papers withdrew. You need at least one quantitative scientist in your team.

21 The log files of the statistical analyses (not the results) should be added to the supplemental data. This will help readers understand the detailed statistical analysis procedures.

22 Recent issues in the reproducibility of computational research have surfaced: Scientific papers commonly leave out experimental details necessary for reproduction Studies have shown difficulty replicating published experimental results Recent increase in retracted papers High number of failing clinical trials

23 Culture of Reproducibility To increase the trust in computational research, it is necessary for individual researchers, institutions, funding bodies, and journals to establish a culture of reproducibility. At a minimum, research should be sufficiently documented for the researchers themselves to reproduce their results.

24 Rule 1: For every result, keep track of how it was produced Rule 2: Avoid manual data manipulation steps Rule 3: Archive the exact versions of all external programs used Rule 4: Version control all custom scripts (Subversion, Git) Rule 5: Record all intermediate results, when possible in standardized formats

25 Rule 6: For analyses with randomness, note underlying random seed Rule 7: Always store raw data behind plots Rule 8: Generate hierarchical analysis output, allowing layers of increasing detail to be inspected Rule 9: Connect textual statements to underlying results Rule 10: Provide public access to scripts, runs, and results

26 Microbiome and PheWAS

27

28 The launch of the US BRAIN and European Human Brain Projects coincides with growing international efforts toward transparency and increased access to publicly funded research in the neurosciences. However, big science efforts are not the only drivers of data-sharing needs, as neuroscientists across the full spectrum of research grapple with the overwhelming volume of data being generated daily and a scientific environment that is increasingly focused oncollaboration.

29 The authors consider the issue of sharing of the richly diverse and heterogeneous small data sets produced by individual neuroscientists, so-called long-tail data. The utility of these data, the diversity of repositories and options available for sharing such data, and emerging best practices.

30 Ridge Regression Analysis Ridge regression reduces this variability by shrinking the coefficients, resulting in more prediction accuracy at the cost of usually only a small increase of bias. In Ridge regression, the coefficients are shrunken towards zero, but will never become exactly zero. So, when the number of predictors is large, Ridge regression will not provide a sparse model that is easy to interpret.

31 Regression Analysis The Lasso was developed by Tibshirani (1996) to improve both prediction accuracy and model interpretability by combining the nice features of Ridge regression and subset selection. The Lasso reduces the variability of the estimates by shrinking the coefficients and at the same time produces interpretable models by shrinking some coefficients to exactly zero.

32 Elastic Net Analysis Zou and Hastie (2005) proposed the Elastic Net to overcome the limitations of the Lasso in some situations. The Elastic Net also combines shrinkage and variable selection, and in addition encourages grouping of variables: groups of highly correlated variables tend to be selected together, where the Lasso would only select one variable of the group.

33 Regression Analysis Also, in the case P >> N, Lasso algorithms are limited because at most N variables can be selected. Zou and Hastie (2005) conjecture that, whenever Ridge regression improves on OLS, the Elastic Net will improve the Lasso.

34 Lasso and Elastic Net Elastic net is a related technique. Elastic net is a hybrid of ridge regression and lasso regularization. Like lasso, elastic net can generate reduced models by generating zero-valued coefficients. Empirical studies have suggested that the elastic net technique can outperform lasso on data with highly correlated predictors.

35 Definition of Ridge Regression, Lasso, EN The loss functions for Ridge regression, the Lasso, and the Elastic Net can be viewed as constrained versions of the ordinary least squares (OLS) regression loss function. In Ridge regression, the sum of squares of the coefficients is constrained as follows:

36 Definition of Ridge Regression, Lasso, EN The Lasso constrains the sum of the absolute values of the coefficients: with t 1 the Lasso tuning parameter.

37 Definition of Ridge Regression, Lasso, EN Finally, the Elastic Net combines the Ridge regression and the Lasso constraints:

38 Summary Lasso The lasso technique solves this regularization problem. For a given value of λ, a nonnegative parameter, lasso solves the problem

39 Summary Lasso As λ increases, the number of nonzero components of β decreases. The lasso problem involves the L 1 norm of β, as contrasted with the elastic net algorithm.

40 Summary Elastic Net The elastic net technique solves this regularization problem. For an α strictly between 0 and 1, and a nonnegative λ, elastic net solves the problem where

41 Summary Elastic Net Elastic net is the same as lasso when α = 1. As α shrinks toward 0, elastic net approaches ridge regression. For other values of α, the penalty term P α (β) interpolates between the L 1 norm of β and the squared L 2 norm of β.

42 Limitations of the lasso The group lasso and sparse group lasso acts like the lasso at the group level depending on λ. In fact if the group sizes are all one, it reduces to the lasso. In group lasso, if a group of parameters is non-zero, they will all be non-zero. The sparse group lasso yields sparsity at both the group and individual feature levels, in order to select groups and predictors within a group.

43 Definition of Ridge Regression, Lasso, EN These constrained loss functions can also be written as penalized loss functions:

44 NATURE REVIEWS CANCER VOLUME 13 NOVEMBER 2013

45 Microbiome research is just one of many flavors of the big data projects that have become ubiquitous in the life sciences. Brain scientists are attempting to map all of the 86 billion neurons in the human brain and catalog the trillions of connections they make with other neurons. As science moves toward big data endeavors, so grows the concern that much of what is discovered is fool s gold.

46 Studying microbiome : 16S rdna gene sequencing 16S rrna gene is found in all bacterial species Variable sequence can be thought of as a molecular fingerprint. Can be used to identify bacterial genera and species. Degenerate primers are designed form the conserved region. Large public databases available for comparison.

47 Sequence clustering into OTUs (Operational Taxonomic Units)

48

49

50 Statistical methods Sparse Dirichlet-multinomial Regression for simultaneous selection of microbiome-associated variables and their affected taxa Kernel-based Regression Methods for testing the effect of microbiome composition on the clinical/biological outcome(s). Network analysis

51 END

52 Questions

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013 Model selection in R featuring the lasso Chris Franck LISA Short Course March 26, 2013 Goals Overview of LISA Classic data example: prostate data (Stamey et. al) Brief review of regression and model selection.

More information

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients by Li Liu A practicum report submitted to the Department of Public Health Sciences in conformity with

More information

Regularized Logistic Regression for Mind Reading with Parallel Validation

Regularized Logistic Regression for Mind Reading with Parallel Validation Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland

More information

Lasso on Categorical Data

Lasso on Categorical Data Lasso on Categorical Data Yunjin Choi, Rina Park, Michael Seo December 14, 2012 1 Introduction In social science studies, the variables of interest are often categorical, such as race, gender, and nationality.

More information

Commonwealth Advanced Data Analytics Alliance & The President s Precision Medicine Initiative

Commonwealth Advanced Data Analytics Alliance & The President s Precision Medicine Initiative Commonwealth Advanced Data Analytics Alliance & The President s Precision Medicine Initiative Deputy Secretary Anthony Fung Presentation to the Health IT Standards Advisory Committee December 17, 2015

More information

Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD

Speaker First Plenary Session THE USE OF BIG DATA - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD Optum Labs Cambridge, MA, USA Statistical Methods and Machine Learning ISPOR International

More information

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon Integrating DNA Motif Discovery and Genome-Wide Expression Analysis Department of Mathematics and Statistics University of Massachusetts Amherst Statistics in Functional Genomics Workshop Ascona, Switzerland

More information

Social Media Aided Stock Market Predictions by Sparsity Induced Regression

Social Media Aided Stock Market Predictions by Sparsity Induced Regression Social Media Aided Stock Market Predictions by Sparsity Induced Regression Delft Center for Systems and Control Social Media Aided Stock Market Predictions by Sparsity Induced Regression For the degree

More information

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month

More information

Regulatory Issues in Genetic Testing and Targeted Drug Development

Regulatory Issues in Genetic Testing and Targeted Drug Development Regulatory Issues in Genetic Testing and Targeted Drug Development Janet Woodcock, M.D. Deputy Commissioner for Operations Food and Drug Administration October 12, 2006 Genetic and Genomic Tests are Types

More information

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources 1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools

More information

A leader in the development and application of information technology to prevent and treat disease.

A leader in the development and application of information technology to prevent and treat disease. A leader in the development and application of information technology to prevent and treat disease. About MOLECULAR HEALTH Molecular Health was founded in 2004 with the vision of changing healthcare. Today

More information

Statistical issues in the analysis of microarray data

Statistical issues in the analysis of microarray data Statistical issues in the analysis of microarray data Daniel Gerhard Institute of Biostatistics Leibniz University of Hannover ESNATS Summerschool, Zermatt D. Gerhard (LUH) Analysis of microarray data

More information

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa 29.10.2013

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa 29.10.2013 Next Generation Sequencing: Adjusting to Big Data Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa 29.10.2013 Outline Human Genome Project Next-Generation Sequencing Personalized Medicine

More information

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg Building risk prediction models - with a focus on Genome-Wide Association Studies Risk prediction models Based on data: (D i, X i1,..., X ip ) i = 1,..., n we like to fit a model P(D = 1 X 1,..., X p )

More information

Bayesian Penalized Methods for High Dimensional Data

Bayesian Penalized Methods for High Dimensional Data Bayesian Penalized Methods for High Dimensional Data Joseph G. Ibrahim Joint with Hongtu Zhu and Zakaria Khondker What is Covered? Motivation GLRR: Bayesian Generalized Low Rank Regression L2R2: Bayesian

More information

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data Effective Linear Discriant Analysis for High Dimensional, Low Sample Size Data Zhihua Qiao, Lan Zhou and Jianhua Z. Huang Abstract In the so-called high dimensional, low sample size (HDLSS) settings, LDA

More information

Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

More information

Predicting daily incoming solar energy from weather data

Predicting daily incoming solar energy from weather data Predicting daily incoming solar energy from weather data ROMAIN JUBAN, PATRICK QUACH Stanford University - CS229 Machine Learning December 12, 2013 Being able to accurately predict the solar power hitting

More information

Connecting Basic Research and Healthcare Big Data

Connecting Basic Research and Healthcare Big Data Elsevier Health Analytics WHS 2015 Big Data in Health Connecting Basic Research and Healthcare Big Data Olaf Lodbrok Managing Director Elsevier Health Analytics o.lodbrok@elsevier.com t +49 89 5383 600

More information

Biomedical Big Data and Precision Medicine

Biomedical Big Data and Precision Medicine Biomedical Big Data and Precision Medicine Jie Yang Department of Mathematics, Statistics, and Computer Science University of Illinois at Chicago October 8, 2015 1 Explosion of Biomedical Data 2 Types

More information

COMMON METHODOLOGICAL ISSUES FOR CER IN BIG DATA

COMMON METHODOLOGICAL ISSUES FOR CER IN BIG DATA COMMON METHODOLOGICAL ISSUES FOR CER IN BIG DATA Harvard Medical School and Harvard School of Public Health sharon@hcp.med.harvard.edu December 2013 1 / 16 OUTLINE UNCERTAINTY AND SELECTIVE INFERENCE 1

More information

life science data mining

life science data mining life science data mining - '.)'-. < } ti» (>.:>,u» c ~'editors Stephen Wong Harvard Medical School, USA Chung-Sheng Li /BM Thomas J Watson Research Center World Scientific NEW JERSEY LONDON SINGAPORE.

More information

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort xavier.conort@gear-analytics.com Motivation Location matters! Observed value at one location is

More information

How is Big Data Different? A Paradigm Shift

How is Big Data Different? A Paradigm Shift How is Big Data Different? A Paradigm Shift Jennifer Clarke, Ph.D. Associate Professor Department of Statistics Department of Food Science and Technology University of Nebraska Lincoln ASA Snake River

More information

Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

More information

Cancer Biostatistics Workshop Science of Doing Science - Biostatistics

Cancer Biostatistics Workshop Science of Doing Science - Biostatistics Cancer Biostatistics Workshop Science of Doing Science - Biostatistics Yu Shyr, PhD Jan. 18, 2008 Cancer Biostatistics Center Vanderbilt-Ingram Cancer Center Yu.Shyr@vanderbilt.edu Aims Cancer Biostatistics

More information

Vision for the Cohort and the Precision Medicine Initiative Francis S. Collins, M.D., Ph.D. Director, National Institutes of Health Precision

Vision for the Cohort and the Precision Medicine Initiative Francis S. Collins, M.D., Ph.D. Director, National Institutes of Health Precision Vision for the Cohort and the Precision Medicine Initiative Francis S. Collins, M.D., Ph.D. Director, National Institutes of Health Precision Medicine Initiative: Building a Large U.S. Research Cohort

More information

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences WP11 Data Storage and Analysis Task 11.1 Coordination Deliverable 11.2 Community Needs of

More information

Penalized regression: Introduction

Penalized regression: Introduction Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

More information

Causal Leading Indicators Detection for Demand Forecasting

Causal Leading Indicators Detection for Demand Forecasting Causal Leading Indicators Detection for Demand Forecasting Yves R. Sagaert, El-Houssaine Aghezzaf, Nikolaos Kourentzes, Bram Desmet Department of Industrial Management, Ghent University 13/07/2015 EURO

More information

Case Study Life Sciences Data

Case Study Life Sciences Data Case Study Life Sciences Data Centre for Integrative Systems Biology and Bioinformatics www.imperial.ac.uk/bioinfsupport Sarah Butcher s.butcher@imperial.ac.uk www.imperial.ac.uk/bioinfsupport Bio-data

More information

Predicting Health Care Costs by Two-part Model with Sparse Regularization

Predicting Health Care Costs by Two-part Model with Sparse Regularization Predicting Health Care Costs by Two-part Model with Sparse Regularization Atsuyuki Kogure Keio University, Japan July, 2015 Abstract We consider the problem of predicting health care costs using the two-part

More information

Statistics in Medicine Research Lecture Series CSMC Fall 2014

Statistics in Medicine Research Lecture Series CSMC Fall 2014 Catherine Bresee, MS Senior Biostatistician Biostatistics & Bioinformatics Research Institute Statistics in Medicine Research Lecture Series CSMC Fall 2014 Overview Review concept of statistical power

More information

Big data, Genomics and Public Health: Big Data meets DNA

Big data, Genomics and Public Health: Big Data meets DNA Big data, Genomics and Public Health: Big Data meets DNA Winston Hide, Harvard School of Public Health and Harvard Stem Cell Institute Critical Data - Secondary use of Big Data from Critical Care - January

More information

ORACLE HEALTH SCIENCES INFORM ADVANCED MOLECULAR ANALYTICS

ORACLE HEALTH SCIENCES INFORM ADVANCED MOLECULAR ANALYTICS ORACLE HEALTH SCIENCES INFORM ADVANCED MOLECULAR ANALYTICS INCORPORATE GENOMIC DATA INTO CLINICAL R&D KEY BENEFITS Enable more targeted, biomarker-driven clinical trials Improves efficiencies, compressing

More information

PreciseTM Whitepaper

PreciseTM Whitepaper Precise TM Whitepaper Introduction LIMITATIONS OF EXISTING RNA-SEQ METHODS Correctly designed gene expression studies require large numbers of samples, accurate results and low analysis costs. Analysis

More information

ABSTRACT JEL: C35, C63, M15. KEYWORDS: Project Management, Performance, Prediction, Earned Value INTRODUCTION

ABSTRACT JEL: C35, C63, M15. KEYWORDS: Project Management, Performance, Prediction, Earned Value INTRODUCTION GLOBAL JOURNAL OF BUSINESS RESEARCH VOLUME 7 NUMBER 5 013 EXTREME PROGRAMMING PROJECT PERFORMANCE MANAGEMENT BY STATISTICAL EARNED VALUE ANALYSIS Wei Lu, Duke University Li Lu, University of Electronic

More information

Open Access to Manuscripts, Open Science, and Big Data

Open Access to Manuscripts, Open Science, and Big Data Open Access to Manuscripts, Open Science, and Big Data Progress, and the Elsevier Perspective in 2013 Presented by: Dan Morgan Title: Senior Manager Access Relations, Global Academic Relations Company

More information

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University Software and Methods for the Analysis of Affymetrix GeneChip Data Rafael A Irizarry Department of Biostatistics Johns Hopkins University Outline Overview Bioconductor Project Examples 1: Gene Annotation

More information

Transferability of Economic Evaluations in Clinical Trials

Transferability of Economic Evaluations in Clinical Trials Transferability of Economic Evaluations in Clinical Trials Henry Glick Institutt for helseledelse og helseøkonomi November 25, 2008 The Problem Multicenter and multinational studies are the norm for the

More information

Vertical data integration for melanoma prognosis. Australia 3 Melanoma Institute Australia, NSW 2060 Australia. kaushala@maths.usyd.edu.au.

Vertical data integration for melanoma prognosis. Australia 3 Melanoma Institute Australia, NSW 2060 Australia. kaushala@maths.usyd.edu.au. Vertical integration for melanoma prognosis Kaushala Jayawardana 1,4, Samuel Müller 1, Sarah-Jane Schramm 2,3, Graham J. Mann 2,3 and Jean Yang 1 1 School of Mathematics and Statistics, University of Sydney,

More information

Master of Science in Healthcare Informatics and Analytics Program Overview

Master of Science in Healthcare Informatics and Analytics Program Overview Master of Science in Healthcare Informatics and Analytics Program Overview The program is a 60 credit, 100 week course of study that is designed to graduate students who: Understand and can apply the appropriate

More information

Masters of Science in Clinical Research (MSCR) Curriculum. Goal/Objective of the MSCR

Masters of Science in Clinical Research (MSCR) Curriculum. Goal/Objective of the MSCR Masters of Science in Clinical (MSCR) Curriculum Goal/Objective of the MSCR The MSCR program is an interdisciplinary research degree program housed within the Department of Epidemiology in the School of

More information

A Primer of Genome Science THIRD

A Primer of Genome Science THIRD A Primer of Genome Science THIRD EDITION GREG GIBSON-SPENCER V. MUSE North Carolina State University Sinauer Associates, Inc. Publishers Sunderland, Massachusetts USA Contents Preface xi 1 Genome Projects:

More information

Big Data: a new era for Statistics

Big Data: a new era for Statistics Big Data: a new era for Statistics Richard J. Samworth Abstract Richard Samworth (1996) is a Professor of Statistics in the University s Statistical Laboratory, and has been a Fellow of St John s since

More information

Accelerating Development and Approval of Targeted Cancer Therapies

Accelerating Development and Approval of Targeted Cancer Therapies Accelerating Development and Approval of Targeted Cancer Therapies Anna Barker, NCI David Epstein, Novartis Oncology Stephen Friend, Sage Bionetworks Cindy Geoghegan, Patient and Partners David Kessler,

More information

Single-Cell DNA Sequencing with the C 1. Single-Cell Auto Prep System. Reveal hidden populations and genetic diversity within complex samples

Single-Cell DNA Sequencing with the C 1. Single-Cell Auto Prep System. Reveal hidden populations and genetic diversity within complex samples DATA Sheet Single-Cell DNA Sequencing with the C 1 Single-Cell Auto Prep System Reveal hidden populations and genetic diversity within complex samples Single-cell sensitivity Discover and detect SNPs,

More information

Statistics and the Search for Scientific Truth

Statistics and the Search for Scientific Truth Statistics and the Search for Scientific Truth Martin Hazelton 1 Institute of Fundamental Sciences Massey University 11 November 2015 1 Presenter: m.hazelton@massey.ac.nz U3A, November 2015 1 / 30 Science

More information

Factors for success in big data science

Factors for success in big data science Factors for success in big data science Damjan Vukcevic Data Science Murdoch Childrens Research Institute 16 October 2014 Big Data Reading Group (Department of Mathematics & Statistics, University of Melbourne)

More information

Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects

Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects Report on the Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects Background and Goals of the Workshop June 5 6, 2012 The use of genome sequencing in human research is growing

More information

Quality Assessment of Exon and Gene Arrays

Quality Assessment of Exon and Gene Arrays Quality Assessment of Exon and Gene Arrays I. Introduction In this white paper we describe some quality assessment procedures that are computed from CEL files from Whole Transcript (WT) based arrays such

More information

The MSCR Curriculum and Its Advantages

The MSCR Curriculum and Its Advantages Masters of Science in Clinical Research (MSCR) Curriculum Goal/Objective of the MSCR The MSCR program is an interdisciplinary research degree program housed within the Department of Epidemiology in the

More information

5. Multiple regression

5. Multiple regression 5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful

More information

Healthcare data analytics. Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw

Healthcare data analytics. Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw Healthcare data analytics Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw Outline Data Science Enabling technologies Grand goals Issues Google flu trend Privacy Conclusion Analytics

More information

Big Data: Big N. V.C. 14.387 Note. December 2, 2014

Big Data: Big N. V.C. 14.387 Note. December 2, 2014 Big Data: Big N V.C. 14.387 Note December 2, 2014 Examples of Very Big Data Congressional record text, in 100 GBs Nielsen s scanner data, 5TBs Medicare claims data are in 100 TBs Facebook 200,000 TBs See

More information

Package metafuse. November 7, 2015

Package metafuse. November 7, 2015 Type Package Package metafuse November 7, 2015 Title Fused Lasso Approach in Regression Coefficient Clustering Version 1.0-1 Date 2015-11-06 Author Lu Tang, Peter X.K. Song Maintainer Lu Tang

More information

MODULE 2: Advanced methodologies and tools for research. Research funding and innovation.

MODULE 2: Advanced methodologies and tools for research. Research funding and innovation. MODULE 2: Advanced methodologies and tools for research. Research funding and innovation. Code: 43642 Credits: 6 ECTS Type: Compulsory Language: English/Spanish Module s Coordinator: Àlex Sánchez alex.sanchez@vhir.org

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Model Validation Techniques

Model Validation Techniques Model Validation Techniques Kevin Mahoney, FCAS kmahoney@ travelers.com CAS RPM Seminar March 17, 2010 Uses of Statistical Models in P/C Insurance Examples of Applications Determine expected loss cost

More information

Correlational Research

Correlational Research Correlational Research Chapter Fifteen Correlational Research Chapter Fifteen Bring folder of readings The Nature of Correlational Research Correlational Research is also known as Associational Research.

More information

2019 Healthcare That Works for All

2019 Healthcare That Works for All 2019 Healthcare That Works for All This paper is one of a series describing what a decade of successful change in healthcare could look like in 2019. Each paper focuses on one aspect of healthcare. To

More information

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem Elsa Bernard Laurent Jacob Julien Mairal Jean-Philippe Vert September 24, 2013 Abstract FlipFlop implements a fast method for de novo transcript

More information

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics. Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

More information

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology http://tinyurl.com/bioinf525-w16

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology http://tinyurl.com/bioinf525-w16 Course Director: Dr. Barry Grant (DCM&B, bjgrant@med.umich.edu) Description: This is a three module course covering (1) Foundations of Bioinformatics, (2) Statistics in Bioinformatics, and (3) Systems

More information

Data Mining Builds Process Understanding for Vaccine Manufacturing

Data Mining Builds Process Understanding for Vaccine Manufacturing Data Mining Builds Process Understanding for Vaccine Manufacturing WCBP 2009 Current Topics in Vaccine Development January 14, 2009 Julia O Neill, Principal Engineer Merck & Co., Inc. Global Vaccine Technology

More information

Gene Expression Analysis

Gene Expression Analysis Gene Expression Analysis Jie Peng Department of Statistics University of California, Davis May 2012 RNA expression technologies High-throughput technologies to measure the expression levels of thousands

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

PREDA S4-classes. Francesco Ferrari October 13, 2015

PREDA S4-classes. Francesco Ferrari October 13, 2015 PREDA S4-classes Francesco Ferrari October 13, 2015 Abstract This document provides a description of custom S4 classes used to manage data structures for PREDA: an R package for Position RElated Data Analysis.

More information

Big data in macroeconomics Lucrezia Reichlin London Business School and now-casting economics ltd. COEURE workshop Brussels 3-4 July 2015

Big data in macroeconomics Lucrezia Reichlin London Business School and now-casting economics ltd. COEURE workshop Brussels 3-4 July 2015 Big data in macroeconomics Lucrezia Reichlin London Business School and now-casting economics ltd COEURE workshop Brussels 3-4 July 2015 WHAT IS BIG DATA IN ECONOMICS? Frank Diebold claimed to have introduced

More information

Towards a Big Data Taxonomy. Bill Mandrick, PhD Data Tactics Version 26_August_2013

Towards a Big Data Taxonomy. Bill Mandrick, PhD Data Tactics Version 26_August_2013 Towards a Big Data Taxonomy Bill Mandrick, PhD Data Tactics Version 26_August_2013 Scientific Taxonomies Represent Types of Processes Types of Objects Physical Objects Information Artifacts Types of Characteristics

More information

Guidance for Industry

Guidance for Industry Guidance for Industry Q2B Validation of Analytical Procedures: Methodology November 1996 ICH Guidance for Industry Q2B Validation of Analytical Procedures: Methodology Additional copies are available from:

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Ethical Principles in Clinical Research. Christine Grady Department of Bioethics NIH Clinical Center

Ethical Principles in Clinical Research. Christine Grady Department of Bioethics NIH Clinical Center Ethical Principles in Clinical Research Christine Grady Department of Bioethics NIH Clinical Center 1 Ethical principles Are these studies ethical? How do we know? Ethics of clinical research The goal

More information

MASCOT Search Results Interpretation

MASCOT Search Results Interpretation The Mascot protein identification program (Matrix Science, Ltd.) uses statistical methods to assess the validity of a match. MS/MS data is not ideal. That is, there are unassignable peaks (noise) and usually

More information

Genomic Medicine The Future of Cancer Care. Shayma Master Kazmi, M.D. Medical Oncology/Hematology Cancer Treatment Centers of America

Genomic Medicine The Future of Cancer Care. Shayma Master Kazmi, M.D. Medical Oncology/Hematology Cancer Treatment Centers of America Genomic Medicine The Future of Cancer Care Shayma Master Kazmi, M.D. Medical Oncology/Hematology Cancer Treatment Centers of America Personalized Medicine Personalized health care is a broad term for interventions

More information

Florida Study of Career and Technical Education

Florida Study of Career and Technical Education Florida Study of Career and Technical Education Final Report Louis Jacobson, Ph.D. Christine Mokher, Ph.D. 2014 IRM-2014-U-008790 Approved for Distribution Unlimited This document represents the best opinion

More information

Machine Learning Methods for Demand Estimation

Machine Learning Methods for Demand Estimation Machine Learning Methods for Demand Estimation By Patrick Bajari, Denis Nekipelov, Stephen P. Ryan, and Miaoyu Yang Over the past decade, there has been a high level of interest in modeling consumer behavior

More information

6.2.8 Neural networks for data mining

6.2.8 Neural networks for data mining 6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural

More information

How Can Institutions Foster OMICS Research While Protecting Patients?

How Can Institutions Foster OMICS Research While Protecting Patients? IOM Workshop on the Review of Omics-Based Tests for Predicting Patient Outcomes in Clinical Trials How Can Institutions Foster OMICS Research While Protecting Patients? E. Albert Reece, MD, PhD, MBA Vice

More information

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Focusing on results not data comprehensive data analysis for targeted next generation sequencing Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes

More information

Summary of Discussion on Non-clinical Pharmacology Studies on Anticancer Drugs

Summary of Discussion on Non-clinical Pharmacology Studies on Anticancer Drugs Provisional Translation (as of January 27, 2014)* November 15, 2013 Pharmaceuticals and Bio-products Subcommittees, Science Board Summary of Discussion on Non-clinical Pharmacology Studies on Anticancer

More information

Industry Environment and Concepts for Forecasting 1

Industry Environment and Concepts for Forecasting 1 Table of Contents Industry Environment and Concepts for Forecasting 1 Forecasting Methods Overview...2 Multilevel Forecasting...3 Demand Forecasting...4 Integrating Information...5 Simplifying the Forecast...6

More information

BIOS 6660: Analysis of Biomedical Big Data Using R and Bioconductor, Fall 2015 Computer Lab: Education 2 North Room 2201DE (TTh 10:30 to 11:50 am)

BIOS 6660: Analysis of Biomedical Big Data Using R and Bioconductor, Fall 2015 Computer Lab: Education 2 North Room 2201DE (TTh 10:30 to 11:50 am) BIOS 6660: Analysis of Biomedical Big Data Using R and Bioconductor, Fall 2015 Computer Lab: Education 2 North Room 2201DE (TTh 10:30 to 11:50 am) Course Instructor: Dr. Tzu L. Phang, Assistant Professor

More information

Current reporting in published research

Current reporting in published research Current reporting in published research Doug Altman Centre for Statistics in Medicine, Oxford, UK and EQUATOR Network Research article A published research article is a permanent record that will be used

More information

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova Using the Grid for the interactive workflow management in biomedicine Andrea Schenone BIOLAB DIST University of Genova overview background requirements solution case study results background A multilevel

More information

The degrees of freedom of the Lasso in underdetermined linear regression models

The degrees of freedom of the Lasso in underdetermined linear regression models The degrees of freedom of the Lasso in underdetermined linear regression models C. Dossal (1), M. Kachour (2), J. Fadili (2), G. Peyré (3), C. Chesneau (4) (1) IMB, Université Bordeaux 1 (2) GREYC, ENSICAEN

More information

Wir schaffen Wissen heute für morgen. Workshop Research Integrity at PSI 2013 Data management Tuesday June 4 2013, 13.30 17.00. Louis Tiefenauer, PSI

Wir schaffen Wissen heute für morgen. Workshop Research Integrity at PSI 2013 Data management Tuesday June 4 2013, 13.30 17.00. Louis Tiefenauer, PSI Wir schaffen Wissen heute für morgen Workshop Research Integrity at PSI 2013 Data management Tuesday June 4 2013, 13.30 17.00 Louis Tiefenauer, PSI PSI, 10. Juni 2013 Program Dur. End Welcome by Thierry

More information

October 17, 2005. Elias Zerhouni, M.D. Director National Institutes of Health One Center Drive Suite 126 MSC 0148 Bethesda, MD 20892

October 17, 2005. Elias Zerhouni, M.D. Director National Institutes of Health One Center Drive Suite 126 MSC 0148 Bethesda, MD 20892 October 17, 2005 Elias Zerhouni, M.D. Director National Institutes of Health One Center Drive Suite 126 MSC 0148 Bethesda, MD 20892 Dear Dr. Zerhouni: The undersigned nonprofit medical and scientific societies

More information

Data Analysis on the ABI PRISM 7700 Sequence Detection System: Setting Baselines and Thresholds. Overview. Data Analysis Tutorial

Data Analysis on the ABI PRISM 7700 Sequence Detection System: Setting Baselines and Thresholds. Overview. Data Analysis Tutorial Data Analysis on the ABI PRISM 7700 Sequence Detection System: Setting Baselines and Thresholds Overview In order for accuracy and precision to be optimal, the assay must be properly evaluated and a few

More information

Double Degree Track in Neuroscience &International Public Policy at the University of Wisconsin-Madison

Double Degree Track in Neuroscience &International Public Policy at the University of Wisconsin-Madison Double Degree Track in Neuroscience &International Public Policy at the University of Wisconsin-Madison Purpose: The Neuroscience and Public Policy Program offers a double degree track that leads to the

More information

RT 2 Profiler PCR Array: Web-Based Data Analysis Tutorial

RT 2 Profiler PCR Array: Web-Based Data Analysis Tutorial RT 2 Profiler PCR Array: Web-Based Data Analysis Tutorial Samuel J. Rulli, Jr., Ph.D. qpcr-applications Scientist Samuel.Rulli@QIAGEN.com Pathway Focused Research from Sample Prep to Data Analysis! -2-

More information

STATE OF MICHIGAN DEPARTMENT OF INSURANCE AND FINANCIAL SERVICES Before the Director of Insurance and Financial Services

STATE OF MICHIGAN DEPARTMENT OF INSURANCE AND FINANCIAL SERVICES Before the Director of Insurance and Financial Services STATE OF MICHIGAN DEPARTMENT OF INSURANCE AND FINANCIAL SERVICES Before the Director of Insurance and Financial Services In the matter of: Petitioner, v Blue Care Network of Michigan, Respondent. File

More information

Integrated Resource Plan

Integrated Resource Plan Integrated Resource Plan March 19, 2004 PREPARED FOR KAUA I ISLAND UTILITY COOPERATIVE LCG Consulting 4962 El Camino Real, Suite 112 Los Altos, CA 94022 650-962-9670 1 IRP 1 ELECTRIC LOAD FORECASTING 1.1

More information

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data From Reads to Differentially Expressed Genes The statistics of differential gene expression analysis using RNA-seq data experimental design data collection modeling statistical testing biological heterogeneity

More information

Shiny Server Pro: Regulatory Compliance and Validation Issues

Shiny Server Pro: Regulatory Compliance and Validation Issues Shiny Server Pro: Regulatory Compliance and Validation Issues A Guidance Document for the Use of Shiny Server Pro in Regulated Clinical Trial Environments June 19, 2014 RStudio, Inc. 250 Northern Ave.

More information

200627 - AC - Clinical Trials

200627 - AC - Clinical Trials Coordinating unit: Teaching unit: Academic year: Degree: ECTS credits: 2014 200 - FME - School of Mathematics and Statistics 715 - EIO - Department of Statistics and Operations Research MASTER'S DEGREE

More information

Basic Analysis of Microarray Data

Basic Analysis of Microarray Data Basic Analysis of Microarray Data A User Guide and Tutorial Scott A. Ness, Ph.D. Co-Director, Keck-UNM Genomics Resource and Dept. of Molecular Genetics and Microbiology University of New Mexico HSC Tel.

More information

G E N OM I C S S E RV I C ES

G E N OM I C S S E RV I C ES GENOMICS SERVICES THE NEW YORK GENOME CENTER NYGC is an independent non-profit implementing advanced genomic research to improve diagnosis and treatment of serious diseases. capabilities. N E X T- G E

More information

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS) Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS) A typical RNA Seq experiment Library construction Protocol variations Fragmentation methods RNA: nebulization,

More information