Open array data analysis: mirna profiling in blood samples from patient suffering heart diseases

Similar documents
Gene Expression Analysis

Analysis of Illumina Gene Expression Microarray Data

Statistical issues in the analysis of microarray data

Consistent Assay Performance Across Universal Arrays and Scanners

Tutorial for proteome data analysis using the Perseus software platform

Exiqon Array Software Manual. Quick guide to data extraction from mircury LNA microrna Arrays

Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6

Identification of rheumatoid arthritis and osteoarthritis patients by transcriptome-based rule set generation

Introduction To Real Time Quantitative PCR (qpcr)

RT 2 Profiler PCR Array: Web-Based Data Analysis Tutorial

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University

Real-time PCR: Understanding C t

ALLEN Mouse Brain Atlas

REAL TIME PCR USING SYBR GREEN

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients

2.500 Threshold e Threshold. Exponential phase. Cycle Number

Correlation of microarray and quantitative real-time PCR results. Elisa Wurmbach Mount Sinai School of Medicine New York

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

REAL TIME PCR SYBR GREEN

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

DeCyder Extended Data Analysis module Version 1.0

SELDI-TOF Mass Spectrometry Protein Data By Huong Thi Dieu La

Improving SAS Global Forum Papers

QuantStudio 3D AnalysisSuite Software

Step-by-Step Guide to Basic Expression Analysis and Normalization

Factors for success in big data science

Analyzing the Effect of Treatment and Time on Gene Expression in Partek Genomics Suite (PGS) 6.6: A Breast Cancer Study

Validation and Calibration. Definitions and Terminology

Gene expression analysis. Ulf Leser and Karin Zimmermann

Content Sheet 7-1: Overview of Quality Control for Quantitative Tests

Cluster software and Java TreeView

Supplemental Material. Methods

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Quality Assessment of Exon and Gene Arrays

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Profiling of microrna in Blood Serum/Plasma. Guidelines for the mircury LNA TM Universal RT microrna PCR System

Frequently Asked Questions Next Generation Sequencing

edger: differential expression analysis of digital gene expression data User s Guide Yunshun Chen, Davis McCarthy, Mark Robinson, Gordon K.

Two-Way ANOVA tests. I. Definition and Applications...2. II. Two-Way ANOVA prerequisites...2. III. How to use the Two-Way ANOVA tool?...

Hierarchical Clustering Analysis

Exercise with Gene Ontology - Cytoscape - BiNGO

Real time and Quantitative (RTAQ) PCR. so I have an outlier and I want to see if it really is changed

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

ncounter Leukemia Fusion Gene Expression Assay Molecules That Count Product Highlights ncounter Leukemia Fusion Gene Expression Assay Overview

Quando si parla di PCR quantitativa si intende:

manual last update on July 8, 2008

Frozen Robust Multi-Array Analysis and the Gene Expression Barcode

Gene Expression Macro Version 1.1

Analysing Questionnaires using Minitab (for SPSS queries contact -)

Microarray Data Analysis. A step by step analysis using BRB-Array Tools

OpenArray Sample Tracker Software

Analytical Test Method Validation Report Template

QuantStudio 12K Flex Real-Time PCR System. The all-in-one qpcr instrument

Regression Clustering

Gene Expression Assays

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Quantitative Real Time PCR Protocol. Stack Lab

Export Pricing and Credit Constraints: Theory and Evidence from Greek Firms. Online Data Appendix (not intended for publication) Elias Dinopoulos

Course Objective This course is designed to give you a basic understanding of how to run regressions in SPSS.

Obesity in America: A Growing Trend

MS Data Analysis I: Importing Data into Genespring and Initial Quality Control

DHL Data Mining Project. Customer Segmentation with Clustering

A Procedure for Classifying New Respondents into Existing Segments Using Maximum Difference Scaling

Microarray Analysis. The Basics. Thomas Girke. December 9, Microarray Analysis Slide 1/42

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Using Excel for inferential statistics

Scott Reierstad. Field Applications Scientist

Quantitative proteomics background

Nucleic Acid Purity Assessment using A 260 /A 280 Ratios

Summary of important mathematical operations and formulas (from first tutorial):

CHAPTER TWELVE TABLES, CHARTS, AND GRAPHS

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

A Streamlined Workflow for Untargeted Metabolomics

PreciseTM Whitepaper

Validating Microarray Data Using RT 2 Real-Time PCR Products

Final Project Report

ABSORBENCY OF PAPER TOWELS

Micro RNAs: potentielle Biomarker für das. Blutspenderscreening

Visualization of Complex Survey Data: Regression Diagnostics

Chapter 5: Analysis of The National Education Longitudinal Study (NELS:88)

NCSS Statistical Software

Molecular Assessment of Dried Blood Spot Quality during Development of a Novel Automated. Screening

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

The Forgotten JMP Visualizations (Plus Some New Views in JMP 9) Sam Gardner, SAS Institute, Lafayette, IN, USA

Exploratory data analysis (Chapter 2) Fall 2011

Practical Differential Gene Expression. Introduction

Factor Analysis. Sample StatFolio: factor analysis.sgp

bitter is de pil Linos Vandekerckhove, MD, PhD

Comparing Methods for Identifying Transcription Factor Target Genes

Transcription:

CRG BIOINFORMATICS CORE FACILITIES Open array data analysis: mirna profiling in blood samples from patient suffering heart diseases May 2015 Users: Begona Benito and Marta Tajes Users center: IMIM Analyst: Sarah Bonnin Group leader: Julia Ponomarenko

1. Project Scientific background, limitations: Blood samples from patients with either preserved or reduced ejection fraction, with or without atrial fibrillation. Some studies suggest that some mirnas could be directly involved in the development of the disease. Limitations: * Because of the high dilution of target molecules in blood samples, mirnas are present in low concentration in plasma and are therefore difficult to detect. * One of the challenges of mirna profiling from serum or plasma is the lack of established housekeeping genes for data normalization. Technology: Platform: OpenArray, Life Technology Array: TaqMan OpenArray Human MicroRNA Panel, QuantStudio 12K Flex Catalog number: 4470187 754 mirnas + 4 controls replicated 16 times 3 samples are loaded on each array, for a total of 15 arrays ran in 4 different batches at the following dates: - batch A, 3 arrays: 13/02/2015 - batch B, 4 arrays: 05/03/2015 - batch C, 4 arrays: 10/03/2015 - batch D, 4 arrays: 11/03/2015 Data and goal: The experiment consists of 45 samples, divided into 5 experimental groups: * 9 technical replicates samples: 9 samples pooled together and then ran as technical replicates, to be able to study the technical variation (TechCtrl) * 9 preserved ejection fraction (PEF = 4) * 9 preserved ejection fraction with atrial fibrillation (PEF+AF = 3) * 9 reduced ejection fraction (REF = 2)

* 9 reduced ejection fraction with atrial fibrillation (REF+AF = 1) The users are mostly interested in the change of mirna expression in case of atrial fibrillation in each ejection fraction situation. Hence, we will first focus on comparing PEF+AF vs PEF and REF+AF vs REF. Table 1 shows, for each sample, which experimental group it belongs to, on which array it was run, and in which batch. Sample ID Experimental group Array Batch 138-33 1 = REF+AF ROK49 B 535-13 1 = REF+AF ROK58 B 549-5 1 = REF+AF ROK62 B 602-25 1 = REF+AF ROK50 C 204-9 1 = REF+AF RON52 C 271-24 1 = REF+AF RON60 C 678-1 1 = REF+AF RON50 D 686-17 1 = REF+AF ROK51 D 1029-29 1 = REF+AF ROK67 D 1104-30 2 = REF OMZ01 A 117-10 2 = REF OMZ18 A 1037-18 2 = REF OMZ54 A 174-34 2 = REF ROL05 B 75-6 2 = REF ROK49 B 85-14 2 = REF ROK62 B 970-26 2 = REF ROL1 C 859-21 2 = REF RON60 C 612-2 2 = REF ROK51 D 866-31 3 = PEF+AF OMZ18 A 495-19 3 = PEF+AF OMZ01 A 829-11 3 = PEF+AF OMZ54 A 1088-15 3 = PEF+AF ROL05 B 299-7 3 = PEF+AF ROK58 B 1049-22 3 = PEF+AF ROL1 C 1016-27 3 = PEF+AF RON52 C 1113-3 3 = PEF+AF RON42 D 1057-35 3 = PEF+AF RON50 D 819-16 4 = PEF ROK62 B 146-12 4 = PEF ROK50 C 924-23 4 = PEF RON52 C 670-28 4 = PEF RON60 C 480-8 4 = PEF RON42 D 446-20 4 = PEF RON50 D 1099-32 4 = PEF ROK51 D

516-4 4 = PEF ROK67 D 647-36 4 = PEF ROK67 D 12Q TechCtrl RON42 D 13Q TechCtrl ROL05 B 14Q TechCtrl OMZ01 A 15Q TechCtrl ROK49 B 16Q TechCtrl OMZ18 A 17Q TechCtrl ROK50 C 18Q TechCtrl ROL1 C 19Q TechCtrl OMZ54 A 20Q TechCtrl ROK58 B Table 1

2. Preprocessing Extraction of Ct data: Ct data for all mirnas and all samples was extracted from the analysis_result.txt (part of the raw data handed by the users) file for each mirna and each sample. All analysis was performed in the R/Bioconductor environment. In particular, Bioconductor package HTqPCR was used as it is designed for the analysis of high-throughput qpcr data. Quality control: Figure 1 shows the raw Ct distribution for each sample. Figure 1 We observe two clear Ct density peaks : one which summit is located around Ct=25, the other one around Ct=40.

mirna transcripts for which Ct is around 40 are too lowly expressed to be considered as actually expressed: we will try to filter out some features in order not to lose too much detection power. Figure 2 shows a hierarchical clustering of samples using all mirnas. Colors represent the experimental groups the samples belong to (a.), or the batches (b.) in which samples were run. a. Clustering colored per experimental group. b. Clustering colored per batch. Figure 2. Dendrograms using raw data. Figure 2.b shows us a slight batch effect: indeed samples that were run in batch A are all grouping together. This is a bias often found when arrays are not processed all in the same batch and/or on the same day.

It is to remember that such technical biases are more visible when features are lowly expressed or when few differences are expected between experimental groups. We will try to correct for that bias. Features filtering: Features are tagged as Undetermined if their Ct is beyond 38, and Unreliable if their Ct is below 10 or if their standard variation is above 0.9 across all samples of a same experimental group. We are then filtering out features that are Undetermined/Unreliable in 36 samples or more (we consider that features can potentially be expressed in only one experimental group, i.e. 9 samples here, and not expressed in the 36 remaining samples). Using that filtering, 411 features were removed and we will be working with the 407 remaining ones. Figure 3 shows the density plot (same as Figure 1) of the remaining filtered data: we can see that the second peak of lowly expressed features is well reduced. Figure 3

Figure 4 shows the dendrogram (as Figure 2) using the remaining features after filtering. Colors represent the experimental groups the samples belong to (a.), or the batches (b.) in which samples were run. a. Clustering colored per experimental group. b. Clustering colored per batch. Figure 4. Dendrograms using filtered data. Batch effect correction: The ComBat method (Bioconductor package sva ) was applied to try and correct for the batch effect we observe. ComBat allows adjusting for batch effects in a dataset where the batch covariate is known, which is the case here.

Figure 5 and 6 show, as in previous steps, the Ct density per sample and dendrograms based on filtered and corrected data, respectively. Figure 5 a. Clustering colored by experimental group.

b. Clustering colored by batches. Figure 6. Dendrograms using filtered and batch corrected data. Figure 6b shows us that samples from batch A do not clustered all together as previously observed, so the batch effect seems to have been corrected. Figure 6a does not show a very improved clustering of samples per experimental group, apart maybe slightly for the group of replicated controls (TechCtrl). Normalization: A commonly used and validated method for qpcr normalization is the deltact intra-sample normalization: one or more features within the array are chosen (sufficiently expressed and stable in expression across the whole experiment), and are used as reference feature(s) for raw Ct correction. The Ct data from this (or these) reference feature (s) is (are) then subtracted from all other features, to adjust for intra-sample variability and make samples better comparable. Selection of reference features 4 control features are provided within this array, and are repeated each 16 times in each array: 000338_ath-miR159a_B 001006_RNU48_B 001094_RNU44_B 001973_U6. We will first check their levels of expression and variability within and across samples (on raw data before filtering and ComBat correction).

Figure 7 shows boxplots dispaying the Ct distribution of each control feature per sample. Results are displayed only for 4 samples but show the main trends. 1016 27_Ct.txt 15 20 25 30 35 40 U6 athmir159 RNU44 RNU48 Figure 7. Figure 8 shows the expression profiles of these control features across samples.

control genes Ct 20 40 60 80 100 000338_ath mir159a_b 001006_RNU48_B 001094_RNU44_B 001973_U6 rrna_b Figure 8 1016 27 1029 29 1037 18 1049 22 1057 35 1088 15 1099 32 1104 30 1113 3 117 10 12Q 138 33 13Q 146 12 14Q 15Q 16Q 174 34 17Q 18Q 19Q 204 9 20Q samples 271 24 299 7 446 20 480 8 495 19 516 4 535 13 549 5 602 25 612 2 647 36 670 28 678 1 686 17 75 6 819 16 829 11 85 14 859 21 866 31 924 23 970 26 Of the 4 control features, 000338_ath-miR159a_B and 001094_RNU44_B have very high Ct values, i.e. very low transcript expression (hence unreliable). 001006_RNU48_B is generally more highly expressed, but seems to be varying in expression across samples quite much. 001973_U6 is the most stable in expression across samples, and is sufficiently expressed. Next we tried to find some mirnas within the array which would be suitable (and better than the default controls) as references for deltact normalization: mirnas for which maximum Ct is below or equals 35, and coefficient of variation less than 0.1 across all samples, are selected. This method results in the selection of 61 mirnas. From these 61 mirnas, we decide to select the top 10 mirnas, i.e. the ones that show lowest levels of variation across samples (smallest coefficient of variation): 002315_hsa-miR-10b#_B 002148_hsa-miR-144#_B 002838_HSA-MIR-1291_B

000512_hsa-miR-210_A 000387_hsa-miR-10a_A 002281_hsa-miR-193a-5p_A 001515_hsa-miR-660_A 000416_hsa-miR-30a-3p_B 002340_hsa-miR-423-5p_A 001984_hsa-miR-590-5p_A Figure 9 shows the Ct profiles of these 10 mirnas across samples (a.) and their intra-experimental group variation (b.). tested mirna for use as controls Ct 10 15 20 25 30 35 40 002315_hsa mir 10b#_B 002148_hsa mir 144#_B 002838_HSA MIR 1291_B 000512_hsa mir 210_A 000387_hsa mir 10a_A 002281_hsa mir 193a 5p_A 001515_hsa mir 660_A 000416_hsa mir 30a 3p_B 002340_hsa mir 423 5p_A 001984_hsa mir 590 5p_A 1016 27 1029 29 1037 18 1049 22 1057 35 1088 15 1099 32 1104 30 1113 3 117 10 12Q 138 33 13Q 146 12 14Q 15Q 16Q 174 34 17Q 18Q 19Q 204 9 20Q samples 271 24 299 7 446 20 480 8 495 19 516 4 535 13 549 5 602 25 612 2 647 36 670 28 678 1 686 17 75 6 819 16 829 11 85 14 859 21 866 31 924 23 970 26 a. 40 1 2 3 4 TechCtrl 30 Ct values for samples 20 10 0 b. 000387_hsa mir 10a_A 000416_hsa mir 30a 3p_B 000512_hsa mir 210_A 001515_hsa mir 660_A 001984_hsa mir 590 5p_A 002148_hsa mir 144#_B 002281_hsa mir 193a 5p_A 002315_hsa mir 10b#_B 002340_hsa mir 423 5p_A 002838_HSA MIR 1291_B

Figure 9. 10 most stable mirnas that will be used for normalization. These 10 samples are used for normalization of our data (filtered and ComBat corrected) using the deltact method. 3. Analysis Differential expression analysis: Remaining control probes (000338_ath-miR159a_B, 001006_RNU48_B, 001094_RNU44_B, 001973_U6) are removed from the dataset before performing differential expression analysis: it will hence be performed on 375 mirnas. A method from HTqPCR based on limma (linear models for microarray data) was used, which uses a moderated t-test to assess differential expression of mirnas between experimental groups. Results: Results (Excel file) can be found in: http://public-docs.crg.es/biocore/sbonnin/begona_benito/2015-05_openarray/ Using the following credentials: Login: mtajes Password: marta15 Brief description of the columns found in the results file: t.test : The result of the t-test. p.value : The corresponding p.values. adj.p.value : P-values after correcting for multiple testing using the Benjamini- Holm method. ddct : The deltadeltact values = deltadeltact = deltact(target) deltact(calibrator) FC: The fold change; 2^(-ddCt). Target/Calibrator: the first/last experimental group in a pairwise comparison, respectively; for G1 vs G2, G1 is the target, G2 the calibrator.

Mean columns: The average Ct across the target/calibrator samples for the given Category columns: all results are assigned to a category, either "OK" or "Unreliable" depending on the input Ct values: the result will be "OK unless at least half of the Ct values for a given gene are unreliable/undetermined. Filtering the data using the adjusted p-value (<0.05) does not yield any result. Table 2 lists the mirnas found when filtering the data using the (unadjusted) p-value (< 0.05). G1 vs G2 G3 vs G4 22 mirnas 12 mirnas 000409_hsa-miR-27b_A 000443_hsa-miR-107_A 001286_hsa-miR-539_A 001597_hsa-miR-645_B 001610_hsa-miR-411_A 001988_hsa-miR-598_A 002087_hsa-miR-505#_B 002088_hsa-miR-636_A 002222_hsa-miR-1_A 002231_hsa-miR-9#_B 002233_hsa-miR-331-5p_A 002248_hsa-miR-142-5p_A 002301_hsa-miR-22#_B 002302_hsa-miR-425#_B 002305_hsa-miR-30d#_B 002339_hsa-miR-483-3p_B 002352_hsa-miR-652_A 002358_hsa-miR-489_A 002434_hsa-miR-628-3p_B 002437_hsa-miR-20a#_B 002642_HSA-MIR-151-5P_B 002801_HSA-MIR-1255B_B Table 2. 000377_hsa-let-7a_A 000443_hsa-miR-107_A 000482_hsa-miR-181c_A 000500_hsa-miR-199b_A 000533_hsa-miR-302c_A 001557_hsa-miR-624_B 002182_hsa-miR-939_B 002202_hsa-miR-889_A 002239_hsa-miR-654-3p_A 002361_hsa-miR-146b-3p_A 002393_hsa-miR-520d-5p_A 002409_hsa-miR-589_A

4. References OpenArray: https://www.lifetechnologies.com/order/catalog/product/4470187?cid=search- 4470187 R project: R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0 Bioconductor: Huber, W., Carey, J. V, Gentleman, R., Anders, S., Carlson, M., Carvalho, S. B, Bravo, C. H, Davis, S., Gatto, L., Girke, T., Gottardo, R., Hahne, F., Hansen, D. K, Irizarry, A. R, Lawrence, M., Love, I. M, MacDonald, J., Obenchain, V., Ole's, K. A, Pag'es, H., Reyes, A., Shannon, P., Smyth, K. G, Tenenbaum, D., Waldron, L., Morgan and M. (2015). Orchestrating highthroughput genomic analysis with Bioconductor. Nature Methods, 12(2), pp. 115 121. HTqPCR: Dvinge H and Bertone P (2009). HTqPCR: High - throughput analysis and visualization of quantitative real - time PCR data in R. Bioinformatics, 25(24), pp. 3325. ComBat: Johnson WE, Rabinovic A, and Li C (2007). Adjusting batch effects in microarray expression data using Empirical Bayes methods. Biostatistics 8(1):118-127 limma: Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W and Smyth GK (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, 43(7), pp. e47. Shaffer J, Schlumpberger M and Lader E. mirna profiling from blood challenges and recommendations. From Qiagen: http://www.sabiosciences.com/manuals/whitepaper_serumplasma.pdf