Open array data analysis: mirna profiling in blood samples from patient suffering heart diseases

CRG BIOINFORMATICS CORE FACILITIES Open array data analysis: mirna profiling in blood samples from patient suffering heart diseases May 2015 Users: Begona Benito and Marta Tajes Users center: IMIM Analyst: Sarah Bonnin Group leader: Julia Ponomarenko

1. Project Scientific background, limitations: Blood samples from patients with either preserved or reduced ejection fraction, with or without atrial fibrillation. Some studies suggest that some mirnas could be directly involved in the development of the disease. Limitations: * Because of the high dilution of target molecules in blood samples, mirnas are present in low concentration in plasma and are therefore difficult to detect. * One of the challenges of mirna profiling from serum or plasma is the lack of established housekeeping genes for data normalization. Technology: Platform: OpenArray, Life Technology Array: TaqMan OpenArray Human MicroRNA Panel, QuantStudio 12K Flex Catalog number: 4470187 754 mirnas + 4 controls replicated 16 times 3 samples are loaded on each array, for a total of 15 arrays ran in 4 different batches at the following dates: - batch A, 3 arrays: 13/02/2015 - batch B, 4 arrays: 05/03/2015 - batch C, 4 arrays: 10/03/2015 - batch D, 4 arrays: 11/03/2015 Data and goal: The experiment consists of 45 samples, divided into 5 experimental groups: * 9 technical replicates samples: 9 samples pooled together and then ran as technical replicates, to be able to study the technical variation (TechCtrl) * 9 preserved ejection fraction (PEF = 4) * 9 preserved ejection fraction with atrial fibrillation (PEF+AF = 3) * 9 reduced ejection fraction (REF = 2)

* 9 reduced ejection fraction with atrial fibrillation (REF+AF = 1) The users are mostly interested in the change of mirna expression in case of atrial fibrillation in each ejection fraction situation. Hence, we will first focus on comparing PEF+AF vs PEF and REF+AF vs REF. Table 1 shows, for each sample, which experimental group it belongs to, on which array it was run, and in which batch. Sample ID Experimental group Array Batch 138-33 1 = REF+AF ROK49 B 535-13 1 = REF+AF ROK58 B 549-5 1 = REF+AF ROK62 B 602-25 1 = REF+AF ROK50 C 204-9 1 = REF+AF RON52 C 271-24 1 = REF+AF RON60 C 678-1 1 = REF+AF RON50 D 686-17 1 = REF+AF ROK51 D 1029-29 1 = REF+AF ROK67 D 1104-30 2 = REF OMZ01 A 117-10 2 = REF OMZ18 A 1037-18 2 = REF OMZ54 A 174-34 2 = REF ROL05 B 75-6 2 = REF ROK49 B 85-14 2 = REF ROK62 B 970-26 2 = REF ROL1 C 859-21 2 = REF RON60 C 612-2 2 = REF ROK51 D 866-31 3 = PEF+AF OMZ18 A 495-19 3 = PEF+AF OMZ01 A 829-11 3 = PEF+AF OMZ54 A 1088-15 3 = PEF+AF ROL05 B 299-7 3 = PEF+AF ROK58 B 1049-22 3 = PEF+AF ROL1 C 1016-27 3 = PEF+AF RON52 C 1113-3 3 = PEF+AF RON42 D 1057-35 3 = PEF+AF RON50 D 819-16 4 = PEF ROK62 B 146-12 4 = PEF ROK50 C 924-23 4 = PEF RON52 C 670-28 4 = PEF RON60 C 480-8 4 = PEF RON42 D 446-20 4 = PEF RON50 D 1099-32 4 = PEF ROK51 D

516-4 4 = PEF ROK67 D 647-36 4 = PEF ROK67 D 12Q TechCtrl RON42 D 13Q TechCtrl ROL05 B 14Q TechCtrl OMZ01 A 15Q TechCtrl ROK49 B 16Q TechCtrl OMZ18 A 17Q TechCtrl ROK50 C 18Q TechCtrl ROL1 C 19Q TechCtrl OMZ54 A 20Q TechCtrl ROK58 B Table 1

2. Preprocessing Extraction of Ct data: Ct data for all mirnas and all samples was extracted from the analysis_result.txt (part of the raw data handed by the users) file for each mirna and each sample. All analysis was performed in the R/Bioconductor environment. In particular, Bioconductor package HTqPCR was used as it is designed for the analysis of high-throughput qpcr data. Quality control: Figure 1 shows the raw Ct distribution for each sample. Figure 1 We observe two clear Ct density peaks : one which summit is located around Ct=25, the other one around Ct=40.

mirna transcripts for which Ct is around 40 are too lowly expressed to be considered as actually expressed: we will try to filter out some features in order not to lose too much detection power. Figure 2 shows a hierarchical clustering of samples using all mirnas. Colors represent the experimental groups the samples belong to (a.), or the batches (b.) in which samples were run. a. Clustering colored per experimental group. b. Clustering colored per batch. Figure 2. Dendrograms using raw data. Figure 2.b shows us a slight batch effect: indeed samples that were run in batch A are all grouping together. This is a bias often found when arrays are not processed all in the same batch and/or on the same day.

It is to remember that such technical biases are more visible when features are lowly expressed or when few differences are expected between experimental groups. We will try to correct for that bias. Features filtering: Features are tagged as Undetermined if their Ct is beyond 38, and Unreliable if their Ct is below 10 or if their standard variation is above 0.9 across all samples of a same experimental group. We are then filtering out features that are Undetermined/Unreliable in 36 samples or more (we consider that features can potentially be expressed in only one experimental group, i.e. 9 samples here, and not expressed in the 36 remaining samples). Using that filtering, 411 features were removed and we will be working with the 407 remaining ones. Figure 3 shows the density plot (same as Figure 1) of the remaining filtered data: we can see that the second peak of lowly expressed features is well reduced. Figure 3

Figure 4 shows the dendrogram (as Figure 2) using the remaining features after filtering. Colors represent the experimental groups the samples belong to (a.), or the batches (b.) in which samples were run. a. Clustering colored per experimental group. b. Clustering colored per batch. Figure 4. Dendrograms using filtered data. Batch effect correction: The ComBat method (Bioconductor package sva ) was applied to try and correct for the batch effect we observe. ComBat allows adjusting for batch effects in a dataset where the batch covariate is known, which is the case here.

Figure 5 and 6 show, as in previous steps, the Ct density per sample and dendrograms based on filtered and corrected data, respectively. Figure 5 a. Clustering colored by experimental group.

b. Clustering colored by batches. Figure 6. Dendrograms using filtered and batch corrected data. Figure 6b shows us that samples from batch A do not clustered all together as previously observed, so the batch effect seems to have been corrected. Figure 6a does not show a very improved clustering of samples per experimental group, apart maybe slightly for the group of replicated controls (TechCtrl). Normalization: A commonly used and validated method for qpcr normalization is the deltact intra-sample normalization: one or more features within the array are chosen (sufficiently expressed and stable in expression across the whole experiment), and are used as reference feature(s) for raw Ct correction. The Ct data from this (or these) reference feature (s) is (are) then subtracted from all other features, to adjust for intra-sample variability and make samples better comparable. Selection of reference features 4 control features are provided within this array, and are repeated each 16 times in each array: 000338_ath-miR159a_B 001006_RNU48_B 001094_RNU44_B 001973_U6. We will first check their levels of expression and variability within and across samples (on raw data before filtering and ComBat correction).

Figure 7 shows boxplots dispaying the Ct distribution of each control feature per sample. Results are displayed only for 4 samples but show the main trends. 1016 27_Ct.txt 15 20 25 30 35 40 U6 athmir159 RNU44 RNU48 Figure 7. Figure 8 shows the expression profiles of these control features across samples.

control genes Ct 20 40 60 80 100 000338_ath mir159a_b 001006_RNU48_B 001094_RNU44_B 001973_U6 rrna_b Figure 8 1016 27 1029 29 1037 18 1049 22 1057 35 1088 15 1099 32 1104 30 1113 3 117 10 12Q 138 33 13Q 146 12 14Q 15Q 16Q 174 34 17Q 18Q 19Q 204 9 20Q samples 271 24 299 7 446 20 480 8 495 19 516 4 535 13 549 5 602 25 612 2 647 36 670 28 678 1 686 17 75 6 819 16 829 11 85 14 859 21 866 31 924 23 970 26 Of the 4 control features, 000338_ath-miR159a_B and 001094_RNU44_B have very high Ct values, i.e. very low transcript expression (hence unreliable). 001006_RNU48_B is generally more highly expressed, but seems to be varying in expression across samples quite much. 001973_U6 is the most stable in expression across samples, and is sufficiently expressed. Next we tried to find some mirnas within the array which would be suitable (and better than the default controls) as references for deltact normalization: mirnas for which maximum Ct is below or equals 35, and coefficient of variation less than 0.1 across all samples, are selected. This method results in the selection of 61 mirnas. From these 61 mirnas, we decide to select the top 10 mirnas, i.e. the ones that show lowest levels of variation across samples (smallest coefficient of variation): 002315_hsa-miR-10b#_B 002148_hsa-miR-144#_B 002838_HSA-MIR-1291_B

000512_hsa-miR-210_A 000387_hsa-miR-10a_A 002281_hsa-miR-193a-5p_A 001515_hsa-miR-660_A 000416_hsa-miR-30a-3p_B 002340_hsa-miR-423-5p_A 001984_hsa-miR-590-5p_A Figure 9 shows the Ct profiles of these 10 mirnas across samples (a.) and their intra-experimental group variation (b.). tested mirna for use as controls Ct 10 15 20 25 30 35 40 002315_hsa mir 10b#_B 002148_hsa mir 144#_B 002838_HSA MIR 1291_B 000512_hsa mir 210_A 000387_hsa mir 10a_A 002281_hsa mir 193a 5p_A 001515_hsa mir 660_A 000416_hsa mir 30a 3p_B 002340_hsa mir 423 5p_A 001984_hsa mir 590 5p_A 1016 27 1029 29 1037 18 1049 22 1057 35 1088 15 1099 32 1104 30 1113 3 117 10 12Q 138 33 13Q 146 12 14Q 15Q 16Q 174 34 17Q 18Q 19Q 204 9 20Q samples 271 24 299 7 446 20 480 8 495 19 516 4 535 13 549 5 602 25 612 2 647 36 670 28 678 1 686 17 75 6 819 16 829 11 85 14 859 21 866 31 924 23 970 26 a. 40 1 2 3 4 TechCtrl 30 Ct values for samples 20 10 0 b. 000387_hsa mir 10a_A 000416_hsa mir 30a 3p_B 000512_hsa mir 210_A 001515_hsa mir 660_A 001984_hsa mir 590 5p_A 002148_hsa mir 144#_B 002281_hsa mir 193a 5p_A 002315_hsa mir 10b#_B 002340_hsa mir 423 5p_A 002838_HSA MIR 1291_B

Figure 9. 10 most stable mirnas that will be used for normalization. These 10 samples are used for normalization of our data (filtered and ComBat corrected) using the deltact method. 3. Analysis Differential expression analysis: Remaining control probes (000338_ath-miR159a_B, 001006_RNU48_B, 001094_RNU44_B, 001973_U6) are removed from the dataset before performing differential expression analysis: it will hence be performed on 375 mirnas. A method from HTqPCR based on limma (linear models for microarray data) was used, which uses a moderated t-test to assess differential expression of mirnas between experimental groups. Results: Results (Excel file) can be found in: http://public-docs.crg.es/biocore/sbonnin/begona_benito/2015-05_openarray/ Using the following credentials: Login: mtajes Password: marta15 Brief description of the columns found in the results file: t.test : The result of the t-test. p.value : The corresponding p.values. adj.p.value : P-values after correcting for multiple testing using the Benjamini- Holm method. ddct : The deltadeltact values = deltadeltact = deltact(target) deltact(calibrator) FC: The fold change; 2^(-ddCt). Target/Calibrator: the first/last experimental group in a pairwise comparison, respectively; for G1 vs G2, G1 is the target, G2 the calibrator.

Mean columns: The average Ct across the target/calibrator samples for the given Category columns: all results are assigned to a category, either "OK" or "Unreliable" depending on the input Ct values: the result will be "OK unless at least half of the Ct values for a given gene are unreliable/undetermined. Filtering the data using the adjusted p-value (<0.05) does not yield any result. Table 2 lists the mirnas found when filtering the data using the (unadjusted) p-value (< 0.05). G1 vs G2 G3 vs G4 22 mirnas 12 mirnas 000409_hsa-miR-27b_A 000443_hsa-miR-107_A 001286_hsa-miR-539_A 001597_hsa-miR-645_B 001610_hsa-miR-411_A 001988_hsa-miR-598_A 002087_hsa-miR-505#_B 002088_hsa-miR-636_A 002222_hsa-miR-1_A 002231_hsa-miR-9#_B 002233_hsa-miR-331-5p_A 002248_hsa-miR-142-5p_A 002301_hsa-miR-22#_B 002302_hsa-miR-425#_B 002305_hsa-miR-30d#_B 002339_hsa-miR-483-3p_B 002352_hsa-miR-652_A 002358_hsa-miR-489_A 002434_hsa-miR-628-3p_B 002437_hsa-miR-20a#_B 002642_HSA-MIR-151-5P_B 002801_HSA-MIR-1255B_B Table 2. 000377_hsa-let-7a_A 000443_hsa-miR-107_A 000482_hsa-miR-181c_A 000500_hsa-miR-199b_A 000533_hsa-miR-302c_A 001557_hsa-miR-624_B 002182_hsa-miR-939_B 002202_hsa-miR-889_A 002239_hsa-miR-654-3p_A 002361_hsa-miR-146b-3p_A 002393_hsa-miR-520d-5p_A 002409_hsa-miR-589_A

4. References OpenArray: https://www.lifetechnologies.com/order/catalog/product/4470187?cid=search- 4470187 R project: R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0 Bioconductor: Huber, W., Carey, J. V, Gentleman, R., Anders, S., Carlson, M., Carvalho, S. B, Bravo, C. H, Davis, S., Gatto, L., Girke, T., Gottardo, R., Hahne, F., Hansen, D. K, Irizarry, A. R, Lawrence, M., Love, I. M, MacDonald, J., Obenchain, V., Ole's, K. A, Pag'es, H., Reyes, A., Shannon, P., Smyth, K. G, Tenenbaum, D., Waldron, L., Morgan and M. (2015). Orchestrating highthroughput genomic analysis with Bioconductor. Nature Methods, 12(2), pp. 115 121. HTqPCR: Dvinge H and Bertone P (2009). HTqPCR: High - throughput analysis and visualization of quantitative real - time PCR data in R. Bioinformatics, 25(24), pp. 3325. ComBat: Johnson WE, Rabinovic A, and Li C (2007). Adjusting batch effects in microarray expression data using Empirical Bayes methods. Biostatistics 8(1):118-127 limma: Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W and Smyth GK (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, 43(7), pp. e47. Shaffer J, Schlumpberger M and Lader E. mirna profiling from blood challenges and recommendations. From Qiagen: http://www.sabiosciences.com/manuals/whitepaper_serumplasma.pdf