Open array data analysis: mirna profiling in blood samples from patient suffering heart diseases

Transcription

1 CRG BIOINFORMATICS CORE FACILITIES Open array data analysis: mirna profiling in blood samples from patient suffering heart diseases May 2015 Users: Begona Benito and Marta Tajes Users center: IMIM Analyst: Sarah Bonnin Group leader: Julia Ponomarenko

2 1. Project Scientific background, limitations: Blood samples from patients with either preserved or reduced ejection fraction, with or without atrial fibrillation. Some studies suggest that some mirnas could be directly involved in the development of the disease. Limitations: * Because of the high dilution of target molecules in blood samples, mirnas are present in low concentration in plasma and are therefore difficult to detect. * One of the challenges of mirna profiling from serum or plasma is the lack of established housekeeping genes for data normalization. Technology: Platform: OpenArray, Life Technology Array: TaqMan OpenArray Human MicroRNA Panel, QuantStudio 12K Flex Catalog number: mirnas + 4 controls replicated 16 times 3 samples are loaded on each array, for a total of 15 arrays ran in 4 different batches at the following dates: - batch A, 3 arrays: 13/02/ batch B, 4 arrays: 05/03/ batch C, 4 arrays: 10/03/ batch D, 4 arrays: 11/03/2015 Data and goal: The experiment consists of 45 samples, divided into 5 experimental groups: * 9 technical replicates samples: 9 samples pooled together and then ran as technical replicates, to be able to study the technical variation (TechCtrl) * 9 preserved ejection fraction (PEF = 4) * 9 preserved ejection fraction with atrial fibrillation (PEF+AF = 3) * 9 reduced ejection fraction (REF = 2)

3 * 9 reduced ejection fraction with atrial fibrillation (REF+AF = 1) The users are mostly interested in the change of mirna expression in case of atrial fibrillation in each ejection fraction situation. Hence, we will first focus on comparing PEF+AF vs PEF and REF+AF vs REF. Table 1 shows, for each sample, which experimental group it belongs to, on which array it was run, and in which batch. Sample ID Experimental group Array Batch = REF+AF ROK49 B = REF+AF ROK58 B = REF+AF ROK62 B = REF+AF ROK50 C = REF+AF RON52 C = REF+AF RON60 C = REF+AF RON50 D = REF+AF ROK51 D = REF+AF ROK67 D = REF OMZ01 A = REF OMZ18 A = REF OMZ54 A = REF ROL05 B = REF ROK49 B = REF ROK62 B = REF ROL1 C = REF RON60 C = REF ROK51 D = PEF+AF OMZ18 A = PEF+AF OMZ01 A = PEF+AF OMZ54 A = PEF+AF ROL05 B = PEF+AF ROK58 B = PEF+AF ROL1 C = PEF+AF RON52 C = PEF+AF RON42 D = PEF+AF RON50 D = PEF ROK62 B = PEF ROK50 C = PEF RON52 C = PEF RON60 C = PEF RON42 D = PEF RON50 D = PEF ROK51 D

4 = PEF ROK67 D = PEF ROK67 D 12Q TechCtrl RON42 D 13Q TechCtrl ROL05 B 14Q TechCtrl OMZ01 A 15Q TechCtrl ROK49 B 16Q TechCtrl OMZ18 A 17Q TechCtrl ROK50 C 18Q TechCtrl ROL1 C 19Q TechCtrl OMZ54 A 20Q TechCtrl ROK58 B Table 1

5 2. Preprocessing Extraction of Ct data: Ct data for all mirnas and all samples was extracted from the analysis_result.txt (part of the raw data handed by the users) file for each mirna and each sample. All analysis was performed in the R/Bioconductor environment. In particular, Bioconductor package HTqPCR was used as it is designed for the analysis of high-throughput qpcr data. Quality control: Figure 1 shows the raw Ct distribution for each sample. Figure 1 We observe two clear Ct density peaks : one which summit is located around Ct=25, the other one around Ct=40.

6 mirna transcripts for which Ct is around 40 are too lowly expressed to be considered as actually expressed: we will try to filter out some features in order not to lose too much detection power. Figure 2 shows a hierarchical clustering of samples using all mirnas. Colors represent the experimental groups the samples belong to (a.), or the batches (b.) in which samples were run. a. Clustering colored per experimental group. b. Clustering colored per batch. Figure 2. Dendrograms using raw data. Figure 2.b shows us a slight batch effect: indeed samples that were run in batch A are all grouping together. This is a bias often found when arrays are not processed all in the same batch and/or on the same day.

7 It is to remember that such technical biases are more visible when features are lowly expressed or when few differences are expected between experimental groups. We will try to correct for that bias. Features filtering: Features are tagged as Undetermined if their Ct is beyond 38, and Unreliable if their Ct is below 10 or if their standard variation is above 0.9 across all samples of a same experimental group. We are then filtering out features that are Undetermined/Unreliable in 36 samples or more (we consider that features can potentially be expressed in only one experimental group, i.e. 9 samples here, and not expressed in the 36 remaining samples). Using that filtering, 411 features were removed and we will be working with the 407 remaining ones. Figure 3 shows the density plot (same as Figure 1) of the remaining filtered data: we can see that the second peak of lowly expressed features is well reduced. Figure 3

8 Figure 4 shows the dendrogram (as Figure 2) using the remaining features after filtering. Colors represent the experimental groups the samples belong to (a.), or the batches (b.) in which samples were run. a. Clustering colored per experimental group. b. Clustering colored per batch. Figure 4. Dendrograms using filtered data. Batch effect correction: The ComBat method (Bioconductor package sva ) was applied to try and correct for the batch effect we observe. ComBat allows adjusting for batch effects in a dataset where the batch covariate is known, which is the case here.

9 Figure 5 and 6 show, as in previous steps, the Ct density per sample and dendrograms based on filtered and corrected data, respectively. Figure 5 a. Clustering colored by experimental group.

10 b. Clustering colored by batches. Figure 6. Dendrograms using filtered and batch corrected data. Figure 6b shows us that samples from batch A do not clustered all together as previously observed, so the batch effect seems to have been corrected. Figure 6a does not show a very improved clustering of samples per experimental group, apart maybe slightly for the group of replicated controls (TechCtrl). Normalization: A commonly used and validated method for qpcr normalization is the deltact intra-sample normalization: one or more features within the array are chosen (sufficiently expressed and stable in expression across the whole experiment), and are used as reference feature(s) for raw Ct correction. The Ct data from this (or these) reference feature (s) is (are) then subtracted from all other features, to adjust for intra-sample variability and make samples better comparable. Selection of reference features 4 control features are provided within this array, and are repeated each 16 times in each array: _ath-miR159a_B _RNU48_B _RNU44_B _U6. We will first check their levels of expression and variability within and across samples (on raw data before filtering and ComBat correction).

11 Figure 7 shows boxplots dispaying the Ct distribution of each control feature per sample. Results are displayed only for 4 samples but show the main trends _Ct.txt U6 athmir159 RNU44 RNU48 Figure 7. Figure 8 shows the expression profiles of these control features across samples.

12 control genes Ct _ath mir159a_b _RNU48_B _RNU44_B _U6 rrna_b Figure Q Q Q 15Q 16Q Q 18Q 19Q Q samples Of the 4 control features, _ath-miR159a_B and _RNU44_B have very high Ct values, i.e. very low transcript expression (hence unreliable) _RNU48_B is generally more highly expressed, but seems to be varying in expression across samples quite much _U6 is the most stable in expression across samples, and is sufficiently expressed. Next we tried to find some mirnas within the array which would be suitable (and better than the default controls) as references for deltact normalization: mirnas for which maximum Ct is below or equals 35, and coefficient of variation less than 0.1 across all samples, are selected. This method results in the selection of 61 mirnas. From these 61 mirnas, we decide to select the top 10 mirnas, i.e. the ones that show lowest levels of variation across samples (smallest coefficient of variation): _hsa-miR-10b#_B _hsa-miR-144#_B _HSA-MIR-1291_B

13 000512_hsa-miR-210_A _hsa-miR-10a_A _hsa-miR-193a-5p_A _hsa-miR-660_A _hsa-miR-30a-3p_B _hsa-miR-423-5p_A _hsa-miR-590-5p_A Figure 9 shows the Ct profiles of these 10 mirnas across samples (a.) and their intra-experimental group variation (b.). tested mirna for use as controls Ct _hsa mir 10b#_B _hsa mir 144#_B _HSA MIR 1291_B _hsa mir 210_A _hsa mir 10a_A _hsa mir 193a 5p_A _hsa mir 660_A _hsa mir 30a 3p_B _hsa mir 423 5p_A _hsa mir 590 5p_A Q Q Q 15Q 16Q Q 18Q 19Q Q samples a TechCtrl 30 Ct values for samples b _hsa mir 10a_A _hsa mir 30a 3p_B _hsa mir 210_A _hsa mir 660_A _hsa mir 590 5p_A _hsa mir 144#_B _hsa mir 193a 5p_A _hsa mir 10b#_B _hsa mir 423 5p_A _HSA MIR 1291_B

14 Figure most stable mirnas that will be used for normalization. These 10 samples are used for normalization of our data (filtered and ComBat corrected) using the deltact method. 3. Analysis Differential expression analysis: Remaining control probes (000338_ath-miR159a_B, _RNU48_B, _RNU44_B, _U6) are removed from the dataset before performing differential expression analysis: it will hence be performed on 375 mirnas. A method from HTqPCR based on limma (linear models for microarray data) was used, which uses a moderated t-test to assess differential expression of mirnas between experimental groups. Results: Results (Excel file) can be found in: Using the following credentials: Login: mtajes Password: marta15 Brief description of the columns found in the results file: t.test : The result of the t-test. p.value : The corresponding p.values. adj.p.value : P-values after correcting for multiple testing using the Benjamini- Holm method. ddct : The deltadeltact values = deltadeltact = deltact(target) deltact(calibrator) FC: The fold change; 2^(-ddCt). Target/Calibrator: the first/last experimental group in a pairwise comparison, respectively; for G1 vs G2, G1 is the target, G2 the calibrator.

15 Mean columns: The average Ct across the target/calibrator samples for the given Category columns: all results are assigned to a category, either "OK" or "Unreliable" depending on the input Ct values: the result will be "OK unless at least half of the Ct values for a given gene are unreliable/undetermined. Filtering the data using the adjusted p-value (<0.05) does not yield any result. Table 2 lists the mirnas found when filtering the data using the (unadjusted) p-value (< 0.05). G1 vs G2 G3 vs G4 22 mirnas 12 mirnas _hsa-miR-27b_A _hsa-miR-107_A _hsa-miR-539_A _hsa-miR-645_B _hsa-miR-411_A _hsa-miR-598_A _hsa-miR-505#_B _hsa-miR-636_A _hsa-miR-1_A _hsa-miR-9#_B _hsa-miR-331-5p_A _hsa-miR-142-5p_A _hsa-miR-22#_B _hsa-miR-425#_B _hsa-miR-30d#_B _hsa-miR-483-3p_B _hsa-miR-652_A _hsa-miR-489_A _hsa-miR-628-3p_B _hsa-miR-20a#_B _HSA-MIR-151-5P_B _HSA-MIR-1255B_B Table _hsa-let-7a_A _hsa-miR-107_A _hsa-miR-181c_A _hsa-miR-199b_A _hsa-miR-302c_A _hsa-miR-624_B _hsa-miR-939_B _hsa-miR-889_A _hsa-miR-654-3p_A _hsa-miR-146b-3p_A _hsa-miR-520d-5p_A _hsa-miR-589_A

16 4. References OpenArray: R project: R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN Bioconductor: Huber, W., Carey, J. V, Gentleman, R., Anders, S., Carlson, M., Carvalho, S. B, Bravo, C. H, Davis, S., Gatto, L., Girke, T., Gottardo, R., Hahne, F., Hansen, D. K, Irizarry, A. R, Lawrence, M., Love, I. M, MacDonald, J., Obenchain, V., Ole's, K. A, Pag'es, H., Reyes, A., Shannon, P., Smyth, K. G, Tenenbaum, D., Waldron, L., Morgan and M. (2015). Orchestrating highthroughput genomic analysis with Bioconductor. Nature Methods, 12(2), pp HTqPCR: Dvinge H and Bertone P (2009). HTqPCR: High - throughput analysis and visualization of quantitative real - time PCR data in R. Bioinformatics, 25(24), pp ComBat: Johnson WE, Rabinovic A, and Li C (2007). Adjusting batch effects in microarray expression data using Empirical Bayes methods. Biostatistics 8(1): limma: Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W and Smyth GK (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, 43(7), pp. e47. Shaffer J, Schlumpberger M and Lader E. mirna profiling from blood challenges and recommendations. From Qiagen: