Identification of rheumatoid arthritis and osteoarthritis patients by transcriptome-based rule set generation

Identification of rheumatoid arthritis and osterthritis patients by transcriptome-based rule set generation Bering Limited Report generated on September 19, 2014 Contents 1 Dataset summary 2 1.1 Project description........................... 2 2 Array processing and normalization 3 3 Quality 3 3.1 Outlier detection............................ 3 3.1.1 Principal Component Analysis................. 5 3.1.2 Signal density and box plots.................. 6 3.1.3 Array similarity heatmap and hierarchical clustering..... 6 3.2 Batch correction............................ 7 3.3 Quality summary........................ 7 4 Differential expression analysis 8 5 Gene Ontology enrichment 9 5.1 Biological Process........................... 10 5.2 Cellular Component.......................... 10 5.3 Molecular Function........................... 10 6 Reactome pathway enrichment 11 1

1 Dataset summary Number of samples: 30 Number of chip identifiers: 506944 Comparison: ra vs. 1.1 Project description ArrayExpression accession number: E-GEO-55235. Discrimination of rheumatoid arthritis (RA) patients from patients with other inflammatory/degenerative joint diseases or healthy individuals purely on the basis of genes differentially expressed in high-throughput data has proven very difficult. Thus, the present study sought to achieve such discrimination by employing a novel unbiased apprch using rule-based classifiers. Three multi-center genome-wide transcriptomic data sets (Affymetrix HG- U133 A/B) from a total of 79 individuals, including 20 healthy s ( group - CG), as well as 26 osterthritis (OA) and 33 RA patients. Reference: Woetzel D., et al. Identification of rheumatoid arthritis and osterthritis patients by transcriptome-based rule set generation. Arthritis Research & Therapy 2014, 16:R84 file.name GSM1332201 ND 1 S... GSM1332202 ND 2 S... GSM1332203 ND 3 S... GSM1332204 ND 4 S... GSM1332205 ND 5 S... GSM1332206 ND 6 S... GSM1332207 ND 7 S... GSM1332208 ND 8 S... GSM1332209 ND 9 S... GSM1332210 ND 10... GSM1332211 OA 1 S... GSM1332212 OA 2 S... GSM1332213 OA 3 S... GSM1332214 OA 4 S... GSM1332215 OA 5 S... GSM1332216 OA 6 S... GSM1332217 OA 7 S... GSM1332218 OA 8 S... GSM1332219 OA 9 S... GSM1332220 OA 10... phenotype

GSM1332221 RA 1 S... ra GSM1332222 RA 2 S... ra GSM1332223 RA 3 S... ra GSM1332224 RA 4 S... ra GSM1332225 RA 5 S... ra GSM1332226 RA 6 S... ra GSM1332227 RA 7 S... ra GSM1332228 RA 8 S... ra GSM1332229 RA 9 S... ra GSM1332230 RA 10... ra Table 1: Sample-data relationships 2 Array processing and normalization After array normalization and detection of present probes, 14595 probes were retained. 3 Quality GeneProfiler pipeline aims to identify outliers, batch effects, and overly noisy experiments. Automated quality is carried out using the arrayqualitymetrics Bioconductor package. 3.1 Outlier detection Outlier detection is carried out using three distinct apprches: Box plot: Each box corresponds to one array. Typically, one expects the boxes to have similar positions and widths. If the distribution of an array is very different from the others, this may indicate an experimental problem. Outlier detection is performed by computing the Kolmogorov-Smirnov statistic Ka between each array s distribution and the distribution of the pooled data. Signal Density plot: Typically, the distributions of the arrays should have similar shapes and ranges. Arrays whose distributions are very different from the others should be considered for possible problems. Inter-sample correlation heatmap: Patterns in this plot can indicate clustering of the arrays either because of intended biological or unintended experimental factors (batch effects). The distance between two arrays is computed as the mean absolute difference between the data of the arrays (using the data from all probes without filtering). Outlier detection is performed by

looking for arrays for which the sum of the distances to all other arrays is exceptionally large. Results of outlier detection are shown in Table 2. Table columns contain results of a specific outlier detection test. FALSE value indicates that an array is not an outlier, while TRUE value indiciates that an array is an outlier. An array will be considered an outlier, and labeled so in a column Vote, if it is called an outlier by at least two methods. Boxplot Density Heatmap Vote GSM1332201 ND 1 S... FALSE FALSE FALSE FALSE GSM1332202 ND 2 S... TRUE FALSE FALSE FALSE GSM1332203 ND 3 S... FALSE FALSE FALSE FALSE GSM1332204 ND 4 S... FALSE FALSE FALSE FALSE GSM1332205 ND 5 S... FALSE FALSE FALSE FALSE GSM1332206 ND 6 S... FALSE FALSE FALSE FALSE GSM1332207 ND 7 S... TRUE FALSE FALSE FALSE GSM1332208 ND 8 S... TRUE FALSE FALSE FALSE GSM1332209 ND 9 S... FALSE FALSE FALSE FALSE GSM1332210 ND 10... FALSE FALSE FALSE FALSE GSM1332211 OA 1 S... FALSE FALSE FALSE FALSE GSM1332212 OA 2 S... FALSE FALSE FALSE FALSE GSM1332213 OA 3 S... FALSE FALSE FALSE FALSE GSM1332214 OA 4 S... FALSE FALSE FALSE FALSE GSM1332215 OA 5 S... FALSE FALSE FALSE FALSE GSM1332216 OA 6 S... FALSE FALSE FALSE FALSE GSM1332217 OA 7 S... FALSE FALSE FALSE FALSE GSM1332218 OA 8 S... FALSE FALSE FALSE FALSE GSM1332219 OA 9 S... FALSE FALSE FALSE FALSE GSM1332220 OA 10... TRUE FALSE FALSE FALSE GSM1332221 RA 1 S... FALSE FALSE FALSE FALSE GSM1332222 RA 2 S... FALSE FALSE FALSE FALSE GSM1332223 RA 3 S... FALSE FALSE FALSE FALSE GSM1332224 RA 4 S... FALSE FALSE FALSE FALSE GSM1332225 RA 5 S... FALSE FALSE FALSE FALSE GSM1332226 RA 6 S... FALSE FALSE FALSE FALSE GSM1332227 RA 7 S... FALSE FALSE FALSE FALSE GSM1332228 RA 8 S... FALSE FALSE FALSE FALSE GSM1332229 RA 9 S... FALSE FALSE FALSE FALSE GSM1332230 RA 10... FALSE FALSE FALSE FALSE Table 2: Outlying arrays.

3.1.1 Principal Component Analysis Figure 1: Scatterplot visualising Principal Component Analysis for 30 arrays. Outliers (if any) are shown in red. Principal Components Analysis (PCA) plots were used to visualize the overall quality of a micrrray dataset. Each point in the PCA plots corresponds to an array. Dissimilar arrays are further apart.

3.1.2 Signal density and box plots Figure 2: Boxplots and signal intensity densities for 30 arrays. Outliers (if any) are shown in red. 3.1.3 Array similarity heatmap and hierarchical clustering Hiearachical clustering was used to determine if sample clusters correspond to the experimental sample groups, rather than to technical sources of variation.

Figure 3: Array similarity heatmap for 30 arrays. The color scale is chosen to cover the range of distances encountered in the dataset. There were 0 outlying arrays. 3.2 Batch correction If batches are specified, they are corrected. 3.3 Quality summary Of 22283 probes, 12917 passed quality protocols. 30 samples passed outlier detection criteria.

4 Differential expression analysis Differential expression analysis was carried out comparing ra vs.. There were 692 up-regulated and 801 down-regulated genes (p value 0.05, FDR-correction: No). Top 10 differentially expressed genes are shown in Table 3. Symbol Name logfc P.Value CXCL13 Chemokine (C-X-C motif) ligand 1.1E+01 2.4E-10 13 SLAMF8 SLAM family member 8 7.4E+00 4.6E-12 TPD52L1 Tumor protein D52-like 1-4.8E+00 1.4E-11 ADAMDEC1 ADAM-like, decysin 1 4.7E+00 6.4E-10 SERPINA1 Serpin peptidase inhibitor, 4.6E+00 2.0E-09 clade A (alpha-1 antiproteinase, antitrypsin), member 1 NOVA1 Neuro-oncological ventral -3.0E+00 4.3E-10 antigen 1 CCL13 Chemokine (C-C motif) ligand 4.5E+00 5.1E-08 13 ISG20 Interferon stimulated 2.3E+00 2.2E-10 exonuclease gene 20kDa CD27 CD27 molecule 2.9E+00 1.6E-09 CRLF1 Cytokine receptor-like factor 1-4.7E+00 3.2E-07 Table 3: Top 10 differentially expressed genes.

Figure 4: Volcano plot of all differentially expressed genes in ra vs.. Top 5 differentially expressed genes are labeled. 5 Gene Ontology enrichment 1493 differentially expressed genes were enriched for Gene Ontology (GO) Biological Process (BP), Cellular Component (CC), and Molecular Function (MF) terms. All micrrray genes (n=8934) were used as background. Headers in Tables 4, 5, and 6, Significant and P.Value reffer to number of significant genes annotated by a term and corresponding significance p-values respectively.

5.1 Biological Process Term Significant P.Value Immune Response 277 1.20E-30 Immune System Process 372 8.10E-25 Defense Response 254 2.90E-20 Regulation Of Immune System Process 219 1.00E-19 Regulation Of Immune Response 165 5.30E-19 Positive Regulation Of Immune System Process 155 2.10E-18 Positive Regulation Of Response To Stimulus 258 1.20E-17 Regulation Of Response To Stimulus 418 2.20E-17 Signal Transduction 600 3.80E-17 Signaling 638 1.10E-16 Table 4: Top enriched Gene Ontology Biological Process terms. 5.2 Cellular Component Term Significant P.Value Cell Periphery 491 1.00E-18 Plasma Membrane 480 1.40E-18 Membrane 774 1.50E-17 Extracellular Region 421 1.40E-13 Membrane Part 566 2.10E-11 Intrinsic Component Of Membrane 473 3.90E-11 Integral Component Of Membrane 465 1.00E-10 Extracellular Region Part 355 1.70E-10 Extracellular Space 154 6.10E-10 Side Of Membrane 63 8.60E-10 Table 5: Top enriched Gene Ontology Cellular Component terms. 5.3 Molecular Function Term Significant P.Value Receptor Activity 158 8.00E-12 Signal Transducer Activity 167 5.50E-09 Molecular Transducer Activity 167 5.50E-09 Receptor Binding 178 8.30E-09 Transmembrane Signaling Receptor Activity 106 1.70E-08 Signaling Receptor Activity 116 7.60E-08 Antigen Binding 24 9.40E-08 Sulfur Compound Binding 41 8.70E-07 Heparin Binding 32 3.30E-06 Chemokine Activity 16 3.60E-06 Table 6: Top enriched Gene Ontology Molecular Function terms.

6 Reactome pathway enrichment 1493 differentially expressed genes were enriched for Reactome pathways. Top 10 enriched pathways are shown in Table 7. Description P.Value Count Activity.Score Immune System 1.50E-03 85 0.6 Adaptive Immune System 5.40E-03 42 0.6 Phosphorylation of CD3 and TCR zeta 1.90E-02 11 1.8 chains Lipid and lipoprotein metabolism 1.90E-02 7-1.9 Hemostasis 2.00E-02 31-0.2 TCR signaling 2.00E-02 12 1.7 Cytokine Signaling in Immune system 2.60E-02 31 1.0 Platelet activation, signaling and 2.70E-02 29-0.2 aggregation Antigen Activates B Cell Receptor Leading 3.00E-02 24-0.0 to Generation of Second Messengers Alternative complement activation 3.10E-02 26-0.5 Table 7: Top enriched Reactome Pathways. Column P.Value refers to raw enrichment significance p-values. Column Count highlights the total number of differentially expressed genes assigned to a specific pathway. Column Activity.Score corresponds to the average pathway fold change in ra vs. comparison.