Systematic assessment of cancer missense mutation clustering in protein structures Atanas Kamburov, Michael Lawrence, Paz Polak, Ignaty Leshchiner, Kasper Lage, Todd R. Golub, Eric S. Lander, Gad Getz SI Appendix
Supplemental Methods Collapsing consecutive mutated residues To examine the effect of consecutive mutated residues on CLUMPS results, we implemented a variant of CLUMPS where two or more mutated residues, which were consecutive in the protein sequence, were combined to a single "meta-residue". The 3-D location of the centroid of the new meta-residue [used for Euclidean distance measurements to other mutated (meta-) residues] was calculated based on the 3-D locations of the individual member residues and also depended linearly on their mutational recurrence. For example, if both residues P[k] and P[k+1] of protein P are found mutated and P[k] is mutated much more frequently than P[k+1], then the centroid of the new meta-residue P[k:k+1] will be closer to the centroid of P[k] than to the centroid of P[k+1]. Unlike in the original CLUMPS implementation, (meta-) residues were not allowed to be immediately next to each other in the protein sequence during the permutations. Comparison of methods for cancer gene identification Per-gene p-values calculated with MutSig and its components MutSig-CL, MutSig-FN and MutSig-CV were obtained from the original PanCancer study [1]. To enable comparison of the per-gene p-values calculated with these methods with the CLUMPS p-values (calculated per structure), we considered the smallest CLUMPS p-value of the representative structures for each protein Protein interaction interfaces Information about human protein residues forming interaction interfaces with other human proteins, small molecule/ion ligands, DNA or RNA (based on co-complex structures from PDB) was obtained from the PDBsum database [2] on 27.07.2014. All residues of a protein predicted by PDBsum to be involved in any type of contact (e.g., hydrogen or disulphide bonds or non-bonded contacts) with the interaction partner were considered interface residues. Only interfaces with at least one mutation were analyzed. In cases where multiple co-complex structures were available for a given pair of interactors, we selected the structure maximizing interface size and sequence coverage of the protein interactor(s), as well as the number of mutations at the interface. As expected, factoring the number of mutations in interaction interfaces into the selection process and especially restricting the analysis to interfaces with at least one observed mutation led to some inflation in a Q-Q plot (SI Appendix, Fig. S12); however, we aimed to avoid missing interesting biological interactions due to falsenegative contact residue predictions in PDBsum. Mutually similar (in terms of interface residues) protein-ligand interfaces were grouped together and from each group, only one representative interface was analyzed (i.e., the one comprising most residues). This was done to avoid testing separately interfaces like KRAS-GTP, KRAS-GDP, KRAS-inhibitor, etc. In the case of protein-protein interactions, we focused only on heteromers since for many homomeric co-complex structures, it is unclear whether the corresponding protein forms oligomers in solution or if the observed residue contacts are attributable only to the way the protein was crystallized ("crystal-packing interactions") [3]. Moreover, in many instances one of the interactors was not annotated with a UniProt identifier in PDB/SIFTS despite the existence of a non-standard protein name annotation. To recover missing UniProt annotations, we aligned all non-annotated sequences that were found in protein complexes with human
proteins against UniProt/SwissProt-human using WU-BLAST (http://www.ebi.ac.uk/tools/sss/wublast/). A given query sequence was annotated with the UniProt reference identifier corresponding to the smallest BLASTP alignment p-value but only if at least 90% of the query was aligned to the reference with at least 90% sequence identity. Protein/RNA expression and copy number data Matched TCGA RPPA, RNAseq and copy number data from endometrial [4] and colorectal tumor samples [5] (used for quantifying the expression of SPOP substrates and CCNE1, respectively) were downloaded from the Broad GDAC portal (http://gdac.broadinstitute.org/). The samples were divided into several groups according to SPOP/FBXW7 mutation and substrate copy number statuses (SI Appendix, Fig. S6 B and Main Text Fig. 5). Before plotting, protein and RNA expression levels in each sample were normalized by subtracting the median and dividing by the standard deviation of the corresponding expression level distributions of samples with no SPOP/FBXW7 somatic mutations and no substrate copy number changes. A gene was considered amplified/deleted if it was in a genomic segment, supported by at least 3 SNP probes, with mean above 0.3/below -0.3 in the copy number data. References 1. Lawrence MS et al. (2014) Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505:495-501. 2. de Beer TAP, Berka K, Thornton JM, Laskowski RA (2014) PDBsum additions. Nucleic Acids Res. 42:D292-296. 3. Janin J (1997) Specific versus non-specific contacts in protein crystals. Nat. Struct. Biol. 4:973-974. 4. The Cancer Genome Atlas Network (2013) Integrated genomic characterization of endometrial carcinoma. Nature 497:67-73. 5. The Cancer Genome Atlas Network (2012) Comprehensive molecular characterization of human colon and rectal cancer. Nature 487:330-337.
Figure S1. Overview of our CLUMPS approach for identifying significant mutation clustering in protein structures. WAP: weighted average proximity score; d q,r : spatial (Euclidean) distance between the centroids of residues q and r ; n q and n r : normalized number of samples with missense mutations impacting residues q and r, respectively; t: soft distance threshold (see Materials and Methods in the Main Text for details).
Figure S2: Quantile-quantile plot of empirical p-values calculated with CLUMPS for all tested (representative) protein structures (Dataset 1). Significant and near-significant protein structures are labeled; purple label color indicates tumor suppressors and green color indicates oncoproteins.
Missense hotspot: p.s340l Splice site hot-spot Figure S3: TumorPortal (http://tumorportal.org) screenshot showing the positions of mutations in NUF2. Missense mutations are shown as green circles, with color intensity scaling with evolutionary conservation. The portion of the NUF2 protein sequence covered by the structure shown in Fig. 3 (Main Text) is highlighted in black.
A B Figure S4: Several non-recurrent mutations in STK11 impact residues at the active site, forming a spatial (3-D) cluster. A) TumorPortal (http://tumorportal.org) screenshot showing the positions of mutations in the linear STK11 protein sequence. Missense mutations are shown as green circles, with color intensity scaling with evolutionary conservation. B) Structure of STK11 (PDB: 2WTK) with mutated residues shown as red lines. Mutations that cluster together at the active site are labeled; p.n181 and p.d194 were found mutated in two samples each, the rest of the labeled residues in one sample each. Shown in blue is phosphoaminophosphonic acid-adenylate ester, an analog of substrate ATP.
Figure S5: Comparison of CLUMPS p-values (denoted Spatial clustering ) against p-values calculated for the corresponding genes using the MutSig suite of tools for detecting cancer genes. MutSig provides three p-values corresponding to three different statistical tests (MutSig-CL: linear clustering of mutations; MutSig-CV: overall mutation burden, taking into account covariates like replication timing and expression level; and MutSig-FN: the relative frequency of mutations at evolutionarily conserved and likely functional DNA bases), as well as a combined p-value (MutSigintegrated). The plots correspond to a comparison of each of these four MutSig p-values against the CLUMPS p-value for the corresponding gene (the most significant CLUMPS p-value is considered if there are multiple representative protein structures). Spearman s correlation coefficient ρ is provided in each figure. Dashed red lines correspond to nominal significance thresholds (p=0.01). Genes detected as significant or near-significant with CLUMPS, but not with MutSig or its separate components, are labeled.
A Cluster E (endometrial only; newly identified) Cluster S (substratebinding pocket) B Figure S6: Clusters of endometrial and prostate cancer mutations in SPOP. A) TumorPortal (http://tumorportal.org) screenshot showing the positions of mutations in SPOP. Missense mutations are shown as green circles, whose color intensity scales with evolutionary conservation. The portion of the SPOP protein sequence covered by the structure shown in Fig. 4 (Main Text) is highlighted in black. B) Protein and RNA levels of the SPOP substrates MAPK8 and PTEN in endometrial tumors with mutations from both Clusters E and S compared to SPOP-wildtype endometrial tumors (protein and RNA expression levels correspond to RPPA and RNAseq measurements by TCGA, respectively).
Figure S7: PPP2R1A (grey) bound to PPP2R5C (green) (PDB: 2NYL). Mutated residues in both proteins are highlighted in red, with color intensity scaling with the number of samples harboring missense mutations impacting the corresponding residue. Recurrent mutations ( 3 samples) are shown as sticks, non-recurrent mutations as thin lines. PPP2R1A mutations at the interface are labeled.
Figure S8: HRAS (grey) bound to RASA1 (green) (PDB: 1WQ1). Mutations in both proteins are colored in red, with color intensity scaling with recurrence. Recurrent mutations ( 3 samples) are shown as sticks, non-recurrent mutations as thin lines. Mutated interface residues in both proteins are labeled (black label: HRAS residues, green label: RASA1 residues).
Figure S9: OGT (grey) bound to an HCFC1 fragment (orange) (PDB: 4N3B). Residues in both proteins that are impacted by missense mutations are highlighted in red; those at the common interaction interface are labeled (black label: OGT residues, brown label: HCFC1 residues).
Figure S10: Distribution of the relative reference (UniProt) protein sequence coverage of all 3-D structures of proteins used in the full CLUMPS analysis (prior to selecting the representative structures per protein). SI Appendix, Fig S12 shows a corresponding distribution after the selection of representative structures.
Figure S11: Protein sequence coverage by individual PDB structures is depicted for the top 20 proteins that showed significant or near-significant 3-D mutation clustering. The proteins are ordered on the x-axis and the length of each protein sequence is normalized to unity. The y-axis shows log 10 (CLUMPS p-value). Each blue line corresponds to a PDB structure/chain; its x-dimensions show the relative coverage of the protein sequence and its y-dimension shows the mutation clustering p- value for that structure/chain. Many overlapping lines are shown as a single thicker line. Red lines correspond to the structure selected by our greedy search algorithm (see Materials and Methods in the Main text).
Figure S12: Distribution of the overall relative reference (UniProt) protein sequence coverage (= total residues covered by all selected 3D structures for a protein over the number of residues in the protein) for all proteins used in the full CLUMPS analysis.
A B Figure S13: Plots of functions used for calculating the Weighted Average Proximity (WAP) score: A) f d; t = 6 = e!!!,!!!! B) h N; Θ = 2, m = 3 =!!!!!!!!
Figure S14: Comparison of p-values obtained with the original implementation of CLUMPS, which weights mutated residues according to recurrence (see Materials and Methods) (black dots) against corresponding p-values obtained with a version of CLUMPS that weights all mutated residues equally (red stars). The top scoring 300 structures from Dataset 1 are shown.
Figure S15: Quantile-quantile plot of empirical p-values corresponding to mutation enrichment in interaction interfaces. Red dots represent significant interfaces (q 0.1; see Table 2 in the Main Text and Datasets 8, 10, 11, 12). The apparent slight inflation is due to the pre-filtering of interfaces to select only those with at least one mutation and because the interface selection strategy favors interfaces with more mutations among different PDB instances of similar interfaces in order to increase sensitivity (see Materials and Methods in the Main Text).