Systematic assessment of cancer missense mutation clustering in protein structures

Similar documents

ALLEN Mouse Brain Atlas

PREDA S4-classes. Francesco Ferrari October 13, 2015

NOVEL GENOME-SCALE CORRELATION BETWEEN DNA REPLICATION AND RNA TRANSCRIPTION DURING THE CELL CYCLE IN YEAST IS PREDICTED BY DATA-DRIVEN MODELS

Real-time PCR: Understanding C t

Frequently Asked Questions Next Generation Sequencing

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Package cgdsr. August 27, 2015

InSyBio BioNets: Utmost efficiency in gene expression data and biological networks analysis

Package empiricalfdr.deseq2

Guide for Data Visualization and Analysis using ACSN

CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data.

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa

Tutorial for proteome data analysis using the Perseus software platform

Analysis of FFPE DNA Data in CNAG 2.0 A Manual

DeCyder Extended Data Analysis module Version 1.0

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Guide for Bioinformatics Project Module 3

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

Core Facility Genomics

PROTEINS THE PEPTIDE BOND. The peptide bond, shown above enclosed in the blue curves, generates the basic structural unit for proteins.

Supplementary Figure 1: Quality Assessment of Mouse Arrays. Supplementary Figure 2: Quality Assessment of Rat Arrays

Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation

Data Analysis for Ion Torrent Sequencing

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Next Generation Sequencing: Technology, Mapping, and Analysis

Human Genome Organization: An Update. Genome Organization: An Update

Supporting Information. Fast and Efficient Fragment-Based Lead Generation. by Fully Automated Processing and Analysis of

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Breast cancer and the role of low penetrance alleles: a focus on ATM gene

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

The Ramachandran Map of More Than. 6,500 Perfect Polypeptide Chains

Course on Functional Analysis. ::: Gene Set Enrichment Analysis - GSEA -

Microarray Data Analysis. A step by step analysis using BRB-Array Tools

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

A General Framework for Weighted Gene Co-expression Network Analysis

HENIPAVIRUS ANTIBODY ESCAPE SEQUENCING REPORT

Supplementary Information

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

How many of you have checked out the web site on protein-dna interactions?

DNA Insertions and Deletions in the Human Genome. Philipp W. Messer

Steffen Lindert, René Staritzbichler, Nils Wötzel, Mert Karakaş, Phoebe L. Stewart, and Jens Meiler

Final Project Report

Pairwise Sequence Alignment

MAKING AN EVOLUTIONARY TREE

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Exploratory data analysis for microarray data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Genetomic Promototypes

Gene Expression Analysis

Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6

Simplifying Data Interpretation with Nexus Copy Number

New Technologies for Sensitive, Low-Input RNA-Seq. Clontech Laboratories, Inc.

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

cansar: integrated cancer knowledgebase

CSC 2427: Algorithms for Molecular Biology Spring Lecture 16 March 10

9. Text & Documents. Visualizing and Searching Documents. Dr. Thorsten Büring, 20. Dezember 2007, Vorlesung Wintersemester 2007/08

Supplementary Figures S1 - S11

2.500 Threshold e Threshold. Exponential phase. Cycle Number

Bioinformatics Resources at a Glance

MultiExperiment Viewer Quickstart Guide

Linear Sequence Analysis. 3-D Structure Analysis

Multivariate Analysis of Ecological Data

MASCOT Search Results Interpretation

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Introduction To Real Time Quantitative PCR (qpcr)

Protein Prospector and Ways of Calculating Expectation Values

Computing the maximum similarity bi-clusters of gene expression data

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University

Hierarchical Bayesian Modeling of the HIV Response to Therapy

Analysis of ChIP-seq data in Galaxy

Exercise with Gene Ontology - Cytoscape - BiNGO

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

ProteinPilot Report for ProteinPilot Software

Introduction to Exploratory Data Analysis

RNAseq / ChipSeq / Methylseq and personalized genomics

Technical Note. Roche Applied Science. No. LC 18/2004. Assay Formats for Use in Real-Time PCR

Current Motif Discovery Tools and their Limitations

Formalin fixation at low temperature better preserves nucleic acid integrity. Gianni Bussolati. University of Turin

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

3D structure visualization and high quality imaging. Chimera

Interpreting Data in Normal Distributions

Statistical Applications in Genetics and Molecular Biology

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

Discovery & Modeling of Genomic Regulatory Networks with Big Data

Discovering Bioinformatics

Visual Structure Analysis of Flow Charts in Patent Images

Interaktionen von RNAs und Proteinen

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

Big Data Visualization for Genomics. Luca Vezzadini Kairos3D

IGV Hands-on Exercise: UI basics and data integration

STATISTICA Formula Guide: Logistic Regression. Table of Contents

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

CNV Univariate Analysis Tutorial

Transcription:

Systematic assessment of cancer missense mutation clustering in protein structures Atanas Kamburov, Michael Lawrence, Paz Polak, Ignaty Leshchiner, Kasper Lage, Todd R. Golub, Eric S. Lander, Gad Getz SI Appendix

Supplemental Methods Collapsing consecutive mutated residues To examine the effect of consecutive mutated residues on CLUMPS results, we implemented a variant of CLUMPS where two or more mutated residues, which were consecutive in the protein sequence, were combined to a single "meta-residue". The 3-D location of the centroid of the new meta-residue [used for Euclidean distance measurements to other mutated (meta-) residues] was calculated based on the 3-D locations of the individual member residues and also depended linearly on their mutational recurrence. For example, if both residues P[k] and P[k+1] of protein P are found mutated and P[k] is mutated much more frequently than P[k+1], then the centroid of the new meta-residue P[k:k+1] will be closer to the centroid of P[k] than to the centroid of P[k+1]. Unlike in the original CLUMPS implementation, (meta-) residues were not allowed to be immediately next to each other in the protein sequence during the permutations. Comparison of methods for cancer gene identification Per-gene p-values calculated with MutSig and its components MutSig-CL, MutSig-FN and MutSig-CV were obtained from the original PanCancer study [1]. To enable comparison of the per-gene p-values calculated with these methods with the CLUMPS p-values (calculated per structure), we considered the smallest CLUMPS p-value of the representative structures for each protein Protein interaction interfaces Information about human protein residues forming interaction interfaces with other human proteins, small molecule/ion ligands, DNA or RNA (based on co-complex structures from PDB) was obtained from the PDBsum database [2] on 27.07.2014. All residues of a protein predicted by PDBsum to be involved in any type of contact (e.g., hydrogen or disulphide bonds or non-bonded contacts) with the interaction partner were considered interface residues. Only interfaces with at least one mutation were analyzed. In cases where multiple co-complex structures were available for a given pair of interactors, we selected the structure maximizing interface size and sequence coverage of the protein interactor(s), as well as the number of mutations at the interface. As expected, factoring the number of mutations in interaction interfaces into the selection process and especially restricting the analysis to interfaces with at least one observed mutation led to some inflation in a Q-Q plot (SI Appendix, Fig. S12); however, we aimed to avoid missing interesting biological interactions due to falsenegative contact residue predictions in PDBsum. Mutually similar (in terms of interface residues) protein-ligand interfaces were grouped together and from each group, only one representative interface was analyzed (i.e., the one comprising most residues). This was done to avoid testing separately interfaces like KRAS-GTP, KRAS-GDP, KRAS-inhibitor, etc. In the case of protein-protein interactions, we focused only on heteromers since for many homomeric co-complex structures, it is unclear whether the corresponding protein forms oligomers in solution or if the observed residue contacts are attributable only to the way the protein was crystallized ("crystal-packing interactions") [3]. Moreover, in many instances one of the interactors was not annotated with a UniProt identifier in PDB/SIFTS despite the existence of a non-standard protein name annotation. To recover missing UniProt annotations, we aligned all non-annotated sequences that were found in protein complexes with human

proteins against UniProt/SwissProt-human using WU-BLAST (http://www.ebi.ac.uk/tools/sss/wublast/). A given query sequence was annotated with the UniProt reference identifier corresponding to the smallest BLASTP alignment p-value but only if at least 90% of the query was aligned to the reference with at least 90% sequence identity. Protein/RNA expression and copy number data Matched TCGA RPPA, RNAseq and copy number data from endometrial [4] and colorectal tumor samples [5] (used for quantifying the expression of SPOP substrates and CCNE1, respectively) were downloaded from the Broad GDAC portal (http://gdac.broadinstitute.org/). The samples were divided into several groups according to SPOP/FBXW7 mutation and substrate copy number statuses (SI Appendix, Fig. S6 B and Main Text Fig. 5). Before plotting, protein and RNA expression levels in each sample were normalized by subtracting the median and dividing by the standard deviation of the corresponding expression level distributions of samples with no SPOP/FBXW7 somatic mutations and no substrate copy number changes. A gene was considered amplified/deleted if it was in a genomic segment, supported by at least 3 SNP probes, with mean above 0.3/below -0.3 in the copy number data. References 1. Lawrence MS et al. (2014) Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505:495-501. 2. de Beer TAP, Berka K, Thornton JM, Laskowski RA (2014) PDBsum additions. Nucleic Acids Res. 42:D292-296. 3. Janin J (1997) Specific versus non-specific contacts in protein crystals. Nat. Struct. Biol. 4:973-974. 4. The Cancer Genome Atlas Network (2013) Integrated genomic characterization of endometrial carcinoma. Nature 497:67-73. 5. The Cancer Genome Atlas Network (2012) Comprehensive molecular characterization of human colon and rectal cancer. Nature 487:330-337.

Figure S1. Overview of our CLUMPS approach for identifying significant mutation clustering in protein structures. WAP: weighted average proximity score; d q,r : spatial (Euclidean) distance between the centroids of residues q and r ; n q and n r : normalized number of samples with missense mutations impacting residues q and r, respectively; t: soft distance threshold (see Materials and Methods in the Main Text for details).

Figure S2: Quantile-quantile plot of empirical p-values calculated with CLUMPS for all tested (representative) protein structures (Dataset 1). Significant and near-significant protein structures are labeled; purple label color indicates tumor suppressors and green color indicates oncoproteins.

Missense hotspot: p.s340l Splice site hot-spot Figure S3: TumorPortal (http://tumorportal.org) screenshot showing the positions of mutations in NUF2. Missense mutations are shown as green circles, with color intensity scaling with evolutionary conservation. The portion of the NUF2 protein sequence covered by the structure shown in Fig. 3 (Main Text) is highlighted in black.

A B Figure S4: Several non-recurrent mutations in STK11 impact residues at the active site, forming a spatial (3-D) cluster. A) TumorPortal (http://tumorportal.org) screenshot showing the positions of mutations in the linear STK11 protein sequence. Missense mutations are shown as green circles, with color intensity scaling with evolutionary conservation. B) Structure of STK11 (PDB: 2WTK) with mutated residues shown as red lines. Mutations that cluster together at the active site are labeled; p.n181 and p.d194 were found mutated in two samples each, the rest of the labeled residues in one sample each. Shown in blue is phosphoaminophosphonic acid-adenylate ester, an analog of substrate ATP.

Figure S5: Comparison of CLUMPS p-values (denoted Spatial clustering ) against p-values calculated for the corresponding genes using the MutSig suite of tools for detecting cancer genes. MutSig provides three p-values corresponding to three different statistical tests (MutSig-CL: linear clustering of mutations; MutSig-CV: overall mutation burden, taking into account covariates like replication timing and expression level; and MutSig-FN: the relative frequency of mutations at evolutionarily conserved and likely functional DNA bases), as well as a combined p-value (MutSigintegrated). The plots correspond to a comparison of each of these four MutSig p-values against the CLUMPS p-value for the corresponding gene (the most significant CLUMPS p-value is considered if there are multiple representative protein structures). Spearman s correlation coefficient ρ is provided in each figure. Dashed red lines correspond to nominal significance thresholds (p=0.01). Genes detected as significant or near-significant with CLUMPS, but not with MutSig or its separate components, are labeled.

A Cluster E (endometrial only; newly identified) Cluster S (substratebinding pocket) B Figure S6: Clusters of endometrial and prostate cancer mutations in SPOP. A) TumorPortal (http://tumorportal.org) screenshot showing the positions of mutations in SPOP. Missense mutations are shown as green circles, whose color intensity scales with evolutionary conservation. The portion of the SPOP protein sequence covered by the structure shown in Fig. 4 (Main Text) is highlighted in black. B) Protein and RNA levels of the SPOP substrates MAPK8 and PTEN in endometrial tumors with mutations from both Clusters E and S compared to SPOP-wildtype endometrial tumors (protein and RNA expression levels correspond to RPPA and RNAseq measurements by TCGA, respectively).

Figure S7: PPP2R1A (grey) bound to PPP2R5C (green) (PDB: 2NYL). Mutated residues in both proteins are highlighted in red, with color intensity scaling with the number of samples harboring missense mutations impacting the corresponding residue. Recurrent mutations ( 3 samples) are shown as sticks, non-recurrent mutations as thin lines. PPP2R1A mutations at the interface are labeled.

Figure S8: HRAS (grey) bound to RASA1 (green) (PDB: 1WQ1). Mutations in both proteins are colored in red, with color intensity scaling with recurrence. Recurrent mutations ( 3 samples) are shown as sticks, non-recurrent mutations as thin lines. Mutated interface residues in both proteins are labeled (black label: HRAS residues, green label: RASA1 residues).

Figure S9: OGT (grey) bound to an HCFC1 fragment (orange) (PDB: 4N3B). Residues in both proteins that are impacted by missense mutations are highlighted in red; those at the common interaction interface are labeled (black label: OGT residues, brown label: HCFC1 residues).

Figure S10: Distribution of the relative reference (UniProt) protein sequence coverage of all 3-D structures of proteins used in the full CLUMPS analysis (prior to selecting the representative structures per protein). SI Appendix, Fig S12 shows a corresponding distribution after the selection of representative structures.

Figure S11: Protein sequence coverage by individual PDB structures is depicted for the top 20 proteins that showed significant or near-significant 3-D mutation clustering. The proteins are ordered on the x-axis and the length of each protein sequence is normalized to unity. The y-axis shows log 10 (CLUMPS p-value). Each blue line corresponds to a PDB structure/chain; its x-dimensions show the relative coverage of the protein sequence and its y-dimension shows the mutation clustering p- value for that structure/chain. Many overlapping lines are shown as a single thicker line. Red lines correspond to the structure selected by our greedy search algorithm (see Materials and Methods in the Main text).

Figure S12: Distribution of the overall relative reference (UniProt) protein sequence coverage (= total residues covered by all selected 3D structures for a protein over the number of residues in the protein) for all proteins used in the full CLUMPS analysis.

A B Figure S13: Plots of functions used for calculating the Weighted Average Proximity (WAP) score: A) f d; t = 6 = e!!!,!!!! B) h N; Θ = 2, m = 3 =!!!!!!!!

Figure S14: Comparison of p-values obtained with the original implementation of CLUMPS, which weights mutated residues according to recurrence (see Materials and Methods) (black dots) against corresponding p-values obtained with a version of CLUMPS that weights all mutated residues equally (red stars). The top scoring 300 structures from Dataset 1 are shown.

Figure S15: Quantile-quantile plot of empirical p-values corresponding to mutation enrichment in interaction interfaces. Red dots represent significant interfaces (q 0.1; see Table 2 in the Main Text and Datasets 8, 10, 11, 12). The apparent slight inflation is due to the pre-filtering of interfaces to select only those with at least one mutation and because the interface selection strategy favors interfaces with more mutations among different PDB instances of similar interfaces in order to increase sensitivity (see Materials and Methods in the Main Text).