Systematic assessment of cancer missense mutation clustering in protein structures

Size: px
Start display at page:

Download "Systematic assessment of cancer missense mutation clustering in protein structures"

Transcription

1 Systematic assessment of cancer missense mutation clustering in protein structures Atanas Kamburov, Michael Lawrence, Paz Polak, Ignaty Leshchiner, Kasper Lage, Todd R. Golub, Eric S. Lander, Gad Getz SI Appendix

2 Supplemental Methods Collapsing consecutive mutated residues To examine the effect of consecutive mutated residues on CLUMPS results, we implemented a variant of CLUMPS where two or more mutated residues, which were consecutive in the protein sequence, were combined to a single "meta-residue". The 3-D location of the centroid of the new meta-residue [used for Euclidean distance measurements to other mutated (meta-) residues] was calculated based on the 3-D locations of the individual member residues and also depended linearly on their mutational recurrence. For example, if both residues P[k] and P[k+1] of protein P are found mutated and P[k] is mutated much more frequently than P[k+1], then the centroid of the new meta-residue P[k:k+1] will be closer to the centroid of P[k] than to the centroid of P[k+1]. Unlike in the original CLUMPS implementation, (meta-) residues were not allowed to be immediately next to each other in the protein sequence during the permutations. Comparison of methods for cancer gene identification Per-gene p-values calculated with MutSig and its components MutSig-CL, MutSig-FN and MutSig-CV were obtained from the original PanCancer study [1]. To enable comparison of the per-gene p-values calculated with these methods with the CLUMPS p-values (calculated per structure), we considered the smallest CLUMPS p-value of the representative structures for each protein Protein interaction interfaces Information about human protein residues forming interaction interfaces with other human proteins, small molecule/ion ligands, DNA or RNA (based on co-complex structures from PDB) was obtained from the PDBsum database [2] on All residues of a protein predicted by PDBsum to be involved in any type of contact (e.g., hydrogen or disulphide bonds or non-bonded contacts) with the interaction partner were considered interface residues. Only interfaces with at least one mutation were analyzed. In cases where multiple co-complex structures were available for a given pair of interactors, we selected the structure maximizing interface size and sequence coverage of the protein interactor(s), as well as the number of mutations at the interface. As expected, factoring the number of mutations in interaction interfaces into the selection process and especially restricting the analysis to interfaces with at least one observed mutation led to some inflation in a Q-Q plot (SI Appendix, Fig. S12); however, we aimed to avoid missing interesting biological interactions due to falsenegative contact residue predictions in PDBsum. Mutually similar (in terms of interface residues) protein-ligand interfaces were grouped together and from each group, only one representative interface was analyzed (i.e., the one comprising most residues). This was done to avoid testing separately interfaces like KRAS-GTP, KRAS-GDP, KRAS-inhibitor, etc. In the case of protein-protein interactions, we focused only on heteromers since for many homomeric co-complex structures, it is unclear whether the corresponding protein forms oligomers in solution or if the observed residue contacts are attributable only to the way the protein was crystallized ("crystal-packing interactions") [3]. Moreover, in many instances one of the interactors was not annotated with a UniProt identifier in PDB/SIFTS despite the existence of a non-standard protein name annotation. To recover missing UniProt annotations, we aligned all non-annotated sequences that were found in protein complexes with human

3 proteins against UniProt/SwissProt-human using WU-BLAST ( A given query sequence was annotated with the UniProt reference identifier corresponding to the smallest BLASTP alignment p-value but only if at least 90% of the query was aligned to the reference with at least 90% sequence identity. Protein/RNA expression and copy number data Matched TCGA RPPA, RNAseq and copy number data from endometrial [4] and colorectal tumor samples [5] (used for quantifying the expression of SPOP substrates and CCNE1, respectively) were downloaded from the Broad GDAC portal ( The samples were divided into several groups according to SPOP/FBXW7 mutation and substrate copy number statuses (SI Appendix, Fig. S6 B and Main Text Fig. 5). Before plotting, protein and RNA expression levels in each sample were normalized by subtracting the median and dividing by the standard deviation of the corresponding expression level distributions of samples with no SPOP/FBXW7 somatic mutations and no substrate copy number changes. A gene was considered amplified/deleted if it was in a genomic segment, supported by at least 3 SNP probes, with mean above 0.3/below -0.3 in the copy number data. References 1. Lawrence MS et al. (2014) Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505: de Beer TAP, Berka K, Thornton JM, Laskowski RA (2014) PDBsum additions. Nucleic Acids Res. 42:D Janin J (1997) Specific versus non-specific contacts in protein crystals. Nat. Struct. Biol. 4: The Cancer Genome Atlas Network (2013) Integrated genomic characterization of endometrial carcinoma. Nature 497: The Cancer Genome Atlas Network (2012) Comprehensive molecular characterization of human colon and rectal cancer. Nature 487:

4 Figure S1. Overview of our CLUMPS approach for identifying significant mutation clustering in protein structures. WAP: weighted average proximity score; d q,r : spatial (Euclidean) distance between the centroids of residues q and r ; n q and n r : normalized number of samples with missense mutations impacting residues q and r, respectively; t: soft distance threshold (see Materials and Methods in the Main Text for details).

5 Figure S2: Quantile-quantile plot of empirical p-values calculated with CLUMPS for all tested (representative) protein structures (Dataset 1). Significant and near-significant protein structures are labeled; purple label color indicates tumor suppressors and green color indicates oncoproteins.

6 Missense hotspot: p.s340l Splice site hot-spot Figure S3: TumorPortal ( screenshot showing the positions of mutations in NUF2. Missense mutations are shown as green circles, with color intensity scaling with evolutionary conservation. The portion of the NUF2 protein sequence covered by the structure shown in Fig. 3 (Main Text) is highlighted in black.

7 A B Figure S4: Several non-recurrent mutations in STK11 impact residues at the active site, forming a spatial (3-D) cluster. A) TumorPortal ( screenshot showing the positions of mutations in the linear STK11 protein sequence. Missense mutations are shown as green circles, with color intensity scaling with evolutionary conservation. B) Structure of STK11 (PDB: 2WTK) with mutated residues shown as red lines. Mutations that cluster together at the active site are labeled; p.n181 and p.d194 were found mutated in two samples each, the rest of the labeled residues in one sample each. Shown in blue is phosphoaminophosphonic acid-adenylate ester, an analog of substrate ATP.

8 Figure S5: Comparison of CLUMPS p-values (denoted Spatial clustering ) against p-values calculated for the corresponding genes using the MutSig suite of tools for detecting cancer genes. MutSig provides three p-values corresponding to three different statistical tests (MutSig-CL: linear clustering of mutations; MutSig-CV: overall mutation burden, taking into account covariates like replication timing and expression level; and MutSig-FN: the relative frequency of mutations at evolutionarily conserved and likely functional DNA bases), as well as a combined p-value (MutSigintegrated). The plots correspond to a comparison of each of these four MutSig p-values against the CLUMPS p-value for the corresponding gene (the most significant CLUMPS p-value is considered if there are multiple representative protein structures). Spearman s correlation coefficient ρ is provided in each figure. Dashed red lines correspond to nominal significance thresholds (p=0.01). Genes detected as significant or near-significant with CLUMPS, but not with MutSig or its separate components, are labeled.

9 A Cluster E (endometrial only; newly identified) Cluster S (substratebinding pocket) B Figure S6: Clusters of endometrial and prostate cancer mutations in SPOP. A) TumorPortal ( screenshot showing the positions of mutations in SPOP. Missense mutations are shown as green circles, whose color intensity scales with evolutionary conservation. The portion of the SPOP protein sequence covered by the structure shown in Fig. 4 (Main Text) is highlighted in black. B) Protein and RNA levels of the SPOP substrates MAPK8 and PTEN in endometrial tumors with mutations from both Clusters E and S compared to SPOP-wildtype endometrial tumors (protein and RNA expression levels correspond to RPPA and RNAseq measurements by TCGA, respectively).

10 Figure S7: PPP2R1A (grey) bound to PPP2R5C (green) (PDB: 2NYL). Mutated residues in both proteins are highlighted in red, with color intensity scaling with the number of samples harboring missense mutations impacting the corresponding residue. Recurrent mutations ( 3 samples) are shown as sticks, non-recurrent mutations as thin lines. PPP2R1A mutations at the interface are labeled.

11 Figure S8: HRAS (grey) bound to RASA1 (green) (PDB: 1WQ1). Mutations in both proteins are colored in red, with color intensity scaling with recurrence. Recurrent mutations ( 3 samples) are shown as sticks, non-recurrent mutations as thin lines. Mutated interface residues in both proteins are labeled (black label: HRAS residues, green label: RASA1 residues).

12 Figure S9: OGT (grey) bound to an HCFC1 fragment (orange) (PDB: 4N3B). Residues in both proteins that are impacted by missense mutations are highlighted in red; those at the common interaction interface are labeled (black label: OGT residues, brown label: HCFC1 residues).

13 Figure S10: Distribution of the relative reference (UniProt) protein sequence coverage of all 3-D structures of proteins used in the full CLUMPS analysis (prior to selecting the representative structures per protein). SI Appendix, Fig S12 shows a corresponding distribution after the selection of representative structures.

14 Figure S11: Protein sequence coverage by individual PDB structures is depicted for the top 20 proteins that showed significant or near-significant 3-D mutation clustering. The proteins are ordered on the x-axis and the length of each protein sequence is normalized to unity. The y-axis shows log 10 (CLUMPS p-value). Each blue line corresponds to a PDB structure/chain; its x-dimensions show the relative coverage of the protein sequence and its y-dimension shows the mutation clustering p- value for that structure/chain. Many overlapping lines are shown as a single thicker line. Red lines correspond to the structure selected by our greedy search algorithm (see Materials and Methods in the Main text).

15 Figure S12: Distribution of the overall relative reference (UniProt) protein sequence coverage (= total residues covered by all selected 3D structures for a protein over the number of residues in the protein) for all proteins used in the full CLUMPS analysis.

16 A B Figure S13: Plots of functions used for calculating the Weighted Average Proximity (WAP) score: A) f d; t = 6 = e!!!,!!!! B) h N; Θ = 2, m = 3 =!!!!!!!!

17 Figure S14: Comparison of p-values obtained with the original implementation of CLUMPS, which weights mutated residues according to recurrence (see Materials and Methods) (black dots) against corresponding p-values obtained with a version of CLUMPS that weights all mutated residues equally (red stars). The top scoring 300 structures from Dataset 1 are shown.

18 Figure S15: Quantile-quantile plot of empirical p-values corresponding to mutation enrichment in interaction interfaces. Red dots represent significant interfaces (q 0.1; see Table 2 in the Main Text and Datasets 8, 10, 11, 12). The apparent slight inflation is due to the pre-filtering of interfaces to select only those with at least one mutation and because the interface selection strategy favors interfaces with more mutations among different PDB instances of similar interfaces in order to increase sensitivity (see Materials and Methods in the Main Text).

ALLEN Mouse Brain Atlas

ALLEN Mouse Brain Atlas TECHNICAL WHITE PAPER: QUALITY CONTROL STANDARDS FOR HIGH-THROUGHPUT RNA IN SITU HYBRIDIZATION DATA GENERATION Consistent data quality and internal reproducibility are critical concerns for high-throughput

More information

PREDA S4-classes. Francesco Ferrari October 13, 2015

PREDA S4-classes. Francesco Ferrari October 13, 2015 PREDA S4-classes Francesco Ferrari October 13, 2015 Abstract This document provides a description of custom S4 classes used to manage data structures for PREDA: an R package for Position RElated Data Analysis.

More information

NOVEL GENOME-SCALE CORRELATION BETWEEN DNA REPLICATION AND RNA TRANSCRIPTION DURING THE CELL CYCLE IN YEAST IS PREDICTED BY DATA-DRIVEN MODELS

NOVEL GENOME-SCALE CORRELATION BETWEEN DNA REPLICATION AND RNA TRANSCRIPTION DURING THE CELL CYCLE IN YEAST IS PREDICTED BY DATA-DRIVEN MODELS NOVEL GENOME-SCALE CORRELATION BETWEEN DNA REPLICATION AND RNA TRANSCRIPTION DURING THE CELL CYCLE IN YEAST IS PREDICTED BY DATA-DRIVEN MODELS Orly Alter (a) *, Gene H. Golub (b), Patrick O. Brown (c)

More information

Real-time PCR: Understanding C t

Real-time PCR: Understanding C t APPLICATION NOTE Real-Time PCR Real-time PCR: Understanding C t Real-time PCR, also called quantitative PCR or qpcr, can provide a simple and elegant method for determining the amount of a target sequence

More information

Frequently Asked Questions Next Generation Sequencing

Frequently Asked Questions Next Generation Sequencing Frequently Asked Questions Next Generation Sequencing Import These Frequently Asked Questions for Next Generation Sequencing are some of the more common questions our customers ask. Questions are divided

More information

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals Xiaohui Xie 1, Jun Lu 1, E. J. Kulbokas 1, Todd R. Golub 1, Vamsi Mootha 1, Kerstin Lindblad-Toh

More information

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the

More information

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 seema@iasri.res.in Genomics A genome is an organism s

More information

Package cgdsr. August 27, 2015

Package cgdsr. August 27, 2015 Type Package Package cgdsr August 27, 2015 Title R-Based API for Accessing the MSKCC Cancer Genomics Data Server (CGDS) Version 1.2.5 Date 2015-08-25 Author Anders Jacobsen Maintainer Augustin Luna

More information

InSyBio BioNets: Utmost efficiency in gene expression data and biological networks analysis

InSyBio BioNets: Utmost efficiency in gene expression data and biological networks analysis InSyBio BioNets: Utmost efficiency in gene expression data and biological networks analysis WHITE PAPER By InSyBio Ltd Konstantinos Theofilatos Bioinformatician, PhD InSyBio Technical Sales Manager August

More information

Package empiricalfdr.deseq2

Package empiricalfdr.deseq2 Type Package Package empiricalfdr.deseq2 May 27, 2015 Title Simulation-Based False Discovery Rate in RNA-Seq Version 1.0.3 Date 2015-05-26 Author Mikhail V. Matz Maintainer Mikhail V. Matz

More information

Guide for Data Visualization and Analysis using ACSN

Guide for Data Visualization and Analysis using ACSN Guide for Data Visualization and Analysis using ACSN ACSN contains the NaviCell tool box, the intuitive and user- friendly environment for data visualization and analysis. The tool is accessible from the

More information

CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data.

CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data. : An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data. Nicolas Philippe and Mikael Salson and Thérèse Commes and Eric Rivals February 13, 2013 1 Results

More information

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa 29.10.2013

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa 29.10.2013 Next Generation Sequencing: Adjusting to Big Data Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa 29.10.2013 Outline Human Genome Project Next-Generation Sequencing Personalized Medicine

More information

Tutorial for proteome data analysis using the Perseus software platform

Tutorial for proteome data analysis using the Perseus software platform Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information

More information

Analysis of FFPE DNA Data in CNAG 2.0 A Manual

Analysis of FFPE DNA Data in CNAG 2.0 A Manual Analysis of FFPE DNA Data in CNAG 2.0 A Manual Table of Contents: I. Background P.2 II. Installation and Setup a. Download/Install CNAG 2.0 P.3 b. Setup P.4 III. Extract Mapping 500K FFPE Data P.7 IV.

More information

DeCyder Extended Data Analysis module Version 1.0

DeCyder Extended Data Analysis module Version 1.0 GE Healthcare DeCyder Extended Data Analysis module Version 1.0 Module for DeCyder 2D version 6.5 User Manual Contents 1 Introduction 1.1 Introduction... 7 1.2 The DeCyder EDA User Manual... 9 1.3 Getting

More information

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem Elsa Bernard Laurent Jacob Julien Mairal Jean-Philippe Vert September 24, 2013 Abstract FlipFlop implements a fast method for de novo transcript

More information

Guide for Bioinformatics Project Module 3

Guide for Bioinformatics Project Module 3 Structure- Based Evidence and Multiple Sequence Alignment In this module we will revisit some topics we started to look at while performing our BLAST search and looking at the CDD database in the first

More information

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis Microarray Data Analysis Workshop MedVetNet Workshop, DTU 2008 Comparative genomic hybridization Because arrays are more than just a tool for expression analysis Carsten Friis ( with several slides from

More information

Core Facility Genomics

Core Facility Genomics Core Facility Genomics versatile genome or transcriptome analyses based on quantifiable highthroughput data ascertainment 1 Topics Collaboration with Harald Binder and Clemens Kreutz Project: Microarray

More information

PROTEINS THE PEPTIDE BOND. The peptide bond, shown above enclosed in the blue curves, generates the basic structural unit for proteins.

PROTEINS THE PEPTIDE BOND. The peptide bond, shown above enclosed in the blue curves, generates the basic structural unit for proteins. Ca 2+ The contents of this module were developed under grant award # P116B-001338 from the Fund for the Improvement of Postsecondary Education (FIPSE), United States Department of Education. However, those

More information

Supplementary Figure 1: Quality Assessment of Mouse Arrays. Supplementary Figure 2: Quality Assessment of Rat Arrays

Supplementary Figure 1: Quality Assessment of Mouse Arrays. Supplementary Figure 2: Quality Assessment of Rat Arrays Supplementary Figure 1: Quality Assessment of Mouse Arrays The mouse microarray data were subjected to an extensive quality-control procedure prior to conducting downstream analyses. We assessed the spread

More information

Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation

Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation PN 100-9879 A1 TECHNICAL NOTE Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation Introduction Cancer is a dynamic evolutionary process of which intratumor genetic and phenotypic

More information

Data Analysis for Ion Torrent Sequencing

Data Analysis for Ion Torrent Sequencing IFU022 v140202 Research Use Only Instructions For Use Part III Data Analysis for Ion Torrent Sequencing MANUFACTURER: Multiplicom N.V. Galileilaan 18 2845 Niel Belgium Revision date: August 21, 2014 Page

More information

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Focusing on results not data comprehensive data analysis for targeted next generation sequencing Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes

More information

Next Generation Sequencing: Technology, Mapping, and Analysis

Next Generation Sequencing: Technology, Mapping, and Analysis Next Generation Sequencing: Technology, Mapping, and Analysis Gary Benson Computer Science, Biology, Bioinformatics Boston University gbenson@bu.edu http://tandem.bu.edu/ The Human Genome Project took

More information

Human Genome Organization: An Update. Genome Organization: An Update

Human Genome Organization: An Update. Genome Organization: An Update Human Genome Organization: An Update Genome Organization: An Update Highlights of Human Genome Project Timetable Proposed in 1990 as 3 billion dollar joint venture between DOE and NIH with 15 year completion

More information

Supporting Information. Fast and Efficient Fragment-Based Lead Generation. by Fully Automated Processing and Analysis of

Supporting Information. Fast and Efficient Fragment-Based Lead Generation. by Fully Automated Processing and Analysis of Supporting Information Fast and Efficient Fragment-Based Lead Generation by Fully Automated Processing and Analysis of Ligand-Observed NMR Binding Data Chen Peng, Alexandra Frommlet, Manuel Perez, Carlos

More information

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association

More information

Breast cancer and the role of low penetrance alleles: a focus on ATM gene

Breast cancer and the role of low penetrance alleles: a focus on ATM gene Modena 18-19 novembre 2010 Breast cancer and the role of low penetrance alleles: a focus on ATM gene Dr. Laura La Paglia Breast Cancer genetic Other BC susceptibility genes TP53 PTEN STK11 CHEK2 BRCA1

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

The Ramachandran Map of More Than. 6,500 Perfect Polypeptide Chains

The Ramachandran Map of More Than. 6,500 Perfect Polypeptide Chains The Ramachandran Map of More Than 1 6,500 Perfect Polypeptide Chains Zoltán Szabadka, Rafael Ördög, Vince Grolmusz manuscript received March 19, 2007 Z. Szabadka, R. Ördög and V. Grolmusz are with Eötvös

More information

Course on Functional Analysis. ::: Gene Set Enrichment Analysis - GSEA -

Course on Functional Analysis. ::: Gene Set Enrichment Analysis - GSEA - Course on Functional Analysis ::: Madrid, June 31st, 2007. Gonzalo Gómez, PhD. ggomez@cnio.es Bioinformatics Unit CNIO ::: Contents. 1. Introduction. 2. GSEA Software 3. Data Formats 4. Using GSEA 5. GSEA

More information

Microarray Data Analysis. A step by step analysis using BRB-Array Tools

Microarray Data Analysis. A step by step analysis using BRB-Array Tools Microarray Data Analysis A step by step analysis using BRB-Array Tools 1 EXAMINATION OF DIFFERENTIAL GENE EXPRESSION (1) Objective: to find genes whose expression is changed before and after chemotherapy.

More information

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources 1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools

More information

A General Framework for Weighted Gene Co-expression Network Analysis

A General Framework for Weighted Gene Co-expression Network Analysis Please cite: Statistical Applications in Genetics and Molecular Biology (2005). A General Framework for Weighted Gene Co-expression Network Analysis Bin Zhang and Steve Horvath Departments of Human Genetics

More information

HENIPAVIRUS ANTIBODY ESCAPE SEQUENCING REPORT

HENIPAVIRUS ANTIBODY ESCAPE SEQUENCING REPORT HENIPAVIRUS ANTIBODY ESCAPE SEQUENCING REPORT Kimberly Bishop Lilly 1,2, Truong Luu 1,2, Regina Cer 1,2, and LT Vishwesh Mokashi 1 1 Naval Medical Research Center, NMRC Frederick, 8400 Research Plaza,

More information

Supplementary Information

Supplementary Information Supplementary Information S1: Degree Distribution of TFs in the E.coli TRN and CRN based on Operons 1000 TRN Number of TFs 100 10 y = 619.55x -1.4163 R 2 = 0.8346 1 1 10 100 1000 Degree of TFs CRN 100

More information

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms Introduction Mate pair sequencing enables the generation of libraries with insert sizes in the range of several kilobases (Kb).

More information

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis Genomic, Proteomic and Transcriptomic Lab High Performance Computing and Networking Institute National Research Council, Italy Mathematical Models of Supervised Learning and their Application to Medical

More information

How many of you have checked out the web site on protein-dna interactions?

How many of you have checked out the web site on protein-dna interactions? How many of you have checked out the web site on protein-dna interactions? Example of an approximately 40,000 probe spotted oligo microarray with enlarged inset to show detail. Find and be ready to discuss

More information

DNA Insertions and Deletions in the Human Genome. Philipp W. Messer

DNA Insertions and Deletions in the Human Genome. Philipp W. Messer DNA Insertions and Deletions in the Human Genome Philipp W. Messer Genetic Variation CGACAATAGCGCTCTTACTACGTGTATCG : : CGACAATGGCGCT---ACTACGTGCATCG 1. Nucleotide mutations 2. Genomic rearrangements 3.

More information

Steffen Lindert, René Staritzbichler, Nils Wötzel, Mert Karakaş, Phoebe L. Stewart, and Jens Meiler

Steffen Lindert, René Staritzbichler, Nils Wötzel, Mert Karakaş, Phoebe L. Stewart, and Jens Meiler Structure 17 Supplemental Data EM-Fold: De Novo Folding of α-helical Proteins Guided by Intermediate-Resolution Electron Microscopy Density Maps Steffen Lindert, René Staritzbichler, Nils Wötzel, Mert

More information

Final Project Report

Final Project Report CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

More information

Pairwise Sequence Alignment

Pairwise Sequence Alignment Pairwise Sequence Alignment carolin.kosiol@vetmeduni.ac.at SS 2013 Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics What

More information

MAKING AN EVOLUTIONARY TREE

MAKING AN EVOLUTIONARY TREE Student manual MAKING AN EVOLUTIONARY TREE THEORY The relationship between different species can be derived from different information sources. The connection between species may turn out by similarities

More information

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data The Illumina TopHat Alignment and Cufflinks Assembly and Differential Expression apps make RNA data analysis accessible to any user, regardless

More information

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want 1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases as well. There are very

More information

Exploratory data analysis for microarray data

Exploratory data analysis for microarray data Eploratory data analysis for microarray data Anja von Heydebreck Ma Planck Institute for Molecular Genetics, Dept. Computational Molecular Biology, Berlin, Germany heydebre@molgen.mpg.de Visualization

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Genetomic Promototypes

Genetomic Promototypes Genetomic Promototypes Mirkó Palla and Dana Pe er Department of Mechanical Engineering Clarkson University Potsdam, New York and Department of Genetics Harvard Medical School 77 Avenue Louis Pasteur Boston,

More information

Gene Expression Analysis

Gene Expression Analysis Gene Expression Analysis Jie Peng Department of Statistics University of California, Davis May 2012 RNA expression technologies High-throughput technologies to measure the expression levels of thousands

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6

Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6 Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6 Overview This tutorial outlines how microrna data can be analyzed within Partek Genomics Suite. Additionally,

More information

Simplifying Data Interpretation with Nexus Copy Number

Simplifying Data Interpretation with Nexus Copy Number Simplifying Data Interpretation with Nexus Copy Number A WHITE PAPER FROM BIODISCOVERY, INC. Rapid technological advancements, such as high-density acgh and SNP arrays as well as next-generation sequencing

More information

New Technologies for Sensitive, Low-Input RNA-Seq. Clontech Laboratories, Inc.

New Technologies for Sensitive, Low-Input RNA-Seq. Clontech Laboratories, Inc. New Technologies for Sensitive, Low-Input RNA-Seq Clontech Laboratories, Inc. Outline Introduction Single-Cell-Capable mrna-seq Using SMART Technology SMARTer Ultra Low RNA Kit for the Fluidigm C 1 System

More information

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE ACCELERATING PROGRESS IS IN OUR GENES AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE GENESPRING GENE EXPRESSION (GX) MASS PROFILER PROFESSIONAL (MPP) PATHWAY ARCHITECT (PA) See Deeper. Reach Further. BIOINFORMATICS

More information

cansar: integrated cancer knowledgebase

cansar: integrated cancer knowledgebase in partnership with cansar: integrated cancer knowledgebase Bissan Al-Lazikani Cancer Research UK Cancer Therapeutics Unit 10 th Dec 2013 Sharing knowledge for drug discovery Resource to effectively integrate

More information

CSC 2427: Algorithms for Molecular Biology Spring 2006. Lecture 16 March 10

CSC 2427: Algorithms for Molecular Biology Spring 2006. Lecture 16 March 10 CSC 2427: Algorithms for Molecular Biology Spring 2006 Lecture 16 March 10 Lecturer: Michael Brudno Scribe: Jim Huang 16.1 Overview of proteins Proteins are long chains of amino acids (AA) which are produced

More information

9. Text & Documents. Visualizing and Searching Documents. Dr. Thorsten Büring, 20. Dezember 2007, Vorlesung Wintersemester 2007/08

9. Text & Documents. Visualizing and Searching Documents. Dr. Thorsten Büring, 20. Dezember 2007, Vorlesung Wintersemester 2007/08 9. Text & Documents Visualizing and Searching Documents Dr. Thorsten Büring, 20. Dezember 2007, Vorlesung Wintersemester 2007/08 Slide 1 / 37 Outline Characteristics of text data Detecting patterns SeeSoft

More information

Supplementary Figures S1 - S11

Supplementary Figures S1 - S11 1 Membrane Sculpting by F-BAR Domains Studied by Molecular Dynamics Simulations Hang Yu 1,2, Klaus Schulten 1,2,3, 1 Beckman Institute, University of Illinois, Urbana, Illinois, USA 2 Center of Biophysics

More information

2.500 Threshold. 2.000 1000e - 001. Threshold. Exponential phase. Cycle Number

2.500 Threshold. 2.000 1000e - 001. Threshold. Exponential phase. Cycle Number application note Real-Time PCR: Understanding C T Real-Time PCR: Understanding C T 4.500 3.500 1000e + 001 4.000 3.000 1000e + 000 3.500 2.500 Threshold 3.000 2.000 1000e - 001 Rn 2500 Rn 1500 Rn 2000

More information

Bioinformatics Resources at a Glance

Bioinformatics Resources at a Glance Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences

More information

MultiExperiment Viewer Quickstart Guide

MultiExperiment Viewer Quickstart Guide MultiExperiment Viewer Quickstart Guide Table of Contents: I. Preface - 2 II. Installing MeV - 2 III. Opening a Data Set - 2 IV. Filtering - 6 V. Clustering a. HCL - 8 b. K-means - 11 VI. Modules a. T-test

More information

Linear Sequence Analysis. 3-D Structure Analysis

Linear Sequence Analysis. 3-D Structure Analysis Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical properties Molecular weight (MW), isoelectric point (pi), amino acid content, hydropathy (hydrophilic

More information

Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

More information

MASCOT Search Results Interpretation

MASCOT Search Results Interpretation The Mascot protein identification program (Matrix Science, Ltd.) uses statistical methods to assess the validity of a match. MS/MS data is not ideal. That is, there are unassignable peaks (noise) and usually

More information

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: Tuesday 11:00-12:00/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office

More information

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Analysis of gene expression data. Ulf Leser and Philippe Thomas Analysis of gene expression data Ulf Leser and Philippe Thomas This Lecture Protein synthesis Microarray Idea Technologies Applications Problems Quality control Normalization Analysis next week! Ulf Leser:

More information

Introduction To Real Time Quantitative PCR (qpcr)

Introduction To Real Time Quantitative PCR (qpcr) Introduction To Real Time Quantitative PCR (qpcr) SABiosciences, A QIAGEN Company www.sabiosciences.com The Seminar Topics The advantages of qpcr versus conventional PCR Work flow & applications Factors

More information

Protein Prospector and Ways of Calculating Expectation Values

Protein Prospector and Ways of Calculating Expectation Values Protein Prospector and Ways of Calculating Expectation Values 1/16 Aenoch J. Lynn; Robert J. Chalkley; Peter R. Baker; Mark R. Segal; and Alma L. Burlingame University of California, San Francisco, San

More information

Computing the maximum similarity bi-clusters of gene expression data

Computing the maximum similarity bi-clusters of gene expression data BIOINFORMATICS ORIGINAL PAPER Vol. 23 no. 1 2007, pages 50 56 doi:10.1093/bioinformatics/btl560 Gene expression Computing the maximum similarity bi-clusters of gene expression data Xiaowen Liu and Lusheng

More information

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University Software and Methods for the Analysis of Affymetrix GeneChip Data Rafael A Irizarry Department of Biostatistics Johns Hopkins University Outline Overview Bioconductor Project Examples 1: Gene Annotation

More information

Hierarchical Bayesian Modeling of the HIV Response to Therapy

Hierarchical Bayesian Modeling of the HIV Response to Therapy Hierarchical Bayesian Modeling of the HIV Response to Therapy Shane T. Jensen Department of Statistics, The Wharton School, University of Pennsylvania March 23, 2010 Joint Work with Alex Braunstein and

More information

Analysis of ChIP-seq data in Galaxy

Analysis of ChIP-seq data in Galaxy Analysis of ChIP-seq data in Galaxy November, 2012 Local copy: https://galaxy.wi.mit.edu/ Joint project between BaRC and IT Main site: http://main.g2.bx.psu.edu/ 1 Font Conventions Bold and blue refers

More information

Exercise with Gene Ontology - Cytoscape - BiNGO

Exercise with Gene Ontology - Cytoscape - BiNGO Exercise with Gene Ontology - Cytoscape - BiNGO This practical has material extracted from http://www.cbs.dtu.dk/chipcourse/exercises/ex_go/goexercise11.php In this exercise we will analyze microarray

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data From Reads to Differentially Expressed Genes The statistics of differential gene expression analysis using RNA-seq data experimental design data collection modeling statistical testing biological heterogeneity

More information

ProteinPilot Report for ProteinPilot Software

ProteinPilot Report for ProteinPilot Software ProteinPilot Report for ProteinPilot Software Detailed Analysis of Protein Identification / Quantitation Results Automatically Sean L Seymour, Christie Hunter SCIEX, USA Pow erful mass spectrometers like

More information

Introduction to Exploratory Data Analysis

Introduction to Exploratory Data Analysis Introduction to Exploratory Data Analysis A SpaceStat Software Tutorial Copyright 2013, BioMedware, Inc. (www.biomedware.com). All rights reserved. SpaceStat and BioMedware are trademarks of BioMedware,

More information

RNAseq / ChipSeq / Methylseq and personalized genomics

RNAseq / ChipSeq / Methylseq and personalized genomics RNAseq / ChipSeq / Methylseq and personalized genomics 7711 Lecture Subhajyo) De, PhD Division of Biomedical Informa)cs and Personalized Biomedicine, Department of Medicine University of Colorado School

More information

Technical Note. Roche Applied Science. No. LC 18/2004. Assay Formats for Use in Real-Time PCR

Technical Note. Roche Applied Science. No. LC 18/2004. Assay Formats for Use in Real-Time PCR Roche Applied Science Technical Note No. LC 18/2004 Purpose of this Note Assay Formats for Use in Real-Time PCR The LightCycler Instrument uses several detection channels to monitor the amplification of

More information

Current Motif Discovery Tools and their Limitations

Current Motif Discovery Tools and their Limitations Current Motif Discovery Tools and their Limitations Philipp Bucher SIB / CIG Workshop 3 October 2006 Trendy Concepts and Hypotheses Transcription regulatory elements act in a context-dependent manner.

More information

Formalin fixation at low temperature better preserves nucleic acid integrity. Gianni Bussolati. University of Turin

Formalin fixation at low temperature better preserves nucleic acid integrity. Gianni Bussolati. University of Turin Formalin fixation at low temperature better preserves nucleic acid integrity Gianni Bussolati University of Turin Disclosure of interests: G.B. was originally responsible for the invention of the Cold

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

3D structure visualization and high quality imaging. Chimera

3D structure visualization and high quality imaging. Chimera 3D structure visualization and high quality imaging. Chimera Vincent Zoete 2008 Contact : vincent.zoete@isb sib.ch 1/27 Table of Contents Presentation of Chimera...3 Exercise 1...4 Loading a structure

More information

Interpreting Data in Normal Distributions

Interpreting Data in Normal Distributions Interpreting Data in Normal Distributions This curve is kind of a big deal. It shows the distribution of a set of test scores, the results of rolling a die a million times, the heights of people on Earth,

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 4, Issue 1 2005 Article 17 A General Framework for Weighted Gene Co-Expression Network Analysis Bin Zhang Steve Horvath Departments of

More information

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon Integrating DNA Motif Discovery and Genome-Wide Expression Analysis Department of Mathematics and Statistics University of Massachusetts Amherst Statistics in Functional Genomics Workshop Ascona, Switzerland

More information

Discovery & Modeling of Genomic Regulatory Networks with Big Data

Discovery & Modeling of Genomic Regulatory Networks with Big Data Discovery & Modeling of Genomic Regulatory Networks with Big Data Hamid Bolouri Division of Human Biology Fred Hutchinson Cancer Research Center labs.fhcrc.org/bolouri I have no financial relationships

More information

Discovering Bioinformatics

Discovering Bioinformatics Discovering Bioinformatics Sami Khuri Natascha Khuri Alexander Picker Aidan Budd Sophie Chabanis-Davidson Julia Willingale-Theune English version ELLS European Learning Laboratory for the Life Sciences

More information

Visual Structure Analysis of Flow Charts in Patent Images

Visual Structure Analysis of Flow Charts in Patent Images Visual Structure Analysis of Flow Charts in Patent Images Roland Mörzinger, René Schuster, András Horti, and Georg Thallinger JOANNEUM RESEARCH Forschungsgesellschaft mbh DIGITAL - Institute for Information

More information

Interaktionen von RNAs und Proteinen

Interaktionen von RNAs und Proteinen Sonja Prohaska Computational EvoDevo Universitaet Leipzig June 9, 2015 Studying RNA-protein interactions Given: target protein known to bind to RNA problem: find binding partners and binding sites experimental

More information

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: msm_eng@k-space.org

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: msm_eng@k-space.org BIOINFTool: Bioinformatics and sequence data analysis in molecular biology using Matlab Mai S. Mabrouk 1, Marwa Hamdy 2, Marwa Mamdouh 2, Marwa Aboelfotoh 2,Yasser M. Kadah 2 1 Biomedical Engineering Department,

More information

Big Data Visualization for Genomics. Luca Vezzadini Kairos3D

Big Data Visualization for Genomics. Luca Vezzadini Kairos3D Big Data Visualization for Genomics Luca Vezzadini Kairos3D Why GenomeCruzer? The amount of data for DNA sequencing is growing Modern hardware produces billions of values per sample Scientists need to

More information

IGV Hands-on Exercise: UI basics and data integration

IGV Hands-on Exercise: UI basics and data integration IGV Hands-on Exercise: UI basics and data integration Verhaak, R.G. et al. Integrated Genomic Analysis Identifies Clinically Relevant Subtypes of Glioblastoma Characterized by Abnormalities in PDGFRA,

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE STATISTICS The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE VS. INFERENTIAL STATISTICS Descriptive To organize,

More information

CNV Univariate Analysis Tutorial

CNV Univariate Analysis Tutorial CNV Univariate Analysis Tutorial Release 8.1 Golden Helix, Inc. March 18, 2014 Contents 1. Overview 2 2. CNAM Optimal Segmenting 4 A. Performing CNAM Optimal Segmenting..................................

More information