Bioinformatics in LC-MS based Proteomics and Glycomics



Similar documents
Integrated Data Mining Strategy for Effective Metabolomic Data Analysis

Introduction to Proteomics 1.0

The Scheduled MRM Algorithm Enables Intelligent Use of Retention Time During Multiple Reaction Monitoring

泛 用 蛋 白 質 體 學 之 質 譜 儀 資 料 分 析 平 台 的 建 立 與 應 用 Universal Mass Spectrometry Data Analysis Platform for Quantitative and Qualitative Proteomics

ProteinPilot Report for ProteinPilot Software

Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data

Introduction to mass spectrometry (MS) based proteomics and metabolomics

MarkerView Software for Metabolomic and Biomarker Profiling Analysis

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

Using Ontologies in Proteus for Modeling Data Mining Analysis of Proteomics Experiments

Session 1. Course Presentation: Mass spectrometry-based proteomics for molecular and cellular biologists

A Streamlined Workflow for Untargeted Metabolomics

Pep-Miner: A Novel Technology for Mass Spectrometry-Based Proteomics

Learning Objectives:

Building innovative drug discovery alliances. Evotec Munich. Quantitative Proteomics to Support the Discovery & Development of Targeted Drugs

La Protéomique : Etat de l art et perspectives

MultiQuant Software 2.0 for Targeted Protein / Peptide Quantification

Alignment and Preprocessing for Data Analysis

Retrospective Analysis of a Host Cell Protein Perfect Storm: Identifying Immunogenic Proteins and Fixing the Problem

ProteinScape. Innovation with Integrity. Proteomics Data Analysis & Management. Mass Spectrometry

Thermo Scientific SIEVE Software for Differential Expression Analysis

Increasing the Multiplexing of High Resolution Targeted Peptide Quantification Assays

AB SCIEX TOF/TOF 4800 PLUS SYSTEM. Cost effective flexibility for your core needs

Functional Data Analysis of MALDI TOF Protein Spectra

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Mass Spectrometry Signal Calibration for Protein Quantitation

Cancer Biostatistics Workshop Science of Doing Science - Biostatistics

Aiping Lu. Key Laboratory of System Biology Chinese Academic Society

Dr Alexander Henzing

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Challenges in Computational Analysis of Mass Spectrometry Data for Proteomics

Structural Analysis of Labeled N-Glycans from Proteins by LC-MS/MS Separated Using a Novel Mixed-Mode Stationary Phase

Statistical Analysis Strategies for Shotgun Proteomics Data

Tutorial for proteome data analysis using the Perseus software platform

The Open2Dprot Proteomics Project for n-dimensional Protein Expression Data Analysis

OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis

Factors Influencing LC/MS/MS Moving into Clinical and Research Laboratories

Global and Discovery Proteomics Lecture Agenda

Accurate Mass Screening Workflows for the Analysis of Novel Psychoactive Substances

Un (bref) aperçu des méthodes et outils de fouilles et de visualisation de données «omics»

A Navigation through the Tracefinder Software Structure and Workflow Options. Frans Schoutsen Pesticide Symposium Prague 27 April 2015

In-Depth Qualitative Analysis of Complex Proteomic Samples Using High Quality MS/MS at Fast Acquisition Rates

Application Note # LCMS-81 Introducing New Proteomics Acquisiton Strategies with the compact Towards the Universal Proteomics Acquisition Method

SELDI-TOF Mass Spectrometry Protein Data By Huong Thi Dieu La

Application Note # LCMS-66 Straightforward N-glycopeptide analysis combining fast ion trap data acquisition with new ProteinScape functionalities

OpenMS A Framework for Quantitative HPLC/MS-Based Proteomics

Chapter 14. Modeling Experimental Design for Proteomics. Jan Eriksson and David Fenyö. Abstract. 1. Introduction

Introduction to Proteomics

LC-MS/MS for Chromatographers

Quantification of Multiple Therapeutic mabs in Serum Using microlc-esi-q-tof Mass Spectrometry

Using Natural Products Application Solution with UNIFI for the Identification of Chemical Ingredients of Green Tea Extract

PeptidomicsDB: a new platform for sharing MS/MS data.

Tutorial for Proteomics Data Submission. Katalin F. Medzihradszky Robert J. Chalkley UCSF

Thermo Scientific PepFinder Software A New Paradigm for Peptide Mapping

Computational Analysis of LC-MS/MS Data for Metabolite Identification

Data, Measurements, Features

Quantitative proteomics background

Increasing Quality While Maintaining Efficiency in Drug Chemistry with DART-TOF MS Screening

Mass Spectra Alignments and their Significance

InSyBio BioNets: Utmost efficiency in gene expression data and biological networks analysis

Research-grade Targeted Proteomics Assay Development: PRMs for PTM Studies with Skyline or, How I learned to ditch the triple quad and love the QE

Effects of Intelligent Data Acquisition and Fast Laser Speed on Analysis of Complex Protein Digests

Biopharmaceutical Glycosylation Analysis

Pesticide Analysis by Mass Spectrometry

Simultaneous qualitative and quantitative analysis using the Agilent 6540 Accurate-Mass Q-TOF

Already said. Already said. Outlook. Look at LC-MS data. A look at data for quantitative analysis using MSight and Phenyx. What data for quantitation?

A Common Processing and Statistical Frame for Label-Free Quantitative Proteomic Analyses

Protein Protein Interaction Networks

ASMS Regulated Bioanalysis Interest Group (RBIG) Workshop. Antibody-Drug Conjugates (ADC) A Complex Problem in Regulated Bioanalysis.

MRMPilot Software: Accelerating MRM Assay Development for Targeted Quantitative Proteomics

Investigating Biological Variation of Liver Enzymes in Human Hepatocytes

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

MetAssign: Probabilistic annotation of metabolites from LC MS data using a Bayesian clustering approach

FACULTY OF MEDICAL SCIENCE

BIOINFORMATICS Supporting competencies for the pharma industry

HRMS in Clinical Research: from Targeted Quantification to Metabolomics

SIMULTANEOUS DETERMINATION OF NALTREXONE AND 6- -NALTREXOL IN SERUM BY HPLC

Guide to Reverse Phase SpinColumns Chromatography for Sample Prep

Definition of the Measurand: CRP

3. Career Tools Podcasts

What Do We Learn about Hepatotoxicity Using Coumarin-Treated Rat Model?

Background Information

Analysis of gene expression data. Ulf Leser and Philippe Thomas

A Visual Analysis System for Metabolomics Data

Mass Frontier Version 7.0

Rapid and Reproducible Amino Acid Analysis of Physiological Fluids for Clinical Research Using LC/MS/MS with the atraq Kit

WATERS QUANTITATIVE ANALYSIS solutions

Proteomic data analysis for Orbitrap datasets using Resources available at MSI. September 28 th 2011 Pratik Jagtap

Pinpointing phosphorylation sites using Selected Reaction Monitoring and Skyline

Guide for Data Visualization and Analysis using ACSN

Design considerations for proteomic reference materials

AxION edoor. Web-Based, Open-Access Mass Spectrometry Software

Metabolomics Software Tools. Xiuxia Du, Paul Benton, Stephen Barnes

Advantages of the LTQ Orbitrap for Protein Identification in Complex Digests

Industry Perspective: Advantages of Open Access and Walkup LC/ MS Supporting Protein Drug Discovery and Development

DMPK: Experimentation & Data

using ms based proteomics

VALIDATION OF ANALYTICAL PROCEDURES: TEXT AND METHODOLOGY Q2(R1)

2019 Healthcare That Works for All

ProteinQuest user guide

Transcription:

Bioinformatics in LC-MS based Proteomics and Glycomics Kevin Minkun Wang Ressom Lab, Dept. of Oncology, Georgetown University CBIL, Dept. of Electrical & Computer Engineering, Virginia Tech BIST-532, 2015 Fall Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 1 / 70

Outline Introduction Proteomics and Glycomics 1 Introduction Proteomics and Glycomics LC-MS 2 LC-MS based Proteomic and Glycomic Data Analysis Challenges Pipeline and Software Tools 3 Application Biomarker Discovery in Cancer Studies Integrative analysis of Proteomics and Glycomics Computational Purificaiton of LC-MS data using Topic Model Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 2 / 70

Introduction Proteomics and Glycomics Biomolecules and Omics Cascade Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 3 / 70

Introduction Proteomics and Glycomics General idea for measurement of biomolecules High-Performance Liquid Chromatography - Mass Spectrum: biomolecules elute ionize measurement spectrum Reference: Oliver Kohlbacher, and Sven Nahnsen. Computational Proteomics and Metabolomics, University of Tubingen Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 4 / 70

Proteomics Introduction Proteomics and Glycomics Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 5 / 70

Proteomics Introduction Proteomics and Glycomics Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 6 / 70

Proteomics Introduction Proteomics and Glycomics Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 7 / 70

Introduction Proteomics and Glycomics Protein and Peptide Scatter plots of unpurified and purified cancer profiles vs. true cancer profiles Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 8 / 70

Protein and PTMs Introduction Proteomics and Glycomics Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 9 / 70

Glycans Introduction Proteomics and Glycomics Glycosylation is one of the most common PTMs of proteins Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 10 / 70

Glycans Introduction Proteomics and Glycomics Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 11 / 70

Outline Introduction LC-MS 1 Introduction Proteomics and Glycomics LC-MS 2 LC-MS based Proteomic and Glycomic Data Analysis Challenges Pipeline and Software Tools 3 Application Biomarker Discovery in Cancer Studies Integrative analysis of Proteomics and Glycomics Computational Purificaiton of LC-MS data using Topic Model Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 12 / 70

How to measure them Introduction LC-MS Modern Proteomics and Glycomics studies are based on * Liquid Chromatography (LC) - Mass Spectrometry (MS) Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 13 / 70

Introduction LC-MS Instrument: LC-MS Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 14 / 70

Introduction Liquid Chromatography (LC) LC-MS Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 15 / 70

Introduction LC-MS LC-MS: Liquid Chromatography (LC) Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 16 / 70

Introduction LC-MS LC-MS: Liquid Chromatography (LC) Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 17 / 70

Introduction LC-MS LC-MS: Liquid Chromatography (LC) Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 18 / 70

Introduction LC-MS LC-MS: Liquid Chromatography (LC) Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 19 / 70

Introduction LC-MS LC-MS: Liquid Chromatography (LC) Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 20 / 70

Introduction LC-MS LC-MS: Liquid Chromatography (LC) Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 21 / 70

Introduction LC-MS LC-MS: Liquid Chromatography (LC) Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 22 / 70

Introduction LC-MS LC-MS: Liquid Chromatography (LC) Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 23 / 70

Introduction LC-MS LC-MS: Liquid Chromatography (LC) Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 24 / 70

Introduction LC-MS LC-MS: Mass Spectrometry (MS) Mass spectrometry (MS) is an analytical technique to measure the mass (or more precisely: mass-to-charge ratio: m/z ) of an analyte MS has a long history in physics and chemistry and today the key technology in multiomics, including proteomics, glycomics, and metabolomics. Soft ionization methods enable its application in the bio sciences For omics analyses MS is usually coupled to a second seperation technique (e.g. LC for proteomics and glycomics, and LC/GC for metabolomics) There are various types of mass spectrometers Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 25 / 70

Introduction Various Mass Spectrometers LC-MS Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 26 / 70

Introduction LC-MS How does LC-MS raw data look like? Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 27 / 70

Outline LC-MS based Proteomic and Glycomic Data Analysis Challenges 1 Introduction Proteomics and Glycomics LC-MS 2 LC-MS based Proteomic and Glycomic Data Analysis Challenges Pipeline and Software Tools 3 Application Biomarker Discovery in Cancer Studies Integrative analysis of Proteomics and Glycomics Computational Purificaiton of LC-MS data using Topic Model Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 28 / 70

LC-MS based Proteomic and Glycomic Data Analysis Challenges Challenges in Data Analysis Big data set (up to TBs per experiment) Ambiguity in protein inference Variance across samples (RT correction, Normalization) Assignment of Glycan Strutures Ambiguity of masses for small molecules Upstream Deconvolution/Deisotoping Feature Finding & Peak Detection Retention Time Alignment Peptide/Protein Identification and Quantification Adducts Clustering Downstream Statistical Analysis Gene Oncology Analysis Pathway Analysis Multiomic Data Integration Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 29 / 70

LC-MS based Proteomic and Glycomic Data Analysis Feature Finding Challenges Identify ion peaks Integrate ion peaks to sticks Deisotope: map a cluster of sticks back to monoisotopic ion.(reduce Redundant Spectrum) Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 30 / 70

LC-MS based Proteomic and Glycomic Data Analysis Challenges Quantification How to define a feature? Ion? Spectrum? Peak? Group of Peaks? How to define volume? Ion counts? Spectrum counts? Area under curve? It depends on which level you are investigating and the order of preprocessing steps(identification first or quantificaiton first) Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 31 / 70

LC-MS based Proteomic and Glycomic Data Analysis Challenges Feature Model Typically, after deisotoping, the 3D signal can be simplified into a 2D elution profile peak. How to mathematically model the profile and identify the boundaries of peak. How to deal with contamination/overlapping. Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 32 / 70

LC-MS based Proteomic and Glycomic Data Analysis Alignment Challenges Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 33 / 70

LC-MS based Proteomic and Glycomic Data Analysis Alignment Challenges Feature as unit or use spectrum as unit. It depends on the order of preprocessing steps(peak detection first or alignment first) Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 34 / 70

LC-MS based Proteomic and Glycomic Data Analysis Peptide Identification Challenges Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 35 / 70

LC-MS based Proteomic and Glycomic Data Analysis Protein Inference Challenges Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 36 / 70

LC-MS based Proteomic and Glycomic Data Analysis Challenges Glycomic Data Preprocess Pipeline Extracted ion chromatograph: two usages quantification and clustering Main topic feature clustering Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 37 / 70

LC-MS based Proteomic and Glycomic Data Analysis Challenges Peak Detection and Quantification Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 38 / 70

LC-MS based Proteomic and Glycomic Data Analysis Challenges Glycan Identification/Annotation Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 39 / 70

Outline LC-MS based Proteomic and Glycomic Data Analysis Pipeline and Software Tools 1 Introduction Proteomics and Glycomics LC-MS 2 LC-MS based Proteomic and Glycomic Data Analysis Challenges Pipeline and Software Tools 3 Application Biomarker Discovery in Cancer Studies Integrative analysis of Proteomics and Glycomics Computational Purificaiton of LC-MS data using Topic Model Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 40 / 70

LC-MS based Proteomic and Glycomic Data Analysis Incomplete List of Tools Pipeline and Software Tools Proteomics: Scaffold (spectral counts) MaxQuant (area under curve) Mascot, Andromeda (search engine) Skyline, MRMer (targeted analysis) Glycomics: Glycan Profile Annotation (Ressom Lab, including adducts clustering) MultiGlycan (Indiana University) GlycReSoft (Boston University) Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 41 / 70

Outline Application Biomarker Discovery in Cancer Studies 1 Introduction Proteomics and Glycomics LC-MS 2 LC-MS based Proteomic and Glycomic Data Analysis Challenges Pipeline and Software Tools 3 Application Biomarker Discovery in Cancer Studies Integrative analysis of Proteomics and Glycomics Computational Purificaiton of LC-MS data using Topic Model Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 42 / 70

Application Biomarker Discovery in Cancer Studies LC-MS based analysis of N-glycans in Liver Cancer Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 43 / 70

Application Candidate N-Glycan Biomarkers Biomarker Discovery in Cancer Studies Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 44 / 70

Application Biomarker Discovery in Cancer Studies Identification and Differential Analysis Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 45 / 70

Other Observations Application Biomarker Discovery in Cancer Studies Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 46 / 70

Application Biomarker Discovery in Cancer Studies LC-MS based analysis of proteins in Liver Cancer Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 47 / 70

Application Biomarker Discovery in Cancer Studies Biomarker Discovery and Downstream Analysis Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 48 / 70

Outline Application Integrative analysis of Proteomics and Glycomics 1 Introduction Proteomics and Glycomics LC-MS 2 LC-MS based Proteomic and Glycomic Data Analysis Challenges Pipeline and Software Tools 3 Application Biomarker Discovery in Cancer Studies Integrative analysis of Proteomics and Glycomics Computational Purificaiton of LC-MS data using Topic Model Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 49 / 70

Application Integrative Analysis SVM-RFE Integrative analysis of Proteomics and Glycomics Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 50 / 70

Classification Accuracy Application Integrative analysis of Proteomics and Glycomics Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 51 / 70

Application Integration: Network Analysis Integrative analysis of Proteomics and Glycomics Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 52 / 70

Application Network Analysis: Intra Association Integrative analysis of Proteomics and Glycomics Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 53 / 70

Application Network Analysis: Inter Association Integrative analysis of Proteomics and Glycomics Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 54 / 70

Outline Application Computational Purificaiton of LC-MS data using Topic Model 1 Introduction Proteomics and Glycomics LC-MS 2 LC-MS based Proteomic and Glycomic Data Analysis Challenges Pipeline and Software Tools 3 Application Biomarker Discovery in Cancer Studies Integrative analysis of Proteomics and Glycomics Computational Purificaiton of LC-MS data using Topic Model Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 55 / 70

Application Computational Purificaiton of LC-MS data using Topic Model Challenges Sample heterogeneity biomarker discovery Specimens (e.g., tumor tissues) are typically mixtures of cells with distinct states and types, and usually part of the constituents is relevant to the biological question of interest. In some cancer studies, heterogeneity is due to the co-existence of multiple cancerous subtypes. The proportion of cancerous, other disease-related, and healthy components varies across individual samples preselected using pathological estimates. Experimental purification methods: costly and time-consuming. Computational purification methods*: inexpensive and efficient to implement *(available for data already generated without any modifications on experimental procedures). Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 56 / 70

Application Computational Purificaiton of LC-MS data using Topic Model Probabilistic modeling: hypothesize a way to generate the heterogeneous profiles Terminology {td },d=1,,d : expression profile of a heterogeneous sample. [Observed] {γ d },d=1,,d : ture/pure cancerous origin. [Latent] {βm},m=1,,m : non-cancerous contaminants/unfavorite source. [Observed] {θd },d=1,,d : sample-specific mixture proportion. [Latent] {z d,n }: source indicator for each ion in each sample.[latent] γ : average cancer origin (whole-collection-level).[latent] α, η, κ: hyperparameters of Dirichlet priors. Mixture of multiple sources: β, γ, θ = t Intuitive explanation t is treated as an article with N words, representing N measured ions. {γ d }, {β m} play a role of underneath topics in generating each article in the corpus. Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 57 / 70

Application Computational Purificaiton of LC-MS data using Topic Model Probabilistic modeling: hypothesize a way to generate the heterogeneous profiles Three assumptions in our study {β m} t d The source contaminants in each expression profile {t d } are coming from the control group {β m},m=1,,m. It has been observed that the cancerous tissues within tumor samples are typically surrounded by adjacent non-cancerous tissues. γ γ d Corresponding cancerous origins {γ d },d=1,,d share an average cancer profile γ. individual cancerous profile can be treated as a noisy version of the average cancer profile in the same group (i.e., HCC group) {β m} γ Average cancer profile has similar patterns as non-cancerous profiles, except for some sites (biomolecules) which are differentially expressed between case and control groups holds in the same cohort Mathematically, {β m}, γ, {γ d } represent multinomial (probabilistic) distribution over vocabulary/bomolecules. Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 58 / 70

Application Computational Purificaiton of LC-MS data using Topic Model Probabilistic modeling: Topic Model 3-level generative probabilistic model derived from latent Dirichlet allocation (LDA model) (1)Deterministic Stochastic ; (2) Frequentist Bayesian Inference & Estimation: maximizing complete/joint likelihood function via variational expectation maximization (variational EM) algorithms. Wang et al. IEEE BIBM 2015 (under review) Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 59 / 70

Application Computational Purificaiton of LC-MS data using Topic Model Purification on LC-MS profiles: simulation Generate a set of synthetic data by artificially mixing real homogeneous LC-MS data Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 60 / 70

Application Computational Purificaiton of LC-MS data using Topic Model Estimated mixture proportions θ Top: radar charts with 10 spokes, each representing a source in topic panel. The proportion of each source is delimited by the length of lines with color Bottom: scatter plots of correpsonding proportions in θ and θ Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 61 / 70

Application Inferred pure cancer profile γ Computational Purificaiton of LC-MS data using Topic Model Scatter plots of unpurified and purified cancer profiles vs. true cancer profiles Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 62 / 70

Application Computational Purificaiton of LC-MS data using Topic Model Distance between estimation and ground truth Define estimation error ratio for a single sample: ζd (θ, θ) = 2.33% ζd (γ, γ) = 6.51% < ζ d (θ, θ) = θ d θ d 1 θ d 1 100%, d = 1,, 30 ζ d (t, γ) = 16.57% Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 63 / 70

Application Computational Purificaiton of LC-MS data using Topic Model Purification on LC/GC-MS profiles: experimental data 116 LC-MS based serum proteomic profiles All 116 patients were diagnosed with liver cirrhosis and 57 of them developed with HCC. It is not clear how the development of tumor in liver directly affect the alterations in blood. We hypothesize that there are some impacts from cirrhotic constituents contributing to the HCC profile in serum. The contamination may occur in an indirect way. 101 proteins were quantified through LC-MRM-MS. 15 GC-MS based tissue metabolomic profiles 15 liver tissues were collected from 10 participants in the pilot project. 559 metabolites were identified and quantified after preprocessing the GC-MS raw data. Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 64 / 70

Application Computational Purificaiton of LC-MS data using Topic Model Experimental Data I CTSA samples: 105 GC-MS based tissue metabolomic profiles I 10 patients with HCC (10 HCC-C vs. ADJ-C) I 25 independent patients with cirrhosis I 30 patients with HCC (30 HCC-N vs. ADJ-N) I 726 metabolites were identified and quantified after preprocessing the GC-MS raw data. Intensities were normalized based on the measurements of extracted proteins. Missing values were imputed with one sixth of minimum value in each group. Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 65 / 70

Application Performance: proteomic and pilot project Computational Purificaiton of LC-MS data using Topic Model - LC-MS proteomics - GC-MS metabolomics (1)Treat independent cirrhotic profiles as contaminants of HCC profiles: 0 7 (FDR adjusted p-value 0.05); (2) Treat HCC profiles as contaminants of adjacent cirrhotic profiles: ζ(ψ, β) = 28.3% ζ(ψ, β) = 24.9% Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 66 / 70

Application Computational Purificaiton of LC-MS data using Topic Model CTSA project - Multiple Comparisons - GC-MS metabolomics - Purify (1)HCC-N with ADJ-N; (2)HCC-C with ADJ-C; (3)HCC-C with CIRR; (4)ADJ-C with HCC-C Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 67 / 70

Application Computational Purificaiton of LC-MS data using Topic Model To Be Continued Topic model based inference method to computationally address heterogeneity issue in clinical samples analyzed by LC/GC-MS. This model gives a probabilistic explanation on the corpus of LC/GC-MS based profiles. Simulation demonstrated the model s capacity of estimating mixture proportion and retrieving underlying pure cancer profile. Increased discrimination between case and control groups was observed. More biologically meaningful pathways were found. Ongoing work for the next version Adjust appropriate forms of regularization on parameters to address the limitation due to small sample size. Add label information to endow prediction function to the model. Apply to clustering of subtype diseases. Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 68 / 70

Application Computational Purificaiton of LC-MS data using Topic Model Thank you! References Wang M, Yu G, Ressom HW (2015). Integrative analysis of LC-MS based glycomic and proteomic data. To appear in the Proceedings of the 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Milano, Italy, August 25-29, 2015. Tsai TH, Tadesse MG, Di Poto C, Pannell LK, Mechref Y, Wang Y, Ressom HW (2013). Multi-profile Bayesian alignment model for LC-MS data analysis with integration of internal standards. Bioinformatics 29(21):2774-80. PMID: 24013927 Wang M, Yu G, Mechref Y, Ressom HW (2013). GPA: an algorithm for LC/MS based glycan profile annotation, Proceedings of the 2013 IEEE International Conference on Bioinformatics and Biomedicine Workshop (BIBMW), Shanghai, China, December 2013, pp. 16-22. Tsai TH, Song E, Zhu R, Di Poto C, Wang M, Luo Y, Varghese RS, Tadesse MG, Ziada DH, Desai CS, Shetty K, Mechref Y, Ressom HW (2015). LC-MS/MS based Serum Proteomics for Identification of Candidate Biomarkers for Hepatocellular Carcinoma. Proteomics. 15(13), 2369-2381. PMID: 25778709. Wang M, Tsai TH, Yu G, Ressom HW (2015). Purification of LC/GC-MS based Biomolecular Expression Profiles Using a Topic Model. To appear in the Proceedings of the 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Washington D.C., USA, November 9-12, 2015. Tsai TH, Wang M, Di Poto C, Hu Y, Zhou S, Zhao Y, Varghese RS, Luo Y, Tadesse MG, Ziada DH, Desai CS, Shetty K, Mechref Y, Ressom HW (2014). LC-MS Profiling of N-Glycans Derived from Human Serum Samples for Biomarker Discovery in Hepatocellular Carcinoma. J Proteome Res. PMID: 25077556. Tsai TH, Wang M, Ressom HW (2015). Preprocessing and Analysis of LC-MS-Based Proteomic Data. Statistical Analysis in Proteomics (Methods in Molecular Biology), Editor: Klaus Jung, 1st Edition. ISBN-13: 978-1493931057. Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 69 / 70

Quiz/Homework? Application Computational Purificaiton of LC-MS data using Topic Model Good luck! 1 Which of the following information CAN NOT be obtained after the preprocessing (i.e., deisotoping, peak detection, identification and quantification) of LC-MS data? (2 pt) A. Retention Time B. Compound Mass C. Time of Flight D. Charge States E. Intensity of glycans (if glycomics) F. Intensity of peptides (if proteomics) G. Intensity of proteins (if proteomics) H. Intensity of amino acids (if proteomics) 2 Which of the following monosaccharides constitute the core of N-glycans? (2 pt) A. N-acetyl galactosamine (GalNAc) B. Galactose (Gal) C. Neuraminic acid (NeuAc) D. N-acetyl glucosamine (GlcNAc) E. Fucose (Fuc) F. Mannose (Man) 3 Targeted analysis (LC- MRM -MS) is designed for? (2 pt) A. Accurate Identification B. Accurate Quantification C. Both A& B D. None of above 4 Give your explanation on the reason of using Dirichlet priors for multinomial distribution in the topic model. (Slide Pg.59, 4 pt) Kevin M. Wang (Washington D.C., USA) Ressom Lab & CBIL BIST-532, 2015 Fall 70 / 70