Un (bref) aperçu des méthodes et outils de fouilles et de visualisation de données «omics»



Similar documents
A brief introduction to Cytoscape

Quantitative proteomics background

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Integrated Data Mining Strategy for Effective Metabolomic Data Analysis

OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis

Tutorial for proteome data analysis using the Perseus software platform

泛 用 蛋 白 質 體 學 之 質 譜 儀 資 料 分 析 平 台 的 建 立 與 應 用 Universal Mass Spectrometry Data Analysis Platform for Quantitative and Qualitative Proteomics

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

La Protéomique : Etat de l art et perspectives

ProteinScape. Innovation with Integrity. Proteomics Data Analysis & Management. Mass Spectrometry

Using Ontologies in Proteus for Modeling Data Mining Analysis of Proteomics Experiments

The Open2Dprot Proteomics Project for n-dimensional Protein Expression Data Analysis

MRMPilot Software: Accelerating MRM Assay Development for Targeted Quantitative Proteomics

Application Note # LCMS-81 Introducing New Proteomics Acquisiton Strategies with the compact Towards the Universal Proteomics Acquisition Method

Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data

Session 1. Course Presentation: Mass spectrometry-based proteomics for molecular and cellular biologists

Aiping Lu. Key Laboratory of System Biology Chinese Academic Society

Global and Discovery Proteomics Lecture Agenda

Data, Measurements, Features

MultiQuant Software 2.0 for Targeted Protein / Peptide Quantification

MarkerView Software for Metabolomic and Biomarker Profiling Analysis

Statistical Analysis Strategies for Shotgun Proteomics Data

ProteinPilot Report for ProteinPilot Software

Thermo Scientific SIEVE Software for Differential Expression Analysis

Introduction to mass spectrometry (MS) based proteomics and metabolomics

Tutorial for Proteomics Data Submission. Katalin F. Medzihradszky Robert J. Chalkley UCSF

Mass Spectra Alignments and their Significance

The Scheduled MRM Algorithm Enables Intelligent Use of Retention Time During Multiple Reaction Monitoring

Building a Collaborative Informatics Platform for Translational Research: Prof. Yike Guo Department of Computing Imperial College London

Increasing the Multiplexing of High Resolution Targeted Peptide Quantification Assays

In-Depth Qualitative Analysis of Complex Proteomic Samples Using High Quality MS/MS at Fast Acquisition Rates

A Streamlined Workflow for Untargeted Metabolomics

Protein Protein Interaction Networks

SELDI-TOF Mass Spectrometry Protein Data By Huong Thi Dieu La

Thermo Scientific PepFinder Software A New Paradigm for Peptide Mapping

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Challenges in Computational Analysis of Mass Spectrometry Data for Proteomics

Functional Data Analysis of MALDI TOF Protein Spectra

Introduction to Proteomics 1.0

Already said. Already said. Outlook. Look at LC-MS data. A look at data for quantitative analysis using MSight and Phenyx. What data for quantitation?

Pep-Miner: A Novel Technology for Mass Spectrometry-Based Proteomics

Visualizing Networks: Cytoscape. Prat Thiru

Introduction to Proteomics

Data Mining and Machine Learning in Bioinformatics

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Learning Objectives:

Alignment and Preprocessing for Data Analysis

Mascot Search Results FAQ

Effects of Intelligent Data Acquisition and Fast Laser Speed on Analysis of Complex Protein Digests

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Statistical issues in the analysis of microarray data

Research-grade Targeted Proteomics Assay Development: PRMs for PTM Studies with Skyline or, How I learned to ditch the triple quad and love the QE

Statistics for BIG data

Tutorial 9: SWATH data analysis in Skyline

AB SCIEX TOF/TOF 4800 PLUS SYSTEM. Cost effective flexibility for your core needs

InSyBio BioNets: Utmost efficiency in gene expression data and biological networks analysis

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Final Project Report

University of Glasgow - Programme Structure Summary C1G MSc Bioinformatics, Polyomics and Systems Biology

Mass Frontier Version 7.0

Introduction to Data Mining

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Protein Prospector and Ways of Calculating Expectation Values

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Introduction to Data Mining

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Pinpointing phosphorylation sites using Selected Reaction Monitoring and Skyline

2015 Workshops for Professors

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

DMPK: Experimentation & Data

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Maschinelles Lernen mit MATLAB

Building innovative drug discovery alliances. Evotec Munich. Quantitative Proteomics to Support the Discovery & Development of Targeted Drugs

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

NEURAL NETWORKS IN DATA MINING

DeCyder Extended Data Analysis module Version 1.0

Statistics Graduate Courses

PeptidomicsDB: a new platform for sharing MS/MS data.

Dr Alexander Henzing

University of Manchester Health Data Science Masters Modules

Analyzing Research Data Using Excel

MassMatrix Web Server User Manual

life science data mining

Computational Drug Repositioning by Ranking and Integrating Multiple Data Sources

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Vad är bioinformatik och varför behöver vi det i vården? a bioinformatician's perspectives

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Azure Machine Learning, SQL Data Mining and R

Medical Informatics II

Retrospective Analysis of a Host Cell Protein Perfect Storm: Identifying Immunogenic Proteins and Fixing the Problem

Machine Learning with MATLAB David Willingham Application Engineer

Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance?

Introduction to Pattern Recognition

Advantages of the LTQ Orbitrap for Protein Identification in Complex Digests

BIOINFORMATICS Supporting competencies for the pharma industry

Agilent Metabolomics Workflow

Pentaho Data Mining Last Modified on January 22, 2007

Transcription:

Un (bref) aperçu des méthodes et outils de fouilles et de visualisation de données «omics» Workshop «Protéomique & Maladies rares» 25 th September 2012, Paris yves.vandenbrouck@cea.fr CEA Grenoble irtsv BGE (U1038)

A pressing need for: analyzing integration visualization browsing querying editing (annotation) F60-64 env13 ZT50 es4075 MaxEnt 3 8 [Ev-36586,It50,En1] (0.050,200.00,0.200,1400.00,2,Cmp) 1: TOF MSMS 631.30ES+ 100 % 0 (596.29) F G L (329.16) ymax 201.11 819.38 y2 312.12 330.13 187.12 b 762.36 932.44 1033.48 516.25 y1 y3 175.09 401.24 825.42 1259.57 M/z 200 400 600 800 1000 1200

Data mining: more than a field! From Fayyad, 1996

Data mining in the context of MS-based proteomics analysis Käll L, Vitek O. 2011 PLoS Comput Biol

Current challenges in software solutions for MS-based quantitative proteomics (Cappadona et al, 2012) Software usability Data reduction (MS and MS/MS data) => peptide <m/z, Rt, I> Feature detection (desiotoping, isobaric interference ) Noise rejection (random, chemical, contaminants) Retention time alignment (time shifts) Peptide identification (FDR, peptide modifications) => search engines and decoy strategies Normalization of peptide abundances Protein inference (crucial for accurate quantification) Protein quantification Statistical significance analysis and data mining

EDyP lab IT setup: workflow and automation epims http://www.grenoble.prabi.fr/protehome : software download area

Identification : spectrum peptide correspondence 519.79 1037.565448 1039.541016-1.975568 1 40.12 1 KASDVHEVR 519.79 1037.565448 1039.424011-1.858563 0 14.92 2 MTTDEGTGGR 519.79 1037.565448 1039.475632-1.910184 0 14.07 3 EGSFMSLNR 519.79 1037.565448 1035.618881 1.946567 1 10.8 4 RLQHLLEK 519.79 1037.565448 1038.480392-0.914944 0 10.55 4 FALACNASDK 519.79 1037.565448 1039.52327-1.957822 0 10.4 4 ADICVHLNR 519.79 1037.565448 1035.618866 1.946582 1 9.94 7 IRGLPPEVR 519.79 1037.565448 1037.623291-0.057843 1 9.44 8 IISPKVEPR 519.79 1037.565448 1037.525375 0.040073 0 9.3 8 LHPATETGGR 519.79 1037.565448 1038.654922-1.089474 2 8.91 10 KLVDKAPLR YESLTDPSK AMKYLASKK QNHLASLEK QYSSPAGDSK GFTGMLQSR FCKDVVSNK MKCLGQSKK HLKEGARTK YTNLMRPK ASEGSDSGSDK KPYM_HUMAN Q5T9W 7_HUMAN Q5CAQ6_HUMAN ENOA_HUMAN Q53G99_HUMAN PLSL_HUMAN TBB2_HUMAN PGK1_HUMAN G3P2_HUMAN PDIA1_HUMAN TBA6_HUMAN ACTA_HUMAN P1 P3 pe6 pe1 pe3 pe8 pe4 pe2 pe5 pe7 P11 P2 P2 P31

From peptides identification to protein validation: IRMa Filtering rules Score/rank Mass tolerance Peptides groups Significant, duplicated, ambiguous Expertise Result visualisation Spectra list Proteins list spectrum/peptide corr. FDR computation Open-source software http://www.grenoble.prabi.fr/protehome Dupierris et al. Bioinformatics, 2009

Mass spectrometry based measurements Käll L, Vitek O. 2011 PLoS Comput Biol Spectral counting (# of MS/MS identifications assigned to protein) vs. intensity-based (summarizing signals from spectral peaks) A diverse range of methods has been proposed and there is no single generally accepted procedure today. => Depending on your sample & your biological question, expertise is strongly required

Data preprocessing for quantification: role of normalization Changes in relative peptides abundance may reflect not just true biological differences but also systematic bias and random noise From F. Combes et al Samples-based (Global normalization) or control-based (e.g. spiked-in internal standards)

Peptide and protein significance analysis: Statistical analysis and experimental design are closely intertwined! 2 groups >2 groups Independant samples t-test Mann-Whitney ANOVA Kruskall-Wallis Non-independant (paired samples) t-student Wilcoxon ANOVA Friedman ANOVA Assumptions and good practice: Independence of samples Normality (not essential; transformation) Variance homogeneity (=homoscedasticity); essential but rarely justified ANOVA does not maintain a significance level (p-val) against comparisons in multiple dimensions => p-value adjustements are required (e.g. FDR)

The DECanBIo project: bladder cancer candidates biomarkers discovery (FP7 EU funded coord. J. Garin) 98 patients in discovery sub-cohort Label-free LC-MS analysis Association of 2 statistical methods (peptide and protein-centric) Healthy vs. controls C. Masselon et al., HUPO 2011 25 19 13 Spectral lndex permutations Mixed-ANOVA Clough et al., 2009, JPR Fu et al., 2008, JPR

Data mining : various methods Descriptive (unsupervised) => discovery factorial analysis, automatic classification (clustering), association rules Predictive (supervised) => inference qualitative variable discriminant analysis (logistic regression), decision trees, neural networks quantitative variable linear regression (simple & multiple), ANOVA, MANOVA, ANCOVA, GLM, decision tree How to choose the right method? Two criteria: accuracy - robustness

The challenge of defining «valid» proteomic biomarkers Is the change (frequency or abundance) of a certain molecule observed in a proteomics study of disease, the result of the disease, or does it merely reflect an artefact due to technical variability in the pre-analytical steps or in the analysis, biological variability, or bias introduced in the study (e.g. due to lifestyle, age, and gender)? How should we estimate the number of samples required for the definition of likely valid biomarkers? Which algorithms can be employed to combine biomarkers into a multi-marker classifier, and how can the validity of a multi-marker classifier be assessed? Is validation in an independent test set necessary? Dakna et al, 2010 BMC bioinformatics

Data integration and visualization Protein phenotype: are essential proteins more connected? Lethal Viable Jeong et al, 2001, Nature revisited by C. Hermann (TAGC, Luminy, FR)

Network analysis platform Cline et al, Nature Protocols 2007

http://www.cytoscape.org

Visualizing biological data now and in the future: possible integrated visualization environment Many challenges: Large and complex dataset Multiscale representation and navigation Visual analytics Innovative representation O'Donoghue Nat.Meth.S7, S2 - S4 (2010) => «the revolution in biological data visualization hasn t yet started!»

Take home message «A good statistical analysis will not correct a poor experimental design» (O. vitek) «Valid proteomic biomarkers for diagnosis and prognosis only can be defined by applying proper statistical data mining procedures». => Embedded statisticians from the beginninng Data mining techniques: apply the «simplest model» rule; assessment in an independent dataset is essential (or cross-validation) Omics data are largely undermined (unified repositories, knowledgebase, workbench user-friendly, a room for the improvement of methods for optimal design and data mining )

Acknowledgements: EDyP lab (C. Bruley) part of the BGE unit (U1038 J. Garin) IBISA Proteomics platform (ISO9001 certified): yohann.coute@cea.fr PROteomics French Infrastructure (PROFI) Member of PRIME-XS (FP7, European Proteomics Research Infrastructure)

Further reading and useful links Bibliography: -Cappadona S, Baker PR, Cutillas PR, Heck AJ, van Breukelen B. Current challenges in software solutions for mass spectrometry-based quantitative proteomics. Amino Acids. 2012. 43:1087-108. - Käll L, Vitek O. Computational mass spectrometry-based proteomics. PLoS Comput Biol. 2011. 7:e1002277. - Somorjai RL, Dolenko B, Baumgartner R. Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics. 2003. 19:1484-91 - Special issue in Nature Methods, March 2010, Volume 7 No 3s pps1-s68 Open-source system for data mining tools: R community: http://cran.r-project.org/ Bioconducor: http://www.bioconductor.org/ RapidMiner: http://rapid-i.com/content/view/181/196/ «Omics» data visualization, integration: Cytoscape: http://www.cytoscape.org