Medical Informatics II



Similar documents
Analysis of the colorectal tumor microenvironment using integrative bioinformatic tools

Bioinformatics for cancer immunology and immunotherapy

Survey of clinical data mining applications on big data in health informatics

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

Using MATLAB: Bioinformatics Toolbox for Life Sciences

Electronic Medical Records and Genomics: Possibilities, Realities, Ethical Issues to Consider

School of Nursing. Presented by Yvette Conley, PhD

Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

micrornas Non protein coding, endogenous RNAs of 21-22nt length Evolutionarily conserved

Ensemble Learning of Colorectal Cancer Survival Rates

Clinical Research Infrastructure

SELDI-TOF Mass Spectrometry Protein Data By Huong Thi Dieu La

TECHNOLOGIES, PRODUCTS & SERVICES for MOLECULAR DIAGNOSTICS, MDx ABA 298

What is Cancer? Cancer is a genetic disease: Cancer typically involves a change in gene expression/function:

Personalized Treatment for Malignant Mesothelioma

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

Using Ontologies in Proteus for Modeling Data Mining Analysis of Proteomics Experiments

KIDNEY FUNCTION RELATION TO SIZE OF THE TUMOR IN RENAL CELL CANCINOMA

If you were diagnosed with cancer today, what would your chances of survival be?

Integrated Data Mining Strategy for Effective Metabolomic Data Analysis

Microarray Technology

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

How Cancer Begins???????? Chithra Manikandan Nov 2009

Sommaire projets sélectionnés mesure 29: Soutien à la recherche translationnelle

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Dr Alexander Henzing

A Primer of Genome Science THIRD

The Extension of the DICOM Standard to Incorporate Omics

DeCyder Extended Data Analysis module Version 1.0

Computational Pathology and the Role of Pathology Informatics

Gene expression analysis. Ulf Leser and Karin Zimmermann

OpenMedicine Foundation (OMF)

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

The Open2Dprot Proteomics Project for n-dimensional Protein Expression Data Analysis

Introduction to Data Mining

Normal values of IGF1 and IGFBP3. Kučera R., Vrzalová J., Fuchsová R., Topolčan O., Tichopád A.

An EVIDENCE-ENHANCED HEALTHCARE ECOSYSTEM for Cancer: I/T perspectives

Personalized Medicine: Humanity s Ultimate Big Data Challenge. Rob Fassett, MD Chief Medical Informatics Officer Oracle Health Sciences

Functional Data Analysis of MALDI TOF Protein Spectra

CCR Biology - Chapter 9 Practice Test - Summer 2012

Big Data. Tom Plunkett Senior Consultant

Hereditary Ovarian cancer: BRCA1 and BRCA2. Karen H. Lu MD September 22, 2013

InSyBio BioNets: Utmost efficiency in gene expression data and biological networks analysis

14.3 Studying the Human Genome

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Biotechnology and Life Science Marketing Services Mailing List and Data Card Order Form

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Predictive data mining in clinical medicine: a focus on selected methods and applications

Core Facility Genomics

An Introduction to Genomics and SAS Scientific Discovery Solutions

Medical Informatics I

Data Mining Techniques for Prognosis in Pancreatic Cancer

Introduction to mass spectrometry (MS) based proteomics and metabolomics

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

TRACKS GENETIC EPIDEMIOLOGY

Global and Discovery Proteomics Lecture Agenda

OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis

Molecular Diagnostics in Thyroid Cancer

Microarray Data Mining: Puce a ADN

How To Get A Cell Print

Data Mining and Machine Learning in Bioinformatics

GENETIC DATA ANALYSIS

Integration of Genetic and Familial Data into. Electronic Medical Records and Healthcare Processes

Building a Collaborative Informatics Platform for Translational Research: Prof. Yike Guo Department of Computing Imperial College London

Ingenuity Pathway Analysis (IPA )

Pep-Miner: A Novel Technology for Mass Spectrometry-Based Proteomics

Regulatory Issues in Genetic Testing and Targeted Drug Development

Analysis of gene expression data. Ulf Leser and Philippe Thomas

International Journal of Software and Web Sciences (IJSWS)

Molecular markers and clinical trial design parallels between oncology and rare diseases?

> Semantic Web Use Cases and Case Studies

REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf])

Statistics Graduate Courses

Preventive Services for Pregnancy SERVICE WHAT IS COVERED INTERVALS OF COVERAGE Anemia Screening Screening Annual screening for pregnant women

Integration of biospecimen data with clinical data mining

Effects of Herceptin on circulating tumor cells in HER2 positive early breast cancer

Lung Carcinomas New 2015 WHO Classification. Spasenija Savic Pathology

Cancer Patients Urgently Need Effective, Genetically-Targeted Treatments

Vad är bioinformatik och varför behöver vi det i vården? a bioinformatician's perspectives

Long-Term Effects of Drug Addiction

Data Mining On Diabetics

Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company

Future Directions in Cancer Research What does is mean for medical physicists and AAPM?

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa

Human Genome Organization: An Update. Genome Organization: An Update

PSA Testing 101. Stanley H. Weiss, MD. Professor, UMDNJ-New Jersey Medical School. Director & PI, Essex County Cancer Coalition. weiss@umdnj.

Delivering the power of the world s most successful genomics platform

Transcription:

Medical Informatics II Zlatko Trajanoski Institute for Genomics and Bioinformatics Graz University of Technology http://genome.tugraz.at zlatko.trajanoski@tugraz.at Medical Informatics II Introduction Computational Methods Support Vector Machines Principal Component Analysis Decision Trees Neural Networks Self Organized Maps (Survival Analysis) 1

Medical Informatics Research Data Integration Cancer Diagnostics Colorectal Cancer Ovarian Cancer Data Integration DNA RNA Protein Cell DATA Tissue Organ Organism Population 2

Data Integration DNA RNA Protein Cell Tissue Organ Organism DNA Human genome: 3.000.000.000 nucleotides RNA 21.000 genes, n conditions Protein 21.000 genes 100.000 gene products 1.000.000 proteins, n conditions Cell 320 cell types k genes, l proteins, m metabolites, n conditions Population Data Integration New technologies for gene and protein expression profiling RNA: Microarrays Protein: MALDI-TOF, LC-MS/MS Tissue: Tissue microarrays Complementary technologies, real value in integrating diverse datasets Data management and analysis? 3

Data Integration Drowning in data, starving for information? Microarray data (n=1): Affymetrix HG U133A2 chip Raw data: 80 MB per sample (incl. TIFF) MAGE-ML: 30 MB Normalized data: 5-10 MB (Excel table or text file) Data Integration Drowning in data, starving for information? Proteomics data (n=1) Kisslinger et al, Cell 2006, 125:173-186 one organ (heart), one organelle (cytosol) Raw data: 1.55 GB (mzxml format) Sequest search folders: 235 MB Results in PRIDE format: 320 MB Results incl. protein sequences: 374 KB 4

Data Integration n>100 Genomics data (SNPs) Expression data (Microarrays) Proteomics data (LC-MS/MS) Phenotype data (Clinical parameters) Pharmacology data (Pharmacokinetics/dynamics) Medical Images (CT, MR, PET, Ultrasound) Literature data (PubMed, Cochrane)... Centralized database? Herculian task Few standards Data Integration System incompatibilities Organizational issues Specific requirements in specific institutions 5

Data Integration Database Analytical Tools RNA Database Analytical Tools Protein Cell Use standards like MIAME, MIAPE, ICD-O (as good as it gets) Use state-of-the-art software technology: three-tier architecture: database layer (Oracle), middle layer (J2EE), presentation layer Build interfaces for analytical tools Methods Java web-based platform Database backend (Oracle, PostgreSQL, MySQL) Java 2 Enterprise Edition and Struts AndroMDA for code generation JClusterService (Computing Cluster) 6

Lessons we learned Data management systems: imperative! De-centralized databases for primary data and preprocessing Centralized database for processed data Issues Software/database development Changes of software technology Database maintenance Case Study: Cancer Immunology Role of the immune system in early metastasis in colorectal cancer? 7

Database for Clinical and Genomic Data Patients with colorectal cancer (n>1000) Clinical data (n>1000) qpcr of 500 genes (n>100) qpcr of 150 mirna T cell repertoire of 500 parameters (n>50) Phenotypic analysis of 410 parameters using FACS (n=50) Tissue microarrays (n>500) Dedicated database for clinical and genomic data (http://tme.tugraz.at) Tools R, Genesis*, Cytoscape**, ARACNE*** *Sturn et al., Bioinformatics 2002 **Shannon et al., Genome Res 2003 ***Basso et al, Nat Genet 2005 Phenotypes of Tumor-Infiltrating Immune Cells Significantly different markers between invasion positive (VELIPI+) and negative (VELIPI-) patients VELIPI: vascular emboli (VE), lymphatic invasion (LI), perineural invasion (PI) min. expression max. expression Pagès et al. N Engl J Med, 353:2654-2666, 2005 8

Effector Memory T-cells and Survival Disease-free and overall survival of CD45ROhi patients 100 100 CD45RO-hi 60 40 20 CD45RO-lo 0 % Recurrence-Free P<0.001 80 % Survival Tissue MicroArray (TMA) analysis (n=353 patients) CD45RO 80 P<0.001 CD45RO-hi 60 40 20 CD45RO-lo 0 0 50 100 150 Survival (months) 0 50 100 150 Disease Free Survival (months) Pagès et al. N Engl J Med, 353:2654-2666, 2005 Adaptive Immunity has a Beneficial Effect on Clinical Outcome Galon et al. Science, 313:1960-1964, 2006 9

Combined Analysis of Tumor Regions Improves Prediction of Patient Survival Galon et al. Science, 313:1960-1964, 2006 Patient Stratification I II III IV II I III IV I III II IV Galon et al. Science, 313:1960-1964, 2006 10

Conclusion Adaptive immunity has a beneficial effect on clinical outcome in colorectal cancer Combined analysis of tumor regions improves prediction of patient survival Feature Selection and Cancer Classification from Mass Spectrometry Data using Wavelet Analysis and Support Vector Machine 11

For Public Health Ovarian Cancer Tradition biomarker CA125 is not satisfiable High death rate Breast Cancer Prostate Cancer Study datasets (provided by NCI, 2004) Low-resolution SELDI-TOF MS data (dimension=15,154 and 91 controls vs 162 cancers) High-resolution SELDI-TOF MS data (dimension=373,401 and 95 controls vs 121 cancers) MELDI-TOF-MS (provided by Innsbruck univ.) (dimension=54,005; 213 controls vs 178 cancers) Written in Blood 12

Raw High-resolution MS Data Can you see any difference? Binned MS Data Bin the original MS data with unit interval The dimension of feature space is reduced from 373,401 to 11,301 (HIGHres data) 13

Strategies of Data Reduction Two-sample Kolmogorov-Smirnov Goodness-of-fit test (briefly KS-test, nonparametic method): Remove those m/z ratios at which the healthy and cancer have the same distribution Restriction of coefficient of variation (for a positive r.v. X, CV=sd(X)/E(X)) Wavelet analysis (good at treating with discontinuities and sharp spikes) Examples of KS-test à 14

KS-test on Raw Data KS-test on Binned Data 15

Restriction of CV Significance level 5%, t=0.4 Dimension is reduced to 6,757 Discrete Wavelet Transformation samples are separated but overlap is still there and data size is too big! Ã 16

Support Vector Machine Support vectors Other classifiers tested Support Vector Machine Trees (AD3, J48, Bayesian Trees) Neural Networks Voted Perceptron Nearest Neighbour Linear/Quadratic Discriminant Analysis 17

Some Results: Ovarian Cancer Method HEALTHY 2xv CANCER HEALTHY 10xv CANCER IBk 83.21 89.35 85.98 91.15 ADAboost 89.59 91.92 92.05 94.03 VotedP 92.54 95.27 94.45 96.99 SVM 93.30 97.38 94.06 98.19 Some Results: Prostate Cancer Method 2xv 10xv HEALTHY CANCER HEALTHY CANCER SVM 2,000 94.65 88.97 97.02 90.19 SVM 3,500 93.39 85.97 95.52 88.50 18