Medical Informatics II Zlatko Trajanoski Institute for Genomics and Bioinformatics Graz University of Technology http://genome.tugraz.at zlatko.trajanoski@tugraz.at Medical Informatics II Introduction Computational Methods Support Vector Machines Principal Component Analysis Decision Trees Neural Networks Self Organized Maps (Survival Analysis) 1
Medical Informatics Research Data Integration Cancer Diagnostics Colorectal Cancer Ovarian Cancer Data Integration DNA RNA Protein Cell DATA Tissue Organ Organism Population 2
Data Integration DNA RNA Protein Cell Tissue Organ Organism DNA Human genome: 3.000.000.000 nucleotides RNA 21.000 genes, n conditions Protein 21.000 genes 100.000 gene products 1.000.000 proteins, n conditions Cell 320 cell types k genes, l proteins, m metabolites, n conditions Population Data Integration New technologies for gene and protein expression profiling RNA: Microarrays Protein: MALDI-TOF, LC-MS/MS Tissue: Tissue microarrays Complementary technologies, real value in integrating diverse datasets Data management and analysis? 3
Data Integration Drowning in data, starving for information? Microarray data (n=1): Affymetrix HG U133A2 chip Raw data: 80 MB per sample (incl. TIFF) MAGE-ML: 30 MB Normalized data: 5-10 MB (Excel table or text file) Data Integration Drowning in data, starving for information? Proteomics data (n=1) Kisslinger et al, Cell 2006, 125:173-186 one organ (heart), one organelle (cytosol) Raw data: 1.55 GB (mzxml format) Sequest search folders: 235 MB Results in PRIDE format: 320 MB Results incl. protein sequences: 374 KB 4
Data Integration n>100 Genomics data (SNPs) Expression data (Microarrays) Proteomics data (LC-MS/MS) Phenotype data (Clinical parameters) Pharmacology data (Pharmacokinetics/dynamics) Medical Images (CT, MR, PET, Ultrasound) Literature data (PubMed, Cochrane)... Centralized database? Herculian task Few standards Data Integration System incompatibilities Organizational issues Specific requirements in specific institutions 5
Data Integration Database Analytical Tools RNA Database Analytical Tools Protein Cell Use standards like MIAME, MIAPE, ICD-O (as good as it gets) Use state-of-the-art software technology: three-tier architecture: database layer (Oracle), middle layer (J2EE), presentation layer Build interfaces for analytical tools Methods Java web-based platform Database backend (Oracle, PostgreSQL, MySQL) Java 2 Enterprise Edition and Struts AndroMDA for code generation JClusterService (Computing Cluster) 6
Lessons we learned Data management systems: imperative! De-centralized databases for primary data and preprocessing Centralized database for processed data Issues Software/database development Changes of software technology Database maintenance Case Study: Cancer Immunology Role of the immune system in early metastasis in colorectal cancer? 7
Database for Clinical and Genomic Data Patients with colorectal cancer (n>1000) Clinical data (n>1000) qpcr of 500 genes (n>100) qpcr of 150 mirna T cell repertoire of 500 parameters (n>50) Phenotypic analysis of 410 parameters using FACS (n=50) Tissue microarrays (n>500) Dedicated database for clinical and genomic data (http://tme.tugraz.at) Tools R, Genesis*, Cytoscape**, ARACNE*** *Sturn et al., Bioinformatics 2002 **Shannon et al., Genome Res 2003 ***Basso et al, Nat Genet 2005 Phenotypes of Tumor-Infiltrating Immune Cells Significantly different markers between invasion positive (VELIPI+) and negative (VELIPI-) patients VELIPI: vascular emboli (VE), lymphatic invasion (LI), perineural invasion (PI) min. expression max. expression Pagès et al. N Engl J Med, 353:2654-2666, 2005 8
Effector Memory T-cells and Survival Disease-free and overall survival of CD45ROhi patients 100 100 CD45RO-hi 60 40 20 CD45RO-lo 0 % Recurrence-Free P<0.001 80 % Survival Tissue MicroArray (TMA) analysis (n=353 patients) CD45RO 80 P<0.001 CD45RO-hi 60 40 20 CD45RO-lo 0 0 50 100 150 Survival (months) 0 50 100 150 Disease Free Survival (months) Pagès et al. N Engl J Med, 353:2654-2666, 2005 Adaptive Immunity has a Beneficial Effect on Clinical Outcome Galon et al. Science, 313:1960-1964, 2006 9
Combined Analysis of Tumor Regions Improves Prediction of Patient Survival Galon et al. Science, 313:1960-1964, 2006 Patient Stratification I II III IV II I III IV I III II IV Galon et al. Science, 313:1960-1964, 2006 10
Conclusion Adaptive immunity has a beneficial effect on clinical outcome in colorectal cancer Combined analysis of tumor regions improves prediction of patient survival Feature Selection and Cancer Classification from Mass Spectrometry Data using Wavelet Analysis and Support Vector Machine 11
For Public Health Ovarian Cancer Tradition biomarker CA125 is not satisfiable High death rate Breast Cancer Prostate Cancer Study datasets (provided by NCI, 2004) Low-resolution SELDI-TOF MS data (dimension=15,154 and 91 controls vs 162 cancers) High-resolution SELDI-TOF MS data (dimension=373,401 and 95 controls vs 121 cancers) MELDI-TOF-MS (provided by Innsbruck univ.) (dimension=54,005; 213 controls vs 178 cancers) Written in Blood 12
Raw High-resolution MS Data Can you see any difference? Binned MS Data Bin the original MS data with unit interval The dimension of feature space is reduced from 373,401 to 11,301 (HIGHres data) 13
Strategies of Data Reduction Two-sample Kolmogorov-Smirnov Goodness-of-fit test (briefly KS-test, nonparametic method): Remove those m/z ratios at which the healthy and cancer have the same distribution Restriction of coefficient of variation (for a positive r.v. X, CV=sd(X)/E(X)) Wavelet analysis (good at treating with discontinuities and sharp spikes) Examples of KS-test à 14
KS-test on Raw Data KS-test on Binned Data 15
Restriction of CV Significance level 5%, t=0.4 Dimension is reduced to 6,757 Discrete Wavelet Transformation samples are separated but overlap is still there and data size is too big! Ã 16
Support Vector Machine Support vectors Other classifiers tested Support Vector Machine Trees (AD3, J48, Bayesian Trees) Neural Networks Voted Perceptron Nearest Neighbour Linear/Quadratic Discriminant Analysis 17
Some Results: Ovarian Cancer Method HEALTHY 2xv CANCER HEALTHY 10xv CANCER IBk 83.21 89.35 85.98 91.15 ADAboost 89.59 91.92 92.05 94.03 VotedP 92.54 95.27 94.45 96.99 SVM 93.30 97.38 94.06 98.19 Some Results: Prostate Cancer Method 2xv 10xv HEALTHY CANCER HEALTHY CANCER SVM 2,000 94.65 88.97 97.02 90.19 SVM 3,500 93.39 85.97 95.52 88.50 18