Data Mining in Bioinformatics. Erwin M. Bakker LIACS Leiden University



Similar documents
Bachelorclass

An Overview of Knowledge Discovery Database and Data mining Techniques

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

A Review of Data Mining Techniques

The Scientific Data Mining Process

Master s thesis topics in Data Science Gaming Parallel Programming

GENETIC DATA ANALYSIS

Data, Measurements, Features

Information Management course

Big Data Text Mining and Visualization. Anton Heijs

Data Mining On Diabetics

REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf])

Data Mining System, Functionalities and Applications: A Radical Review

Introduction to Data Mining

Sanjeev Kumar. contribute

Introduction to Data Mining

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

ISSN: A Review: Image Retrieval Using Web Multimedia Mining

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Web-Based Genomic Information Integration with Gene Ontology

The Extension of the DICOM Standard to Incorporate Omics

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Machine Learning/Data Mining for Cancer Genomics

Analysis of microarray gene expression data using R/BioC and web tools Organised by MolMed 12 th edition, Erasmus MC, June 25-29, 2012 vs

IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

BlueFuse Multi Analysis Software for Molecular Cytogenetics

BUSINESS ANALYTICS. Overview. Lecture 0. Information Systems and Machine Learning Lab. University of Hildesheim. Germany

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

SURVIVABILITY ANALYSIS OF PEDIATRIC LEUKAEMIC PATIENTS USING NEURAL NETWORK APPROACH

Data deluge (and it s applications) Gianluigi Zanetti. Data deluge. (and its applications) Gianluigi Zanetti

Introduction to Pattern Recognition

Data Warehousing and Data Mining in Business Applications

Ph.D. in Bioinformatics and Computational Biology Degree Requirements

Semantic Video Annotation by Mining Association Patterns from Visual and Speech Features

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Inner Classification of Clusters for Online News

Preparing the scenario for the use of patient s genome sequences in clinic. Joaquín Dopazo

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Graduate Co-op Students Information Manual. Department of Computer Science. Faculty of Science. University of Regina

An Introduction to Data Mining

Data Mining: Overview. What is Data Mining?

How To Use Neural Networks In Data Mining

An EVIDENCE-ENHANCED HEALTHCARE ECOSYSTEM for Cancer: I/T perspectives

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

How To Change Medicine

DMBI: Data Management for Bio-Imaging.

Database Marketing, Business Intelligence and Knowledge Discovery

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

Integration of biospecimen data with clinical data mining

Principles of Data Mining by Hand&Mannila&Smyth

UF EDGE brings the classroom to you with online, worldwide course delivery!

Acceleration for Personalized Medicine Big Data Applications

Doctor of Philosophy in Computer Science

SPATIAL DATA CLASSIFICATION AND DATA MINING

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Thermo Scientific ArrayScan XTI High Content Analysis Reader. revolutionizing cell biology with the power of high content

Use of Data Mining in the field of Library and Information Science : An Overview

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

An Ontology Based Method to Solve Query Identifier Heterogeneity in Post- Genomic Clinical Trials

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore

IEEE JAVA TITLES

Big Data Analytics for Healthcare

Data Mining Part 5. Prediction

An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data

Chapter 6. The stacking ensemble approach

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

A SURVEY ON WEB MINING TOOLS

Graph Mining and Social Network Analysis

Course Requirements for the Ph.D., M.S. and Certificate Programs

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

TIETS34 Seminar: Data Mining on Biometric identification

DATA PREPARATION FOR DATA MINING

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

Computational Drug Repositioning by Ranking and Integrating Multiple Data Sources

Protein Protein Interaction Networks

Healthcare Measurement Analysis Using Data mining Techniques

IC05 Introduction on Networks &Visualization Nov

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

An Introduction to WEKA. As presented by PACE

Integrated Data Mining Strategy for Effective Metabolomic Data Analysis

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

Industrial Challenges for Content-Based Image Retrieval

Data Driven Discovery In the Social, Behavioral, and Economic Sciences

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

Transcription:

Data Mining in Bioinformatics Erwin M. Bakker LIACS Leiden University

Overview Introduction Bioinformatics: Data Analyses Data Modeling DIAL CMSB Phenotype Genotype Integration CYTTRON Sub-Graph Mining Conclusion 2

Leiden - Delft CS Bioinformatics track www.delftleiden.nl/bio Organized by: LIACS Leiden University EEMCS, the faculty of Electrical Engineering, Mathematics and Computer Science of the Delft University of Technology co-operation with three of the four centres of excellence of the Nationaal Regie Orgaan Genomics: the Kluyver Centre for Genomics of Industrial Fermentation in Delft the Cancer Genomics Consortium (of which DUT is a member) the Centre for Medical Systems Biology in Leiden. Focus on: Data Analyses and Data Modeling 3

Data Analyses and Data Modeling Zebra Fish Atlas (dr F. Verbeek) Applied optimization techniques: EA, GA, NN, etc. (prof T. Bäck) Content Based Indexing and Retrieval (dr M.S. Lew) Integrating Protein Databases: Collecting and Analyzing Natural Variants in G Protein-Coupled Receptors (drs. M. van Iterson, drs J. Kazius (LACDR)) Mining Phenotype Genotype Data (drs. F. Colas, LUMC) Data Mining, VL-e (prof J. Kok), Etc. 4

Data Mining Data Mining and Knowledge Discovery in Databases (KDD) are used interchangeably The process of discovery of interesting, meaningful and actionable patterns hidden in large amounts of data Multidisciplinary field originating from artificial intelligence, pattern recognition, statistics, machine learning, bioinformatics, econometrics, 5

Data Mining in Bioinformatics Problem: Leukemia (different types of Leukemia cells look very similar) Given data for a number of samples (patients), can we Accurately diagnose the disease? Predict outcome for given treatment? Recommend best treatment? Solution Data mining on micro-array data 6

CGH-DB Center for Medical Systems Biology (www.cmsb.nl) Data Integration and Logistics (DIAL) CGH-DB a CGH Database Consolidation of Experimental Data Integration of CGH data with: Other CGH Experiments Genome Databases Expression Databases Phenotype data Etc. Publication, validation, repetition, etc. 7

Groups Involved Micro Array Core Facility, VUMC: Bauke Ylstra, José Luis Costa, Anders Svensson, Paul vden IJssel, Mark van de Wiel, Sjoerd Vosse Center for Human and Clinical Genetics, LUMC: Judith Boer, Peter Taschner, and others Department of Molecular Cell Biology, Laboratory for Cytochemistry and Cytometry: Karoly Szuhai Leiden Institute of Advanced Computer Science, LIACS: Joost Kok, Floris Sicking, Erwin Bakker, Sven Groot, Michiel Ranshuysen, Harmen vder Spek 8

CGH-DB Goals A Secure, Reliable, and Scalable database/data management solution for storing the vast amounts of experimental micro array comparative genomic hybridization (CGH) data and images from the different CMSB research groups. Data Consolidation: through standard control mechanisms for data quality, data preprocessing, data referencing (BAC), and meta data (CGH MIAME), it is ensured that the stored data represent the original experimental data in an accurate and highly accessible way. Data Integration: the applied standards for normalization, smoothing, (BAC) referencing, and MIAME CGH annotation must support multiple experiment integration over various platforms, and a controlled interface with further analysis and visualization tools. 9

Comparative Genomic Hybridization (CGH) Mantripragada et a,l Trends in Genetics 2003 10

At BAC, or Oligo positions: Normal Gains Losses 11

Micro Array CGH Data Flow I CGH Oligo 12

Micro Array CGH Data Flow II CGH Oligo 13

Multiple Experiments: Viewing, Analyses 14

15

Visual Feedback: normalization; smoothing; BAC reference file version; 16

Integration of different experiments: BAC, Oligo, Bluefuse, Imagene, etc. 17

CGH Databases Data Explosion BAC 3500 data points Oligo s 20000 to 60000 data points 1000 experiments/year (currently) 200k and 500k expected in the near future Within some years: 5M data points routine diagnosis 200MB - 1GB Images Storage and Computational Requirements 18

Challenges Integration of Genomic Data Micro Array Expression Data mrna levels, Human Genome, Chimp, Rhesus, Mouse, etc. Semantic integration Scale up of routine analysis Scale up of research analysis over integrated data sets Data mining for hidden relations 19

Phenotype Genotype Integration Genotype data Annotated genome databases CGH Database Expression databases Etc. Phenotype data (Multimodal) Blood samples Weight, height, fat %, fat type, etc. Echo, CT, MRI scans Photographs Etc. 20

Longevity Studies at LUMC Group headed by E. Slagboom (LUMC) Current data mining studies by Fabrice Colas (LIACS) Mining genetic data sets 1-, 2-, and 3-itemsets (frequent item sets) Solving the problems in reasonable time was only possible using parallel computing (DAS3) 21

Towards a Classification of Osteo Arthritis subtypes in Subjects with Symptomatic OA at Multiple Joint Sites. F. Colas et al NBIC-ISNB2007 GARP study of OA (Osteo Arthritis) subtypes Identifying genetic factors Assist in development of new treatments Genetic etiology is difficult because of the clinical heterogeneity of the disease Identification of homogeneous subgroups of OA Identify and characterize potentially new disease subtypes using machine learning techniques Parallel Computation (DAS3) 22

Content Based Indexing and Retrieval Techniques Image Databases Speech Databases Video Databases Multimodal Databases Face recognition, bimodal emotion recognition (N. Sebe, UVA), Semantic Audio Indexing, etc. 23

CYTTRON Headed by prof J.P. Abrahams (LIC), www.cyttron.nl. Within the CYTTRON project various modes of imaging biological structures and processes will be integrated in a common visualization platform. The success of the integration and use of the bioimage data strongly relies on new bio-image processing techniques and searching methods. The research focus is on new visual search tools for bioimage queries, handling multi dimensional image data sets. 24

CYTTRON Different Bio-Imaging Techniques: - Light Microscopy -MRM - Confocal laser Microscopy - EM, Cryo, 3D EM -NMR - Crystallography -Etc. 25

Example: White blood cell 10-4 m 10-5 m 10-6 m 10-7 m 10-8 m 10-9 m 26

CYTTRON Integration Different modalities 2D, 3D, Noisy, Model, random projections Poor annotation Database design Content Based Searching Algorithms Feature Based Annotation Automatic Learning: relevance feedback, training sets, etc. Computational needs 27

Content-based image retrieval Database Searching for images based on content only, using an image as a query Using text search for images requires every image to be annotated. This has some disadvantages: Annotating images is time-consuming Annotation can be incomplete Annotation can be almost impossible 28

Image annotation difficulties How would you describe these images? 29

Basic CBIR paradigm Average color (23, 37, 241) Describe a specific visual property (feature) of an image as a vector RGB Historgrams Local Binary Patterns Etc. Extract features for all database images Extract features from query image Calculate distance between query image and all database images Rank images by distance 30

Relevance Feedback Very relevant Database Relevant Iterative search process: Search for query image Ask user to evaluate the search results Use feedback to adjust the query Repeat process until user is satisfied Update weights / adjust vectors Irrelevant 31

Sub-image search Let the user select one or more parts of the query image For each database image, calculate the number of sub-images matching (are close to) the selected parts Rank results based on number of matching subimages 32

Automatic Registration of Microtubule Images Feiyang Yu, Ard Oerlemans Erwin M. Bakker and Michael S. Lew (Artificial images. The original images could not be used due to copyright.) Microtubule Movie 33

Challenges CYTTRON Large number of images Insufficient or no annotation Multiresolution images (different scales) Images made by different types of imaging devices LML Projects High performance feature space computation and indexing (Images, Video s; batch usage) Interactive robust content based indexing techniques: emotion recognition, object recognition, who is talking, what is audible, etc. can be batch usage, but optimally we would like real time usage of DAS3 (!?) 34

Sub-Graph Mining Proteins: structure is function 1D and 2D structure computable from models, 3D structure difficult to predict Protein sequences => molecular description => structural encoding in graphs Existing protein databases can be encoded as graphs New sequences can then be encoded as graphs and used to search the graph database Mine the graph database => frequent patterns => see if these frequent patterns indicated groups of proteins with the same functionality 35

GASTON S. Nijssen, J.Kok 04 www.liacs.nl/~snijssen/gaston/iccs.html Applications: Molecular databases Protein databases Acces-patterns Web-links Etc. 36

Frequent Pattern Trees Develop new parallel versions for frequent item set mining Currently research on Closed and Constrained Frequent Item Set mining Biological Semantics Biological Relevance Evaluation experiments will be run on DAS3 37

Conclusion Data mining in Bioinformatics offer many challenging tasks in which DAS3 plays an essential role: research on novel scalable high performance segmentation of high dimensional and high volume feature spaces. Development and evaluation of novel high performance techniques for data mining research on novel scalable data(base) structures for efficient data querying, analysis and mining of high volume data sets 38