Distributed Data Mining in Discovery Net. Dr. Moustafa Ghanem Department of Computing Imperial College London



Similar documents
Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

A Primer of Genome Science THIRD

Bioinformatics Grid - Enabled Tools For Biologists.

Using Ontologies in Proteus for Modeling Data Mining Analysis of Proteomics Experiments

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

Vad är bioinformatik och varför behöver vi det i vården? a bioinformatician's perspectives

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

EMBL Identity & Access Management

Linear Sequence Analysis. 3-D Structure Analysis

THE CCLRC DATA PORTAL

Cloud-Based Big Data Analytics in Bioinformatics

IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper

Syllabus of B.Sc. (Bioinformatics) Subject- Bioinformatics (as one subject) B.Sc. I Year Semester I Paper I: Basic of Bioinformatics 85 marks

Integrating Bioinformatics, Medical Sciences and Drug Discovery

Processing Genome Data using Scalable Database Technology. My Background

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Bioinformatics Resources at a Glance

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM

Guide for Bioinformatics Project Module 3

Teaching Bioinformatics to Undergraduates

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Bioinformatics: course introduction

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

HETEROGENEOUS DATA INTEGRATION FOR CLINICAL DECISION SUPPORT SYSTEM. Aniket Bochare - aniketb1@umbc.edu. CMSC Presentation

An approach to grid scheduling by using Condor-G Matchmaking mechanism

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

A Platform for Collaborative e-science Applications. Marian Bubak ICS / Cyfronet AGH Krakow, PL bubak@agh.edu.pl

Final Project Report

The Human Genome Project

Cluster, Grid, Cloud Concepts

Biological Databases and Protein Sequence Analysis

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

Novel Mining of Cancer via Mutation in Tumor Protein P53 using Quick Propagation Network

Protein Protein Interaction Networks

An Introduction to Genomics and SAS Scientific Discovery Solutions

Three data delivery cases for EMBL- EBI s Embassy. Guy Cochrane

How To Use The Assembly Database In A Microarray (Perl) With A Microarcode) (Perperl 2) (For Macrogenome) (Genome 2)

Dr Alexander Henzing

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland

From Data to Foresight:

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

Big Data in BioMedical Sciences. Steven Newhouse, Head of Technical Services, EMBL-EBI

Anwendungsintegration und Workflows mit UNICORE 6

An agent-based layered middleware as tool integration

BIO 3352: BIOINFORMATICS II HYBRID COURSE SYLLABUS

Web-Based Genomic Information Integration with Gene Ontology

What s New in Pathway Studio Web 11.1

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Introduction to Genome Annotation

Tutorial for proteome data analysis using the Perseus software platform

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

Integrated Rule-based Data Management System for Genome Sequencing Data

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

A W orkflow Management System for Bioinformatics Grid

Scientific and Technical Applications as a Service in the Cloud

A Reliable and Fast Data Transfer for Grid Systems Using a Dynamic Firewall Configuration

Databases and mapping BWA. Samtools

Human Genome and Human Genome Project. Louxin Zhang

Sequencing the Human Genome

Technical. Overview. ~ a ~ irods version 4.x

Big Data Analytics and Healthcare

Classic Grid Architecture

Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices

Big Data Mining Services and Knowledge Discovery Applications on Clouds

School of Nursing. Presented by Yvette Conley, PhD

Big Data in Drug Discovery

Preparing the scenario for the use of patient s genome sequences in clinic. Joaquín Dopazo

Genome Explorer For Comparative Genome Analysis

Check Your Data Freedom: A Taxonomy to Assess Life Science Database Openness

Informatics and Knowledge Management at the Novartis Institutes for BioMedical Research (NIBR)

Transcription:

Distributed Data Mining in Discovery Net Dr. Moustafa Ghanem Department of Computing Imperial College London

1. What is Discovery Net 2. Distributed Data Mining for Compute Intensive Tasks 3. Distributed Data Mining for Sensor Grids 4. Knowledge Discovery from Naturally Distributed Data Sources 5. What Do Scientists Really Want?

1. What is Discovery Net

What is Discovery Net? Funding : One of the eight UK national e-science Pilot Projects funded by EPSRC ( 2.2M) Start Oct 2001, End March 2005 Goal :Construct the World s first Infrastructure for Global Knowledge Discovery Services Key Technologies: Open Service Computing High Throughput Devices and Real Time Data Mining Real Time Data Integration & Information Structuring Cross Domain Knowledge Discovery and Management Discovery Workflow and Discovery Planning

Discovery Net Applications Life Sciences High throughput genomics and proteomics Distributed Databases and Applications Environmental Modelling High throughput dispersed air sensing technology Sensor Grids 10 9 8 7 6 5 4 3 2 1 A B C D E F G H I J K M L N Real time geo-hazard modelling Earthquake modelling through satellite imagery High performance Distributed Computation

Discovery Net Architecture DPML Web/Grid Services OGSA D-Net Clients: End-user applications and user interface allowing scientists to construct and drive knowledge discovery activities D-Net Middleware: Provides services and execution logic for distributed knowledge discovery and access to distributed resources and services High Performance Communication Protocol (GridFTP, DSTP..) Grid Infrastructure (GSI) Goal: Plug & Play Data Sources, Analysis Components & Knowledge Discovery Processes Computation & Data Resources: Distributed databases, compute servers and scientific devices.

Discovery Net Data Mining Components Generic Data Mining Classification, Clustering, Associations,.. Unstructured-Data Mining Text Mining, Image Mining Domain-specific Mining Bioinformatics, Cheminformatics,..

2. Distribution of Compute Intensive Tasks a. Distributed Data Mining for Geo-hazard Prediction

Grid-based Geo-hazard Data Mining Grid-based HPC Computation Automatically co-register a stack of imagery layers at high precision and speed. Workflow to Coordinate Grid Computation Data Warehousing & Modelling Co-registration & geo-rectification Image features extraction Cluster & classification Grid-based Data Access and Integration

Normalised cross-correlation (NCC) template algorithm Image before Image after Reading Data set Setting comparing window Significant correlation coefficient Reading Data set Setting search window Setting comparing window N Operating on a remotely accessed MPI UNIX parallel computer through fast network with DNet interface. Slow but high accuracy: 24 processors 10 hours for one scene of Landsat-7 ETM+ Pan imagery data. The algorithm also run on GRID. Y Delta X Delta X Correlation coefficient

2. Distribution of Compute Intensive Tasks b. Distributed Clustering

Workflows for Distributed Data Clustering

3. Distributed Mining over Sensor Grid Data Distributed Spatial Data Mining for Air Pollution Modelling

Sensor Specification The GUSTO Project - Update (Generic UV Sensors Technologies & Observations) High throughput open path spectrometer system Robust algorithm for pollutant concentration retrievals Measures SO2, NO, NO2,O3 & Benzene to ppb levels every few seconds Geared for networking of multiple GUSTO units within a GRID Infrastructure Can support Remote Sensing data for (contour) mapping of pollutants www.gusto-systems.com

Networking of Multiple GUSTO Units GUSTO unit 1 GUSTO unit 2 GUSTO unit 3 GUSTO unit 4 HTTP, SOAP, GSI Wireless connectivity Data upload service Sensor registry & control service SensorML Archived weather data Warehouse Archived health data Data access service Monitoring and control software HTTP, SOAP, GSI Public access Web visualizer Visualisation and Data Mining GRID Infrastructure www.gusto-systems.com

Pollution analysis

4. Knowledge Discovery from Naturally Distributed Data Sources Distributed Data Mining in Life Sciences

Distributed Data Mining for Life Sciences secondary structure tertiary structure polymorphism patient records epidemiology expression patterns physiology sequences alignments ATGCAAGTCCCT AAGATTGCATAA GCTCGCTCAGTT receptors signals pathways linkage maps cytogenetic maps physical maps

Information Integration Gene Expression Warehouse OMIM ExPASy SwissProt PDB ExPASy Enzyme Disease Protein Enzyme Affy Fragment LocusLink Known Gene MGD Sequence Metabolite Sequence Cluster SNP Pathway SPAD Genbank NCBI dbsnp KEGG NMR UniGene Given a collection of microarray generated gene expression data, what kind of questions the users wish to pose. Design an integration schema?

From Data Integration to Knowledge Unification In Silico Experiment D-World I-World K-World

Life Science Application: SC2002 HPC Challenge High Throughput Sequencers Identify Organism Chromosomes Identify Organism s DNA D-Net based Global Collaborative Real- Time Genome Annotation Nucleotide-level Annotation Genes Gene markers Regulatory Regions Segmental Duplication Literature References trnas, rrnas Non-translated RNAs Repetitive Elements SNP Variations.. EMBL TIGR NCBI SNP genscan grail E-PCR blast Repeat Masker genscan Protein-level Annotation Identify Proteins Functional Characteisation Domain Classify into Protein Families Homologues 3-D Structure Inter Pro SMART Inter Pro SWISS PROT blast PFAM 3D-PSSM Motif Search Genome Annotation Fold Prediction Literature References Secondary structure.. predator DSC Process-level Annotation Relate Cell Cycle Drugs Cell death Literature References Metabolism Biological Process.. Embryogenesis.. GO KEGG CSNDB GK Pathway Maps AmiGO virtual chip Ontologies GeneMaps GenNav 15 DBs 21 Applications

HPC Challenge SC2002 Download sequence from Reference Server Nucleotide Annotation Workflows Interactive Editor & Visualisation Real-time sequencing in London Inter Pro EMBL SMART NCBI KEGG SWISS PROT Save to Distributed Annotation Server TIGR SNP GO Distributed data and computation 1800 clicks 500 Web access 200 copy/paste 3 weeks work in 1 workflow and few second execution Execute distributed annotation workflow

Discovery Net in Action: China SARS Virtual Lab Homology search against viral genome DB Homology search against protein DB Genbank Annotation using Artemis and GenSense Gene prediction Predicted genes Annotation using Artemis and GenSense Homology search against motif DB Key word search GeneSense Ontology Exon prediction Splice site prediction Multiple sequence alignment Phylogenetic analysis Immunogenetics D-Net: Integration, interpretation, and discovery Relationship between SARS and other virus Mutual regions identification Protein localization site prediction Protein interaction prediction Relationship between SARS virus and human receptors prediction Microarray analysis Epidemiological analysis SARS patients diagnosis Classification and secondary structure prediction Bibliographic databases Bibliographic databases

Discovery Net in Action: SARS Virus Mutation Analysis

5. What do Scientist Really Want? Does it really work?

Towards Compositional Grid Services Native MPI OGSA-service Condor-G Web Service Service Browsing Workflow Warehousing Workflow Authoring Composing services Resource Mapping Sun Grid Engine Service Abstraction Oralce 10g Unicore Workflow Execution A compositional GRID Web Wrapper Workflow Management Collaborative Knowledge Management Workflow Deployment: Grid Service and Portal

Discovery Net Service Composition

Full Workflow

Executing Protein Annotation Workflow

Deployment of Node

Deploying Protein Annotation Workflow

Executing Deployed Service

Locating & Executing Deployed Service from Discovery Net

Workflow Provenance

Workflow Warehousing

Discovery Net Snapshot Scientific Information In Real Time Scientific Discovery Literature Real Time Data Integration Discovery Services Service Workflow Databases Operational Data Dynamic Application Integration Integrative Knowledge Management Using Distributed Resources Images Instrument Data