Distributed Data Mining in Discovery Net Dr. Moustafa Ghanem Department of Computing Imperial College London
1. What is Discovery Net 2. Distributed Data Mining for Compute Intensive Tasks 3. Distributed Data Mining for Sensor Grids 4. Knowledge Discovery from Naturally Distributed Data Sources 5. What Do Scientists Really Want?
1. What is Discovery Net
What is Discovery Net? Funding : One of the eight UK national e-science Pilot Projects funded by EPSRC ( 2.2M) Start Oct 2001, End March 2005 Goal :Construct the World s first Infrastructure for Global Knowledge Discovery Services Key Technologies: Open Service Computing High Throughput Devices and Real Time Data Mining Real Time Data Integration & Information Structuring Cross Domain Knowledge Discovery and Management Discovery Workflow and Discovery Planning
Discovery Net Applications Life Sciences High throughput genomics and proteomics Distributed Databases and Applications Environmental Modelling High throughput dispersed air sensing technology Sensor Grids 10 9 8 7 6 5 4 3 2 1 A B C D E F G H I J K M L N Real time geo-hazard modelling Earthquake modelling through satellite imagery High performance Distributed Computation
Discovery Net Architecture DPML Web/Grid Services OGSA D-Net Clients: End-user applications and user interface allowing scientists to construct and drive knowledge discovery activities D-Net Middleware: Provides services and execution logic for distributed knowledge discovery and access to distributed resources and services High Performance Communication Protocol (GridFTP, DSTP..) Grid Infrastructure (GSI) Goal: Plug & Play Data Sources, Analysis Components & Knowledge Discovery Processes Computation & Data Resources: Distributed databases, compute servers and scientific devices.
Discovery Net Data Mining Components Generic Data Mining Classification, Clustering, Associations,.. Unstructured-Data Mining Text Mining, Image Mining Domain-specific Mining Bioinformatics, Cheminformatics,..
2. Distribution of Compute Intensive Tasks a. Distributed Data Mining for Geo-hazard Prediction
Grid-based Geo-hazard Data Mining Grid-based HPC Computation Automatically co-register a stack of imagery layers at high precision and speed. Workflow to Coordinate Grid Computation Data Warehousing & Modelling Co-registration & geo-rectification Image features extraction Cluster & classification Grid-based Data Access and Integration
Normalised cross-correlation (NCC) template algorithm Image before Image after Reading Data set Setting comparing window Significant correlation coefficient Reading Data set Setting search window Setting comparing window N Operating on a remotely accessed MPI UNIX parallel computer through fast network with DNet interface. Slow but high accuracy: 24 processors 10 hours for one scene of Landsat-7 ETM+ Pan imagery data. The algorithm also run on GRID. Y Delta X Delta X Correlation coefficient
2. Distribution of Compute Intensive Tasks b. Distributed Clustering
Workflows for Distributed Data Clustering
3. Distributed Mining over Sensor Grid Data Distributed Spatial Data Mining for Air Pollution Modelling
Sensor Specification The GUSTO Project - Update (Generic UV Sensors Technologies & Observations) High throughput open path spectrometer system Robust algorithm for pollutant concentration retrievals Measures SO2, NO, NO2,O3 & Benzene to ppb levels every few seconds Geared for networking of multiple GUSTO units within a GRID Infrastructure Can support Remote Sensing data for (contour) mapping of pollutants www.gusto-systems.com
Networking of Multiple GUSTO Units GUSTO unit 1 GUSTO unit 2 GUSTO unit 3 GUSTO unit 4 HTTP, SOAP, GSI Wireless connectivity Data upload service Sensor registry & control service SensorML Archived weather data Warehouse Archived health data Data access service Monitoring and control software HTTP, SOAP, GSI Public access Web visualizer Visualisation and Data Mining GRID Infrastructure www.gusto-systems.com
Pollution analysis
4. Knowledge Discovery from Naturally Distributed Data Sources Distributed Data Mining in Life Sciences
Distributed Data Mining for Life Sciences secondary structure tertiary structure polymorphism patient records epidemiology expression patterns physiology sequences alignments ATGCAAGTCCCT AAGATTGCATAA GCTCGCTCAGTT receptors signals pathways linkage maps cytogenetic maps physical maps
Information Integration Gene Expression Warehouse OMIM ExPASy SwissProt PDB ExPASy Enzyme Disease Protein Enzyme Affy Fragment LocusLink Known Gene MGD Sequence Metabolite Sequence Cluster SNP Pathway SPAD Genbank NCBI dbsnp KEGG NMR UniGene Given a collection of microarray generated gene expression data, what kind of questions the users wish to pose. Design an integration schema?
From Data Integration to Knowledge Unification In Silico Experiment D-World I-World K-World
Life Science Application: SC2002 HPC Challenge High Throughput Sequencers Identify Organism Chromosomes Identify Organism s DNA D-Net based Global Collaborative Real- Time Genome Annotation Nucleotide-level Annotation Genes Gene markers Regulatory Regions Segmental Duplication Literature References trnas, rrnas Non-translated RNAs Repetitive Elements SNP Variations.. EMBL TIGR NCBI SNP genscan grail E-PCR blast Repeat Masker genscan Protein-level Annotation Identify Proteins Functional Characteisation Domain Classify into Protein Families Homologues 3-D Structure Inter Pro SMART Inter Pro SWISS PROT blast PFAM 3D-PSSM Motif Search Genome Annotation Fold Prediction Literature References Secondary structure.. predator DSC Process-level Annotation Relate Cell Cycle Drugs Cell death Literature References Metabolism Biological Process.. Embryogenesis.. GO KEGG CSNDB GK Pathway Maps AmiGO virtual chip Ontologies GeneMaps GenNav 15 DBs 21 Applications
HPC Challenge SC2002 Download sequence from Reference Server Nucleotide Annotation Workflows Interactive Editor & Visualisation Real-time sequencing in London Inter Pro EMBL SMART NCBI KEGG SWISS PROT Save to Distributed Annotation Server TIGR SNP GO Distributed data and computation 1800 clicks 500 Web access 200 copy/paste 3 weeks work in 1 workflow and few second execution Execute distributed annotation workflow
Discovery Net in Action: China SARS Virtual Lab Homology search against viral genome DB Homology search against protein DB Genbank Annotation using Artemis and GenSense Gene prediction Predicted genes Annotation using Artemis and GenSense Homology search against motif DB Key word search GeneSense Ontology Exon prediction Splice site prediction Multiple sequence alignment Phylogenetic analysis Immunogenetics D-Net: Integration, interpretation, and discovery Relationship between SARS and other virus Mutual regions identification Protein localization site prediction Protein interaction prediction Relationship between SARS virus and human receptors prediction Microarray analysis Epidemiological analysis SARS patients diagnosis Classification and secondary structure prediction Bibliographic databases Bibliographic databases
Discovery Net in Action: SARS Virus Mutation Analysis
5. What do Scientist Really Want? Does it really work?
Towards Compositional Grid Services Native MPI OGSA-service Condor-G Web Service Service Browsing Workflow Warehousing Workflow Authoring Composing services Resource Mapping Sun Grid Engine Service Abstraction Oralce 10g Unicore Workflow Execution A compositional GRID Web Wrapper Workflow Management Collaborative Knowledge Management Workflow Deployment: Grid Service and Portal
Discovery Net Service Composition
Full Workflow
Executing Protein Annotation Workflow
Deployment of Node
Deploying Protein Annotation Workflow
Executing Deployed Service
Locating & Executing Deployed Service from Discovery Net
Workflow Provenance
Workflow Warehousing
Discovery Net Snapshot Scientific Information In Real Time Scientific Discovery Literature Real Time Data Integration Discovery Services Service Workflow Databases Operational Data Dynamic Application Integration Integrative Knowledge Management Using Distributed Resources Images Instrument Data