Distributed knowledge sharing and production through collaborative e-science platforms

Transcription

1 Distributed knowledge sharing and production through collaborative e-science platforms PhD Defense - Alban Gaignard Advisor: Johan Montagnat CNRS, University of Nice Sophia Antipolis, I3S Laboratory, MODALIS research group 1

2 Translational research & e-science Research laboratory Healthy population Target population Scanner Medical data Data processing Legacy database 2

3 Translational research & e-science Research laboratory 2

4 Translational research & e-science Research laboratory... 2

5 Translational research & e-science Research laboratory... e-science platform 2

6 Translational research & e-science Research laboratory Sharing e-science platform 2

7 Translational research & e-science Research laboratory Sharing 2 Processing e-science platform 2

8 Translational research & e-science Heterogeneity Research laboratory Distribution Dynamicity Scalability 1 Sharing Knowledge 2 Processing e-science platform 2

9 Challenges & Hypothesis Questions: Scalability/Distribution: how to efficiently search over large distributed data sources? Dynamicity/Heterogeneity: how to cope with legacy/non-relocatable data? how to dynamically combine several independent data sources? Knowledge: how to share/search for data and processing tools with high expressivity? better results interpretation? 3

10 Challenges & Hypothesis Questions: Scalability/Distribution: how to efficiently search over large distributed data sources? Dynamicity/Heterogeneity: how to cope with legacy/non-relocatable data? how to dynamically combine several independent data sources? Knowledge: how to share/search for data and processing tools with high expressivity? better results interpretation? Hypothesis: H1: Domain ontologies H2: Data sources are distributed and autonomous H3: e-science platforms allow to share & produce scientific resources 3

11 Challenges & Hypothesis Questions: Scalability/Distribution: how to efficiently search over large distributed data sources? Dynamicity/Heterogeneity: how to cope with legacy/non-relocatable data? how to dynamically combine several independent data sources? Knowledge: how to share/search for data and processing tools with high expressivity? better results interpretation? Hypothesis: H1: Domain ontologies H2: Data sources are distributed and autonomous H3: e-science platforms allow to share & produce scientific resources Scientific areas Knowledge engineering: reasoning on semantic description of data & processing tools e-science: computing infra. to process/ share/re-purpose scientific resources 3

12 Challenges & Hypothesis Questions: Scalability/Distribution: how to efficiently search over large distributed data sources? Dynamicity/Heterogeneity: how to cope with legacy/non-relocatable data? how to dynamically combine several independent data sources? Knowledge: how to share/search for data and processing tools with high expressivity? better results interpretation? Hypothesis: H1: Domain ontologies H2: Data sources are distributed and autonomous H3: e-science platforms allow to share & produce scientific resources Scientific areas Knowledge engineering: reasoning on semantic description of data & processing tools e-science: computing infra. to process/ share/re-purpose scientific resources semantic e-science reducing "time-to-discovery" 3

13 Thesis Objectives 4

14 Thesis Objectives Coherent sharing and production of distributed knowledge in Life-Science: Knowledge sharing: coping with semantic data volume, distribution, heterogeneity Knowledge production: extracting meaningful & long-term data from large & technical datasets 4

15 Main contributions 1. Knowledge base federation Transparent, efficient, and expressive semantic federated querying Abstract Knowledge Graphs [Web Intelligence'12] [IC'12 workshop] [MICCAI'12 workshop] 2. Semantic Workflows Characterization of semantically annotated services (Nature and Role) Semantic experiment summaries [KEOD'11] [IC'10 workshop] [TMI'13] [CBMS'11] 5

16 E-Science 1 : data integration 6

17 E-Science 1 : data integration E-T-L1 Centralized querying Data warehouse E-T-Ln Materialized Data Integration Extract - Transform - Load Efficiency Scalability Dynamicity Hardly relocatable data? 6

18 E-Science 1 : data integration E-T-L1 Sub-querying Centralized querying Data warehouse E-T-Ln Federated querying Sub-querying Materialized Data Integration Extract - Transform - Load Efficiency Scalability Dynamicity Hardly relocatable data? Virtualized Data Integration Distributed Query Processing Efficiency Scalability (Load/Volume) Dynamicity Data kept at source 6

19 E-Science 1 : distributed semantic querying DARQ Splendid SemWiq Sparql- DQP FedX KGRAM Distribution Performance Heterogeneity Dynamicity Expressivity ?? + +?

20 E-Science 1 : distributed semantic querying DARQ Splendid SemWiq Sparql- DQP FedX KGRAM Distribution Performance Heterogeneity Dynamicity Expressivity ?? + +? Missing expressivity (subset of SPARQL) Only SELECT queries on Basic Graph Patterns, no PATH expressions, no bound subjects for SemWiq, etc. 7

21 E-Science 1 : distributed semantic querying DARQ Splendid SemWiq Sparql- DQP FedX KGRAM KGRAM- DQP Distribution Performance Heterogeneity Dynamicity Expressivity ?? + +? Missing expressivity (subset of SPARQL) Only SELECT queries on Basic Graph Patterns, no PATH expressions, no bound subjects for SemWiq, etc. Balancing Expressivity & Performance 7

22 E-Science 1 : semantic data handling with KGRAM Representing, querying and reasoning on Knowledge Graphs Generic engine Expressivity: SPARQL 1.1 compliant Versatility: several data models (RDF, XML, SQL) Reasoning: RDFS entailments + Inference rules 8

23 E-Science 1 : semantic data handling with KGRAM Representing, querying and reasoning on Knowledge Graphs Generic engine Expressivity: SPARQL 1.1 compliant Versatility: several data models (RDF, XML, SQL) Reasoning: RDFS entailments + Inference rules query language Query parsing Query engine AQL Query evaluation AQL Data producer Query rewriting native QL AKG Data matching & filtering Data transforming native data Data source AQL: Abstract Query Language AKG: Abstract Knowledge Graph AKG KGRAM abstract machine 8

24 E-Science 2 : scientific workflows Semantic workflow (WF) environments METEOR-S ; Taverna/FETA ; BioCatalogue ; BioMOBY target WF design/sharing WF results interpretation through Provenance standards (Provenir, OPM PROV-*) Standards Scalability Linked Data approach Domain knowledge Linked e- BioInfra NeuGrid RDFProv ProvBase Provenan ce Data Wings/ Pegasus PaCE Taverna/ Janus / PROV-O published as a W3C Candidate Recommendation (11 December 2012) 9

25 E-Science 2 : scientific workflows Semantic workflow (WF) environments METEOR-S ; Taverna/FETA ; BioCatalogue ; BioMOBY target WF design/sharing WF results interpretation through Provenance standards (Provenir, OPM PROV-*) Standards Scalability Linked Data approach Domain knowledge Linked e- BioInfra NeuGrid RDFProv ProvBase Provenan ce Data Wings/ Pegasus PaCE Taverna/ Janus / NeuSem Store PROV-O published as a W3C Candidate Recommendation (11 December 2012) 9

26 Contribution 1 Knowledge Sharing (for e-science platforms)

27 Efficient & expressive sharing of knowledge graphs 11

28 Efficient & expressive sharing of knowledge graphs Objectives Transparent federated semantic engine Heterogeneity + Dynamicity Balancing expressivity and performance Distribution + Scalability + Knowledge 11

29 Efficient & expressive sharing of knowledge graphs Objectives Transparent federated semantic engine Heterogeneity + Dynamicity Balancing expressivity and performance Distribution + Scalability + Knowledge 11

30 Efficient & expressive sharing of knowledge graphs Objectives Transparent federated semantic engine Heterogeneity + Dynamicity Balancing expressivity and performance Distribution + Scalability + Knowledge Methods Abstract Knowledge Graphs Distributed Query Processing techniques Static and dynamic optimization 11

31 KGRAM-DQP: distributed query processing KGRAM-DQP (Federator) Query evaluator Parallel MetaProducer Web service client Web service client Producer Data producer #1 SPARQL Web service endpoint Data producer #2 SPARQL Web service endpoint rewritten query native results rewritten query native results Data source #1 Data source #2 12

32 KGRAM-DQP: distributed query processing KGRAM-DQP (Federator) Query evaluator Web service client Data producer #1 SPARQL Web service endpoint Data producer #2 rewritten query native results Data source #1 Parallel MetaProducer Web service client SPARQL Web service endpoint rewritten query native results Data source #2 Producer Cost of network communication Distributed query processing performance Service parallelism / optimizations 12

33 rithm 6 illustrates how to distribute the query over a set of federated knowledge b exploiting the parallelism of each remote producer. Results follow the SPARQL W3C recommendation 5 and are represented as a set of Results, each of them enco ing a set of Mappings between variables and values. KGRAM-DQP: parallel evaluation Algorithm 6: Fine-grained parallel distributed query processing, with an exp wait condition. Data: P roducers the set of SPARQL endpoints, EdgeReq the set of edge requests forming the SPARQL query, scheduler a thread pool allowing parallel execution. Result: Results the set of SPARQL results. 1 foreach (e 2 EdgeReq) do 2 foreach (p 2 P roducers) do in parallel 3 scheduler.submit(p.getedges(e)) ; 4 wait for scheduler ; 5 foreach (task 2 scheduler.getfinished()) do 6 Results task.getresults() ; (a) Synch. barrier (b) Pipelining The principle consists in iterating over each edge request forming the initial S query (line 1). Then, for each edge request, all federated SPARQL endpoints are concurrently (line 3). The federator then wait for all federated endpoints to finish 13

34 Static optimization: pushing applicable FILTERs Filtering irrelevant results the sooner (lighter network communications) add FILTER to each single triple pattern (if applicable) 74 Chapter 4. Semantic data and query distribution Input SPARQL query Listing 4.3: Full SPARQL query distributed over remote KGRAM endpoints PREFIX foaf: < PREFIX dbpedia: < SELECT DISTINCT?x?name?date WHERE {?x foaf:name?name.?x dbpedia:birthdate?date. FILTER (CONTAINS (?name, Bobby A )) } Listing 4.4: Generated SPARQL query encapsulating a single edge request through the naive rewriting strategy PREFIX foaf: < CONSTRUCT {?x foaf:name?name} WHERE {?x foaf:name?name. } 14

35 Static optimization: pushing applicable FILTERs Filtering irrelevant results the sooner (lighter network communications) add FILTER to each single triple pattern (if applicable) 74 Chapter 4. Semantic data and query distribution Input SPARQL query Listing 4.3: Full SPARQL query distributed over remote KGRAM endpoints PREFIX foaf: < PREFIX dbpedia: < SELECT DISTINCT?x?name?date WHERE {?x foaf:name?name.?x dbpedia:birthdate?date. FILTER (CONTAINS (?name, Bobby A )) } Rewritten sub-query Listing 4.4: Generated SPARQL query encapsulating a single edge request through the naive rewriting strategy PREFIX foaf: < CONSTRUCT {?x foaf:name?name} WHERE {?x foaf:name?name. } 14

36 SELECT DISTINCT?x?name?date WHERE {?x foaf:name?name.?x dbpedia:birthdate?date. FILTER (CONTAINS (?name, Bobby A )) Static optimization: pushing } applicable FILTERs Filtering irrelevant results the sooner (lighter ing strategy network communications) add FILTER to each single triple pattern (if applicable) 74 Chapter 4. Semantic data and query distribution PREFIX foaf: < CONSTRUCT {?x foaf:name?name} WHERE {?x foaf:name?name. } Listing 4.3: Full SPARQL query distributed over remote KGRAM endpoints Input SPARQL query PREFIX foaf: < PREFIX dbpedia: < SELECT DISTINCT?x?name?date WHERE {?x foaf:name?name.?x dbpedia:birthdate?date. FILTER (CONTAINS (?name, Bobby A )) } Listing 4.4: Generated SPARQL query encapsula Rewritten sub-query results behind federated endpoints, and thus the load of their processing by the federator. Listing 4.5: Optimized SPARQL query encapsula ing strategy Optimized sub-query PREFIX foaf: < CONSTRUCT {?x foaf:name?name} WHERE { Listing 4.4: Generated SPARQL query encapsulating a single edge request through the naive rewriting strategy?x foaf:name?name. FILTER (CONTAINS (?name, Bobby A )) PREFIX foaf: < } CONSTRUCT {?x foaf:name?name} WHERE {?x foaf:name?name. } 14

37 Dynamic optimization: pushing values Avoid re-evaluation by exploiting intermediate results (communication 74of already known values saved) Chapter [Bind 4. joins] Semantic data and query distribution Replacing variables by their known values for each single triple pattern. Input SPARQL query Listing 4.3: Full SPARQL query distributed over remote KGRAM endpoints PREFIX foaf: < PREFIX dbpedia: < SELECT DISTINCT?x?name?date WHERE {?x foaf:name?name.?x dbpedia:birthdate?date. FILTER (CONTAINS (?name, Bobby A )) } Listing 4.4: Generated SPARQL query encapsulating a single edge request through the naive rewriting strategy Intermediate result?x = PREFIX foaf: < CONSTRUCT {?x foaf:name?name} WHERE {?x foaf:name?name. } 15

38 for, and only one triple should be produced f Dynamic optimization: pushing values Avoid re-evaluation by exploiting intermediate } results (communication 74of already known values saved) Chapter [Bind 4. joins] Semantic data and query distribution Replacing variables by their known values for each single triple pattern. Listing 4.7: Optimized SPARQL query encapsu Listing 4.3: Full SPARQL query distributed over remote KGRAM endpoints Rewritten sub-query rewriting strategy Input SPARQL query PREFIX foaf: < PREFIX dbpedia: < SELECT DISTINCT?x?name?date WHERE {?x foaf:name?name.?x dbpedia:birthdate?date. FILTER (CONTAINS (?name, Bobby A )) } Listing 4.6: Optimized SPARQL query encapsu rewriting strategy PREFIX dbpedia: < CONSTRUCT {< dbpedi < dbpedia:birthd PREFIX dbpedia: < CONSTRUCT {?x dbpedia:birthdate?date } WHERE {?x dbpedia:birthdate?date } Listing 4.4: Generated SPARQL query encapsulating a single edge request through the naive rewriting strategy Intermediate result?x = PREFIX foaf: < CONSTRUCT {?x foaf:name?name} WHERE {?x foaf:name?name. } Listing 4.8: Optimized SPARQL query encapsu rewriting strategy PREFIX dbpedia: < CONSTRUCT { < dbpedia } WHERE { < dbpedia } 15

39 rewriting strategy for, and only one triple should be produced f PREFIX dbpedia: < CONSTRUCT {< Listing 4.6: Optimized dbpedia:birthdate SPARQL query?date} encapsu W < dbpedia:birthdate?date Dynamic optimization: pushing rewriting values strategy } PREFIX dbpedia: < CONSTRUCT {< dbpedi < dbpedia:birthd Avoid re-evaluation by exploiting intermediate } results (communication 74of already known values saved) Chapter [Bind 4. joins] Semantic data and query distribution rewriting strategy Replacing variables by their known values for each single triple pattern. Listing 4.7: Optimized SPARQL query encapsulating a single edg PREFIX dbpedia: < CONSTRUCT { Listing 4.7: Optimized SPARQL query encapsu Input Listing SPARQL 4.3: Full query SPARQL query distributed over remote KGRAM endpoints Rewritten sub-query?x dbpedia:birthdate rewriting?date strategy PREFIX foaf: < } WHERE { PREFIX dbpedia: < PREFIX dbpedia: < dbpedia:birthdate?date CONSTRUCT { SELECT DISTINCT?x?name?date WHERE { }?x dbpedia:birthdate?date?x foaf:name?name. } WHERE {?x dbpedia:birthdate?date.?x dbpedia:birthdate?date FILTER (CONTAINS (?name, Bobby A )) } } Listing 4.8: Optimized SPARQL query encapsulating a single edg rewriting Optimized strategy sub-query < dbpedia Listing 4.4: Generated SPARQL query encapsulating PREFIX a dbpedia: single edge < request Listing through 4.8: the Optimized naive rewriting strategy SPARQL query encapsu Intermediate result CONSTRUCT { rewriting strategy?x = < dbpedia:birthdate?date PREFIX foaf: < } WHERE { PREFIX dbpedia: < CONSTRUCT {?x foaf:name?name} WHERE { < CONSTRUCT { dbpedia:birthdate?date?x foaf:name?name. } < dbpedia } } WHERE { } 15 The experiments presented in section 4.3 will show that th

40 Experiment: large-scale benchmarking (1/2) Objective: performance assessment Material and Methods FedBench from the FedX team (50M triples ; 7 life-science SPARQL queries) Grid'5000 Computing Infrastructure FedX + Fuseki endpoints KGRAM-DQP 166 Chapter 8. Experime FedBench Life-Science datasets Data source Linked Data collection Size (triples) #1 ChEBI 7.3M #2 DBpedia sub-set #1 25.3M #3 DBpedia sub-set #2 18.3M #4 DrugBank 0.7M #5 KEGG Drug 1M Table 8.4: Updated FedBench Life Science data collections (52M triples) fragmented ov Listing 10.7: LS7 FedBench query FedBench Life-Science query #7 SELECT $drug $transform $mass WHERE { { $drug < berlin.de/drugbank/resource/drugbank/affectedorganism> } was reserved for the execution of the FedX federation engine, the othe reserved to expose the 5 data sources through a Fuseki SPARQL endpoin Humans and other mammals. It has to be noted that due to some incompatibilities, it has not been po FedX with OpenRDF-Sesame SPARQL endpoints (versions and described in [Schwarte et al., 2011]. $drug < berlin.de/drugbank/resource/drugbank/casregistrynumber> $cas. $keggdrug < $cas. $keggdrug < $mass FILTER ( $mass > 5 ) } OPTIONAL {$drug < $transform.} To evaluate KGRAM, we deployed a similar environment as previou berlin.de/drugbank/resource/drugbank/biotransformation> experiment 1 through the physical federation. We reserved 6 nodes of the S cluster, one of which was dedicated to the KGRAM federation engine, an ing nodes exposing the 5 data sources through KGRAM endpoints. 16

41 Experiment: large-scale benchmarking (2/2) Real distributed computing infrastructure Mean evaluation time over 10 runs 17

42 Experiment: large-scale benchmarking (2/2) Real distributed computing infrastructure Mean evaluation time over 10 runs 50% timeout variability variability 17

43 Highlights & short-term perspectives Highlights Transparent federated semantic querying No prior knowledge on data source content Performances between DARQ / Splendid and FedX [Distribution /Dynamicity] Expressive approach: SPARQL 1.1 support (Optional, Negation, Property path, aggregates) [Scalability] [Knowledge] Short-term perspectives Coarse-grain DQP (dynamic triple pattern grouping in SERVICE clauses) Prototype algorithm, but possibly ineffective (query planing) Relational database mediation Prototype SQL data producer in KGRAM-DQP [Scalability] [Heterogeneity] 18

44 Contribution 2 Knowledge Production (for e-science platforms)

45 Scientific workflow issues 20

46 Scientific workflow issues 1. Editing data-links between processes? 20

47 Scientific workflow issues 1. Editing data-links between processes? 2. Identifying the cause of failures or atypical results 20

48 Scientific workflow issues 1. Editing data-links between processes? 2. Identifying the cause of failures or atypical results Knowledge-oriented WF environments ease workflow design propagate knowledge on results 20

49 Design issues Workflow design issue, close-up: MRI MRI Registration Matrix x y a z t b Re-sampling MRI Several natures of treatment or data, not explicit at technical level Only considering nature: ambiguity 21

50 Design issues Workflow design issue, close-up: MRI MRI "reference" role Registration "patient" role Matrix x y a z t b Re-sampling MRI Several natures of treatment or data, not explicit at technical level Only considering nature: ambiguity need for Roles to relate data to processing tools! 21

51 Runtime issues Results exploitation issue, close-up: can-be-superimposed-with used wasgeneratedby Registration used used x y a z t b used Re-sampling wasgeneratedby Need for non-ambiguous service annotations to produce new domain-specific statements 22

52 Issues & Objectives Issues: (i) How to explicit the semantics of data processing? (ii) How to benefit from this knowledge... at experiment design-time? at experiment runtime? Objectives: (i) complexity of designing an e-science experiment (workflow) ; (ii) exploitation of results produced during data-intensive experiments. 23

53 Methods Several kinds of knowledge: Technical knowledge (OWL-S, OPM) ; Domain knowledge: 1.Nature of data and services ; 2.Role of data from the service point of view. Our contribution: 1.Domain-specific Role Taxonomy: clarifying bindings between technical service descriptions and domain concepts ; 2.Produce new valuable knowledge through inferences along platform exploitation. Supported by the OntoNeuroLOG domain ontology and the OPM provenance ontology. 24

54 Methods Several kinds of knowledge: Technical knowledge (OWL-S, OPM) ; Domain knowledge: 1.Nature of data and services ; 2.Role of data from the service point of view. Our contribution: 1.Domain-specific Role Taxonomy: clarifying bindings between technical service descriptions and domain concepts ; 2.Produce new valuable knowledge through inferences along platform exploitation. Supported by the OntoNeuroLOG domain ontology and the OPM provenance ontology. 24

55 concepts and Role concepts when annotating semantic service parameters by relying on a domain-specific role taxonomy. Figure 6.3 illustrates the taxonomy of roles dedicated to the characterization of the relationships between neuroimaging data and their dedicated processing. Role concepts are organized following the main classes of neuroimaging processing similarly to the OntoNeuroLOG dataset processing ontology. Neuroimaging Role taxonomy Domain-specific extension of the OPM Role class Roles to disambiguate the annotation of service parameters. Figure 6.3: A domain-specific role taxonomy characterizing how neuroimaging data can be related to neuroimaging defense, 15 march processing 2013, Sophia tools. Antipolis A. Gaignard, PhD 25

56 Neuroimaging Role taxonomy Figure 6.3 illustrates the taxonomy of roles dedicated to the characterization of the Domain-specific extension of the OPM Role class 134 Chapter 6. Semantic scientific workflows for knowledge capture and extension concepts and Role concepts when annotating semantic service parameters by relying on a domain-specific role taxonomy. relationships between neuroimaging data and their dedicated processing. Role concepts are organized following the main classes of neuroimaging processing similarly to the OntoNeuroLOG dataset processing ontology. Roles to disambiguate the annotation of service parameters. Figure 6.3: A domain-specific role taxonomy characterizing how neuroimaging data can be related to neuroimaging processing tools. As-reference Registration As-floating As-transformation x z 0 y t 0 As-unprocessed a b 1 As-transformation This taxonomy illustrates another example of disambiguation in the context of resampling processes. Indeed, the two roles As-affine-transformation and As-transformation-field precise how a matrix should be interpreted by a resampling process. If we consider two 3 3 matrices, they could share the same nature and representation format. However, one could be interpreted as a set of parameters for translation, rotation and scaling, in the context of an affine geometrical transformation, whereas the other one could be interpreted as a deformation field in the context of a non-rigid transformation. Relying on this taxonomy of roles, we are now able to precisely annotate the input and output parameters of our image registration service considered in the running example (figure 6.2) with both Natural and Role concepts. Both input images are characterized by a same Natural concept, T1 weighted magnetic resonance image (T1-MR). T1-MR can be considered as a Natural concept because it stands on its own and does not characterize how input data are related with any other entities. On the other hand, service input parameters can be annotated with two distinct Role concepts to characterize how input data are related to the registration process. The service input parameter interpreting data as floating (the moving data, that will finally be realigned) is annotated with role As-floating- Re-sampling As-resampled 25

57 Inference rule example Inference rules to produce semantic annotations We propose in this chapter a methodology for producing and deducing new meaningful statements. If we consider the result of the registration workflow p in Figure 6.1, it would be interesting to associate the atlas used as input in the tion process to the registered image produced. More generally, our approach the propagation of the effect of services (or sub-parts of workflow) to the produc For instance, we would like to automate the generation of a fact saying that a can be superimposed with another one, because in some cases, processing too require that their input data are expressed in the same coordinate system, and th beforehand been registered. can-be-superimposed-with Registration Resampling Registration-Class Resampling-Class used wasgeneratedby x y a Registration z t b used used used Re-sampling wasgeneratedby Matrix Resampling used (As_affine_transformation) Atlas used (As_reference_image) Result wasgeneratedby (As_resampled_image) Registration Matrix wasgeneratedby (As_affine_transformation) can_be_superimposed_with Atlas Result Figure 6.2: Linking data and processes through generic and domain-specific relations. provenance-based knowledge propagation 26

58 Experiment: inferring VIP experiment summaries (real-life) Objectives: Inferring meaningful experiment summaries from WF runs & domain knowledge Coping with provenance as distributed Linked Data Material & Methods: VIP e-science platform (Moteur WF engine ; OntoVIP ontology) Service annotations (Roles), OPM provenance, Inference rules 12 Chapter 1. Introduction VIP Portal VIP Execution Service Simulation workflows VIP Data Service Organ models Distributed computing infrastructure VIP Platform Simulated data Figure 1.2: The VIP platform, easing the access to medical image simulators, organ models, and leveraging A. Gaignard, thephd EGI defense, distributed 15 march computing 2013, Sophia infrastructure Antipolis to handle heavy simulation. 27

59 Experiment: inferring VIP experiment summaries (real-life) Objectives: Inferring meaningful experiment summaries from WF runs & domain knowledge Coping with provenance as distributed Linked Data Material & Methods: VIP e-science platform (Moteur WF engine ; OntoVIP ontology) Service annotations (Roles), OPM provenance, Inference rules 12 Chapter 1. Introduction VIP Portal VIP Execution Service Simulation workflows VIP Data Service Organ models Distributed computing infrastructure VIP Platform Simulated data Figure 1.2: The VIP platform, easing the access to medical image simulators, organ models, and leveraging A. Gaignard, thephd EGI defense, distributed 15 march computing 2013, Sophia infrastructure Antipolis to handle heavy simulation. 27

60 8.4.2 Results and discussion Semantic experiment summaries The main result of this experiment is a meaningful statements inferred from the execution of a medical image simula iment. These new statements provide a high-level, and concise semantic summary. We consider the experiment summary as a high-level descript only involves domain-specific classes and properties defined in the VIP ont pared to the generic and technical entities provided by the OPM provenanc We also consider the experiment summary as concise since only 7 statemen produced, compared to the 15 thousand statements produced through the M provenance plugin. Coarse-grained & meaningful provenance Inferring VIP experiment summaries (real-life) Fine-grained & technical provenance phantom protocol PET-simulationcompatible-model Parameter-set Simulation parsetextprotocol rdf:type rdf:type is-a compileprotocole generatejobs phantom protocol PET-Simulation Lmf2RawSino sorteosingles sorteosingles sorteosingles sorteosingles sorteosingles sorteosingles sorteoemission sorteoemission sorteoemission sorteoemission sorteoemission sorteoemission Inference rules derives-from-model Simulation workflow run rdf:type derives-from-parameter-set Simulated-data is-a-result-of-at sinogram sinogram is-a is-a PET-Sinogram Figure 8.11: New inferred meaningful statements (dashed arrows) constituting the semant summary. 28

61 PREFIX rdf: < rdf syntax ns#> PREFIX rdfs: < schema#> PREFIX opmo: < PREFIX opmv: < PREFIX ws: < service owl lite.owl#> PREFIX iec: < owl lite.owl#> Semantic experiment summaries The main result of this experiment is a meaningful statements inferred from the execution of a medical image simula iment. These new statements provide a high-level, and concise semantic summary. We consider the experiment summary as a high-level descript only involves domain-specific classes and properties defined in the VIP ont pared to the generic and technical entities provided by the OPM provenanc We also consider the experiment summary as concise since only 7 statemen produced, compared to the 15 thousand statements produced through the M provenance plugin. Inferred meaningful experiment summary: Inferring VIP experiment summaries: material & methods PREFIX vip model: < model.owl#> PREFIX vip simulation: < simulation.owl#> PREFIX vip simulated data: < simulated data.owl#> Inference rule: CONSTRUCT {?out vip model:derives from model?inphantom #... } WHERE {?agent (iec:refers to/rdf:type) vip simulation:image reconstruction simulator component.?wcb opmo:cause?agent.?wcb opmo:effect?x.?x rdf:type opmv:process.?wgb opmo:cause?x.?wgb opmo:effect?out. PET-simulationcompatible-model rdf:type phantom Parameter-set rdf:type protocol Simulation is-a PET-Simulation?agent2 (iec:refers to/rdf:type) vip simulation:parameters generation simulator component.?wcb2 opmo:cause?agent2.?wcb2 opmo:effect?y.?y rdf:type opmv:process. derives-from-model Simulation workflow run rdf:type derives-from-parameter-set Simulated-data }?used1 opmo:cause?inphantom.?used1 opmo:effect?y.?used1 opmo:role/rdfs:label?techrolephantom.?agent2 ws:has input?inportphantom.?inportphantom (iec:refers to/rdf:type) vip model:geometrical phantom object model.?inportphantom rdfs:comment?techrolephantom.?inphantom opmo:avalue?vinphantom.?vinphantom opmo:content?cinphantom. #... is-a-result-of-at sinogram is-a is-a PET-Sinogram Figure 8.11: New inferred meaningful statements (dashed arrows) constituting the semant summary. 29

62 Semantic experiment summaries The main result of this experiment is a meaningful statements inferred from the execution of a medical image simula iment. These new statements provide a high-level, and concise semantic e summary. We consider the experiment summary as a high-level descripti only involves domain-specific classes and properties defined in the VIP onto pared to the generic and technical entities provided by the OPM provenanc We also consider the experiment summary as concise since only 7 statement A real-life medical imaging simulation workflow: semantic mash-up experimentproduced, compared to the 15 thousand statements produced through the M er meaningful experiment summaries 183 provenance plugin. Inferring VIP experiment summaries: results Semantic experiment summaries : raph composed of 4523 nodes and edges, figure 8.10 represents a simpligraph in which some nodes have been removed such as the unique instance of PM Account allowing to retrieve all instances generated in the context of a sinorkflow execution. from this simplified graph we can distinguish two main nodes oteur.processor/sorteo_singles and which correspond to serwith a large number of invocations. PET-simulationcompatible-model Parameter-set Simulation rdf:type rdf:type is-a phantom protocol PET-Simulation rdf:type Simulation workflow run derives-from-parameter-set Simulated-data derives-from-model is-a-result-of-at is-a sinogram is-a PET-Sinogram 8.10: A filtered OPM provenance graph with removed rdf:type properties for the main OPM classes s Artifact, Used, WasGeneratedBy, etc. Figure BIG fine-grained, meaningless provenance 8.11: New inferred meaningful statements (dashed arrows) constituting the semanti summary. ue to its fine granularity and its size, the OPM model leads to complex graphs inng large amounts of generic and technical elements. Interpreting these OPM graphs ficult. To address this issue, we segmented the produced semantic annotations gh two distinct semantic repositories. First, a short-term repository, aiming at temily storing OPM statements, as the necessary input data to infer new meaningful ments. Second, a long-term repository, aiming at permanently storing the new states resulting from inferences involving domain-specific entities provided by the VIP ogy. FEW meaningful statements results Interpretation 30

63 Inferring VIP experiment summaries: results Distributed linked provenance data & inference rules Grid'5000 infrastructure (3 OPM data sources) + KGRAM-DQP 8.4. A real-life medical imaging simulation workflow: semantic mash-up to infer meaningful experiment summaries Reusable inference rules adapt to simulator component evolutions - do not adapt to workflow structure evolutions phantom compileprotocole protocol parsetextprotocol generatejobs sorteosingles Lmf2RawSino sorteoemission subsumedby subsumedby Lmf2RawSino_v2 Lmf2RawSino_v3 sinogram Figure 8.12: Updated Sorteo workflow involving a refined Lmf2RawSino servi 31

64 Inferring VIP experiment summaries: results 1 week of VIP operation / 18 possible inference rules: 118 Simulations (15K triples each) 1.7 M triples 118 Experiments summaries 2656 triples US simulations MR simulations CT simulations scalability 32

65 Highlights & short-term perspectives Highlights Clear delineation between Role and Natural concepts Domain ontology at workflow design-time and run-time Scalable annotation of analyzed data through semantic experiment summaries Reusable inference rules Short-term perspectives Integration of neuro-imaging roles in a sound domain ontology From OPM ontology to PROV-O Publishing experiment summaries as Linked Open Data 33

66 Summary Enhance e-science platforms with Knowledge Engineering (and Semantic Web technologies) Scalable and expressive Knowledge Sharing approach through distributed query processing techniques and abstract knowledge graphs Smart Knowledge Production: "few but meaningful data" Deployment into real-life platforms 2 softwares: NeuSemStore and KGRAM-DQP in production in 2 ANR projects : NeuroLOG and VIP 34

67 Future directions 1. Towards high performance federated semantic querying: triple pattern grouping & query planning "Elastic" SPARQL endpoint for massive knowledge graphs 2. Towards highly expressive federated semantic querying FedBench extensions with more expressive queries Towards distributed reasoning (optimal plan for inferences? materialization?) 3. Towards versatile and reliable knowledge base federations R2RML-based mediation of SQL databases generalized provenance, from processed data to the originating data sources (explanation) 4. Towards reduced information overload in e-science Semantic experiment summaries & (goal-driven) conceptual workflows [Cerezo et al., 2011] Eased inference rules design by relying on WF goals Annotated data to help in WF design 35

68 Merci! O. Corby, A. Gaignard, C. Faron Zucker, J. Montagnat. KGRAM versatile data graphs querying and inference engine, WI'12 (International Conference on Web Intelligence), Macao, A. Gaignard, J. Montagnat, B. Wali, B. Gibaud. Characterizing semantic service parameters with Role concepts to infer domain-specific knowledge at runtime, KEOD 11 (International Conference on Knowledge Engineering and Ontology Development), Paris, A. Gaignard, J. Montagnat, C. Faron Zucker, O. Corby. Semantic Federation of Distributed Neurodata, MICCAI-DCICTAI workshop (Data- and Compute-Intensive Clinical and Translational Imaging Applications), Nice, A. Gaignard, J. Montagnat, C. Faron Zucker, O. Corby. Fédération multi-sources en neurosciences : intégration de données relationnelles et sémantiques, IC'12 (Ingénierie des Connaissances), workshop "Ingénierie des connaissances pour l'inter-opérabilité sémantique en e-santé", Paris, T. Glatard, C. Lartizien, B. Gibaud, R. Ferreira da Silva, G. Forestier, F. Cervenansky, M. Alessandrini, H. Benoit-Cattin, O. Bernard, S. Camarasu-Pop, N. Cerezo, P. Clarysse, A. Gaignard, P. Hugonnard, H. Liebgott, S. Marache, A. Marion, J. Montagnat, J. Tabary and D. Friboulet. A Virtual Imaging Platform for multi-modality medical image simulation, IEEE Transactions on Medical Imaging (TMI), 32 (1), pages , 2013.