Simon Miles King s College London Architecture Tutorial



Similar documents
Living Donor Paired Exchange Registry. What is Living Kidney Donor Paired Exchange?

Data, Measurements, Features

How does a kidney transplant differ from dialysis?

*6816* 6816 CONSENT FOR DECEASED KIDNEY DONOR ORGAN OPTIONS

Big6 Science Fair Project Organizer

Array Comparative Genomic Hybridisation (CGH)

Information for people who have an increased risk of Creutzfeldt-Jakob disease (CJD)

U.K. Familial Ovarian Cancer Screening Study (UK FOCSS) Phase 2 Patient Information Sheet

The Human Genome Project. From genome to health From human genome to other genomes and to gene function Structural Genomics initiative

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

CCR Biology - Chapter 9 Practice Test - Summer 2012

Big Data Challenges in Bioinformatics

Presente e futuro del Web Semantico

How does genetic testing work?

OXFORD OPEN. Introduction IGCSE. Chemistry. Introduction

Stem Cells. Part 1: What is a Stem Cell?

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Introduction to Data Mining

Methods for network visualization and gene enrichment analysis July 17, Jeremy Miller Scientist I jeremym@alleninstitute.org

University of Glasgow - Programme Structure Summary C1G MSc Bioinformatics, Polyomics and Systems Biology

Artificial Intelligence and Machine Learning Models

Computer Modeling. Exciting careers in. * How were computer models involved in making this scene possible? What is it? How does it work? Who uses it?

Sanjeev Kumar. contribute

Integrating Genetic Data into Clinical Workflow with Clinical Decision Support Apps

Facts about Organ and Tissue Donation for Research

EDITORIAL MINING FOR GOLD : CAPITALISING ON DATA TO TRANSFORM DRUG DEVELOPMENT. A Changing Industry. What Is Big Data?

Nevada Department of Education Standards

CPO Science and the NGSS

Intro to the Art of Computer Science

Policies and Procedures

Science Stage 6 Skills Module 8.1 and 9.1 Mapping Grids

1. General Information About The Mitochondrial Disease Biobank

Semantic Workflows and the Wings Workflow System

Body, Brain and Tissue Donation Pack

Blood Transfusion. There are three types of blood cells: Red blood cells. White blood cells. Platelets.

Bioinformatics Grid - Enabled Tools For Biologists.

Where Will my New Kidney Come From?

8. KNOWLEDGE BASED SYSTEMS IN MANUFACTURING SIMULATION

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

Virtual research environments: learning gained from a situation and needs analysis for malaria researchers

Big Data so what s the big deal? Jevin D. West ischool, University of Washington jevinw@uw.edu

Guideline for stresstest Page 1 of 6. Stress test

Validation of E-Science Experiments using a Provenance-based Approach

Questions and Answers for Transplant Candidates about the New Kidney Allocation System

Putting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable

Guide to Writing a Project Report

Data Driven Discovery In the Social, Behavioral, and Economic Sciences

Workflow Requirements (Dec. 12, 2006)

BIOSCIENCES COURSE TITLE AWARD

Doctor of Philosophy in Computer Science

Concepts of digital forensics

Integration of Genetic and Familial Data into. Electronic Medical Records and Healthcare Processes

NIST Big Data PWG & RDA Big Data Infrastructure WG: Implementation Strategy: Best Practice Guideline for Big Data Application Development

ICSU and the Challenge of Big Data in Science

SNP Essentials The same SNP story

AllegroGraph. a graph database. Gary King gwking@franz.com

EMBL Identity & Access Management

Smart Science Lessons and Middle School Next Generation Science Standards

Outbreak questionnaires and data entry using the new EpiData modules

Data Provenance for e-social

14.3 Studying the Human Genome

KNOWLEDGENT WHITE PAPER. Big Data Enabling Better Pharmacovigilance

Building the European Biodiversity. Observation Network (EU BON)

15 Stem Cell Research

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

ProteinQuest user guide

INTRODUCTION. Chapter One

Anforderungen der Life-Science Industrie an die Hochschulen. Hans Widmer Novartis Institutes for BioMedical Research

Egg and sperm donation in the UK:

Year 10: The transmission of heritable characteristics from one generation to the next involves DNA

Debian Med. Integrated software environment for all medical purposes based on Debian GNU/Linux. Andreas Tille. OSWC, Malaga Debian.

Bioinformatics Resources at a Glance

Intro to Bioinformatics

REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf])

Narrator: Transplants using stem cells from the blood, bone marrow or umbilical cord blood

Transcription:

Common provenance questions across e-science experiments Simon Miles King s College London

Outline Gathering Provenance Use Cases Our View of Provenance Sample Case Studies Generalised Questions about Provenance General Issues in Creating an Infrastructure for Provenance Further Details

Projects and Colleagues Started gathering use cases 5 years ago Provenance-Aware Service-Oriented Architecture (PASOA) project EU Provenance project Case studies used as requirements for general infrastructure, a subset implemented Many collaborators, and particularly Luc Moreau, University of Southampton Paul Groth, Information Sciences Institute, University of Southern California

Gathering Use Cases about Provenance

Methodology Describe idea and model of provenance Give others use cases as examples Ask what information (about the past) is being determined and/or used in each case

Use Case Expression Describe preceding actions by scientist(s) State what the scientist determines A bioinformatician, B, downloads sequence data of a human chromosome from GenBank and performs an experiment. B later performs the same experiment on data of the same chromosome, again downloaded from GenBank. B compares the two experiment results and notices a difference. B determines whether the difference was caused by the experimental process or configuration having been changed, or by the chromosome data being different (or both).

Provenance

What Provenance Is Oxford English Dictionary: the fact of coming from some particular source or quarter; origin, derivation the history or pedigree of a work of art, manuscript, rare book, etc.; concretely, a record of the passage of an item through its various owners. Provenance is important for: Interpretation Judging value

Causation Everything that is part of the provenance of an item is a cause of that item being as it is For example, provenance of a bottle of wine includes: Grapes from which it is made Where those grapes grew Steps in the wine s preparation How the wine was stored Between which parties the wine was transported, e.g. producer to distributer to retailer

Causal graphs Donor Organ Decision: Yes

Causal graphs Family Consent Decision: Yes Blood Test Results: -ve decision based on Donor Organ Decision: Yes

Causal graphs Family Consent Request: 432 response to Family Consent Decision: Yes Blood Test Request: 432 response to Blood Test Results: -ve decision based on Donor Organ Decision: Yes

Causal graphs Patient Brain Death: PID 432 triggered by Family Consent Request: 432 response to Family Consent Decision: Yes Blood Test Request: 432 response to Blood Test Results: -ve decision based on Donor Organ Decision: Yes

Causal graphs Patient Brain Death: PID 432 triggered by Family Consent Request: 432 Blood Test Request: 432 response to Family Consent Decision: Yes response to Blood Test Results: -ve triggered by decision based on Donor Organ Decision: Yes

Causal Connections Donation operation Patient after donation with two kidneys Causes and effects are occurrences Occurrence of a process or event, or Occurrence of a data item or physical artefact being in a particular state Counter-factual definition: Effect would not have occurred if cause had not occurred

Sample Case Studies

Bioinformatics Klaus-Peter Zauner at the University of Southampton Analysing the complexity (information content) of gene and protein sequences Purely electronic experiment implemented as UNIX shell scripts calling local executables Inputs downloaded from RefSeq,GenBank Output data is a graph plot graphics file

Provenance Questions Questions included: What sequences led to the production of this output graph? I ran what I thought was the same experiment (same configuration, same input data) on multiple occasions, but the output looks different - what was different?

Proteomics Centre for Proteomic Research at the University of Southampton Identifying proteins within biological samples A lab based experiment to extract data used as evidence for identification Followed by search of public and local databases for proteins matching this evidence

Provenance Questions Questions included: What machine settings did I use in obtaining this successful identification (so I can try similar settings in a later experiment)? What was the perceived reliability of the pieces of evidence and database entries used to identify this protein?

Particle Physics ATLAS experiment at the Large Hadron Collider, CERN Identifying traces of particles produced by the collision of particles at high energies Much data processing, first at CERN, then by physicists around the world A lot of processing in terms of large sets of data, of which only subsets may be used in any one experiment

Provenance Questions Questions included: Has the data set from which the subset of data I am experimenting on is extracted, been updated? Were these results produced by processing involving a version of a library now known to have bugs?

Organ Transplant Management Inter-hospital organ transplant management with software support Governed by the Catalan Health Authority Patients build up healthcare records through check-ups, tests, surgery When a donor dies, standardised procedures guide transplant process involving tests of donor organ, recipient, and making use of healthcare records

Provenance Questions Questions included: Who made the critical decisions which led to this donor organ being accepted/denied for transplantation? Where were the time lags in getting from donation to transplant?

And more... Genetic diseases Aircraft simulation Police databases Social planning Chemicals and lasers Grid service reliability Brain image analysis Healthcare records Ecological simulation Medical images Aerospace aftersales Chemical prediction Galaxy formation Near-earth objects

Generalised Common Questions

Generalised Questions How did I (or someone else) come by this result? (genetic diseases, aerospace examples) What was common and relevant in the history of this set of successful outcomes? (proteomics, social planning examples) Was the process claimed to be performed the one which was actually performed? (organ transplant, chemistry examples)

Generalised Questions What inputs were used to derive this output? (bioinformatics, particle physics examples) What software produced this data? (particle physics, genetic diseases examples) Can I generalise from the process by which this result was produced to a reusable plan? (chemistry example)

Generalised Questions Were these regulations followed in producing this result? (proteomics, transplant examples) Are these two independent conclusions actually based on the same faulty assumption/input? (grid reliability, policing examples) What differed between the way these two results were produced? (social planning, bioinformatics examples)

Generalised Questions Were tools or services used in a meaningful way? (bioinformatics examples) What effect do the tools used have on my rights to patent or publish? (bioinformatics examples) Which inputs have a pronounced effect on the output? (social planning, galaxy formation examples)

Generalised Questions Were the inputs to this experiment of reliable quality? (chemical prediction, biodiversity examples) Who was the source of this decision or input fact? (organ transplant examples)

General Issues for Provenance Infrastructures

Infrastructure Issues Record or infer connections between data, processes, events (and plan in advance) Naming no longer existent data, processes, events, states of artefacts Scalability of storage for large data sets Privacy infringement by ability to infer Requirements to delete old data Querying vast causal graphs Post-processing for most appropriate answers

Extra Resources

More Detail, More Use Cases The Requirements of Using Provenance in e-science Experiments by Miles, Groth, Branco and Moreau Journal of Grid Computing http://twiki.pasoa.ecs.soton.ac.uk See Use Cases section http://www.gridprovenance.org See Applications section

More Detail, More Use Cases The Provenance Challenge First and Second used brain image analysis case study Third (current) uses near-earth object detection (astronomy) case study Workflow-oriented but trying to make connections with database provenance http://twiki.ipaw.info/

Credits Thanks to the many who were interviewed and supplied the use cases (see papers and websites for all the credits)