Data advertising and managin system for Biobanks A use case for the egenvar data management system. Sabry Razick (24 October 2014, ESBB) Department of Cancer Research and Molecular Medicine Norwegian University of Science and Technology Norway igr.medisin.ntnu.no/data/egenvar/ www.ntnu.no
egenvar Data management em (EGDMS) provides tions for: anaging data/metadata for ternal and external reuse dvertising these data without posing the data themselves. Samples Biobank Derived data Associated data
Data not accessed or moved Cataloguing and advertising. Link LIMS and derived data Tags to describe data Establish connections using shared Data <-> Metadata Genotype <-> Phenotype Private <-> Public Description <-> Tag Processing <-> Relationships Big data<-> Location
mples of information t can be attached as tags What type of samples How was it collected Primary/secondary Processing preformed How is it stored/preserved Protocols followed Techniques used Instruments used Phenotypes investigated File type Parameters Software Types of processing Relevant publications rimental Factor Ontology (EFO) logy for Biomedical Investigations (OBI) le Processing and Separation Techniques Ontology(SEP) arch Resource Ontology an Physiology Simulation Ontology(HuPSON) an Disease Ontology (DOID) nal Cancer Institute Thesaurus Experiment Raw data Processed data Interpretations Blood specimen (OBI:0000655) venous blood (SNOMEDCT:53130003) fasting:(xco_0000102) 1M-Duo Infinium HD BeadChip (OBI_0002006 Serum alanine aminotransfera (EFO_0004735 Colorimetric detection (SEP:00165) HDL cholesterol (SNOMEDCT/102737005) GenomeStudio V2011.1 (ION:225) ANOVA (SWO:0000014) linear mixed model: (SCAIVPH_00000687) cardiovascular system disease (DOID:1287) myocardial infarction(doid_5844)
Advertising
Public web-interface 1 2 3.A 3.B Login to see mo details / request more informatio using the link or
Private web-interface se filters to further efine the results nd. e.g. get all ions of the files Get more details. e.g. people involved, grouping Tags. What are the attributed using tags Hierarchical information. i.e. parents and children
e of many ways to navigate: The graph browse
Data mangement
Search for a tag using terminal egenv.sh -exact -search tag="pre-eclampsia" #URL(DISEASE_ONTOLOGY_TAGSOURCE)=https://ans- 180230.stolav.ntnu.no:8185/eGenVar_web/Search/SearchResults3?TABLETOUSE_DISEASE_ONTOLOGY_TAGSOURCE=572cb cd98dbe48fdcc68aea6861e24173d27d9ee 1) DISEASE_ONTOLOGY_TAGSOURCE.PARENT_ID=1005 DISEASE_ONTOLOGY_TAGSOURCE.NAME=pre-eclampsia DISEASE_ONTOLOGY_TAGSOURCE.OBO_ID=DOID:10591 DISEASE_ONTOLOGY_TAGSOURCE.DEFINITION=\N DISEASE_ONTOLOGY_TAGSOURCE.PATH=disease.obo>disease>disease of anatomical entity>cardiovascular system disease>vascular disease>arterydisease>hypertension>pre-eclampsia(disease_ontology_tagsource=946) igr.medisin.ntnu.no/data/egenvar/ www.ntnu.no
Retrieve file paths using sample details tags egenv.sh -search tag="4535811005.r01c01" "sampledetails files2paths.filepath"..waiting for the server.... #URL for this result: #URL(FILES2PATH)=https://ans- 180230.stolav.ntnu.no:8185/eGenVar_web/Search/SearchResults3?TABLETOUSE_FILES2PATH=e93e2fcda6979d2bb0224 03b6f4d78b2c8d5cb05 1) FILES2PATH.FILEPATH=/...nt_20100504.idats/4535811005/4535811005_R01C01_Grn.idat 2) FILES2PATH.FILEPATH=/...nt_20100504.idats/4535811005/4535811005_R01C01_Red.idat Ended in 2 seconds igr.medisin.ntnu.no/data/egenvar/ www.ntnu.no
Type Content description Content generation details. Relationships Location and ownership Relationships Provenance Track Relocation Virtual rearrangements Information consolidation Record changes and relocation Phenotypes Instruments and de Protocols Sample information People/Experts Controls
Acknowledgments Pål Sætrom Oddgeir Lingaas Holmen Rok Mocnik Laurent Thomas Einar Ryeng Finn Drabløs HUNT biobank Kristian Hveem Research counsel of Norway Email: sabryr@gmail.com Project link: http://bigr.medisin.ntnu.no/data/egenvar/ Publication (PMID: 24682735) : The egenvar data management system cataloguing and sharing sensitive data and metadata for the life sciences. igr.medisin.ntnu.no/data/egenvar/ www.ntnu.no
r 1 Donor Original sample Sample B B.2 A B.1 B B.2 B.1 A 2 F1 Experiments/Instrument Annotation resources (FA) Raw data (F1) Post-processing Groomed data (F2) Analysis Processed data (F3) Filter Filtered data (F4) F2 Annotate FA F3 F4 Interpretations, results(f5) F5
Data sharing strategies Email Compress/FTP Data SSH ls/find/sed Advanced files systems, Federated systems Interface User Galaxy EGDMS PubMed ArrayExpress irods ISA GeneBank tools and - privilege NCBI management Taverna GEO TwinNET ArrayExpress IntAct Finland Synapse dbgap Graz UCSC GEO (Gene biobank- Genome Expression Browser Austria Nature Ensembl Omnibus) scientific Genome Browse dat Giga ENCODE dbgap science (The database o Dataverse 1000 Genotypes Genomeand DRYAD Phenotypes) Write a detailed description Create a summary/ Export to schema Parse files/extract data Central/local repository Extract details by The host system Work flow management Access management Collect metadata (e.g. file system) Record provenance with relationships Describe content Organise Describe using tags internal information management system + Metadata server igr.medisin.ntnu.no/data/egenvar/ www.ntnu.no
egenvar data management system(egdms) Sensitive data cannot be freely exchanged, hosted on public repositories or transformed in the same ways as public data. Thus data stays hidden and not reused. Data advertising Data advertising using extended set of meta-data as tags. Keep the data where they are with existing access restrictions and disclosed their presence. Data management Local data management to keep track of what is advertised. Facilitate locating data for internal and external reuse. igr.medisin.ntnu.no/data/egenvar/ www.ntnu.no