Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

Similar documents
Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

DMBI: Data Management for Bio-Imaging.

EMBL Identity & Access Management

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Thermo Scientific ArrayScan XTI High Content Analysis Reader. revolutionizing cell biology with the power of high content

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

Big Data in BioMedical Sciences. Steven Newhouse, Head of Technical Services, EMBL-EBI

Workprogramme

Keystones for supporting collaborative research using multiple data sets in the medical and bio-sciences

Horizon Research e-infrastructures Excellence in Science Work Programme Wim Jansen. DG CONNECT European Commission

Solution for private cloud computing

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

Checklist for a Data Management Plan draft

Introduction to Research Data Management

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

ZEISS Microscopy Course Catalog

Task Scheduling in Hadoop

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

NetApp Big Content Solutions: Agile Infrastructure for Big Data

Web Application Hosting Cloud Architecture

6 ELIXIR Domain Specific Services

#jenkinsconf. Jenkins as a Scientific Data and Image Processing Platform. Jenkins User Conference Boston #jenkinsconf

NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons

Deliverable D1.1. Building data bridges between biological and medical infrastructures in Europe. Grant agreement no.:

Check Your Data Freedom: A Taxonomy to Assess Life Science Database Openness

Exploitation of ISS scientific data

A grant number provides unique identification for the grant.

Open & Big Data for Life Imaging Technical aspects : existing solutions, main difficulties. Pierre Mouillard MD

Managing and Conducting Biomedical Research on the Cloud Prasad Patil

UNINETT Sigma2 AS: architecture and functionality of the future national data infrastructure

Integrated Rule-based Data Management System for Genome Sequencing Data

IBM Software Information Management Creating an Integrated, Optimized, and Secure Enterprise Data Platform:

Neelesh Kamkolkar, Product Manager. A Guide to Scaling Tableau Server for Self-Service Analytics

DataNet Flexible Metadata Overlay over File Resources

Selecting the Right NAS File Server

High-Performance, Low-Cost Computational Chemistry: Servers in a Stick, Box, and Cloud. Nathan Vance Polik Group Hope College February 19, 2015

Service Road Map for ANDS Core Infrastructure and Applications Programs

IMARIS. 3D and 4D interactive analysis and visualization solutions for the life sciences.

IT Coordination Group and ECRIN Data Centers

Cloud Based Distributed Databases: The Future Ahead

Enforce AD RMS Policies for PDF documents in SharePoint Environments Enforce AD RMS Policies for PDF documents in Exchange Environments...

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

PARALLELS CLOUD STORAGE

European Molecular Biology Laboratory Case Example

Three data delivery cases for EMBL- EBI s Embassy. Guy Cochrane

The big data revolution

Clinical Research Infrastructure

Teaching Computational Thinking using Cloud Computing: By A/P Tan Tin Wee

Usage guidelines for the Advanced Light Microscopy Technology Platform (ALM) at the Max-Delbrück-Center, Berlin

Make the Most of Big Data to Drive Innovation Through Reseach

ParaVision 6. Innovation with Integrity. The Next Generation of MR Acquisition and Processing for Preclinical and Material Research.

CPIx - IT ASSESSMENT FORM

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

Local Loading. The OCUL, Scholars Portal, and Publisher Relationship

DAME Astrophysical DAta Mining Mining & & Exploration Exploration GRID

ebook Utilizing MapReduce to address Big Data Enterprise Needs Leveraging Big Data to shorten drug development cycles in Pharmaceutical industry.

Towards the construction of an integrated Wheat Information System

Big Data and Cloud Computing for GHRSST

Early Cloud Experiences with the Kepler Scientific Workflow System

The Extension of the DICOM Standard to Incorporate Omics

Clinical Research Infrastructure at the European level: the ECRIN model. Christine Kubiak ECRIN Coordination Inserm- DRCT

MediSapiens Ltd. Bio-IT solutions for improving cancer patient care. Because data is not knowledge. 19th of March 2015

Global Networking of Collections WFCC and GBRCN perspectives. EMbaRC Seminar David Smith Cantacuzino Institute, Bucharest, Romania 8-9 March 2010

Human Brain Project -

Sharing Data from Large-scale Biological Research Projects: A System of Tripartite Responsibility

The Key Elements of Digital Asset Management

Steven Newhouse, Head of Technical Services

A Service for Data-Intensive Computations on Virtual Clusters

ELIXIR Scientific Programme

SINTERO SERVER. Simplifying interoperability for distributed collaborative health care

HETEROGENEOUS DATA INTEGRATION FOR CLINICAL DECISION SUPPORT SYSTEM. Aniket Bochare - aniketb1@umbc.edu. CMSC Presentation

IDENTIFYING AND OPTIMIZING DATA DUPLICATION BY EFFICIENT MEMORY ALLOCATION IN REPOSITORY BY SINGLE INSTANCE STORAGE

Big answers from big data: Thomson Reuters research analytics

NATIONAL INSTITUTE OF ENVIRONMENTAL HEALTH SCIENCES Division of Extramural Research and Training

A Strategy for Plant Breeding Data Management in International Agricultural Research

RevoScaleR Speed and Scalability

Exploring the roles and responsibilities of data centres and institutions in curating research data a preliminary briefing.

Deploying Exchange Server 2007 SP1 on Windows Server 2008

Turnkey Deduplication Solution for the Enterprise

Hybrid Development and Test USE CASE

Transcription:

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences WP11 Data Storage and Analysis Task 11.1 Coordination Deliverable 11.3 Selected Standards and Architecture for Data Storage and Access Services supporting Euro-BioImaging Task leaders UNIVDUN Additional Task Contributors: October 2013-1 -

1. Report Summary WP11 Objectives To define a roadmap towards the construction of a European Biomedical Imaging Data Storage and Analysis Infrastructure. The key objectives of this infrastructure will be: to support efficient and standardized storage for and access to curated biomedical image data. to support open- source software for biomedical image analysis through coordination of community efforts, provision of an actively maintained repository of state- of- the- art validated algorithms for quantitative image analysis and thorough training. to interface with high performance computing facilities for high- throughput and/or computation- intensive image analysis. to provide seamless collaboration and access to other relevant computing and data resources in ESFRI and in European and national infrastructures. Digital imaging is now routinely used across both the life and biomedical sciences and has become an essential tool for all aspects of research, training and clinical practice. As image data volumes and complexity grow, it becomes increasingly difficult to access, view, share and analyse datasets using standard desktop- based solutions. In addition, the need for interdisciplinary collaboration, where datasets are shared with consortia of scientists with expertise in experimental biology, data analysis, image processing, modelling, physiology and/or medicine has grown as well. In this new age, processing, analysis and sharing of image data based on conventional desktop- based solutions simply is no longer possible. Multi- dimensional images from unique clinical cohorts must be securely shared among defined collaborations to enable full analysis and query, while ensuring that identifiable data are never made publicly available. In biological imaging, technologies such as multi- dimensional fluorescence or high- content screening are becoming standard approaches to reveal fundamental biological mechanisms that explain human physiology and disease. Datasets produced by these technologies are routinely 10 s to 100 s of GBs, and in some cases, many TB s. Examples of these datasets are shown in Table 1, which shows reported dataset sizes from Euro- BioImaging WP6 and WP7 PCS Users. To deliver the potential of these data, and the ambition and potential of European science, the technology that enables access to image data regardless of where it is produced has become a critical scientific need, and one which the Euro- BioImaging infrastructure must address - 2 -

Table 1. Data volumes recorded by Euro- BioImaging Users during the WP6 and WP7 Proof of Concept Studies. Data recorded using different technologies are shown User ID Data Size (GB) Technology. 3 1 STED- microscopy 22 1 High- throughput microscopy (Assay development) 2 2 STED- microscopy 4 2 STED- microscopy, CLSM 9 5 FRAP, FRET, PLA and IF 1 8 AFM, STED- microscopy, CLSM 5 16 STED- microscopy, STED- deconvolution 6 16 STED- microscopy 13 20 Laser Nanosurgery on confocal microscope 16 20 Functional Imaging of Living Cells, FRAP + FRET 11 22.6 High Content Screening and Confocal Microscopy 12 25 Laser Nanosurgery on confocal microscope 14 45 Spectral imaging 10 51.1 Laser scanning and spinning disc microscopy 17 60 LSM, Microinjection, Electron microscopy 8 100 OMX 3DSIM 7 174.3 CLEM and electron tomography 15 1000 CLSM, 2P LSCM, FRAP, photoactivation 18 1000 High- throughput microscopy 19 1000 High- throughput microscopy 20 1200 High- throughput microscopy 21 1500 High- throughput microscopy In this deliverable we define the requirements for Euro- BioImaging image database systems, and in particular consider the importance of standardised software interfaces to access and process data in these systems. We propose the creation of a central Euro- BioImaging Image Data Repository (EB- IDR) for scientific image data, to be used as the community resource for access, mining, standards, benchmarking and publication of image data. 2. Image databases and repositories 2.1 Data repositories at Euro- BioImaging Nodes All data recorded at Euro- BioImaging Nodes will be initially stored at the Node. Depending on the type of data collected and its future use, the responsibility for storing and providing access to - 3 -

the data may or may not shift to the User, the initial Euro- BioImaging data policy is data belongs to the User. For example, with current (late 2013) network and portable storage capacities, datasets up to 100GB can be transferred using various on- line file transfer protocols and data between 100 GB and 1TB can be transferred by portable media. Datasets larger than a few TB s are however not currently portable on- line or physically and thus in most cases will have to remain at the Node until analysis is completed and the user transfers them to a local archive at its home institution, or decides to submit it to the EB- IDR. Besides the sheer volume of the data, expertise, tools, and/or resources to fully process and analyse the data may not be easily transferred from Node to User. In each of these cases, image databases that store and enable sharing, processing, access and where necessary data transfer will be a critical part of the operation plan for each Node. OMERO, CellBase, Bisque and OSIRIX (described in detail in D11.2) are examples of well- developed open source image database solutions that Euro- BioImaging Nodes can choose to use for these functions. They are all under continuing development, but should currently provide the capabilities required by those Euro- BioImaging Nodes that collect large datasets. Euro- BioImaging Nodes and Users will have tools at their disposal to access and process the data they generate. Given the range of image data included in Euro- BioImaging, it is unlikely that Euro- BioImaging can specify a single standardised solution for managing access to data at all Nodes. However, given the advanced state of the applications described in D11.2, Euro- BioImaging must ensure that Nodes have sufficient capabilities to deploy and if at all possible contribute to the development of these applications. By ensuring the deployment and use of such applications, Euro- BioImaging Nodes can help drive their development, to the benefit of the whole Euro- BioImaging community. An example of this type of usage is shown in Figure 1. Datasets recorded as part of the Euro- BioImaging WP6 Proof- of- Concept Study using the University of Dundee 3DSIM OMX microscope have been loaded into an OMERO server. Images are accessible remotely to the PCS Users using OMERO clients (see http://www.openmicroscopy.org/site/support/omero4/). Data is uploaded by the facility manager into accounts owned by each User. - 4 -

Euro-BioImaging 262023 Figure 1. Data collected as part of the WP6 PCS at the University of Dundee loaded into an OMERO system, accessible by PCS Users who collected the data. 2.2 Public Image data Repositories The Euro- BioImaging community will have several specific needs for remote on- line access to image data. For this purpose, a Euro- BioImaging Image Data Repository (EB- IDR) will be required to deliver several functions: Benchmark Datasets. Alongside the Euro- BioImaging web portal that describes resources for image processing (see D11.6), test and/or benchmark datasets will be required to validate analysis tools, and compare their performance on different data types. These data may comprise datasets acquired at Euro- BioImaging Nodes or others submitted by the community. Regardless, these datasets, which will amount to several TBs, must be housed in an on- line repository, annotated, searchable, licensed, and available for community use and flanked by cloud compute services that can host user algorithms for the benchmarking activity. Once deployed, these datasets may accessed by simple download, or alternatively, for large multi- TB, linked to fixed or cloud- based compute resources for processing (see D11.2). Datasets for Collaborative Research. Most data collected at Euro- BioImaging Nodes will be related to specific experiments, and therefore should be stored temporarily at the Euro- BioImaging Node for quality control and analysis and then archived locally via appropriate mechanisms at the -5-

user s home institution in line with funders requirements. Some portion of these data may be published alongside the paper reporting the results, but in general, these data will remain with the Euro- BioImaging User. However, a smaller proportion of datasets are critical community resources. They may be the foundation for national or trans- national research collaborations or reference data linked to other resources (e.g., biological sequence, expression or other on- line databases). These datasets, that we call reference images consist of the images themselves, plus experimental and/or analytic metadata. The definition of these datasets is somewhat fluid, but some examples are: 1. Cellular Phenotypes 2. Cellular structure data, derived from super- resolution light microscopy and 3D electron microscopy) 3. Cellular Atlases 4. Model Organism Atlases 5. Human Atlases Given the scientific value, technical sophistication or sheer expense associated with creating reference image data, public access to these data must be enabled. Similar approaches and collaborative datasets exist in the area of pathology image data acquired from biobank patient samples or mouse clinic image data of animal disease models. Euro- BioImaging is collaborating with the ESFRI infrastructures Infrafrontier and BBMRI that operate in these domains to ensure interoperability and compatible standards of these datasets in the framework of the BioMedBridges biomedical sciences research infrastructure cluster project. 2.3 Scale of the Euro- BioImaging Data Resource (EB- IDR) When proposing a resource like the EB- IDR, careful planning of scale and scope is critical. Fortunately, Euro- BioImaging s Community Survey, the very successful PCS programme, and the results from the first Open Call for Expression of Interest (EoI) for Euro- BioImaging Nodes together provide substantial real- world experience that provide a sound basis to construct the future image data infrastructure. These data allow an estimate of number of Euro- BioImaging Nodes, the number of Users and the size of the datasets they collect. The results of these projections are shown in Table 2. Depending on the assumptions made, the actual projected values range by only about ~2x. Also, we estimate that 15% - 35% of all data generated by normal Euro- BioImaging Nodes and by big Euro- BioImaging Nodes that offer flagship technologies with already established data standards, e.g. in CLEM and high throughput microscopy, will be deposited in the EB- IDR, a fraction likely to increase as data and annotation - 6 -

standards will spread from initial reference data sets to more image based science domains. Furthermore, it is expected that data produced at the Nodes would be temporarily mounted to allow remote user access and analysis (ca. 50% of reference dataset storage capacity). The key conclusion is that the sizes of datasets are certainly large, and likely to grow rapidly in the future. This puts image data at the same scale as biomolecular data, i.e. genome sequences, and therefore similar in scale to other data infrastructures and repositories (e.g., ELIXIR and EBI). Table 2. Scaled Data Storage Model for Euro- BioImaging. Estimated projections for size of EB- IDR, in TBs stored per year. Number of Nodes, Users, and Dataset sizes based on response from PCS providers and Euro- BioImaging Node EoI applications. Years 1-2 (15 Nodes) Years 3-4 (25 Nodes) Years 5-6 (40 Nodes) Users/Yr 525* 875* 1500* Users/Node/Yr 30 30 30 Nodes 15 25 40 Data Size Split (Fraction of Normal ) 0.8 0.7 0.6 No. Normal Nodes 12 17.5 24 (Dataset = 1 TB/user, scale over time) No. Big Nodes 3 7.5 16 (Dataset = 10TB/user, scale over time) Total Data Stored @ Normal Node/Yr 35 35 37,5 (TB) Total Data Stored @ Big Node/Yr (TB) 350 350 375 Total Data all Users/all Nodes/Yr (TB) 1470 3238 6900 Steady state cloud provision at EB- IDR 111 515 1723 (ca. 50% of reference dataset storage capacity)*** Storage capacity for reference dataset, 221 (221 + 809) = 1030 (1030 + 2415) = 3445 published data, data for benchmarking etc. at EB- IDR** * Assumes 1.3 Users/two weeks, excluding holiday, downtime, etc. ** Assume ca. 15%/25%/35% of total data recorded at Nodes is published/committed to IDR. *** The steady- state cloud provision will need to be reviewed in the future, and the actual implementation will depend on emerging technologies and capabilities. 4. Conclusion The main requirement of the Euro- BioImaging Data Infrastructure will be the provision of image databases that store and enable sharing, processing, access and where necessary transfer of data. It is expected that Euro- BioImaging Nodes will generate datasets ranging in size from 0.1 20 TB on average. Thus delivering this data to Users will require more sophisticated methods than USB sticks or writing DVDs. Several applications are now available that enable enterprise- scale data - 7 -

management and remote access, and Euro- BioImaging Nodes will need to invest in the hardware infrastructure and expert staff required to deliver these capabilities to their Users. A Euro- BioImaging Image Data Repository can hold and serve data linked to publications and also reference datasets, to promote large- scale collaborations, benchmarking and data mining. The Euro- BioImaging Survey and PCS provide solid evidence that define the scale and capabilities of such a resource. - 8 -