Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences WP11 Data Storage and Analysis Task 11.1 Coordination Deliverable 11.3 Selected Standards and Architecture for Data Storage and Access Services supporting Euro-BioImaging Task leaders UNIVDUN Additional Task Contributors: October 2013-1 -
1. Report Summary WP11 Objectives To define a roadmap towards the construction of a European Biomedical Imaging Data Storage and Analysis Infrastructure. The key objectives of this infrastructure will be: to support efficient and standardized storage for and access to curated biomedical image data. to support open- source software for biomedical image analysis through coordination of community efforts, provision of an actively maintained repository of state- of- the- art validated algorithms for quantitative image analysis and thorough training. to interface with high performance computing facilities for high- throughput and/or computation- intensive image analysis. to provide seamless collaboration and access to other relevant computing and data resources in ESFRI and in European and national infrastructures. Digital imaging is now routinely used across both the life and biomedical sciences and has become an essential tool for all aspects of research, training and clinical practice. As image data volumes and complexity grow, it becomes increasingly difficult to access, view, share and analyse datasets using standard desktop- based solutions. In addition, the need for interdisciplinary collaboration, where datasets are shared with consortia of scientists with expertise in experimental biology, data analysis, image processing, modelling, physiology and/or medicine has grown as well. In this new age, processing, analysis and sharing of image data based on conventional desktop- based solutions simply is no longer possible. Multi- dimensional images from unique clinical cohorts must be securely shared among defined collaborations to enable full analysis and query, while ensuring that identifiable data are never made publicly available. In biological imaging, technologies such as multi- dimensional fluorescence or high- content screening are becoming standard approaches to reveal fundamental biological mechanisms that explain human physiology and disease. Datasets produced by these technologies are routinely 10 s to 100 s of GBs, and in some cases, many TB s. Examples of these datasets are shown in Table 1, which shows reported dataset sizes from Euro- BioImaging WP6 and WP7 PCS Users. To deliver the potential of these data, and the ambition and potential of European science, the technology that enables access to image data regardless of where it is produced has become a critical scientific need, and one which the Euro- BioImaging infrastructure must address - 2 -
Table 1. Data volumes recorded by Euro- BioImaging Users during the WP6 and WP7 Proof of Concept Studies. Data recorded using different technologies are shown User ID Data Size (GB) Technology. 3 1 STED- microscopy 22 1 High- throughput microscopy (Assay development) 2 2 STED- microscopy 4 2 STED- microscopy, CLSM 9 5 FRAP, FRET, PLA and IF 1 8 AFM, STED- microscopy, CLSM 5 16 STED- microscopy, STED- deconvolution 6 16 STED- microscopy 13 20 Laser Nanosurgery on confocal microscope 16 20 Functional Imaging of Living Cells, FRAP + FRET 11 22.6 High Content Screening and Confocal Microscopy 12 25 Laser Nanosurgery on confocal microscope 14 45 Spectral imaging 10 51.1 Laser scanning and spinning disc microscopy 17 60 LSM, Microinjection, Electron microscopy 8 100 OMX 3DSIM 7 174.3 CLEM and electron tomography 15 1000 CLSM, 2P LSCM, FRAP, photoactivation 18 1000 High- throughput microscopy 19 1000 High- throughput microscopy 20 1200 High- throughput microscopy 21 1500 High- throughput microscopy In this deliverable we define the requirements for Euro- BioImaging image database systems, and in particular consider the importance of standardised software interfaces to access and process data in these systems. We propose the creation of a central Euro- BioImaging Image Data Repository (EB- IDR) for scientific image data, to be used as the community resource for access, mining, standards, benchmarking and publication of image data. 2. Image databases and repositories 2.1 Data repositories at Euro- BioImaging Nodes All data recorded at Euro- BioImaging Nodes will be initially stored at the Node. Depending on the type of data collected and its future use, the responsibility for storing and providing access to - 3 -
the data may or may not shift to the User, the initial Euro- BioImaging data policy is data belongs to the User. For example, with current (late 2013) network and portable storage capacities, datasets up to 100GB can be transferred using various on- line file transfer protocols and data between 100 GB and 1TB can be transferred by portable media. Datasets larger than a few TB s are however not currently portable on- line or physically and thus in most cases will have to remain at the Node until analysis is completed and the user transfers them to a local archive at its home institution, or decides to submit it to the EB- IDR. Besides the sheer volume of the data, expertise, tools, and/or resources to fully process and analyse the data may not be easily transferred from Node to User. In each of these cases, image databases that store and enable sharing, processing, access and where necessary data transfer will be a critical part of the operation plan for each Node. OMERO, CellBase, Bisque and OSIRIX (described in detail in D11.2) are examples of well- developed open source image database solutions that Euro- BioImaging Nodes can choose to use for these functions. They are all under continuing development, but should currently provide the capabilities required by those Euro- BioImaging Nodes that collect large datasets. Euro- BioImaging Nodes and Users will have tools at their disposal to access and process the data they generate. Given the range of image data included in Euro- BioImaging, it is unlikely that Euro- BioImaging can specify a single standardised solution for managing access to data at all Nodes. However, given the advanced state of the applications described in D11.2, Euro- BioImaging must ensure that Nodes have sufficient capabilities to deploy and if at all possible contribute to the development of these applications. By ensuring the deployment and use of such applications, Euro- BioImaging Nodes can help drive their development, to the benefit of the whole Euro- BioImaging community. An example of this type of usage is shown in Figure 1. Datasets recorded as part of the Euro- BioImaging WP6 Proof- of- Concept Study using the University of Dundee 3DSIM OMX microscope have been loaded into an OMERO server. Images are accessible remotely to the PCS Users using OMERO clients (see http://www.openmicroscopy.org/site/support/omero4/). Data is uploaded by the facility manager into accounts owned by each User. - 4 -
Euro-BioImaging 262023 Figure 1. Data collected as part of the WP6 PCS at the University of Dundee loaded into an OMERO system, accessible by PCS Users who collected the data. 2.2 Public Image data Repositories The Euro- BioImaging community will have several specific needs for remote on- line access to image data. For this purpose, a Euro- BioImaging Image Data Repository (EB- IDR) will be required to deliver several functions: Benchmark Datasets. Alongside the Euro- BioImaging web portal that describes resources for image processing (see D11.6), test and/or benchmark datasets will be required to validate analysis tools, and compare their performance on different data types. These data may comprise datasets acquired at Euro- BioImaging Nodes or others submitted by the community. Regardless, these datasets, which will amount to several TBs, must be housed in an on- line repository, annotated, searchable, licensed, and available for community use and flanked by cloud compute services that can host user algorithms for the benchmarking activity. Once deployed, these datasets may accessed by simple download, or alternatively, for large multi- TB, linked to fixed or cloud- based compute resources for processing (see D11.2). Datasets for Collaborative Research. Most data collected at Euro- BioImaging Nodes will be related to specific experiments, and therefore should be stored temporarily at the Euro- BioImaging Node for quality control and analysis and then archived locally via appropriate mechanisms at the -5-
user s home institution in line with funders requirements. Some portion of these data may be published alongside the paper reporting the results, but in general, these data will remain with the Euro- BioImaging User. However, a smaller proportion of datasets are critical community resources. They may be the foundation for national or trans- national research collaborations or reference data linked to other resources (e.g., biological sequence, expression or other on- line databases). These datasets, that we call reference images consist of the images themselves, plus experimental and/or analytic metadata. The definition of these datasets is somewhat fluid, but some examples are: 1. Cellular Phenotypes 2. Cellular structure data, derived from super- resolution light microscopy and 3D electron microscopy) 3. Cellular Atlases 4. Model Organism Atlases 5. Human Atlases Given the scientific value, technical sophistication or sheer expense associated with creating reference image data, public access to these data must be enabled. Similar approaches and collaborative datasets exist in the area of pathology image data acquired from biobank patient samples or mouse clinic image data of animal disease models. Euro- BioImaging is collaborating with the ESFRI infrastructures Infrafrontier and BBMRI that operate in these domains to ensure interoperability and compatible standards of these datasets in the framework of the BioMedBridges biomedical sciences research infrastructure cluster project. 2.3 Scale of the Euro- BioImaging Data Resource (EB- IDR) When proposing a resource like the EB- IDR, careful planning of scale and scope is critical. Fortunately, Euro- BioImaging s Community Survey, the very successful PCS programme, and the results from the first Open Call for Expression of Interest (EoI) for Euro- BioImaging Nodes together provide substantial real- world experience that provide a sound basis to construct the future image data infrastructure. These data allow an estimate of number of Euro- BioImaging Nodes, the number of Users and the size of the datasets they collect. The results of these projections are shown in Table 2. Depending on the assumptions made, the actual projected values range by only about ~2x. Also, we estimate that 15% - 35% of all data generated by normal Euro- BioImaging Nodes and by big Euro- BioImaging Nodes that offer flagship technologies with already established data standards, e.g. in CLEM and high throughput microscopy, will be deposited in the EB- IDR, a fraction likely to increase as data and annotation - 6 -
standards will spread from initial reference data sets to more image based science domains. Furthermore, it is expected that data produced at the Nodes would be temporarily mounted to allow remote user access and analysis (ca. 50% of reference dataset storage capacity). The key conclusion is that the sizes of datasets are certainly large, and likely to grow rapidly in the future. This puts image data at the same scale as biomolecular data, i.e. genome sequences, and therefore similar in scale to other data infrastructures and repositories (e.g., ELIXIR and EBI). Table 2. Scaled Data Storage Model for Euro- BioImaging. Estimated projections for size of EB- IDR, in TBs stored per year. Number of Nodes, Users, and Dataset sizes based on response from PCS providers and Euro- BioImaging Node EoI applications. Years 1-2 (15 Nodes) Years 3-4 (25 Nodes) Years 5-6 (40 Nodes) Users/Yr 525* 875* 1500* Users/Node/Yr 30 30 30 Nodes 15 25 40 Data Size Split (Fraction of Normal ) 0.8 0.7 0.6 No. Normal Nodes 12 17.5 24 (Dataset = 1 TB/user, scale over time) No. Big Nodes 3 7.5 16 (Dataset = 10TB/user, scale over time) Total Data Stored @ Normal Node/Yr 35 35 37,5 (TB) Total Data Stored @ Big Node/Yr (TB) 350 350 375 Total Data all Users/all Nodes/Yr (TB) 1470 3238 6900 Steady state cloud provision at EB- IDR 111 515 1723 (ca. 50% of reference dataset storage capacity)*** Storage capacity for reference dataset, 221 (221 + 809) = 1030 (1030 + 2415) = 3445 published data, data for benchmarking etc. at EB- IDR** * Assumes 1.3 Users/two weeks, excluding holiday, downtime, etc. ** Assume ca. 15%/25%/35% of total data recorded at Nodes is published/committed to IDR. *** The steady- state cloud provision will need to be reviewed in the future, and the actual implementation will depend on emerging technologies and capabilities. 4. Conclusion The main requirement of the Euro- BioImaging Data Infrastructure will be the provision of image databases that store and enable sharing, processing, access and where necessary transfer of data. It is expected that Euro- BioImaging Nodes will generate datasets ranging in size from 0.1 20 TB on average. Thus delivering this data to Users will require more sophisticated methods than USB sticks or writing DVDs. Several applications are now available that enable enterprise- scale data - 7 -
management and remote access, and Euro- BioImaging Nodes will need to invest in the hardware infrastructure and expert staff required to deliver these capabilities to their Users. A Euro- BioImaging Image Data Repository can hold and serve data linked to publications and also reference datasets, to promote large- scale collaborations, benchmarking and data mining. The Euro- BioImaging Survey and PCS provide solid evidence that define the scale and capabilities of such a resource. - 8 -