EUDAT Towards a pan-european Collaborative Data Infrastructure Willem Elbers EUDAT / MPI-TLA Focus meeting: Data repositories SURF, Utrecht March 3, 2014
Outline EUDAT project EUDAT services Summary and conclusion 2
Data Deluge Exponential growth Zettabytes Exabytes Petabytes Terabytes Gigabytes Increasing complexity and variety Where to store it? How to find it? How to make the most of it? 3
Consortium 4 4
EUDATs Mission Collaborative Data Infrastructure Data Generators Users User-focused functionality, data capture & transfer, VREs Trust Data Curation Community Support Services Data discovery & navigation, workflow creation, annotation, interpretability Common Data Services Persistent storage, identification, authenticity, workflow execution, mining 5
... implementing services initially motivated by early community use cases 6
EUDAT addressing all data Large volumes of data (big data) - more uniform in terms of formats and quality - lots of automatic processing - high reduction as goal irregular big data - automatically derived data - aggregated data - semi-automatic processing long tail data - large variety (complexity) - many sources, many owners - difficult to manage 7
The CDI network architecture Generic data centres Community data sites (repositories) may join the data infrastructure or just use EUDAT services 8
Domain of registered data Data in the EUDAT domain must have: (descriptive) Metadata Persistent identifier Ingest points define boundary between domains Joining EUDAT: Community center Using EUDAT: EUDAT data center Specific cases: BE2SHARE where EUDAT center(s) act as repository 9
enrichment processing reduction analysis domain of registered data individual value (short timescale) community value (medium timescale) society value (long timescale) publication acquisition generation description preservation Identifier Service 10
EUDAT Services Portfolio Metadata Catalogue Aggregated EUDAT metadata domain. Data inventory Data Staging Safe Replication Simple Store Dynamic replication to HPC workspace for processing Data preservation, access optimization Researcher data store (simple upload, share and access) PID Identity Integrity Authenticity Locations AAI Network of trust among authentication and authorization actors 11
Replication from repositories to data storages in different administrative domains (long-term) archiving and preservation optimize access for users from different regions bring data closer to powerful computers for data analytics Typical policies triggered by Community Data Managers: Replicate collection X from my repository to data centres A and B Store the replica safely for N years Check the integrity of the replica every M years 12
Transferring data from EUDAT storages to compute facilities reliable, efficient, easy-to-use tools to manage data transfers ingest data into the EUDAT domain of registered data 13
enabled EUDAT sites repositories replica storages 14
B2SHARE Offering a simple self-service registration for data providers Lowering barriers to allow registered users to upload and store smaller scientific data sets into the B2SHARE repository Enabling users to share their data with other researchers 15
B2FIND Make collections of scientific data easy to find Provide access those data collections through the given references in the metadata Commenting functionality 16
Summary The EUDAT project is driven by community requirements bridging the gap between community support services and common data services The EUDAT project is providing services to safely and easily store your data, make it discoverable and run hpc analysis on your data In a domain of registered data 17
Thank you B2SAFE eudat-safereplication@postit.csc.fi B2STAGE eudat-b2stage@postit.csc.f B2FIND http://b2find.eudat.eu/ B2SHARE https://b2share.eudat.eu/ www.eudat.eu 18