Italian Scientific Big Data Initiative



Similar documents
Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

EUDAT. Towards a pan-european Collaborative Data Infrastructure. Willem Elbers

SURFsara Data Services

European Data Infrastructure - EUDAT Data Services & Tools

How To Build An Open Source Data Infrastructure

OpenAIRE Research Data Management Briefing paper

E-learning and Student Management System: toward an integrated and consistent learning process

Optimizing IT Deployment Issues

irods at CC-IN2P3: managing petabytes of data

EUDAT - Open Data Services for Research

Horizon Research e-infrastructures Excellence in Science Work Programme Wim Jansen. DG CONNECT European Commission

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Clodoaldo Barrera Chief Technical Strategist IBM System Storage. Making a successful transition to Software Defined Storage

Local Loading. The OCUL, Scholars Portal, and Publisher Relationship

Digital preservation a European perspective

SCALABLE FILE SHARING AND DATA MANAGEMENT FOR INTERNET OF THINGS

CINECA DSpace-CRIS : An Open Source Solution Use Science 2013 w.c ineca.it

Big Data in the Digital Cultural Heritage

Enabling the re-use of research data: organising stakeholders and infrastructure in the Netherlands

Exploitation of ISS scientific data

Why long time storage does not equate to archive

e-irg workshop Dublin May 2013 Track 1: Coordination of e-infrastructures

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything

Managing Complexity in Distributed Data Life Cycles Enhancing Scientific Discovery

CEDA Storage. Dr Matt Pritchard. Centre for Environmental Data Archival (CEDA)

H2020 Guidelines on Open Data and Data Management Plan

Report of the DTL focus meeting on Life Science Data Repositories

Digital Preservation Strategy,

Workprogramme

The Key Elements of Digital Asset Management

EUROPEAN COMMISSION Directorate-General for Research & Innovation. Guidelines on Data Management in Horizon 2020

Data management challenges in todays Healthcare and Life Sciences ecosystems

A Service for Data-Intensive Computations on Virtual Clusters

The challenges of becoming a Trusted Digital Repository

LJMU Research Data Policy: information and guidance

Reference Architectures for Repositories and Preservation Archiving

THE EMC ISILON STORY. Big Data In The Enterprise. Copyright 2012 EMC Corporation. All rights reserved.

Designing a Cloud Storage System

NITRD and Big Data. George O. Strawn NITRD

The UK INCF Node and the CARMEN project. Presenter: Leslie Smith

Big Data. George O. Strawn NITRD

Open Access and Open Research Data in Horizon 2020

CYBERINFRASTRUCTURE FRAMEWORK FOR 21 st CENTURY SCIENCE AND ENGINEERING (CIF21)

Big Data Analytics Service Definition G-Cloud 7

Action full title: Universal, mobile-centric and opportunistic communications architecture. Action acronym: UMOBILE

Big Data - Infrastructure Considerations

Business-centric Storage FUJITSU Hyperscale Storage System ETERNUS CD10000

Big Data Analytics Nokia

Data Centric Systems (DCS)

Big data management with IBM General Parallel File System

Big Data and Cloud Computing for GHRSST

Regional Vision and Strategy The European HPC Strategy

Project Number: Project Title: Human Brain Project. HBP_SP13_EPFL_ _D13.3.2_Final.docx

Software services competence in research and development activities at PSNC. Cezary Mazurek PSNC, Poland

Inside Track Research Note. In association with. Enterprise Storage Architectures. Is it only about scale up or scale out?

ENHANCED PUBLICATIONS IN THE CZECH REPUBLIC

Cloud and Big Data Standardisation

HPC technology and future architecture

Big Data in BioMedical Sciences. Steven Newhouse, Head of Technical Services, EMBL-EBI

How To Manage Your Digital Assets On A Computer Or Tablet Device

Autonomy Consolidated Archive

Diagram 1: Islands of storage across a digital broadcast workflow

Real Time Big Data Processing

Research Data Management

CROSS PLATFORM AUTOMATIC FILE REPLICATION AND SERVER TO SERVER FILE SYNCHRONIZATION

Distributed File Systems An Overview. Nürnberg, Dr. Christian Boehme, GWDG

Kriterien für ein PetaFlop System

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

Transcription:

Italian Scientific Big Data Initiative Sanzio Bassini Director of Supercomputing Application & Innovation Department S.Bassini@cineca.it Casalecchio di Reno (BO) Via Magnanelli 6/3, 40033 Casalecchio di Reno 051 6171411 www.cineca.it Cineca Big Data 1

CINECA Interuniversity Consortium 69 Italian Universities CNR, OGS, SISSA/ISAS Trieste Ministry of University and Research Cineca Big Data 2

Mission Private non for Profit Organization Founded in 1969 by Ministry of Public Education now under the control of Ministry Education, University and Research Main activities: Promote the use of the most advanced HPC systems to support public and private scientific and technological research o Services both for Universities, National Research Institute, private enterprises and Ministry Three operational site o Bologna (consortium headquarter), Milan, Rome o 711 employees Supercomputing, Application & Innovation Dpt. Tier-0 and Tier-1 Supercomputing service Data processing and data management for scientific big data Enabling specialistic support for application development and exploitation Cineca Big Data 3

The focus Cineca Big Data 4

Data as an asset One of the most significant changes of the past decade has been the widespread recognition of data as an asset rather than the refuse of research. Cineca Big Data 5

G8 + O5 White Paper 5 Principles for an Open Data Infrastructure 1. Discoverability Implementation of appropriate persistent identifier frameworks, adoption of descriptive metadata standards, and the use of appropriate data formats and taxonomies. 2. Accessibility Publicly funded scientific research data should be made openly available with as few restrictions as possible. Ethical, legal or commercial constraints may be imposed on the use of research data to ensure that the research process is not damaged by inappropriate release of data. 3. Understandability Scientific data sets must be understandable in order to be effectively used. A set of numbers, texts, pictures or even videos alone cannot be understandable without additional context, semantics, data analysis tools, and algorithms. 4. Manageability Data management policies and plans must make it clear who is responsible for maintaining the availability of data and how the associated costs are to be met including issues associated with curation, storage and services. 5. People A global approach to research data infrastructure requires a highly skilled and adaptable workforce and culture that is able to capture the available data and make it available to those that are able to use it appropriately. Cineca Big Data 6

G8+O5 Data working group and GSO Report Recommendations 9 & 10 Recommendation 9 - e-infrastructure Global research infrastructure initiatives should recognize the utility of the integrated use of advanced e-infrastructures, services for accessing and processing, and curating data, as well as remote participation (interaction) and access to scientific experiments. Recommendation 10 - Data exchange Global scientific data infrastructure providers and users should recognise the utility of data exchange and interoperability of data across disciplines and national boundaries as a means to broadening the scientific reach of individual data sets. Cineca Big Data 7

Research model data lifecycle Follow-up research New research Undertake research reviews Scrutinize findings Teach and learn Enter data, digitize, transcribe, translate Check, validate, clean data Anonymise data where necessary Describe data Manage and store data Interpret data Derive data Produce research outputs Author publications Prepare data for preservation Distribute data Share data Control access Establish copyright Promote data Migrate data to best format Migrate data to suitable medium Backup and store data Create metadata and documentation Archive data Cineca Big Data http://www.data-archive.ac.uk/create-manage/life-cycle 8

Data Domain Scientists personal data Homeless scientists Citizen scientists Community repositories Institute repositories Cineca Big Data 9

EUDAT available services Safe replication and archiving Data stage in & out Metadata indexing and search User friendly service for data upload & download Cineca Big Data 10

National Scientific Data Repository Aim of the project the creation of the national scientific data repository Objectives: Service provision for national and EU funded project: - EUDAT (European Data Infrastructure) - V-MUST (Virtual Museum Transnational Network) - Connettomics ICON Foundation, - LENS Human Brain Project - Fluid dynamic COST Action MP0806 Particles in turbulence - EuHIT - facilities for turbulence research - Astrophysics Mission PLANK, Mission GAIA - Geophysics INGV RI EPOS Project - NFFA / Sincrotrone - Nanoscience Foundries and Fine Analysis - Elixir Italian National Infrastructure - Epigen National Epigenetic infrastructure - Nextdata National Environment data infrastructure New business marketplace Improve the processing of data (include the data scarcely structured) Cineca Big Data 11

CINECA Current infrastructure External Data sources Internal Data sources PRACE EUDAT Lab A Prj A Lab B Prj B Lab Prj FERMI Tier- 0 > 2PF Next PLX Tier-1 > 1PF Data store Data Processing Workspace > 5 PByte Repository > 5PByte High IOPS > 50 Tbyte SSD Tape >12 Pbyte FERMI High througput NUBES / PLX Cloud serv. HPDA Front end cluster 64 thin node, 8 Fat node viz Big mem Data mover DB Web serv. processing Web Archive FTP Cineca Big Data 12

EUDAT-EPOS success story The whole archive of the INGV s data center in Rome is replicated to CINECA, the EUDAT s data center in Bologna ( years:1990-2013) Granularity is at single file level (the collection of sesimic waveforms related to 1 sensor for 1 day) Metadata are also replicated, as files (dataless). The replicas are stored, synchronized and accessed by a single user, the INGV community manager. Cineca Big Data 13

SmartDATA project CINECA s project to deal with BigData and HPC: a Big Data Analytics engine for OpenDATA catalogues EUDAT choerent Initial pre-conditions: Data are archived, catalogued, searchable and accessible: - Robust and reliable infrastructure - Interoperability and adoption of standards SmartData satisfies these requirements and besides: make available an HPC environment with analysis tools which foster the re-use of such data. ease the information exchange among experts belonging to different disciplines, from both the point of view of the methods and of the data.! Hence the data sharing creates new knowledge Cineca Big Data 14

SmartDATA project SmartDATA Portal and apps SmartDATA users orchestrator HPC Front-End Job Scheduler workflows ingestion Data Registry (irods) for OpenDATA sources repository Scratch staging Supercomputer analytic engine Cineca Big Data 15

EuHIT project EuHIT is a consortium that aims at integrating cutting-edge European facilities for turbulence research across national boundaries (2013-2017). It includes 25 research institutes and 2 industrial partners from 10 European countries. A total of 14 cutting-edge turbulence research infrastructures CINECA will provide the technical and hardware infrastructure for the turbulence data digital library service (DLTD), offering: A service to manage data coming from numerical simulations and experiments in the field of fluid dynamics. A storage space of 150 TB online. A helpdesk and specialized support. Cineca Big Data 16

EUHIT project TurBase, a freely accessible living knowledgebase for high quality turbulence data, will rely on the CINECA s DLTD service, which will be implemented using the same technology adopted by EUDAT to foster interoperability and accessibility Cineca Big Data 17

Human Brain project The Human Brain Project (HBP) is a large scale European research project which aims to simulate the world's first and most exact human brain in a supercomputer. Working together with 86 different European institutions the project is planned to last ten years (2013-2023). To image an entire brain many parallel stacks of images are acquired. They are afterwards merged with a custom-made algorithm suited to work with very large data sets (~ 1 TB) Cineca Big Data 18

Human Brain project WP7.5 High Performance Computing Platform: integration and operations CINECA is responsible for Task 7.5.4 The HBP supercomputer for massive data analytics: Implement and operate a data-centric HPC facility providing efficient storage, processing and management of large volumes of data generated by the HBP. The HBP Massive Data Analytics Supercomputer in CINECA will provide Tier-0 and Tier-1 service, integrated with a mass storage facility of more than 5 Petabytes of working space with an aggregated file system bandwidth of about 70 GB/sec in a full production environment. The system will be integrated with a Data Facility of more than 5 Petabytes for online disk storage repository and more 10 Petabytes for long term data archiving, providing efficient data life-cycle management of structured and un-structured data generated by the HBP. Cineca Big Data 19

Sample preparation 1. Transcardiac mouse perfusion with PBS 1X followed by Paraformaldehyde 2. Extraction of the brain from the skull & post-fix in PFA 4% overnight @ 4 C. 3. (optional) excision of the desired part of the brain. 4. Embedding of the sample in a 0.5 % w/w agarose gel. 5. Dehydration in graded ethanol series. 6. (for whole brains) additional dehydration in hexane 100% (1h). 7. Incubation in clearing solution (1:2 Benzyl Alcohol / Benzyl Benzoate). Uncleared mouse brain Cleared mouse brain Cineca Big Data 20

Whole Brain Imaging Scale bars: 1 mm TV 10 10 SV CV Cerebellum from a P10 L7-GFP mouse Total volume 73 mm 3, voxel size 0.8 0.8 1 µm 3, acquisition time 24 h (1.3 MegaVoxels/s) Cineca Big Data 21

Whole Brain Imaging Scale bars: 200 µm Whole brain from P15 thy1-gfp-m mouse Portions of hippocampus and superior colliculus are magnified Total volume 223 mm 3, voxel size 0.8 0.8 1 µm 3, acquisition time 3 days (1.3 MegaVoxels/s) Cineca Big Data 22

Cineca Big Data 23

Initiatives (HPC HPDA) HPC Tier-0 (FERMI) >2PF; Tier-1 > 1PF; Open Access Peer Review PRACE, ISCRA, LISA HPDA Nubes, cloud, remote visualization, structured and unstructured data processing 80 servers node Linux cluster Repository Access Reference Curation Replica Preservation 10PB 5PB 5PB 12 PB WS Repository LT archiving Cineca Big Data 24

Business model for (Cineca) RI Research infrastructure supported by Minister, National and European (structured funds) Scientific community contribution for marginal and operational costs Standard service provision best practices are commonly in use by Cineca which ISO fully certified Partnership and scientific collaboration Vertical disciplinary project should (have to) define in its budget the contribution for the persistency of the (Cineca) RI on base of its programmatic access Pay for service model (applicable for Cineca being a private organization) does imply costs for VAT and couldn t be widely applicable Open tender for service procurement may be very difficult to apply because of the difficult to define a suitable terms of reference documentation for in progress scientific activities The skills and competence profiles need a continuously training programme part also on the job Technology transfer process also in collaboration with scientific communities for added value service towards private organizations and industries. Cineca Big Data 25

Many thanks for your Bologna attention! Rome Milan Cineca Big Data 26