Building an Open Data Infrastructure for Science: Turning Policy into Practice



Similar documents
Integrating Research Information: Requirements of Science Research

Shibbolized irods (and why it matters)

Computing Strategic Review. December 2015

Research Data Alliance: Current Activities and Expected Impact. SGBD Workshop, May 2014 Herman Stehouwer

INFRASTRUCTURE PROGRAMME

Linking raw data with scientific workflow and software repository: some early

Global Research Data Infrastructure: Path Forward for Progress. Dr. Chris Greer Senior Executive for Cyber Physical Systems

Exploring the roles and responsibilities of data centres and institutions in curating research data a preliminary briefing.

Benefits of managing and sharing your data

Workprogramme

Nuclear Physics Technology Showcase Event 26 September 2013, IoP London. STFC External Innovations Programme Funding Schemes

Data sharing and Big Data in the physical sciences. 2 October 2015

Second EUDAT Conference, October 2013 Data Management Plans and Certification Motivation: increasing importance of Data Management Planning

The APS and synchrotron facilities in the US and worldwide: a network perspective

Content. > How to ensure sustainability of RIs? > RIs generate knowledge > Socio-economic impact of RIs

Pan-European ESS Data Management and Software Centre Takes Form in Copenhagen

Data Management Planning

OpenAIRE Research Data Management Briefing paper

House of Lords Science & Technology Committee: Scientific Infrastructure

Workshop on Data Visualisation, Reduction and Analysis at Australia s Replacement Research Reactor Workshop Report and Recommendations Background

ESRC Big Data Network Phase 2: Business and Local Government Data Research Centres Welcome, Context, and Call Objectives

Horizon Research e-infrastructures Excellence in Science Work Programme Wim Jansen. DG CONNECT European Commission

Scientific Data Infrastructure: activities in the Capacities Programme of FP7

Metadata for Data Discovery: The NERC Data Catalogue Service. Steve Donegan

Human Brain Project -

The data landscape lessons from UK

Open Science and Property Knowledge:

BIG data big problems big opportunities Rudolf Dimper Head of Technical Infrastructure Division ESRF

Creating a Data Management Plan for your Research

Deposition and use of raw diffraction images

Jean Sykes The UK Research Data Service Project

THE UNIVERSITY OF LEEDS. Vice Chancellor s Executive Group Funding for Research Data Management: Interim

IT Needs of and Vision for Photon / Neutron Community

Exploitation of ISS scientific data

The BEAR Management Group will report to the University Research Committee.

LJMU Research Data Policy: information and guidance

Research Data Storage and the University of Bristol

European Data Infrastructure - EUDAT Data Services & Tools

The UK INCF Node and the CARMEN project. Presenter: Leslie Smith

Building an Ecosystem to Accelerate Data-Driven Innovation

THE CCLRC DATA PORTAL

STFC s Science and Technology Strategy

UK-EOF Data Solutions Workshop

An Introduction to Managing Research Data

Executive summary. Key points

Global Scientific Data Infrastructures: The Big Data Challenges. Capri, May, 2011

EDISON: Coordination and cooperation to establish new profession of Data Scientist for European Research and Industry

Survey of Canadian and International Data Management Initiatives. By Diego Argáez and Kathleen Shearer

data infrastructures framework for action for H2020

Cloud and Big Data Standardisation

ICSU World Data System Implementation Plan

Implementing selection and appraisal policies at the UK Data Archive

What can be learned from the UK CARMEN project? Presenter: Leslie Smith

Overview of state of art in Data management. Stefano Cozzini CNR/IOM and exact lab srl

NT1: An example for future EISCAT_3D data centre and archiving?

Scholarly Communication A Matter of Public Policy

Research Data Alliance - Research Data Sharing without barriers Big Data & Open Data Workshop 2014, Brussels 7-8 May 2014

Research information meets research data management... in the library?

Big Data Standardisation in Industry and Research

The 100,000 genomes project

Big Data and the Earth Observation and Climate Modelling Communities: JASMIN and CEMS

9360/15 FMA/AFG/cb 1 DG G 3 C

NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons

International Consortium for Harmonization of Clinical Laboratory Results. Operating Procedures

8970/15 FMA/AFG/cb 1 DG G 3 C

RESEARCH DATA MANAGEMENT POLICY

Optimising Data Management: full listing of deliverables. College storage approach

Capitalizing on Big Data

Report of the DTL focus meeting on Life Science Data Repositories

SURFsara Data Services

New Jersey Big Data Alliance

The cansas Format for Storage and Interchange of Reduced Multi-Dimensional Small-Angle Scattering Data

Life Sciences and Large Data Challenges

Data at NIST: A View from the Office of Data and Informatics

EU H2020 funding opportunities. Mauro Morandin INFN PD

CYBERINFRASTRUCTURE FRAMEWORK FOR 21 st CENTURY SCIENCE AND ENGINEERING (CIF21)

Computing Advisory Panel (CAP) Preliminary enquiry in respect of future PRACE membership. Jan- 2014

Taking Forward The Curl Vision: The Year s Highlights

Data Intensive Research Initiative for South Africa (DIRISA)

THE EFFECTS OF BIG DATA ON INFRASTRUCTURE. Sakkie Janse van Rensburg Dr Dale Peters UCT 1

Integrating archive data storage with institutional repositories

Accelerating Experimental Elementary Particle Physics with the Gordon Supercomputer. Frank Würthwein Rick Wagner August 5th, 2013

CHPC initiatives in Data and HPC in South Africa. An initiative of the Department of Science and Technology Managed by the CSIR Meraka Institute

Research Data Services at Cambridge and our motivations for the Pilot

Large scale Data Storage Services for Science at STFC. David Corney

NERC Data Policy Guidance Notes

Open Research Data: The Future of Science - Forum Rolex Learning Center (EPFL Campus) Christopher Brown

PaNdata WP3 F2F Meeting. PaNdata F2F Meeting, MAX-lab, March 13/14, 2013 Heinz J Weyer, PSI 1

EUROPEAN COMMISSION Directorate-General for Research & Innovation. Guidelines on Data Management in Horizon 2020

Grids, e-business and e-utilities. Tony Hey Director of the UK e-science Core Programme EPSRC and DTI

Data Intensive Science and Computing

How To Build An Open Source Data Infrastructure

(Possible) HEP Use Case for NDN. Phil DeMar; Wenji Wu NDNComm (UCLA) Sept. 28, 2015

Research Data Understanding your choice for data placement

Data Management and Standardisation in Distributed Systems Biology Research

Deploying Multiscale Applications on European e-infrastructures

Jeff Wolf Deputy Director HPC Innovation Center

Procurement Innovation for Cloud Services in Europe

Research Data Management: The library s role

DATA LIFE CYCLE & DATA MANAGEMENT PLANNING

Transcription:

Building an Open Infrastructure for Science: Turning Policy into Practice Juan Bicarregui Head of Services Division STFC Department of Scientific Computing Franco-British Workshop on Big in Science, November 2012, London

Overview Introduction What is STFC? What do we need from our data infrastructure? An example project The PaNdata Collaboration Fostering Collaboration on a Global Scale The Research Alliance

What is STFC? 250m Programme includes: Neutron and Muon Source Square Kilometre Array Synchrotron Radiation Source Lasers Space Science Particle Physics Compuing and Management Microstructures Nuclear Physics Radio Communications Large Hadron Collider Daresbury Laboratory ESRF & ILL, Grenoble

What is the science?

Putting Policy into Practice RCUK Principles on Policy Seven (fairly) orthogonal principles: Public good Preservation Discoverability Confidentiality First use Recognition Costs } } Access } Rights Similar policy work going on at EU and Global levels

centric view of research the researcher acts through ingest and access Archival Virtual Research Environment Creation Access Curation Services the researcher shouldn t have to Network worry about the information infrastructure Storage Compute Information Infrastructure

The 7 C s Collaboration Communication Creation Creation Archival Access Collection Curation Services Network Storage Compute Capacity Curation Computation

Creation Linked systems for: Proposal submission User management acquisition Metadata carried from each system to the next Detectors moving from Hz to KHz, towards MHz,... Examining the detectors on MAPS instrument on ISIS

Capacity 10 x Moore s Law (2 years) 2 x Moore s Law (1.5 years) Moore s law for us 10.000 Total Stored (TeraBytes) } 20PB is about 15 months 1.000 Currently store about 100 20 PetaBytes of data 10 1 1997 1998 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2012 Moore s Law x1000 in 13 years Doubling every 1.3 years

Computation Fitting of experimental data to model Compute intensive components on HPC (Blue Gene/Q at DL, Emerald GPGPU cluster at RAL) Real-time diagnostics of instrument performance and data flow pipeline. Computational derivation of properties from theory

Facility Archives All ISIS data (~25 years) > 3,000,000 files All Diamond (~5 years) > 100,000,000 files LHC Tier 1 UK hub for LHC data (11PB) Other Research Councils NERC JASMIN+CEMS super data cluster ( 4.5M) BBSRC Institutes data archive MRC Support service JISC JISCmail (1Million users) National Grid Service Universities Curation Imperial College - National Service for Computational Chemistry Software - EPSRC funded Oxford, UCL, Southampton Bristol, Emerald GPU cluster ( 1M) (EPSRC) Others: The SCARF compute cluster for RAL Facilities, Users and Collaborators The STFC Publications Archive The CCPs (Collaborative Computational Projects) The Chemical base Service The IBM Blue Gene/Q ICE-CSE The StorageTek tape robots 100PB Capacity JASMIN & CEMS 4PB Parallel disk

Collection Proposal Record Publication Scientist submits application for beamtime Approval Scheduling Experiment cleansing analysis Subsequent publication registered with facility Facility committee approves application Facility registers, trains, and schedules scientist s visit Scientists visits, facility run s experiment Raw data filtered and cleansed Tools for processing made available

Immense Expectations! Web enables: access to everything Everything on-line Interlinking enables: Validation of results Repetition of experiment Communication The web has changed everything... Discovery enables: new knowledge from old STFC s e-pubs Institutional Repository has records of 30,000 publications spanning 25 years

Collaboration Technology integration facilitates scientific collaboration Cross facility/beamline Cross disciplinary Technology integration improves facility efficiency PaN-data Photon and Neutron infrastructure ICAT also used in Australian Synchrotron and Oak Ridge National Lab Research Alliance

The 7 C s Collaboration Communication Creation Creation Archival Access Collection Curation Services Network Storage Compute Capacity Curation Computation

Overview Introduction What is STFC? What do we need from our data infrastructure? An example project - PaNdata Fostering Collaboration on a Global Scale - RDA

The PaNdata Collaboration Established 2007 with 4 partners Expanded since to 13 organisations (see next slide) Aims:...to construct and operate a shared data infrastructure for Neutron and Photon laboratories... 2007 2008 2009 2010 2011 2012 2013 2014 EDNS (4) EDNP (10) PaNdataEurope(11) Pandata ODI(11)

PaN-data Partners PaN-data bring together 13 major European Research Infrastructures ISIS is the world s leading pulsed spallation neutron source ILL operates the most intense slow neutron source in the world PSI operates the Swiss Light Source, SLS, and Neutron Spallation Source, SINQ, and is developing the SwissFEL Free Electron Laser HZB operates the BER II research reactor the BESSY II synchrotron CEA/LLB operates neutron scattering spectrometers from the Orphée fission reactor JCNS Juelich Centre for Neutron Science ESRF is a third generation synchrotron light source jointly funded by 19 European countries Diamond is new 3rd generation synchrotron funded by the UK and the Wellcome Trust DESY operates two synchrotrons, Doris III and Petra III, and the FLASH free electron laser Soleil is a 2.75 GeV synchrotron radiation facility in operation since 2007 ELETTRA operates a 2-2.4 GeV synchrotron and is building the FERMI Free Electron Laser ALBA is a new 3 GeV synchrotron facility due to become operational in 2010 MaxLab, Max IV Synchrotron PaN-data is coordinated by the STFC Department of Scientific Computing

The Science we do - Structure of materials Visit facility on research campus Place sample in beam Diffraction pattern from sample Fitting experimental data to model Structure of cholesterol in crude oil Over 30,000 user visitors each year: physics, chemistry, biology, medicine, energy, environmental, materials, culture pharmaceuticals, petrochemicals, microelectronics Over 5.000 high impact publications per year But so far no integrated data repositories Lacking sustainability & traceability Longitudinal strain in aircraft wing Hydrogen storage for zero emission vehicles Bioactive glass for bone growth Magnetic moments in electronic storage

PaNdata Vision Single Infrastructure Single User Experience Raw Analysed Publication Different Infrastructures Analysis Different User Experiences Catalogue Catalogue Catalogue Publications Catalogue Raw Facility 1 Analysis Analysed Publication Publication s Raw Facility 2 Analysis Analysed Publication Publication s Raw Facility 3 Capacity Storage Analysis Software Repositories Analysed Repositories Publication Publication s Publications Repositories EDNS - European Infrastructure for Neutron and Synchrotron Sources

PaN-data Europe building a sustainable data infrastructure for Neutron and Photon laboratories PaN-data Standardisation PaN-data Europe is undertaking 5 standardisation activities: 1. Development of a common data policy framework 2. Agreement on protocols for shared user information exchange 3. Definition of standards for common scientific data formats 4. Strategy for the interoperation of data analysis software enabling the most appropriate software to be used independently of where the data is collected 5. Integration and cross-linking of research outputs completing the lifecycle of research, linking all information underpinning publications, and supporting the long-term preservation of the research outputs

The Research Lifecycle a personal view the researcher acts through ingest and access Meta/ Archival Catalogues Research Environment User Info feed DAQ Creation feed Analysis feed Provenanced Research Portals Access Services EGI GEANT e-infrastructure the researcher shouldn t have to worry about the information infrastructure Network Storage Compute Local resources Information Infrastructure

Overview Introduction What is STFC? What do we need from our data infrastructure? An example project PaNdata Fostering Collaboration on a Global Scale - RDA

Research Alliance Emerging international organization Currently supported by: EU NSF Australian National Service To accelerate data-driven innovation through research data sharing and exchange. Infrastructure, Policy, Practice and Standards

Research Alliance Vision and Purpose Vision Researchers around the world sharing and using research data without barriers. Purpose to accelerate international data-driven innovation and discovery by facilitating research data sharing and exchange, use and re-use, standards harmonization, and discoverability. through the development and adoption of infrastructure, policy, practice, standards, and other deliverables.

RDA Principles Openness Membership is open to all interested organizations, all meetings are public, RDA processes are transparent, and all RDA products are freely available to the public; Consensus The RDA moves forward by achieving consensus and resolves disagreements through appropriate voting mechanisms; Balance The RDA is organized on the principle of balanced representation for individual organizations and stakeholder communities; Harmonization The RDA works to achieve harmonization across standards, policies, technologies, tools, and other data infrastructure elements; Voluntary The RDA is not a government organization or regulatory body and, instead, is a public body responsive to its members; and Non-profit RDA is not a commercial organization and will not design, promote, endorse, or sell commercial products, technologies, or services.

Research Alliance: Building Bridges Bridges to the future data preservation Bridges to research partners Bridges across disciplines Bridges across regions Bridges to integration to solve new problems Bridges across communities 27

RDA role Two bridges we can build: Connecting Connecting People What kind of organisation do we need to do this?

Practitioners Domain Administrative Domain Working Groups - Carry out work of RDA - Reach consensus on outputs - May suggest BoFs about new topics - Open to all but - some commitment expected Plenary - Open to all persons involved in RDA - Hears and comments on reports from WGs - Suggests new BoFs - Hears candidates for TAC Technical Advice Committee - advise on WG work activities - Interacting directly with working groups - advise on new WGs and new BoFs - Give implementation suggestions to strategic direction from council Online Open Interaction Fora - use for all kinds of activities, open to all RDA members Admistration and Management Team -Implement strategic direction set by council -Supports the activities of the RDA -Arrange plenary meetings -Run the on-line for a -Manage documents -Convene nominating committees for - Council and TAC -Monitor and controls finances -Prepare reports for - Council, funders,. Council - Set strategic direction - Final vote on governance matters - Approve new WGs (TAC advised) - control balanced WG approach

Research Alliance Status in November 2012 Initial meetings held in Munich and Washington ~200 Delagates ~12 vanguard Working Groups being established Council and Secretariat formed Launch planned for March 17 in Guttenberg Website: rd-alliance.org/ Washington Meeting: d2i.indiana.edu/data2012/researchalliance

RDA The Knowledge Lifecycle Collaboration Communication Creation e-science/ infrastructure Capacity Creation Archival Curation Services Network Storage Compute Access Collection Curation Scientific Computing/ e-infrastructure Computation

Overview Introduction What is STFC? What do we need from our data infrastructure? An example project PaNdata Fostering Collaboration on a Global Scale - RDA

Thank You www.stfc.ac.uk/scd www.pan-data.eu www.rd-alliance.org/

The End