Open Science, Big Data and Research Reproducibility. Tony Hey Senior Data Science Fellow escience Ins>tute University of Washington tony.hey@live.



Similar documents
The Fourth Paradigm: Data-Intensive Scientific Discovery, Open Science and the Cloud

Science Gateways What are they and why are they having such a tremendous impact on science? Nancy Wilkins- Diehr wilkinsn@sdsc.edu

The Emerging Discipline of Data Science. Principles and Techniques For Data- Intensive Analysis

Big Data and Clouds: Challenges and Opportuni5es

Data archiving and reproducible research for ecology and evolu6on. March 23 rd 2010 Ian Dworkin

UCLA: Data management, governance, and policy issues

Exploring the roles and responsibilities of data centres and institutions in curating research data a preliminary briefing.

Mission. To provide higher technological educa5on with quality, preparing. competent professionals, with sound founda5ons in science, technology

Data sharing and Big Data in the physical sciences. 2 October 2015

Why are data sharing and reuse so difficult?

Harnessing the Potential of Data Scientists and Big Data for Scientific Discovery

Open Access to Manuscripts, Open Science, and Big Data

The LCC Network Integrated Data Management Network GREAT NORTHERN LCC STEERING COMMITTEE MEETING MORAN, WY 25 SEPTEMBER 2011

Globus Research Data Management: Introduction and Service Overview. Steve Tuecke Vas Vasiliadis

CS 5150 So(ware Engineering So(ware Development in Prac9ce

Enhanced Research Data Management and Publication with Globus

Urban Big Data Centre

Data-Intensive Science and Scientific Data Infrastructure

Research at the Department of Computer Science and Software Engineering. Professor Yong Yue BEng, PhD, CEng, FIET, FIMechE 17 October 2014

Components of Technology Suppor4ng Data Intensive Research

Engagement Strategies for Emerging Big Data Collaborations

BENCHMARKING V ISUALIZATION TOOL

Big Data, Big Challenges

Big Data Research at DKRZ

PROJECT PORTFOLIO SUITE

Effec%ve AX 2012 Upgrade Project Planning and Microso< Sure Step. Arbela Technologies

Big Data. The Big Picture. Our flexible and efficient Big Data solu9ons open the door to new opportuni9es and new business areas

Benefits of managing and sharing your data

Learning and Learning Environments. Broadening Par2cipa2on in STEM. STEM Professional Workforce

Big process for big data

Big Data Challenges in Bioinformatics

Science Gateways in the US. Nancy Wilkins-Diehr

MSc Data Science at the University of Sheffield. Started in September 2014

High Performance Compu2ng and High Performance Data: exploring the growing use of Supercomputers in Oil and Gas Explora2on & Produc2on

High Performance Compu2ng and Big Data. High Performance compu2ng Curriculum UvA- SARA h>p://

Introduction to Research Data Management. Tom Melvin, Anita Schwartz, and Jessica Cote April 13, 2016

Data analysis in Par,cle Physics

Application of Supply Chain Concepts to the Analysis Process

February 22, 2013 MEMORANDUM FOR THE HEADS OF EXECUTIVE DEPARTMENTS AND AGENCIES

The Library (Big) Data scien4st

RELATED WORK DATANET FEDERATION CONSORTIUM, IRODS,

Big Data Visualiza9on

Developing the Next Genera5on of Clinical Research Scien5sts. Northwestern Health Sciences University s Fellowship Program

Open-Source Based Solutions for Processing, Preserving, and Presenting Oral Histories

CSER & emerge Consor.a EHR Working Group Collabora.on on Display and Storage of Gene.c Informa.on in Electronic Health Records

globus online Cloud-based services for (reproducible) science Ian Foster Computation Institute University of Chicago and Argonne National Laboratory

NERSC Data Efforts Update Prabhat Data and Analytics Group Lead February 23, 2015

LJMU Research Data Policy: information and guidance

Predictive Community Computational Tools for Virtual Plasma Science Experiments

UNIFIED, END- TO- END EDISCOVERY

OpenAIRE Research Data Management Briefing paper

Berkeley Institute for Data Science (BIDS)

Better Transnational Access and Data Sharing to Solve Common Questions

Cloud Computing for Research Roger Barga Cloud Computing Futures, Microsoft Research

Workshop : Open and Big Data for Life Imaging

Big Data and Scientific Discovery

Data Science And Big Data Analytics Course

Long Term Data Preserva0on LTDP = Data Sharing In Time and Space

Big Data and Science: Myths and Reality

BPO. Accerela*ng Revenue Enhancements Through Sales Support Services

Academic Career Paths and Job Search

High Performance Compu2ng Facility

Engaging and Maintaining Suppor/ve Rela/onships with School Systems

The Challenge of Handling Large Data Sets within your Measurement System

DEEP FILM ACCESS Project (Digital Transforma4ons in the Arts and Humani4es: Big Data) February 2014 April 2015

I N T E L L I G E N T S O L U T I O N S, I N C. DATA MINING IMPLEMENTING THE PARADIGM SHIFT IN ANALYSIS & MODELING OF THE OILFIELD

Cloud Compu?ng & Big Data in Higher Educa?on and Research: African Academic Experience

A wiki is nothing more than a website that is op-mized for easy edi-ng,

Workforce Demand and Career Opportunities in University and Research Libraries

B2B Offerings. Helping businesses op2mize. Infolob s amazing b2b offerings helps your company achieve maximum produc2vity

Business Analysis Standardization A Strategic Mandate. John E. Parker CVO, Enfocus Solu7ons Inc.

How To Make A Cloud Based Computer Power Available To A Computer (For Free)

Working with the British Library and DataCite Institutional Case Studies

Data Requirements from NERSC Requirements Reviews

Kaseya Fundamentals Workshop DAY THREE. Developed by Kaseya University. Powered by IT Scholars

Research Data Management Services. Katherine McNeill Social Sciences Librarians Boot Camp June 1, 2012

CYBERINFRASTRUCTURE FRAMEWORK FOR 21 st CENTURY SCIENCE AND ENGINEERING (CIF21)

Astrophysics with Terabyte Datasets. Alex Szalay, JHU and Jim Gray, Microsoft Research

Opportuni)es and Challenges of Textual Big Data for the Humani)es

From Consultancy. Projects to Case Studies. Ins2tute Case Studies: 10 September 2012, SSI Fellows Programme Launch Steve Crouch

Challenges and Solutions for Big Data in the Public Sector:

Interaction with other IT projects: EUDAT2020, VLDATA, ENVRI PLUS,

A grant number provides unique identification for the grant.

Clusters in the Cloud

The Changing Nature and Uses of Data

Cloud Compu)ng. Yeow Wei CHOONG Anne LAURENT

An Introduction to Managing Research Data

Teaching Computational Thinking using Cloud Computing: By A/P Tan Tin Wee

Transla6ng from Clinical Care to Research: Integra6ng i2b2 and OpenClinica

Everything You Need to Know about Cloud BI. Freek Kamst

David Minor. Chronopolis Program Manager Director, Digital Preserva7on Ini7a7ves UCSD Library San Diego Supercomputer Center

31 December Dear Sir:

1 Actuate Corpora-on Big Data Business Analy/cs

Research in Simulation: Research and Grant Writing 101

We are pleased to offer the following program to Woodstock Area Educators:

1 About This Proposal

Big Data Processing Experience in the ATLAS Experiment

Creating a digital workplace to enhance engagement, communication and collaboration

Update on the Cloud Demonstration Project

Transcription:

Open Science, Big Data and Research Reproducibility Tony Hey Senior Data Science Fellow escience Ins>tute University of Washington tony.hey@live.com

The Vision of Open Science

Vision for a New Era of Research Repor>ng Reproducible Research Collaboration Reputation & Influence Interactive Data Dynamic Documents (Thanks to Bill Gates SC05)

Jim Gray s Vision: All Scien>fic Data Online Many disciplines overlap and use data from other sciences. Internet can unify all literature and data Go from literature to computa;on to data back to literature. Informa;on at your finger;ps For everyone, everywhere Increase Scien;fic Informa;on Velocity Literature Derived and recombined data Raw Data Huge increase in Science Produc;vity (From Jim Gray s last talk)

OSTP Memo: Open Science and Open Access

US OSTP Memorandum Direc;ve requiring the major Federal Funding agencies to develop a plan to support increased public access to the results of research funded by the Federal Government. The memorandum defines digital data as the digital recorded factual material commonly accepted in the scienefic community as necessary to validate research findings including data sets used to support scholarly publicaeons, but does not include laboratory notebooks, preliminary analyses, drahs of scienefic papers, plans for future research, peer review reports, communicaeons with colleagues, or physical objects, such as laboratory specimens. 22 February 2013

The US Na>onal Library of Medicine The NIH Public Access Policy ensures that the public has access to the published results of NIH funded research. Requires scien;sts to submit final peer- reviewed journal manuscripts that arise from NIH funds to the digital archive PubMed Central upon acceptance for publicaeon. Policy requires that these papers are accessible to the public on PubMed Central no later than 12 months arer publica;on. Publishers PubMed Taxon Phylogeny PubMed abstracts Nucleo;de sequences Complete Genomes 3 - D Structure Protein sequences Entrez Genomes Genome Centers MMDB Entrez cross- database search tool Ø Jim Gray worked with David Lipman and his team at NCBI to create a portable version of PubMed Central Ø This is now deployed in Europe and elsewhere

US NIH Open Access Policy Once posted to PubMed Central, results of NIH- funded research become more prominent, integrated and accessible, making it easier for all scien;sts to pursue NIH s research priority areas compe;;vely. PubMed Central materials are integrated with large NIH research data bases such as Genbank and PubChem, which helps accelerate scien;fic discovery. The policy allows NIH to monitor, mine, and develop its por^olio of taxpayer funded research more effec;vely, and archive its results in perpetuity

NIH Open Access Compliance? PMC Compliance Rate Before legal mandate compliance was 19% Signed into law by George W. Bush in 2007 ARer legal mandate compliance up to 75% NIH have taken a further step of announcing that, some;me in 2013 they stated that they will hold processing of non- compeeng conenuaeon awards if publicaeons arising from grant awards are not in compliance with the Public Access Policy. NIH now implemented their policy about con;nua;on awards Compliance rate increasing ½% per month By November 2014, compliance rate had reached 86%

Open Access to Scholarly Publica>ons and Data: 2013 as the Tipping Point? US OSTP Memorandum 26 February 2013 Global Research Council Ac;on Plan 30 May 2013 G8 Science Ministers Joint Statement 12 June 2013 European Union Parliament 13 June 2013

University of California approves Open Access UC is the largest public research university in the world and its faculty members receive roughly 8% of all research funding in the U.S. UC produces 40,000 publica;ons per annum corresponding to about 2 3 % of all peer- reviewed ar;cles in world each year UC policy requires all 8000 faculty to deposit full text copies of their research papers in the UC escholarship repository unless they specifically choose to opt- out 2 August 2013

Data Cura>on

New Requirements for DOE Research Data

EPSRC Expecta>ons for Data Preserva>on Research organisa;ons will ensure that EPSRC- funded research data is securely preserved for a minimum of 10 years from the date that any researcher privileged access period expires Research organisa;ons will ensure that effec;ve data cura;on is provided throughout the full data lifecycle, with data cura;on and data lifecycle being as defined by the Digital Cura;on Centre

Progress in Data Cura>on? Professor James Frew (UCSB): Biggest change is funding agency mandate. NSF s Data Management Plan for all proposals has made scien;sts (pretend?) to take data cura;on seriously. There are beoer curated databases and metadata now - but not sure that quality frac;on is increasing! Frew s first law: scien;sts don t write metadata Frew s second law: any scien;st can be forced to write bad metadata Ø Should automate crea;on of metadata as far as possible Ø Scien;sts need to work with metadata specialists with domain knowledge

Open Science and Research Reproducibility

Jon Claerbout and the Stanford Explora>on Project (SEP) with the oil and gas industry Jon Claerbout is the Cecil Green Professor Emeritus of Geophysics at Stanford University He was one of the first scien;sts to recognize that the reproducibility of his geophysics research required access not only to the text of the paper but also to the data being analyzed and the sorware used to do the analysis His 1992 Paper introduced an early version of an executable paper

Serious problems of research reproducibility in bioinforma>cs Review of 2,047 retracted ar;cles indexed in PubMed in May of 2012 concluded that: 21.3% were retracted because of errors, 67.4% were retracted because of scien;fic misconduct Fraud or suspected fraud (43.4%) Duplicate publica;on (14.2%) Plagiarism (9.8%) Study by pharma companies Bayer and Amgen concluded that between 60% and 70% of biomedicine studies may be non- reproducible Amgen scien;sts were only able to reproduce 7 out of 53 cancer results published in Science and Nature

Computa>onal Science and Reproducibility

escience and the Fourth Paradigm Thousand years ago Experimental Science Descrip;on of natural phenomena Last few hundred years Theore>cal Science Newton s Laws, Maxwell s Equa;ons Last few decades Computa>onal Science Simula;on of complex phenomena. a a 2 = 4πGρ Κ 3 c a 2 2 Today Data- Intensive Science Scien;sts overwhelmed with data sets from many different sources Data captured by instruments Data generated by simula;ons Data generated by sensor networks escience is the set of tools and technologies to support data federa;on and collabora;on For analysis and data mining For data visualiza;on and explora;on For scholarly communica;on and dissemina;on (With thanks to Jim Gray)

2012 ICERM Workshop on Reproducibility in Computa>onal and Experimental Mathema>cs The workshop par;cipants noted that computa;onal science poses a challenge to the usual no;ons of research reproducibility Experimental scien;sts are taught to maintain lab books that contain details of the experimental design, procedures, equipment, raw data, processing and analysis (but ) Few computa;onal experiments are documented so carefully: Ø Typically there is no record of the workflow, no lis;ng of the sorware used to generate the data, and inadequate details of the computer hardware the code ran on, the parameter sesngs and any compiler flags that were set

Best Prac>ces for Researchers Publishing Computa>onal Results Data must be available and accessible. In this context the term "data" means the raw data files used as a basis for the computa;ons, that are necessary for others to regenerate published computa;onal findings. Code and methods must be available and accessible. The tradi;onal methods sec;on in a typical publica;on does not communicate sufficient detail for a knowledgeable reader to replicate computa;onal results. A necessary ac;on is making the complete set of instruc;ons, typically in the form of computer scripts or workflow pipelines, conveniently available. Cita>on. Do it. If you use data you did not collect from scratch, or code you did not write, however liole, cite it. Cita;on standards for code and data are discussed but it is less important to get the cita;on perfect than it is to make sure the work is cited at all. Copyright and Publisher Agreements. Publishers, almost uniformly, request that authors transfer all ownership rights over the ar;cle to them. All they really need is the authors' permission to publish. Supplemental materials. Publishers should establish style guides for supplemental sec;ons, and authors should organize their supplemental materials following best prac;ces. From h_p://wiki.stodden.net

But Sustainability of Data Links? 44 % of data links from 2001 broken in 2011 Pepe et al. 2012

Challenge of Numerical Reproducibility? Numerical round- off error and numerical differences are greatly magnified as compu;ng simula;ons are scaled up to run on highly parallel systems. As a result, it is increasingly difficult to determine whether a code has been correctly ported to a new system, because computa;onal results quickly diverge from standard benchmark cases. And it is doubly difficult for other researchers, using independently wrioen codes and dis;nct computer systems, to reproduce published results. David Bailey: Fooling the Masses: Reproducibility in High- Performance Compu;ng (hop://www.davidhbailey.com)

Same Physics, Different Programs? Different programs wrioen by different researchers can be used to explore the physics of the same complex system Programs may use different algorithms and/or different numerical methods Codes are different but the target physics problem is the same Cannot insist on exact numerical agreement Ø Computa;onal reproducibility involves finding similar quan;ta;ve results for the key physical parameters of the system being explored

Big Science and the Long Tail

Big Science and the Long Tail Extremely large data sets Expensive to move Domain standards High computa;onal needs Supercomputers, HPC, Grids e.g. High Energy Physics, Astronomy Data Volume Large data sets Some Standards within Domains Shared Datacenters & Clusters Research Collabora;ons e.g. Genomics, Financial Medium & Small data sets Flat Files, Excel Widely diverse data; Few standards Local Servers & PCs e.g. Social Sciences, HumaniEes Big Science Number of Researchers The Long Tail

Project Pyramids In most disciplines there are: a few giga projects, several mega consor;a and then many small labs. ORen some instrument creates need for giga- or mega- project: Polar sta;on Accelerator Telescope Remote sensor Genome sequencer Supercomputer Tier 1, 2, 3 facili;es to use instrument + data International Multi-Campus Single Lab Slide from Jim Gray (2007)

Experiment Budgets ¼ ½ Soeware Soeware for Instrument scheduling Instrument control Data gathering Data reduc;on Database Analysis Modeling Visualiza;on Slide from Jim Gray 2007 Millions of lines of code Repeated for experiment arer experiment Not much sharing or learning CS can change this Build generic tools Workflow schedulers Databases and libraries Analysis packages Visualizers

Compu>ng and Big Science Computa;onal science in the sense of performing computer simula;ons of complex systems is only one aspect of compu;ng in physics For Big Science projects, genera;on and analysis of the data would not be possible without large- scale compu;ng resources Examples: NASA MODIS Satellite Data Pre- processing LSST Large Synop;c Survey Telescope LHC Large Hadron Collider Experiments Ø New challenges for open science

NASA MODIS Satellite: Image Processing Pipeline Data collec;on stage Downloads requested input ;les from NASA Rp sites Includes geospa;al lookup for non- sinusoidal ;les that will contribute to a reprojected sinusoidal ;le Reprojec;on stage Converts source ;le(s) to intermediate result sinusoidal ;les Simple nearest neighbor or spline algorithms Deriva;on reduc;on stage First stage visible to scien;st Computes ET in our ini;al use Analysis reduc;on stage Op;onal second stage visible to scien;st Enables produc;on of science analysis ar;facts such as maps, tables, virtual sensors Source Imagery Download Sites Data Collec>on Stage Reprojec;on Queue... Download Queue Reprojec>on Stage Source Metadata Reduc;on #1 Queue AzureMODIS Service Web Role Portal Deriva>on Reduc>on Stage Request Queue Reduc;on #2 Queue Scien;sts Science results Analysis Reduc>on Stage hsp://research.microsoh.com/en- us/projects/azure/azuremodis.aspx Slide thanks to Catharine van Ingen Scien;fic Results Download

The ATLAS Soeware: Thanks to Gordon Wa_s SoRware wrioen by around 700 postdocs and grad students ATLAS sorware is 6M lines of code 4.5M in C++ and 1.5M in Python Typical reconstruc;on task has 400 to 500 sorware modules SoRware system begins with data acquisi;on of collision events from 100M readout channels and then reconstructs par;cle trajectories The reconstruc;on process requires a detailed Monte Carlo simula;on of the ATLAS detector taking account of the geometries, proper;es and efficiencies of each subsystem of the detector Produces values for the energy and momentum of the tracks observed in the detector Then find Higgs boson J

ATLAS Soeware Engineering Methodologies Automated integra;on tes;ng of modules Candidate release code versions tested in depth by running long jobs, producing standard plots, and detailed comparison with reference data sets ATLAS uses JIRA tool for bug tracking ARer every observed difference has been inves;gated and resolved, the new version of the code is released to whole ATLAS collabora;on ATLAS uses Apache Subversion (SVN) version- control system. With over 2000 sorware packages to be tracked, ATLAS uses release management sorware developed by the collabora;on

Research Reproducibility at the LHC? At the LHC there are the two experiments - ATLAS and CMS - looking for new Higgs and beyond physics The detectors and the sorware used by these two experiments are very different The two experiments are at different intersec;on points of the LHC and generate different data sets Research reproducibility is addressed by having the same physics observed in different experiments: e.g. see the Higgs boson at the same mass value in both experiments Making meaningful data available to the public is difficult but the new CERN Open Data portal is now making a start

h_p://opendata.cern.ch

Medium- Scale Science: An Example from the UK In UK, the Research Funding Agency RCUK (approx. NSF equivalent) supports a dozen or so Collabora;ve Computa;onal Projects (CCPs) Each CCP is focused on a small number of scien;fic codes and provides support to a specific community Team of scien;fic sorware experts at Daresbury Lab helps university researchers develop and maintain codes

Long Tail Science: Small research groups Typically faculty professor and a few postdocs and graduate students The researchers usually have liole or no formal training in sorware development and sorware engineering technologies Fortran is s;ll used by many Long Tail scien;sts although there is increasing use of Python Small research groups have no sorware budget so use packages like MATLAB or FLUENT Excel is used extensively for data analysis and management and visualiza;on

Soeware engineering and soeware carpentry

Soeware Quality and Soeware Sustainability Open source is not a panacea! Commercial open source projects can produce high quality sorware but too oren scien;fic sorware is of poor quality, undocumented and unmaintained The NSF now recognizes the importance of developing high quality maintainable scien;fic sorware in its SI 2 program: 1. Scien>fic Soeware Elements (SSE): SSE awards target small groups that will create and deploy robust sorware elements for which there is a demonstrated need that will advance one or more significant areas of science and engineering. 2. Scien>fic Soeware Integra>on (SSI): SSI awards target larger, interdisciplinary teams organized around the development and applica;on of common sorware infrastructure aimed at solving common research problems faced by NSF researchers in one or more areas of science and engineering. SSI awards will result in a sustainable community sorware framework serving a diverse community or communi;es. 3. Scien>fic Soeware Innova>on Ins>tutes (S2I2): S2I2 awards will focus on the establishment of long- term hubs of excellence in sorware infrastructure and technologies, which will serve a research community of substan;al size and disciplinary breadth.

Data Science in the Future?

What is a Data Scien>st? Data Engineer People who are expert at Opera;ng at low levels close to the data, write code that manipulates They may have some machine learning background. Large companies may have teams of them in- house or they may look to third party specialists to do the work. Data Analyst People who explore data through sta>s>cal and analy>cal methods They may know programming; May be an spreadsheet wizard. Either way, they can build models based on low- level data. They eat and drink numbers; They know which ques;ons to ask of the data. Every company will have lots of these. Data Steward People who think to managing, cura>ng, and preserving data. They are informa;on specialists, archivists, librarians and compliance officers. This is an important role: if data has value, you want someone to manage it, make it discoverable, look arer it and make sure it remains usable. What is a data scienest? MicrosoH UK Enterprise Insights Blog, Kenji Takeda hsp://blogs.msdn.com/b/microsohenterpriseinsight/archive/2013/01/31/what- is- a- data- scienest.aspx

Conclusions?

Some final thoughts Computa;onal research reproducibility is complicated! Open science needs access to sorware and data but are executable papers the answer? Making all research data open is not sensible or feasible e.g. LHC experiments Maintaining persistent links between publica;ons and data is not easy Certainly need beoer training for research sorware developers and data scien;sts Need for culture change in university research ecosystem to recognize and provide career paths for research scien;sts who specialize in sorware development and data science

Slide thanks to Bryan Lawrence

See opinion piece in Nature Physics last month hop://www.nature.com/nphys/journal/v11/n5/full/nphys3313.html Available open access! Thank you for listening!