How To Manage Research Data At Columbia
|
|
|
- Anna Sheryl Gray
- 5 years ago
- Views:
Transcription
1 An experience/position paper for the Workshop on Research Data Management Implementations *, March 13-14, 2013, Arlington Rajendra Bose, Ph.D., Manager, CUIT Research Computing Services Amy Nurnberger, Research Data Manager, Center for Digital Research and Scholarship, Columbia Libraries/Information Services Columbia University, New York NY March 1, 2013 Introduction Columbia researchers in all areas generally develop their own storage solutions or use storage infrastructure that is provisioned at the school, department or center level by one of several local IT groups. At least one faculty- led center has a dedicated research computing/it team that provides data storage and other services to hundreds of users across a number of departments and labs, and another similar group may be formed in At the same time, central divisions including Information Technology, the Libraries, and Research Initiatives, together with new faculty committees, have been working to evolve research data storage strategy at the university during the past year. Current planning underscores the complete research data lifecycle and represents a change in philosophy at the university. In 2011 Columbia s Executive Vice President for Research formed a Research Computing Executive Committee of senior administrative and academic leadership. This group in turn charged a faculty- led Shared Research Computing Policy Advisory Committee to design a coordinated governance and financial plan for research computing. In the past months, a subcommittee of this group focusing on data storage and security has recommended moving forward with a research storage pilot project, and this recommendation has been submitted to the parent Research Computing Executive Committee for consideration. If approved, this research storage pilot would initially target a small subset of research faculty, would plan to offer storage at rates roughly comparable to commercial cloud systems such as Amazon S3, and would be tied to a larger central IT division (CUIT) project to acquire an expandable storage system for a variety of institutional needs. With this storage acquisition, CUIT could begin to enact a plan to provide centrally- managed and administered storage to the wider Columbia community. The research storage pilot would also be connected to a Core Research Computing Facility construction project to upgrade infrastructure within the university data center. * This work is licensed under a Creative Commons Attribution- NonCommercial 3.0 Unported (CC BY- NC 3.0) License This project is supported by NIH Research Facility Improvement Grant 1G20RR , awarded April 15, 2010, and associated New York State funds.
2 2 The Academic Commons repository hosted by the Columbia University Libraries / Information Services (CUL/IS) Center for Digital Research and Scholarship (CDRS) continues to not only collect and preserve the scholarship of the faculty, but also to make it accessible through search and discovery tools. The scholarly products preserved by Academic Commons are not limited to datasets and raw data, but may additionally include works based on those data such as articles, book chapters, essays, monographs, working papers, technical reports, conference presentations, multimedia creations (e.g., simulations, three- dimensional maps), and other materials in digital formats. As the scale and types of data being generated by research are constantly, and respectively, growing and evolving, CUL/IS is developing policies, procedures, communications, and training to accommodate these transformations. Central IT perspective In the last decades, CUIT has been responsible primarily for the administrative computing and storage needs for the university. Although CUIT has concentrated activity on managing the campus network as well as financial and student record systems, and does not currently offer storage services suited for research, a five- person research computing group was established four years ago. This group, Research Computing Services, now coordinates centrally- managed high- performance computing resources to some research groups on campus and has worked with others on reviewing their research storage needs. A legacy service provides all active Columbia associates with a small default amount of home directory file storage on the central Unix system. This service includes regular backups of home directory content with an offsite component. Aside from this home directory storage, researchers are expected to use storage infrastructure provided by their local IT group for day- to- day work. If these resources are not available or adequate, researchers frequently use their own research funds to build their own storage solutions, which could include hiring part- time, full- time or shared IT staff (or using research staff as IT staff), in addition to purchasing and configuring hardware and software and designing a backup regimen. Many faculty, staff and students now use commercial storage solutions like Dropbox to collaborate with colleagues on research and to backup work documents. The CUIT Research Computing Services group, in collaboration with several central IT teams, initiated a shared high performance computing (HPC) pilot system in 2009, which now serves roughly 200 researchers. This system includes 75 TB of non- backed- up scratch storage available for users running HPC jobs. Researchers using our HPC system use other file systems over the network to store their data sets and computational results permanently. Faculty committees have recommended future planning to make it easy for HPC users to move data sets from centrally- managed, general research storage to and from HPC scratch storage to accommodate researcher workflows. Libraries perspective Historically, academic libraries have focused on the preservation and archiving activities related to the end of the data life cycle. These activities have usually been concerned with
3 3 special collections materials, often focusing on the unique or the rare. With the growing attention accorded to research data, libraries have seized the opportunity to employ their expertise in the realms of metadata development, information organization, and curation and begun to explore expansion of their roles in supporting data management in the academy. Active exploration and discussion of this realm at Columbia was initiated in March 2008 when an e- Science Task Force was convened. The findings of that task force indicated a need for Columbia to invest in a university strategy for research data and the cyberinfrastructure it depends on. Following on these findings, CUL/IS conducted interviews with representative members of the science and engineering faculty to better understand researchers data management needs. These interviews found that the area of data management is one where there are few best practices and little investment in developing robust services. In partial response to the e- Science Task Force and the data interviews, Columbia instituted a Research Data Manager position. The Research Data Manager is tasked with: expanding data management planning services; developing practices, outreach, and education for each of the different stages in the data management lifecycle; and initiating policy development around the multiple issues involved in a comprehensive strategy for data management. This is a collaborative role, and the Research Data Manager works with others who share a vested interest in issues around data management and requirements to develop robust services supporting the needs of researchers at Columbia. The establishment of this position emphasizes the increased importance of research data management at Columbia and is indicative of Columbia s dedication to developing research data support and services throughout the digital data lifecycle. As a part of supporting the entire data life cycle, CUL/IS continues to consider options for collaborative and working data storage spaces. While an earlier pilot of Alfresco has been discontinued, other platforms, such as Google Apps, are being considered. In concert with this, CUL/IS has continued to assist researchers in developing data management plans (DMPs) by maintaining a library of agency and directory specific templates. These DMPs are further supported by the planned increase of fee- free storage for research products, such as data, provided by Columbia s institutional repository Academic Commons. This summer, Academic Commons will be supporting file sizes of up to 10GB free of charge. Funded research data files beyond that will require $10/GB for their storage. Academic Commons is housed on a Fedora- based digital preservation system. CUL/IS is in the process of moving from a storage system employing disks and tapes with offsite support by the NYSERNet Data Center located in Syracuse, New York, and managed using the Sun StorageTek Storage Archive Manager (SAM) software to one based on two Isilon clusters. This will be supported with auxiliary storage through a partnership with Indiana University. Each cluster will contain four nodes providing 432TB of raw disk space, equating to 292TB of usable storage. One cluster will be located in Columbia s data center and the other in NYSERNet Data Center. The data placed in the local cluster is immediately and automatically replicated to the NYSERNet cluster. Additionally, different data classes may be configured to survive some number of local failures beyond the default of two
4 4 simultaneous disk failures or a single node failure from the cluster. The partnership with Indiana University permits CUL/IS to place a third copy of the data in their large Hierarchical File System (HFS) setup, which streams data off to tape. This partnership allows CUL/IS not only to have greater geographic distribution for storage, but also to protect against potential problems with the Isilon replication system by having the data stored on a completely separate system. Building on the Academic Commons Fedora- based digital preservation system, CDRS has implemented a SOLR- Lucene enterprise search server to index the metadata associated with the repository s contents and provide enhanced discovery and access to its materials. In addition to this, Blacklight, a Ruby on Rails application, was instituted in April 2011 to further increase the visibility of items residing in the repository by better exposing the index metadata to search engines. Further refinements enabling access and enhancing visibility include serializing the metadata in multiple formats (e.g. RSS, ATOM, and OAI- PMH), generating XML site maps, creating structured metaheaders supporting Google Scholar indexing, and implementing schema.org microdata (RDFa). In addition to the work with Academic Commons, CUL/IS also has memoranda of understanding (MOU) with the Integrated Earth Data Applications facility (IEDA) at Columbia s Lamont Doherty Earth Observatory, whereby CUL/IS is providing a dark archiving service for IEDA data sets, and the Center for International Earth Science Information Network, providing long term storage and access. CUL/IS has also agreed to act as a failover agency for long- term archiving of the Socioeconomic Data and Applications Center (SEDAC) Long Term Archive **. There continue to be consortia efforts to build a Trustworthy Repositories Audit & Certification (TRAC) certified repository, as well as considerations of participation in the Academic Preservation Trust (APTrust) and the Digital Preservation Network (DPN). CUL/IS will continue to seek opportunities to partner and collaborate, as demonstrated by its gold level sponsorship of the Duraspace Consortium. In the recent past, CUL/IS has looked to collaborate with its researchers and research centers by assessing needs through monitoring and analyzing active grants and agency required data management plans. Going forward with the knowledge that the needs of researchers are constantly changing, as demonstrated in fields as disparate as neuroscience and digital humanities, CUL/IS is looking beyond its monitor and respond pattern of activities. Acknowledging that demands for data storage and guidance will increase, as indicated by the recent memorandum released by the Office of Science and Technology Policy CUL/IS has the goal of anticipating and providing forward- looking support of needs, and being the solution researchers did not know they needed. This goal requires a continued collaboration with campus stakeholders, and a consistent message to university administration and to outside funders that increased support for data management is required. A Research Data Symposium was recently hosted by CUL/IS **
5 5 including national and international research, funding agency and publishing industry representatives. This event confirmed that partnership and collaboration outside of the university is intrinsic to institutional success in data storage and is an imperative part of data storage services. Faculty- led efforts Since 2005 the Columbia University Center for Computational Biology and Bioinformatics (C2B2), located on the medical center campus, has included a research computing support team of roughly nine staff. This group provides fee- based services, including data storage for research computing applications as well as central file storage with backup, for about 15 labs, or roughly 200 researchers and staff. C2B2 services are a model for research computing/it for the Mortimer B. Zuckerman Mind Brain Behavior Institute which will occupy the Jerome L. Greene Science Center building on the newest Columbia Manhattanville campus. The Greene Science Center is scheduled for occupancy in 2016 and is expected to house a total of 800 researchers and administrators. Concluding Remarks Efforts by central IT, Library, and Research Initiative divisions to evolve research storage strategy at Columbia having been moving forward over the past year at the same time as faculty- led research centers develop their own research computing teams and resources. These parallel activities will benefit from closer and continued communication and coordination between central university divisions and distributed research IT staff. In addition, several other major university initiatives including the creation of an Institute for Data Science and Engineering that will bring in new faculty and a focus on big data issues, the growth of the New York Genome Center that provides computing, data storage and other services for biomedical researchers at member institutions, and the launch of eight global centers outside of New York promise to keep research storage at the forefront of technology planning at Columbia for many years to come. The combined understanding and efforts of the researchers, campus IT leaders, and library/archive specialists at the NSF- funded Workshop on Research Data Management Implementions, and related future events, will help shape Columbia s approach to research data storage.
Columbia University Digital Library Architecture. Robert Cartolano, Director Library Information Technology Office October, 2009
Columbia University Digital Library Architecture Robert Cartolano, Director Library Information Technology Office October, 2009 Agenda Technology Architecture Off-site NYSERNet Facility Ingest, curation
OpenAIRE Research Data Management Briefing paper
OpenAIRE Research Data Management Briefing paper Understanding Research Data Management February 2016 H2020-EINFRA-2014-1 Topic: e-infrastructure for Open Access Research & Innovation action Grant Agreement
Local Loading. The OCUL, Scholars Portal, and Publisher Relationship
Local Loading Scholars)Portal)has)successfully)maintained)relationships)with)publishers)for)over)a)decade)and)continues) to)attract)new)publishers)that)recognize)both)the)competitive)advantage)of)perpetual)access)through)
Digital Repository Initiative
Digital Repository Initiative 01 April 2014 Ray Frohlich Director, Enterprise Systems and Infrastructure with Euan Cochrane Digital Preservation Manager, Preservation Michael Friscia Manager of Digital
Research Data Storage, Sharing, and Transfer Options
Research Data Storage, Sharing, and Transfer Options Principal investigators should establish a research data management system for their projects including procedures for storing working data collected
Brown University Libraries Technology Plan, 2015-2017
Brown University Libraries Technology Plan, 2015-2017 Technology Vision Brown University Library creates, develops, promotes, and uses technology to further the Library s mission and strategic directions
OJS @ Queen s Open Journal System (OJS) Business Case
The centrality of the library in the academy enables it to act as a primary catalyst for change in the scholarly communication domain. Libraries understand the culture of scholarship and are strategically
Best Practices for Research Data Management. October 30, 2014
Best Practices for Research Data Management October 30, 2014 Presenters Andrew Johnson Research Data Librarian University Libraries Shelley Knuth Research Data Specialist Research Computing Outline What
Library Strategic Planning
Library Strategic Planning Develop campus partnerships to collect, manage, share, and preserve Georgia Tech digital research data 1. Needs Assessment Research data survey & interviews NSF DMP content analysis
Geospatial Data Stewardship at an Interdisciplinary Data Center
Geospatial Data Stewardship at an Interdisciplinary Data Center Robert R. Downs, PhD Senior Digital Archivist and Senior Staff Associate Officer or Research Acting Head of Cyberinfrastructure and Informatics
6. FINDINGS AND SUGGESTIONS
6. FINDINGS AND SUGGESTIONS 6.1 Introduction: The advancements in ICT and their proper utilization by research and academic librarians are not only strengthening the capabilities of libraries but also
Integrated Rule-based Data Management System for Genome Sequencing Data
Integrated Rule-based Data Management System for Genome Sequencing Data A Research Data Management (RDM) Green Shoots Pilots Project Report by Michael Mueller, Simon Burbidge, Steven Lawlor and Jorge Ferrer
Data Management at UT
Data Management at UT Maria Esteva, TACC, [email protected] Colleen Lyon, UT Libraries, [email protected] Angela Newell, ITS, [email protected] What is data management? systematic organization
University of Bristol. Research Data Storage Facility (the Facility) Policy Procedures and FAQs
University of Bristol Research Data Storage Facility (the Facility) Policy Procedures and FAQs This FAQs should be read in conjunction with the RDSF usage FAQs - https://www.acrc.bris.ac.uk/acrc/rdsf-faqs.html
Big Data to Knowledge (BD2K)
Big Data to Knowledge () potential funding agency synergies Jennie Larkin, PhD Office of the Associate Director of Data Science National Institutes of Health idash-pscanner meeting UCSD September 16, 2014
Best Practices for Data Management. RMACC HPC Symposium, 8/13/2014
Best Practices for Data Management RMACC HPC Symposium, 8/13/2014 Presenters Andrew Johnson Research Data Librarian CU-Boulder Libraries Shelley Knuth Research Data Specialist CU-Boulder Research Computing
Data Storage Options for Research
Research IT Office Data Storage Options for Research By Ashok Mudgapalli Director of Research IT Agenda Current Research Data Storage Current Data Backup Strategies Available Storage Solution: Enterprise
Library and University Collections Vision, Values, Strategic Goals and Key Performance Areas, 2013-2017
Library and University Collections Vision, Values, Strategic Goals and Key Performance Areas, 2013-2017 Library and University Collections in a nutshell Library and University Collections (L&UC) is responsible
Columbia University Libraries / Information Services
Stephen Davis, October 28, 2010 Columbia University Libraries / Information Services Digital Asset Management Digital Preservation Digital Publishing Introductions Stephen Paul Davis Director, Libraries
Let s Talk Digital An Approach to Managing, Storing, and Preserving Time-Based Media Art Works. DAMS Branch Manager Smithsonian Institution, OCIO
Let s Talk Digital An Approach to Managing, Storing, and Preserving Time-Based Media Art Works DAMS Branch Manager Smithsonian Institution, OCIO 1 Love and Marriage Digital Asset Management?? Digital Diamonds
Emerging Career Trends for Information Professionals: A Snapshot of Job Titles in Summer 2013
Emerging Career Trends for Information Professionals: A Snapshot of Job Titles in Summer 2013 Introduction This report provides an informal snapshot regarding some of the latest career trends for information
Digital Asset Management Developing your Institutional Repository
Digital Asset Management Developing your Institutional Repository Manny Bekier Director, Biomedical Communications Clinical Instructor, School of Public Health SUNY Downstate Medical Center Why DAM? We
Harvard Library Preparing for a Trustworthy Repository Certification of Harvard Library s DRS.
National Digital Stewardship Residency - Boston Project Summaries 2015-16 Residency Harvard Library Preparing for a Trustworthy Repository Certification of Harvard Library s DRS. Harvard Library s Digital
Checklist and guidance for a Data Management Plan
Checklist and guidance for a Data Management Plan Please cite as: DMPTuuli-project. (2016). Checklist and guidance for a Data Management Plan. v.1.0. Available online: https://wiki.helsinki.fi/x/dzeacw
The National Consortium for Data Science (NCDS)
The National Consortium for Data Science (NCDS) A Public-Private Partnership to Advance Data Science Ashok Krishnamurthy PhD Deputy Director, RENCI University of North Carolina, Chapel Hill What is NCDS?
Assessing a Scientific Data Center as a Trustworthy Digital Repository
Assessing a Scientific Data Center as a Trustworthy Digital Repository Robert R. Downs 1 and Robert S. Chen 2 1 [email protected] 2 [email protected] NASA Socioeconomic Data and Applications
Server Virtualization with Windows Server Hyper-V and System Center
Course 20409B: Server Virtualization with Windows Server Hyper-V and System Center Course Details Course Outline Module 1: Evaluating the Environment for Virtualization This module provides an overview
Databases & Data Infrastructure. Kerstin Lehnert
+ Databases & Data Infrastructure Kerstin Lehnert + Access to Data is Needed 2 to allow verification of research results to allow re-use of data + The road to reuse is perilous (1) 3 Accessibility Discovery,
CYBERINFRASTRUCTURE FRAMEWORK FOR 21 st CENTURY SCIENCE AND ENGINEERING (CIF21)
CYBERINFRASTRUCTURE FRAMEWORK FOR 21 st CENTURY SCIENCE AND ENGINEERING (CIF21) Goal Develop and deploy comprehensive, integrated, sustainable, and secure cyberinfrastructure (CI) to accelerate research
Amazon Cloud Storage Options
Amazon Cloud Storage Options Table of Contents 1. Overview of AWS Storage Options 02 2. Why you should use the AWS Storage 02 3. How to get Data into the AWS.03 4. Types of AWS Storage Options.03 5. Object
POSITION DETAILS. Digitisation & Digital Services
HR191 JOB DESCRIPTION NOTES Forms must be downloaded from the UCT website: http://www.uct.ac.za/depts/sapweb/forms/forms.htm This form serves as a template for the writing of job descriptions. A copy of
Trinity College Library
Trinity College Library Strategic 2013/14-2015/16 Introduction This document draws on discussions of the expanded unit heads group during the academic year 2012/2013. It combines the goals articulated
New Developments in Data Sharing, Remote Access, Secure Data, and Documentation at the Cornell Institute for Social and Economic Research (CISER)
New Developments in Data Sharing, Remote Access, Secure Data, and Documentation at the Cornell Institute for Social and Economic Research (CISER) William C. Block and Lars Vilhuber 4 th Workshop on Data
Johns Hopkins University Data Management Services
Johns Hopkins University Data Management Services Dave Fearon, Betsy Gunia, and Tim DiLauro [email protected] http://dmp.data.jhu.edu/ JHU Data Management Services How we started: NSF s Data Management
RESEARCH DATA MANAGEMENT POLICY
Document Title Version 1.1 Document Review Date March 2016 Document Owner Revision Timetable / Process RESEARCH DATA MANAGEMENT POLICY RESEARCH DATA MANAGEMENT POLICY Director of the Research Office Regular
Data Management using irods
Data Management using irods Fundamentals of Data Management September 2014 Albert Heyrovsky Applications Developer, EPCC [email protected] 2 Course outline Why talk about irods? What is irods?
Adding Robust Digital Asset Management to Oracle s Storage Archive Manager (SAM)
Adding Robust Digital Asset Management to Oracle s Storage Archive Manager (SAM) Oracle's Sun Storage Archive Manager (SAM) self-protecting file system software reduces operating costs by providing data
A federated data infrastructure: the Dutch way forward
Data Archiving and Networked Services A federated data infrastructure: the Dutch way forward Ingrid Dillo (DANS) DeIC Conference Middelfart, 30 September 2014 DANS is an institute of KNAW en NWO The Dutch
NSF Data Management Plan Template Duke University Libraries Data and GIS Services
NSF Data Management Plan Template Duke University Libraries Data and GIS Services NSF Data Management Plan Requirement Overview The Data Management Plan (DMP) should be a supplementary document of no more
The UK INCF Node and the CARMEN project. Presenter: Leslie Smith [email protected]
The UK INCF Node and the CARMEN project Presenter: Leslie Smith [email protected] Content Events and activities at UK INCF Node Recent Forthcoming The CARMEN project History, current status, ways
Software services competence in research and development activities at PSNC. Cezary Mazurek PSNC, Poland
Software services competence in research and development activities at PSNC Cezary Mazurek PSNC, Poland Workshop on Actions for Better Participation of New Member States to FP7-ICT Timişoara, 18/19-03-2010
Scott D. Bacon. Web Services and Emerging Technologies Librarian Assistant Professor Kimbel Library, Coastal Carolina University
Scott D. Bacon Web Services and Emerging Technologies Librarian Assistant Professor Kimbel Library, Coastal Carolina University Education Indiana University Bloomington, May 2011 - MLS with Digital Libraries
CAREER TRACKS PHASE 1 UCSD Information Technology Family Function and Job Function Summary
UCSD Applications Programming Involved in the development of server / OS / desktop / mobile applications and services including researching, designing, developing specifications for designing, writing,
REACCH PNA Data Management Plan
REACCH PNA Data Management Plan Regional Approaches to Climate Change (REACCH) For Pacific Northwest Agriculture 875 Perimeter Drive MS 2339 Moscow, ID 83844-2339 http://www.reacchpna.org [email protected]
UNIVERSITY OF NAMIBIA
UNIVERSITY OF NAMIBIA SCHOLARLY COMMUNICATIONS POLICY FOR THE UNIVERSITY OF NAMIBIA Custodian /Responsible Executive Responsible Division Status Recommended by Pro Vice-Chancellor: Academic Affairs and
Information Services hosted services and costs
Information Services hosted services and costs 1. Overview Information Services (I.S.) has provided a popular hosting service for specialized departmental servers for many years. Servers hosted by I.S.
PROGRAM DESIGN. Our design process, philosophy and values.
PROGRAM Our design process, philosophy and values. Program Design and Development is one of our core services in which we work with faculty to design and develop courses that are part of a new or redesigned
CloudCenter Full Lifecycle Management. An application-defined approach to deploying and managing applications in any datacenter or cloud environment
CloudCenter Full Lifecycle Management An application-defined approach to deploying and managing applications in any datacenter or cloud environment CloudCenter Full Lifecycle Management Page 2 Table of
Archive Data Retention & Compliance. Solutions Integrated Storage Appliances. Management Optimized Storage & Migration
Solutions Integrated Storage Appliances Management Optimized Storage & Migration Archive Data Retention & Compliance Services Global Installation & Support SECURING THE FUTURE OF YOUR DATA w w w.q sta
Curriculum Vitae Rajendra Bose
Curriculum Vitae Rajendra Bose Mortimer B. Zuckerman Mind Brain Behavior Institute Columbia University New York, NY 10027 Email: [email protected] Web: http://www.columbia.edu/~rb2568 Education University
Now that you have a Microsoft private cloud, what the heck are you going to do with it?
Now that you have a Microsoft private cloud, what the heck are you going to do with it? Tony Bradley Microsoft MVP, CISSP-ISSAP Principal Analyst, Bradley Strategy Group Abstract Choosing and building
Backup of NAS devices with Avamar
Backup of NAS devices with Avamar Extremely fast / no load Video describing NAS backup using Avamar based on this ppt: https://youtu.be/swg1ejldgmw The most fresh version of this document, you will find
CUL Strategic Plan Goals for September 2014 2016
CUL Strategic Plan Goals for September 2014 2016 The planned renovation is a major factor driving many goals, and should be considered as we prioritize goals and consider new projects. Space Assess the
NET+ INFRASTRUCTURE AND PLATFORM SERVICES PORTFOLIO STATUS & UPDATE
NET+ INFRASTRUCTURE AND PLATFORM SERVICES PORTFOLIO STATUS & UPDATE Andrew Keating, Eric Jeanes, Sean O Brien NET+ Cloud Services 2014 Internet2 NET+ IPS Portfolio Update CONTENTS Goals and Updates Portfolio
European Data Infrastructure - EUDAT Data Services & Tools
European Data Infrastructure - EUDAT Data Services & Tools Dr. Ing. Morris Riedel Research Group Leader, Juelich Supercomputing Centre Adjunct Associated Professor, University of iceland BDEC2015, 2015-01-28
