The National Consortium for Data Science (NCDS) A Public-Private Partnership to Advance Data Science Ashok Krishnamurthy PhD Deputy Director, RENCI University of North Carolina, Chapel Hill
What is NCDS? is a public-private partnership to advance data science Mission Leadership in data science research & education, help industry to use the power of data to drive economic growth Vision Focused multi-sector, multidisciplinary data science community to solve big data challenges and drive the field forward Goals Engage broad communities of data experts Coordinate data science research priorities that span disciplines and industries Facilitate development education & training programs Support development of technical, ethical & policy standards Apply NCDS expertise to data challenges in science, business and government 2
NCDS Members The Big Data Frontier 4
Why a Consortium? Time Consortium can plant a stake in the ground quickly: significant funding and full-time staff not essential to launch. Participation Shared vision, ability to have your voice heard, define the issues to be tackled. Flexibility Able to try different models, different key projects, different core foci to see what works best and to respond to changing and varied needs and interests. Community Consortium is way of building a community that can eventually become the foundation for a center (a physical place). 5
NCDS Components Data Lab & Observatory Data Fellows Program Working Groups Data Science Events Shared, distributed infrastructure housing large organized data; serves as a platform for data R&D and data science education (Graduate certificates, MS) Seed grants for faculty to work on consortiumapproved projects; NCDS review panel evaluates proposals Industry internships for graduate students Visiting industry data scientists at member universities Year long deep dive into topics of interest to members Produces position papers, workshops, software, events, etc. Leadership Summits (Spring) Data Matters Short Courses (Summer) Student Career events (Fall/Spring) Invited lectures and outreach (ongoing) 6
Accomplishments: 2013-2014 Organizational: Bylaws passed, steering committee, kickoff featuring Dr. Eric Green (Director, NHGRI) and US Rep. David Price, 10 paid memberships so far. Programmatic: NCDS Leadership Summit (April 2013); Five Faculty Fellows appointed (October 2013); Student-Industry-Faculty career awareness event (April 2014); Data Innovation Showcase (May 2014); Data Matters short course series (June 2014), Observatory active with data sets (June 2014). Upcoming: Tech Talks with UNC Computer Science and UNC Career Services (October 2014), New Data Fellows CFP (October 2014), Working Groups (Fall 2014). 8
NCDS Data Cyberinfrastructure Secure Research Workspace/ Secure Medical Research Workspace: Secured virtual environments ExoGENI/ADAME NT: Federated Infrastructure as a Service irods: Policy-driven data management DataBridge: Social media- like discovery of useful data sets Genomic Medical Workflow Engine: Informatics and HPC in High Throughput Sequencing Key: Infrastructure that adapts to problems 9
What is irods? free to use free to modify free to contribute sits between the files and the user irods is open source data grid middleware for Data Discovery Workflow Automation Secure Collaboration Data Virtualization Metadata policies: any condition; any action sharing without losing control file system flexibility 1 10
irods 4.0: Ready for Enterprise Product of nearly 20 years of research and development, funded by DARPA, DOE, NASA, NSF, NARA, and NOAA. Sustainability - Formation of the irods Consortium 6 members, presently: developers, users, storage vendors Provides interaction between user/developer community Professional integration services, technical support, training and certification Enterprise Quality - Starting with irods 4.0, the entire codebase has been reviewed and restructured. Plug-in architecture Each change is verified with a test case in a continuous integration suite Pre-compiled binary packages are available for several Linux distributions and multiple database management systems. 1 11
Who Uses irods? The Wellcome Trust Sanger Institute manages 2 PB of data with irods Data discovery and workflow automation: data is tagged with processing history and checksums Secure collaboration: workgroups can share with each other while independently maintaining archiving and access policies Data virtualization: data is replicated for redundancy and high availability 1 12
Who Uses irods? The iplant Collaborative uses irods to manage over 112M files (>750 TB) with over 20,000 users Data discovery: Templates guide application of metadata according to international data curation standards Workflow automation: Fine-grained user permissions conditioned on domain, group, file size, metadata Data virtualization: Data is easily moved between storage and compute resources, always maintaining a specified level of redundancy 1 13
Data Science Education Modular courses for 11 month program Graduate Certificate in Data Science (Half time) MS in Data Science (Full time) 14
Conclusion Developing Data Science Will: Develop the next generation of data science experts and leaders Create strategies, practices and scientific methods for understanding data Enable more collaborations among data and domain scientists, business, academia and government Assist those who are struggling to collect, analyze, manage and use data Establish methodologies for measuring the value and impact of data 15
THANK YOU!