Addressing research data challenges at the University of Colorado Boulder Thomas Hauser Director Research Computing University of Colorado Boulder thomas.hauser@colorado.edu Research Data Challenges Research data can be big in different ways [1] Volume that challenges our computing, storage and network infrastructure Lasting significance, e.g. clinical trial, environmental data Descriptive challenges, e.g. experimental setup Research Computing (RC) services to address research data challenges Partner with faculty and researchers RC-DMZ for large data transfers PetaLibrary storage infrastructure Large Scale Compute Research Data Services Collaboration between RC and the CU-Boulder Libraries Under development [1] C. Lynch, Big data: How do your data grow?, Nature, vol. 455, no. 7209, pp. 28 29, Sep. 2008. 1
Data intensive projects at CU- Boulder National Snow and Ice Data Center makes cryospheric and other data accessible and useful to researchers around the world High Energy Physics (HEP) group runs an Open Science Grid (OSG) site as part of the Large Hadron Collider (LHC) experiment BioFrontiers institute that collaborates with the Anschutz Medical Center in Denver Researchers in the Department of Computer collect and data-mine data from social networks such as Twitter to model and understand how social networks are used in emergencies Scientific Data Movement The Task: Large Data Transfer End to End Disk, Network Card, OS, Application Protocols, LAN, WAN Topologically and Physically complex (e.g. multi-domain) The Concerns Machine/OS/Protocol Tuning (e.g. TCP as the typical choice in this space) HEP is prime example that a collaboration and end to end tuning and performance debugging is necessary 2
Science DMZ Concept involves some important players: Architectural Split Enterprise vs Science use Migration of big data off of the LAN (mutually benefits the regular users too) Security and Networking Paradigm shift learn to trust things vs untrust of everything Router filters are faster than firewalls Using the Right Tools Monitoring of the network perfsonar Dynamic allocation of bandwidth DYNES, OSCARS, OpenFlow Proper data movement applications SCP = Bad, GridFTP = Good Well tuned servers (hardware, software, and protocol stack) CU-Boulder s Science DMZ: Current and future RC-DMZ: Collaboration with OIT Networking group and RC Current RC-DMZ Campus wide dedicated 10 gig Dedicated uplinks Single path single point of failures Next year (CC NIE) Upgrade border routers to be 100G and OpenFlow capability Add redundant paths and more paths between key sites using Arista switches and MLAG Dedicated perfsonar and BRO nodes 3
GridFTP and Globus online Gridftp resources (Globus Online) Four Dell PowerEdge R710s as GridFTP servers Dedicated 10Gb ethernet per node External access via science DMZ colorado#gridftp Internal access via dedicated private vlans colorado#jila, colorado#nsidc --data-interface <vlan> 4
Projects Enabled by RC-DMZ DYNES High speed access to central research data storage Globus Online Enabling different groups to have data driven high speed workflows applications HEP openscience grid node BioFrontier Institute: Moving large amounts of data between CU-Boulder and CU-Anschutz campuses NSIDIC: sharing of data JILA: Transfer to and from XSEDE resources Performance numbers Data transferred from colorado#gridftp Data transferred to colorado#gridftp 122.5 TB 21.6 TB Peak transfer rate between distinct endpoints 2.9 Gb/s Peak transfer rate to/from Janus (disk) Peak transfer rate to/from Janus (memory) 5.9 Gb/s 9.5 Gb/s 5
Storage at CU-Boulder Storage Resources 80 TB of mirrored and snapshoted project spaces on IBM Nseries (netapp) Janus Supercomputer 16,416 Westmere cores 850 TB of Lustre scratch NSF funded petalibrary project Grant provides the infrastructure Researcher pays for the medium Development of a long-term business model Global File System DDN with GPFS Capacity of several PB Performance tier Science data cloud storage Data Management Support Data management task force Members: RC, Libraries, Faculty Report to the campus leadership and faculty by October Set of recommendations to make support of Data Management a priority for the campus Research Data Services (RDS) Collaboration between CU-Boulder Libraries and RC Data.colorado.edu Basic services with existing staffing Develop a business model Build full data management services 6
Contributions Conan Moore, OIT Networking Engineer Daniel Milroy, RC System Administrator Jazcek Braden, RC Senior System Administrator Kimberly Stacey, RC Data Manager Jason Zurawski, Internet 2 Questions? 7