The : A Business Model for Big Data on a Small Budget Patrick Calhoun, Petascale Storage Administrator Henry Neeman, Director OU Supercomputing Center for Education & Research (OSCER) University of Oklahoma Information Technology XSEDE [14] Wednesday July 16 2014
Co-authors David Akin, OU Joshua Alexander, OU Brett Zimmerman, OU Fred Keller, OU Brandon George, OU 2
Outline We ll have time for these: Context Business Model Technology We might have time for this: User Interface We won t have time for these, but feel free to look at the slides on your own: Implementation Maintenance Sociology Please feel free to ask questions at any time. We like interacting. 3
Context
Overview (6) IBM System x3650 15 Faculty & Staff 12 projects / 10 departments OSCER-RELATED FUNDING TO DATE: $259M total, $145M to OU DCS 9900 TS 3500 5
Large Data Volume Choices I ve got tens of TB of data (or hundreds of TB or PB or ). Why can t I just buy a bunch of USB drives at my local big box store (or online)? 6
Large Data Volume Choices You can enter a NASCAR race on a riding lawnmower, but: you probably won t win; you probably will get killed. http://express.howstuffworks.com/gif/exp-nascar-2.jpg http://uslmra.org/wp-content/uploads/2009/09/howardlawnmowerracing.jpg 7
Why Not Roll-Your-Own? If a research team s data sizes are small, roll-your-own is perfectly reasonable: USB disk drives are cheap: 4 TB USB 3.0 = $139 (pricewatch.com 7/15/2014). Buy two and copy everything to both drives (getting user compliance on the secondary copy isn t necessarily trivial). Slightly bigger than that: can do a small, cheap RAID enclosure for mirroring or RAID6 (RAID5 probably isn t robust enough for large drives, given rebuild times), BUT: Price per TB starts going way up. Need much more expertise to configure and manage. Risk is higher because a failed system loses lots of data or buy two, doubling your costs. 8
Jargon Backup Nightly incremental (just the files that are new or have changed in the past 24 hours), AND Occasional full dump (every week, every month, whatever) Archive Write Once, Read Seldom if Ever 9
How Do Researchers Behave in the Wild? Territoriality Affordability No data management strategy Why? 10
Territoriality Some researchers like to hug their toys because they don t trust others (a) to provide shared resources to a large community, while simultaneously (b) serving each user s specific needs well (and at high priority). http://enterprise.media.seagate.com/files/2009/09/computerhug460x276-300x180.jpg http://gigaom2.files.wordpress.com/2012/10/jason-server-hug.jpeg 11
Affordability Some researchers perceive roll-your-own as cheaper than a central resource even when it s actually more expensive because of non-obvious (non-hardware) costs. Space, power, cooling: rack-in-a-closet isn t plausible any more. Labor: requires expertise far beyond a typical grad student. Maintenance: not cheap, especially after 3 to 4 years. But they have to stretch their research funds as far as possible. 12
No Data Mgmt Strategy Some research teams store their research data is on a single hard drive in the PC under a grad student s desk, in which case: The faculty member doesn t know what format the data is in, where to find it, nor how to read it so when the grad student graduates, the data essentially becomes unuseable. May be rarely if ever backed up. 13
Why? Some researchers perceive their administrations (especially but not only central IT) as barriers to their progress, instead of partners in their progress. In some cases, this is based on direct negative experience and/or advice/anecdotes from colleagues. For some users, the bulk of their hands-on computing experience is with personal computing (PCs, laptops, tablets, phones), which typically are relatively straightforward to manage with tiny capital, labor and expertise cost (e.g., increase phone storage by inserting MicroSD card; install software with a few taps for a few dollars or free). Grad student labor is (relatively) cheap. At some institutions, faculty incentives are based on: graduating students, publishing papers, getting external funding NOT on having well-managed IT resources. 14
How to Be, and Seem, Cheaper? Distribute the costs among multiple entities. That way, no one has to bear the whole burden. Therefore, the cost for each becomes affordable. Find ways to leverage the funding to get other funding. 15
Business Model
OneOCII The is part of a statewide initiative known as the OneOklahoma Cyberinfrastructure Initiative, co-led by: University of Oklahoma (OU, Norman) Oklahoma State University (OSU, Stillwater) Langston University (Langston) Tandy Supercomputing Center (part of the Oklahoma Innovation Institute, a non-profit in Tulsa) OneNet (Oklahoma s research, education & government network) 17
OK PetaStore Technology Strategy Distribute the costs among a research funding agency, the institution, and the research teams. Archive, not live storage: Write once, read seldom if ever. Independent, standalone system; not part of a cluster. Spend grant funds on many media slots but few media (tape cartridges, disk drives). Most of the media that the grant has purchased have been allocated to the research projects in the proposal. Media slots are available on a first come first serve basis. Software cost should be a small fraction of total cost. Under the OneOklahoma Cyberinfrastructure Initiative, this is also true for academic institutions statewide (and also many non-academic institutions). Maximize media longevity. 18
Business Model Grant: hardware, software, 3 year warranties on everything Institution (CIO + VPR): space, power, cooling, labor, maintenance after the 3 year warranty period Researchers: media (tape cartridges, disk drives) Compared to roll-your-own disk, for researchers PetaStore tape is: cheaper more reliable less labor requires less training (~1 hour) can be faster (~200 MB/sec to write, ~140 MB/sec to read) Compared to roll-your-own disk, PetaStore disk is: more expensive, but otherwise like tape 19
Business Model Grant: hardware, software, 3 year warranties on everything Institution (CIO + VPR): space, power, cooling, labor, maintenance after the 3 year warranty period Researchers: media (tape cartridges, disk drives) Compared to roll-your-own disk, for researchers PetaStore tape is: Cheaper (33% cheaper per TB raw 7/15/2014) By the way, LTO-5 tape is also faster than SATA disk drives (140 MB/sec vs 25-50). LTO-5 has an unrecoverable read error ( bit rot ) rate of 1 in 10 17 bits, compared to 10 14 for SATA and 10 15 for SAS. 20
NSF MRI Grant Acquisition of Extensible Petascale Storage for Data Intensive Research National Science Foundation grant no. OCI-1039829 10/1/2010-9/30/2013, no cost extension to 9/30/2014 21
NSF MRI Grant: Summary OU was awarded a National Science Foundation (NSF) Major Research Instrumentation (MRI) grant in 2010. It features 15 faculty and staff from 12 projects in 10 departments. We ve purchased and deployed a combined disk/tape bulk storage archive from IBM: the NSF budget paid for most of the hardware and software, plus warranties/maintenance for 3 years; OU cost share and institutional commitment pay for space, power, cooling and labor, as well as maintenance after the 3 year project period; individual users (e.g., faculty across Oklahoma) pay for the media (disk drives and tape cartridges). 22
Data Management Plans Beginning mid-january 2011, ALL proposals to the NSF had to have 2-page data management plans. (The plan could be an argument that no data management plan is needed). National Institutes of Health have a similar requirement. OSCER has worked with the Office of the VP for Research to create boilerplate text that includes a description of the PetaStore. This doesn t address issues such as metadata, provenance, etc, but it does cover physical data management. 23
Longevity The current PetaStore system will end-of-life roughly 2017. Faculty may not have funds for purchasing more media in PetaStore II. How to handle the tape? 24
Longevity Strategy PetaStore II has to be backward-compatible with PetaStore I, in the sense of allowing LTO, including LTO-5 and LTO-6 (could also allow non-lto, if desired). Tape cartridges are good for the earliest of: 15 years 5000 load/unload cycles 200 complete tape read/writes So far, only 6 tape cartridges (< 1%) are in danger of wearing out in less than 15 years. PetaStore II must include a couple of LTO-6 drives, which can read and write both LTO-6 and LTO-5. 25
Longevity Transition 1. Acquire PetaStore II, including a small amount of LTO-7 media. 2. Put PetaStore II into full production. 3. Put PetaStore I into read-only mode. 4. Copy a modest amount of data off tape cartridges on PetaStore I to PetaStore II. 5. Empty those PetaStore I tape cartridges. 6. Export them from PetaStore I and import them into PetaStore II. 7. Repeat steps 4-6 until all tape cartridges have been moved over. 8. Decommission PetaStore I. 9. May want to copy old data from new media to old media. 26
Technology
Hardware (6) IBM System x3650 15 Faculty & Staff 12 projects / 10 departments OSCER-RELATED FUNDING TO DATE: $259M total, $145M to OU DCS 9900 TS 3500 28
Hardware & Software Hardware Disk: IBM DCS9900 (rebranded DDN S2A9900) Tape: IBM TS3500 Software Disk: IBM s General Parallel File System (GPFS) Tape: IBM s Tivoli Storage Manager (TSM) 29
Hardware: Disk IBM DCS9900 (rebranded DataDirect Networks S2A9900) 2 controllers 20 enclosures of 60 disk drive slots each (1200 slots total) NOT EXPANDABLE Initially purchased 300 disk drives (minimum allowed) ~477 TB useable Currently at 530 disk drives ~842 TB useable Cost saving strategy: maximize the ratio of disk drive slots per controller (which are expensive) Peak speed 5.4 GB/sec, benchmarked at ~4 GB/sec (idealized test) faster than our then-cluster s parallel filesystem, similar to current cluster s speed Speed wasn t a goal: it s an archive! 30
Hardware: Tape IBM TS3500 4 x LTO-5 tape drives 2859 tape cartridge slots Initially 100 tape cartridges, has grown to 960 so far Expandable to over 22,600 tape cartridge slots (over 55 PB at LTO-6 coming soon!) Planning to buy 2 x LTO-6 tape drives soon LTO-X can write LTO-(X-1) and read LTO-(X-2) 31
Software: Disk IBM s General Parallel FileSystem (GPFS) Charged per server, not per tape slot or per TB VERY VANILLA! NOTE: High Energy Physics has a separate Lustre partition (200 of the 530 disk drives). 32
Software: Tape Tivoli Storage Manager (TSM) Integrates well with GPFS. Originally designed for backups archive capability added on later. Not ideal to manage billions of files in an archive. Priced per server, and we have only 6 servers. Most tape software has one or both of the following: per-cartridge-slot activation upcharge; per-tb capacity charge. These charges would have wrecked the project. Not IBM s first choice for archive software. They d prefer us to use HPSS (common at national centers). But, HPSS s cost would have consumed the entire budget. 33
Tape Software Summary We chose a terribly risky software strategy because all of the alternatives to this high risk would have guaranteed failure. We got very lucky: It actually works! Configuring the software took weeks of hard labor, including a 2 week onsite intervention by an IBM expert. BUT: Now that we know how to do this, we can help others. 34
User Interface
Interface Methods Linux Shell on Computing Cluster SCP/SFTP (GUI or character terminal) GridFTP and GlobusOnline 36
Filesystem Layout Consistency All files belong in the directory tree /archive/ /archive/... contains the same data, regardless of interface Transparency Data redundancy and target media types are obvious in the path. Leverage existing skills Users LITERALLY use standard POSIX and GNU commands (Plus some optional supplemental commands) 37
Duplication Policies Comparable base path names to our computing cluster. /home/username; /scratch/username, /work/username, /work/project /archive/username/disk_1copy_unsafe /archive/username/disk_1copy_tape_1copy /archive/username/tape_1copy_unsafe /archive/username/tape_2copies project often substitutes for username The user has to choose the policy for each file (or collection of files), and has to type the implication of a dicey choice. 38
Offsite Copies These two policies can benefit from off-site copies: disk_1copy_tape_1copy tape_2copies Periodic export and reclamation Weekly SneakerNet from South Campus to OU IT s Disaster Recovery data center (~5 miles). In Oklahoma, natural disasters tend to be highly localized: Tornadoes (common) Flash floods (occasional) Ice storms (non-disruptive of storage, especially tape) Earthquakes, but nothing strong in the past 15+ years 39
Current Groups Using Currently, 26 research groups have capacity on the PetaStore (plus OSCER itself, which consumes 10% of the original disk space for a landing pad for files that are to be on tape only). Of these, 9 are from the grant proposal and 14 aren t. Footprints (as of Apr 6 2014): disk_1copy_unsafe: 11.5 TB disk_1copy_tape_1copy: 40.4 TB (per copy) tape_1copy_unsafe: 60.4 TB tape_2copies: 240.8 TB (per copy) 40
Weird Constraints File Sizes: prefer 10 100 GB, accept 1 GB Keeps the number of files manageable (avoid flakiness, excessive database traversal times, shoeshining ) Retrieval time ~4 minutes for 10 GB, ~22 minutes for 100 GB (excluding time pending in the queue until a drive is available, if any) File Types: Unless individual files are 10 100 GB AFTER COMPRESSION, we prefer that they be zip files or gzipped tar files. Compression is a good thing. Replacing many small files with one big file is a good thing. NOTE: No autocompression when copying from disk to tape. 41
Background slides
NSF MRI Research Projects 1. Numerical Prediction and Data Assimilation for Convection Storms, Tornadoes and Hurricanes: Xue, Meteorology and Center for Analysis & Prediction of Storms (CAPS) 2. ATLAS Tier 2 High Energy Physics: Strauss, Skubic, Severini, Physics & Astronomy, Oklahoma Center for High Energy Physics 3. Earth Observations for Biogeochemistry, Climate and Global Health: Xiao, Botany & Microbiology, Center for Spatial Analysis 4. Adaption of Robust Kernel Methods to Geosciences: Trafalis, Industrial Engr; Richman, Leslie, Meteorology 5. 3D Synthetic Spectroscopy of Astrophysical Objects: Baron, Physics & Astronomy 6. Credibility Assessment Research Initiative: Jensen, Management Information Systems, Center for Applied Social Research 43
NSF MRI Research Projects 7. Developing Spatiotemporal Relational Models to Anticipate Tornado Formation: McGovern, Computer Science (CS), Interaction, Discovery, Exploration, Adaptation (IDEA) Lab 8. Coastal Hazards Modeling: Kolar, Dresback, Civil Engineering & Environmental Science (CEES), Natural Hazards Center 9. High Resolution Polarimetric Radar Studies Using OU- PRIME Radar: Palmer, Meteorology & Atmospheric Radar Research Center 10. Perceptual and cognitive capacity: Modeling Behavior and Neurophysiology: Wenger, Psychology 11. Multiscale Transport in Micro- and Nano-structures: Papavassiliou, Chemical, Biological & Materials Engr 12. Electron Transfer Cofactors and Charge Transport: Wheeler, Chemistry & Biochemistry 44
Implementation
Storage Admin Labor Cost How much labor does Patrick average per month? May 2011 (delivery) Feb 2011 (full production): ~140 hours per month (~80% FTE) Ongoing maintenance labor: ~9 hours per month Ongoing user training labor: ~75 hours per month 46
TSM Server Setup TSM servers are set up as follows: Vanilla RHEL Linux 5.7 (May upgrade to RHEL 6.2+) Can be configured either as a GPFS server (expensive) or GPFS client (cheap) guess which we picked... LTO tape drives in the TS3500 use the lin_tape kernel module lin_tape handles tape drive multipathing 47
GPFS Server Setup GPFS servers are set up as follows: Vanilla RHEL Linux 5.7 (May upgrade to RHEL 6.2+) Use Linux Device Mapper Multipath for GPFS NSD LUNs Each NSD is owned by one primary and one secondary GPFS Server. 48
Minor Components Required In addition to the DCS9900 and TS3500 SAN: 8 Gb FibreChannel SAN (campus FC backbone) GPFS Servers: 4 x IBM System x3650 TSM Servers: 2 x IBM System x3650 (Active-Passive) Separate Data and Management Networks: 10 Gb, GigE Client Systems: GPFS/NFS/SFTP 49
SAN Layout Dual (Redundant-path) Fabrics Separate physical interface for Disk and Tape Zoning: Exactly 2 endpoints per zone LOTS OF ZONES 32 Zones for GPFS Servers <-> DCS9900 16 Zones for TSM Servers <-> TS3500 16 Zones for TSM Servers <-> DCS9900 HBA: QLogic QLE2562-8Gb FC Dual-port HBA for System x 50
SAN Zones 51
Ethernet Connectivity One public network for data (10 Gb) One private network for management (GigE) 52
Optional Client Servers Lustre Servers for opt-out of HSM (Unsupported) Remote mount via sftp. (Unsupported) Allows for encrypted filesystem support, for example. 53
Maintenance
Monitoring DDN s UNSUPPORTED s2mon Custom scripts to check statuses User Reports 55
Exception Handling Historically due to implementation oversights. Maximum Number of used Scratch tapes Led to insufficient available tapes, failed migrations. TSM Log Backup aging policy Led to insufficient available tapes, failed migrations. No Disk Quota bound on number of inodes Led to failed stat queries, locked systems. Inconsistent tape drive enumeration Led to broken TSM paths, inaccessible user data. Stale File Handle. No data lost. 56
Sociology
User Training Takes about an hour We currently train each new user one-on-one before letting them on. 58
User Training Orientation Outline: Description and Intent of the Inquisition of user s use case System Rules: 1. Files MUST be 1 GB or larger. 2. Files SHOULD NOT exceed 100 GB. 3. All media must be purchased through our approved channels. The 4 Duplication Policies (and directory names) The 3 Interfaces (Compute Cluster, SCP/SFTP, gridftp) Supplemental Commands How to Zip or tar+gzip a collection of files Any other training for this user s use case 59
Use Agreement Before a user can buy media and/or log in, they have to sign and submit a PetaStore Use Agreement. 60
Use Agreement 1. I will not store on the PetaStore any files that are subject to the US federal Health Insurance Portability and Accountability Act (HIPAA). 2. I will not store on the PetaStore any files that are subject to the US federal Family Educational Rights and Privacy Act (FERPA). 3. I will not store on the PetaStore any files that are classified. 4. If any of the files that I store on the PetaStore are subject to one or more agreements with any Institutional Review Board (IRB), including but not limited to the IRB of the University, then I will take full responsibility for ensuring full compliance with such agreement(s). 61
Use Agreement 5. If I am collaborating with colleagues who are at institutions outside of the United States of America (that is, outside of both US states and US territories), then I will take full responsibility for ensuring that those colleagues do not access the PetaStore themselves, but rather I and/or or other members of my team who are at US institutions will access the PetaStore on behalf of the entire team. 6. I understand that, if and when I cease to be employed by and/or a student at an institution in the United States of America, then access to my files on the PetaStore will be available only to those of my collaborators who are employed by and/or students at US institutions. 62
Use Agreement 7. I will take full responsibility for ensuring that my use of the PetaStore is in full compliance with the most current version of the University s Acceptable Use Policy, currently accessible at http://www.ou.edu/content/dam/it/security/acceptable_use_ Policy.pdf. 8. If I am one of the Principal/Co-Principal investigators of a team, then I will take full responsibility for ensuring that any student members of the team are likewise in full compliance. 63
Use Agreement 9. I understand that the ability of the University to provide the PetaStore is contingent on continued National Science Foundation funding and cooperation; that the University provides the PetaStore on an as-is basis, and while every reasonable and good faith effort will be made to ensure the reliability and availability of the PetaStore and of the files stored on it, the University makes no guarantees with respect to its reliability or continued availability. 10. In the event that the University ceases providing the PetaStore or any comparable resource, then I will take full responsibility for transferring any and all relevant files to other storage resources, and in a timely manner. 64
Use Agreement 11. I will take full responsibility for ensuring that I keep abreast of and comply with changes to any of the relevant laws, policies and circumstances described above. 65
Acknowledgements NSF MRI Participants and External Advisory Group OSCER Operations Team: Brandon George, David Akin, Brett Zimmerman, Joshua Alexander OU CIO/VPIT Loretta Early, Asst VPIT Eddie Huebsch OU VP for Research Kelvin Droegemeier OU IT: Fred Keller, Gensheng Qian, cable crew, etc. IBM: Jim Herzig (now retired), Mike Kane (now at Verizon), Frank Lee, Tu Nguyen, Ray Paden 66
Acknowledgements Portions of this material are based upon work supported by the National Science Foundation under the following grant: Grant No. OCI-1039829, MRI: Acquisition of Extensible Petascale Storage for Data Intensive Research. 67
Thanks for your attention! QUESTIONS? 68