HPSS Best Practices Erich Thanhardt Bill Anderson Marc Genty B
Overview Idea is to Look Under the Hood of HPSS to help you better understand Best Practices Expose you to concepts, architecture, and tape tech Cite Best Practice s in context along the way Talk ends with references to further resources Talk is interactive, please ask questions along the way
HPSS - What is it? Acronym Stands for High Performance Storage System HPSS is software that manages petabytes of data on disk and robotic tape libraries. Quoted from:http://www.hpss-collaboration.org
HPSS - What makes it different? Hardware: Use of tape technology is a distinguishing characteristic of HPSS Use case: HPSS is an archive and not a (parallel) file system system is remote, not cross mounted operation set is limited to metadata and file transfers Best Practice: Be aware what makes HPSS very different than GLADE - intended use
Archive HPSS Main Use Cases Data is stored and preserved indefinitely While system components come and go Model data and observational data collections Disaster Recovery Leverage dual sites for geographic separation Additional level of archival preservation
HPSS Software Architecture AUTH Authentication 4x Gateway Servers HPSS End User Control Gateway Linux/Unix Host HSI/HTAR Client Metadata CLint Interface (CLI) DATA HPSS
HPSS Software Architecture Best Practice: Reporting errors via EV ticket include: name, host, datetime, -d4 error tracing authentication problems those pesky parallel file transfer limits your guaranteed on-ramp to the system data bandwidth allocation will be increasing over the next few months
HPSS Software Architecture Best Practice: Validating that a file was written ls -l both locally and on HPSS compare pathname and size not sufficient to see the pathname (ls) Here is what can happen: Creating pathname in HPSS happens first Then data transfer between client and HPSS That transfer can be interrupted
HPSS - One System/Two Sites NWSC Cheyenne ARCHIVE DISASTER RECOVERY MLCF Boulder HPSS Disk Cache Oracle Tape Drives + Media Oracle SL8500 Tape Library
HPSS Libraries - Oracle SL8500
HPSS Tape Libraries Frontal View MLCF ACSLS Server SL8500 Tape Library
HPSS Libraries Top View Tape Library
HPSS Libraries - Photos
ORACLE DRIVE & MEDIA
Small File Problem Cost of a random read: Robot retrieval, mount, seek: 70 secs to avg file Transfer data rate: 240 MB/sec 184 MB file means 99% latency 1% transfer Cost of returning tape Double it - indirect cost to you 368 MB file means 99% latency 1% transfer Compare these with avg filesize of 166 MB
Small File Problem Best Practice: best is to avoid small files, but where needed - aggregate with htar
Deleting files File Deletion Deleting data on tape creates unusable spaces on tape because it s linear and continuous Mischaracterizations and system data migrations Best Practice - delete un-needed files but also avoid temporary files (whether rewriting or create/delete s)
Repeated Reads and Writes Best Practice: avoid both repeated reads from and repeated writes to an archive file - bring the file out and park it somewhere else
File Rescue Adopting orphaned files from others user/proj combo goes invalid after period of time someone needs to take ownership and pay storage costs Best Practice - never use cp to copy data internally in order to move it if you don t have proper permissions - open ticket
Optimizing Reads Best Practice - if you are reading back data at large scales, contact Helpdesk at cislhelp@ucar.edu for ways to order your requests - it can be done! Process is not perfect but usually has a positive effect
Storage Hierarchy Concept CPU Memory Disk Tape
Attributes of Storage Hierarchy Cost & Characteristics Speed & Capacity Persistence & Reliability hardware, RAID/RAIT, dual copy Availability online/nearline/offline Location onsite/offsite
HPSS Storage Pyramid DISK CACHE Disk TAPE LIBS ROBOTICS DRIVES & MEDIA Tape
Hierarchical Storage Manager (HSM) Purge DISK Stage Migrate TAPE
User Interaction with HPSS Purge DISK Stage Migrate TAPE
Basic Stats Jun-Aug 2014 Writes/Reads ratio ~4-5 to 1 User response times ~116 sec/read vs. ~9-10 sec/write ratio read/write response times ~ 13 to 1
Tape Technology Upgrades Purge DISK Stage Migrate Migrate TAPE
Data Services Pyramid - Workflow PFS GLADE GPFS 90 GB/sec Archive DR HPSS 9 GB/sec
Workflow - Optimal Create data on GLADE/GPFS Post process (new data plus deletes) Commit data selectively to HPSS Best Practice!
Workflow - Realistic Create data on GLADE/GPFS Commit to HPSS (back it up) Post process (new data) Commit post-processed data (selectively?) to HPSS
Workflow - To Avoid Create data on GLADE/GPFS Commit to HPSS (back it up) Delete from GLADE/GPFS. time passes Stage from HPSS back to GLADE/GPFS. process staged data
Workflow - To Avoid Create data on GLADE/GPFS Commit to HPSS (back it up) Delete from GLADE/GPFS. time passes Stage from HPSS back to GLADE/GPFS. process staged data BEST PRACTICE - contact cislhelp@ucar.edu
Additional Resources CISL Support & Allocations Helpdesk & CISL Consulting send email to cislhelp@ucar.edu HPSS Documentation http://www2.cisl.ucar.edu/docs/hpss Best Practices doc http://www2.cisl.ucar.edu/docs/best_practices
The End