Research Data Storage Infrastructure (RDSI) Project. DaSh Straw-Man

Research Data Storage Infrastructure (RDSI) Project DaSh Straw-Man

Recap from the Node Workshop (Cherry-picked) *Higher Tiered DCs cost roughly twice the cost of Lower Tiered DCs. * However can provide a robust Higher Tiered like service. * Using co-operating Lower Tiered DCs. * With distributed and/or replicated mechanisms. * If a service (partially) fails another DC can temporarily provide it. * If a DC fails other DCs can provide its services temporarily. *Loss of service pardonable. Loss of data unforgivable. *Need to provide concrete assurances to the end user.

*Whats DaSh all about? * Developing sufficient elements of potential technical architectures for data interoperability and sharing. * So that its use can be appropriately specified the call for nodes proposal. * Mile high view of technical architectures to get data into and out of the RDSI node(s). *Ensure (meta)data durability and curation. * Loss of (meta)data is a capital offence. *Ensure data scalability. * Storage capacity, moving data into and out of a node(s). *Ensure End-user usability. * Provide a good end-user experience. *DaSh straw-man seeks community opinion on the various possible architectures.

Re-exported FS Building Blocks HSM, Tiers Storage Classes protocol neg. SRM SRM Wide Area xfers REST S3 Clouds GRIDs Wide Area xfers gsiftp, https dcap, DPM, xrootd NFS, CIFS WebDAV, FUSE

*irods and Federation *Federation is a feature in which separate irods Zones (irods instances), can be integrated. * When zones 'A' and 'B' are federated, they work together. * Each zone continues to be separately administrated. * Users in the multiple zones, if given permission, will be able to access data and metadata in the other zones. * No user passwords exchanged * Zone admins setup trust relationships to other zones.

ARCS Data Fabric icat only. Hosted on NeCTAR NSP irods server + tape irods server + tape irods server irods server irods server + tape irods server irods server + tape

Node s Eye View. (N=6) No Federation.

Node s Eye View. (N=6) Too much Federation. Too much confusion!!

Node s Eye View. (N=6) Just right Federation. Slave ICAT Slave ICAT Slave ICAT Master ICAT Slave ICAT Slave ICAT Slave ICAT

Distributed Fault- Tolerant Parallel FS Over N=6 nodes Re-exported FS Distributed vs Federated HSM, Tiers Storage Classes protocol neg. SRM SRM Wide Area xfers REST S3 Clouds GRIDs Wide Area xfers gsiftp, https dcap, DPM, xrootd NFS, CIFS WebDAV, FUSE

Distributed Pros and Cons *Distributed over a larger number of nodes. * Geographic scaling as well as node scaling. * Inherent data replication. *Fault Tolerant. * A storage brick took lickin but the service keep on tickin. * A node took a lickin but the service keep on tickin. *Parallel I/O. * All nodes can participate to move data. High aggregate BW. *Single global namespace. * Rather than separate logical namespaces. *Cost Effective * Use cheap hardware. Big disks over fast disks. * Design to expect failures.

File Replication *Whole file * Duplicated and stored on multiple bricks. *Slices of file * File sliced and diced, slices stored on multiple bricks. * A single brick may not contain the whole file. * Erasure Codes * Parity Blocks * (used in RAID) * Reed-Solomon * Over sampled polynomial constructed from data. * Add Erasure codes and slice file * Need M of N pieces to recover file (M < N) * Can store a slice on multiple bricks. Extra redundancy.

SurfNET Survey of Wide Area Distributed Storage. (Circa 2010) [1/4] http://www.surfnet.nl/nl/innovatieprogramma's/gigaport3/documents/eds-3r%20open-storage-scouting-v1.0.pdf Requirements required: *Scalable. * Capacity, performance and concurrent access. * Expandable storage without degrading performance. *High Availability. * Keeps data available to apps and clients. * Even in the event of a malfunction. * Or system reconfiguration. * Needs to replicates data to multiple locations.

SurfNET Survey of Wide Area Distributed Storage. (Circa 2010) [2/4] http://www.surfnet.nl/nl/innovatieprogramma's/gigaport3/documents/eds-3r%20open-storage-scouting-v1.0.pdf *Durability * No data is lost from a single software or hardware failure. * Automatically maintain minimum number of replicas. * Support backup to tape. *Performance at Traditional SAN/NAS Level. * Comparable performance to traditional non-distributed SAN/NAS. *Dynamic Operation. * Availability, durability, performance configurable per application. * Reduce costs as not running at highest support level at the time. * Allow users, apps, sysadmins to balance cost vs features. * System should be self-configurable, self-tunable. * Support data movement between different storage technologies. * Tiered functionality. Classes of Storage.

SurfNET Survey of Wide Area Distributed Storage. (Circa 2010) [3/4] http://www.surfnet.nl/nl/innovatieprogramma's/gigaport3/documents/eds-3r%20open-storage-scouting-v1.0.pdf *Cost Effective * Must be possible to build, configure, run and maintain in a cost effective manner. * Must work with commodity hardware. * Hardware may not be as reliable as high end hardware. * Configuration of system and its maintenance must be easy and straight forward. * Operation of system is energy efficient. * License fees for software when applicable must be limited. *Generic Interfaces. * System offers generic interfaces to apps and clients * POSIX interface. POSIX/NFSv4.1 semantics. * Block device (iscsi, etc).

SurfNET Survey of Wide Area Distributed Storage. (Circa 2010) [4/4] http://www.surfnet.nl/nl/innovatieprogramma's/gigaport3/documents/eds-3r%20open-storage-scouting-v1.0.pdf *Protocols Based on Open Standards * System build using open protocols * Reduces vendor lock-in * More economical in the long run. *Multi-Party Access * System must support access by multiple geographically dispersed parties at the same time. * Promotes collaboration between these parties.

SurfNET Survey of Wide Area Distributed Storage.(Circa 2010) http://www.surfnet.nl/nl/innovatieprogramma's/gigaport3/documents/eds-3r%20open-storage-scouting-v1.0.pdf Candidates *Lustre *GlusterFS *GPFS *Ceph *+ dcache Non-Candidates *XtreemFS *MogileFS *NFS v4.1 (pnfs) *ZFS *VERITAS FS *Parascale *CAStor *Tahoe-LAFS *DRBD

Nordic DataGrid Facility (dcache)

The DEISA Global File System at European Scale (Multi-Cluster General Parallel File System)

TeraGrid (GPFS & Lustre)

SurfNET Survey of Wide Area Distributed Storage + dcache Lustre GlusterFS GPFS Ceph dcache Owner Oracle Gluster IBM Newdream dcache.org Licence GNU GPL GNU GPL commercial GNU GPL DESY Data Primitive Object (file) Object (file) block Object (file) Object (file) Data placement Metadata Storage tiers Round robin + free space heuristics Max 2 metadata servers Pools of object targets Different strategies via modules Stored with file Policy based Distribute over storage servers Placement groups, random mappings Multiple metadata servers Policy based pnfs (postgresql) unknown Policy defined CRUSH rules Policy defined

SurfNET Survey of Wide Area Distributed Storage + dcache. Lustre GlusterFS GPFS Ceph dcache Failure handling Replication WAN deployment example Client interface Node types Assuming reliable nodes Server side (failover pairs) TeraGrid Native client, FUSE, CIFS, NFS Clients, metadata, objects Assuming unreliable nodes Assuming reliable nodes, Failure groups Assuming unreliable nodes Assuming reliable nodes Client side Server side Server side Server side City Cloud (Swedish IaaS provider) Native Client, FUSE TeraGrid DEISA Native Client, exports NFSv3, CIFS, pcifs, WebDAV, SRM (StoRM) unknown Native client, FUSE Client, data Client, data Clients, metadata, objects Fermilab, Swegrid, NDGF NFSv4.1, HTTP, WebDAV, GridFTP, Xrootd, SRM, dcap Clients, metadata, objects

WAN Data Caching and Performance Bringing data closer to where it is consumed. *Researchers are naturally distributed over the city and country *Some may not benefit from the high speed networks provide by AARNet and the NRN due to their location. *Can RDSI help these spatially disenfranchised? *Yes, (sort of). *Take the model of Content Delivery Networks. * ie Akamai, Amazon CloudFront, etc * Web content, videos etc are cached close to the end user. *But focus on data caching rather that content caching. *May not provide the same experience as the spatially franchised. * But every bit helps!

WAN Data Caching with GPFS.

WAN Data Caching Continued *dcache is a distributed cache system. * Locate a dcache pool close to the spatially disenfranchised. * dcache admin can populate required data collections to spatially disenfranchised using standard SRM processes. * Potentially a (reasonably) fast parallel transfer. *BioTorrents <http://www.biotorrents.net> * Allows scientists to rapidly share their results, datasets, and software using the popular BitTorrent file sharing technology. * All data is open-access and any illegal filesharing is not allowed on BioTorrents. * Or RDSI nodes can provide bit-torrent seeders itself from its nodes. * Ignoring the bad press BitTorrent is very good at what it does.

Data Durability. Things that go bump in the night (or not!) *Data Durability is an absolute necessity. *RDSI must provide a safe and enduring home for research data. * This might be more difficult as it appears! *The enemy is *Physics. *The world is a complex quantum/probabilistic system. * And so are all your computing and storage infrastructure. *Random events in your infrastructure will create: *Bit Rot and Silent Corruptions. *But you can engineer around the laws of physics.

Data Durability. Sources of Bit Rot and Silent Corruptions User Space ECC errors Corrupted Metadata Corrupted Data Inter-op issues Bugs in FW Wear Out Flipped Bits Latent sector errors VM Memory Filesystems Block layer SCSI layer Low-level drivers Controller firmware Storage firmware Disk Mechanics + Physical magnetic media All interconnecting cables Cosmic rays/sun spots EM Radiation, etc Lost Writes Torn Writes Misdirected Writes From Silent Corruptions, Peter.Kelemen, CERN

Data Durability. Expected Background Bit Error Rate (BER) * NIC/Link/HBA: 10-10 (1 bit in ~1.1 GB) * Check-summed, retransmit if necessary * Memory: 10-12 (1 bit in ~116 GB) * ECC * SATA Disk: 10-14 (1 bit in ~11.3 TB) * Various error correction codes * Enterprise Disk: 10-15 (1 bit in ~113 TB) * Various error correction codes * Tape: 10-18 (1 bit in ~1.11 PB) * Various error correction codes * Data maybe encoded up to five or more times as it travels to and from physical disk/tape to user space. * At petascale incredibly infrequent events happen all the time. From Silent Corruptions, Peter.Kelemen, CERN

Data Durability. The errors you know. The errors you don t know. There are known errors; there are errors we know we know. We also know there are known unknown errors; that is to say we know there are some things we do not know. But there are also unknown unknown errors; the ones we don't know we don't know. Paraphrased from Donald Rumsfeld From Silent Corruptions, Peter.Kelemen, CERN

Data Durability. The errors you know. The errors you don t know. *There are Data Errors that you will now know about. * Logs message. * SMART messages * Detection: SW/HW-level with error messages * Correction: SW/HW-level with warnings * If your really lucky your kernel will panic so you ll know something happened. *There are Data Errors that you will never know about. * As far as your storage infrastructure knows that write/read was executed perfectly. * In reality you will probably never know the data has been corrupted. * (Unless you design for this eventuality.) From Silent Corruptions, Peter.Kelemen, CERN

Data Durability. How to discover the unknown unknowns. * checksums * (CRC32, MD5, SHA1,...) * Checksum (meta)data. * Transport checksum with meta(data) for later comparison. * Error detection and correction codings. * Detects errors caused by noise, etc. (See checksums.) * Corrects detected errors and reconstruction of the original, error-free data. * Backward error correction: * Automatic Retransmit on error detection. * Forward error correction: * Encode extra redundant data. * Regenerate data from Forward Error Codes. * Multiple copies with quorum. From Silent Corruptions, Peter.Kelemen, CERN

Data Durability. Silent Corruptions and CERN * Circa 2007. 9PB tape. 4PB disk. 6000 nodes. 20000 drives, 1200 RAID. * Probabilistic storage integrity check (fsprobe) on 4000 nodes. * Write known bit pattern * Read it back. * Compare and alert when mismatch found. * 6 cycles over 1 hour each. * Low I/O footprint for background operation on 2GB file. * Keep complexity to the minimum. * use static buffers * Attempt to preserve details about detected corruptions for further analysis. From Silent Corruptions, Peter.Kelemen, CERN

Data Durability. Silent Corruptions and CERN *2000 incidents reported over 97 PB of traffic. * 6/day on average observed! * 192 MB of data silent data corruption. *320 nodes affected over 27 hardware types. *Multiple types of corruptions. *Some corruptions are transient. *Overall BER considering all the links in the chain * 3x10-7. *Not the 10-12 10-14 spec d rates. From Silent Corruptions, Peter.Kelemen, CERN

Data Durability. Types of silent Corruptions * Type 1 * Single/double bit flip errors. Usually persistent. * Usually bad memory (RAM, cache, etc.) * Happens with expensive ECC memory too. * Type II * Small, 2 n -sized random chunks (128-512 bytes) of unknown origin. * Usually transient. * Possible OOM Killer or corrupted SLAB/SLUB allocator. * Type III * multiple large chunks of 64K, old file data. I/O command timeouts * Usually persistent. * Type IV various sized chunks of zeros. From Silent Corruptions, Peter.Kelemen, CERN

Data Durability. What Can Be Done? *Self-examining/healing hardware. *WRITE-READ cycles before ACK. *Check-summing though not necessarily enough. *End-to-end check-summing. *Store multiple copies. *Regular scrubbing of RAID arrays. *Data refresh. Re-read cycles on tapes. *Generally accept and prepare for corruptions. From Silent Corruptions, Peter.Kelemen, CERN

Data Durability. The solutions. ZFS. The Good. * Developed by Sun (now Oracle) on Solaris. * Designed from the ground up with a focus on data integrity. * Combined filesystem, logical volume manager * RAID-Z, RAID-Z2, RAID-Z3, or mirrored * Copy-on-write. Transactional operation. * Built-in end-to-end data integrity. * Data/metadata checksum all the way to the root. * Always consistent on disk. no fsck or journaling * Automatic self-healing. * Intelligent online scrubbing and resilvering. * Very large filesystem limits. Max. 256 ZB FS * Deduplication. * Snapshots. and much much more.

Data Durability. The solutions. ZFS. The Bad. *Supported on Solaris only. * OpenSolaris is no more. *Kernel ports for FreeBSD and NetBSD. * Using OpenSolaris kernel source code. *Linux port via ZFS-FUSE. * Kernel space good. User space not so good. *ZFS on Linux. * Supported by Lawrence Livermore National Laboratory. * Issues with CDDL and GPL license compatibility in the kernel. * Solaris Portability layer/shim to the rescue. * Currently v0.6.0-rc4. It worked for me but not production grade yet.

Data Durability. The solutions. ZFS for Lustre. *1999: Peter Bramm from CMU creates Lustre. * A GPL massively parallel distributed file system. * 2003: Bramm created Cluster File Systems Inc to continue work. * 2007: Sun acquires Cluster File Systems Inc. * Works to combines ZFS and Lustre. * High Performance parallel FS with end to end data integrity. * But only supported on solaris. *2009: LLNL starts porting ZFS kernel to linux. * Oracle acquires Sun. *2010: Oracle announced ZFS/Lustre only for Solaris. *2011: LLNL starts ZFS/Lustre port for linux. *Late 2011: LLNL plans ZFS/Lustre FS. * 50 PB. 512GB/s 1TB/s bandwidth.

Data Durability. The solutions. DataDirect Networks S2S Technology. *SATA storage with: * Enterprise-class performance. * Reliability and data integrity. * Automatic self-healing * Detects anomalies and begins journaling all writes while recovering operations. *Dynamic Maid (D-MAID) * Save additional power and cooling by powering down the platters, * Where over 80% of power is consumed. * DC friendly.

Community Input Time. *Are we barking up the right tree. *Are we barking up the wrong tree. *Is there even a tree in the first place. *You decide.

Building Block *Are the base building blocks sufficient? * If not what should be added? *Is there a need for additional data transfer protocols. * If so what should be added? *Is there a need for additional file system protocol? * If so what should be added? *What additional public cloud storage infrastructure should RDSI consider? *What additional private cloud storage infrastructure should RDSI consider?

Federated vs Distributed. *Should RDSI continue to embrace the federated irods model? *Should RDSI embrace the Distributed FS model? *Should RDSI embrace both the federated and distributed model?

Distributed Fault Tolerant Parallel Filesystems. *If RDSI chooses to use a Distributed Fault Tolerant Parallel filesystem component, are there such systems that we have not yet consider?

WAN Data Caching There are always going to researchers who may not be able to benefit from the high speed networks provide by AARNet and the NRN. WAN Data Caching may partially eliminate their disadvantage but at cost. *Should RDSI consider the use of WAN Data Caches? *If so what sites would benefit from these data caches?

Data Durability. Data Durability is one of the foremost challenges of RDSI. However it seems impossible to entirely eliminate the various issues of bit rot and silent corruptions. *Given this fact of nature what level of data durability is the research community willing to accept?