High Performance Computing Specialists ZFS Storage as a Solution for Big Data and Flexibility
Introducing VA Technologies UK Based System Integrator Specialising in High Performance ZFS Storage Partner of E4 Computing Engineering Delivering ZFS Storage and HPC Solutions
VA HPC Powered by E4 New HPC Solutions ARKA ARM and Quadro + Tegra 3 Blades Joint New Solutions Lustre on ZFS & Hadoop on ZFS
But First. ZFS 6 Great Reasons to Love ZFS Architecture Data Integrity Redundancy Transactional Copy on Write (COW) Snapshots Hybrid Storage Mixing SSD with HDD
Architecture 6 Great Reasons to Love ZFS Architecture Data Integrity Redundancy Transactional Copy on Write (COW) Snapshots Hybrid Storage Mixing SSD with HDD
Architecture ZFS Pool Layout Clones Pool Configuration zvol Information zvol File System zvol File System Snapshots
Architecture ZFS Layer View raw swapdumpiscsi?? ZFS NFS CIFS?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) e.g. pnfs?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iscsi FC??
Data Integrity 5 Great Reasons to Love ZFS Architecture Data Integrity Redundancy Transactional Copy on Write (COW) Snapshots Hybrid Storage Mixing SSD with HDD
Data Integrity Designers Quote The job of any file system boils down to this: when asked to read a block, it should return the same data that was previously written to that block. If it can't do that -- because the disk is offline or the data has been damaged or tampered with -- it should detect this and return an error. Jeff Bonwick Father of ZFS Dec 2008
Data Integrity Merkle Trees & Checksums www.va-technologies.com
Data Integrity Validating Checksum www.va-technologies.com
Data Integrity Do you have a Write Hole? www.va-technologies.com
Redundancy 6 Great Reasons to Love ZFS Architecture Data Integrity Redundancy Transactional Copy on Write (COW) Snapshots Hybrid Storage Mixing SSD with HDD
Redundancy Mirrored Disks in 2 vdevs root vdev Logical vdevs top-level vdev children[0] mirror top-level vdev children[1] mirror vdev type = disk children[0] Vdev vdev type = disk type = disk children[0] children[0] Physical or leaf vdevs vdev type = disk children[0]
Redundancy RAIDz2 in 3vdevs vdev-0 RAID-Z2 HDD HDD HDD HDD HDD HDD zpool vdev-1 RAID-Z2 HDD HDD HDD HDD HDD HDD vdev-x RAID-Z2 HDD HDD HDD HDD HDD HDD Physical or leaf vdevs
Redundancy Dynamic Striping RAID-0 Column size = 128 kbytes, stripe width = 384 kbytes 384 kbytes ZFS Dynamic Stripe recordsize = 128 kbytes vdevs Total write size = 2816 kbytes
Transactional Copy on Write (COW) 6 Great Reasons to Love ZFS Architecture Data Integrity Redundancy Transactional Copy on Write (COW) Snapshots Hybrid Storage Mixing SSD with HDD
Transactional Copy on Write (COW) Journaling FSCK
1. Initial block tree 2. COW some data www.va-technologies.com 3. COW metadata 4. Update Uberblocks & free
Snapshots 6 Great Reasons to Love ZFS Architecture Data Integrity Redundancy Transactional Copy on Write (COW) Snapshots Hybrid Storage Mixing SSD with HDD
Snapshots www.va-technologies.com
Hybrid Storage Pools Mixing SSD with HDD 6 Great Reasons to Love ZFS Architecture Data Integrity Redundancy Transactional Copy on Write (COW) Snapshots Hybrid Storage Mixing SSD with HDD
Hybrid Storage Pools Mixing SSD with HDD Old Old New ZFS
Hybrid Storage Pools Mixing SSD with HDD Adaptive Replacement Cache (ARC) Separate ZFS Intent Log (ZIL) Main Pool Level 2 ARC (L2ARC) Write optimized device (SSD) HDD HDD HDD Read optimized device (SSD)
Hybrid Storage Pools ARC Evict the oldest single-use entry Miss Hit LRU Recent Cache MRU MFU Frequent Cache LFU RAM Evict the oldest multiple accessed entry
Hybrid Storage Pools L2ARC Data soon to be evicted from the ARC sent to the Layer 2 ARC (L2ARC) which is usually a SSD vdev Works well when cache vdev is optimized for fast reads lower latency than pool disks inexpensive way to read performance SSD vdev can be striped for better performance Non Persistent Requires a rebuild after power off (soon to change!) ARC data soon to be evicted cache
Hybrid Storage Pools ZIL/SLOG Known as the Write Cache Non Volatile Stores small (<32k Configurable ) sync writes in high speed persistent storage or a separate log device (SLOG) Flushes to disk backend periodically sequential write stream as part of the TXG group Assigned on a per pool basis Perfect use case for RAM Based SAS Devices
ZFS Limitations Parallel Access? Distributed File System? pnfs
Relevancy to the HPC Community www.va-technologies.com
Hadoop on ZFS Rebuild Times Utilize SSD s with HDD Better Administration Single Drive Replacement Integrated Management Warning and Reporting
Linux and Lustre on ZFS Native Kernel Module now available.
Linux and Lustre on ZFS 55PB Now Active to Sequoia Users Lustre + ZFS Fully Configured 768 OSS & OST s
Lustre on ZFS Write Performance Single shared file IOR (10G block, 1M transfers) 1600 1400 1200 Sequoia Workload 768 OSS Nodes 2048 Tasks per OSS 1,572,864 Compute Cores MB/s 1000 800 600 400 200 0 8 16 24 32 40 48 56 64 72 80 88 96 104 LDISKFS Increased tasks per OSS degrades performance ZFS - Constant performance Increase I/O size for RAIDZ2 LDISKFS+RAI D6 ZFS+RAID6 ZFS+RAIDZ2
Lustre on ZFS Read Performance 1200 1000 LDISKFS mballocallowslargeri/o 800 600 400 ZFS 128Kmaximumblocksize IOPs limited for ZFS+RAID6 Perfect Opportunity for Read Caching 200 0 8 16 24 32 40 48 56 64 72 80 88 96 104 LDISKFS+RAI D6 ZFS+RAID6 ZFS+RAIDZ2
Lustre on ZFS Coming Soon New Hardware Optimised for Lustre on ZFS Low Power Consumption OSS & OST Customisable for your Lustre deployment Full Lustre and HW Support
Thanks Very Much! Ryan Tyler VA Technologies ryan.tyler@va technologies.com @ryanjamestyler