The Panasas Parallel Storage Cluster
What Is It? What Is The Panasas ActiveScale Storage Cluster A complete hardware and software storage solution Implements An Asynchronous, Parallel, Object-based, POSIX compliant Filesystem A Global Namespace Strict client cache coherency
Physically, How Is It Organized? A shelf is 4U high and contains slots from 11 blades 0-3 DirectorBlades per shelf, with the remaining slots for StorageBlades 1 DirectorBlade 10 StorageBlades
Terminology Metadata The information that describes the data contained in files Size, create time, modify time, location on disk, permissions Block Based Filesystem A filesystem in which the client accessed files based on the physical location on disk. File Based Filesystem A filesystem where a client requests a file by name. Object Based Filesystem In this case, the filename is abstracted into a identifier. We will discuss this later. RAID multiple disks arranged into one physical disk tuned for redundancy or speed. JBOD (just a bunch of disks) multiple disks access directly rather than in an array.
Direct Attached Storage (local filesystem) Private storage for a host operating system IDE connected internal hard drive Serial ATA or SCSI attached drives USB drives Examples: ext3, reiserfs, NTFS, ufs, FAT32 This discussion is mostly about distributed file systems Problems of scale require lots of storage computers working together
Network Attached Storage File Server exports storage at the file level NFS/CIFS are widely deployed NFS is the only official file system standard Scalability limited by server hardware Moderate number of clients (10 s to 100 s) Moderate amount of storage (few TB) A nice model until it runs out of steam Islands of storage Bandwidth to a file limited by its server NetApp (ONTAP 7.x), Sun, HP, SnapServer, EMC Celerra, StorEdge NAS, IBM TotalStorage NAS, whitebox Linux NAS Head
Clustered NAS More scalable than single-headed NAS Multiple NAS heads share back-end storage In-band NAS head still limits performance and drives up cost Two primary architectures Forward requests to owner Head Export NAS from shared file system NFS does not provide a good mechanism for dynamic load balancing Clients permanently mount a particular Head GPFS, Isilon OneFS, IBRIX, Polyserve, NetApp- GX, BlueArc, Exanet ExaStore, ONStor, Pillar Data, IBM/Transarc AFS, IBM DFS NAS Heads
Storage Area Network Common management and provisioning for host storage Block Devices (JBOD or RAID) accessible via iscsi or FC network Wire-speed/RAID-speed performance potential Proprietary solutions for shared file systems Scalability limited by block management on metadata server (e.g., 32 nodes) NAS access provided by file head that re-exports the SAN file system Asymmetric (pictured) or Symmetric implementations SAN Clients Metadata Server(s) Storage
Object Based Storage Clusters Block and file interfaces replaced with an object abstraction. Block management pushed all the way out to the disks. Allows for parallel and direct access to disks Requires non-standards based Client Luster, Panasas data OSD Clients Object (OSD) Storage Metadata Server
pnfs: Standard Storage Clusters pnfs is an extension to the Network File System v4 protocol standard Allows for parallel and direct access From Parallel Network File System clients To Storage Devices over multiple storage protocols Moves the Network File System server out of the data path data pnfs Clients NFSv4.1 Server Block (FC) / Object (OSD) / File (NFS) Storage
RAID Redundant Array of Independent Drives Many physical disks bound together with hardware or software. Multiple layouts to accommodate performance and fault tolerance requirements. Used to create larger filesystems out of standard drive technology.
Comparing Technology How does an object-based, parallel filesystem compare to traditional storage solutions? vs. Direct Attached Storage oseparate control and data paths. Metadata and data workloads are distributed. omultiple access points for redundancy and scalability ono need to balance expensive server resources between applications and storage access vs. Network Attached Storage oscalability and ease of management in very large installations vs. Storage Area Networks oclients access storage directly, no intermediary gateway oall communication is IP based, choose your infrastructure Low cost, high bandwidth Gigabit or 10-Gigabit Ethernet Higher cost, low latency Infiniband
Panasas Object-Based Storage Cluster Consist of two primary components Object Storage Devices (OSD): StorageBlades MetaData Manager: DirectorBlades Directors implement file system semantics Access control, cache consistency, user identity, etc. Directors have rights to perform these object operations Create, delete, create group, delete group Get attributes and set attributes Clone group, copy-on-right support for snapshots Clients perform direct I/O with these object operations Read, write Get attributes, set (some) attributes
Panasas StorageBlade (OSD) Balanced storage device CPU, SDRAM, GE NIC and 2 spindles, 2x2TB SATA Commodity parts drive low cost Performance scales with capacity Single Seamless Namespace!
DirectFLOW Client DirectFLOW client is kernel loadable FS module Implements standard Vnode interface Uses native Panasas network protocols (RPC and iscsi) Caches data, directories, attributes, capabilities Responds to callbacks for cache consistency Does RAID I/O directly to StorageBlades w/ iscsi/osd
DirectorBlades Metadata manager Realm Control admit blades, start/stop services, failover File Manager access control, cache consistency, file system semantics Storage Manager file virtualization (maps), recovery, reconstruction Management console Web-based GUI or Command Line Interface (CLI) Status, charts, reporting Storage management Gateway function (NFS/CIFS) collocated on DirectorBlade Fast processor and large main memory Multiple DirectorBlades allow service replication for fault tolerance
Environment AC Power Each shelf has dual power supplies and battery Automatic graceful shutdown if you lose AC power Masks brownouts and short (5-sec) power glitches Thermal 800 Watts in 4u! Power supplies and batteries have fans that cool the shelf Blades, power supplies, batteries, network cards all monitor tempurature Warnings generated near temperature limit Unilateral blade shutdown if a blade gets very hot Graceful shutdown of a whole shelf if multiple blades are hot
Bladesets and Volumes Bladeset is a storage (OSD) failure domain Single OSD failure results in degraded operation and reconstruction Two OSD failures results in data unavailability Bladesets can be expanded or merged (but not unmerged) for growth Capacity balancing occurs within a bladeset Volume is a file hierarchy with a quota One or more volumes compete for space within a bladeset No physical boundaries between volumes, except quota limits Volume is unit of DirectFlow metadata work Each director blade manages one or more volumes NFS/CIFS gateway workload is orthogonal to DirectFlow metadata All director blades provide uniform/symmetric NFS/CIFS access
What Problems Does It Solve? It s all about removing the bottlenecks in traditional storage No RAID engine bottleneck oclient driven RAID scales as the number of clients increases omultiple Volumes or DirectorBlades for Scalable Reconstruction No Network Uplink bottleneck o10gige port or 4-Port Gig-E Link Aggregation Group per Shelf Flexible, per File layouts (SDK Required) oraid1/5 for large streaming I/O oraid10 for N-to-1 Writes or Random I/O ocustomizable Stripe width and depth Control the number of spindles Parity Overhead Global Namespace Single, web browser based management interface of 100 s of TBs
Customizing For Your Environment Pick your protocol DirectFLOW, NFS, CIFS, any combination at one time omore Director Blades for NFS / CIFS performance omore Storage Blades for DirectFLOW performance Interactive vs. Batch Processing ActiveStor 5000 w/ larger cache sizes on Storage Blades for Interactive work Fault tolerance Configurable spares for multiple sequential Storage Blade failures Configurable bladeset sizes for simultaneous Blade failure risk mitigation Redundant network links Storage Capacity Options Smaller Capacity Blades omore spindles, less data to reconstruct, more shelves Larger Capacity Blades ofewer shelves, reduced double-disk failure risk
Logically, How Does Data Flow Linux Client w/ DirectFLOW Filesystem Client Storage Blades NFS / CiFS Clients IP Network Metadata Manager NFS / CiFS Gateway Director Blades
Logically, How Does Data Flow An Example Six Shelf System Linux Client w/ DirectFLOW Filesystem Client Storage Blades NFS / CiFS Clients IP Network Metadata Manager NFS 1 / CiFS 2 Gateway 3 4 5 6 Director Blades
Logically, How Does Data Flow An Example Six Shelf System, with Three Bladesets NFS / CiFS Clients Linux Client w/ DirectFLOW Filesystem Client IP Network Metadata Manager Storage Blades Bladeset 1 Bladeset 2 Bladeset 3 NFS 1 / CiFS 2 Gateway 3 4 5 6 Director Blades
Logically, How Does Data Flow? An Example Six Shelf System, with Three Bladesets and Eight Volumes NFS / CiFS Clients Linux Client w/ DirectFLOW Filesystem Client IP Network Metadata Manager Vol1 Vol2 Vol3 Vol4 Vol6 Vol5 Vol7 Vol8 Storage Blades Vol1 Vol2 Vol4 Vol5 Shelf Vol8 10 SBs Vol6 Vol3 Vol7 Bladeset 1 Bladeset 2 Bladeset 3 NFS 1 / CiFS 2 Gateway 3 4 5 6 Director Blades
How Do I Manage 100 s of TB? All from a single http:// or command line interface PanActive Manager: Single GUI for entire namespace management Simple out-of-box experience Seamlessly adopt new blades Capacity & load balancing Volumes and quotas Snapshots 1-touch reporting capabilities for capacity trends, asset ID, and performance Email and/or pager notification of errors and warnings Scriptable CLI for all features