Enterprise Storage Management with Red Hat Enterprise Linux for Reduced TCO Dave Wysochanski (dave.wysochanski@redhat.com)
Agenda Enterprise Storage Problems Red Hat Enterprise Linux Solutions Device Mapper Architecture LVM2 Multipath Low Level Drivers / Transports (FC, iscsi, etc) Management Tools Summary
Not Covered Filesystems (tiny bit) Clustering, GFS Surfing (web, surfboard)
Enterprise Storage Problems Primary Data Sizes Unknown / Unpredictable Availability (component servicing, failures) Performance Backup Secondary Vendor incompatibility Problem resolution ping/pong Technology evolution / incompatibility
Enterprise Storage Support out of the box Ext3 the standard open source journaling filesystem Online resizing, perf improvements SAN support F/C for popular Emulex and Qlogic F/C HBAs iscsi Software initiator, iscsi HBA Native (DM) Multipathing Availability in face of path / component failures LVM2 kernel level storage virtualization, now includes: Dynamic volume resizing (unknown data size) Snapshot (backup, availability) Striping / RAID0 (performance) Mirroring / RAID1 (availability)
Filesystems
Ext3 / Ext4 Today (ext3) Max filesystem size 8TB Max file size 8TB (x86/amd64/em64t) & 16TB (Itanium2/POWER) Online growing (System config lvm or cmdline, e.g. 4.x: ext2online) Directory indexing (hash tree directories improves perf) Block reservations (improved read/write perf) Futures RHEL5 has 16T ext3 support (tech preview today) (previous limit was 8T) RHEL5 ext4 into RHEL5 as a tech preview Extents, prealloc, delayed alloc, large fs, etc. Reference: Ted Ts'o's ext4 talk
Device Mapper (DM) Architecture
DM Architecture Overview dmsetup lvm2 dmraid... multipath kpartx libdevmapper Userspace Kernelspace dmioctl dmcore target target target e.g. dm raid1
Device Mapper General purpose kernel block I/O redirection subsystem Logical devices are maps of: specified sectors on underlying devices according to the rules implemented in a target Contains DM core, and DM targets core only about 2000 lines of code Generic infrastructure for registering targets targets Multipath, linear, striped, snapshot, mirror, etc
DM Kernel Architecture ioctl interface filesystem interface control interface core device-mapper mapping / target interface block interface linear mirror snapshot multipath path hardware log kcopyd selectors handlers round-robin emc
Device Mapper No concept of volume groups, logical or physical volume simply maps from one block device to another Does not know about on disk formats e.g. LVM2 metadata, filesystems User< >kernel interface is via ioctl() Encapsulated into libdevmapper Kernel component of dm multipath, LVM2, dmraid, etc DM devices can be stacked, for example: snapshot of a mirror whose components are multipath devices
Device Mapper Targets /dev/dm0?????? Target /dev/sda /dev/sdb /dev/hda /dev/dm1
Device Mapper Targets and Corresponding Userspace Subsystems
Surfing Dos / Don'ts Do First learn to swim Get a lesson from a professional Start in a location with smaller waves (<2m) Try to ride broken waves first Use a long board Learn to dive under waves Have fun Don't Think you know it all Start in a location like Hawaii with very large (>>3m) waves Get in the way of experts
DM Multipath: Today Modular design Storage vendor plugins path checker, path priorities (userspace) error code handling / path initialization (kernel) Path selection policies Only round robin currently UUID calculation Path grouping Fail over / fail back policies Broad hardware support HP, Hitachi, SUN, EMC, NetApp, IBM, etc Active / Active, Active / Passive
DM Multipath: Future Multipath root Alternative load balancing policies Request based multipath (NEC) Support newest storage arrays Active / Passive arrays? Requests?
Logical Volume Management Volume Management creates a layer of abstraction over the physical storage. Physical Volumes (disks) are combined into Volume Groups. Volume Groups are divided into Logical Volumes, like the way in which disks are divided into partitions. Logical Volumes are used by Filesystems and Applications (e.g. databases).
Logical Volume Management Logical Volumes lvcreate vgcreate pvcreate Volume Group Physical Volumes Disks
Advantages of LVM Filesystems can extend across multiple disks. Hardware storage configuration hidden from software change without needing to stop applications or unmount filesystems Data can be rearranged on disks e.g. empyting a hot swappable disk before removing it Device snapshots can be taken consistent backups test effect of changes without affecting the real data
LVM2 Features Concatenation, Striping (RAID 0), Mirroring (RAID 1) Supports additional RAID levels (3, 5, 6, 10, 0+1) (future) Snapshots (writeable) Provides underpinnings for cluster wide logical volume management (CLVM) Same on disk metadata format Integrated into the Anaconda installer allow configurations at installation time Replaces LVM1, which was provided in RHEL 3 Easier to use and more configurable (/etc/lvm.conf) Clean separation of application and kernel runtime mapping
LVM2 Features Single tool binary designed to contain all tool functions Column based display tools with tailored output LVM2 Metadata Concurrent use of more than one on disk format Human readable text based format Changes happen atomically Redundant copies of metadata Upwardly compatible with LVM1 enables easy upgrade Transactional (journaled) changes Pvmove based on temporary mirrors (Core Dirty Log)
LVM2: DM Striped / Linear linear target Device name, start sector, length <start length 'linear' device offset>. striped target # stripes, striping chunk size, pairs of device name and sector error target causes any I/O to the mapped sectors to fail useful for defining gaps in the new logical device (e.g. fake a huge device)
LVM2: DM Mirror (raid1) Maintains identical copies of data on devices. Divides the device being copied into regions typically 512KB in size. Maintains a (small) log with one bit per region indicating whether or not each region is in sync. Two logs are available core or disk. Parameters are: mirror <log type> <#log parameters> [<log parameters>] <#mirrors> <device> <offset> <device> <offset>... The disk log parameters are: <log device> <region size> [<sync>] <sync> can be sync or nosync to indicate whether or not an initial sync from the first device is required.
LVM2: DM Mirror (today) Single node mirror Clustered mirroring (4.5) HA LVM
LVM2: DM Mirror (future) Extend an active mirror Snapshots of mirrors Install/boot from mirror RAID 10 and RAID 01 >2 legged mirror Clustered Mirror (5.x) Read balancing, handling failing devices automatically Robustness and performance
LVM2: DM Snapshot An implementation of writable snapshots. Makes a snapshot of the state of a device at a particular instant The first time each block is changed after that it makes a copy of the data prior to the change, so that it can reconstruct the state of the device. Run fsck on a snapshot of a mounted filesystem to test its integrity to find out whether the real device needs fsck or not. Test applications against production data by taking a snapshot and running tests against the snapshot, leaving the real data untouched. Take backups from a snapshot for consistency.
LVM2: DM Snapshot Copy on write Not a backup substitute Requires minimal storage (5% of origin) Preserves origin Allows experimentation Dropped if full Resizeable
LVM2: Snapshot Example Uses Backup on live system fsck snapshot of live filesystem Test applications against real data Xen domu's Others?
LVM2: Snapshot Futures Merge changes to a writeable snapshot back into its read only origin Robustness and performance Memory efficiency Clustered Snapshots LiveCD + USB Snapshots of mirrors
Device Mapper Futures
LVM2: DM Raid4/5 (dm raid45) Features Failed device replacement Selectable parity device with RAID 4 Selectable allocation algorithms (data and parity) left/right, symmetric/asymmetric Stripe cache (data and parity) XOR algorithm for parity written by Heinz Mauelshagen http://people.redhat.com/heinzm/sw/dm/dm raid45/
DM Block Caching Target: dm cache write back or write through local disk cache intended use is remote block devices (iscsi, ATAoE) written by Ming Zhao technical report on IBM's CyberDigest http://tinyurl.com/35qzcg
DM Block Caching Target: dm hstore dm hstore ( Hierachical store ) similar to dm cache building block for HSM System single host caching remote replication written by Heinz Mauelshagen Features caches reads and writes to an origin device writes data back to origin device keep state of extents (eg., uptodate, dirty,...) on disk background initialization (instantaneous creation) supports read only origins
DM Misc Futures Add a netlink based mechanism for communication with userspace Mike Anderson (IBM) Reduce kernel stack usage for some targets/paths Automatically detect and handle changes to physical capacity A lot of other stuff Requests?
Transports / Low Level Drivers
ISCSI Initiator Low cost Enterprise SAN connectivity Red Hat Enterprise Linux Host Cisco iscsi initiator Qlogic/Adaptec driver RHEL 3 U4+: linux iscsi NIC iscsi adapter Open source Cisco implementation RHEL4 U2+: linux iscsi Rewrite for 2.6 kernel (based on 2.4 driver) RHEL5+: open iscsi.org Qualified with major storage vendors NetApp EMC ISCSI storage controller (e.g. NetApp) TCP/IP ISCSI bridge Fibre Channel Switch SAN FC host EqualLogic
iscsi Initiator: Today open iscsi.org RFC 3720 compliant Flexible transport design software iscsi, hardware iscsi, iser Command line management (iscsiadm) building block for GUIs Basic isns support
iscsi Initiator: Future Fully integrated hardware iscsi (full offload, partial offload) Improved / more flexible management tools Install to software iscsi Performance improvements Improved isns BIOS / OF Boot Requests?
Fibre Channel Device Drivers RHEL4/5 Driver versions tracking upstream submissions very closely Goal is to keep them current as much as possible (e.g. F/C 4GB, SATA 2) Greatly increased support with over 4,000 SCSI devices/paths (was 256 in RHEL 3) Each update contains current drivers Actively coordinate with Qlogic, Emulex and system vendors Integrate key bug fixes Aid partners to keep their open source drivers current up stream Driver update model (Jon Masters)
Management Tools
Management Tools: system config lvm Transport Protocol agnostic Simplifies resizing Future: iscsi management plugin (currently in 4.5)
Management Tools: Conga Web browser front end (luci) Agent (ricci) serializes requests Single node or cluster management Future: Storage Server Clustered NFS Clustered SAMBA
Management Tools: Conga (screenshots)
Summary Out of the box support for multipathing, LVM, etc DM provides very flexible, extensible architecture New DM targets being developed Active communities around DM, LVM2, etc Good management tools (CLI, GUI)
More Information LVM2 http://sourceware.org/lvm2/ Device Mapper http://sourceware.org/dm Multipathing dm devel@redhat.com http://christophe.varoqui.free.fr/wiki/wakka.php?wiki=home iscsi http://www.open iscsi.org Conga http://sourceware.org/cluster/conga/
More Information Red Hat Enterprise Linux LVM Administrator's Guide http://www.redhat.com/docs/manuals/enterprise/rhel 5 manual/cluster_logical_volum Presenter Dave Wysochanski (dave.wysochanski@redhat.com) This presentation http://people.redhat.com/dwysocha/talks
Backup Slides
Tagging LVM2 supports two sorts of tags. Tags can be attached to objects such as PVs, VGs, LVs and segments. Tags can be attached to hosts, for example in a cluster configuration. Tags are strings using [A Za z0 9_+. ] of up to 128 characters and they cannot start with a hyphen. On the command line they are normally prefixed by @. LVM1 objects cannot be tagged as the metadata does not support it.
Tagging Object Tags Use addtag or deltag with lvchange vgchange pvchange lvcreate or vgcreate. Only objects in a Volume Group can be tagged. PVs lose their tags when removed from a VG. This is because tags are stored as part of the VG metadata. Snapshots cannot be tagged. Wherever a list of objects is accepted on the command line, a tag can be used. e.g. lvs @database to lists all the LV with the 'database' tag. Display tags with lvs o +tags or pvs o +tags etc.
Tagging Host Tags You can define host tags in the configuration files. If you set tags { hosttags = 1 }, a hosttag is automatically defined using the machine's hostname. This lets you use a common config file between all your machines. For each host tag, an extra config file is read if it exists: lvm_<hosttag>.conf. And if that file defines new tags, then further config files will be appended to the list of files to read in. tags { tag1 { } tag2 { host_list = [ host1 ] } } This always defines tag1, and defines tag2 if the hostname is host1.
Tagging Controlling Activation You can specify in the config file that only certain LVs should be activated on that host. e.g. activation { volume_list = [ vg1/lvol0, @database ] } This acts as a filter for activation requests (like 'vgchange ay') and only activates vg1/lvol0, any LVs or VGs with the 'database' tag in the metadata on that host. There is a special match @* which causes a match only if any metadata tag matches any host tag on that machine.
Tagging Simple Example Every machine in the cluster has tags { hosttags = 1 } You want to activate vg1/lvol2 only on host db2. Run lvchange addtag @db2 vg1/lvol2 from any host in the cluster. Run lvchange ay vg1/lvol2. This solution involves storing hostnames inside the VG metadata.
dmsetup A command line wrapper for communication with the Device Mapper. Provides complete access to the ioctl commands via libdevmapper. Examples: dmsetup version dmsetup create vol1 table1 dmsetup ls dmsetup info vol1 dmsetup table vol1 dmsetup info c