Lustre SMB Gateway Integrating Lustre with Windows
Hardware: Old vs New Compute 60 x Dell PowerEdge 1950-8 x 2.6Ghz cores, 16GB, 500GB Sata, 1GBe - Win7 x64 Storage 1 x Dell R510-12 x 2TB Sata, RAID5, 1GBe - Centos 5 Compute 63 x Dell 7910 rack workstation - 24 x 2.5Ghz cores, 64GB, 4 x 900GB 10k SAS, 256GB SSD, 2 x 10Gbe - ESXi 5.5u2 Hypervisor & Win7 VM Storage 4 x Dell R630-8 x 2.4Ghz cores, 64GB, 4 x 600GB 10k SAS, 2 x 10Gbe, 1 x dualport Mellanox ConnectX3-1 x MD3460 Array, 42 x 600GB 10k, 1TB flash - Red Hat EL6 Network Cisco Nexus 2232TM-E Fabric Extender Cisco Nexus 6K 6001 Switches Qlogic 12300 QDR Infiniband switches 2
3
What is CTDB? Clustered implementation of Trivial Database system High Availability service for clustered file-system 4
Why use CTDB? Compute nodes had to be Windows 7 x64 but no native Windows Lustre client existed Save costs on Infiniband network hardware NFS client in Windows is average Opportunity to leverage existing network infrastructure in the Datacentre If Windows could do something well.. CIFS/SMB access would be one of them 5
Differences in CTDB vs SMB CTDB Many hosts for single file-system CTDB service can manage SMB NMB Winbind Host resiliency inbuilt Recovery file lock between CTDB hosts Shared password db SMB Single SMB host per file-system No failover Less potential bandwidth 6
How we implemented CTDB 2 x Physical nodes Simple tdbsam password database Bonded 2 x 10Gbe per host Single QDR link per host to Lustre Local config files [smb.conf, public_addresses,nodes, etc] Shared config / working files [*.tdb, recovery_lock, etc] Round Robin DNS for public IPs 7
HPC Render Cluster Lustre Storage 12 x Object Servers 2 x Metadata Servers QDR Infiniband Network VMware Compute Cluster 63 x Dell 7910 rack workstation 10GBe connectivity per host 1:1 VM to Host mapping CTDB Hosts 4 x Dell R630 2 x 10Gbe in 802.3ad for data 2 x 1Gbe in 802.3ad for heartbeat 1 x Mellanox 2-port ConnectX3 IB ctdb-01 ctdb-02 ctdb-03 ctdb-04 CTDB heartbeat Data Network Networking In rack Cisco Nexus 2232TM-E FEX 4 x 10 Gbit uplink per FEX Pod Cisco Nexus 6001 Switch 6 x 10 Gbit to core network
HPC Render Cluster Lustre Storage 12 x Object Servers 2 x Metadata Servers QDR Infiniband Network VMware Compute Cluster 63 x Dell 7910 rack workstation 10GBe connectivity per host 1:1 VM to Host mapping CTDB Hosts 2 x Dell R630 2 x 10Gbe 802.3ad for data 1 x 1Gbe heartbeat (crossover) 1 x Mellanox 2-port IB HA-SMB Hosts 2 x Dell R630 2 x 10Gbe in 802.3ad for data 1 x 1Gbe heartbeat (crossover) ctdb-01 ctdb-02 HA-smb-01 HA-smb-02 CTDB heartbeat CCS heartbeat Data Network Networking In rack Cisco Nexus 2232TM-E FEX 4 x 10 Gbit uplink per FEX Pod Cisco Nexus 6001 Switch 6 x 10 Gbit to core network
Storage Services Scratch volume Red Hat EL6 host 17 TB Direct attached storage Two node cluster providing an SMB service with HA failover Clustered LVM with EXT4 FS Bandwidth 20Gbit 42 x 600GB DDP, 1TB Flash Archive volume Red Hat EL6 host Samba gateway to 3.0 PB Lustre file-system Clustered TDB (CTDB) used to provide SMB service Lustre FS direct mounted on host CTDB shares a lustre directory Bandwidth up to 40Gbit 10
How the storage is used Storing 8k IMAX film renders Min. 5MB image frame 50,000 frames ~275MB/s stream rate Around 250-300GB of renders for 35 mins footage Streaming from Lustre archive back into final film cuts 11
Sample 8k render 8192 6144 RGB 12
How does CTDB perform in practice? Approximately 30 seconds for complete takeover Tools and documentation are helpful Measured throughput is good ~600MB/s for archive sync from scratch disk over ethernet Normal system load on CTDB hosts with many connections Basic config is reliable 13
What did not work well? Load balancing: TCP connections killed as part of failover Round Robin DNS effect in practice Some notify.sh script handling for monitoring Limited set of default handlers for alerting of cluster state Integrating with existing SMB services: Internal CTDB service scripts modify eth adapter state Splitting out NMB & SMB pids, configs, init scripts was problematic 14
Ideas to improve our CTDB Lustre service Load balancing: Hardware load balancer for initial connection but not transfers Write a software service to manage connections Red Hat EL6 Load Balancer Add-on? Write more comprehensive scripts to alert on cluster state 15
Test results: Lustre disk 2 x CTDB host 12 x clients 4GB data size Sequential read Sequential write Random 512k read Random 512k write 16
Test results: Render Node FEXs 17
Test results: CTDB Node FEXs 18
Conclusions Reasonable performance Reliable and manageable storage service Scalable [as a gateway service] Provides greater accessibility to lustre for non-intensive data applications 19
Basic CTDB Config Mounted Lustre with -o flock [root@rm01 ~]# mount grep lustre 192.168.55.129@o2ib:192.168.55.130@o2ib:/lustre on /lustre type lustre (rw,flock) [root@rm01 ~]#
Basic CTDB Config Created data, and private directory for ctdb operations [root@rm01 ~]# ls -la /lustre/ctdb drwx------ 3 root root 4096 Jan 15 2015.ctdb drwxr-xr-x 3 root root 4096 Feb 3 2015 data [root@rm01 ~]#
Basic CTDB Config Configured /etc/sysconfig/ctdb [root@rm01 ~]# egrep -v '#' /etc/sysconfig/ctdb CTDB_RECOVERY_LOCK=/lustre/ctdb/.ctdb/recovery_lock CTDB_NODES=/etc/ctdb/nodes CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses CTDB_MANAGES_SAMBA=yes ulimit -n 10000 CTDB_LOGFILE=/var/log/log.ctdb CTDB_DEBUGLEVEL=NOTICE CTDB_NOTIFY_SCRIPT=/etc/ctdb/notify.sh [root@rm01 ~]#
Basic CTDB Config Configured node file (Heartbeat network) [root@rm01 ~]# cat /etc/ctdb/nodes 192.168.100.10 192.168.100.20 [root@rm01 ~]#
Basic CTDB Config Configured public_addresses file (service IP s) [root@rm01 ~]# cat /etc/ctdb/public_addresses 136.186.52.26/24 intelbond0 136.186.52.27/24 intelbond0 [root@rm01 ~]#
Basic CTDB Config Configured public_addresses file (service IP s) [root@rm01 ~]# cat /etc/ctdb/public_addresses 136.186.52.26/24 intelbond0 136.186.52.27/24 intelbond0 [root@rm01 ~]#
Setup smb.conf Basic CTDB Config [root@rm01 ~]# egrep -v '#' /etc/samba/smb.conf [global] workgroup = render server string = Samba Server Version %v netbios name = render-archive interfaces = intelbond0 hosts allow = 127. 136.186.226. 136.186.52. 136.186.12. 136.186.53. clustering = yes ctdbd socket = /var/run/ctdb/ctdbd.socket cluster addresses = 136.186.52.26 136.186.52.27 idmap backend = tdb2 bind interfaces only = no pid directory = /var/run/samba/ctdb private dir = /lustre/ctdb/.ctdb/privdir fileid:mapping = fsname use mmap = no nt acl support = yes ea support = yes log file = /var/log/samba/ctdb/log.%m max log size = 50 security = user passdb backend = tdbsam load printers = no printing = bsd printcap name = /dev/null [archive] comment = Lustre Archive browseable = no writable = yes path = /lustre/ctdb/data/archive [root@rm01 ~]#