LISTSERV in a High-Availability Environment DRAFT Revised 2010-01-11

LISTSERV in a High-Availability Environment DRAFT Revised 2010-01-11 Introduction For many L-Soft customers, LISTSERV is a critical network application. Such customers often have policies dictating uptime and availability requirements. This document addresses how to use LISTSERV in high-availability environments and how to minimize application downtime in case of a hardware or software failure. It is not intended as a comprehensive document on achieving high network availability, but is focused on the LISTSERV application itself. It is assumed that you will incorporate these suggestions into a broader plan of network failover, backup, and recovery. Using LISTSERV in a High-Availability Environment LISTSERV has several features that lend themselves to rapid failure recovery and high availability. Discussed below are LISTSERV's SMTP forwarding features and how to use filesystem mirroring for failover. Outbound SMTP Forwarding LISTSERV's SMTP forwarding features allow for a high degree of load balancing and failover for LISTSERV's outbound e-mail delivery. These features are controlled by way of the SMTP_FORWARD and SMTP_FORWARD_n site configuration keywords. At their simplest, the SMTP forward settings might look as follows (the examples use UNIX format; see the documentation for syntax for other platforms): SMTP_FORWARD_1="mail1.example.net" This example tells LISTSERV to use MAIL1.EXAMPLE.NET for its outbound e-mail delivery. If spawning a sub-process for SMTP delivery fails for some reason, LISTSERV will mail to MAIL1.EXAMPLE.NET using the main LISTSERV process. Let's look at some ways we can scale this up for greater redundancy:

SMTP_FORWARD_1="5*mail1.example.net" In this example, all outbound e-mail still goes through MAIL1.EXAMPLE.NET, but instead of a single process handling all mail, we have five processes delivering mail in parallel. This helps to empty LISTSERV's spool faster, but doesn't help if MAIL1.EXAMPLE.NET is unavailable. For that, we need failover: SMTP_FORWARD_1="5*mail1.example.net;mail2.example.net" In this example, we still deliver all mail through MAIL1.EXAMPLE.NET as long as it's available. But if MAIL1.EXAMPLE.NET becomes unavailable, LISTSERV will automatically switch to delivering through MAIL2.EXAMPLE.NET. If MAIL1.EXAMPLE.NET comes back online, LISTSERV will automatically switch back to it. This gives us good failover, but what if we want to load balance between the two servers instead? Then we do this: SMTP_FORWARD_1="5*mail1.example.net" SMTP_FORWARD_2="5*mail2.example.net" In this example, we get five connections to MAIL1 and five connections to MAIL2, and LISTSERV splits the load between them. Note that this doesn't necessarily mean that both servers will get the same number of messages. If MAIL1 is able to receive mail more quickly than MAIL2, it will end up with a greater portion of the total mail traffic. Also, at low volume -- when all recipients fit in a single queue file -- the entire queue file will go to one server or the other. If one server becomes available, outbound mail will continue to go out through the other. We can combine the load balancing and failover settings: SMTP_FORWARD_1="5*mail1.example.net;mail3.example.net" SMTP_FORWARD_2="5*mail2.example.net;mail4.example.net" In the example above, the load is balanced between MAIL1 and MAIL2, and if either fails, its traffic will go out through MAIL3 or MAIL4, respectively. Any number of SMTP_FORWARD_n lines may be defined to share load balancing and failover among any number of outbound mail servers. It is also possible to schedule SMTP_FORWARD rotation. For example, if our corporate mail goes out through CORP.EXAMPLE.NET and we only want to use it to help with LISTSERV mail between 8:00pm and 6:00am, we can do:

SMTP_FORWARD_1="5*mail1.example.net;mail3.example.net" SMTP_FORWARD_2="5*mail2.example.net;mail4.example.net" SMTP_FORWARD_3="5*corp.example.net(20:00-06:00)" In this example, LISTSERV ignores CORP.EXAMPLE.NET except between the hours of 8:00pm and 6:00am, during which it will employ five additional SMTP workers to deliver mail through that server. Failover For LISTSERV The examples above all address how to build redundancy and failover into LISTSERV's outbound SMTP delivery. But what if the LISTSERV server itself should become unavailable and we need to fail over to a backup server? To address that, we need two pieces: a copy of the LISTSERV filesystem, and a way to route mail (and possibly web requests) to a backup server. Mirroring the LISTSERV Filesystem LISTSERV cannot do its own filesystem mirroring, so we need to rely on an external application to do that. There are many ways we could accomplish mirroring. A relatively simple method would be to use a Network File System (NFS) mount or Network Attached Storage (NAS) system to store the necessary LISTSERV files. If the primary LISTSERV server becomes unavailable, then we simply mount the NFS or NAS partition(s) on a backup server. If NFS or NAS are not available or not recommended for performance reasons, then we could do remote synchronization instead. This could be something as simple as running a scheduled 'rsync' command on a UNIX server or something more complex like the many commercial products for mirroring a Windows filesystem. On Windows, the LISTSERV filesystem usually looks as follows: \LISTSERV\LISTS \LISTSERV\LOG \LISTSERV\MAIN \LISTSERV\OUT \LISTSERV\SPOOL \LISTSERV\TMP \LISTSERV\WWW There is generally no need to mirror the OUT, SPOOL, or TMP directories, as they contain

transient queue and temp files. It is generally enough to mirror the MAIN, LISTS, LOG, and WWW directories to a failover server. (The WWW directory is the recommended location for LISTSERV web archive files. If you store these files in some other location, such as under the IIS INETPUB directory, then you would need to mirror that location instead.) On UNIX, you should mirror the entire ~listserv directory and subdirectories, minus the spool and tmp directories. If using LISTSERV with NFS, it is recommended that you put spool on a local disk for performance reasons. The spool contains transient queue files that are usually not worth the performance overhead of mirroring. The amount of data to be mirrored can be reduced if list memberships are stored in a DBMS instead of LISTSERV's internal list format. In that case, both the primary and backup system would need to be configured for DBMS access. Instead of updating the *.list files with subscription changes, the DBMS would be updated in real time, reducing the amount of file synchronization needed between the primary and backup systems. It is also possible to store list notebook archives on a networked filesystem (like a NFS or NAS mount) instead of the local disk. LISTSERV's application files are generally quite small; list notebook archives are usually much larger, depending on the amount of archived content. For either platform, mirroring the actual HTML and index files in the web archive directory is not important; LISTSERV will rebuild those files on restart if the directory structure itself exists. You may mirror the log directory or not at your discretion, balancing the value of the log files with the overhead of mirroring them. Unlike (for example) most database management systems, LISTSERV does not usually have open file issues, so it is safe to copy a running installation to the backup server. It is important to keep drive letters and directory paths the same between the main server and the backup server. The site configuration and list configurations both use directory paths for various settings (e.g., the notebook archive path), so those paths must not change on the backup server. Additionally, the web server and any SMTP services should be configured identically on the primary and backup servers. If there is a local firewall on the backup server, it needs to be configured identically to the primary server, as well. You will need to configure the backup server to start LISTSERV on demand. For Windows, this means running X:\LISTSERV\MAIN\SMTP DEFINE and 'X:\LISTSERV\MAIN\LSV DEFINE' to create the Windows services, and then editing the service definitions for manual (instead of automatic) startup. For UNIX, this means

copying your existing LISTSERV init script to the backup server, and disabling startup during the default runlevel. You do not want the backup server to automatically start LISTSERV whenever it is rebooted; it should only start when the primary server fails. Failing over the LISTSERV service Once a copy of the LISTSERV filesystem is available on the backup server and the startup scripts are in place, the failover procedure itself needs to be created. The details of how to that are OS specific, but in general, you'll need to: 1. Monitor the primary server. Typically this means running a process either on the backup machine or on some third server that keeps track of the primary server by way of periodic pings through ICMP or TCP. Alternately, the primary server can send out a periodic 'heartbeat' to the backup server. After some number of failed pings or missed heartbeats, the primary server will be considered offline. 2. Transfer the IP address. If the IP address of the primary server is no longer reachable, it can be transferred to the backup server. The means for doing so will vary by operating system. 3. Start services on the backup server. Once the IP address is in place, start the web server, LISTSERV service, and SMTP listener (Windows) or mail server (UNIX) on the backup machine. Mail and web requests should now be routed to the backup server. Once the backup server takes over, you need to be sure that the primary server will not try to reclaim the IP address when it is restarted, or you'll have an IP conflict. This can be most easily accomplished by simply pulling the network cable from the primary server before powering it back on. For sites using NAT, this can be more easily avoided by leaving the internal IP addresses alone, and skipping step 2 above in favor of changing the NAT rule instead to route the external IP address to the internal IP address of the backup server instead of the primary. Then the primary server can be brought back online with no IP conflict on the internal network.