How To Build A Large Scale Mail Transfer Protocol (Mta) System In A Distributed System

A High Performance System Prototype for Large-scale SMTP Services Sethalat Rodhetbhai g4165238@ku.ac.th Yuen Poovarawan yuen@ku.ac.th Department of Computer Engineering Kasetsart University Bangkok, Thailand Abstract As serving for the fast glowing number of Internet users, now, the electronic mail systems have heavier workload for handling the larger-sized mail messages and larger number of messages. Therefore, we propose a prototype of scalable SMTP (Simple Mail Transfer Protocol) service system in distributed system architecture to provide better performance for all users. The prototype consists of a group of dispatcher nodes that will accept all SMTP requests from clients using RR-DNS (Round Robin Domain Name Service) load distribution. A dispatcher node, acts as a front node of a cluster of mail servers, will dynamically redirect a connection to an appropriate mail server or other dispatcher node of less load cluster (selected by a scheduling algorithm with load information) using packet-rewriting approach. The selected mail server will process the request and reply the result directly to the client. The performance evaluation of the system implementation in the first phase shows that the prototype has been aligned successfully and can improve the round trip time between the system and clients to satisfy users. Keywords: load balance, cluster, SMTP, mail server, packet-rewriting, scheduling, 1. Introduction Nowadays, Electronic mail is still one of the most important services for communication on the Internet. For electronic mail message transferring over TCP/IP networks (such as Internet), the de facto standard client/server protocol is the Simple Mail Transfer Protocol (SMTP), described in RFC821 [1]. It is used by mail systems to operate at the application layer of the Open System Interconnection (OSI) stack [2], as shown in Figure 1. Its operation is really quite simple: after a reliable connection is established, a client initiates a brief handshake sequence. Then the client sends one or more messages to the server. The common SMTP (Internet mail system) model for electronic mail transferring is illustrated in Figure 2. A sender user creates an electronic mail message and invokes a Mail User Agent (MUA) then submits that message to a Mail Transfer Agent (MTA). The MTA determines how a message has to be routed to get to the receiver user. It passes (or relays) the message along to another MTA on another machine, which passes it to yet another machine, and so forth, until it reaches the destination MTA that closer to the ultimate receiver. The final

MTA passes the message to an appropriate Mail Delivery Agent to save it into a local message store. The receiver user may invoke a MUA for subsequent access such as reading, saving, or replying the message [1]. Figure 1: Simple Mail Transfer Protocol on Open System Interconnection Model However, in the Internet, the content of most electronic mail messages currently contains not only plain text data but also huge multimedia data such as images, audio, and video. As serving for the fast glowing number of Internet users, now, the electronic mail systems have heavier workload for handling the larger-sized mail messages and larger number of messages. For example, over 100 million users around the world have active Hotmail accounts [3]. Figure 2: The Common SMTP Model for Electronic Mail Transferring Due to many tons of requests, which are being very serious problem for SMTP systems, the systems must have enough efficiency to provide very good service for these requests. We pay more attention to this issue. Therefore, in our study, we have focused on designing and implementation of scalable SMTP service system prototype in distributed system to provide better performance between clients and MTAs, called D-EMS (Distributed Electronic Mail System).

In the remainder of this paper, we review some of background and existing related works in Section 2. The objectives of system design are presented in Section 3. The system (D-EMS) architecture is introduced in Section 4. The test bed of the system implementation is described in Section 5, followed by some experiments and results in Section 6. Finally, work conclusion is presented in Section 7. 2. Background and Related Work Generally, to alleviate the problem of heavy workload on a server, many of the current schemes will distribute requests between multiple servers in the system. Designing such system involves making decisions about how best server is selected for a request such that client receives response of request in minimum time and how this request is directed to that server. We can categorize the request distribution mechanisms based on entity that routes the request as client-based, DNS-based, dispatcher-based, and server-based. Client-based approach: Client side entity is responsible for selecting the server. Thus no server side processing is required for selection of server. The selection of one of multiple servers is done by client software, client-side DNS or proxy servers. DNS-based approach: Server side authorized DNS maps domain name to IP address which is one of the multiple servers, based on various scheduling policies. Selection of target server occurs at server side DNS so it does not suffer from applicability problem of client-side mechanisms. But DNS has limited control over requests reaching at server because of caching of IP address mapping at several levels such as by client software, local DNS resolver, intermediate name servers. Besides the mapping, a validity period for a server IP address mapping, known as Time-To-Live (TTL) is also supplied. After expiration of TTL period, this mapping request is again forwarded to authorized DNS. By setting this value to be very small or zero does not work because of existence of non-cooperative intermediate name servers and client level caching. Also, it increases network traffic and DNS itself can become bottleneck. Dispatcher-based approach: This approach gives full control over client requests to server side entity. In this approach, the DNS returns the address of a dispatcher that routes all client requests to other servers in the cluster. Thus it acts as a centralized scheduler at the server side that controls all client request distribution. It presents single IP address to outside world, hence it is much more transparent. These mechanisms can be categorized as follows: 1) Packet single-rewriting by the dispatcher 2) Packet double-rewriting by the dispatcher 3) Packet forwarding by the dispatcher [4] 4) ONE-IP address [5] Server-based approach: this approach allows two-level dispatching, first by cluster DNS and later each server may reassign a received request to one of the other server in the cluster. This solves the problem of non-uniform load distribution of client request and limited control of DNS. One technique of the approach is Packet Forwarding by Server. In this technique, first level scheduling is done using round robin DNS mechanism; the second level dispatching is done by packet rewriting mechanism that is transparent to users. So the first request reaches to any node in cluster. If that node figures out that the other node is better

for serving this request, the node uses MAC address to reroute the packet to selected node. There are some works on the improvement of electronic mail service system. For example, the cluster-based mail systems are performed at University of California at Berkeley s NinjaMail and University of Washington s Porcupine. NinjaMail [6] is a distributed clustered e-mail system built on top of UC Berkeley s Ninja cluster and OceanStore wide-area data storage architectures. The single-cluster NinjaMail architecture provides a scalable and fault tolerant service, without sacrificing user features or limiting the available mail access modes. Porcupine [7] is a scalable e-mail system, at present, based on store and forward functionality. Porcupine uses a single cluster model and all nodes are functional homogeneity. Therefore, any node can execute part or all of any transaction, e.g., the delivery or retrieval of mail. Based on this principle, Porcupine uses three techniques to meet our scalability goals. First, every transaction is dynamically scheduled to ensure that work is uniformly distributed across all nodes in the cluster. Second, the system automatically reconfigures whenever nodes are added or removed even transiently. Third, system and user data are automatically replicated across a number of nodes to ensure availability. 3. System Design Objective Our target prototype is based on an electronic mail service system in distributed system environment. The system consists of several clusters that have a large number of SMTP server nodes with some mechanism to distribute the incoming client requests among those server nodes. The primary objectives of the system are lists as below: System should compatible with the standard SMTP protocol and network elements, i.e. they can be deployed in current infrastructure and protocol suite. System has no need to change of components at client side or SMTP server daemon. System should be scalable, i.e. we can easily increase or decrease number of SMTP server nodes in clusters when needed in environment. System should give better performance in terms of latency perceived at client side, i.e. time lag between request of submission by client and mail content transmission to server should be minimized. System should provide transparence to user, i.e. a request should be diverted to appropriate SMTP server node automatically by the system. System should be fault tolerant, i.e. system should continue working even if some servers are down or off-line. System should avoid overloading of any server node, i.e. requests beyond capacity of any server node should not reach to it because the server may crash. System should not have additional overhead for its functioning, in terms of computation required or network traffic generated.

4. System Architecture We have designed a SMTP service system, called D-EMS (Distributed Electronic Mail System). D-EMS consists of a group of dispatcher nodes that will accept all SMTP requests from clients. These clients can be both MUA and outside MTA. Each dispatcher node, acts as a front node of a cluster of MTA service nodes, will divert requests to appropriate MTA in cluster or other appropriate dispatcher nodes. The overview of D-EMS architecture is shown in Figure 3. Dispatcher Node 1 MTA Node 1:1 MTA Node 1:2 TCP/IP Network Dispatcher Node 2 MTA Node 1:N 1 MTA Node 2:1 MTA Node 2:2 MTA Node 2:N 2 TCP/IP Network MTA Node M:1 Dispatcher Node M MTA Node M:2 MTA Node M:N m D-EMS Figure 3: Overview of Distributed Electronic Mail System (D-EMS) Model Basically, in transition from normal MTA to D-EMS, client side machine does not need to any modification of their custom MUA or MTA to send a mail message into D-EMS. The transactions between clients and D-EMS are accomplished by using the usual standard SMTP protocol. The operations in D-EMS system can be described in following. Before a client will initiate a SMTP transaction, it firstly asks its local DNS resolver for IP address of target SMTP server (dispatcher nodes in our D-EMS system). If local resolver or intermediate resolvers do not have this mapping or TTL of information has expired, this request reaches to server side authorized DNS server, in D-EMS system, replies with IP address of one of dispatcher node. The authorized DNS server uses Round Robin Scheduling Algorithm for load balancing the selection between these dispatcher nodes. In next step, the client sends request to a dispatcher node by using obtained IP address in previous step. The dispatcher

node decides which MTA service node in the cluster should service and redirect of this request to that service node. The selected service node sends replies on behalf of the dispatcher node directly to the client. The system uses Single Way Packet Rewriting technique for the packet redirection as shown in Figure 4. Namely, all consecutive IP packets that come from the client to the dispatcher node will be modified the header of IP packet by the dispatcher node. The destination address field of IP header is replaced by IP address of the selected MTA service node. Then the modified packet will be rerouted to the target MTA server. The MTA daemon on the service node receives that the packet, react, and send reply packet back to the client. However, before the reply packet will be released out the service node, it will be modified again for source address field of IP header from IP address of the selected service node to IP address of dispatcher. The packet will send directly to the client. The client is supposed that the reply packet comes back from dispatcher, not from the service node. Rewrite Destination to MTA Service Node IP Rewrite Source to Dispatcher Node IP Dispatcher Node MTA Service Node IP Packet SRC : Client IP DST : Dispatch Node IP IP Packet SRC : Client IP DST : MTA Service IP IP Packet SRC : Dispatcher Node IP DST : Client Node IP Client Reply to Client Directly Figure 4: Single Way Packet Rewriting Mechanism The system uses some metrics to estimate workload on a service node as following: The number of redirected packet by each node The number of active SMTP connections of each node The number of mail messages in mail queue of each node The CPU utilization, free RAM, Buffer RAM, number of disk accesses, free swap, number of processes of each node The number of requests served of each node The number of bytes transferred of each node The number of open TCP connections of each node at any given moment The number of active socket of each node Within each cluster, every service node periodically sends its load information to dispatcher node. Similarly every dispatcher node aggregates load information of every service node and distributes the information of its cluster to other dispatcher nodes periodically for request redirection decision between clusters.

5. Test Bed of System Implementation Currently, the D-EMS system test bed is still in the early stages of implementation. Most of the main features have been supported in the test bed. At present, our system consists of two clusters and each cluster is composed of one dispatcher node and three MTA service nodes. All machine in the system are HP Vectra PCs: Pentium III 933/133Mhz with 256Mb RAM, 40Gb hard disk, and built-in 3Com 10/100Mbps Ethernet adapter. Intel Express 510T Fast Ethernet switching is used for connected these machines. Dispatcher nodes and MTA service nodes are running RedHat Linux 7.2 operating system with kernel 2.4.18. All MTA service nodes are running SMTP daemon: Sendmail version 8.11.6 with identical configuration. We have applied packet redirection in dispatcher nodes and service nodes by using netfilter and iptables components [7]. Both components are the framework inside the Linux 2.4.x kernel which enables packet filtering, network address translation (NAT) and other packet mangling. netfilter is a set of hooks inside the Linux 2.4.x kernel s network stack which allows kernel modules to register callback functions called every time a network packet traverses one of those hooks. iptables is a generic table structure for the definition of rulesets. Each rule within an IP table consists out of a number of classifiers (matches) and one connected action (target). We add rules in iptables to hook IP packets that have destination IP address as standard SMTP service port (number 25) from network stack on a dispatcher node to mangle them in userspace. In the userspace, packets in the queue can be managed freely by userdefined program. We develop a daemon to perform packet rewriting for redirection phase. These daemons on dispatcher nodes and MTA service nodes will also communicate together to send/receive load information for load balancing requests. 6 Experiment and Result For experiments on test bed system, we have developed a simple SMTP request generator for workload simulation in many situations because users at our site do not receive enough email to drive the system into an overload condition. Specifically we model a mean message size of 4.7KB. In Figure 5, we demonstrate performance by showing the average latency of one message transmission when numbers of mail transmission to the system at the same time are varied. We compare between system with the only one MTA service node and our D-EMS system. These results show that if there is not much number of mail messages (less than 100 messages), only single MTA service node can handle these messages better than our MTA service node. However if many mails get into the system, only single service node is slow down for request handling. The D-EMS system seems to work better on interaction with requests. The D-EMS system, which consists of 2 clusters of 3 MTA nodes, works very well and archives better performance than the system that has only 1 cluster of 3 MTA nodes.

1 0.9 0.8 Average Lantency Time of One Message Transmission (sec.) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 Number of Messages in Concurrent Transmission Single MTA Service Node D-EMS: 1 Cluster of 3 MTA Nodes D-EMS: 2 Clusters of 3 MTA Nodes Figure 5: The Experimental Results 7. Conclusion Although the implementation of a system in D-EMS prototype is in the first phase, it can show that our system has better performance than a single MTA service system. However, the further phase of implementations on D-EMS model will be continued to boost up the performance. References [1] J. B. Postel, SMTP- Simple Mail Transfer Protocol, IETF RFC 821, 1982. (http://www.ietf.org/rfc/rfc0821.txt) [2] Spiros Sakellariadis, Spyros Sakellariadis, The Microsoft Exchange Server Internet Mail Connector, 29th Street Press, 1998. [3] Microsoft Corp., PressPass, May 2001. (http://www.microsoft.com/presspass/press/2001/may01/05-14hotmail100pr.asp ) [4] G.D.H. HUNT, G.S. GOLDZSMIT, R. K., AND MUKHERJEE, R. Network Dispatcher: A connection router for scalable internet services. Proceedings of 7th Int'l World Wide Web Conference (April 1998). [5] DAMANI, O., CHUNG, P., AND KINTALA, C. ONE-IP: Techniques for hosting a service on a cluster of machines. Proceedings of 41st IEEE Computing Society Int'l Conference (February 1996), 85-92. [6] J.Robert von Behren, Steven Czerwinski, Anthony D.Joseph, Eric A.Brewer, and John Kubiatowicz, NinjaMail: the Design of a High-Performance Clustered, Distributed E- mail System. In the Proceedings of International Workshops on Parallel Processing 2000. Auguest 21-24, 2000, Toronto, Cananda. [7] Yasuchi Saito, Brian N. Bershad, and Henry M. Levy, Manageability, availability and performance in Porcupine: a highly scalable, cluster-based mail service, 17th ACM Symposium on Operating System Principles (SOSP 99) Published as Operating Systems Review 34(5):1 15, Dec. 1999.

[8] Wael R. Elwasif James S. Plank Micah Beck Rich Wolski, IBP-Mail: Controlled Delivery of Large Mail Files. NetStore 99: Network Storage Symposium Seattle, WA, October, 1999. [9] Richard Reich, Sendmail V8: A (Smoother) Engine Powers Network Email. Tutorial in UnixWorld Website, (http://www.networkcomputing.com/unixword/tutorial/008/008.txt.html) [10] Website of netfilter and iptables, (http://www.netfilter.org)