GridFTP: A Data Transfer Protocol for the Grid Grid Forum Data Working Group on GridFTP Bill Allcock, Lee Liming, Steven Tuecke ANL Ann Chervenak USC/ISI <tuecke@mcs.anl.gov> Introduction In Grid environments, access to distributed data is typically as important as access to distributed computational resources. Distributed scientific and engineering applications require: transfers of large amounts of data (terabytes or petabytes) between storage systems, and access to large amounts of data (gigabytes or terabytes) by many geographically distributed applications and users for analysis, visualization, etc. Unfortunately, the lack of standard protocols for transfer and access of data in the Grid has led to a fragmented Grid storage community. Users who wish to access different storage systems are forced to use multiple protocols and/or APIs, and it is difficult to efficiently transfer data between these different storage systems. We propose a common data transfer and access protocol called GridFTP that provides secure, efficient data movement in Grid environments. This protocol, which extends the standard FTP protocol, provides a superset of the features offered by the various Grid storage systems currently in use. We chose the FTP protocol because it is the most commonly used protocol for data transfer on the Internet, and of the existing candidates from which to start it comes closest to meeting the Grid s needs. The GridFTP protocol includes the following features: Grid Security Infrastructure (GSI) and Kerberos support Third-party control of data transfer Parallel data transfer Striped data transfer Partial file transfer Automatic negotiation of TCP buffer/window sizes Support for reliable and restartable data transfer Integrated instrumentation Some of these features are supported by FTP extensions that have already been standardized in the IETF, but which are currently seldom implemented. Other features are new extensions FTP. 1
The Grid Forum GridFTP working group was formed within the Data working group to foster the development of a GridFTP protocol standard, as well as interoperable storage clients and servers that implement the GridFTP protocol. Motivation for a common protocol There are already a number of storage systems in use by the Grid community. These storage systems have been created in response to specific needs for storing and accessing large datasets. They each focus on a distinct set of requirements and provide distinct services to their clients. For example, some storage systems (DPSS, HPSS) focus on high-performance access to data and utilize parallel data transfer streams and/or striping across multiple servers to improve performance. i ii Other systems (DFS) focus on supporting high-volume usage and utilize dataset replication and local caching to divide and balance server load. iii The SRB system connects heterogeneous data collections and provides a uniform client interface to these repositories, and also provides metadata for use in identifying and locating data within the storage system. iv Still other systems (HDF5) focus on the structure of the data, and provide client support for accessing structured data from a variety of underlying storage systems. v Unfortunately, most of these storage systems utilize incompatible, an often unpublished protocols for accessing data, and therefore require the use of their own client libraries to access data. The use of multiple incompatible protocols and client libraries for accessing storage effectively partitions the datasets available on the grid. Applications that require access to data stored in different storage systems must either choose to only use a subset of storage systems, or must use multiple methods to retrieve data from the various storage systems. One approach to breaking down partitions created by these mutually incompatible storage system protocols is to build a layered client or gateway which can present the user with one interface, but which translates requests into the various storage system protocols and/or client library calls. This approach is attractive to existing storage system providers because it does not require them adopt support for a new protocol. But it also has significant disadvantages, including: Performance: Costly translations are often required between the layered client and storage system specific client libraries and protocols. In addition, it can be challenging to efficiently transfer a dataset from one storage system to another. Complexity: Building and maintaining a client or gateway that supports numerous storage systems is considerable work. In addition, staying up to date as each storage system independently evolves is very difficult. This is further exacerbated by the need to provide support for multiple client languages, such as C/C++, Java, Perl, Python, shells, etc. It would be mutually advantageous to both storage providers and users to have a common level of interoperability between all of these disparate systems: a common but extensible underlying data transfer protocol. Storage providers would gain a broader user base, because their data would be available to any client. Storage users would gain access to a broader range of storage systems and data. In addition, these benefits can be gained without the performance and complexity problems of the layered client or gateway approach. 2
Furthermore, establishing a common data transfer protocol would eliminate the current duplication of effort in developing unique data transfer capabilities for different storage systems. A pooling of effort in the data transfer protocol area would lead to greater reliability, performance, and overall features that would then be available to all distributed storage systems. Characteristics of the data transfer protocol In order to make a common data transfer protocol attractive to users and developers of existing storage systems, we must provide a transfer protocol that offers a superset of the features offered by systems currently in regular use. In addition, the protocol must be extensible, in order to support future innovations by storage system users and developers. We have observed that the FTP protocol is the protocol most commonly used for data transfer on the Internet, and the most likely candidate for meeting the Grid s needs. It is attractive in particular for the following reasons. It is a widely implemented and well-understood IETF standard protocol. There is a large code base and expertise from which to build. It provides a well-defined architecture for protocol extensions, and supports dynamic discovery of the extensions supported by a particular implementation. Numerous groups have added various extensions through the IETF. Some of these extensions are particularly useful in the Grid. In addition to client/server transfers (i.e. put/get, or remote read/write ), it also supports transfers directly between two servers, mediated by a third party client (i.e. third party transfer ). The separation of data and control channels onto different sockets allows for easier extensibility for parallel and striped transfers, efficiently transiting firewalls, etc. Most current FTP implementations support only a subset of the features defined in the FTP protocol (RFC 969) and its accepted extensions. Some of the seldom-implemented features are useful to Grid applications. But the standards also lack several features which Grid applications require. We intend to select a subset of the existing FTP standards and further extend it, adding the following features. We believe that the resulting protocol will be a suitable candidate for the common data transfer protocol for the grid, which we call GridFTP. Grid Security Infrastructure (GSI) and Kerberos support Robust and flexible authentication, integrity, and confidentiality features are critical when transferring or accessing files. GridFTP must support GSI and Kerberos authentication, with user controlled setting of various levels of data integrity and/or confidentiality. GridFTP provides this capability by implementing the GSSAPI authentication mechanisms defined by RFC 2228, FTP Security Extensions. Third-party control of data transfer In order to manage large data sets for large distributed communities, it is necessary to provide third-party control of transfers between storage servers. GridFTP provides this capability by 3
adding GSSAPI security to the existing third-party transfer capability defined in the FTP standard. Parallel data transfer On wide-area links, using multiple TCP streams (even between the same source and destination) can improve aggregate bandwidth over using a single TCP stream. This is required both between a single client and a single server, and between two servers. GridFTP supports parallel data transfer through FTP command extensions and data channel extensions defined in the Grid Forum draft. Striped data transfer Using multiple TCP streams to transfer data that is partitioned across multiple servers (ala. DPSS) can further improve aggregate bandwidth. GridFTP supports striped data transfers through extensions defined in the Grid Forum draft. Partial file transfer Many applications require the transfer of partial files. However, standard FTP requires the application to transfer the entire file, or the remainder of a file starting at a particular offset. GridFTP introduces new FTP commands, as defined in the Grid Forum draft, to support transfers of regions of a file. Automatic negotiation of TCP buffer/window sizes Manually setting TCP buffer/window sizes is an error-prone process (particularly for nonexperts) and is often simply not done. GridFTP extends the standard FTP command set and data channel protocol to support both manual setting and automatic negotiation of TCP buffer sizes both for large files and large sets of small files. Support for reliable data transfer Reliable transfer is important for many applications that manage data. Fault recovery methods for handling transient network failures, server outages, etc. are needed. The FTP standard includes basic features for restarting failed transfer that are not widely implemented. The GridFTP protocol exploits these features, and extends them cover the new data channel protocol. Implementation Status The GridFTP protocol has been adopted, in full or in part, in the following software systems: Globus Toolkit: The next version of the Globus Toolkit, which is currently available as an alpha release, includes a reference implementation of the full GridFTP protocol. Client and server SDKs, command line programs, and sample servers are included. wuftpd: The Washington University ftpd server has been modified to support GSI, partial files, and parallel transfers. This work was done by the ACES group at NCSA and the Globus Project group at Argonne. ncftp: The ncftp client has been modified to support GSI. This work was done by the Globus Project group at Argonne. 4
HPSS pftpd: The HPSS pftpd server has been modified to support GSI. This work was done by SDSC. Unitree ftpd: The Unitree ftpd server has been modified to support GSI. This work was done by the ACES group at NCSA. References i B. Tierney, W. Johnston, J. Lee, G. Hoo, and M. Thompson. End-to-end performance analysis of high speed distributed storage systems in wide area ATM networ ks. In NASA/Goddard Conference on Mass Storage Systems and Technologies, 1996, LBNL-39064. ii R.W. Watson and R.A. Coyne. The parallel I/O architecture of the high-performance storage system (HPSS). In IEEE MSS Symposium, 1995. iii It is interesting to note that DFS also provides for cache coherence in the face of multiple writers. This is one of its drawbacks for data grid applications, since this feature forces unnecessary overhead. Data grid applications seldom need the ability to change existing datasets. iv Information on SRB is available on the World Wide Web at http://www.npaci.edu/dice/srb/. v Information about HDF/HDF5 is available on the World Wide Web at http://hdf.ncsa.uiuc.edu/. vi Information about the Globus gsiftp is available on the World Wide Web at http://www.globus.org/security/v1.1/. 5