HP-UX 11i TCP/IP Performance White Paper



Similar documents
Pump Up Your Network Server Performance with HP- UX

Application Note. Windows 2000/XP TCP Tuning for High Bandwidth Networks. mguard smart mguard PCI mguard blade

Per-Flow Queuing Allot's Approach to Bandwidth Management

Transport Layer Protocols

VMWARE WHITE PAPER 1

MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM?

TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to

Optimizing TCP Forwarding

TCP/IP Optimization for Wide Area Storage Networks. Dr. Joseph L White Juniper Networks

Ethernet. Ethernet. Network Devices

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Key Components of WAN Optimization Controller Functionality

Computer Networks. Chapter 5 Transport Protocols

B-2 Analyzing TCP/IP Networks with Wireshark. Ray Tompkins Founder of Gearbit

Windows Server 2008 R2 Hyper-V Live Migration

Frequently Asked Questions

MOBILITY AND MOBILE NETWORK OPTIMIZATION

Networking Topology For Your System

Final for ECE374 05/06/13 Solution!!

Top 10 Tips for z/os Network Performance Monitoring with OMEGAMON Session 11899

Names & Addresses. Names & Addresses. Hop-by-Hop Packet Forwarding. Longest-Prefix-Match Forwarding. Longest-Prefix-Match Forwarding

Understanding Slow Start

Content Distribution Networks (CDN)

Windows Server 2008 R2 Hyper-V Live Migration

Lecture Computer Networks

TCP over Multi-hop Wireless Networks * Overview of Transmission Control Protocol / Internet Protocol (TCP/IP) Internet Protocol (IP)

Understanding TCP/IP. Introduction. What is an Architectural Model? APPENDIX

Performance and Recommended Use of AB545A 4-Port Gigabit Ethernet Cards

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Internet Firewall CSIS Packet Filtering. Internet Firewall. Examples. Spring 2011 CSIS net15 1. Routers can implement packet filtering

A Transport Protocol for Multimedia Wireless Sensor Networks

PERFORMANCE TUNING ORACLE RAC ON LINUX

Transport and Network Layer

Isilon IQ Network Configuration Guide

hp ProLiant network adapter teaming

[Prof. Rupesh G Vaishnav] Page 1

Measure wireless network performance using testing tool iperf

Building a Highly Available and Scalable Web Farm

About Firewall Protection

MODBUS MESSAGING ON TCP/IP IMPLEMENTATION GUIDE V1.0b CONTENTS

Technical Support Information Belkin internal use only

Cisco Integrated Services Routers Performance Overview

GlobalSCAPE DMZ Gateway, v1. User Guide

Virtualization: TCP/IP Performance Management in a Virtualized Environment Orlando Share Session 9308

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

Tivoli IBM Tivoli Web Response Monitor and IBM Tivoli Web Segment Analyzer

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking

bbc Adobe LiveCycle Data Services Using the F5 BIG-IP LTM Introduction APPLIES TO CONTENTS

ICOM : Computer Networks Chapter 6: The Transport Layer. By Dr Yi Qian Department of Electronic and Computer Engineering Fall 2006 UPRM

ZEN LOAD BALANCER EE v3.04 DATASHEET The Load Balancing made easy

IP SAN Best Practices

Improving DNS performance using Stateless TCP in FreeBSD 9

Quantum StorNext. Product Brief: Distributed LAN Client

CHAPTER 3 PROBLEM STATEMENT AND RESEARCH METHODOLOGY

Windows Server Performance Monitoring

IP - The Internet Protocol

Outline. TCP connection setup/data transfer Computer Networking. TCP Reliability. Congestion sources and collapse. Congestion control basics

Smart Tips. Enabling WAN Load Balancing. Key Features. Network Diagram. Overview. Featured Products. WAN Failover. Enabling WAN Load Balancing Page 1

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Zarząd (7 osób) F inanse (13 osób) M arketing (7 osób) S przedaż (16 osób) K adry (15 osób)

Top 10 Tips for z/os Network Performance Monitoring with OMEGAMON Ernie Gilman

First Workshop on Open Source and Internet Technology for Scientific Environment: with case studies from Environmental Monitoring

Network Simulation Traffic, Paths and Impairment

IPv4 and IPv6 Integration. Formation IPv6 Workshop Location, Date

Chapter 1 - Web Server Management and Cluster Topology

Introduction to Mainframe (z/os) Network Management

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Basic Networking Concepts. 1. Introduction 2. Protocols 3. Protocol Layers 4. Network Interconnection/Internet

First Midterm for ECE374 02/25/15 Solution!!

High Performance Cluster Support for NLB on Window

Proxy Server, Network Address Translator, Firewall. Proxy Server

MINIMUM NETWORK REQUIREMENTS 1. REQUIREMENTS SUMMARY... 1

DEPLOYMENT GUIDE Version 1.1. Configuring BIG-IP WOM with Oracle Database Data Guard, GoldenGate, Streams, and Recovery Manager

ACHILLES CERTIFICATION. SIS Module SLS 1508

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Chapter 3. TCP/IP Networks. 3.1 Internet Protocol version 4 (IPv4)

VXLAN: Scaling Data Center Capacity. White Paper

High-Speed TCP Performance Characterization under Various Operating Systems

STANDPOINT FOR QUALITY-OF-SERVICE MEASUREMENT

Allocating Network Bandwidth to Match Business Priorities

OpenFlow Based Load Balancing

Overlapping Data Transfer With Application Execution on Clusters

IP Network Layer. Datagram ID FLAG Fragment Offset. IP Datagrams. IP Addresses. IP Addresses. CSCE 515: Computer Network Programming TCP/IP

D1.2 Network Load Balancing

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Lecture 15: Congestion Control. CSE 123: Computer Networks Stefan Savage

Guide to TCP/IP, Third Edition. Chapter 3: Data Link and Network Layer TCP/IP Protocols

Microsoft SQL Server 2012 on Cisco UCS with iscsi-based Storage Access in VMware ESX Virtualization Environment: Performance Study

AS/400e. TCP/IP routing and workload balancing

Chapter 12 Supporting Network Address Translation (NAT)

Network Security TCP/IP Refresher

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

Protocols. Packets. What's in an IP packet

Chapter 2 TOPOLOGY SELECTION. SYS-ED/ Computer Education Techniques, Inc.

Active-Active and High Availability

ZEN LOAD BALANCER EE v3.02 DATASHEET The Load Balancing made easy

Overview of TCP/IP. TCP/IP and Internet

QoS & Traffic Management

Transcription:

HP-UX 11i TCP/IP Performance White Paper 1 Introduction...3 1.1 Intended Audience...3 1.2 Organization of the document...3 1.3 Related Documents...3 1.4 Acknowledgements:...4 2 Out of the Box TCP/IP Performance Features for HP-UX Servers...5 2.1 TCP Window Size and Window Scale Option (RFC 1323)...5 2.2 Selective Acknowledgement (RFC 2018)...6 2.3 Limited Transmit (RFC 3042)...6 2.4 Large Initial Congestion Window (RFC 3390)...6 2.5 TCP Segmentation Offload (TSO)...7 2.6 Packet Trains for IP fragments...7 3 Advanced Out of the Box Scalability and Performance Features...9 3.1 TOPS...9 3.1.1 Configuration Scenario for TOPS...9 3.1.2 socket_enable_tops Tunable...9 3.2 STREAMS NOSYNC Level Synchronization...10 3.2.1 IP NOSYNC synchronization...10 3.3 Protection from Packet Storms...11 3.3.1 Detect and Strobe Solution...11 3.3.2 HP-UX Networking Responsiveness Features...11 3.3.3 Responsiveness Tuning...12 3.4 Interrupt Binding and Migration...12 3.4.1 Configuration Scenario for Interrupt Migration...13 3.4.2 Cache Affinity Improvement...13 4 Improving HP-UX Server Performance...14 4.1 Tuning Application and Database Servers...14 4.1.1 Tuning Application Servers...15 4.1.2 Tuning Database Servers...17 4.2 Tuning Web Servers...18 4.2.1 Network Server Accelerator HTTP...18

4.2.2 Socket Caching for TCP Connections...20 4.2.3 Tuning tcphashsz...21 4.2.4 Tuning the listen queue limit...22 4.2.5 Using MSG_EOF flag for TCP Applications...24 4.3 Tuning Servers in Wireless Networks...25 4.3.1 Smoothed RTO Algorithm...25 4.3.2 Forward-Retransmission Timeout (F-RTO)...25 5 Tuning Applications Using Programmatic Interfaces...27 5.1 sendfile()...27 5.2 Polling Events...27 5.3 send() and recv() Socket Buffers...28 5.3.1 Data Buffering in Sockets...28 5.3.2 Controlling Socket Buffer Limits...28 5.3.3 System Socket Buffer Tunables...29 5.4 Effective use of the listen backlog value...29 6 Monitoring Network Performance...31 6.1 Monitoring network statistics...31 6.1.1 Monitoring TCP connections with netstat an...32 6.1.2 Monitoring protocol statistics with netstat -p...32 6.1.3 Monitoring link level statistics with lanadmin...34 6.2 Monitoring System Resource Utilization...35 6.2.1 Monitoring CPU Utilization Using Glance...35 6.2.2 Monitoring CPU statistics using Caliper...36 6.2.3 Monitoring Memory Utilization using Glance:...37 6.2.4 Monitoring Memory utilization using vmstat...38 6.2.5 Monitoring Cache Miss Latency...38 6.2.6 Monitoring other resources...38 6.3 Measuring Network Throughput...38 6.3.1 Measuring Throughput with Netperf Bulk Data transfer...39 6.3.2 Measuring Transaction Rate with Netperf request/response:...39 6.3.3 Key issues for throughput with Netperf traffic...40 6.4 Additional Monitoring Tools...40 Appendix A: Annotated output of netstat s (TCP, UDP, IP, ICMP)...41 Appendix B: Annotated output of ndd h and discussions of the TCP/IP tunables...53 Table 1: Summary of TCP/IP Tunables...81 Table 2: Operating System Support for TCP/IP Tunables...85 Revision History...89

1 Introduction This white paper is intended as a guide to tuning networking performance at the network and transport layers. This includes IPv4, IPv6, TCP, UDP, and related protocols. Some topics will touch on other areas including sockets interfaces, network interface drivers and application protocols; however that is not the focus of this paper. Other information is available for these subsystems as referenced below. 1.1 Intended Audience This whitepaper is intended for the following: Administrators responsible for supporting or tuning the internal workings of the HP-UX networking stack Network programmers, for example those who directly write to the TCP or UDP protocols using socket system calls HP-UX network and system administrators who want to supplement their knowledge of HP-UX configuration options NOTE: This white paper is specific to performance tuning, and is not a general guide to HP-UX network administration. 1.2 Organization of the document This document is organized as follows: Chapter 1: Introduction. Chapter 2: Provides information on out of the box TCP/IP performance features. Chapter 3: Provides information on advanced out of the box scalability and performance features. Chapter 4: Provides recommendations on tuning HP-UX servers. Chapter 5: Provides information on tuning applications using programmatic interfaces. Chapter 6: Provides information on how to monitor and troubleshoot network performance on HP- UX. Appendix A: Provides detailed description of netstat statistics and related tuning. Appendix B: Provides detailed description of TCP/IP ndd tunables. 1.3 Related Documents The following documentation supplements information in this document: HP-UX 11i v3 performance further increases your productivity for improved IT business value http://h71028.www7.hp.com/erc/downloads/4aa1-0961enw.pdf Performance and Recommended Use of AB287A 10 Gigabit Ethernet Cards http://docs.hp.com/en/10gigewhitepaper.pdf/10gige_arches_whitepaper_version5.pdf HP Auto Port Aggregation Performance and Scalability White Paper http://docs.hp.com/en/7662/new-apa-white-paper.pdf Network Server Accelerator White Paper http://www.docs.hp.com/en/nsawp-90902/nsawp-90902.pdf RFCs related to TCP/IP Performance: http://www.ietf.org/rfc.html RFC 1323: TCP Extensions for High Performance RFC 2001: TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms 3

RFC 2018: TCP Selective Acknowledgement Options RFC 2861: TCP Congestion Window Validation RFC 3042: Enhancing TCP's Loss Recovery Using Limited Transmit RFC 3390: Increasing TCP's Initial Window RFC 3782: The NewReno Modification to TCP's Fast Recovery Algorithm RFC 4138: Forward RTO-Recovery (F-RTO): An Algorithm for Detecting Spurious Retransmission Timeouts 1.4 Acknowledgements: Most of the information in Appendix A and B has been derived from the Annotated Output of 'ndd -h' and 'netstat -s" documents written by Rick Jones. You can find these documents at ftp://ftp.cup.hp.com/dist/networking/briefs/annotated_ndd.txt ftp://ftp.cup.hp.com/dist/networking/briefs/annotated_netstat.txt 4

2 Out of the Box TCP/IP Performance Features for HP-UX Servers The HP-UX Networking Stack is especially engineered and tested for optimum performance in an enterprise mission-critical environment. HP-UX 11i v3 exhibits excellent performance on NFS server performance and in the TPC-C benchmark, a measurement of intensive online transaction processing (OLTP) in a database environment. Typically, OLTP includes a mixture of read-only or update, short or long, and interactive or deferred database transactions. There are significant networking performance tuning improvements and optimizations in 11i v3 for database applications as demonstrated by the benchmark result. There are many out of the box performance features introduced in HP-UX 11i. Users do not need to configure or tune any attributes in order to see the performance improvement with these out of the box performance features. The networking stack gracefully adapts to different networking needs in an enterprise, from noisy low-bandwidth wireless environments to high-bandwidth high-throughput datacenter environments. The TCP/IP performance features described in this chapter improve the performance of HP-UX servers, including database servers, application servers, NFS servers, web servers, mail servers, DNS servers, ftp servers, DHCP servers, gateways, and firewall systems. 2.1 TCP Window Size and Window Scale Option (RFC 1323) TCP performance depends not only on the transfer rate itself, but also on the product of the link bit rate and the round-trip delay, or latency. This "bandwidth-delay product" measures the amount of data that would "fill the pipe"; it is the buffer space required on the sender and receiver systems to obtain maximum throughput on the TCP connection over the path, i.e., the amount of unacknowledged data that TCP must handle in order to keep the pipeline full. TCP performance problems arise when the bandwidth-delay product is large. We refer to an Internet path operating in this region as a "long, fat pipe". In order to improve the performance of a network with a large bandwidth-delay product, the TCP window size needs to be sufficiently large. HP-UX supports the TCP window scale option (RFC 1323), which increases the maximum TCP window size up to approximately 1 gigabyte, or 1,073,725,440 bytes (65,535 * 2 14 ). When HP-UX initiates a TCP connection (an active open), HP-UX always initiates a SYN segment with the window scale option. Even when the real window size is less than 65,536, the window scale option is used with the scale factor set to 0. It is because advertising a 64K window with a window scale option of 0 is better than advertising a 64K window without a window scale option as it tells the peer that the window scale option is supported. When HP-UX responds to a connection request (a passive open), HP-UX accepts SYN segments with the window scale option. For the receiving TCP, the default receive window size is set by the ndd tunable tcp_recv_hiwater_def. Applications can change the receive window size by the SO_RCVBUF socket option. To fully utilize the bandwidth, the receive window needs to be sufficiently large for a given bandwidth-delay product. For the sending TCP, the default send buffer size is set by the ndd tunable tcp_xmit_hiwater_def, and applications can change the size by the SO_SNDBUF setsockopt() option. By setting the send socket buffer 5

sufficiently large for a given bandwidth-delay product, the transport is better positioned to taking full advantage of the remote TCP's advertised window. 2.2 Selective Acknowledgement (RFC 2018) TCP may experience poor performance when multiple packets are lost from one window of data. Selective Acknowledgment (SACK), described in RFC 2018, is effective in recovering from loss of multiple segments in a window. It accomplishes this by extending TCP's original, simple "ACK to the first hole in the data" algorithm with one that describes holes past the first lost segment. This information, sent from the receiver to the sender as an option field in the TCP header, allows the sender to retransmit lost segments sooner. In addition, the acknowledgment of segments after the first hole in sequence space using SACK allows the sender to avoid retransmission of segments which were not lost. SACK is configured in HP-UX with the ndd tunable tcp_sack_enable, which can be set to the following values: 0: Never initiate, nor accept the use of SACK 1: Always ask for, and accept the use of SACK 2: Do not ask for, but accept the use of SACK (Default) The default value of 2 is somewhat conservative, as the system will not initiate the use of SACK on a connection. It may be necessary to keep this value in some cases, as other TCP implementations which do not support SACK may be improperly implemented and may not ignore this option when it is requested in a TCP SYN segment. However, if the remote initiates the connection and asks for SACK, HP-UX will honor that request. A tcp_sack_enable value of 1 should be used if you want the system to use SACK for those connections initiated from the system itself (i.e. applications on the system calling connect() or t_connect()). 2.3 Limited Transmit (RFC 3042) HP-UX implements TCP Limited Transmit (RFC 3042), which provides faster recovery from packet loss. When a segment is lost, exercising the TCP Fast Retransmit algorithm (RFC 2001) is much faster than waiting for the TCP retransmit timeout. In order to trigger the TCP Fast Retransmit, three duplicate acknowledgments need to be received. However, when the congestion window is small, enough duplicate acknowledgments may not be produced. The Limited Transmit feature attempts to induce necessary duplicate acknowledgments in such situations. For each of the first two duplicate ACKs, Limited Transmit sends a new data segment (if new data is available). If a previous segment has in fact been lost, these new segments will induce additional duplicate acknowledgments. This improves the chances of initiating Fast Retransmit. Limited Transmit can be used either with or without the TCP selective acknowledgement (SACK) mechanism. 2.4 Large Initial Congestion Window (RFC 3390) The congestion window is the flow-control imposed by the sending TCP entity. When TCP starts a new connection or re-starts transmission after a long idle period, it starts conservatively by sending a few segments, i.e. the initial congestion window, and does not utilize the whole window advertised by the receiving TCP. 6

The large initial congestion window (RFC 3390) increases the permitted initial window from one or two segments to four segments or 4380 bytes, whichever is less. For example, when MSS is 1460 bytes, the TCP connection starts with three segments (3*1460=4380). By default, HP-UX uses the large initial congestion window. This is configured by the ndd tunable tcp_cwnd_initial. The large initial congestion window is especially effective for connections that need to send a small quantity of data. For example, to send 4KB of data, it needs just one RTT to transmit the data, while, without the large initial window, it requires an extra round trip time (RTT) which could have significant performance impact on long delay networks. 2.5 TCP Segmentation Offload (TSO) TCP Segmentation Offload (TSO) refers to a mechanism by which the TCP host stack offloads certain portions of outbound TCP packet processing to the Network Interface Card (NIC). This reduces the host CPU utilization. It allows the HP-UX transport implementation to create packets up to 32120 bytes in length that can be passed down to the driver in one write. This feature is also referred to as Large Send Offload, Segmentation Offload, Multidata Transmit, or Re-segmentation. TSO increases the efficiency of the HP-UX kernel by allowing 22 segments with the TCP maximum segment size (MSS) of 1460 bytes to be processed at one time, saving 21 passes down the stack. Using a Jumbo Frame MTU of 9000 bytes, this translates to 3 to 4 passes for a 32k byte send. This feature can significantly reduce the server load for applications that transmit large amounts of data from the system. Examples of such applications include Web Servers, NFS, and file transfer applications. If the link is primarily used for bulk-data transfers, then turning on this feature improves CPU utilization. The performance gain is less optimal for shorter application messages transmitted over interactive streams. The NIC must be capable of providing this feature. To enable this feature use the following commands: 11i v1 and 11i v2: lanadmin -X send_cko_on ppa lanadmin -X vmtu 32160 ppa 11i v3: nwmgr -s -A tx_cko=on -c interface_name nwmgr -s -A vmtu=32160 -c interface_name If the card is not TSO-capable, the "vmtu" option will not be supported. For more information on TSO-enhanced cards and drivers, go to http://www.hp.com/go/softwaredepot, and search for TSO. 2.6 Packet Trains for IP fragments Packet Trains are used when sending IP fragments. This solves the problem where driver may not be able to handle a burst of IP fragments. Previously, when processing a large datagram, IP would fragment the datagram to the current MTU size and send down each fragment to the driver before processing the rest of the datagram. This could cause a problem if the driver is unable to process one or more of the IP fragments during outbound processing of these individual fragments. 7

A single fragment dropped by the driver will cause an entire datagram to be unrecoverable. When the remote machine picks up the remaining fragments, they will be queued in its reassembly queue, according to the IP protocol. If this happens frequently, the entire IP reassembly queue on the receiving side will be exhausted. This, in turn, would result in good packets being dropped because of the full buffer on the receiving side. To mitigate this problem, HP-UX uses Packet Trains. As each fragment is carved off, it is linked with the other fragments for this write to form a packet train, until the entire datagram is processed. Then a request is made to the driver to ensure that all of the fragments can be accommodated in one request. If so, then IP passes down the packet train and the driver sends it to the card. If the driver cannot accommodate the entire packet train, then the entire train is discarded. This reduces the host CPU utilization. This feature is enabled by default, provided that the driver is capable of handling this request. Currently only 1000 Mbit or faster interfaces support this feature. To see if a driver has this feature enabled, enter the following command: # ndd get /dev/ip ip_ill_status If the output includes the keyword TRAIN, the driver supports this feature. For example: ILL rq wq upcnt mxfrg err memavail ilmcnt name 00000000517990a8 000000005001e400 000000005001e580 00001 01500 000 RUNNING BROADCAST CKO MULTICAST CKO_IN TRAIN 00000000 000001 lan0 8

3 Advanced Out of the Box Scalability and Performance Features The HP-UX Networking Stack has been engineered for best scalability and performance for high end servers. It can gracefully scale up from a few processors to 256 processors, and from 10 BaseT to 10 Gigabit Ethernet. Due to various configuration requirements for different type of workloads on high end servers, HP-UX provides the following advanced performance features for a highly-scalable TCP/IP stack: TOPS NOSYNC Protection from Packet Storm Interrupt Binding 3.1 TOPS Thread-Optimized Packet Scheduling (TOPS) increases the scalability and performance of TCP and UDP socket applications sharing a high-bandwidth network interface on multiprocessor systems. The goal is to move inbound packets processing to the same processor that runs the receiving user application. IP networking stacks, such as the stack implemented on HP-UX, operate as multiplexers, which route packets between network interface cards (NICs) and a set of user endpoints. HP-UX achieves excellent scalability by scheduling multiple applications across a set of processors; and, for outbound data, applications scale well when sharing a NIC. However, for inbound data, the configuration of each NIC determines which processor it interrupts. For most NICs, a single processor is interrupted as packets come in from the network. In the absence of TOPS, this processor will do the protocol processing for each incoming packet. Since a single high-speed NIC can process incoming data for many connections, the processor interrupted by this NIC can easily become a bottleneck. This prevents the maximum network throughput or packet rate from being realized. In order to improve scalability in this case, the TOPS mechanism allows the driver to quickly hand off packets to the processor where the application is most likely running, and return to processing packets coming from the wire. In most cases, a single processor will then perform all memory accesses to the application data inside each packet. This leads to a more efficient use of memory and cache subsystems. The TOPS mechanism is used by all TCP and UDP sockets without application modification or recompilation. In most cases, an additional benefit of requiring only a single processor to handle application data coming in from the network is realized. This leads to a more efficient use of memory and cache subsystems. 3.1.1 Configuration Scenario for TOPS TOPS is most beneficial for system configurations where the number of CPUs is much greater than the number of NICs such as a 16-way system with one or two Gigabit cards. Inbound packet processing is spread among the CPUs based on where the socket application processes are scheduled, leading to a more even distribution of the processing load in MP-scalable and network-intensive applications. 3.1.2 socket_enable_tops Tunable TOPS is enabled by default on HP-UX 11i, and requires no action on the part of an application to take advantage of this feature. On the more recent patches of 11i v1 and 11i v2, the ndd tunable socket_enable_tops is available to turn off or alter the behavior of TOPS. In 11i v3, the equivalent 9

tunable will be provided in a future patch. This may be useful in cases described below where specific conditions make the TOPS default less than optimal. Refer to Table 2 (at the end of Appendix B) for the patch level information for the ndd tunable socket_enable_tops. It should not be necessary to disable TOPS. However, there are cases where the scalability issue addressed by TOPS does not exist. When there are multiple NICs on a system, it is possible that no NIC interrupt will become a processing bottleneck even with TOPS disabled (socket_enable_tops = 0). In these cases, there may be some efficiency gained by avoiding the overhead of TOPS, and allowing more of the processing to be done in the NIC interrupt context before switching to the processor running the application. In the most efficient, highest-performing case of the application and NIC being assigned to the same processor, however, there is no need for TOPS to switch processors, and therefore the TOPS tunable setting will have no effect on performance. Another consideration for TOPS tuning is whether the NIC is configured for checksum offload (CKO) on inbound data. If CKO is enabled, TOPS will provide less benefit for the memory cache, as there will not be a need to read the payload data during the inbound TCP/UDP processing. As an application is rescheduled over time between different processors, or in the cases where threads executing on different processors may share a socket, TOPS may not operate optimally in determining which processor to switch to in order to match where the system call will execute to receive the data. In most cases, the default TOPS setting for11i v3 (socket_enable_tops = 2) will work best in following the application to its current CPU. In cases where sockets are being opened and closed at a high rate, it may be possible to gain some efficiency by fixing the processor assigned to each connection by TOPS using the ndd setting socket_enable_tops = 1, which is the default for 11i v1 and 11i v2. However, these cases may be rare, and can only be determined by experimentation, or by detailed measurement and analysis of the performance of the HP-UX kernel. As a result, changing from the default setting to socket_enable_tops = 2 on 11i v1 and 11i v2 will provide equal or better performance in the majority of cases. 3.2 STREAMS NOSYNC Level Synchronization Previously the STREAMS framework supported execution of only one instance of the put procedure at a time for a given STREAMS queue. For multiple requests to the same queue, STREAMS synchronized the requests depending on the synchronization level of a module. Synchronization ensured that only one request was executed at a time. With high speed I/O, the synchronization limits imposed by STREAMS could easily lead to a performance bottleneck. The restriction imposed by these previous STREAMS synchronization methods has been removed by providing a new synchronization method NOSYNC in 11i v3 and the latest patches for 11i v1 and 11i v2. If a module uses NOSYNC level synchronization, the STREAMS framework can concurrently execute multiple instances of its queue's put procedure and a single instance of the same queue's service procedure. This requires the modules to protect any module-specific data that is shared between multiple instances of put procedures, or between the put and service procedures. 3.2.1 IP NOSYNC synchronization With NOSYNC level synchronization, the IP module can handle requests simultaneously when multiple requests arrive on the same queue. This feature significantly improves network throughput, reaching near link speed for high-speed network interfaces such as multi-port Gigabit cards in an Auto Port Aggregation (APA) configuration or 10Gigabit cards. To realize the performance gain from this feature, all modules (eg. DLPI, IPFilter) on the networking stack between the IP layer and the LAN driver must have NOSYNC enabled. HP recommends that providers of 10

modules pushed on the DLPI stream create or modify the modules to operate at the NOSYNC synchronization level so that the NOSYNC performance gain is not lost. For more details about writing a NOSYNC module/driver refer to: STREAMS/UX Programmer's Guide, available at http://docs.hp.com/en/netcom.html#streams/ux Patch level information for the NOSYNC feature: 11i v1: STREAMS: PHNE_35453 or higher ARPA Transport: PHNE_35351 or higher DLPI: PHNE_33704 or higher IPFilter: A.03.05.12 or later 11i v2: STREAMS: PHNE_34788 or higher ARPA Transport: PHNE_35765 or higher DLPI: PHNE_33429 or higher IPFilter: A.03.05.12 or later 3.3 Protection from Packet Storms When the network is overloaded, or when defective network components send out a flood of packets, a server can see an inbound packet storm of network traffic. This would have a serious impact on the performance of a mission critical server as a lot of the CPU power is consumed and an excessive amount of time spent in interrupt context to process these packets. HP-UX has an extensive set of features to minimize the negative effects of many types of packet storms. Using the default capabilities of HP-UX 11i v3, the system will be well protected against this, as described below. 3.3.1 Detect and Strobe Solution The Detect and Strobe kernel functionality is described in The HP-UX 11i v3 Release Notes, Chapter 5 http://docs.hp.com/en/5991-6469/index.html. This feature is designed to limit the amount of processor time spent in interrupt context to a maximum percentage over time. This provides better responsiveness for time-sensitive applications and high-priority kernel threads that could otherwise be delayed by interrupt activity. A tunable parameter, documented in the man page intr_strobe_ics_pct(5) is provided to control the operation of detect and strobe. It is enabled by default, and it is documented that only HP Field Engineers should change the value of this tunable. 3.3.2 HP-UX Networking Responsiveness Features Several features of the networking kernel code contribute to protecting the system from packet storms, and these have been improved in HP-UX 11i v3. In general, synchronization points exist in the protocol layers to serialize the processing of packets when required. For example, to maintain the state of a particular TCP connection, inbound and outbound segments are processed serially as they are received by the upper or lower protocol layers, and queuing can occur. The queued backlog of packets could become a responsiveness issue, particularly when processed in an interrupt context. However, by setting some reasonable limits to the queue length, and 11

eliminating points of contention to allow more parallelism in TCP/IP processing, HP-UX has eliminated many causes of delay in the kernel, even when the system is under extreme load. In addition, the Detect and Strobe feature will be activated if the incoming traffic is more than the system can handle. 3.3.3 Responsiveness Tuning The cost of providing responsiveness for the overall system in the case of packet storms is that incoming network interrupts can be delayed or even dropped. This will usually occur in a case where the incoming packets would be eventually dropped anyway due to a kernel queue overflow, memory shortage, or network protocol timeout. Given that dropping packets is inevitable, dropping them as soon as possible in the NIC uses fewer operating system resources, and is therefore the most desirable response. The latter is particularly true in the case of packet storms consisting of unwanted packets, for example from a malfunctioning switch, where the loss of the packets themselves is of little or no consequence. In the case of reliable protocols such as TCP, the dropped data should be recovered through a retransmission, and the protocol should help relieve the overflow by slowing down the connection using the TCP congestion window. In the case of an unreliable datagram protocol such as UDP, the loss of data may be noticeable at the user or application level. The logging messages described in intr_strobe_ics_pct(5) can be used to determine when the Detect and Strobe feature has been activated due to excessive interrupt activity. In addition, HP Support can retrieve network-specific kernel statistics that can determine if packets are being dropped due to queue overflows. If responsiveness is not critical on the system, it may be possible to gain a small amount of performance by tuning intr_strobe_ics_pct(5) to allow a higher maximum percentage of interrupt processing. Other approaches to increasing responsiveness and performance include using Interrupt Migration as described in section 3.4 and binding critical applications to processors or processor sets where interrupt activity is less likely to be a problem. Many of the features described above are available in HP-UX 11i v1 and 11i v2 at the most recent patch levels. HP-UX Reference HP-UX 11i v2 September 2004, describes intr_strobe_ics_pct(5), which is disabled by default in this version. In September 2006, an 11i v2 Detect and Strobe was released based on the May 2005 update release, with some additional recommended patches. A similar responsiveness solution was released for HP-UX 11i v1 as a set of patches and optional products. The Interrupt Migration product for 11i v1 is one of these optional products, which is available without cost from software.hp.com. Because of the set of patches required, and the recommendation for the HP Field to modify intr_strobe_ics_pct(5), HP Support should be contacted if a responsiveness solution is required for all HP-UX 11i releases. 3.4 Interrupt Binding and Migration In HP-UX 11i, the system administrator has the ability to assign interface cards to interrupt specific processors, overriding the default assignment performed by the operating system. The command used for this assignment, called Interrupt Migration, is intctl(1m). The default assignments done by the operating system at boot time will spread interrupts evenly across a set of processors, and will work well in most cases. However, for optimal network performance, it may be necessary to change this, taking into consideration the overall system and application workload. 12

3.4.1 Configuration Scenario for Interrupt Migration A significant amount of network protocol processing for inbound packets is done as part of the interrupt from the network interface. In order to avoid a CPU bottleneck when there is heavy network traffic, Interrupt Migration can be used to move interrupts away from heavily-loaded processors. Examples of this load balancing could be to configure two busy network interfaces to interrupt separate processors, or to schedule network interrupts away from a processor which is busy with unrelated application processing. In the case of an IP subnet configured using Auto Port Aggregation (APA); maximum throughput can be achieved by assigning interrupts for each interface in the aggregate to a separate processor. The 10 Gigabit Ethernet driver (ixgbe) for HP-UX provides load balancing through the destination-port based multiqueue feature. This allows multiple processors to be interrupted by the 10 Gigabit card, and the incoming traffic can be separated into multiple flows based on the TCP destination port. Only TCP is supported by the destination port multiqueue feature. This increases the maximum throughput of the 10 Gigabit card, which would otherwise be limited by the interrupt processing speed of a single CPU. The "10GigEthr-00 (ixgbe) 10 Gigabit Ethernet Driver" release notes (http://docs.hp.com/en/j6379-90003/j6379-90003.pdf) explains the configuration of the multiqueue feature. 3.4.2 Cache Affinity Improvement Network protocols are layered, and data and control structures are shared between these layers. When these structures are brought into a processor's cache, less time is spent stalling for cache misses as the remaining protocol layers process the packet. Since interrupts for a NIC are bound to a processor, there is even a good possibility that some structures will still be in the correct processor's cache when the next packet for a given connection arrives. However, when an application receives the data, there is the possibility of additional cache misses, as the HP-UX scheduler assigns application threads to processors independently of the interrupt bindings. To get the most efficient operation from a cache standpoint, it is beneficial to have the interrupt assigned where the busiest applications are consuming the data. Using mpctl(2) on a per-application basis, and optionally defining processor sets, applications can be restricted to run on specific processors. If this does not result in a CPU bottleneck, then it is most efficient both for the application and from a system wide perspective. On the other hand, there is little cache sharing between network interfaces, so there will be little benefit from cache affinity if multiple network interfaces interrupt the same processor. 13

4 Improving HP-UX Server Performance 4.1 Tuning Application and Database Servers Many of the enterprise applications today are architected and built using the J2EE framework, which is designed for the mainframe-scale computing typical of large enterprises. The J2EE framework provides a way to architect solutions which are distributed, multi-tiered and scalable. The diagram below shows an overview of the multi-tiered J2EE application architecture. Enterprise Data Center Tier 1 Tier 2 Web Server App Servers Tier 3 DB Server Internet Internet Web Clients In such an architecture, the client tier typically consists of web browsers or traditional terminals used at point of sales etc. In a typical deployment, web and business tiers are either separate or may be hosted within a single physical server. Application servers normally run business logic of an enterprise and they communicate with backend database servers using application programming interfaces such as the Java Database Connectivity (JDBC) interfaces. Though both web servers and application servers can be hosted on a single physical server, the often used practice is to separate them and run them on different physical servers for better performance and scalability of applications. In an actual deployment, there may be components such as a network load balancer which will help balance the load among multiple application servers and/or web servers. The diagram on the next page shows a typical physical view of such a deployment. 14

Enterprise Data Center Firewall Web Server Load balancer App Server DB Server Internet/Leased lines http https Open Zone DMZ MZ 4.1.1 Tuning Application Servers Network traffic characteristics of a physical server which is used to run as an application server varies based on its usage context and the nature of applications (business logic) that they run. Web applications are normally implemented using technologies such as servlets and JSP scripts. For example, users first connect to the Web server which in turn forwards the request to run an application. Such an application may be implemented as a servlet on an application server. Based on the application logic, the application server may need to access the back-end database server. Typically, application servers communicate with front-end web servers or back-end database servers using a shared set of TCP connections an approach known as connection pooling. Typically, Web servers reuse these connections for forwarding the requests from different clients at different point in time. The connection pooling approach is preferred over creation of new connections when needed for performance reasons. The number of TCP connections in the pool is often configurable and is based on the number of concurrent users that the system needs to support during peak load conditions. Usually application server vendors may suggest a set of networking related tunable parameters that are best suited to run the application server on a given OS platform. In this section we provide a set of guidelines on tuning network tunable parameters to run application servers on HP-UX 11i. Most of the tunable parameters discussed below are queried or set using ndd command on HP-UX. Please refer to Appendix B for more details on these tunable parameters. 4.1.1.1 tcp_time_wait_interval A physical server has to support large number of concurrent TCP connections if it is used to run both an application server and a Web server simultaneously. tcp_time_wait_interval controls how long connections need to be in the TIME_WAIT state before closing down. Often opening and closing of a large 15

number of TCP connections, as in the case with Web servers, may result in a large number of connections staying in TIME_WAIT state before getting closed. Application server vendors may typically suggest tuning this parameter related to TCP s TIME_WAIT timer. With the default value of 60 seconds for tcp_time_wait_interval on HP-UX, the HP-UX stack can track literally millions of TIME_WAIT connections with no particular decrease in performance and only a slight cost in terms of memory. Please refer to Appendix B for further discussion on this tunable parameter. 4.1.1.2 tcp_conn_request_max Depending upon the configuration of a physical server, application servers typically need to accept a large number of concurrent connections. The number of connections that can be queued by the networking stack is the minimum of the listen backlog and this tunable parameter tcp_conn_request_max. Application server vendors may suggest this tunable parameter to be set to 4096. On HP-UX the default value for this tunable is 4096 so it may not need a change. Use netstat p tcp to monitor any dropped connections due to listen queue full conditions and increase the value of this tunable parameter if necessary. Refer to section 4.2.4 for a detailed description of this tunable parameter. 4.1.1.3 tcp_xmit_hiwater_def This parameter controls the amount of unsent data that triggers the write side flow control. For typical OLTP types of transactions (short request and short response) this parameter needs no tuning. Increasing this tunable enables large buffer writes. For Decision Support System (DSS) workloads (i.e. small query and large response), we recommend setting this tunable parameter to 65536 (default is 32768). Please refer to Appendix B for further discussion on this tunable parameter. 4.1.1.4 tcp_ip_abort_interval In certain deployment scenarios, backend database servers may be used in a highly available cluster configuration. tcp_ip_abort_interval is the maximum amount of time a sending tcp will wait before concluding the receiver is not reachable. Application servers may use this mechanism to detect node or link failure and automatically switch the traffic over to a working database server in a cluster configuration. In a typical deployment, application servers may be communicating with database servers and Web servers which are physically close. To help faster detection and making use of fail-over features in such environment, it may be desirable to set this tunable parameter to a lower value than the default value of 10 minutes. However, it is not recommended to set this parameter lower than tcp_time_wait_interval. Please refer to Appendix B for further discussion on this tunable parameter. 4.1.1.5 tcp_keepalive_interval When there is no activity on a connection and if the application requests that the keepalive timer be enabled on the connection, then TCP sends keepalive probes at tcp_keepalive_interval intervals to make sure that the remote host is still reachable and is responding. Application servers may make use of this feature (SO_KEEPALIVE) to quickly fail over in cluster configurations when there is not much network traffic. As application servers typically maintain a pool of long standing TCP connections open with both Web servers and database servers, it is desirable to detect and fail over earlier in case of node or link failures during very low network traffic. The default value is 2 hours; however some application server vendors suggested tuning this tunable parameter to a much lower value (e.g. 900 seconds) than the default value. 16

4.1.1.6 tcphashsz tcphashsz controls the size of several hash tables maintained within the kernel. For better performance it is better to have larger tables at the expense of more memory being used when there is a large number of concurrent connections in the system. On modern-day servers memory may not be a major constraint. When Web server and application servers are run on the same physical machine, the suggested value for this tunable parameter is 32768. If Web server and application servers are running on different machines, then the number of concurrent connections on the application server may not be very large. In this case the default value (number of CPUs * 1024) should suffice. This parameter is set using the following command: # kctune tcphashsz=32768 Please note that system has to be rebooted for the new value to take effect. Otherwise system will continue to use the current value of the tcphashsz parameter. Refer to section 4.2.3 on page 21 for more discussion on tuning tcphashsz. 4.1.2 Tuning Database Servers There are several different database systems that are deployed today on HP-UX. Typically networking may not be a bottleneck on a database server as compared to I/O. Nevertheless, the following tuning may help improve the overall efficiency from a networking perspective. 4.1.2.1 tcp_xmit_hiwater_def This parameter controls the amount of unsent data that can be queued to the connection before subsequent attempts by the application to send data will cause the call to block or return EWOULDBLOCK/EAGAIN if the socket is marked non-blocking. For typical OLTP types of transactions (short requests and short responses) this parameter needs no tuning. However, you may consider increasing it to 65536 from the default value 32768 for DSS (Decision Support System) or BI (Business Intelligence) workloads that require a large amount of data to be transferred from the database server. Furthermore, this may help in data backups from database servers to an external storage device over a network-attached storage (NAS). 4.1.2.2 socket_udp_rcvbuf_default Cluster based database technologies are becoming popular. Typically nodes of such database clusters communicate among themselves using UDP. A large amount of data may get exchanged between server nodes in a database cluster connected through an interconnect. In this case you may want to consider increasing the tunable parameter socket_udp_rcvbuf_default which defines the default receive buffer size for UDP sockets. If the command netstat p udp shows socket overflows, it might be desirable to increase this tunable parameter. It should be noted that increasing the size of the socket buffer only helps if the length of the overload condition is short and the burst of traffic is less than the size of the socket buffer. Increasing the socket buffer size will not help if the overload is sustained. 4.1.2.3 socket_udp_sndbuf_default As mentioned above, cluster based database technologies may use UDP to communicate between nodes in the cluster. This tunable parameter sets the default send buffer size for UDP sockets. The default value for this tunable parameter is 65536, which is optimal for cluster based database technologies used on HP-UX. 17

4.2 Tuning Web Servers As the demand for faster and more scalable web service increases, it is desirable to improve web server performance and scalability by integrating web server functionality into operating systems. Web servers have characteristics of many short-lived connections, which open and close TCP connections at a very fast rate. In a busy web server environment, there could be tens of thousands of TCP connections per second. The following features and configurations are recommended for optimizing Web servers: Network Server Accelerator HTTP Socket Caching for TCP connections Increase the tcphashsz value Increase the listen queue limit MSG_EOF for TCP applications 4.2.1 Network Server Accelerator HTTP The Network Server Accelerator HTTP (NSA HTTP) is a product that provides an in-kernel cache of web pages in HP-UX 11i. This section describes the performance improvements achievable with NSA HTTP and the system tuning needed to achieve these performance improvements. The following list highlights the techniques NSA HTTP implements to achieve superior performance of Web servers in HP-UX 11i: Serving content from RAM (main memory) eliminates disk latency and I/O bandwidth constraints. In-kernel implementation decreases transitions between kernel and user mode. Tight integration with the TCP protocol stack. This allows efficient event notification and data transfer. In particular, a zero-copy send interface reduces data transfer by allowing responses to be sent directly from the RAM-based cache. Deferred-interrupt context processing removes the overhead associated with threads. Re-use of data structures and other resources reduces the lengths of critical code paths. The HTTP-specific portion of NSA HTTP is implemented as a DLKM module. In addition, the nsahttp utility is provided to configure and administer NSA HTTP. For a detailed description of the utility, refer to the nsahttp(1) man page. The NSA HTTP product is supported on HP-UX 11i and is available from http://www.software.hp.com. 4.2.1.1 Usage Scenarios There are a number of ways NSA HTTP can be used in a Web server environment. We briefly describe two scenarios to highlight the most typical usage. 4.2.1.1.1 Single System Web Server with NSA HTTP The simplest scenario uses NSA HTTP and a conventional user-level Web-server process co-located on a single system. In this topology, NSA HTTP increases server capacity by increasing the efficiency for processing static requests. NSA HTTP provides a fast path in the kernel that bypasses normal processing of static requests at the user level. This fast path entails having NSA HTTP parse each HTTP request to determine whether it can be served from the kernel. Requests that cannot be served from the kernel are passed to the user-level server process. Adding the fast path in the kernel, therefore, introduces additional parsing and processing to the path for requests served at the user level. This overhead is not significant and is more than compensated for by the increased efficiency when serving static requests. 18

4.2.1.1.2 Multiple Web Servers with Partitioned Content High-traffic Web sites typically feature multiple servers that are dedicated for specific purposes. A given set of servers, for example, may serve specific content such as images, advertisements, audio, or video. Dedicating servers to specific content types limits the total working set that must be delivered by any single server and allows the server's hardware configuration to be tailored to its content. One common approach to partitioning content is to separate static and dynamic content. Servicing static content requests typically requires more I/O bandwidth and memory than servicing dynamic content; servicing dynamic content requests typically requires greater CPU capacity. A second typical usage scenario, usually associated with very-high-traffic Web sites, is to deploy NSA HTTP on multiple web servers dedicated to serving static content, with a load balancer and/or web switch that routes user requests to the appropriate server. This approach is typically viable when the content of a site has already been manually partitioned among a set of specialized servers. 4.2.1.2 Tuning Recommendations This section describes the NSA HTTP operating parameters that you can tune to improve performance. 4.2.1.2.1 Maximum NSA HTTP Cache Percentage (max cache percentage) The maximum NSA HTTP URI (Uniform Resource Identifier, a term that encompasses URL) cache size is configured as a percentage of system memory. You can set the value for this parameter by editing /etc/rc.config.d/nsahttpconf or by using the nsahttp(1) command: # nsahttp -C max_cache_percentage By default, max_cache_percentage is 50 (50% of system memory). You should set the value for max_cache_percentage in conjunction with the system file cache settings (see filecache_max(5) for 11i v3, and dbc_min_pct(5) for 11i v1 and 11i v2). See sendfile() in section 5.1 of this document for additional information on file cache settings, as sendfile caching is done directly in the file cache, separately from caching done by NSA HTTP. 4.2.1.2.2 Cache Entry Timeout NSA HTTP has a URI cache entry timeout value. If an entry is not accessed for a period longer than the timeout value, NSA may re-use (write over) the entry. For best performance, an optimal value for the timeout value must be found. If it is too high, the cache may contain many stale entries. If it is too low, there may be excessive cache entry timeouts and increased cache misses. You can set the cache timeout by editing /etc/rc.config.d/nsahttpconf or by using the following nsahttp command: # nsahttp -e cache_timeout The cache_timeout value is set in seconds. For example, the command nsahttp -e 7200 sets the cache entry timeout to 7200 seconds (two hours). 4.2.1.2.3 Maximum URI Page Size NSA HTTP allows you to limit the maximum size of each of the URI objects (web pages) stored in the cache. You can tune this value to optimize the cache usage. You can set the maximum URI page size by editing /etc/rc.config.d/nsahttpconf or by using with the following nsahttp command: # nsahttp -m max_uri_page_size 19

The max_uri_page_size is specified in bytes. For example, the command nsahttp -m 2097152 causes NSA HTTP to cache only web pages that contain 2MB or fewer. 4.2.1.3 Performance Data A simulated web server environment was used to measure the performance of NSA HTTP. The workload was a mix of static content (70%) and dynamic content (30%). The measurements were taken using Web servers that implement copy avoidance when servicing static requests. The performance improvement was about 13-17%. On workloads with only static content, the performance improvement was approximately 60-70%. The performance improvements can be significantly greater when NSA HTTP is used with web servers that do not implement copy avoidance for servicing static requests. 4.2.2 Socket Caching for TCP Connections There is a finite amount of operating system overhead in opening and closing a TCP connection (for example in the processing of the socket(), accept() and close() system calls) that exists regardless of any data transfer over the lifetime of the connection. For long-lived connections, the cost of opening and closing a TCP connection is not significant when amortized over the life of the connection. However, for a short-lived connection, the overhead of opening and closing a connection can have a significant performance impact. HP-UX 11i implements a socket caching feature for better performance of short-lived connections such as web connections. A considerable amount of kernel resources (such as TCP and IP level data structures, STREAMS data structures) are allocated for each new TCP endpoint. By avoiding the allocation of these resources each time an application opens a socket, or receives a socket with a new connection through the accept() system call, a server can proceed more quickly to the data transfer phase of the connection. The socket caching feature for TCP connections saves the endpoint resources instead of freeing them, speeding up the closing function. Once the cache is populated, new TCP connections can use cached resources, speeding up the opening of a connection. TCP endpoint resources cached due to closing of one TCP connection can be reused to open a new TCP connection by any application. HP-UX 11i v3 has been enhanced to cache both IPv4 and IPv6 TCP connections. HP-UX 11i v1 and HP-UX 11i v2 support caching of IPv4 TCP connections only. HP-UX does not currently cache other transport protocols such as UDP. 4.2.2.1 Tuning Recommendation Socket caching is enabled by default for IPv4 and IPv6 TCP connections. The default value of number of TCP endpoint resources that are cached is 512 per processor. The number of cached elements (TCP endpoint resources) can be changed by using the ndd tunable socket_caching_tcp. For example to set the number of cached elements to 1024 one would do the following: # ndd -set /dev/sockets socket_caching_tcp 1024 20