White Paper Monitoring Techniques for Cisco Network Registrar White Paper 2011 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public Information. Page 1 of 14
Introduction... 3 Statistics Collection... 3 Collect DHCP Server Statistics... 3 Collect DNS Server Statistics... 5 Collect TFTP Server Statistics... 9 Collect Host Statistics... 10 Interpret the Collected Statistics... 10 Interpret DHCP Server Statistics... 11 Interpret DNS Server Statistics... 12 Interpret TFTP Server Statistics... 13 Related Information... 14 2011 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public Information. Page 2 of 14
Introduction This document describes the techniques that you can use to collect and interpret the statistics required to monitor a Cisco Network Registrar (CNR) deployment. You must collect and store the statistics in a usable format to make these decisions of both CNR and the network: Capacity planning Attack detection Misconfiguration The techniques in this document use these features to access the statistics counters that are available in the CNR version 6.1 servers: Server logging CLI commands API features You can apply these techniques to the earlier versions, where the statistics counters, logging functions, CLI, and API commands are supported. The techniques to collect statistics (like CPU use, disk use, and network use) are not in the scope of this paper. The minimum recommendations on the statistics that you must collect are provided in the Collect Host Statistics section. Statistics Collection The statistics collection step is important to monitor a CNR deployment. You can use the statistics to analyze directions and identify the trouble spots. Change the formats of collected statistics, so that all are in a common time-based reference and file format. Merge inputs with the comma separated value (CSV) text files as an intermediate format. You can use and analyze the collected data with integration of report and chart products. Each line in this file has a time stamp that is put into a normal state across the time zones if the deployment is geographically distributed. The collection of statistics (taking measurements) occurs with a recommended time of five minutes between each measurement. Each line in the CSV text file records the data items measured in the period (the last five minutes). For example, if you measure the number of request packets processed by the DHCP server, you can calculate the number of packets received in the given interval for each line, not the number processed from the start, which is the value reported by the server. Collect DHCP Server Statistics The DHCP server collects the basic statistics while it processes incoming requests. You must not do any configuration to enable this feature. The extensions that run in the CNR DHCP server are important performance factors that you must consider when you collect statistics. Any additional statistics that are available depend on the capabilities of the extension. CNR API Use the CNR API to collect the total statistics from the last server-start. To sample the statistics, your program must poll the server and record the changes from the last polling interval. Use the API command, GetServerStats. This API returns these attributes in an SCP object of the type, DHCPServerStats: 2011 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public Information. Page 3 of 14
Attribute start-time total-discovers total-requests total-releases total-offers total-acks total-naks total-declines Description The date and time the server was last reloaded Total DHCPDISCOVER packets received Total DHCPREQUEST packets received Total DHCPRELEASE packets received Total DHCPOFFER packets sent Total DHCPACK packets sent Total DHCPNACK packets sent Total DHCPDECLINE packets received You can retrieve these counters with the CLI, use the server getstats command: nrcmd> dhcp getstats DHCP Log Files You can use the DHCP activity summary log to collect statistics from the server. You enable the activity summary logs when you set the activity-summary flag for the DHCP log-set attribute: nrcmd> dhcp set log-settings=activity-summary You set the report interval by the activity-summary-interval attribute: nrcmd> dhcp set activity-summary-interval=1m The counters are reported in message #05321. For example: 02/08/2004 16:00:02 name/dhcp/1 Activity Server 0 05320 DHCP activity, 60 seconds: Discovers: 20000, Offers: 20000, Requests: 20000, Acks: 20000, Nacks: 0, Rel.: 0, Decl.: 0, Exp.: 0, In use: Resp: 518, Req: 1, Acks/Second: 333. If you enable failover, the failover-related counters are reported in message #05322. For example: 02/08/2004 16:00:02 name/dhcp/1 Activity Server 0 05322 Failover RECEIVED: 324, bndupd 0, ack 318, nak 0, pool 0, poll 6, updreq 0, upddone 0, SENT: 839, bndupd 833, ack 0, nak 0, pool 0, poll 6, updreq 0, upddone 0, MISSED: 0 Category Label Description Activity Discovers DHCPDISCOVER packets received during reporting interval Offers Requests Acks Nacks Rel Decl Exp Resp Req DHCPOFFER packets sent during reporting interval DHCPREQUEST packets received during reporting interval DHCPACK packets sent during reporting interval DHCPNACK packets sent during reporting interval DHCPRELEASE packets received during reporting interval DHCPDECLINE packets received during reporting interval Number of leases expired during reporting interval Number of DHCP server response buffers in use at the end of this reporting interval Number of DHCP server request buffers in use at the end of this reporting interval Acks/Second Average rate for the reporting interval, if greater than 0 Failover Received Failover packets received during reporting interval 2011 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public Information. Page 4 of 14
Category Label Description Received Bndupd Bind update packets received during reporting interval Ack Bind ack packets received during reporting interval Nack Bind nack packets received during reporting interval Pool Backup pool messages received during reporting interval Poll Polling (keep-alive) messages received during reporting interval Updreq Update request messages received during reporting interval Upddone Update done messages received during reporting interval Sent Failover packets sent during reporting interval Sent Bndupd Bind update packets sent during reporting interval Ack Bind ack packets sent during reporting interval Nack Bind nack packets sent during reporting interval Pool Backup pool messages sent during reporting interval Poll Polling (keep alive) messages sent during reporting interval Updreq Update request messages sent during reporting interval Upddone Update done messages sent during reporting interval Missed Failover packets dropped during reporting interval Collect DNS Server Statistics The DNS server collects the basic statistics during normal server processing, modeled after RFC-1611. Do not configure any item to enable this feature. The server collects the enhanced statistics separately for the interval from the last server-start and the enabled sample counters for the current sample-interval. The counters fit in these five categories: Performance Query Errors Security Maxcounters You can enable the sample counters for these groups by setting the collect-sample-counters attribute and configure the sample interval: nrcmd> dns enable collect-sample-counters nrcmd> dns set activity-counter-interval=1m The default sample interval is five minutes. No configuration is required to collect the total counters (measured from the last server-start or administrative reset). The DNS server counters include: Category id config-recurs config-up-time config-reset-time Label String identifier for this DNS server The recursion services offered by this server: available(1) - performs recursion on requests from clients; restricted(2) - recursion is performed on requests only from certain clients; unavailable(3) - recursion is not available. The elapsed time since the DNS server process was started. The elapsed time since the DNS server was last reset (restarted). 2011 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public Information. Page 5 of 14
Category Label config-reset The server state: other(1) - server in some unknown state; initializing(3) - server (re)initializing; running(4) - server currently running. counter-auth-ans counter-auth-no-names counter-auth-no-data-resps counter-non-auth-datas counter-non-auth-no-datas counter-referrals The number of queries which were authoritatively answered. The number of queries for which authoritative no such name responses were made. The number of queries for which authoritative no such data (empty answer) responses were made. The number of queries which were non-authoritatively answered (cached data). The number of queries which were non-authoritatively answered with no data (empty answer). The number of requests that were referred to other servers. counter-errors The number of requests the server has processed that were answered with errors (RCODE values other than 0 and 3). Reference RFC-1035 section 4.1.1.] counter-rel-names counter-req-refusals counter-req-unparses counter-other-errors counter-reset-time sample-time sample-interval The number of requests received by the server for names that are only 1 label long (text form - no internal dots). The number of DNS requests refused by the server. The number of requests received that could not be parsed. The number of requests which were aborted for other (local) server errors. The time stamp of the last administrative reset of DNS counters. The time stamp of the last sample. This attribute applies only when sample counters are enabled. The counter sampling interval. This attribute applies only when sample counters are enabled. These are the enhanced counters by category: Category Label Description Performance updated-rrs Total number of RR s added or deleted, including administrative updates from SCP. Note that a single update may have multiple deletes and/or adds. update-packets ixfrs-out ixfrs-in ixfrs-full-resp axfrs-out axfrs-in queries xfrs-out-at-limit xfrs-in-at-limit notifies-out notifies-in Total number of update packets successfully processed. Number of successful outbound incremental zone transfers. Number of successful inbound incremental zone transfers, including full zone responses. Number of successful outbound full zone transfers that are originated from IXFR requests, but required a full zone response because of IXFR errors, requested serial history was not available, or there were too many changes in the zone. Number of successful outbound full zone transfers, including full zone responses to IXFR requests. Number of successful inbound full zone transfers. Number of query responses, including name queries, IXFR/AXFR query responses, and query forward responses, but excluding update replies. Number of times the number of outbound zone transfers reached the concurrent limit (set by the DNS server visibility 3 attribute, xfer-serverconcurrent-limit, which has a default value of five). Number of time the number of inbound zone transfers reached the concurrent limit (set by the DNS server visibility 3 attribute, xfer-server-concurrent-limit, which has a default value of five). Number of outbound Notify packets. Number of inbound Notify packets. 2011 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public Information. Page 6 of 14
Category Label Description Query auth-answers Number of queries that were authoritatively answered (reference RFC-1611). auth-no-names auth-no-data-responses nonauth-answers nonauth-no-data-responses referrals relative-name-requests lame-delegations mem-cache-hits mem-cache-misses mem-cache-writes Number of queries for which authoritative-no-such-name responses were made (reference RFC-1611). Number of queries for which authoritative-no-such-data (empty answer) responses were made (reference RFC-1611). Number of queries that were non-authoritatively answered from cached data (reference RFC-1611). Number of queries that were non-authoritatively answered with no data, i.e. an empty answer (reference RFC-1611). Number of requests that were referred to other servers (reference RFC-1611). Number of requests received by the server for names that were only one text label long (reference RFC-1611). Number of lame delegations. Number of internal memory cache lookup hits. Number of internal memory cache lookup misses. Number of cache record writes to the persistent cache DB. Security rcvd-tsig-packets Number of received packets containing a TSIG record. detected-tsig-bad-time detected-tsig-bad-key detected-tsig-bad-sig rcvd-tsig-bad-time rcvd-tsig-bad-key rcvd-tsig-bad-sig unauth-xfer-reqs unauth-update-reqs restrict-query-acl Bad TSIG time detected from incoming packet contents. Bad TSIG key detected from incoming packet contents. Bad TSIG signature detected from incoming packet contents. Bad TSIG time reported in the TSIG error field in the incoming packet. Bad TSIG key reported in the TSIG error field in the incoming packet. Bad TSIG signature reported in the TSIG error field in the incoming packet. The number of restrict-xfer-acl ACL authorization failures for zones with restrict-xfer enabled. The number DNS update failures due to update-acl ACL authorization failures or because zones have been configured with the dynamic attribute disabled. The number of query failures due to restrict-query-acl ACL authorization failures. Errors update-errors Number of errors detected in update packets, excluding TSIG errors. ixfr-in-errors ixfr-out-errors axfr-in-errors axfr-out-errors sent-total-errors rcvd-format-errors sent-format-errors sent-other-errors Number of inbound IXFR errors, excluding packet format errors. Number of outbound IXFR errors, excluding packet format errors. Number of inbound AXFR errors, excluding packet format errors. Number of outbound AXFR errors, excluding packet format errors. Number of requests the server has processed that were answered with errors (RCODE values other than 0, 3, 6,7, and 8). reference RFC-1611 Number of incoming packets received with the error field set, i.e. with RCODE set to FORMERR. Number of requests received that could not be parsed and resulted in a FORMERR response. reference RFC-1611. Number of requests that were aborted for other (local) server errors. reference RFC-1611 Maxcounters concurrent-xfrs-in The maximum number of concurrent threads used for inbound zone transfers during this reporting interval. concurrent-xfrs-out The maximum number of concurrent threads used for outbound zone transfers during this reporting interval. 2011 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public Information. Page 7 of 14
CNR API You can collect total statistics from the last server-start or sample counters that use the CNR API. Use the API call getcnrdnsserverstats. This API returns these statistics: Attribute Description Value Data Type dns-server-stats total-counters sample-counters The current DNS server statistics The category counters measured since the last server start or administrative reset The sample counters measured during the last sampling interval SCP Object of type DNSServerStats. SCP List of counter objects of type DNSServerPerformanceStats, DNSServerQueryStats, DNSServerSecurityStats, DNSServerErrorsStats, and DNSServerMaxCounterStats. SCP List of counter objects of type DNSServerPerformanceStats, DNSServerQueryStats, DNSServerSecurityStats, DNSServerErrorsStats, and DNSServerMaxCounterStats. You can use the server getstats command to retrieve these counters with the CLI: nrcmd> dns getstats nrcmd> dns getstats all total nrcmd> dns getstats all sample DNS Log Files You can use the DNS activity summary log to collect periodic statistics from the server. Set these items to enable the activity summary log: The activity-summary flag for the DNS log-settings attribute The report interval The categories to be logged For example: nrcmd> dns set log-settings=activity-summary nrcmd> dns set activity-summary-interval=1m nrcmd> dns set activity-counter-log-settings=total, sample, performance, query The default report interval is five minutes. Note: You must enable sample counters to report counters for the sample interval. The counters are reported for both totals and the latest sample interval in messages 03523, 03573, 03574, 03575, 03576, 03577, 03578, 03579, 03580 and 03603. For example: 02/20/2004 15:48:41 name/dns/1 Info Server 0 03523 [Stats-Perform] Total since Fri Feb 20 15:42:15 2004 - update-rrs:0, update-packets:0, ixfrs-out:0, ixfrsin:0, ixfrs-full-resp:5, axfrs-out:10, axfrs-in:0, queries:10, xfrs-out-atlimit:0, xfrs-in-at-l notifies-out:10, notifies-in:0. 02/20/2004 15:48:41 name/dns/1 Info Server 0 03573 [Stats-Perform] sampled at Fri Feb 20 15:43:41 2004 with interval of 300 sec - update-rrs:0, updatepackets:0, ixfrs-out:0, ixfrs-in:0, ixfrs-full-resp:5, axfrs-out:10, axfrs-in:0, queries:10, xfrs-out xfrs-in-at-limit:0, notifies-out:10, notifies-in:0. 2011 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public Information. Page 8 of 14
02/20/2004 14:28:42 name/dns/1 Info Server 0 03574 [Stats-Query] Total since Fri Feb 20 13:54:17 2004: auth-answers:1, auth-no-names:0, auth-no-dataresponses:0, nonauth-answers:3, nonauth-no-data-responses:0, referrals:1, relative-name-requests:0, refusals:0, lame-delegations:0, mem-cache-hits:316, mem-cache-misses:124, mem-cache-writes 02/20/2004 14:29:42 name/dns/1 Info Server 0 03575 [Stats-Query] sampled at Fri Feb 20 14:28:42 2004 with interval of 60 sec: auth-answers:1, auth-nonames:0, auth-no-data-responses:0, nonauth-answers:3, nonauth-no-dataresponses:3, referrals:0, relative-name-requests:1, refusals:0, lamedelegations:0, mem-cache-hits:0, mem-cache-misses:18, mem-cache-writes:4. Collect TFTP Server Statistics CNR API You can use the CNR API to collect statistics from the last server-start. To sample statistics, your program must poll the server and record the changes from the last polling interval. Use the GetServerStats API call. This API returns these attributes in an SCP Object of type TFTPServerStats: Attribute id server-start-time server-reset-time Description String identifier for this TFTP server. The start time of the server. The time the server was last restarted or reloaded. server-state The server state: other(1) - server in some unknown state; initializing(3) - server (re)initializing; running(4) - server currently running. server-time-since-start server-time-since-reset total-packets-in-pool total-packets-in-use total-packets-received total-packets-sent total-packets-drained total-packets-dropped total-packets-malformed total-read-requests total-read-requests-completed total-read-requests-refused total-read-requests-ignored total-read-requests-timed-out total-write-requests total-write-requests-completed total-write-requests-refused total-write-requests-ignored total-write-requests-timed-out total-docsis-requests total-docsis-requests-completed The elapsed time since the TFTP server process was started. The elapsed time since the TFTP server was last reset (restarted or reloaded). Maximum number of packet buffers that can be used by the server. Total number of packet buffers currently in use by the server. Total number of packets received by the server since the last server reset. Total number of packets server has sent since the last server reset. Total number of packets drained (read and discarded) since the last server reset. A packet is drained when the TFTP server is overwhelmed and is using all its packets already, so there are no more available to process the incoming packet. Total number of packets the server has dropped since the last server reset. This includes packets that are unknown to the server, malformed, duplicated, drained, etc. (any packet that is dropped for any reason). Total number of packets the server has received that were malformed since the last server reset. Total number of packets the server has received that were read requests since the last server reset. Total number of read requests that were completed since the last server reset. Total number of read requests that the server refused since the last server reset. Total number of read requests that the server ignored since the last server reset. Total number of read requests that timed out since the last server reset. The number of packets the server has received that were write requests since the last server reset. Total number of write requests that were completed since the last server reset. Total number of write requests that the server refused since the last server reset. Total number of write requests that the server ignored since the last server reset. Total number of write requests that timed out since the last server reset. The number of packets the server has received that were CSRC 1.0 dynamic DOCSIS requests since the last server reset. Total number of CSRC 1.0 dynamic DOCSIS requests that were completed since the last server reset. 2011 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public Information. Page 9 of 14
Attribute total-docsis-requests-refused total-docsis-requests-ignored total-docsis-requests-timed-out read-requests-per-second write-requests-per-second docsis-requests-per-second Description Total number of CSRC 1.0 dynamic DOCSIS requests that the server refused since the last server reset. Total number of CSRC 1.0 dynamic DOCSIS requests that the server ignored since the last server reset. Total number of CSRC 1.0 dynamic DOCSIS requests that timed out since the last server reset. Number of read requests per second processed during this reporting interval. Number of write requests per second processed during this reporting interval. Number of CSRC 1.0 dynamic DOCSIS requests per second processed during this reporting interval. You can use the server getstats command to retrieve these counters with the CLI: nrcmd> tftp getstats Collect Host Statistics To collect host statistics you must plan the host capacity and tune the system performance. The mechanism used to collect information on the machine is not in the scope of this document, because it is system dependent. You must collect at least these statistics: CPU usage User System Wait Network usage Send Receive Disk usage Read Write Uptime at the same rate and in the same vein as the server information Collect the use as percentages. Interpret the Collected Statistics You must interpret the collected statistics as an art more than a science. You must develop heuristics and adjust over time, because they are based on your deployment and deployment history. Calculate the steady state for all the statistics that you collect. These sections describe what to look for in theses categories: Capacity planning Attack detection Misconfiguration Each server has statistics that highlight what the server does. These are its performance indicators. The uptime of the servers and the machines are used to warn of errors. Restarts can occur for maintenance. However, if they occur frequently, you can have a problem that requires investigation. 2011 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public Information. Page 10 of 14
You can merge the collected host statistics with the collected server statistics to help plan capacity. With one CSV file that contains machine and server statistics, you can create charts that map server performance with CPU, network, and disk usage. Attack detection covers both malicious attempts to break your network and friendly processes that cause more load than expected. You must know the steady state rates for each server to calculate if an attack occurs. Misconfiguration can occur both in the servers and network configuration. You must know the deployment architecture to monitor the configuration. Interpret DHCP Server Statistics The number of DHCP messages processed is the main performance indicator for the DHCP server. Request and response buffers provide an indication of the traffic load on the server. Failover counters provide an indication of the state of failover synchronization. Capacity Planning You can chart the CPU, network, and disk usage versus the performance indicators of the DHCP server, by using the merged CSV text file. From this you can determine what combination of machine resources impact the performance of your server, aid the capacity plan, and tune the performance of the machine. The number of request buffers used indicates how many simultaneous requests the server handles. When the network operates at a steady state, this value remains relatively constant. When a large reboot occurs, the value jumps to the configured maximum. Further incoming packets are dropped, and new requests are only taken in by the server as pending requests. This algorithm leverages the fact that DHCP clients timeout and retry if they do not receive a response. Dropping the extra requests allows the server to dedicate its process to handle only as many packets as it can respond to within the client time out and minimizes the total time required to bring all clients on line. Once the reboot event is completed, the buffers in use return to steady state values. Note: Since the same pool of request buffers is used for both lease activity and failover activity, request buffers in use never reach 0 when failover is enabled, even in the absence of client activity. The default value for max-dhcp-requests is 500, but you can tune this to the capacity of the server. The server capacity is defmed in terms of the lease rate and the average latency of the lease transaction. For example, if the maximum capacity of the server is 1000 leases/sec, and on average, leases are returned to the client in 500 ms, then a value of 500 is sufficient for the server to respond to clients at this rate. A lower value throttles the performance of the server below this capacity. A higher value increases the latency during burst events, but allows a greater number of clients to be serviced without retries. Given the typical client timeout of four seconds, an average latency of two seconds can be tolerated without added client timeouts on the traffic load. You can measure the maximum leases per second where CPU use reaches 100% or the latency exceeds the maximum threshold. The number of response buffers in use indicates how many simultaneous requests are completed by the server. When the network operates at a steady state, this value remains constant, and tracks with the number of request buffers in use. If the server reaches its configured maximum, it can no longer respond to events. This should not occur and is an indicator of a serious network problem. Since the same pool of response buffers is used for both lease activity and failover activity, the server adjusts this value to be at least four times the request buffers, to ensure sufficient resources are available to process all pending client and failover activity simultaneously. 2011 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public Information. Page 11 of 14
The performance of the DHCP server is impacted by the performance of external systems, if LDAP client lookups are used, or the server is integrated into a Broadband Access Center (BAC) provisioning system. If the external systems are operated at capacity, then the DHCP server seems slower. If the CPU utilization on the DHCP server does not reach 100% before the latency threshold is reached, this can be an indicator of a provisioning system performance problem. In this case, you can fix the problem in the provisioning system, and not at the DHCP server. Attack Detection Comparing the rates of incoming DHCP messages with the steady state rates of incoming requests is a method of detecting a possible attack. A large number of dropped packets, declines, or nacks could also be indicators. However, these can also be indicators of a misconfiguration. A large increase in requests can indicate that a CMTS was rebooted or that some portion of the network restarted after a power outage. Misconfiguration The presence of decline messages indicates a network configuration error or a misbehaving client. Addresses are marked unavailable when a decline is received, but then reclaimed once the unavailable timeout period expires. However, addresses continue to cycle through an unavailable state until the network problem is resolved. The DHCP server logs contain additional entries for specific error conditions that are encountered, and can be used to help isolate the problem. Some number of nacks are normal when failover partners resynchronize after an outage. An excessive number of nacks can indicate a configuration mismatch between the servers that prevents them from agreeing on the state of a lease. In this case, the servers may fail to complete resynchronization. You can use the failover configuration feature in the CNR web UI to verify and correct failover configuration issues. If the number of dropped DHCP messages increases over the steady state for this statistic, a configuration error can exist in the provisioning system that prevents the server from assigning the client a valid address that matches its client class assignment. The DHCP server logs contain additional entries for the specific encountered conditions. Interpret DNS Server Statistics The main performance indicators for the DNS server are the number of query, zone transfer, and update messages processed. Capacity Plan You can chart the CPU, network, and disk usage versus the performance indicators of the DNS server by using the merged CSV text file. From this information you can determine what combination of machine resources impact the performance of your server, aid with the capacity plan and tune the performance of the machine. A high number of memory cache misses can indicate that you should increase the size allocated to the cache to support a higher volume of queries. However, if the majority of query responses are non-authoritative, cache misses can indicate the TTLs for these records are too short to be usefully cached. In this case, a larger cache has little impact. The performance counters, xfrs-out-at-limit and xfrs-in-at-limit indicate the number of times the server was throttled back by its configuration limit in processing zone transfers. If the main function of the server is to support zone transfers (for example, it is a secondary server configured to service zone requests from a group of secondtier secondary servers that serve client query requests), you can increase this limit to reduce the latency of zone updates. You should take care when changing this value for general-purpose servers, since an increase in zone 2011 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public Information. Page 12 of 14
transfer responsiveness decreases the responsiveness of update and query processing. The query, zone transfer, and update performance counters can be used to assess the primary role of each server in the network. Attack Detection A method of used to detect a possible attack, is to compare the rates of incoming query messages with the steady state rates of incoming requests. A large number of no-such-data responses or ACL authorization failures can be indicators. However, these can also be indicators of a misconfiguration. A large increase in queries can also indicate that some portion of the network restarted after a power outage. Misconfiguration The presence of lame delegation errors indicates a misconfiguration in the network that needs to be corrected on the originating name server. An excessive number of error packets received can indicate configuration problems in other DNS servers in the network. The DNS server logs contain additional entries for specific error conditions that you encounter, and you can use these to help isolate the problem. An excessive number of no-such-data responses can indicate a configuration mismatch between the domain information provided to the client in a DHCP response, and the zones configured on the DNS server. These configuration errors should be corrected at the DHCP server, or the originating provisioning system, as appropriate for the deployment. A large number of ACL authorization failures can indicate a configuration mismatch between servers or a situation where clients issue requests from a new network that was not added to the authorized list. The DNS server logs contain additional entries for specific error conditions that are encountered, and can be used to help isolate the problem. Interpret TFTP Server Statistics The main performance indicators for the TFTP server are the number of read and write request messages processed. Packet buffers in use provide an indication of the traffic load on the server. Capacity Plan You can use the merged CSV text file to chart the CPU, network, and disk usage versus the performance indicators of the TFTP server. From this information you can determine what combination of machine resources impact the performance of your server, aid to plan the capacity and tune the performance of the machine. The number of packet buffers used indicates how many simultaneous requests are handled by the server. When the network is operating at a steady state, this value should remain relatively constant. When a large reboot event occurs, this value can jump to the configured maximum. The server default is 512. Further incoming packets are dropped, and new requests are only taken in by the server as pending requests. This algorithm leverages the fact that clients timeout and retry if they do not receive a response. Dropping the extra requests allows the server to dedicate its process to handle only the packets it can respond to within the client time out and minimizes the total time required to bring all clients on line. Once the reboot event is completed, the buffers in use return to steady state values. This value does not need to be tuned. Since the TFTP protocol starts a new connection for each client request, configuring the server to accept a greater number of simultaneous connections can quickly exhaust server resources, and result in degraded performance overall. The maximum value that can be configured is 1000. 2011 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public Information. Page 13 of 14
Attack Detection A method used to detect a possible attack is to compare the rates of incoming read and write requests with the steady state rates. A large number of refused and/or ignored requests can also be indicators. However, these can also be indicators of a misconfiguration. A large increase in requests can indicate that a CMTS was rebooted or some portion of the network was restarted following a power outage. Misconfiguration An excessive number of ignored read requests can indicate a configuration mismatch between the file information provided to the client in a DHCP or BOOTP response, and the files available on the TFTP server. Configuration errors should be corrected at the DHCP server, or the originating provisioning system, as appropriate for the deployment. The TFTP server logs contain additional entries for specific error conditions that are encountered, and can be used to help isolate the problem. Related Information Technical Support - Cisco Systems Printed in USA C11-682725-00 08/11 2011 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public Information. Page 14 of 14