Cloudera Manager Health Checks

Transcription

1 Cloudera, Inc Page Mill Road Palo Alto, CA US: Intl: Cloudera Manager Health Checks

2 Important Notice Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and any other product or service names or slogans contained in this document, except as otherwise disclaimed, are trademarks of Cloudera and its suppliers or licensors, and may not be copied, imitated or used, in whole or in part, without the prior written permission of Cloudera or the applicable trademark holder. Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation. All other trademarks, registered trademarks, product names and company names or logos mentioned in this document are the property of their respective owners. Reference to any products, services, processes or other information, by trade name, trademark, manufacturer, supplier or otherwise does not constitute or imply endorsement, sponsorship or recommendation thereof by us. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Cloudera. Cloudera may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Cloudera, the furnishing of this document does not give you any license to these patents, trademarks copyrights, or other intellectual property. The information in this document is subject to change without notice. Cloudera shall not be liable for any damages resulting from technical errors or omissions which may be present in this document, or from use of this document. Version: 4.7 Date: September 5, 2013

3 Contents ACTIVITY MONITOR ACTIVITY MONITOR PIPELINE... 1 ACTIVITY MONITOR ACTIVITY TREE PIPELINE... 1 ACTIVITY MONITOR FILE DESCRIPTORS... 2 ACTIVITY MONITOR HOST HEALTH... 3 ACTIVITY MONITOR LOG DIRECTORY FREE SPACE... 3 ACTIVITY MONITOR PROCESS STATUS... 4 ACTIVITY MONITOR UNEXPECTED EXITS... 4 ACTIVITY MONITOR WEB SERVER STATUS... 5 FLUME AGENT FILE DESCRIPTORS... 6 FLUME AGENT HOST HEALTH... 6 FLUME AGENT LOG DIRECTORY FREE SPACE... 6 FLUME AGENT PROCESS STATUS... 7 FLUME AGENT UNEXPECTED EXITS... 8 ALERT PUBLISHER FILE DESCRIPTORS... 8 ALERT PUBLISHER HOST HEALTH... 9 ALERT PUBLISHER LOG DIRECTORY FREE SPACE... 9 ALERT PUBLISHER PROCESS STATUS ALERT PUBLISHER UNEXPECTED EXITS CLOUDERA MANAGEMENT SERVICES ACTIVITY MONITOR HEALTH CLOUDERA MANAGEMENT SERVICES ALERT PUBLISHER HEALTH CLOUDERA MANAGEMENT SERVICES EVENT SERVER HEALTH CLOUDERA MANAGEMENT SERVICES HOST MONITOR HEALTH CLOUDERA MANAGEMENT SERVICES NAVIGATOR HEALTH CLOUDERA MANAGEMENT SERVICES REPORTS MANAGER HEALTH CLOUDERA MANAGEMENT SERVICES SERVICE MONITOR HEALTH DATANODE BLOCK COUNT DATANODE DATA DIRECTORY STATUS DATANODE FILE DESCRIPTORS DATANODE FREE SPACE DATANODE GC DURATION DATANODE HOST HEALTH... 16

4 DATANODE LOG DIRECTORY FREE SPACE DATANODE NAMENODE CONNECTIVITY DATANODE PROCESS STATUS DATANODE UNEXPECTED EXITS DATANODE WEB SERVER STATUS EVENT SERVER EVENT STORE SIZE EVENT SERVER FILE DESCRIPTORS EVENT SERVER HOST HEALTH EVENT SERVER INDEX DIRECTORY FREE SPACE EVENT SERVER LOG DIRECTORY FREE SPACE EVENT SERVER PROCESS STATUS EVENT SERVER UNEXPECTED EXITS EVENT SERVER WEB SERVER STATUS EVENT SERVER WRITE PIPELINE FAILOVER CONTROLLER FILE DESCRIPTORS FAILOVER CONTROLLER HOST HEALTH FAILOVER CONTROLLER LOG DIRECTORY FREE SPACE FAILOVER CONTROLLER PROCESS STATUS FAILOVER CONTROLLER UNEXPECTED EXITS FLUME AGENTS HEALTH HBASE ACTIVE HBASE MASTER HEALTH HBASE BACKUP HBASE MASTER HEALTH HBASE REGIONSERVERS HEALTH HBASE REST SERVER FILE DESCRIPTORS HBASE REST SERVER HOST HEALTH HBASE REST SERVER LOG DIRECTORY FREE SPACE HBASE REST SERVER PROCESS STATUS HBASE REST SERVER UNEXPECTED EXITS HBASE THRIFT SERVER FILE DESCRIPTORS HBASE THRIFT SERVER HOST HEALTH HBASE THRIFT SERVER LOG DIRECTORY FREE SPACE HBASE THRIFT SERVER PROCESS STATUS... 35

5 HBASE THRIFT SERVER UNEXPECTED EXITS HDFS ACTIVE NAMENODE HEALTH HDFS CANARY HDFS CORRUPT BLOCKS HDFS CORRUPT REPLICAS HDFS DATANODES HEALTH HDFS FREE SPACE HDFS MISSING BLOCKS HDFS NAMENODE HEALTH HDFS STANDBY NAMENODE HEALTH HDFS UNDER-REPLICATED BLOCKS HOST AGENT LOG DIRECTORY HOST AGENT PARCEL DIRECTORY HOST AGENT PROCESS DIRECTORY HOST AGENT STATUS HOST CLOCK OFFSET HOST DNS RESOLUTION HOST DNS RESOLUTION DURATION HOST FRAME ERRORS HOST NETWORK INTERFACE SPEED HOST SWAPPING HOST MONITOR FILE DESCRIPTORS HOST MONITOR HOST HEALTH HOST MONITOR HOST PIPELINE HOST MONITOR LOG DIRECTORY FREE SPACE HOST MONITOR PROCESS STATUS HOST MONITOR UNEXPECTED EXITS HOST MONITOR WEB SERVER STATUS HTTPFS FILE DESCRIPTORS HTTPFS HOST HEALTH HTTPFS LOG DIRECTORY FREE SPACE HTTPFS PROCESS STATUS... 54

6 HTTPFS UNEXPECTED EXITS IMPALA ASSIGNMENT LOCALITY IMPALA DAEMONS HEALTH IMPALA STATESTORE HEALTH IMPALA DAEMON FILE DESCRIPTORS IMPALA DAEMON HOST HEALTH IMPALA DAEMON LOG DIRECTORY FREE SPACE IMPALA DAEMON PROCESS STATUS IMPALA DAEMON RESIDENT SET SIZE IMPALA DAEMON STATESTORE CONNECTIVITY IMPALA DAEMON UNEXPECTED EXITS IMPALA DAEMON WEB SERVER STATUS IMPALA STATESTORE DAEMON FILE DESCRIPTORS IMPALA STATESTORE DAEMON HOST HEALTH IMPALA STATESTORE DAEMON LOG DIRECTORY FREE SPACE IMPALA STATESTORE DAEMON PROCESS STATUS IMPALA STATESTORE DAEMON RESIDENT SET SIZE IMPALA STATESTORE DAEMON UNEXPECTED EXITS IMPALA STATESTORE DAEMON WEB SERVER STATUS JOBTRACKER FILE DESCRIPTORS JOBTRACKER GC DURATION JOBTRACKER HOST HEALTH JOBTRACKER LOG DIRECTORY FREE SPACE JOBTRACKER PROCESS STATUS JOBTRACKER UNEXPECTED EXITS JOBTRACKER WEB SERVER STATUS JOURNALNODE EDITS DIRECTORY FREE SPACE JOURNALNODE FILE DESCRIPTORS JOURNALNODE GC DURATION JOURNALNODE HOST HEALTH JOURNALNODE LOG DIRECTORY FREE SPACE JOURNALNODE PROCESS STATUS... 73

7 JOURNALNODE SYNC STATUS JOURNALNODE UNEXPECTED EXITS JOURNALNODE WEB SERVER STATUS MAPREDUCE ACTIVE JOBTRACKER HEALTH MAPREDUCE JOB FAILURE RATIO MAPREDUCE JOBTRACKER HEALTH MAPREDUCE MAP TASK BACKLOG MAPREDUCE MAP TASK LOCALITY MAPREDUCE REDUCE TASK BACKLOG MAPREDUCE STANDBY JOBTRACKER HEALTH MAPREDUCE TASKTRACKERS HEALTH MASTER FILE DESCRIPTORS MASTER GC DURATION MASTER HBASE MASTER CANARY MASTER HOST HEALTH MASTER LOG DIRECTORY FREE SPACE MASTER PROCESS STATUS MASTER UNEXPECTED EXITS MASTER WEB SERVER STATUS NAMENODE CHECKPOINT STATUS NAMENODE DATA DIRECTORIES FREE SPACE NAMENODE FILE DESCRIPTORS NAMENODE GC DURATION NAMENODE HOST HEALTH NAMENODE JOURNALNODE SYNC STATUS NAMENODE LOG DIRECTORY FREE SPACE NAMENODE NAME DIRECTORY STATUS NAMENODE PROCESS STATUS NAMENODE RPC LATENCY NAMENODE SAFE MODE STATUS NAMENODE UNEXPECTED EXITS NAMENODE UPGRADE STATUS... 93

8 NAMENODE WEB SERVER STATUS NAVIGATOR SERVER FILE DESCRIPTORS NAVIGATOR SERVER HOST HEALTH NAVIGATOR SERVER LOG DIRECTORY FREE SPACE NAVIGATOR SERVER PROCESS STATUS NAVIGATOR SERVER UNEXPECTED EXITS REGIONSERVER CLUSTER CONNECTIVITY REGIONSERVER COMPACTION QUEUE SIZE REGIONSERVER FILE DESCRIPTORS REGIONSERVER FLUSH QUEUE SIZE REGIONSERVER GC DURATION REGIONSERVER HDFS READ LATENCY REGIONSERVER HDFS SYNC LATENCY REGIONSERVER HOST HEALTH REGIONSERVER LOG DIRECTORY FREE SPACE REGIONSERVER MEMSTORE SIZE REGIONSERVER PROCESS STATUS REGIONSERVER STORE FILE INDEX SIZE REGIONSERVER UNEXPECTED EXITS REGIONSERVER WEB SERVER STATUS REPORTS MANAGER FILE DESCRIPTORS REPORTS MANAGER HOST HEALTH REPORTS MANAGER LOG DIRECTORY FREE SPACE REPORTS MANAGER PROCESS STATUS REPORTS MANAGER SCRATCH DIRECTORY FREE SPACE REPORTS MANAGER UNEXPECTED EXITS SECONDARYNAMENODE CHECKPOINT DIRECTORIES FREE SPACE SECONDARYNAMENODE FILE DESCRIPTORS SECONDARYNAMENODE GC DURATION SECONDARYNAMENODE HOST HEALTH SECONDARYNAMENODE LOG DIRECTORY FREE SPACE SECONDARYNAMENODE PROCESS STATUS

9 SECONDARYNAMENODE UNEXPECTED EXITS SECONDARYNAMENODE WEB SERVER STATUS ZOOKEEPER SERVER CONNECTION COUNT ZOOKEEPER SERVER DATA DIRECTORY FREE SPACE ZOOKEEPER SERVER DATA LOG DIRECTORY FREE SPACE ZOOKEEPER SERVER FILE DESCRIPTORS ZOOKEEPER SERVER GC DURATION ZOOKEEPER SERVER HOST HEALTH ZOOKEEPER SERVER LOG DIRECTORY FREE SPACE ZOOKEEPER SERVER MAXIMUM REQUEST LATENCY ZOOKEEPER SERVER OUTSTANDING REQUESTS ZOOKEEPER SERVER PROCESS STATUS ZOOKEEPER SERVER QUORUM MEMBERSHIP ZOOKEEPER SERVER UNEXPECTED EXITS SERVICE MONITOR FILE DESCRIPTORS SERVICE MONITOR HOST HEALTH SERVICE MONITOR LOG DIRECTORY FREE SPACE SERVICE MONITOR PROCESS STATUS SERVICE MONITOR ROLE PIPELINE SERVICE MONITOR UNEXPECTED EXITS SERVICE MONITOR WEB SERVER STATUS TASKTRACKER BLACKLISTED STATUS TASKTRACKER FILE DESCRIPTORS TASKTRACKER GC DURATION TASKTRACKER HOST HEALTH TASKTRACKER JOBTRACKER CONNECTIVITY TASKTRACKER LOG DIRECTORY FREE SPACE TASKTRACKER PROCESS STATUS TASKTRACKER UNEXPECTED EXITS TASKTRACKER WEB SERVER STATUS ZOOKEEPER CANARY ZOOKEEPER SERVERS HEALTH

10 ZOOKEEPER ZXID ROLLOVER

11 Activity Monitor Activity Monitor Pipeline Activity Monitor Activity Monitor Pipeline Details: This Activity Monitor health check checks that no messages are being dropped by the activity monitor stage of the Activity Monitor pipeline. A failure of this health check indicates a problem with the Activity Monitor. This may indicate a configuration problem or a bug in the Activity Monitor. This test can be configured using the Activity Monitor Activity Monitor Pipeline Monitoring Time Period monitoring setting. Short Name: Activity Monitor Pipeline Activity Monitor Activity Monitor Pipeline Monitoring The health check for monitoring the Activity Monitor activity monitor pipeline. This specifies the number of dropped messages that will be tolerated over the monitoring time period. activitymonitor_activity_ monitor_pipeline_ critical:any, warning:never Activity Monitor Activity Monitor Pipeline Monitoring Time Period The time period over which the Activity Monitor activity monitor pipeline will be monitored for dropped messages. activitymonitor_activity_ monitor_pipeline_ window 5 MINUTES Activity Monitor Activity Tree Pipeline Details: This Activity Monitor health check checks that no messages are being dropped by the activity tree stage of the Activity Monitor pipeline. A failure of this health check indicates a problem with the Activity Monitor. This may indicate a configuration problem or a bug in the Activity Monitor. This test can be configured using the Activity Monitor Activity Tree Pipeline Monitoring Time Period monitoring setting. Short Name: Activity Tree Pipeline Cloudera Manager 4.7 Health Checks 1

12 Activity Monitor File Descriptors Activity Monitor Activity Tree Pipeline Monitoring The health check activitymonitor_activity_ critical:any, for monitoring tree_pipeline_ warning:never the Activity Monitor activity tree pipeline. This specifies the number of dropped messages that will be tolerated over the monitoring time period. Activity Monitor Activity Tree Pipeline Monitoring Time Period The time period over which the Activity Monitor activity tree pipeline will be monitored for dropped messages. activitymonitor_activity_ tree_pipeline_window 5 MINUTES Activity Monitor File Descriptors Details: This Activity Monitor health check checks that the number of file descriptors used does not rise above some percentage of the Activity Monitor file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Activity Monitor monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. activitymonitor_fd_ critical: , warning: Cloudera Manager 4.7 Health Checks

13 Activity Monitor Host Health Activity Monitor Host Health Details: This Activity Monitor health check factors in the health of the host upon which the Activity Monitor is running. A failure of this check means that the host running the Activity Monitor is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the Activity Monitor Host Health Check Activity Monitor monitoring setting. Short Name: Host Health Activity Monitor Host Health Check When computing the overall Activity Monitor health, consider the host's health. activitymonitor_host_ health_enabled Activity Monitor Log Directory Free Space Details: This Activity Monitor health check checks that the filesystem containing the log directory of this Activity Monitor has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Activity Monitor monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space log_directory_free_ space_percentage_ critical:never, warning:never Cloudera Manager 4.7 Health Checks 3

14 Activity Monitor Process Status Monitoring Absolute setting is configured. Activity Monitor Process Status Details: This Activity Monitor health check checks that the Cloudera Manager Agent on the Activity Monitor host is heart beating correctly and that the process associated with the Activity Monitor role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Activity Monitor process, a lack of connectivity to the Cloudera Manager Agent on the Activity Monitor host, or a problem with the Cloudera Manager Agent. This check can fail either because the Activity Monitor has crashed or because the Activity Monitor will not start or stop in a timely fashion. Check the Activity Monitor logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Activity Monitor host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Activity Monitor host, or look in the Cloudera Manager Agent logs on the Activity Monitor host for more details. This test can be enabled or disabled using the Activity Monitor Process Health Check Activity Monitor monitoring setting. Short Name: Process Status Activity Monitor Process Health Check Enables the health check activitymonitor_scm_ that the Activity Monitor's health_enabled process state is consistent with the role configuration Activity Monitor Unexpected Exits Details: This Activity Monitor health check checks that the Activity Monitor has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period Activity Monitor monitoring settings. Short Name: Unexpected Exits 4 Cloudera Manager 4.7 Health Checks

15 Activity Monitor Web Server Status Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never Activity Monitor Web Server Status Details: This Activity Monitor health check checks that the web server of the Activity Monitor is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the Activity Monitor, a misconfiguration of the Activity Monitor or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the Activity Monitor for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the Activity Monitor's web server are failing or timing out. These requests are completely local to the Activity Monitor's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the Activity Monitor's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection Activity Monitor monitoring setting. Short Name: Web Server Status Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. activitymonitor_ web_metric_ collection_enabled Cloudera Manager 4.7 Health Checks 5

16 Flume Agent File Descriptors Flume Agent File Descriptors Details: This Agent health check checks that the number of file descriptors used does not rise above some percentage of the Agent file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Agent monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. flume_agent_fd_ critical: , warning: Flume Agent Host Health Details: This Agent health check factors in the health of the host upon which the Agent is running. A failure of this check means that the host running the Agent is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the Flume Agent Host Health Check Agent monitoring setting. Short Name: Host Health Flume Agent Host Health Check When computing the overall Flume Agent health, consider the host's health. flume_agent_host_ health_enabled Flume Agent Log Directory Free Space Details: This Agent health check checks that the filesystem containing the log directory of this Agent has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Agent monitoring settings. Short Name: Log Directory Free Space 6 Cloudera Manager 4.7 Health Checks

17 Flume Agent Process Status Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free_ space_percentage_ critical:never, warning:never Flume Agent Process Status Details: This Agent health check checks that the Cloudera Manager Agent on the Agent host is heart beating correctly and that the process associated with the Agent role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Agent process, a lack of connectivity to the Cloudera Manager Agent on the Agent host, or a problem with the Cloudera Manager Agent. This check can fail either because the Agent has crashed or because the Agent will not start or stop in a timely fashion. Check the Agent logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Agent host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Agent host, or look in the Cloudera Manager Agent logs on the Agent host for more details. This test can be enabled or disabled using the Flume Agent Process Health Check Agent monitoring setting. Short Name: Process Status Flume Agent Process Health Check Enables the health check that the Flume Agent's process state is consistent flume_agent_scm_ health_enabled Cloudera Manager 4.7 Health Checks 7

18 Flume Agent Unexpected Exits with the role configuration Flume Agent Unexpected Exits Details: This Agent health check checks that the Agent has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period Agent monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never Alert Publisher File Descriptors Details: This Alert Publisher health check checks that the number of file descriptors used does not rise above some percentage of the Alert Publisher file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Alert Publisher monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file alertpublisher_fd_ critical: , warning: Cloudera Manager 4.7 Health Checks

19 Alert Publisher Host Health descriptor limit. Alert Publisher Host Health Details: This Alert Publisher health check factors in the health of the host upon which the Alert Publisher is running. A failure of this check means that the host running the Alert Publisher is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the Alert Publisher Host Health Check Alert Publisher monitoring setting. Short Name: Host Health Alert Publisher Host Health Check When computing the overall Alert Publisher health, consider the host's health. alertpublisher_host_ health_enabled Alert Publisher Log Directory Free Space Details: This Alert Publisher health check checks that the filesystem containing the log directory of this Alert Publisher has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Alert Publisher monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a log_directory_free_ space_percentage_ critical:never, warning:never Cloudera Manager 4.7 Health Checks 9

20 Alert Publisher Process Status percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. Alert Publisher Process Status Details: This Alert Publisher health check checks that the Cloudera Manager Agent on the Alert Publisher host is heart beating correctly and that the process associated with the Alert Publisher role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Alert Publisher process, a lack of connectivity to the Cloudera Manager Agent on the Alert Publisher host, or a problem with the Cloudera Manager Agent. This check can fail either because the Alert Publisher has crashed or because the Alert Publisher will not start or stop in a timely fashion. Check the Alert Publisher logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Alert Publisher host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Alert Publisher host, or look in the Cloudera Manager Agent logs on the Alert Publisher host for more details. This test can be enabled or disabled using the Alert Publisher Process Health Check Alert Publisher monitoring setting. Short Name: Process Status Alert Publisher Process Health Check Enables the health check that the Alert Publisher's process state is consistent with the role configuration alertpublisher_scm_ health_enabled Alert Publisher Unexpected Exits Details: This Alert Publisher health check checks that the Alert Publisher has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected 10 Cloudera Manager 4.7 Health Checks

21 Cloudera Management Services Activity Monitor Health Exits and Unexpected Exits Monitoring Period Alert Publisher monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never Cloudera Management Services Activity Monitor Health Details: This Cloudera Management Services service-level health check checks for the presence of a running, healthy Activity Monitor. The check returns "Bad" health if the service is running and the Activity Monitor is not running. In all other cases it returns the health of the Activity Monitor. A failure of this health check indicates a stopped or unhealthy Activity Monitor. Check the status of the Activity Monitor for more information. This test can be enabled or disabled using the Activity Monitor Role Health Check Activity Monitor service-wide monitoring setting. Short Name: Activity Monitor Health Activity Monitor Role Health Check When computing the overall Management Service health, consider Activity Monitor's health mgmt_activitymonitor_ health_enabled Cloudera Management Services Alert Publisher Health Details: This Cloudera Management Services service-level health check checks for the presence of a running, healthy Alert Publisher. The check returns "Bad" health if the service is running and the Alert Publisher is not running. In all other cases it returns the health of the Alert Publisher. A failure of this health check indicates a stopped or unhealthy Alert Publisher. Check the status of the Alert Publisher for more information. This test can be enabled or disabled using the Alert Publisher Role Health Check Cloudera Manager 4.7 Health Checks 11

22 Cloudera Management Services Event Server Health Alert Publisher service-wide monitoring setting. Short Name: Alert Publisher Health Alert Publisher Role Health Check When computing the overall Management Service health, consider Alert Publisher's health mgmt_alertpublisher_ health_enabled Cloudera Management Services Event Server Health Details: This Cloudera Management Services service-level health check checks for the presence of a running, healthy Event Server. The check returns "Bad" health if the service is running and the Event Server is not running. In all other cases it returns the health of the Event Server. A failure of this health check indicates a stopped or unhealthy Event Server. Check the status of the Event Server for more information. This test can be enabled or disabled using the Event Server Role Health Check Event Server service-wide monitoring setting. Short Name: Event Server Health Event Server Role Health Check When computing the overall Management Service health, consider Event Server's health mgmt_eventserver_ health_enabled Cloudera Management Services Host Monitor Health Details: This Cloudera Management Services service-level health check checks for the presence of a running, healthy Host Monitor. The check returns "Bad" health if the service is running and the Host Monitor is not running. In all other cases it returns the health of the Host Monitor. A failure of this health check indicates a stopped or unhealthy Host Monitor. Check the status of the Host Monitor for more information. This test can be enabled or disabled using the Host Monitor Role Health Check Host Monitor service-wide monitoring setting. Short Name: Host Monitor Health Host Monitor Role When computing the overall mgmt_hostmonitor_ 12 Cloudera Manager 4.7 Health Checks

23 Cloudera Management Services Navigator Health Health Check Management service health, consider the Host Monitor's health health_enabled Cloudera Management Services Navigator Health Details: This Cloudera Management Services service-level health check checks for the presence of a running, healthy Navigator Server. The check returns "Bad" health if the service is running and the Navigator Server is not running. In all other cases it returns the health of the Navigator Server. A failure of this health check indicates a stopped or unhealthy Navigator Server. Check the status of the Navigator Server for more information. This test can be enabled or disabled using the Navigator Role Health Check Navigator Server service-wide monitoring setting. Short Name: Navigator Health Navigator Role Health Check When computing the overall Management Service health, consider Navigator's health mgmt_navigator_ health_enabled Cloudera Management Services Reports Manager Health Details: This Cloudera Management Services service-level health check checks for the presence of a running, healthy Reports Manager. The check returns "Bad" health if the service is running and the Reports Manager is not running. In all other cases it returns the health of the Reports Manager. A failure of this health check indicates a stopped or unhealthy Reports Manager. Check the status of the Reports Manager for more information. This test can be enabled or disabled using the Reports Manager Role Health Check Reports Manager service-wide monitoring setting. Short Name: Reports Manager Health Reports Manager Role Health Check When computing the overall Management Service health, consider Reports Manager's health mgmt_reportsmanager_ health_enabled Cloudera Manager 4.7 Health Checks 13

24 Cloudera Management Services Service Monitor Health Cloudera Management Services Service Monitor Health Details: This Cloudera Management Services service-level health check checks for the presence of a running, healthy Service Monitor. The check returns "Bad" health if the service is running and the Service Monitor is not running. In all other cases it returns the health of the Service Monitor. A failure of this health check indicates a stopped or unhealthy Service Monitor. Check the status of the Service Monitor for more information. This test can be enabled or disabled using the Service Monitor Role Health Check Service Monitor service-wide monitoring setting. Short Name: Service Monitor Health Service Monitor Role Health Check When computing the overall Management Service health, consider Service Monitor's health mgmt_servicemonitor_ health_enabled DataNode Block Count Details: This is a DataNode health check that checks for whether the DataNode has too many blocks. Having too many blocks on a DataNode may affect the DataNode's performance, and an increasing block count may require additional heap space to prevent long garbage collection pauses. This test can be configured using the DataNode Block Count DataNode monitoring setting. Short Name: Block Count DataNode Block Count The health check of the number of blocks on a DataNode datanode_block_count_ critical:never, warning: DataNode Data Directory Status Details: This is a DataNode health check that checks for whether the DataNode has reported any failed volumes. A failure of this health check indicates that there is a problem with one or more volumes on the DataNode. See the DataNode system for more information. This test can be configured using the DataNode Volume Failures DataNode monitoring setting. Short Name: Data Directory Status 14 Cloudera Manager 4.7 Health Checks

25 DataNode File Descriptors DataNode Volume Failures The health check of failed volumes in a DataNode. datanode_volume_ failures_ critical:any, warning:never DataNode File Descriptors Details: This DataNode health check checks that the number of file descriptors used does not rise above some percentage of the DataNode file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring DataNode monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. datanode_fd_ critical: , warning: DataNode Free Space Details: This is a DataNode health check that checks that the amount of free space available for HDFS block data on the DataNode does not fall below some percentage of total configured capacity of the DataNode. A failure of this health check may indicate a capacity planning problem. Try adding more disk capacity and additional data directories to the DataNode, or add additional DataNodes and take steps to rebalance your HDFS cluster. This test can be configured using the DataNode Free Space Monitoring DataNode monitoring setting. Short Name: Free Space DataNode Free Space Monitoring The health check of free space in a DataNode. Specified as a percentage of the capacity on the DataNode. datanode_free_ space_ critical: , warning: Cloudera Manager 4.7 Health Checks 15

26 DataNode GC Duration DataNode GC Duration Details: This DataNode health check checks that the DataNode is not spending too much time performing Java garbage collection. It checks that no more than some percentage of recent time is spent performing Java garbage collection. A failure of this health check may indicate a capacity planning problem or misconfiguration of the DataNode. This test can be configured using the DataNode Garbage Collection Duration and DataNode Garbage Collection Duration Monitoring Period DataNode monitoring settings. Short Name: GC Duration DataNode Garbage Collection Duration Monitoring Period The period to review when computing the moving average of garbage collection time. datanode_gc_ duration_window 5 MINUTES DataNode Garbage Collection Duration The health check for the weighted average time spent in Java garbage collection. Specified as a percentage of elapsed wall clock time. See DataNode Garbage Collection Duration Monitoring Period. datanode_gc_ duration_ critical: , warning: DataNode Host Health Details: This DataNode health check factors in the health of the host upon which the DataNode is running. A failure of this check means that the host running the DataNode is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the DataNode Host Health Check DataNode monitoring setting. Short Name: Host Health DataNode Host Health Check When computing the overall DataNode health, consider the host's health. datanode_host_ health_enabled 16 Cloudera Manager 4.7 Health Checks

27 DataNode Log Directory Free Space DataNode Log Directory Free Space Details: This DataNode health check checks that the filesystem containing the log directory of this DataNode has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage DataNode monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: BYTES 00000, warning: Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free_ space_percentage_ critical:never, warning:never DataNode NameNode Connectivity Details: This is a DataNode health check that checks that the all running NameNodes in the HDFS service consider the DataNode alive. A failure of this health check may indicate that the DataNode is having trouble communicating with some or all NameNodes in the service. Look in the DataNode logs for more details. This test can be enabled or disabled using the DataNode Connectivity Health Check DataNode monitoring setting. The DataNode Connectivity Tolerance at Startup DataNode monitoring setting and the Health Check Startup Tolerance NameNode monitoring setting can be used to control the check's tolerance windows around DataNode and NameNode restarts respectively. Short Name: NameNode Connectivity Cloudera Manager 4.7 Health Checks 17

28 DataNode Process Status DataNode Connectivity Health Check Enables the health check that verifies the DataNode is connected to the NameNode datanode_connectivity_ health_enabled DataNode Connectivity Tolerance at Startup The amount of time to wait for the DataNode to fully start up and connect to the NameNode before enforcing the connectivity check. datanode_connectivity_ tolerance 180 SECONDS Health Check Startup Tolerance The amount of time allowed after this role is started that failures of health checks that rely on communication with this role will be tolerated. namenode_startup_ tolerance 5 MINUTES DataNode Process Status Details: This DataNode health check checks that the Cloudera Manager Agent on the DataNode host is heart beating correctly and that the process associated with the DataNode role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the DataNode process, a lack of connectivity to the Cloudera Manager Agent on the DataNode host, or a problem with the Cloudera Manager Agent. This check can fail either because the DataNode has crashed or because the DataNode will not start or stop in a timely fashion. Check the DataNode logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the DataNode host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the DataNode host, or look in the Cloudera Manager Agent logs on the DataNode host for more details. This test can be enabled or disabled using the DataNode Process Health Check DataNode monitoring setting. Short Name: Process Status DataNode Process Health Check Enables the health check that the DataNode's datanode_scm_health_ enabled 18 Cloudera Manager 4.7 Health Checks

29 DataNode Unexpected Exits process state is consistent with the role configuration DataNode Unexpected Exits Details: This DataNode health check checks that the DataNode has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period DataNode monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never DataNode Web Server Status Details: This DataNode health check checks that the web server of the DataNode is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the DataNode, a misconfiguration of the DataNode or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the DataNode for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the DataNode's web server are failing or timing out. These requests are completely local to the DataNode's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the DataNode's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. Cloudera Manager 4.7 Health Checks 19

30 Event Server Event Store Size This test can be configured using the Web Metric Collection DataNode monitoring setting. Short Name: Web Server Status Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. datanode_web_metric_ collection_enabled Event Server Event Store Size Details: This is an Event Server health check that checks that the event store size has not grown too far above the configured event store capacity. A failure of this health check indicates that the Event Server is having a problem performing cleanup. This may indicate a configuration problem or bug in the Event Server. This test can be configured using the Event Store Capacity Monitoring Event Server monitoring setting. Short Name: Event Store Size Event Store Capacity Monitoring The health check on the number of events in the event store. Specified as a percentage of the maximum number of events in Event Server store. eventserver_capacity_ critical: , warning: Event Server File Descriptors Details: This Event Server health check checks that the number of file descriptors used does not rise above some percentage of the Event Server file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Event Server monitoring setting. Short Name: File Descriptors 20 Cloudera Manager 4.7 Health Checks

31 Event Server Host Health File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. eventserver_fd_ critical: , warning: Event Server Host Health Details: This Event Server health check factors in the health of the host upon which the Event Server is running. A failure of this check means that the host running the Event Server is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the Event Server Host Health Check Event Server monitoring setting. Short Name: Host Health Event Server Host Health Check When computing the overall Event Server health, consider the host's health. eventserver_host_ health_enabled Event Server Index Directory Free Space Details: This is an Event Server health check that checks that the filesystem containing the index directory of this Event Server has sufficient free space. This test can be configured using the Index Directory Free Space Monitoring Absolute and Index Directory Free Space Monitoring Percentage Event Server monitoring settings. Short Name: Index Directory Free Space Index Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains eventserver_index_ directory_free_ space_absolute_ critical: , warning: BYTES Cloudera Manager 4.7 Health Checks 21

32 Event Server Log Directory Free Space the index directory. Index Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains the index directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if an Index Directory Free Space Monitoring Absolute setting is configured. eventserver_index_ directory_free_ space_percentage_ critical:never, warning:never Event Server Log Directory Free Space Details: This Event Server health check checks that the filesystem containing the log directory of this Event Server has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Event Server monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: BYTES 00000, warning: Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that log_directory_free_ space_percentage_ critical:never, warning:never 22 Cloudera Manager 4.7 Health Checks

33 Event Server Process Status filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. Event Server Process Status Details: This Event Server health check checks that the Cloudera Manager Agent on the Event Server host is heart beating correctly and that the process associated with the Event Server role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Event Server process, a lack of connectivity to the Cloudera Manager Agent on the Event Server host, or a problem with the Cloudera Manager Agent. This check can fail either because the Event Server has crashed or because the Event Server will not start or stop in a timely fashion. Check the Event Server logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Event Server host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Event Server host, or look in the Cloudera Manager Agent logs on the Event Server host for more details. This test can be enabled or disabled using the Event Server Process Health Check Event Server monitoring setting. Short Name: Process Status Event Server Process Health Check Enables the health check that the Event Server's process state is consistent with the role configuration eventserver_scm_ health_enabled Event Server Unexpected Exits Details: This Event Server health check checks that the Event Server has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period Event Server monitoring settings. Short Name: Unexpected Exits Cloudera Manager 4.7 Health Checks 23

34 Event Server Web Server Status Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never Event Server Web Server Status Details: This Event Server health check checks that the web server of the Event Server is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the Event Server, a misconfiguration of the Event Server or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the Event Server for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the Event Server's web server are failing or timing out. These requests are completely local to the Event Server's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the Event Server's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection Event Server monitoring setting. Short Name: Web Server Status Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. eventserver_web_metric_ collection_enabled 24 Cloudera Manager 4.7 Health Checks

35 Event Server Write Pipeline Event Server Write Pipeline Details: This Event Server health check checks that no messages are being dropped by the writer stage of the Event Server pipeline. A failure of this health check indicates a problem with the Event Server. This may indicate a configuration problem or a bug in the Event Server. This test can be configured using the Event Server Write Pipeline Monitoring Time Period monitoring setting. Short Name: Write Pipeline Event Server Write Pipeline Monitoring The health check for monitoring the Event Server write pipeline. This specifies the number of dropped messages that will be tolerated over the monitoring time period. eventserver_write_ pipeline_ critical:any, warning:never Event Server Write Pipeline Monitoring Time Period The time period over which eventserver_write_ the Event Server write pipeline_window pipeline will be monitored for dropped messages. 5 MINUTES Failover Controller File Descriptors Details: This Failover Controller health check checks that the number of file descriptors used does not rise above some percentage of the Failover Controller file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Failover Controller monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. failovercontroller_ fd_ critical: , warning: Cloudera Manager 4.7 Health Checks 25

36 Failover Controller Host Health Failover Controller Host Health Details: This Failover Controller health check factors in the health of the host upon which the Failover Controller is running. A failure of this check means that the host running the Failover Controller is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the FailoverController Host Health Check Failover Controller monitoring setting. Short Name: Host Health FailoverController Host Health Check When computing the overall FailoverController health, consider the host's health. failovercontroller_ host_health_ enabled Failover Controller Log Directory Free Space Details: This Failover Controller health check checks that the filesystem containing the log directory of this Failover Controller has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Failover Controller monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_thr esholds critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is log_directory_free_ critical:never, space_percentage_t warning:never hresholds 26 Cloudera Manager 4.7 Health Checks

37 Failover Controller Process Status configured. Failover Controller Process Status Details: This Failover Controller health check checks that the Cloudera Manager Agent on the Failover Controller host is heart beating correctly and that the process associated with the Failover Controller role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Failover Controller process, a lack of connectivity to the Cloudera Manager Agent on the Failover Controller host, or a problem with the Cloudera Manager Agent. This check can fail either because the Failover Controller has crashed or because the Failover Controller will not start or stop in a timely fashion. Check the Failover Controller logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Failover Controller host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Failover Controller host, or look in the Cloudera Manager Agent logs on the Failover Controller host for more details. This test can be enabled or disabled using the FailoverController Process Health Check Failover Controller monitoring setting. Short Name: Process Status FailoverController Process Health Check Enables the health check that the FailoverController's process state is consistent with the role configuration failovercontroller_scm_ health_enabled Failover Controller Unexpected Exits Details: This Failover Controller health check checks that the Failover Controller has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period Failover Controller monitoring settings. Short Name: Unexpected Exits Unexpected Exits The period to review when unexpected_exits_ 5 MINUTES Cloudera Manager 4.7 Health Checks 27

38 Flume Agents Health Monitoring Period computing unexpected exits. window Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never Flume Agents Health Details: This is a Flume service-level health check that checks that enough of the Flume agents in the cluster are healthy. The check returns "Concerning" health if the number of healthy Flume agents falls below a warning threshold, expressed as a percentage of the total number of Flume agents. The check returns "Bad" health if the number of healthy and "Concerning" Flume agents falls below a critical threshold, expressed as a percentage of the total number of Flume agents. For example, if this check is configured with a warning threshold of 80% and a critical threshold of 60% for a cluster of five Flume agents, this check would return "Good" health if four or more Flume agents have good health. This check would return "Concerning" health if at least three Flume agents have either "Good" or "Concerning" health. If more than two Flume agents have bad health, this check would return "Bad" health. A failure of this health check indicates unhealthy Flume agents. Check the status of the individual Flume agents for more information. This test can be configured using the Healthy Flume Agent Monitoring Flume service-wide monitoring setting. Short Name: Flume Agents Health Healthy Flume Agent Monitoring The health check of the overall Flume Agents health. The check returns "Concerning" health if the percentage of "Healthy" Flume Agents falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" Flume Agents falls below the critical flume_agents_ healthy_ critical:never, warning: Cloudera Manager 4.7 Health Checks

39 HBase Active HBase Master Health threshold. HBase Active HBase Master Health Details: This is an HBase service-level health check that checks for the presence of an active, running and healthy HBase Master. The check returns "Bad" health if the service is running and a running, active Master cannot be found. In all other cases it returns the health of the running, active Master. A failure of this health check may indicate stopped or unhealthy Master roles, or it may indicate a problem with communication between the Cloudera Manager Service Monitor and the HBase service. Check the status of the HBase service's Master roles and look in the Cloudera Manager Service Monitor's log files for more information when this check fails. This test can be enabled or disabled using the Active Master Health Check HBase service-wide monitoring setting. In addition, the HBase Active Master Detection Window can be used to adjust the amount of time that the Cloudera Manager Service Monitor has to detect the active HBase Master before this health check fails. Short Name: Active HBase Master Health Active Master Health Check When computing the overall HBase cluster health, consider the active HBase Master's health. hbase_master_health_ enabled HBase Active Master Detection Window The tolerance window that will be used in HBase service tests that depend on detection of the active HBase Master. hbase_active_master_ detecton_window 3 MINUTES HBase Backup HBase Master Health Details: This is an HBase service-level health check that checks for running, healthy HBase Masters in backup mode. The check is disabled if the HBase service is not configured with multiple HBase Masters. Otherwise, the check returns "Concerning" health if either of two conditions are met. First, if there is no HBase Master running in backup mode. Second, if any of the HBase Masters running in backup mode are in less than "Good" health. This second condition is included because a failure of the active HBase Master leads to a race condition between all backup HBase Masters. When there is a less than healthy backup HBase Master, it is possible that it could become the active HBase Master if it won such a race, Cloudera Manager 4.7 Health Checks 29

40 HBase RegionServers Health and the HBase service could end up with a less than healthy active HBase Master. A failure of this health check may indicate one or more stopped or unhealthy backup HBase Masters, or it may indicate a problem with communication between the Cloudera Manager Service Monitor and the HBase service. Check the status of the HBase service's Master roles and the Cloudera Manager Service Monitor's log files for more information when this check fails. This test can be enabled or disabled using the Backup Masters Health Check HBase service-wide monitoring setting. In addition, the HBase Active Master Detection Window can be used to adjust the amount of time that the Cloudera Manager Service Monitor has to detect the active HBase Master before this health check fails. Short Name: Backup HBase Master Health Backup Masters Health Check When computing the overall HBase cluster health, consider the health of the backup HBase Masters. hbase_backup_ masters_health_ enabled HBase Active Master Detection Window The tolerance window that will be used in HBase service tests that depend on detection of the active HBase Master. hbase_active_ master_detecton_ window 3 MINUTES HBase RegionServers Health Details: This is an HBase service-level health check that checks that enough of the RegionServers in the cluster are healthy. The check returns "Concerning" health if the number of healthy RegionServers falls below a warning threshold, expressed as a percentage of the total number of RegionServers. The check returns "Bad" health if the number of healthy and "Concerning" RegionServers falls below a critical threshold, expressed as a percentage of the total number of RegionServers. For example, if this check is configured with a warning threshold of 95% and a critical threshold of 90% for a cluster of 100 RegionServers, this check would return "Good" health if 95 or more RegionServers have good health. This check would return "Concerning" health if at least 90 RegionServers have either "Good" or "Concerning" health. If more than 10 RegionServers have bad health, this check would return "Bad" health. A failure of this health check indicates unhealthy RegionServers. Check the status of the individual RegionServers for more information. This test can be configured using the Healthy HBase Region Servers Monitoring HBase service-wide monitoring setting. Short Name: RegionServers Health 30 Cloudera Manager 4.7 Health Checks

41 HBase REST Server File Descriptors Healthy HBase Region Servers Monitoring The health check of the overall HBase Region Servers health. The check returns "Concerning" health if the percentage of "Healthy" HBase Region Servers falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" HBase Region Servers falls below the critical threshold. hbase_regionservers_ healthy_ critical: , warning: HBase REST Server File Descriptors Details: This HBase REST Server health check checks that the number of file descriptors used does not rise above some percentage of the HBase REST Server file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring HBase REST Server monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. hbaserestserver_fd_ critical: , warning: HBase REST Server Host Health Details: This HBase REST Server health check factors in the health of the host upon which the HBase REST Server is running. A failure of this check means that the host running the HBase REST Server is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the HBase Rest Server Host Health Check HBase REST Server monitoring setting. Short Name: Host Health Cloudera Manager 4.7 Health Checks 31

42 HBase REST Server Log Directory Free Space HBase Rest Server Host Health Check When computing the overall HBase Rest Server health, consider the host's health. hbaserestserver_host_ health_enabled HBase REST Server Log Directory Free Space Details: This HBase REST Server health check checks that the filesystem containing the log directory of this HBase REST Server has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage HBase REST Server monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free_ space_percentage_ critical:never, warning:never HBase REST Server Process Status Details: This HBase REST Server health check checks that the Cloudera Manager Agent on the HBase REST Server host is heart beating correctly and that the process associated with the HBase REST Server role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem 32 Cloudera Manager 4.7 Health Checks

43 HBase REST Server Unexpected Exits with the HBase REST Server process, a lack of connectivity to the Cloudera Manager Agent on the HBase REST Server host, or a problem with the Cloudera Manager Agent. This check can fail either because the HBase REST Server has crashed or because the HBase REST Server will not start or stop in a timely fashion. Check the HBase REST Server logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the HBase REST Server host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the HBase REST Server host, or look in the Cloudera Manager Agent logs on the HBase REST Server host for more details. This test can be enabled or disabled using the HBase Rest Server Process Health Check HBase REST Server monitoring setting. Short Name: Process Status HBase Rest Server Process Health Check Enables the health check that the HBase Rest Server's process state is consistent with the role configuration hbaserestserver_scm_ health_enabled HBase REST Server Unexpected Exits Details: This HBase REST Server health check checks that the HBase REST Server has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period HBase REST Server monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never Cloudera Manager 4.7 Health Checks 33

44 HBase Thrift Server File Descriptors HBase Thrift Server File Descriptors Details: This HBase Thrift Server health check checks that the number of file descriptors used does not rise above some percentage of the HBase Thrift Server file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring HBase Thrift Server monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. hbasethriftserver_ fd_ critical: , warning: HBase Thrift Server Host Health Details: This HBase Thrift Server health check factors in the health of the host upon which the HBase Thrift Server is running. A failure of this check means that the host running the HBase Thrift Server is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the HBase Thrift Server Host Health Check HBase Thrift Server monitoring setting. Short Name: Host Health HBase Thrift Server Host Health Check When computing the overall HBase Thrift Server health, consider the host's health. hbasethriftserver_host_ health_enabled HBase Thrift Server Log Directory Free Space Details: This HBase Thrift Server health check checks that the filesystem containing the log directory of this HBase Thrift Server has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage HBase Thrift Server monitoring settings. Short Name: Log Directory Free Space 34 Cloudera Manager 4.7 Health Checks

45 HBase Thrift Server Process Status Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check log_directory_free_ for monitoring of free space space_percentage_ on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. critical:never, warning:never HBase Thrift Server Process Status Details: This HBase Thrift Server health check checks that the Cloudera Manager Agent on the HBase Thrift Server host is heart beating correctly and that the process associated with the HBase Thrift Server role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the HBase Thrift Server process, a lack of connectivity to the Cloudera Manager Agent on the HBase Thrift Server host, or a problem with the Cloudera Manager Agent. This check can fail either because the HBase Thrift Server has crashed or because the HBase Thrift Server will not start or stop in a timely fashion. Check the HBase Thrift Server logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the HBase Thrift Server host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the HBase Thrift Server host, or look in the Cloudera Manager Agent logs on the HBase Thrift Server host for more details. This test can be enabled or disabled using the HBase Thrift Server Process Health Check HBase Thrift Server monitoring setting. Short Name: Process Status HBase Thrift Server Process Health Enables the health check that the HBase Thrift hbasethriftserver_scm_ health_enabled Cloudera Manager 4.7 Health Checks 35

46 HBase Thrift Server Unexpected Exits Check Server's process state is consistent with the role configuration HBase Thrift Server Unexpected Exits Details: This HBase Thrift Server health check checks that the HBase Thrift Server has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period HBase Thrift Server monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never HDFS Active NameNode Health Details: This is an HDFS service-level health check that checks for the presence of an active, running and healthy NameNode. The check returns "Bad" health if the service is running and a running, active NameNode cannot be found. In all other cases it returns the health of the running, active NameNode. A failure of this health check may indicate stopped or unhealthy NameNode roles, the need to issue a failover command to make some NameNode active, or it may indicate a problem with communication between the Cloudera Manager Service Monitor and one or more NameNodes. Check the status of the HDFS service's NameNode roles and look in the Cloudera Manager Service Monitor's log files for more information when this check fails. This test can be enabled or disabled using the Active NameNode Role Health Check HDFS service-wide monitoring setting. In addition, the Active NameNode Detection Window can be used to adjust the amount of time that the Cloudera Manager Service Monitor has to 36 Cloudera Manager 4.7 Health Checks

47 HDFS Canary detect the active HDFS NameNode before this health check fails, and the NameNode Activation Startup Tolerance can be used to adjust the amount of time around NameNode startup that the check allows for a NameNode to be made active. Short Name: Active NameNode Health Active NameNode Detection Window The tolerance window that will be used in HDFS service tests that depend on detection of the active NameNode. hdfs_active_namenode_ detecton_window 3 MINUTES Active NameNode Role Health Check When computing the overall HDFS cluster health, consider the active NameNode's health hdfs_namenode_health_ enabled NameNode Activation Startup Tolerance The amount of time after NameNode(s) start that the lack of an active NameNode will be tolerated. This is intended to allow either the auto-failover daemon to make a NameNode active, or a specifically issued failover command to take effect. hdfs_namenode_ activation_startup_ tolerance 180 SECONDS HDFS Canary Details: This is an HDFS service-level health check that checks that basic client operations are working and are completing in a reasonable amount of time. This check reports the results of a periodic "canary" test that performs the following sequence of operations. First, it creates a file. By default, the path is /tmp/.cloudera_health_monitoring_canary_<i>timestamp</i>. The canary test then writes a small amount of data to that file, reads that data back, and verifies that the data is correct. Lastly, the canary test removes the created file. The check returns "Bad" health if any of the basic operations fail. The check returns "Concerning" health if the canary test runs too slowly. A failure of this health check may indicate that the cluster is failing to satisfy basic client requests correctly or in a timely fashion. Check the status of the NameNode and other service-level health checks. Look in the Service Monitor logs for log messages from the canary test. Look in the NameNode logs for more details about the processing of the canary test requests. This test can be enabled or disabled using the HDFS Canary Health Check HDFS Cloudera Manager 4.7 Health Checks 37

48 HDFS Corrupt Blocks service-wide monitoring setting. Short Name: HDFS Canary HDFS Canary Health Check Enables the health check that hdfs_canary_health_ a client can create, read, write, enabled and delete files HDFS Corrupt Blocks Details: This is an HDFS service-level health check that checks that the number of corrupt blocks does not rise above some percentage of the cluster's total blocks. A block is called corrupt by HDFS if it has at least one corrupt replica along with at least one live replica. As such, a corrupt block does not indicate unavailable data, but they do indicate an increased chance that data may become unavailable. If none of a block's replicas are live, the block is called a missing block by HDFS, not a corrupt block. HDFS automatically fixes corrupt blocks in the background. A failure of this health check may indicate a problem with the underlying storage or filesystem of a DataNode. Use the HDFS fsck command to identify which files contain corrupt blocks. This test can be configured using the Blocks With Corrupt Replicas Monitoring HDFS service-wide monitoring setting. Short Name: Corrupt Blocks Blocks With Corrupt Replicas Monitoring The health check of the number of blocks that have at least one corrupt replica. Specified as a percentage of the total number of blocks. hdfs_blocks_with_ corrupt_replicas_ critical: , warning: HDFS Corrupt Replicas Details: This is an HDFS service-level health check that checks that the number of corrupt replicas does not rise above some percentage of the cluster's total blocks. A block in HDFS is usually made up of multiple replicas, so a corrupt replica does not by itself indicate unavailable data. Unavailable data is indicated by missing blocks. Corrupt replicas do indicate an increased chance that data may become unavailable. If none of a block's replicas are live, the block is called a missing block by HDFS. HDFS automatically fixes corrupt replicas in the background. A failure of this health check may indicate a problem with the underlying storage or filesystem of a DataNode. Use the HDFS fsck command to 38 Cloudera Manager 4.7 Health Checks

49 HDFS DataNodes Health identify which files contain corrupt blocks. This test can be configured using the Corrupt Replicas Monitoring HDFS service-wide monitoring setting. Note that the percentage here that we are thresholding is a computation of replicas divided by blocks, so it could be more than 100% in some cases. Short Name: Corrupt Replicas Corrupt Replicas Monitoring The health check of the number of corrupt replica. Specified as a percentage of the total number of blocks. Note that there are more replicas than blocks, so it is theoretically possible for this to be over one hundred percent. hdfs_corrupt_blocks_ critical: , warning: HDFS DataNodes Health Details: This is an HDFS service-level health check that checks that enough of the DataNodes in the cluster are healthy. The check will return "Concerning" health if the number of healthy DataNodes falls below a warning threshold, expressed as a percentage of the total number of DataNodes. The check returns "Bad" health if the number of healthy and "Concerning" DataNodes falls below a critical threshold, expressed as a percentage of the total number of DataNodes. For example, if this check is configured with a warning threshold of 95% and a critical threshold of 90% for a cluster of 100 DataNodes, this check would return "Good" health if 95 or more DataNodes have good health. This check would return "Concerning" health if at least 90 DataNodes have either "Good" or "Concerning" health. If more than 10 DataNodes have bad health, this check would return "Bad" health. A failure of this health check indicates unhealthy DataNodes. Check the status of the individual DataNodes for more information. This test can be configured using the Healthy DataNodes Monitoring HDFS service-wide monitoring setting. Short Name: DataNodes Health Healthy DataNodes Monitoring The health check of the overall DataNodes health. The check returns "Concerning" health if the percentage of "Healthy" hdfs_datanodes_ healthy_ critical: , warning: Cloudera Manager 4.7 Health Checks 39

50 HDFS Free Space DataNodes falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" DataNodes falls below the critical threshold. HDFS Free Space Details: This is an HDFS service-level health check that checks that the amount of free space in the HDFS cluster does not fall below some percentage of total configured capacity. A failure of this health check may indicate a capacity planning problem, or a loss of DataNodes. This test can be configured using the HDFS Free Space Monitoring HDFS service-wide monitoring setting. Short Name: Free Space HDFS Free Space Monitoring The health check of free space in HDFS. Specified as a percentage of total HDFS capacity. hdfs_free_space_ critical: , warning: HDFS Missing Blocks Details: This is an HDFS service-level health check that checks the number of missing blocks does not rise above some percentage of the cluster's total blocks. A missing block is a block with no live replicas. All replicas are either missing or corrupt. This may happen because of corruption or because DataNodes are offline or being decomissioned. A failure of this health check may indicate the loss of several DataNodes at once. If there are files stored in the cluster with a replication factor value of 1, you may see missing blocks with the loss or malfunction of a single DataNode. Use the HDFS fsck command to identify which files contain missing blocks. This test can be configured using the Missing Block Monitoring HDFS service-wide monitoring setting. Short Name: Missing Blocks Missing Block The health check hdfs_missing_blocks_ critical:any, 40 Cloudera Manager 4.7 Health Checks

51 HDFS NameNode Health Monitoring of the number of missing blocks. Specified as a percentage of the total number of blocks. warning:never HDFS NameNode Health Details: This HDFS service-level health check checks for the presence of a running, healthy NameNode. The check returns "Bad" health if the service is running and the NameNode is not running. In all other cases it returns the health of the NameNode. A failure of this health check indicates a stopped or unhealthy NameNode. Check the status of the NameNode for more information. This test can be enabled or disabled using the Active NameNode Role Health Check NameNode service-wide monitoring setting. Short Name: NameNode Health Active NameNode Role Health Check When computing the overall HDFS cluster health, consider the active NameNode's health hdfs_namenode_ health_enabled HDFS Standby NameNode Health Details: This is an HDFS service-level health check that checks for a running, healthy NameNode in standby mode. The check is disabled if the HDFS service is not configured with multiple NameNodes. Otherwise, the check returns "Concerning" health if either of two conditions are met. First, if there is no NameNode running in standby mode. Second, if the running standby NameNode is in less than "Good" health. A failure of this health check may indicate one or more stopped or unhealthy NameNodes, or it may indicate a problem with communication between the Cloudera Manager Service Monitor and some or all of the HDFS NameNodes. Check the status of the HDFS service's NameNode roles and the Cloudera Manager Service Monitor's log files for more information when this check fails. This test can be enabled or disabled using the Standby NameNode Health Check HDFS service-wide monitoring setting. In addition, the Active NameNode Detection Window can be used to adjust the amount of time that the Cloudera Manager Service Monitor has to detect the active NameNode before this health check fails. Short Name: Standby NameNode Health Cloudera Manager 4.7 Health Checks 41

52 HDFS Under-Replicated Blocks Active NameNode Detection Window The tolerance window that will be used in HDFS service tests that depend on detection of the active NameNode. hdfs_active_namenode_ detecton_window 3 MINUTES Standby NameNode Health Check When computing the overall HDFS cluster health, consider the health of the standby NameNode. hdfs_standby_ namenodes_health_ enabled HDFS Under-Replicated Blocks Details: This is an HDFS service-level health check that checks that the number of under-replicated blocks does not rise above some percentage of the cluster's total blocks. A failure of this health check may indicate a loss of DataNodes. Use the HDFS fsck command to identify which files contain underreplicated blocks. This test can be configured using the Under-replicated Block Monitoring HDFS service-wide monitoring setting. Short Name: Under-Replicated Blocks Under-replicated Block Monitoring The health check of the number of under-replicated blocks. Specified as a percentage of the total number of blocks. hdfs_under_replicated_ blocks_ critical: , warning: Host Agent Log Directory Details: This is a host health check that checks that the filesystem containing the Cloudera Manager Agent's log directory has sufficient free space. This test can be configured using the Cloudera Manager Agent Log Directory Free Space Monitoring Absolute and Cloudera Manager Agent Log Directory Free Space Monitoring Percentage host monitoring settings. Short Name: Agent Log Directory 42 Cloudera Manager 4.7 Health Checks

53 Host Agent Parcel Directory Cloudera Manager Agent Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains the Cloudera Manager agent's log directory. host_agent_log_ directory_free_space_ absolute_ critical: , warning: BYTES Cloudera Manager Agent Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains the Cloudera Manager agent's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Cloudera Manager Agent Log Directory Free Space Monitoring Absolute setting is configured. host_agent_log_ directory_free_space_ percentage_ critical:never, warning:never Host Agent Parcel Directory Details: This is a host health check that checks that the filesystem containing the Cloudera Manager Agent's parcel directory has sufficient free space. This test can be configured using the Cloudera Manager Agent Parcel Directory Free Space Monitoring Absolute and Cloudera Manager Agent Parcel Directory Free Space Monitoring Percentage host monitoring settings. Short Name: Agent Parcel Directory Cloudera Manager Agent Parcel Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains the Cloudera Manager agent's parcel directory. host_agent_parcel_ directory_free_space_ absolute_ critical: , warning: BYTES Cloudera Manager 4.7 Health Checks 43

54 Host Agent Process Directory Cloudera Manager Agent Parcel Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains the Cloudera Manager agent's parcel directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Cloudera Manager Agent Parcel Directory Free Space Monitoring Absolute setting is configured. host_agent_parcel_ directory_free_space_ percentage_ critical:never, warning:never Host Agent Process Directory Details: This is a host health check that checks that the filesystem containing the Cloudera Manager Agent's process directory has sufficient free space. The process directory contains the configuration files for the processes which the Cloudera Manager Agent starts. This test can be configured using the Cloudera Manager Agent Process Directory Free Space Monitoring Absolute and Cloudera Manager Agent Process Directory Free Space Monitoring Percentage host monitoring settings. Short Name: Agent Process Directory Cloudera Manager Agent Process Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains the Cloudera Manager agent's process directory. host_agent_process_ directory_free_space _absolute_ critical: , warning: BYTES Cloudera Manager Agent Process Directory Free Space Monitoring The health check for monitoring of free space on the filesystem that contains the Cloudera host_agent_process_ directory_free_space _percentage_ critical:never, warning:never 44 Cloudera Manager 4.7 Health Checks

55 Host Agent Status Percentage Manager agent's process directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Cloudera Manager Agent Process Directory Free Space Monitoring Absolute setting is configured. Host Agent Status Details: This is a host health check that checks that the host's Cloudera Manager Agent is heart beating correctly and has the correct software version. A failure of this health check may indicate a lack of connectivity with the host's Cloudera Manager Agent, a problem with the Cloudera Manager Agent, or that the Cloudera Manager Agent or Host Monitor software is out of date. Check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the host, or look in the host's Cloudera Manager Agent logs for more details. If this check reports a software version mismatch between the Cloudera Manager Agent and the Host Monitor, check the version of each component by consulting the appropriate logs or the appropriate status web pages. This test can be enabled or disabled using the Host Process Health Check host configuration setting. Short Name: Agent Status Host Process Health Check Enables the health check that the host's process state is consistent with the role configuration host_scm_health_ enabled Host Clock Offset Details: This is a host health check that checks if the host's system clock appears to be out-of-sync. A failure of this health check may indicate that the host's system clock needs to be synchronized with other cluster nodes. It is recommended that NTP be used to synchronize system clocks across the cluster. Check the Cloudera Manager agent log for more details. The clock offset metric is calculated by comparing system clock values after accounting for measured network latency. In normal operating Cloudera Manager 4.7 Health Checks 45

56 Host DNS Resolution conditions the estimated offset will be non-zero and may vary slightly due to measurement error. This test can be configured using the Host Clock Offset host configuration setting. Short Name: Clock Offset Host Clock Offset The health check for the host clock offset host_clock_offset_ critical: , warning: MILLISECONDS Host DNS Resolution Details: This is a host health check that checks that the host's hostname and canonical name are consistent when checked from a Java process. A failure of this health check may indicate that the host's DNS configuration is not correct. Check the Cloudera Manager Agent log for the names that were detected by this check. The hostname and canonical name are considered to be consistent if the hostname or the hostname plus a domain name is the same as the canonical name. This health check uses domain names from the domain and search lines in /etc/resolv.conf. This health check does not consult /etc/nsswitch.conf and may give incorrect results if /etc/resolv.conf is not used by the host. There may be a delay of up to 5 minutes before this health check picks up changes to /etc/resolv.conf. This test can be configured using the Hostname and Canonical Name Health Check host configuration setting. Short Name: DNS Resolution Hostname and Canonical Name Health Check Whether the hostname and canonical names for this host are consistent when checked from a Java process. host_dns_resolution_ enabled Host DNS Resolution Duration Details: This is a host health check that checks that the host's DNS resolution completes in a timely manner. The DNS resolution duration is calculated by measuring the time that a call to getlocalhost in a Java process takes on this host. Please note that DNS information may be cached on the host and this caching may affect the reported resolution duration. A failure of this health check may indicate that the host's DNS configuration is set incorrectly or the hosts's DNS server is responding slowly. This test can be configured using the Host DNS Resolution Duration host configuration setting. Short Name: DNS Resolution Duration 46 Cloudera Manager 4.7 Health Checks

57 Host Frame Errors Host DNS The health check Resolution Duration for the host DNS resolution duration host_dns_ resolution_ duration_ critical:never, warning: MILLISECONDS Host Frame Errors Details: This is a host health check that checks for network frame errors across all network interfaces. A failure of this health check may indicate a problem with network hardware (e.g. switches) and can potentially cause other service or role-level performance problems. Check the host and network hardware logs for more details. This test can be configured using the Host Network Frame Error Percentage, Host Network Frame Error Check Window, Host Network Frame Error Test Minimum Required Packets host configuration settings. Short Name: Frame Errors Host Network Frame Error Check Window The amount of time over which the host frame error checks for frame errors. host_network_frame_ errors_window 15 MINUTES Host Network Frame Error Percentage The health check for the percentage of received packets that are frame errors. host_network_frame_ errors_ critical: , warning:any Host Network Frame Error Test Minimum Required Packets The minimum number of host_network_frame_ 0 received packets that must be received within the test window for this test to return "Bad" health. If less that this number of packets is received during the test window, the health check will never return "Bad" health. errors_floor Cloudera Manager 4.7 Health Checks 47

58 Host Network Interface Speed Host Network Interface Speed Details: This is a host health check that checks for network interfaces that appear to be operating at less than full speed. A failure of this health check may indicate that network interface(s) may be configured incorrectly and may be causing performance problems. Use the ethtool command to check and configure the host's network interfaces to use the fastest available link speed and duplex mode. This test can be configured using the Host's Network Interfaces Slow Link Modes, Network Interface Expected Link Speed and Network Interface Expected Duplex Mode host configuration settings. Short Name: Network Interface Speed Host's Network Interfaces Slow Link Modes The for the health check of the number of network interfaces that appear to be operating at less than full speed. host_network_ interfaces_slow_ mode_ critical:never, warning:any Network Interface Expected Duplex Mode The expected duplex mode for network interfaces host_nic_expected_ duplex_mode Full Network Interface Expected Link Speed The expected network interface link speed host_nic_expected_ speed 1000 Host Swapping Details: This is a health check that checks that the host has not swapped out more than a certain number of pages over the last fifteen minutes. A failure of this health check may indicate misconfiguration of the host operating system, or too many processes running on the host. Try reducing vm.swappiness, or add more memory to the host. This test can be configured using the Host Memory Swapping, Host Memory Swapping Check Window host configuration settings. Short Name: Swapping Host Memory Swapping Check Window The amount of time over which the memory swapping test checks for pages host_memswap_ window 15 MINUTES 48 Cloudera Manager 4.7 Health Checks

59 Host Monitor File Descriptors swapped. Host Memory Swapping The health check of the number of pages swapped out on the host in the last 15 minutes host_memswap_ critical:never, warning:any PAGES Host Monitor File Descriptors Details: This Host Monitor health check checks that the number of file descriptors used does not rise above some percentage of the Host Monitor file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Host Monitor monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. hostmonitor_fd_ critical: , warning: Host Monitor Host Health Details: This Host Monitor health check factors in the health of the host upon which the Host Monitor is running. A failure of this check means that the host running the Host Monitor is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the Host Monitor Host Health Check Host Monitor monitoring setting. Short Name: Host Health Host Monitor Host Health Check When computing the overall hostmonitor_host_ Host Monitor health, consider health_enabled the host's health. Cloudera Manager 4.7 Health Checks 49

60 Host Monitor Host Pipeline Host Monitor Host Pipeline Details: This Host Monitor health check checks that no messages are being dropped by the host stage of the Host Monitor pipeline. A failure of this health check indicates a problem with the Host Monitor. This may indicate a configuration problem or a bug in the Host Monitor. This test can be configured using the Host Monitor Host Pipeline Monitoring Time Period monitoring setting. Short Name: Host Pipeline Host Monitor Host Pipeline Monitoring The health check for monitoring the Host Monitor host pipeline. This specifies the number of dropped messages that will be tolerated over the monitoring time period. hostmonitor_host_ pipeline_ critical:any, warning:never Host Monitor Host Pipeline Monitoring Time Period The time period over which hostmonitor_host_ the Host Monitor host pipeline pipeline_window will be monitored for dropped messages. 5 MINUTES Host Monitor Log Directory Free Space Details: This Host Monitor health check checks that the filesystem containing the log directory of this Host Monitor has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Host Monitor monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Log Directory Free Space Monitoring The health check for monitoring of free space log_directory_free_ space_percentage_ critical:never, warning:never 50 Cloudera Manager 4.7 Health Checks

61 Host Monitor Process Status Percentage on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. Host Monitor Process Status Details: This Host Monitor health check checks that the Cloudera Manager Agent on the Host Monitor host is heart beating correctly and that the process associated with the Host Monitor role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Host Monitor process, a lack of connectivity to the Cloudera Manager Agent on the Host Monitor host, or a problem with the Cloudera Manager Agent. This check can fail either because the Host Monitor has crashed or because the Host Monitor will not start or stop in a timely fashion. Check the Host Monitor logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Host Monitor host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Host Monitor host, or look in the Cloudera Manager Agent logs on the Host Monitor host for more details. This test can be enabled or disabled using the Host Monitor Process Health Check Host Monitor monitoring setting. Short Name: Process Status Host Monitor Process Health Check Enables the health check that the Host Monitor's process state is consistent with the role configuration hostmonitor_scm_ health_enabled Cloudera Manager 4.7 Health Checks 51

62 Host Monitor Unexpected Exits Host Monitor Unexpected Exits Details: This Host Monitor health check checks that the Host Monitor has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period Host Monitor monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never Host Monitor Web Server Status Details: This Host Monitor health check checks that the web server of the Host Monitor is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the Host Monitor, a misconfiguration of the Host Monitor or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the Host Monitor for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the Host Monitor's web server are failing or timing out. These requests are completely local to the Host Monitor's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the Host Monitor's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection Host Monitor monitoring setting. Short Name: Web Server Status Web Metric Enables the health check that hostmonitor_web_ 52 Cloudera Manager 4.7 Health Checks

63 HttpFS File Descriptors Collection the Cloudera Manager Agent can successfully contact and gather metrics from the web server. metric_collection_ enabled HttpFS File Descriptors Details: This HttpFS health check checks that the number of file descriptors used does not rise above some percentage of the HttpFS file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring HttpFS monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. httpfs_fd_ critical: , warning: HttpFS Host Health Details: This HttpFS health check factors in the health of the host upon which the HttpFS is running. A failure of this check means that the host running the HttpFS is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the HttpFS Host Health Check HttpFS monitoring setting. Short Name: Host Health HttpFS Host Health Check When computing the overall HttpFS health, consider the host's health. httpfs_host_health_ enabled Cloudera Manager 4.7 Health Checks 53

64 HttpFS Log Directory Free Space HttpFS Log Directory Free Space Details: This HttpFS health check checks that the filesystem containing the log directory of this HttpFS has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage HttpFS monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check log_directory_free_ for monitoring of free space space_percentage_ on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. critical:never, warning:never HttpFS Process Status Details: This HttpFS health check checks that the Cloudera Manager Agent on the HttpFS host is heart beating correctly and that the process associated with the HttpFS role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the HttpFS process, a lack of connectivity to the Cloudera Manager Agent on the HttpFS host, or a problem with the Cloudera Manager Agent. This check can fail either because the HttpFS has crashed or because the HttpFS will not start or stop in a timely fashion. Check the HttpFS logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the HttpFS host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the HttpFS host, or look in the Cloudera Manager Agent logs on the HttpFS host for more details. This test can be enabled or disabled using the HttpFS Process Health Check HttpFS monitoring setting. Short Name: Process Status 54 Cloudera Manager 4.7 Health Checks

65 HttpFS Unexpected Exits HttpFS Process Health Check Enables the health check that the HttpFS's process state is consistent with the role configuration httpfs_scm_health_ enabled HttpFS Unexpected Exits Details: This HttpFS health check checks that the HttpFS has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period HttpFS monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never Impala Assignment Locality Details: This is an Impala service-level health check that checks that a sufficient percentage of recent assignments are operating on local data. The check returns "Concerning" or "Bad" health if the number of assignments operating on local data falls below a warning or critical. The number of assignments observed during the check's window must be above a configured minimum number of assignments for the check to be enabled. A failure of this health check may indicate problems with the configuration of the Impala service. Check that each Impala Daemon is co-located with a DataNode, and that the IP address of each Impala Daemon matches the IP address of its co-located DataNode. This test can be configured using the Assignment Locality Ratio, Assignment Locality Monitoring Cloudera Manager 4.7 Health Checks 55

66 Impala Daemons Health Period and Assignment Locality Minimum Assignments Impala service-wide monitoring settings. Short Name: Assignment Locality Assignment Locality Minimum Assignments Assignment Locality Monitoring Period The minimum number of impala_assignment_ assignments that must locality_minimum occur during the test time period before the threshold values will be checked. Until this number of assignments have been observed in the test time period the health check will be disabled. The time period over which to compute the assignment locality ratio. Specified in minutes. impala_assignment_ locality_window MINUTES Assignment Locality Ratio The health check impala_assignment_ for assignment locality. locality_ Specified as a percentage of total assignments. critical: , warning: Impala Daemons Health Details: This is an Impala service-level health check that checks that enough of the Impala Daemons in the cluster are healthy. The check returns "Concerning" health if the number of healthy Impala Daemons falls below a warning threshold, expressed as a percentage of the total number of Impala Daemons. The check returns "Bad" health if the number of healthy and "Concerning" Impala Daemons falls below a critical threshold, expressed as a percentage of the total number of Impala Daemons. For example, if this check is configured with a warning threshold of 95% and a critical threshold of 90% for a cluster of 100 Impala Daemons, this check would return "Good" health if 95 or more Impala Daemons have good health. This check would return "Concerning" health if at least 90 Impala Daemons have either "Good" or "Concerning" health. If more than 10 Impala Daemons have bad health, this check would return "Bad" health. A failure of this health check indicates unhealthy Impala Daemons. Check the status of the individual Impala Daemons for more information. This test can be configured using the Healthy Impala Daemon Monitoring Impala service-wide monitoring setting. Short Name: Impala Daemons Health 56 Cloudera Manager 4.7 Health Checks

67 Impala StateStore Health Healthy Impala Daemon Monitoring The health check of the overall Impala Daemons health. The check returns "Concerning" health if the percentage of "Healthy" Impala Daemons falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" Impala Daemons falls below the critical threshold. impala_impalads_ healthy_ critical: , warning: Impala StateStore Health Details: This Impala service-level health check checks for the presence of a running, healthy Impala StateStore Daemon. The check returns "Bad" health if the service is running and the Impala StateStore Daemon is not running. In all other cases it returns the health of the Impala StateStore Daemon. A failure of this health check indicates a stopped or unhealthy Impala StateStore Daemon. Check the status of the Impala StateStore Daemon for more information. This test can be enabled or disabled using the StateStore Role Health Check Impala StateStore Daemon service-wide monitoring setting. Short Name: StateStore Health StateStore Role Health Check When computing the overall Impala cluster health, consider the StateStore's health impala_statestore_ health_enabled Impala Daemon File Descriptors Details: This Impala Daemon health check checks that the number of file descriptors used does not rise above some percentage of the Impala Daemon file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Impala Daemon monitoring setting. Short Name: File Descriptors Cloudera Manager 4.7 Health Checks 57

68 Impala Daemon Host Health File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. impalad_fd_ critical: , warning: Impala Daemon Host Health Details: This Impala Daemon health check factors in the health of the host upon which the Impala Daemon is running. A failure of this check means that the host running the Impala Daemon is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the Impala Daemon Host Health Check Impala Daemon monitoring setting. Short Name: Host Health Impala Daemon Host Health Check When computing the overall Impala Daemon health, consider the host's health. impalad_host_health_ enabled Impala Daemon Log Directory Free Space Details: This Impala Daemon health check checks that the filesystem containing the log directory of this Impala Daemon has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Impala Daemon monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES 58 Cloudera Manager 4.7 Health Checks

69 Impala Daemon Process Status Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free_ space_percentage_ critical:never, warning:never Impala Daemon Process Status Details: This Impala Daemon health check checks that the Cloudera Manager Agent on the Impala Daemon host is heart beating correctly and that the process associated with the Impala Daemon role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Impala Daemon process, a lack of connectivity to the Cloudera Manager Agent on the Impala Daemon host, or a problem with the Cloudera Manager Agent. This check can fail either because the Impala Daemon has crashed or because the Impala Daemon will not start or stop in a timely fashion. Check the Impala Daemon logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Impala Daemon host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Impala Daemon host, or look in the Cloudera Manager Agent logs on the Impala Daemon host for more details. This test can be enabled or disabled using the Impala Daemon Process Health Check Impala Daemon monitoring setting. Short Name: Process Status Impala Daemon Process Health Check Enables the health check that the Impala Daemon's process state is consistent with the role configuration impalad_scm_health_ enabled Cloudera Manager 4.7 Health Checks 59

70 Impala Daemon Resident Set Size Impala Daemon Resident Set Size Details: This Impala Daemon health check checks that the size of the resident set does not rise above a configured threshold value. A failure of this health check may indicates that the Impala Daemon process is consuming more memory than expected. It is possible that that unexpected memory consumption may lead to swapping and decreased performance for processes running on the same host as this Impala Daemon. Increased Impala Daemon memory consumption may be caused by an increased workload on the Impala service, or by a bug in the Impala Daemon software. To avoid failures of this health check, free up additional memory for this Impala Daemon process and increase the Resident Set Size monitoring setting. This test can be configured using the Resident Set Size Impala Daemon monitoring setting. Short Name: Resident Set Size Resident Set Size The health check on the resident size of the process. process_resident_set_ size_ critical:never, warning:never BYTES Impala Daemon StateStore Connectivity Details: This is an Impala Daemon health check that checks that the StateStore considers the Impala Daemon alive. A failure of this health check may indicate that the Impala Daemon is having trouble communicating with the StateStore. Look in the Impala Daemon logs for more details. This check may return an unknown result if the Service Monitor is not able to communicate with the StateStore web server. Check the status of the StateStore web server and the Service Monitor logs if this check is returning an unknown result. This test can be enabled or disabled using the Impala Daemon Connectivity Health Check Impala Daemon monitoring setting. The Impala Daemon Connectivity Tolerance at Startup Impala Daemon monitoring setting and the Health Check Startup Tolerance StateStore monitoring setting can be used to control the check's tolerance windows around Impala Daemon and StateStore restarts respectively. Short Name: StateStore Connectivity Health Check Startup Tolerance The amount of time allowed after this role is started that failures of health checks that rely on communication with this role will be tolerated. statestore_startup_ tolerance 5 MINUTES 60 Cloudera Manager 4.7 Health Checks

71 Impala Daemon Unexpected Exits Impala Daemon Connectivity Health Check Enables the health check that verifies the Impala Daemon is connected to the StateStore impalad_connectivity_ health_enabled Impala Daemon Connectivity Tolerance at Startup The amount of time to wait for the Impala Daemon to fully start up and connect to the StateStore before enforcing the connectivity check. impalad_connectivity_ tolerance 180 SECONDS Impala Daemon Unexpected Exits Details: This Impala Daemon health check checks that the Impala Daemon has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period Impala Daemon monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never Impala Daemon Web Server Status Details: This Impala Daemon health check checks that the web server of the Impala Daemon is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the Cloudera Manager 4.7 Health Checks 61

72 Impala StateStore Daemon File Descriptors web server of the Impala Daemon, a misconfiguration of the Impala Daemon or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the Impala Daemon for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the Impala Daemon's web server are failing or timing out. These requests are completely local to the Impala Daemon's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the Impala Daemon's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection Impala Daemon monitoring setting. Short Name: Web Server Status Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. impalad_web_metric_ collection_enabled Impala StateStore Daemon File Descriptors Details: This Impala StateStore Daemon health check checks that the number of file descriptors used does not rise above some percentage of the Impala StateStore Daemon file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Impala StateStore Daemon monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. statestore_fd_ critical: , warning: Cloudera Manager 4.7 Health Checks

73 Impala StateStore Daemon Host Health Impala StateStore Daemon Host Health Details: This Impala StateStore Daemon health check factors in the health of the host upon which the Impala StateStore Daemon is running. A failure of this check means that the host running the Impala StateStore Daemon is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the StateStore Host Health Check Impala StateStore Daemon monitoring setting. Short Name: Host Health StateStore Host Health Check When computing the overall StateStore health, consider the host's health. statestore_host_ health_enabled Impala StateStore Daemon Log Directory Free Space Details: This Impala StateStore Daemon health check checks that the filesystem containing the log directory of this Impala StateStore Daemon has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Impala StateStore Daemon monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check log_directory_free_ for monitoring of free space space_absolute_ on the filesystem that contains this role's log directory. critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check log_directory_free_ for monitoring of free space space_percentage_ on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. critical:never, warning:never Cloudera Manager 4.7 Health Checks 63

74 Impala StateStore Daemon Process Status Impala StateStore Daemon Process Status Details: This Impala StateStore Daemon health check checks that the Cloudera Manager Agent on the Impala StateStore Daemon host is heart beating correctly and that the process associated with the Impala StateStore Daemon role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Impala StateStore Daemon process, a lack of connectivity to the Cloudera Manager Agent on the Impala StateStore Daemon host, or a problem with the Cloudera Manager Agent. This check can fail either because the Impala StateStore Daemon has crashed or because the Impala StateStore Daemon will not start or stop in a timely fashion. Check the Impala StateStore Daemon logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Impala StateStore Daemon host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Impala StateStore Daemon host, or look in the Cloudera Manager Agent logs on the Impala StateStore Daemon host for more details. This test can be enabled or disabled using the StateStore Process Health Check Impala StateStore Daemon monitoring setting. Short Name: Process Status StateStore Process Health Check Enables the health check that the StateStore's process state is consistent with the role configuration statestore_scm_ health_enabled Impala StateStore Daemon Resident Set Size Details: This Impala StateStore Daemon health check checks that the size of the resident set does not rise above a configured threshold value. A failure of this health check may indicates that the Impala StateStore Daemon process is consuming more memory than expected. It is possible that that unexpected memory consumption may lead to swapping and decreased performance for processes running on the same host as this Impala StateStore Daemon. Increased Impala StateStore Daemon memory consumption may be caused by an increased workload on the Impala service, or by a bug in the Impala StateStore Daemon software. To avoid failures of this health check, free up additional memory for this Impala StateStore Daemon process and increase the Resident Set Size monitoring setting. This test can be configured using the Resident Set Size Impala StateStore Daemon monitoring setting. Short Name: Resident Set Size Resident Set Size The health check process_resident_set_ critical:never, BYTES 64 Cloudera Manager 4.7 Health Checks

75 Impala StateStore Daemon Unexpected Exits on the resident size of the process. size_ warning:never Impala StateStore Daemon Unexpected Exits Details: This Impala StateStore Daemon health check checks that the Impala StateStore Daemon has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period Impala StateStore Daemon monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never Impala StateStore Daemon Web Server Status Details: This Impala StateStore Daemon health check checks that the web server of the Impala StateStore Daemon is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the Impala StateStore Daemon, a misconfiguration of the Impala StateStore Daemon or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the Impala StateStore Daemon for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the Impala StateStore Daemon's web server are failing or timing out. These requests are completely local to the Impala StateStore Daemon's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the Impala StateStore Cloudera Manager 4.7 Health Checks 65

76 JobTracker File Descriptors Daemon's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection Impala StateStore Daemon monitoring setting. Short Name: Web Server Status Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. statestore_web_metric_ collection_enabled JobTracker File Descriptors Details: This JobTracker health check checks that the number of file descriptors used does not rise above some percentage of the JobTracker file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring JobTracker monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. jobtracker_fd_ critical: , warning: JobTracker GC Duration Details: This JobTracker health check checks that the JobTracker is not spending too much time performing Java garbage collection. It checks that no more than some percentage of recent time is spent performing Java garbage collection. A failure of this health check may indicate a capacity planning problem or misconfiguration of the JobTracker. This test can be configured using the JobTracker Garbage Collection Duration and JobTracker Garbage Collection Duration Monitoring Period JobTracker monitoring settings. Short Name: GC Duration 66 Cloudera Manager 4.7 Health Checks

77 JobTracker Host Health JobTracker Garbage Collection Duration Monitoring Period The period to review when jobtracker_gc_ computing the moving average duration_window of garbage collection time. 5 MINUTES JobTracker Garbage Collection Duration The health check for the weighted average time spent in Java garbage collection. Specified as a percentage of elapsed wall clock time. See JobTracker Garbage Collection Duration Monitoring Period. jobtracker_gc_ duration_ critical: , warning: JobTracker Host Health Details: This JobTracker health check factors in the health of the host upon which the JobTracker is running. A failure of this check means that the host running the JobTracker is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the JobTracker Host Health Check JobTracker monitoring setting. Short Name: Host Health JobTracker Host Health Check When computing the overall JobTracker health, consider the host's health. jobtracker_host_ health_enabled JobTracker Log Directory Free Space Details: This JobTracker health check checks that the filesystem containing the log directory of this JobTracker has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage JobTracker monitoring settings. Short Name: Log Directory Free Space Cloudera Manager 4.7 Health Checks 67

78 JobTracker Process Status Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free_ space_percentage_ critical:never, warning:never JobTracker Process Status Details: This JobTracker health check checks that the Cloudera Manager Agent on the JobTracker host is heart beating correctly and that the process associated with the JobTracker role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the JobTracker process, a lack of connectivity to the Cloudera Manager Agent on the JobTracker host, or a problem with the Cloudera Manager Agent. This check can fail either because the JobTracker has crashed or because the JobTracker will not start or stop in a timely fashion. Check the JobTracker logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the JobTracker host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the JobTracker host, or look in the Cloudera Manager Agent logs on the JobTracker host for more details. This test can be enabled or disabled using the JobTracker Process Health Check JobTracker monitoring setting. Short Name: Process Status JobTracker Process Health Check Enables the health check that the JobTracker's process state is consistent with the role configuration jobtracker_scm_ health_enabled 68 Cloudera Manager 4.7 Health Checks

79 JobTracker Unexpected Exits JobTracker Unexpected Exits Details: This JobTracker health check checks that the JobTracker has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period JobTracker monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never JobTracker Web Server Status Details: This JobTracker health check checks that the web server of the JobTracker is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the JobTracker, a misconfiguration of the JobTracker or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the JobTracker for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the JobTracker's web server are failing or timing out. These requests are completely local to the JobTracker's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the JobTracker's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection JobTracker monitoring setting. Short Name: Web Server Status Web Metric Enables the health check jobtracker_web_ Cloudera Manager 4.7 Health Checks 69

80 JournalNode Edits Directory Free Space Collection that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. metric_collection_ enabled JournalNode Edits Directory Free Space Details: This is a JournalNode health check that checks that the filesystem containing the edits directory of this JournalNode has sufficient free space. This test can be configured using the Edits Directory Free Space Monitoring Absolute and Edits Directory Free Space Monitoring Percentage JournalNode monitoring settings. Short Name: Edits Directory Free Space Edits Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains the JournalNode's edits directory. journalnode_edits_ directory_free_ space_absolute_ critical: , warning: BYTES Edits Directory Free Space Monitoring Percentage The health check journalnode_edits_ for monitoring of free space directory_free_ on the filesystem that space_percentage_ contains the JournalNode's edits directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Edits Directory Free Space Monitoring Absolute setting is configured. critical:never, warning:never JournalNode File Descriptors Details: This JournalNode health check checks that the number of file descriptors used does not rise above some percentage of the JournalNode file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be 70 Cloudera Manager 4.7 Health Checks

81 JournalNode GC Duration configured using the File Descriptor Monitoring JournalNode monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. journalnode_fd_ critical: , warning: JournalNode GC Duration Details: This JournalNode health check checks that the JournalNode is not spending too much time performing Java garbage collection. It checks that no more than some percentage of recent time is spent performing Java garbage collection. A failure of this health check may indicate a capacity planning problem or misconfiguration of the JournalNode. This test can be configured using the JournalNode Garbage Collection Duration and JournalNode Garbage Collection Duration Monitoring Period JournalNode monitoring settings. Short Name: GC Duration JournalNode Garbage Collection Duration Monitoring Period The period to review when computing the moving average of garbage collection time. journalnode_gc_ duration_window 5 MINUTES JournalNode Garbage Collection Duration The health check for the weighted average time spent in Java garbage collection. Specified as a percentage of elapsed wall clock time. See JournalNode Garbage Collection Duration Monitoring Period. journalnode_gc_ duration_ critical: , warning: Cloudera Manager 4.7 Health Checks 71

82 JournalNode Host Health JournalNode Host Health Details: This JournalNode health check factors in the health of the host upon which the JournalNode is running. A failure of this check means that the host running the JournalNode is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the JournalNode Host Health Check JournalNode monitoring setting. Short Name: Host Health JournalNode Host Health Check When computing the overall JournalNode health, consider the host's health. journalnode_host_ health_enabled JournalNode Log Directory Free Space Details: This JournalNode health check checks that the filesystem containing the log directory of this JournalNode has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage JournalNode monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is log_directory_free_ space_percentage_ critical:never, warning:never 72 Cloudera Manager 4.7 Health Checks

83 JournalNode Process Status configured. JournalNode Process Status Details: This JournalNode health check checks that the Cloudera Manager Agent on the JournalNode host is heart beating correctly and that the process associated with the JournalNode role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the JournalNode process, a lack of connectivity to the Cloudera Manager Agent on the JournalNode host, or a problem with the Cloudera Manager Agent. This check can fail either because the JournalNode has crashed or because the JournalNode will not start or stop in a timely fashion. Check the JournalNode logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the JournalNode host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the JournalNode host, or look in the Cloudera Manager Agent logs on the JournalNode host for more details. This test can be enabled or disabled using the JournalNode Process Health Check JournalNode monitoring setting. Short Name: Process Status JournalNode Process Health Check Enables the health check that journalnode_scm_ the JournalNode's process state health_enabled is consistent with the role configuration JournalNode Sync Status Details: This is a JournalNode health check that checks that the active NameNode is in sync with this JournalNode. This check returns "Bad" health if the active NameNode is out-of-sync with the JournalNode. This check is disabled when there is no active NameNode. This test can be configured using the Active NameNode Sync Status Health Check and Active NameNode Sync Status Startup Tolerance JournalNode monitoring settings. Short Name: Sync Status Active NameNode Sync Status Health Check Enables the health check that journalnode_sync_ verifies the active status_enabled NameNode's sync status to the Cloudera Manager 4.7 Health Checks 73

84 JournalNode Unexpected Exits JournalNode Active NameNode Sync Status Startup Tolerance The amount of time at JournalNode startup allowed for the active NameNode to get in sync with the JournalNode. journalnode_sync_ status_startup_ tolerance 180 SECONDS JournalNode Unexpected Exits Details: This JournalNode health check checks that the JournalNode has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period JournalNode monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never JournalNode Web Server Status Details: This JournalNode health check checks that the web server of the JournalNode is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the JournalNode, a misconfiguration of the JournalNode or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the JournalNode for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager 74 Cloudera Manager 4.7 Health Checks

85 MapReduce Active JobTracker Health Agent's HTTP requests to the JournalNode's web server are failing or timing out. These requests are completely local to the JournalNode's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the JournalNode's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection JournalNode monitoring setting. Short Name: Web Server Status Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. journalnode_web_ metric_collection_ enabled MapReduce Active JobTracker Health Details: This is a MapReduce service-level health check that checks for the presence of an active, running and healthy JobTracker. The check returns "Bad" health if the service is running and a running, active JobTracker cannot be found. In all other cases it returns the health of the running, active JobTracker. A failure of this health check may indicate stopped or unhealthy JobTracker roles, the need to issue a failover command to make some JobTracker active, or it may indicate a problem with communication between the Cloudera Manager Service Monitor and one or more JobTrackers. Check the status of the MapReduce service's JobTracker roles and look in the Cloudera Manager Service Monitor's log files for more information when this check fails. This test can be enabled or disabled using the JobTracker Role Health Check MapReduce service-wide monitoring setting. In addition, the Active JobTracker Detection Window can be used to adjust the amount of time that the Cloudera Manager Service Monitor has to detect the active MapReduce JobTracker before this health check fails, and the JobTracker Activation Startup Tolerance can be used to adjust the amount of time around JobTracker startup that the check allows for a JobTracker to be made active. Short Name: Active JobTracker Health Active JobTracker Detection Window The tolerance window that will mapreduce_active_ be used in Mapreduce service jobtracker_detecton tests that depend on detection _window of the active JobTracker. 3 MINUTES Cloudera Manager 4.7 Health Checks 75

86 MapReduce Job Failure Ratio JobTracker Activation Startup Tolerance The amount of time after JobTracker(s) start that the lack of an active JobTracker will be tolerated. This is intended to allow either the auto-failover daemon to make a JobTracker active, or a specifically issued failover command to take effect. mapreduce_ jobtracker_activation _startup_tolerance 180 SECONDS JobTracker Role Health Check When computing the overall MapReduce cluster health, consider the JobTracker's health mapreduce_ jobtracker_health_ enabled MapReduce Job Failure Ratio Details: This is a MapReduce service-level health check that checks that no more than some percentage of recently completed jobs have failed. A failure of this health check may indicate problems with the MapReduce service or with the failing jobs. Check the status of the MapReduce service for more details. This test can be configured using the Job Failure Ratio, Job Failure Ratio Minimum Failing Jobs and Job Failure Ratio Monitoring Period MapReduce service-wide monitoring setting. Short Name: Job Failure Ratio Job Failure Ratio Minimum Failing Jobs The minimum number of jobs that must fail during the test time period before the threshold values will be checked. Until this number of jobs have failed in the test time period the health check will continue to return good health. mapreduce_job_failure_ ratio_minimum_jobs 0 76 Cloudera Manager 4.7 Health Checks

87 MapReduce JobTracker Health Job Failure Ratio Monitoring Period The time period to review when computing job failure ratio. Specified in minutes. mapreduce_job_failure_ ratio_window 5 MINUTES Job Failure Ratio The health check of the number of recently failed jobs. Specified as a percentage of recently completed jobs. See Job Failure Ratio Monitoring Period. mapreduce_job_failure_ ratio_ critical:never, warning:never MapReduce JobTracker Health Details: This MapReduce service-level health check checks for the presence of a running, healthy JobTracker. The check returns "Bad" health if the service is running and the JobTracker is not running. In all other cases it returns the health of the JobTracker. A failure of this health check indicates a stopped or unhealthy JobTracker. Check the status of the JobTracker for more information. This test can be enabled or disabled using the JobTracker Role Health Check JobTracker service-wide monitoring setting. Short Name: JobTracker Health JobTracker Role Health Check When computing the overall MapReduce cluster health, consider the JobTracker's health mapreduce_jobtracker_ health_enabled MapReduce Map Task Backlog Details: This is a MapReduce service-level health check that checks that the number of waiting map tasks in the cluster does not rise above some percentage of the number of total available map slots. The behavior of this health check depends on how the MapReduce cluster is being used. In some scenarios, it may be normal to have large numbers of waiting map tasks. In such scenarios, this test should be disabled. In other scenarios, a failure of this health check may indicate a capacity planning problem or a problem with the MapReduce service. Check the status of the MapReduce service for more details. This test can be configured using the MapReduce Map Task Backlog MapReduce service-wide Cloudera Manager 4.7 Health Checks 77

88 MapReduce Map Task Locality monitoring setting. Short Name: Map Task Backlog MapReduce Map Task Backlog The health check of the number of map tasks in the backlog. Specified as a percentage of the total number of map slots. mapreduce_map_ backlog_ critical:never, warning:never MapReduce Map Task Locality Details: This is a MapReduce service-level health check that checks that no more than some percentages of recently completed maps were operating on rack-local or other-local data. The check returns "Concerning" health if the number of rack-local maps is above a configured minimum number of maps and greater than the warning threshold or if the number of other-local maps is above a configured minimum number of maps and greater than the warning threshold. The test never returns "Bad" health. A failure of this health check may indicate problems with the configuration of the MapReduce service. Consider using the fair-scheduler and changing its delay configuration mapred.fairscheduler.locality.delay. In some scenarios, it may be normal to have a large number of non-local maps. For example, data import maps are always non-local. In such scenarios, consider disabling one or more of the used by this test. This test can be configured using the Rack- Local Map Task, Maps Locality Minimum Rack-Local Maps, Other-Local Map Task, Maps Locality Minimum Other-Local Maps and Map Tasks Locality Monitoring Period MapReduce service-wide monitoring settings. Short Name: Map Task Locality Map Tasks Locality Monitoring Period The time period to monitor when computing health test results for map tasks locality. Specified in minutes. mapreduce_maps_ locality_window 15 MINUTES Maps Locality Minimum Other- Local Maps The minimum number of non-local maps that must complete during the test time period before the threshold values will be checked. Until this number of non-local maps have completed in the test mapreduce_maps_ 0 locality_minimum_ other_locality_ maps 78 Cloudera Manager 4.7 Health Checks

89 MapReduce Reduce Task Backlog time period the health check will continue to return good health. Maps Locality Minimum Rack- Local Maps The minimum number of racklocal maps that must complete locality_minimum_ mapreduce_maps_ 0 during the test time period before rack_local_maps the threshold values will be checked. Until this number of racklocal maps have completed in the test time period the health check will continue to return good health. Other-Local Map Task The health check of the number of map tasks using nonlocal data. Specified as a percentage of other-local map tasks in the total number of map tasks. mapreduce_other_ local_ critical:never, warning:never Rack-Local Map Task The health check of the number of map tasks using nonlocal data. Specified as a percentage of rack-local map tasks in the total number of map tasks. mapreduce_rack_ local_ critical:never, warning:never MapReduce Reduce Task Backlog Details: This is a MapReduce service-level health check that checks that the number of waiting reduce tasks in the cluster does not rise above some percentage of the number of total available reduce slots. The behavior of this health check depends on how the MapReduce cluster is being used. In some scenarios, it may be normal to have large numbers of waiting reduce tasks. In such scenarios, this test should be disabled. In other scenarios, a failure of this health check may indicate a capacity planning problem or a problem with the MapReduce service. Check the status of the MapReduce service for more details. This test can be configured using the MapReduce Reduce Task Backlog MapReduce service-wide monitoring setting. Short Name: Reduce Task Backlog Cloudera Manager 4.7 Health Checks 79

90 MapReduce Standby JobTracker Health MapReduce Reduce Task Backlog The health check for the number of reduce tasks in the backlog. Specified as a percentage of the total number of reduce slots. mapreduce_reduce_ backlog_ critical:never, warning:never MapReduce Standby JobTracker Health Details: This is an MapReduce service-level health check that checks for a running, healthy JobTracker in standby mode. The check is disabled if the MapReduce service is not configured with multiple JobTrackers. Otherwise, the check returns "Concerning" health if either of two conditions are met. First, if there is no JobTracker running in standby mode. Second, if the running standby JobTracker is in less than "Good" health. A failure of this health check may indicate one or more stopped or unhealthy JobTrackers, or it may indicate a problem with communication between the Cloudera Manager Service Monitor and some or all of the MapReduce JobTrackers. Check the status of the MapReduce service's JobTracker roles and the Cloudera Manager Service Monitor's log files for more information when this check fails. This test can be enabled or disabled using the Standby JobTracker Health Check MapReduce service-wide monitoring setting. In addition, the Active JobTracker Detection Window can be used to adjust the amount of time that the Cloudera Manager Service Monitor has to detect the active JobTracker before this health check fails. Short Name: Standby JobTracker Health Active JobTracker Detection Window The tolerance window that will mapreduce_active_ be used in Mapreduce service jobtracker_detecton_ tests that depend on detection window of the active JobTracker. 3 MINUTES Standby JobTracker Health Check When computing the overall cluster health, consider the health of the standby JobTracker. mapreduce_standby_ jobtrackers_health_ enabled MapReduce TaskTrackers Health Details: This is a MapReduce service-level health check that checks that enough of the TaskTrackers in the cluster are healthy. The check returns "Concerning" health if the number of healthy TaskTrackers 80 Cloudera Manager 4.7 Health Checks

91 Master File Descriptors falls below a warning threshold, expressed as a percentage of the total number of TaskTrackers. The check returns "Bad" health if the number of healthy and "Concerning" TaskTrackers falls below a critical threshold, expressed as a percentage of the total number of TaskTrackers. For example, if this check is configured with a warning threshold of 95% and a critical threshold of 90% for a cluster of 100 TaskTrackers, this check would return "Good" health if 95 or more TaskTrackers have good health. This check would return "Concerning" health if at least 90 TaskTrackers have either "Good" or "Concerning" health. If more than 10 TaskTrackers have bad health, this check would return "Bad" health. A failure of this health check indicates unhealthy TaskTrackers. Check the status of the individual TaskTrackers for more information. This test can be configured using the Healthy TaskTracker Monitoring MapReduce service-wide monitoring setting. Short Name: TaskTrackers Health Healthy TaskTracker Monitoring The health check of the overall TaskTrackers health. The check returns "Concerning" health if the percentage of "Healthy" TaskTrackers falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" TaskTrackers falls below the critical threshold. mapreduce_ tasktrackers_healthy _ critical: , warning: Master File Descriptors Details: This Master health check checks that the number of file descriptors used does not rise above some percentage of the Master file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Master monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file master_fd_ critical: , warning: Cloudera Manager 4.7 Health Checks 81

92 Master GC Duration descriptor limit. Master GC Duration Details: This Master health check checks that the Master is not spending too much time performing Java garbage collection. It checks that no more than some percentage of recent time is spent performing Java garbage collection. A failure of this health check may indicate a capacity planning problem or misconfiguration of the Master. This test can be configured using the HBase Master Garbage Collection Duration and HBase Master Garbage Collection Duration Monitoring Period Master monitoring settings. Short Name: GC Duration HBase Master Garbage Collection Duration Monitoring Period The period to review when master_gc_ computing the moving average duration_window of garbage collection time. 5 MINUTES HBase Master Garbage Collection Duration The health check for the weighted average time spent in Java garbage collection. Specified as a percentage of elapsed wall clock time. See HBase Master Garbage Collection Duration Monitoring Period. master_gc_ duration_ critical: , warning: Master HBase Master Canary Details: This is an HBase Master health check that checks that a client can connect to and get basic information from the Master in a reasonable amount of time. The check returns "Bad" health if the connection to or basic queries of the Master fail. The check returns "Concerning" health if the connection attempt or queries do not complete in a reasonable time. A failure of this health check may indicate that the Master is failing to satisfy basic client requests correctly or in a timely fashion. Check the status of the Master, and look in the Master logs for more details. This test can be enabled or disabled using the HBase Master Canary Health Check HBase Master monitoring setting. Short Name: HBase Master Canary 82 Cloudera Manager 4.7 Health Checks

93 Master Host Health HBase Master Canary Health Check Enables the health check that a client can connect to the HBase Master master_canary_ health_enabled Master Host Health Details: This Master health check factors in the health of the host upon which the Master is running. A failure of this check means that the host running the Master is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the HBase Master Host Health Check Master monitoring setting. Short Name: Host Health HBase Master Host Health Check When computing the overall HBase Master health, consider the host's health. master_host_health_ enabled Master Log Directory Free Space Details: This Master health check checks that the filesystem containing the log directory of this Master has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Master monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check log_directory_free_ for monitoring of free space space_absolute_ on the filesystem that contains this role's log directory. critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check log_directory_free_ for monitoring of free space space_percentage_ on the filesystem that contains this role's log directory. Specified as a percentage of critical:never, warning:never Cloudera Manager 4.7 Health Checks 83

94 Master Process Status the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. Master Process Status Details: This Master health check checks that the Cloudera Manager Agent on the Master host is heart beating correctly and that the process associated with the Master role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Master process, a lack of connectivity to the Cloudera Manager Agent on the Master host, or a problem with the Cloudera Manager Agent. This check can fail either because the Master has crashed or because the Master will not start or stop in a timely fashion. Check the Master logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Master host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Master host, or look in the Cloudera Manager Agent logs on the Master host for more details. This test can be enabled or disabled using the HBase Master Process Health Check Master monitoring setting. Short Name: Process Status HBase Master Process Health Check Enables the health check that the HBase Master's process state is consistent with the role configuration master_scm_health_ enabled Healthy HBase Region Servers Monitoring The health check of the overall HBase Region Servers health. The check returns "Concerning" health if the percentage of "Healthy" HBase Region Servers falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" hbase_regionservers _healthy_ critical: , warning: Cloudera Manager 4.7 Health Checks

95 Master Unexpected Exits HBase Region Servers falls below the critical threshold. Master Unexpected Exits Details: This Master health check checks that the Master has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period Master monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never Master Web Server Status Details: This Master health check checks that the web server of the Master is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the Master, a misconfiguration of the Master or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the Master for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the Master's web server are failing or timing out. These requests are completely local to the Master's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the Master's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured Cloudera Manager 4.7 Health Checks 85

96 NameNode Checkpoint Status using the Web Metric Collection Master monitoring setting. Short Name: Web Server Status Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. master_web_metric_ collection_enabled NameNode Checkpoint Status Details: This is a NameNode health check that checks the NameNode's filesystem checkpoint is no older than some percentage of the Filesystem Checkpoint Period and that the number of transactions that have occurred since the last filesystem checkpoint does not exceed some percentage of the Filesystem Checkpoint Transaction Limit. A failure of this health check may indicate a problem with the active NameNode or its configured checkpointing role, i.e. either a standby NameNode or a SecondaryNameNode. Check the NameNode logs or the logs of the checkpointing role for more details. This test can be configured using the Filesystem Checkpoint Age Monitoring and Filesystem Checkpoint Transactions Monitoring NameNode monitoring setting. Short Name: Checkpoint Status Filesystem Checkpoint Age Monitoring The health check namenode_ of the age of the HDFS checkpoint_age_ namespace checkpoint. Specified as a percentage of the configured checkpoint interval. critical: , warning: Filesystem Checkpoint Transactions Monitoring The health check of the number of transactions since the last HDFS namespace checkpoint. Specified as a percentage of the configured checkpointing transaction limit. namenode_ checkpoint_ transactions_ critical: , warning: Cloudera Manager 4.7 Health Checks

97 NameNode Data Directories Free Space NameNode Data Directories Free Space Details: This is a NameNode health check that checks whether the filesystems containing the data directories have sufficient free space. This test can be configured using the Data Directories Free Space Monitoring Absolute and Data Directories Free Space Monitoring Percentage NameNode monitoring settings. Short Name: Data Directories Free Space Data Directories Free Space Monitoring Absolute The health check for monitoring of free space on the filesystems that contain this role's data directories. namenode_data_ directories_free_ space_absolute_ critical: , warning: BYTES Data Directories Free Space Monitoring Percentage The health check for monitoring of free space on the filesystems that contain this role's data directories. Specified as a percentage of the capacity on the filesystem. This setting is not used if a Data Directories Free Space Monitoring Absolute setting is configured. namenode_data_ directories_free_ space_percentage_ critical:never, warning:never NameNode File Descriptors Details: This NameNode health check checks that the number of file descriptors used does not rise above some percentage of the NameNode file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring NameNode monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors namenode_fd_ critical: , warning: Cloudera Manager 4.7 Health Checks 87

98 NameNode GC Duration used. Specified as a percentage of file descriptor limit. NameNode GC Duration Details: This NameNode health check checks that the NameNode is not spending too much time performing Java garbage collection. It checks that no more than some percentage of recent time is spent performing Java garbage collection. A failure of this health check may indicate a capacity planning problem or misconfiguration of the NameNode. This test can be configured using the NameNode Garbage Collection Duration and NameNode Garbage Collection Duration Monitoring Period NameNode monitoring settings. Short Name: GC Duration NameNode Garbage Collection Duration Monitoring Period The period to review when computing the moving average of garbage collection time. namenode_gc_ duration_window 5 MINUTES NameNode Garbage Collection Duration The health check for the weighted average time spent in Java garbage collection. Specified as a percentage of elapsed wall clock time. See NameNode Garbage Collection Duration Monitoring Period. namenode_gc_ critical: , duration_ warning: NameNode Host Health Details: This NameNode health check factors in the health of the host upon which the NameNode is running. A failure of this check means that the host running the NameNode is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the NameNode Host Health Check NameNode monitoring setting. Short Name: Host Health 88 Cloudera Manager 4.7 Health Checks

99 NameNode JournalNode Sync Status NameNode Host Health Check When computing the overall NameNode health, consider the host's health. namenode_host_ health_enabled NameNode JournalNode Sync Status Details: This is a NameNode health check that checks that the NameNode is not out-of-sync with too many JournalNodes. This check is disabled if Quorum-based storage is not in use.a failure of this health check may indicate a problem with the JournalNodes or a communication problem between the NameNode and the JournalNodes. Check the NameNode and the JournalNode logs for more details.this test can be configured using the NameNode Out-Of-Sync JournalNodes NameNode monitoring setting. Short Name: JournalNode Sync Status NameNode Out-Of- Sync JournalNodes The health check for the number of out-of-sync JournalNodes for this NameNode. namenode_out_of_ critical:any, sync_journal_nodes_ warning:never NameNode Log Directory Free Space Details: This NameNode health check checks that the filesystem containing the log directory of this NameNode has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage NameNode monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Cloudera Manager 4.7 Health Checks 89

100 NameNode Name Directory Status Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free_ space_percentage_ critical:never, warning:never NameNode Name Directory Status Details: This is a NameNode health check that checks for whether the NameNode has reported any failed data directories. A failure of this health check indicates that there is a problem with one or more data directories on the NameNode. See the NameNode system web UI for more information. Note that unless the configuration dfs.namenode.name.dir.restore is configured to, the NameNode will require a restart to recognize data directories that have been restored (e.g., after an NFS outage). This test can be configured using the NameNode Directory Failures NameNode monitoring setting. Short Name: Name Directory Status NameNode Directory Failures The health check of failed status directories in a NameNode. namenode_directory_ failures_ critical:any, warning:never NameNode Process Status Details: This NameNode health check checks that the Cloudera Manager Agent on the NameNode host is heart beating correctly and that the process associated with the NameNode role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the NameNode process, a lack of connectivity to the Cloudera Manager Agent on the NameNode host, or a problem with the Cloudera Manager Agent. This check can fail either because the NameNode has crashed or because the NameNode will not start or stop in a timely fashion. Check the NameNode logs 90 Cloudera Manager 4.7 Health Checks

101 NameNode RPC Latency for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the NameNode host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the NameNode host, or look in the Cloudera Manager Agent logs on the NameNode host for more details. This test can be enabled or disabled using the NameNode Process Health Check NameNode monitoring setting. Short Name: Process Status NameNode Process Health Check Enables the health check that the NameNode's process state is consistent with the role configuration namenode_scm_ health_enabled NameNode RPC Latency Details: This is a NameNode health check that checks that a moving average of the average time it takes for the NameNode to respond to requests does not exceed some value. A failure of this health check could indicate misconfiguration of the NameNode, that the NameNode is having a problem writing to one of its data directories, or may indicate a capacity planning problem. Check the NameNode's RpcQueueTime_avg_time and if this indicates that the bulk of the RPC latency is spent with requests queued, try increasing the NameNode Handler Count. If the NameNode's RpcProcessingTime_avg_time indicates the bulk of the RPC latency is due to request processing, check to see that each of the directories in which the HDFS metadata is being stored is performing adequately. This test can be configured using the NameNode RPC Latency and NameNode RPC Latency Monitoring Window NameNode monitoring settings. Short Name: RPC Latency NameNode RPC Latency Monitoring Window The period to review when computing the moving average of the NameNode's RPC latency. namenode_rpc_ latency_window 5 MINUTES NameNode RPC Latency The health check of the NameNode's RPC namenode_rpc_ latency_ critical: , warning: MILLISECONDS Cloudera Manager 4.7 Health Checks 91

102 NameNode Safe Mode Status latency. NameNode Safe Mode Status Details: This is a NameNode health check that checks that the NameNode is not in safemode. A failure of this health check indicates that the NameNode is in safemode. Look in the NameNode logs for more details. This test can be enabled or disabled using the NameNode Safemode Health Check NameNode monitoring setting. The Health Check Startup Tolerance NameNode monitoring setting also controls the amount of time at NameNode start up to tolerate safemode. Short Name: Safe Mode Status Health Check Startup Tolerance The amount of time allowed namenode_startup_ after this role is started that tolerance failures of health checks that rely on communication with this role will be tolerated. 5 MINUTES NameNode Safemode Health Check Enables the health check that the NameNode is not in safemode namenode_safe_mode_ enabled NameNode Unexpected Exits Details: This NameNode health check checks that the NameNode has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period NameNode monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES 92 Cloudera Manager 4.7 Health Checks

103 NameNode Upgrade Status Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never NameNode Upgrade Status Details: This is a NameNode health check that checks for the presence of an unfinalized HDFS metadata upgrade. If an unfinalized HDFS metadata upgrade is detected, this check returns "Concerning" health. A failure of this health check indicates that a previously performed HDFS upgrade needs to be finalized. This can be done via the Finalize Metadata Upgrade NameNode command using Cloudera Manager. This test can be enabled or disabled using the HDFS Upgrade Status Health Check NameNode monitoring setting. Short Name: Upgrade Status HDFS Upgrade Enables the health check of Status Health Check the upgrade status of the NameNode. namenode_upgrade_ status_enabled NameNode Web Server Status Details: This NameNode health check checks that the web server of the NameNode is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the NameNode, a misconfiguration of the NameNode or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the NameNode for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the NameNode's web server are failing or timing out. These requests are completely local to the NameNode's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the NameNode's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection NameNode monitoring setting. Short Name: Web Server Status Cloudera Manager 4.7 Health Checks 93

104 Navigator Server File Descriptors Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. namenode_web_metric_ collection_enabled Navigator Server File Descriptors Details: This Navigator Server health check checks that the number of file descriptors used does not rise above some percentage of the Navigator Server file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Navigator Server monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. navigator_fd_ critical: , warning: Navigator Server Host Health Details: This Navigator Server health check factors in the health of the host upon which the Navigator Server is running. A failure of this check means that the host running the Navigator Server is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the Navigator Host Health Check Navigator Server monitoring setting. Short Name: Host Health Navigator Host Health Check When computing the overall Navigator health, consider the host's health. navigator_host_health_ enabled 94 Cloudera Manager 4.7 Health Checks

105 Navigator Server Log Directory Free Space Navigator Server Log Directory Free Space Details: This Navigator Server health check checks that the filesystem containing the log directory of this Navigator Server has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Navigator Server monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check log_directory_free_ for monitoring of free space space_absolute_ on the filesystem that contains this role's log directory. critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check log_directory_free_ for monitoring of free space space_percentage_ on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. critical:never, warning:never Navigator Server Process Status Details: This Navigator Server health check checks that the Cloudera Manager Agent on the Navigator Server host is heart beating correctly and that the process associated with the Navigator Server role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Navigator Server process, a lack of connectivity to the Cloudera Manager Agent on the Navigator Server host, or a problem with the Cloudera Manager Agent. This check can fail either because the Navigator Server has crashed or because the Navigator Server will not start or stop in a timely fashion. Check the Navigator Server logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Navigator Server host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Navigator Server host, or look in the Cloudera Manager Agent logs on the Navigator Server host for more details. This test can be enabled or disabled using the Navigator Process Health Check Navigator Server monitoring setting. Short Name: Process Status Cloudera Manager 4.7 Health Checks 95

106 Navigator Server Unexpected Exits Navigator Process Health Check Enables the health check that navigator_scm_health_ the Navigator's process state is enabled consistent with the role configuration Navigator Server Unexpected Exits Details: This Navigator Server health check checks that the Navigator Server has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period Navigator Server monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never RegionServer Cluster Connectivity Details: This is a RegionServer health check that checks that the Master considers the RegionServer alive. A failure of this health check may indicate that the RegionServer is having trouble communicating with at least the HBase Master and possibly the entire HBase cluster. Look in the RegionServer logs for more details. This test can be enabled or disabled using the HBase RegionServer to Master Connectivity Check RegionServer monitoring setting. The HBase Region Server Connectivity Tolerance at Startup RegionServer monitoring setting and the Health Check Startup Tolerance Master monitoring setting can be used to control the check's tolerance windows around RegionServer and Master restarts respectively. Short Name: Cluster Connectivity 96 Cloudera Manager 4.7 Health Checks

107 RegionServer Compaction Queue Size HBase Region Server Connectivity Tolerance at Startup The amount of time to wait for the HBase Region Server to fully start up and connect to the HBase Master before enforcing the connectivity check. regionserver_ connectivity_ tolerance 180 SECONDS HBase RegionServer to Master Connectivity Check Enables the health check that the RegionServer is connected to the Master regionserver_master_ connectivity_enabled Health Check Startup Tolerance The amount of time allowed after this role is started that failures of health checks that rely on communication with this role will be tolerated. master_startup_ tolerance 5 MINUTES RegionServer Compaction Queue Size Details: This is a RegionServer health check that checks that a moving average of the size of the RegionServer's compaction queue does not exceed some value. A failure of this health check indicates a high write load on the RegionServer. Try reducing the write load on the RegionServer, increasing capacity by adding additional disks to the RegionServer, or adding additional RegionServers. This test can be configured using the HBase RegionServer Compaction Queue Monitoring and HBase RegionServer Compaction Queue Monitoring Period RegionServer monitoring settings. Short Name: Compaction Queue Size HBase RegionServer Compaction Queue Monitoring Period The period over which to compute the moving average of the compaction queue size. regionserver_ compaction_queue_ window 5 MINUTES HBase RegionServer Compaction Queue Monitoring The health check of the weighted average size of the HBase RegionServer compaction regionserver_ compaction_queue_ critical:never, warning: Cloudera Manager 4.7 Health Checks 97

108 RegionServer File Descriptors queue over a recent period. See HBase RegionServer Compaction Queue Monitoring Period. RegionServer File Descriptors Details: This RegionServer health check checks that the number of file descriptors used does not rise above some percentage of the RegionServer file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring RegionServer monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. regionserver_fd_ critical: , warning: RegionServer Flush Queue Size Details: This is a RegionServer health check that checks that a moving average of the size of the RegionServer's flush queue does not exceed some value. A failure of this health check indicates a high write load on the RegionServer. Try reducing the write load on the RegionServer, increasing capacity by adding additional disks to the RegionServer, or adding additional RegionServers. This test can be configured using the HBase RegionServer Flush Queue Monitoring and HBase RegionServer Flush Queue Monitoring Period RegionServer monitoring settings. Short Name: Flush Queue Size HBase RegionServer Flush Queue Monitoring Period The period over which to compute the moving average of the flush queue size. regionserver_flush_ queue_window 5 MINUTES 98 Cloudera Manager 4.7 Health Checks

109 RegionServer GC Duration HBase RegionServer Flush Queue Monitoring The health check regionserver_flush_ of the average size of the queue_ HBase RegionServer flush queue over a recent period. See HBase RegionServer Flush Queue Monitoring Period. critical:never, warning: RegionServer GC Duration Details: This RegionServer health check checks that the RegionServer is not spending too much time performing Java garbage collection. It checks that no more than some percentage of recent time is spent performing Java garbage collection. A failure of this health check may indicate a capacity planning problem or misconfiguration of the RegionServer. This test can be configured using the HBase Region Server Garbage Collection Duration and HBase Region Server Garbage Collection Duration Monitoring Period RegionServer monitoring settings. Short Name: GC Duration HBase Region Server Garbage Collection Duration Monitoring Period The period to review when computing the moving average of garbage collection time. regionserver_gc_ duration_window 5 MINUTES HBase Region Server Garbage Collection Duration The health check for the weighted average time spent in Java garbage collection. Specified as a percentage of elapsed wall clock time. See HBase Region Server Garbage Collection Duration Monitoring Period. regionserver_gc_ duration_ critical: , warning: Cloudera Manager 4.7 Health Checks 99

110 RegionServer HDFS Read Latency RegionServer HDFS Read Latency Details: This is a RegionServer health check that checks that the average time it takes for the RegionServer to read from HDFS does not exceed some value. The health check computes a moving average of the average HDFS read time for the RegionServer over a configurable window and then compares that moving average to a configured threshold value. A moving average is used to avoid alerting on short-lived HDFS read latency spikes. The behavior of this health check depends on how the HBase cluster is being used. A failure of this health check may indicate a problem with HDFS, extensive load on the cluster, or a capacity management problem. For example, it is possible to see average read latency increase while executing MapReduce jobs on the cluster. In such situations, if the read latency increase is unacceptable, consider running the MapReduce jobs with fewer map or reduce slots. This test can be configured using the HBase RegionServer HDFS Read Latency and HBase RegionServer HDFS Read Latency Monitoring Period RegionServer monitoring settings. Short Name: HDFS Read Latency HBase RegionServer HDFS Read Latency Monitoring Period The period over which to compute the moving average of the HDFS read latency of the HBase RegionServer. regionserver_read_ latency_window 5 MINUTES HBase RegionServer HDFS Read Latency The health check of the latency that the RegionServer sees for HDFS read operations regionserver_read_ latency_ critical: , warning: MILLISECONDS RegionServer HDFS Sync Latency Details: This is a RegionServer health check that checks that the average time it takes for the RegionServer to perform HDFS syncs does not exceed some value. The health check computes a moving average of the average HDFS sync time for the RegionServer over a configurable window and then compares that moving average to a configured threshold value. A moving average is used to avoid alerting on short-lived HDFS sync latency spikes. The behavior of this health check depends on how the HBase cluster is being used. A failure of this health check may indicate a problem with HDFS, extensive load on the cluster, or a capacity management problem. For example, it is possible to see average sync latency increase while executing MapReduce jobs on the cluster. In such situations, if the sync latency increase is unacceptable, consider running the MapReduce jobs with fewer map or reduce slots. This test can be configured using the HBase RegionServer HDFS Sync Latency and HBase 100 Cloudera Manager 4.7 Health Checks

111 RegionServer Host Health RegionServer HDFS Sync Latency Monitoring Period RegionServer monitoring settings. Short Name: HDFS Sync Latency HBase RegionServer HDFS Sync Latency Monitoring Period The period over which to compute the moving average of the HDFS sync latency of the HBase RegionServer. regionserver_sync_ latency_window 5 MINUTES HBase RegionServer HDFS Sync Latency The health check for the latency of HDFS write operations that the RegionServer detects regionserver_sync_ latency_ critical: , warning: MILLISECONDS RegionServer Host Health Details: This RegionServer health check factors in the health of the host upon which the RegionServer is running. A failure of this check means that the host running the RegionServer is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the HBase Region Server Host Health Check RegionServer monitoring setting. Short Name: Host Health HBase Region Server Host Health Check When computing the overall HBase Region Server health, consider the host's health. regionserver_host_ health_enabled RegionServer Log Directory Free Space Details: This RegionServer health check checks that the filesystem containing the log directory of this RegionServer has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Cloudera Manager 4.7 Health Checks 101

112 RegionServer Memstore Size RegionServer monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free_ space_percentage_ critical:never, warning:never RegionServer Memstore Size Details: This is a RegionServer health check to check that the amount of the RegionServer's memory devoted to memstores does not exceed some percentage of the RegionServer's configured hbase.regionserver.global.memstore.upperlimit. When a RegionServer's memstores reach this maximum size, new updates are blocked while the RegionServer flushes. A failure of this health check indicates a high write load on the RegionServer. Try reducing the write load on the RegionServer, increasing capacity by adding additional disks to the RegionServer, or adding additional RegionServers. This test can be configured using the HBase RegionServer Memstore Size RegionServer monitoring setting. Short Name: Memstore Size HBase RegionServer Memstore Size The health check of the total size of RegionServer's memstores. regionserver_ memstore_size_ critical: , warning: Cloudera Manager 4.7 Health Checks

113 RegionServer Process Status Specified as a percentage of the configured upper limit. See Maximum Size of All Memstores in RegionServer. RegionServer Process Status Details: This RegionServer health check checks that the Cloudera Manager Agent on the RegionServer host is heart beating correctly and that the process associated with the RegionServer role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the RegionServer process, a lack of connectivity to the Cloudera Manager Agent on the RegionServer host, or a problem with the Cloudera Manager Agent. This check can fail either because the RegionServer has crashed or because the RegionServer will not start or stop in a timely fashion. Check the RegionServer logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the RegionServer host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the RegionServer host, or look in the Cloudera Manager Agent logs on the RegionServer host for more details. This test can be enabled or disabled using the HBase Region Server Process Health Check RegionServer monitoring setting. Short Name: Process Status HBase Region Server Process Health Check Enables the health check that the HBase Region Server's process state is consistent with the role configuration regionserver_scm_ health_enabled RegionServer Store File Index Size Details: This is a RegionServer health check that checks that the sum of the sizes of all store file indexes does not exceed some percentage of the RegionServer's maximum heap size. A failure of this health check indicates that the RegionServer is using a significant portion of its memory for store file indexes. If the amount of memory devoted to these indexes is undesirably high, the size of indexes can be reduced by increasing the HBase block size, by using smaller key values, or by using fewer columns. Each of these choices involves trade-offs. Contact Cloudera Support for more information on this topic. This test can be configured using the Percentage of Heap Used by HStoreFile Index RegionServer monitoring setting. Short Name: Store File Index Size Cloudera Manager 4.7 Health Checks 103

114 RegionServer Unexpected Exits Percentage of Heap Used by HStoreFile Index The health check of the size used by the HStoreFile index. Specified as a percentage of the total heap size. regionserver_store_ file_idx_size_ critical:never, warning: RegionServer Unexpected Exits Details: This RegionServer health check checks that the RegionServer has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there have been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period RegionServer monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never RegionServer Web Server Status Details: This RegionServer health check checks that the web server of the RegionServer is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the RegionServer, a misconfiguration of the RegionServer or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the RegionServer for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the RegionServer's web server are failing or timing out. These requests are 104 Cloudera Manager 4.7 Health Checks

115 Reports Manager File Descriptors completely local to the RegionServer's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the RegionServer's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection RegionServer monitoring setting. Short Name: Web Server Status Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. regionserver_web_ metric_collection_ enabled Reports Manager File Descriptors Details: This Reports Manager health check checks that the number of file descriptors used does not rise above some percentage of the Reports Manager file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Reports Manager monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. reportsmanager_fd_ critical: , warning: Reports Manager Host Health Details: This Reports Manager health check factors in the health of the host upon which the Reports Manager is running. A failure of this check means that the host running the Reports Manager is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the Reports Manager Host Health Check Reports Manager monitoring setting. Short Name: Host Health Cloudera Manager 4.7 Health Checks 105

116 Reports Manager Log Directory Free Space Reports Manager Host Health Check When computing the overall Reports Manager health, consider the host's health. reportsmanager_host_ health_enabled Reports Manager Log Directory Free Space Details: This Reports Manager health check checks that the filesystem containing the log directory of this Reports Manager has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Reports Manager monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. log_directory_free_ space_percentage_ critical:never, warning:never Reports Manager Process Status Details: This Reports Manager health check checks that the Cloudera Manager Agent on the Reports Manager host is heart beating correctly and that the process associated with the Reports Manager role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with 106 Cloudera Manager 4.7 Health Checks

117 Reports Manager Scratch Directory Free Space the Reports Manager process, a lack of connectivity to the Cloudera Manager Agent on the Reports Manager host, or a problem with the Cloudera Manager Agent. This check can fail either because the Reports Manager has crashed or because the Reports Manager will not start or stop in a timely fashion. Check the Reports Manager logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Reports Manager host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Reports Manager host, or look in the Cloudera Manager Agent logs on the Reports Manager host for more details. This test can be enabled or disabled using the Reports Manager Process Health Check Reports Manager monitoring setting. Short Name: Process Status Reports Manager Process Health Check Enables the health check that the Reports Manager's process state is consistent with the role configuration reportsmanager_scm_ health_enabled Reports Manager Scratch Directory Free Space Details: This is a Reports Manager health check that checks that the filesystem containing the scratch directory of this Reports Manager has sufficient free space. This test can be configured using the Scratch Directory Free Space Monitoring Absolute and Scratch Directory Free Space Monitoring Percentage Reports Manager monitoring settings. Short Name: Scratch Directory Free Space Scratch Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains the scratch directory. reportsmanager_scratch _directory_free_space_ absolute_ critical: , warning: BYTES Scratch Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains the scratch directory. reportsmanager_scratch _directory_free_space_ percentage_ critical:never, warning:never Cloudera Manager 4.7 Health Checks 107

118 Reports Manager Unexpected Exits Specified as a percentage of the capacity on that filesystem. This setting is not used if a Scratch Directory Free Space Monitoring Absolute setting is configured. Reports Manager Unexpected Exits Details: This Reports Manager health check checks that the Reports Manager has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period Reports Manager monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never SecondaryNameNode Checkpoint Directories Free Space Details: This is a Secondary NameNode health check that checks that the filesystems containing the checkpoint directories have sufficient free space. This test can be configured using the Checkpoint Directories Free Space Monitoring Absolute and Checkpoint Directories Free Space 108 Cloudera Manager 4.7 Health Checks

119 SecondaryNameNode File Descriptors Monitoring Percentage Secondary NameNode monitoring settings. Short Name: Checkpoint Directories Free Space Checkpoint Directories Free Space Monitoring Absolute The health check for monitoring of free space on the filesystems that contain this role's checkpoint directories. secondarynamenode_ checkpoint_ directories_free_ space_absolute_ critical: , warning: BYTES Checkpoint Directories Free Space Monitoring Percentage The health check for monitoring of free space on the filesystems that contain this role's checkpoint directories. Specified as a percentage of the capacity on the filesystem. This setting is not used if a Checkpoint Directories Free Space Monitoring Absolute setting is configured. secondarynamenode_ checkpoint_ directories_free_ space_percentage_ critical:never, warning:never SecondaryNameNode File Descriptors Details: This SecondaryNameNode health check checks that the number of file descriptors used does not rise above some percentage of the SecondaryNameNode file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring SecondaryNameNode monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a secondarynamenode_ fd_ critical: , warning: Cloudera Manager 4.7 Health Checks 109

120 SecondaryNameNode GC Duration percentage of file descriptor limit. SecondaryNameNode GC Duration Details: This SecondaryNameNode health check checks that the SecondaryNameNode is not spending too much time performing Java garbage collection. It checks that no more than some percentage of recent time is spent performing Java garbage collection. A failure of this health check may indicate a capacity planning problem or misconfiguration of the SecondaryNameNode. This test can be configured using the Secondary NameNode Garbage Collection Duration and Secondary NameNode Garbage Collection Duration Monitoring Period SecondaryNameNode monitoring settings. Short Name: GC Duration Secondary NameNode Garbage Collection Duration Monitoring Period The period to review when computing the moving average of garbage collection time. secondarynamenode_ gc_duration_window 5 MINUTES Secondary NameNode Garbage Collection Duration The health check for the weighted average time spent in Java garbage collection. Specified as a percentage of elapsed wall clock time. See Secondary NameNode Garbage Collection Duration Monitoring Period. secondarynamenode_ gc_duration_ critical: , warning: SecondaryNameNode Host Health Details: This SecondaryNameNode health check factors in the health of the host upon which the SecondaryNameNode is running. A failure of this check means that the host running the SecondaryNameNode is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the Secondary NameNode Host Health Check 110 Cloudera Manager 4.7 Health Checks

121 SecondaryNameNode Log Directory Free Space SecondaryNameNode monitoring setting. Short Name: Host Health Secondary NameNode Host Health Check When computing the overall Secondary NameNode health, consider the host's health. secondarynamenode_ host_health_enabled SecondaryNameNode Log Directory Free Space Details: This SecondaryNameNode health check checks that the filesystem containing the log directory of this SecondaryNameNode has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage SecondaryNameNode monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check log_directory_free_ for monitoring of free space space_absolute_ on the filesystem that contains this role's log directory. critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check log_directory_free_ for monitoring of free space space_percentage_ on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. critical:never, warning:never SecondaryNameNode Process Status Details: This SecondaryNameNode health check checks that the Cloudera Manager Agent on the SecondaryNameNode host is heart beating correctly and that the process associated with the Cloudera Manager 4.7 Health Checks 111

122 SecondaryNameNode Unexpected Exits SecondaryNameNode role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the SecondaryNameNode process, a lack of connectivity to the Cloudera Manager Agent on the SecondaryNameNode host, or a problem with the Cloudera Manager Agent. This check can fail either because the SecondaryNameNode has crashed or because the SecondaryNameNode will not start or stop in a timely fashion. Check the SecondaryNameNode logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the SecondaryNameNode host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the SecondaryNameNode host, or look in the Cloudera Manager Agent logs on the SecondaryNameNode host for more details. This test can be enabled or disabled using the Secondary NameNode Process Health Check SecondaryNameNode monitoring setting. Short Name: Process Status Secondary NameNode Process Health Check Enables the health check that the Secondary NameNode's process state is consistent with the role configuration secondarynamenode_ scm_health_enabled SecondaryNameNode Unexpected Exits Details: This SecondaryNameNode health check checks that the SecondaryNameNode has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period SecondaryNameNode monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_ critical:any, warning:never 112 Cloudera Manager 4.7 Health Checks

123 SecondaryNameNode Web Server Status unexpected_exits_window configuration for the role. SecondaryNameNode Web Server Status Details: This SecondaryNameNode health check checks that the web server of the SecondaryNameNode is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the SecondaryNameNode, a misconfiguration of the SecondaryNameNode or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the SecondaryNameNode for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the SecondaryNameNode's web server are failing or timing out. These requests are completely local to the SecondaryNameNode's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the SecondaryNameNode's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection SecondaryNameNode monitoring setting. Short Name: Web Server Status Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. secondarynamenode_ web_metric_collection _enabled ZooKeeper Server Connection Count Details: This is a ZooKeeper Server role-level health check that checks that a moving average of the ZooKeeper Server's connection count does not exceed some value. A failure of this health check indicates a high connection load on the ZooKeeper Server. This test can be configured using the ZooKeeper Server Connection Count and ZooKeeper Server Connection Count Monitoring Period ZooKeeper Server monitoring settings. Short Name: Connection Count. Cloudera Manager 4.7 Health Checks 113

124 ZooKeeper Server Data Directory Free Space ZooKeeper Server Connection Count Monitoring Period The period to review when computing the moving average of the connection count. Specified in minutes. zookeeper_server_ connection_count_ window 3 MINUTES ZooKeeper Server Connection Count The health check of the weighted average size of the ZooKeeper Server connection count over a recent period. See ZooKeeper Server Connection Count Monitoring Period. zookeeper_server_ connection_count_ critical:never, warning:never ZooKeeper Server Data Directory Free Space Details: This is a ZooKeeper Server health check that checks that the filesystem containing the data directory of this ZooKeeper Server has sufficient free space. The data directory contains the database snapshots of the ZooKeeper Server. This test can be configured using the Data Directory Free Space Monitoring Absolute and Data Directory Free Space Monitoring Percentage ZooKeeper Server monitoring settings. Short Name: Data Directory Free Space Data Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains the ZooKeeper Server's data directory. zookeeper_server_ data_directory_free _space_absolute_ critical: , warning: BYTES Data Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains the ZooKeeper server's data directory. Specified as a percentage of the capacity on that zookeeper_server_ data_directory_free _space_percentage _ critical:never, warning:never 114 Cloudera Manager 4.7 Health Checks

125 ZooKeeper Server Data Log Directory Free Space filesystem. This setting is not used if a Data Directory Free Space Monitoring Absolute setting is configured. ZooKeeper Server Data Log Directory Free Space Details: This is a ZooKeeper Server health check that checks that the filesystem containing the data log directory of this ZooKeeper Server has sufficient free space. The data log directory contains the transaction logs of the ZooKeeper Server. This test can be configured using the Data Log Directory Free Space Monitoring Absolute and Data Log Directory Free Space Monitoring Percentage ZooKeeper Server monitoring settings. Short Name: Data Log Directory Free Space Data Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains the ZooKeeper server's data log directory. zookeeper_server_ critical: data_log_directory_ , free_space_ warning: absolute_ BYTES Data Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the filesystem that contains the ZooKeeper server's data log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Data Log Directory Free Space Monitoring Absolute setting is configured. zookeeper_server_ critical:never, data_log_directory_ warning:never free_space_ percentage_ Cloudera Manager 4.7 Health Checks 115

126 ZooKeeper Server File Descriptors ZooKeeper Server File Descriptors Details: This Server health check checks that the number of file descriptors used does not rise above some percentage of the Server file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Server monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. zookeeper_server_ fd_ critical: , warning: ZooKeeper Server GC Duration Details: This Server health check checks that the Server is not spending too much time performing Java garbage collection. It checks that no more than some percentage of recent time is spent performing Java garbage collection. A failure of this health check may indicate a capacity planning problem or misconfiguration of the Server. This test can be configured using the ZooKeeper Server Garbage Collection Duration and ZooKeeper Server Garbage Collection Duration Monitoring Period Server monitoring settings. Short Name: GC Duration ZooKeeper Server Garbage Collection Duration Monitoring Period The period to review when zookeeper_server_ computing the moving average gc_duration_ of garbage collection time. window 5 MINUTES ZooKeeper Server Garbage Collection Duration The health check for the weighted average time spent in Java garbage collection. Specified as a percentage of elapsed wall clock time. See ZooKeeper Server Garbage Collection Duration Monitoring Period. zookeeper_server_ gc_duration_ critical: , warning: Cloudera Manager 4.7 Health Checks

127 ZooKeeper Server Host Health ZooKeeper Server Host Health Details: This Server health check factors in the health of the host upon which the Server is running. A failure of this check means that the host running the Server is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the ZooKeeper Server Host Health Check Server monitoring setting. Short Name: Host Health ZooKeeper Server Host Health Check When computing the overall ZooKeeper Server health, consider the host's health. zookeeper_server_ host_health_enabled ZooKeeper Server Log Directory Free Space Details: This Server health check checks that the filesystem containing the log directory of this Server has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage Server monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check log_directory_free_ for monitoring of free space space_absolute_ on the filesystem that contains this role's log directory. critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check log_directory_free_ for monitoring of free space space_percentage_ on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. critical:never, warning:never Cloudera Manager 4.7 Health Checks 117

128 ZooKeeper Server Maximum Request Latency ZooKeeper Server Maximum Request Latency Details: This is a ZooKeeper server health check that checks that the ratio of the maximum request latency to the maximum client-negotiable session timeout does not exceed some value. Note that the maximum client-negotiable timeout need not be the actual session timeout used by a client but is the upper bound of such client session timeouts. As a result, this check being in "Good" health does not preclude clients from experiencing timeouts based on the particular session timeout value they have negotiated with the server. A failure of this health check likely indicates a high load on the ZooKeeper Server. This test can be configured using the Maximum Latency Monitoring ZooKeeper Server monitoring setting. Short Name: Maximum Request Latency Maximum Latency Monitoring The percentage of the ratio of the maximum request latency to the maximum client-negotiable session timeout since the server was started. zookeeper_server_ max_latency_ critical: , warning: ZooKeeper Server Outstanding Requests Details: This is a ZooKeeper Server role-level health check that checks that a moving average of the size of the ZooKeeper Server's outstanding requests does not exceed some value. Outstanding requests are the number of queued requests in the server. It increases when the server is under load and is receiving more sustained requests than it can process. A failure of this health check indicates a high connection load on the ZooKeeper Server. This test can be configured using the ZooKeeper Server Outstanding Requests and ZooKeeper Server Outstanding Requests Monitoring Period ZooKeeper Server monitoring settings. Short Name: Outstanding Requests. ZooKeeper Server Outstanding Requests Monitoring Period The period to review when computing the moving average of the outstanding requests queue size. Specified in minutes. zookeeper_server_ outstanding_requests_ window 3 MINUTES ZooKeeper Server The health check zookeeper_server_ critical:never, 118 Cloudera Manager 4.7 Health Checks

129 ZooKeeper Server Process Status Outstanding Requests of the weighted average size of the ZooKeeper Server outstanding requests queue over a recent period. See ZooKeeper Server Outstanding Requests Monitoring Period. outstanding_requests_ warning:never ZooKeeper Server Process Status Details: This Server health check checks that the Cloudera Manager Agent on the Server host is heart beating correctly and that the process associated with the Server role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Server process, a lack of connectivity to the Cloudera Manager Agent on the Server host, or a problem with the Cloudera Manager Agent. This check can fail either because the Server has crashed or because the Server will not start or stop in a timely fashion. Check the Server logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Server host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Server host, or look in the Cloudera Manager Agent logs on the Server host for more details. This test can be enabled or disabled using the ZooKeeper Server Process Health Check Server monitoring setting. Short Name: Process Status ZooKeeper Server Process Health Check Enables the health check that the ZooKeeper Server's process state is consistent with the role configuration zookeeper_server_ scm_health_enabled ZooKeeper Server Quorum Membership Details: This is a ZooKeeper server role-level health check to verify that the server is part of a quorum. This check is disabled if the ZooKeeper Server is in standalone mode. The check returns "Concerning" health as long as the quorum membership status for the ZooKeeper Server was determined within the detection window and it is in leader election. The check returns "Bad" health if the ZooKeeper Server is not part of a quorum or the quorum status of the ZooKeeper Server could not be determined for the entire detection window. A failure of this health check may indicate a communication problem between this ZooKeeper Server and the rest of its peers or between the Cloudera Manager Service Monitor and Cloudera Manager 4.7 Health Checks 119

130 ZooKeeper Server Unexpected Exits the ZooKeeper Server. Check the ZooKeeper Server and Cloudera Manager Service Monitor logs for additional information. This test can be enabled or disabled using the Enable the Quorum Membership Check ZooKeeper Server monitoring setting. In addition, the Quorum Membership Detection Window setting can be used to adjust the time that the Cloudera Manager Service Monitor has to detect the ZooKeeper Server quorum membership status before this health check fails. Short Name: Quorum Membership. Enable the Quorum Membership Check Enables the quorum membership check for this ZooKeeper Server. zookeeper_server_ quorum_membership_ enabled Quorum Membership Detection Window The tolerance window that will be used in the detection of a ZooKeeper server's membership in a quorum. Specified in minutes. zookeeper_server_ quorum_membership_ detection_window 3 MINUTES ZooKeeper Server Unexpected Exits Details: This Server health check checks that the Server has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period Server monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when unexpected_exits_ computing unexpected exits. window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window unexpected_exits_ critical:any, warning:never 120 Cloudera Manager 4.7 Health Checks

131 Service Monitor File Descriptors configuration for the role. Service Monitor File Descriptors Details: This Service Monitor health check checks that the number of file descriptors used does not rise above some percentage of the Service Monitor file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Service Monitor monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. servicemonitor_fd_ critical: , warning: Service Monitor Host Health Details: This Service Monitor health check factors in the health of the host upon which the Service Monitor is running. A failure of this check means that the host running the Service Monitor is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the Service Monitor Host Health Check Service Monitor monitoring setting. Short Name: Host Health Service Monitor Host Health Check When computing the overall Service Monitor health, consider the host's health. servicemonitor_host_ health_enabled Service Monitor Log Directory Free Space Details: This Service Monitor health check checks that the filesystem containing the log directory of this Service Monitor has sufficient free space. This test can be configured using the Log Directory Free Space Cloudera Manager 4.7 Health Checks 121

132 Service Monitor Process Status Monitoring Absolute and Log Directory Free Space Monitoring Percentage Service Monitor monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check log_directory_free_ for monitoring of free space space_absolute_ on the filesystem that contains this role's log directory. critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check log_directory_free_ for monitoring of free space space_percentage_ on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. critical:never, warning:never Service Monitor Process Status Details: This Service Monitor health check checks that the Cloudera Manager Agent on the Service Monitor host is heart beating correctly and that the process associated with the Service Monitor role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the Service Monitor process, a lack of connectivity to the Cloudera Manager Agent on the Service Monitor host, or a problem with the Cloudera Manager Agent. This check can fail either because the Service Monitor has crashed or because the Service Monitor will not start or stop in a timely fashion. Check the Service Monitor logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the Service Monitor host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the Service Monitor host, or look in the Cloudera Manager Agent logs on the Service Monitor host for more details. This test can be enabled or disabled using the Service Monitor Process Health Check Service Monitor monitoring setting. Short Name: Process Status 122 Cloudera Manager 4.7 Health Checks

133 Service Monitor Role Pipeline Service Monitor Process Health Check Enables the health check that the Service Monitor's process state is consistent with the role configuration servicemonitor_scm_ health_enabled Service Monitor Role Pipeline Details: This Service Monitor health check checks that no messages are being dropped by the role stage of the Service Monitor pipeline. A failure of this health check indicates a problem with the Service Monitor. This may indicate a configuration problem or a bug in the Service Monitor. This test can be configured using the Service Monitor Role Pipeline Monitoring Time Period monitoring setting. Short Name: Role Pipeline Service Monitor Role Pipeline Monitoring The health check for monitoring the Service Monitor role pipeline. This specifies the number of dropped messages that will be tolerated over the monitoring time period. servicemonitor_role_ pipeline_ critical:any, warning:never Service Monitor Role Pipeline Monitoring Time Period The time period over which the Service Monitor role pipeline will be monitored for dropped messages. servicemonitor_role_ pipeline_window 5 MINUTES Service Monitor Unexpected Exits Details: This Service Monitor health check checks that the Service Monitor has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period Service Monitor monitoring settings. Short Name: Unexpected Exits Cloudera Manager 4.7 Health Checks 123

134 Service Monitor Web Server Status Unexpected Exits Monitoring Period The period to review when unexpected_exits_ computing unexpected exits. window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never Service Monitor Web Server Status Details: This Service Monitor health check checks that the web server of the Service Monitor is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the Service Monitor, a misconfiguration of the Service Monitor or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the Service Monitor for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the Service Monitor's web server are failing or timing out. These requests are completely local to the Service Monitor's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the Service Monitor's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection Service Monitor monitoring setting. Short Name: Web Server Status Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. servicemonitor_web_ metric_collection_ enabled 124 Cloudera Manager 4.7 Health Checks

135 TaskTracker Blacklisted Status TaskTracker Blacklisted Status Details: This is a TaskTracker health check that checks that the JobTracker has not blacklisted the TaskTracker. A failure of this health check indicates that the JobTracker has blacklisted the TaskTracker because of the failure rate of tasks on the TaskTracker is significantly higher than the average cluster failure rate. Check the TaskTracker logs for more details. This test can be enabled or disabled using the TaskTracker Blacklisted Health Check TaskTracker monitoring setting. Short Name: Blacklisted Status TaskTracker Blacklisted Health Check Enables the health check that the TaskTracker is not blacklisted tasktracker_blacklisted_ health_enabled TaskTracker File Descriptors Details: This TaskTracker health check checks that the number of file descriptors used does not rise above some percentage of the TaskTracker file descriptor limit. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring TaskTracker monitoring setting. Short Name: File Descriptors File Descriptor Monitoring The health check of the number of file descriptors used. Specified as a percentage of file descriptor limit. tasktracker_fd_ critical: , warning: TaskTracker GC Duration Details: This TaskTracker health check checks that the TaskTracker is not spending too much time performing Java garbage collection. It checks that no more than some percentage of recent time is spent performing Java garbage collection. A failure of this health check may indicate a capacity planning problem or misconfiguration of the TaskTracker. This test can be configured using the TaskTracker Garbage Collection Duration and TaskTracker Garbage Collection Duration Monitoring Period TaskTracker monitoring settings. Short Name: GC Duration Cloudera Manager 4.7 Health Checks 125

136 TaskTracker Host Health TaskTracker Garbage Collection Duration Monitoring Period The period to review when computing the moving average of garbage collection time. tasktracker_gc_ duration_window 5 MINUTES TaskTracker Garbage Collection Duration The health check for the weighted average time spent in Java garbage collection. Specified as a percentage of elapsed wall clock time. See TaskTracker Garbage Collection Duration Monitoring Period. tasktracker_gc_ critical: , duration_ warning: TaskTracker Host Health Details: This TaskTracker health check factors in the health of the host upon which the TaskTracker is running. A failure of this check means that the host running the TaskTracker is experiencing some problem. See that host's status page for more details.this test can be enabled or disabled using the TaskTracker Host Health Check TaskTracker monitoring setting. Short Name: Host Health TaskTracker Host Health Check When computing the overall TaskTracker health, consider the host's health. tasktracker_host_ health_enabled TaskTracker JobTracker Connectivity Details: This is a TaskTracker health check that checks that the JobTracker considers the TaskTracker alive. A failure of this health check may indicate that the TaskTracker is having trouble communicating with the JobTracker. Look in the TaskTracker logs for more details. This test can be enabled or disabled using the TaskTracker Connectivity Health Check TaskTracker monitoring setting. The TaskTracker Connectivity Tolerance at Startup TaskTracker monitoring setting and the Health Check Startup Tolerance JobTracker monitoring setting can be used to control the check's tolerance windows around TaskTracker and JobTracker restarts respectively. Short Name: JobTracker Connectivity 126 Cloudera Manager 4.7 Health Checks

137 TaskTracker Log Directory Free Space Health Check Startup Tolerance The amount of time allowed after this role is started that failures of health checks that rely on communication with this role will be tolerated. jobtracker_startup_ tolerance 5 MINUTES TaskTracker Connectivity Health Check Enables the health check that the TaskTracker is connected to the JobTracker tasktracker_ connectivity_health_ enabled TaskTracker Connectivity Tolerance at Startup The amount of time to tasktracker_ wait for the TaskTracker to connectivity_tolerance fully start up and connect to the JobTracker before enforcing the connectivity check. 180 SECONDS TaskTracker Log Directory Free Space Details: This TaskTracker health check checks that the filesystem containing the log directory of this TaskTracker has sufficient free space. This test can be configured using the Log Directory Free Space Monitoring Absolute and Log Directory Free Space Monitoring Percentage TaskTracker monitoring settings. Short Name: Log Directory Free Space Log Directory Free Space Monitoring Absolute The health check for monitoring of free space on the filesystem that contains this role's log directory. log_directory_free_ space_absolute_ critical: , warning: BYTES Log Directory Free Space Monitoring Percentage The health check for monitoring of free space on the log_directory_free_ space_percentage_ critical:never, warning:never Cloudera Manager 4.7 Health Checks 127

138 TaskTracker Process Status filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute setting is configured. TaskTracker Process Status Details: This TaskTracker health check checks that the Cloudera Manager Agent on the TaskTracker host is heart beating correctly and that the process associated with the TaskTracker role is in the state expected by Cloudera Manager. A failure of this health check may indicate a problem with the TaskTracker process, a lack of connectivity to the Cloudera Manager Agent on the TaskTracker host, or a problem with the Cloudera Manager Agent. This check can fail either because the TaskTracker has crashed or because the TaskTracker will not start or stop in a timely fashion. Check the TaskTracker logs for more details. If the check fails because of problems communicating with the Cloudera Manager Agent on the TaskTracker host, check the status of the Cloudera Manager Agent by running /etc/init.d/cloudera-scm-agent status on the TaskTracker host, or look in the Cloudera Manager Agent logs on the TaskTracker host for more details. This test can be enabled or disabled using the TaskTracker Process Health Check TaskTracker monitoring setting. Short Name: Process Status TaskTracker Process Health Check Enables the health check tasktracker_scm_ that the TaskTracker's health_enabled process state is consistent with the role configuration TaskTracker Unexpected Exits Details: This TaskTracker health check checks that the TaskTracker has not recently exited unexpectedly. The check returns "Bad" health if the number of unexpected exits goes above a critical threshold. For example, if this check is configured with a critical threshold of 1, this check would return "Good" health 128 Cloudera Manager 4.7 Health Checks

139 TaskTracker Web Server Status if there have been no unexpected exits recently. If there has been 1 or more unexpected exits recently, this check would return "Bad" health. This test can be configured using the Unexpected Exits and Unexpected Exits Monitoring Period TaskTracker monitoring settings. Short Name: Unexpected Exits Unexpected Exits Monitoring Period The period to review when computing unexpected exits. unexpected_exits_ window 5 MINUTES Unexpected Exits The health check for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. unexpected_exits_ critical:any, warning:never TaskTracker Web Server Status Details: This TaskTracker health check checks that the web server of the TaskTracker is responding quickly to requests by the Cloudera Manager agent, and that the Cloudera Manager agent can collect metrics from the web server. A failure of this health check may indicate a problem with the web server of the TaskTracker, a misconfiguration of the TaskTracker or a problem with the Cloudera Manager agent. Consult the Cloudera Manager agent logs and the logs of the TaskTracker for more detail. If the test's failure message indicates a communication problem, this means that the Cloudera Manager Agent's HTTP requests to the TaskTracker's web server are failing or timing out. These requests are completely local to the TaskTracker's host, and so should never fail under normal conditions. If the test's failure message indicates an unexpected response, then the TaskTracker's web server responded to the Cloudera Manager Agent's request, but the Cloudera Manager Agent could not interpret the response for some reason. This test can be configured using the Web Metric Collection TaskTracker monitoring setting. Short Name: Web Server Status Web Metric Collection Enables the health check that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. tasktracker_web_ metric_collection_ enabled Cloudera Manager 4.7 Health Checks 129

140 ZooKeeper Canary ZooKeeper Canary Details: This is a ZooKeeper service-level health check that checks that basic client operations are working and are completing in a reasonable amount of time. This check reports the results of a periodic "canary" test that performs the following sequence of operations. First, it connects to and establishes a session (the root session) with the ZooKeeper service and creates a permanent znode to serve as the root of all canary operations. The canary test then connects to and establishes sessions (the child sessions) with each ZooKeeper server of the service. Each child session is used to create an ephemeral child znode under the canary root. After the child znodes have been created, watches that await znode deletion events are registered with each of the child znodes for each of the child sessions. The canary test then deletes each of the child znodes and then verifies that each child session has received deletion notifications for each of the child znodes. Finally the canary test closes all the child sessions, deletes the root znode and closes the root session. The check returns "Bad" health if the establishment of the root session to the ZooKeeper service fails, the creation of znodes (permanent or ephemeral) fails, the deletion of znodes fails or the retrieval of child znodes of the root znode fails. The check returns "Concerning" health when the canary test succeeds but has one or more servers that could not participate in the canary test operations or if the canary test runs too slowly. A failure of this health check may indicate that ZooKeeper is failing to satisfy client requests correctly or in a timely fashion. Check the status of the ZooKeeper servers, and look in the ZooKeeper server logs for more details. This test can be enabled or disabled using the ZooKeeper Canary Health Check ZooKeeper service monitoring setting. The ZooKeeper Canary Root Znode Path, ZooKeeper Canary Connection Timeout, ZooKeeper Canary Session Timeout, ZooKeeper Canary Operation Timeout settings control the operation of the canary. Short Name: ZooKeeper Canary ZooKeeper Canary Connection Timeout Configures the timeout used by the canary for connection establishment with ZooKeeper servers zookeeper_canary_ MILLISECONDS connection_timeout ZooKeeper Canary Health Check Enables the health check that a client can connect to ZooKeeper and perform basic operations zookeeper_canary_ health_enabled ZooKeeper Canary Operation Timeout Configures the timeout used by the canary for ZooKeeper operations zookeeper_canary_ operation_timeout MILLISECONDS 130 Cloudera Manager 4.7 Health Checks

141 ZooKeeper Servers Health ZooKeeper Canary Root Znode Path Configures the path of the root znode under which all canary updates are performed zookeeper_canary_ root_path /cloudera_ manager_ zookeeper_ canary ZooKeeper Canary Session Timeout Configures the timeout used by the canary sessions with ZooKeeper servers zookeeper_canary_ session_timeout MILLISECONDS ZooKeeper Servers Health Details: This is a ZooKeeper service-level health check that checks that enough of the ZooKeeper servers in the cluster are healthy. The check returns "Concerning" health if the number of healthy ZooKeeper servers falls below a warning threshold, expressed as a percentage of the total number of ZooKeeper servers. The check returns "Bad" health if the number of healthy and "Concerning" ZooKeeper servers falls below a critical threshold, expressed as a percentage of the total number of ZooKeeper servers. For example, if this check is configured with a warning threshold of 80% and a critical threshold of 60% for a cluster of 5 ZooKeeper servers, this check would return "Good" health if 4 or more ZooKeeper servers have good health. This check would return "Concerning" health if at least 3 ZooKeeper servers have either "Good" or "Concerning" health. If more than 2 ZooKeeper servers have bad health, this check would return "Bad" health. A failure of this health check indicates unhealthy ZooKeeper servers. Check the status of the individual ZooKeeper servers for more information. This test can be configured using the Healthy ZooKeeper Server Monitoring ZooKeeper service-wide monitoring setting. Short Name: ZooKeeper Servers Health Healthy ZooKeeper Server Monitoring The health check zookeeper_servers_ critical: , of the overall ZooKeeper service health. The check returns "Concerning" health if the percentage of "Healthy" ZooKeeper servers falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" ZooKeeper healthy_ warning: Cloudera Manager 4.7 Health Checks 131

142 ZooKeeper ZXID Rollover. servers falls below the critical threshold. ZooKeeper ZXID Rollover. Details: This ZooKeeper service-level health check monitors the current zxid to ensure that its xid component does not rollover. The zxid is a 64-bit number maintained by ZooKeeper and is made up of two parts. The higher order 32-bit part is the epoch and the lower order 32-bit part is the xid. This check concerns itself with the xid portion that has a maximum possible value of 0xffffffff. If the xid reaches this value a rollover can occur. The check returns "Concerning" or "Bad" health if the current xid is above a warning threshold or critical threshold respectively. The threshold is expressed as a percentage of the maximum possible xid. For example, if this check is configured with a warning percentage threshold of 80% and a critical percentage threshold of 95% for a ZooKeeper service, this check would return "Good" health if the current xid is less than 0xcccccccc. This check would return "Concerning" health if the current xid is between 0xcccccccc and 0xf If the current xid is above 0xf , this check would return "Bad" health. A failure of this health check indicates that an overflow of xid may occur in the near future if the corrective action of forcing a leader election is not taken. This test is disabled by default since rollover of the xid is a concern only in releases prior to CDH3u4. For those releases, the test needs to be enabled explicitly. This test can be configured using the ZooKeeper Current Zxid Monitoring Percentage ZooKeeper service-wide monitoring setting. Short Name: ZXID Rollover. ZooKeeper Current Zxid Monitoring Percentage The health check for monitoring of the xid portion of the current zxid of the service. Specified as a percentage of the maximum possible xid setting of 0xffffffff. zookeeper_current_ zxid_percentage_th resholds critical:never, warning:never 132 Cloudera Manager 4.7 Health Checks